“Research” Interest

This week Friday, I meet with my future roommate in Beijing. During the lunch, we had a conversation about each one’s research interest. My roommate, likes me, is also a CS graduate student at Austin. However, unlike me, he has a clear vision about what direction he is going to pursue in graduate school. He just finished his undergraduate degree in Automation department at Tsinghua University. Automation department, as he explained, is similar to a mixture of mechanical engineering and electrical engineering. He has interest in mathematics since high school and naturally, he wants to work on machine learning theory in graduate school with emphasis on computer vision (CV).

Now comes to my turn. That’s a hard question I have been thinking about for a while. I don’t have clear vision on what I’m going to pursue next. I think maybe I’m too greedy and want to keep everything. However, I also realize that I may not be as greedy as I thought initially. I know I don’t want to work on computer architecture, computation theory, algorithm, compiler, network. Now, my options really just choosing among operating system, database, and machine learning. For the machine learning, I even know I probably won’t choose computer vision eventually (still want to try a course though) and I more lean towards the natural language processing (NLP). However, picking one out of those areas is just too hard for me now, even after I did some analysis in my last post trying to buy myself into picking machine learning only. There is always a question running in my head: why I have to pick one? Sometimes I just envy the person like my future roommate who doesn’t have this torture in his mind (maybe he does? I don’t know).

This feeling, to be honest, doesn’t new to me. When I was undergraduate facing the pressure of getting a job, a naive approach is just locking oneself in the room and keeping thinking what profession might suit me the best. After two years of working, I grow up enough to know that this methodology on making choice is stupid and I also grow up enough to know that “give up is a practice of art”. Why I’m in this rush to pick the direction I want to pursue even before I’m taking any graduate course yet? Why can’t I sit down and try out several courses first? Because I want to get a PhD in good school so bad. Let’s face the fact that people get smarter and smarter in generations. Here “smarter and smarter” doesn’t necessarily mean that people won’t repeat the mistake that happened before. It means that people will have better capability to improve themselves. Machine learning is not hot in 2014 from my experience in college. Back that time, Leetcode only has around 100 problems. I have no particular emotional attachment to machine learning material when I’m taking the AI class. Maybe because wisconsin has tradition in system area? I don’t know. However, in 2017, everyone, even my mother who is a retired accountant, can say some words about AI, machine learning. Isn’t that crazy?

On my homepage,  I write the following words:

I like to spend time on both system and machine learning: system programming is deeply rooted in my heart that cannot easily get rid of; machine learning is like the magic trick that the audience always want to know how it works. I come back to the academia in the hope of finding the spark between these two fascinating fields.

Trust me, I really mean it. Maybe because I graduate from wisconsin, I have naturally passion for system-level programming, no matter it from operating system or database. Professor Remzi’s system class is just a blast for anyone who wants to know what’s going on really under the software application layer. Professor Naughton’s db course is fully of insights that I can keep referring to even I begin to work a DBMS in real world. Wisconsin is just too good in system field and this is something that I can hardly say no even I have work so hard lie to my face saying that “system is not worth your time”. What about machine learning? To be honest, great AI dream may never accomplish. Undergraduate AI course surveys almost every corner of AI development but only machine learning becomes the hottest nowadays. Almost every AI-related development nowadays (i.e. NLP,  Robotics, CV) relies on machine learning technique support. Why I’m attracted to machine learning? Because it’s so cool. I’m like a kid who is eager to know what is going on behind magic trick. Machine learning is a technique to solve un-programmable task. We cannot come up with a procedure to teach machine read text, identify image object, and so on. We can solve these tasks only because the advancement of machine learning. Isn’t this great? Why both? I think machine learning and system becomes more and more inseparable. Without good knowledge about system, one can hardly build a good machine learning system. Implementing batch gradient descent using map-reduce is a good example in this case.

I just realized that I haven’t answered the question about rushing towards the making decision. In order to get a good graduate school to pursue PhD, you need to demonstrate that you can do research. This is done by publishing papers. Most of undergraduates nowadays have papers under their belt. That’s huge pressure to me. Master program only has two years. I cannot afford the time to look around. I need to get started with research immediately in order to have a good standing when I apply to PhD in 2018.

So, as you can tell, I have problem. So, as a future researcher, I need to solve the problem. Here is what I’m planning to do:

  • Take courses in machine learning in first semester and begin to work on research project as soon as I can. I’ll give NLP problem a chance.
  • Meanwhile, sitting in OS class and begin to read papers produced by the Berkeley Database group. People their seem to have interest in the intersection between machine learning and system. This paper looks like promising one.
  • Talk to more people in the area and seek some advice from others.
  • Start reading “How to stop worrying and start living

Will this solve the problem eventually? I don’t know. Only time can tell.

Advertisements

Takeaway from DTCC 2017

由于同事出差,我有幸参加了在北京国际会议中心举办的第八届中国数据库技术大会(Database Technology Conference China 2017)。这是我第一次参加业界交流大会,内心还是格外兴奋的。这次大会确实有很多的收获,我想用这篇博客记录下来。本来我想用英文记录的,毕竟对于计算机领域,英文是我的“母语”,但是介于分享主要以中文为主,所以我就还是以中文来记录了。

会议目标

虽然机会来的很突然,但是我还是设立了一些目标以最大可能的利用好这次机会(以下是这篇博文的英文初稿,由于实在是懒着重新翻译成中文,各位就凑合着看吧):

Get some sense from the peers

Focus on your own product is quite important. However, it’s even more important to see how your peers doing. I’m not an architect yet but I feel it’s helpful to begin thinking like an architect and see what the problems that your peers are facing and how they try to solve them. In addition, by knowing how’s the going with your peers, you may get a measure of yourself: is the work you are doing on the same level as your peers? Are you in a good shape in the job market? What’s the gap you need to fulfill skill-wise?

Deepen the understanding of the field

Even almost two years working on the database field, I still think myself as a newbie. This is mainly because database is arguably the most complex software that people can ever make and there are tons of stuff I don’t know. So, I want to see in a high level that what’s the trend of the field and what kind of reflection that people derive from their day-to-day engineering practice. I think this may help me to catch-up with the masters.

AI or System?

As I disclosed in my last post, I decide to head back to school and get a master degree. To be honest, my ultimate goal is to acquire a PhD in Computer Science and currently I’m actively preparing for it. The most important question is that which field I want to study?  I have two options and I have some interests in both fields: AI and System. Why these two options and not others is worth a whole new post and I don’t want to discuss here. So, my task for now is to gather as much information as possible about these two fields and see which one looks more attractive to me. This event is extremely helpful because it has sharing on System as well as on AI.

Day 1

第一天分为上下半场。上午是开场及四个分享。下午则是五个同时进行的专场,每个专场有六个同一主题的分享。这就造成了我无法参加每一个分享。第一天我的策略就是面面俱到:系统的我也参加,AI相关的我也参加。以下就是针对我参加的每一场的一些心得感悟和评论:

年度主题解读 (曹鹏 – 京东金融副总裁)

本次会议的主题叫做“数据驱动,价值发现”。这个分享是从京东金融自身的角度对本次会议的主题进行了结构。从中我记住了两点:

  1. Finance领域受到了机器学习的冲击,最近几年有越来越多的FinTech公司出现。机器学习在这种公司的主要应用从这个分享来看是对客户群体更加精确的定位和分析。相应的,对于量化交易策略的作用,这个分享没有涉及。我最近一直比较关心机器学习在金融领域的应用,但是从这个分享上,我没有找到我想要找到的答案。因为,在我看来,对客户群体的精确定位是一种机器学习的通用应用,并不具备金融行业的独特性。
  2. 数据公司在我看来是一个不错的创业想法。分享中提到数据对于京东金融的重要性。他们不仅要求数据的广度,也要求数据的厚度。一个重要问题是数据是具有很强的时效性和冷热变化的。一年前顾客的消费记录对于现在来说并不具备非常强的指导意义。因此,京东金融每天都要收集大量的数据(~6TB)来保证整个分析的准确性。同时,演讲者透露出即便在这种情况下,他们觉得数据还是远远无法满足他们的需求的。这个就能解释为什么IBM最近收购了The Weather Company和医疗影像公司Merge Healthcare:无非就是看上了这两家公司的数据。这让我想做数据贩卖商会不会是一个不错的创业点子呢?

数据库发展概览 (吴承杨 – 甲骨文)

这场分享整体来说亮点不多。不过还是有一些重要信息的:

  1.  在去IOE喊了那么多年的今天,Oracle的市场占有率依然有56%之多
  2. 数据库的未来是云:这里演讲者用一个case讲述hybrid cloud的重要性。企业现在面临的问题是如何将公有云的数据和本地服务器上的数据有效的对接在一起以及如何将公有云私有化等。整场演讲更像是Oracle解决方案介绍会,技术方面很少涉及,但是指出了未来数据库发展的方向:上云。
  3. 演讲者台风不错,是一个不错的演讲者。

数据技术的下一站 – 数据应用 (王桐 – 永洪科技)

这个分享反应出永洪科技的主营业务和技术实力可能不是那么雄厚。整个分享我感受到永洪科技做的是数据库的应用开发,而不是数据库系统的本身。从这个分享中我了解到永洪把传统数据库以及大数据系统做了个集成平台,并在上面开发了针对不同行业应用的服务。这个感觉和IBM自家的Bluemix非常像,少的只是Watson系列。我个人看来做软件系统集成要比做系统本身难度要低很多。整个分享关注在永洪科技所提供的各种数据应用的服务。我查了一下,公司属于初创成立于2012年,我觉得走到今天这个地步也是不容易的。

整个分享亮点还是有的。一个是人物岗位关系图的展示,流程之间的pending关系以一种网状图的形式展现出来,每个节点是一个岗位。通过这种展示,我们能清晰看出哪个岗位人物最关键,他的缺席或者能力高低会对整个公司业务带来何种影响。另外一个亮点就是资源配置图。展现的是诸如会议室的使用情况,使用率等指标。但凡在IBM呆过的,对会议室这点肯定会深有体会:无数会议室被人预定却无发得到充分利用。我想这种资源展示应该是对我们这种会议室资源紧张的地方来讲会有很大帮助的吧?

达梦如何冲击核心业务系统 – 国产数据库的产品发展之路 (韩朱忠 – 达梦数据)

我觉得这个分享可能是今天最励志的分享了。整个分享讲的就是一个国产小厂商是如何奋斗和外资数据库斗争,一点点争取市场份额,成长到今天这个样子的。这里边讲到的一个关于他们对这个用C写的数据库的SQL优化能力进行提升的例子。 他们曾经遇到过一条SQL, 长达3.9K行,换句话说就是粘到word文档里能粘350多页。里边包含着17个inner join, 557个子查询, 831个or筛选, 1000+个查询字段,2731个case when。他们通过不断优化将这个SQL语句从几百分钟降到不到1秒。另外一个故事是讲国产数据库生存的艰辛。因为大企业及银行电信等核心产业的数据库都是采用外资的, 国产根本进不去。国产只能在中小企业市场去竞争。但是,这家数据库通过自身的不断努力,终于拿下国家电网的单子以及西藏和东方航空的单子。这在我看来是非常了不起的成就。这就让我对IBM产生了反思。我不觉得我们DB2能在不经过针对性的优化的情况下就能处理这么复杂的SQL语句。这个例子也让我觉得要么我们是在用我们的名声和过去的积累在赢得客户,要么就是DB2售前的同事在做POC的时候super tryhard。我明显感受到我们和这些国产数据库在努力程度上的差距。也许有一天我和他们的地位会呼唤?我相信这是IBM高层不愿意看到的事情。我们确实该努力了。

SSD的IO Determination特性在数据库业务优化中的应用与拓展 (阳学仕 – 宝存科技)

这个是从storage上出发来讲如何用软件模拟硬件来提升读写速度。换句话说,这个分享带给我的思考就是数据库怎样才能利用IO determination提升读写速度。这里讲的IO determination我粗浅理解看来就是让硬盘上的应用能更加和谐共处,并通过提升应用优先级,IO资源上下限,以及时间上对读写顺序进行优化等方式来使应用获得所需要的资源。另外SSD对于网络发展的匹配也有涉及:通过硬件的提升,我们现在基本可以做到本地写入和通过网络写入远程只有10几微秒的差距。这些在我看来是属于OS的领域。硬件对DB的加成这个方向让我感到耳目一新。

面向未来的数据库体系架构的思考 (张瑞 – 阿里巴巴)

这个主要介绍的是阿里巴巴里的AliSQL的架构以及针对阿里业务特点的数据库架构的反思。这里有两点我想提及:

  1. 国内厂商和IBM在对待数据库上有本质上的区别。国内厂商如阿里巴巴,腾讯,以及百度都是以自身业务痛点作为出发点对自家的数据库进行开发和改造。所以相应的,这些家的数据库改造,提升都是带有极强的针对性的。他们的数据库架构可能并不具备非常强的通用性。相反,IBM是把数据库作为产品来销售的,因此在数据库本身设计上考虑到的更多是面面俱到,大而全的尽可能满足所有用户类型的需求。这就导致在某些场景下,IBM的DB2做不到像AliSQL, OceanDB, TDB那样强劲。因此,在超大型公司做数据库,最终方向可能都是“私人订制”。
  2. 机器学习与系统结合的越来越紧密。这里演讲者提到他们想在未来把自动运维转换到智能运维上面来。SQL不再是DBA来手动看,而是通过ML的某种方式来进行优化。这些阿里的人还没有想好但是他们觉得这是未来的方向。

下午场综述

下午听的有”百度NewSQL数据库系统”, “Tencent MySQL内核优化解析”, “滴滴大数据应用”,“自然语言技术在文智趋势分析产品上的应用”。百度上最大收获是说现在分布式事物数据库非常的热,如果研究透,就没有在国内趟不过去的问题。另外一点收获就是不要过分崇拜Google系统。虽然细节我没有听的特别懂,但是从演讲者言语间我感受到,黑猫白猫抓到耗子就是好猫。有的时候不能太学究。而且系统之间即使是理念一模一样,但是由于implementation不同,也会导致巨大的性能差异。

腾讯的讲的非常Technical, 加上演讲者是技术出身,整个session非常的煎熬,感觉就是内核优化是个大坑,需要很扎实的DB知识。最后两场我选得是和机器学习相关的。不得不说没有达到我心中的理想。滴滴介绍的是他们一些数学模型应用的场景。我感觉演讲者应该是加入滴滴时间不长,并没有从一些模型上讲出个所以然来,反倒是应用场景上更让我感受到经济学家也是有用武之地的:比如说如何运用高峰涨价来调控司机和打车人之间的供求关系,以及如何收取取消订单等行为给平台所带来的损失。也许是民怨太重,整个滴滴分享感觉像是个新闻发布会。最后的自然语言技术应用是非常无聊的。演讲者是产品经理出身,主要介绍了下腾讯是如何针把NLP技术应用在新闻上的。非常泛泛,没有提及一些NLP上的技术难点,非常失望。

Day 2

第二天我觉得整体上不如第一天的精彩。主要原因我在想是方向性和行业发展战略性的内容比例在降低而具体技术内容所占比例在上升。不得不说的是通过这两天大会的观察,国内数据库领域MySQL系和Oracle系还是占主流,这主要是因为互联网行业的蓬勃发展。下面我就简单聊聊这一天的观察和体悟:

  • Informix现在是和物联网IOT紧密的捆绑在了一起

在IBM我的邻居就是Informix Technical Support组。他们组的老大之前也分享过Informix在物联网领域的应用。这在我看来是为Informix这个昔日的巨人在找新的发力点以获得新生。这点也在今天题为“万物互联时代的数据库支撑平台–SinoDB”上获得了印证。SinoDB可以理解为Informix的fork因为这个公司从IBM这里获得了Informix的源代码的授权。不得不说的是IBM在这里变成了吐槽的对象,这些以Informix元老员工成立的公司认为IBM并没有善待Informix这个继子。他们认为是时候把自己的“孩子”重新领回来让他茁壮成长了。这也让我不得不思考当初IBM收购Informix到底是为了什么?问了问和我一同参会的同事,Informix的代码是否已经和DB2的有机的融合在一起现在还是个未知数。这也让我明白为什么在Oracle收购MySQL之后会出现这么多MySQL的fork:毕竟不是亲儿子。

  • 问题的多重性和domain knowledge的重要性

下午场我就是盯着机器学习专场在听。其中我觉得来自连家的“机器学习技术在房屋估价中的应用”的分享最为有意思。分享的内容其实从标题就可以猜出个八九不离十。这个分享一个重要的信息就是机器学习并不是以算法为核心的而是以建立在以domain knowledge为支撑的加工过的data的基础上的。对于链家的问题就是他们的数据量是十万级的,远不及一些图像处理或者文本处理的亿级别的数据。另外他们的数据是类别变量和连续变量混合,连续变量有数量级差异;以及不可避免的脏数据。这些都很大程度上决定了要基于domain knowledge的feature engineering和针对数据特点的算法确定。现在想想也就不难理解为什么从在本科上统计课到现在看的Prof. Andrew Ng’s ML课程,大家拿到数据的第一步都是plotting:就是为了能更好的结合自己的domain knowledge来观察数据特点及预处理。另外说一句就是,在我看来从昨天的滴滴大数据应用到今天这场链家的机器学习应用,他们本质上处理的问题都是属于经济学范畴。与经济学中计量经济所不同的是,机器学习的方法更加暴力:分析数据就是分析数据,而不是先要把问题归类到经济,然后按照经济的科班套路先建模再通过数据验证模型的套路来解决问题。我这里不想说也不够资格说哪个解决问题的方式方法更好。我想说的是一个问题放在不同角度来解决套路真的是完全不一样。站在不同位置上看待同一个问题也许能会擦出更加明亮的火花?

Day 3

最后一天就是全天的专场了。前两天听下来基本上对System, ML方向有了个粗略的sense。到了第三天我就把重点放在了其他一些领域比如说区块链。这里我觉得讲的比较好的就是“区块链与大数据技术结合的商业应用”这场。可以看出的是区块链作为一个新兴技术,由于账本本身是公开的,可以把 这个想象成一个巨大的只支持insert和select的数据库,那么对于这个数据库里的数据挖掘和针对这个数据库所能做的一些优化就成为了现在区块链届关注的重点。据介绍现在这个账本已经有3,400G这么大。我另外了解到,分布式账本这种技术应用场景还是非常广泛的。比如说红十字会接受捐赠就可以利用区块链技术使得所有捐款信息完全透明公开。说句题外话,现在任何一个项目都需要不同类型的人才。系统,AI都有自己施展拳脚的空间。

小结

参加conference确实是一个非常愉快的体验。像我这种技术渣渣可以了解到各个领域的前进方向,找到自己努力的方向和未来的定位。和我一快来的同事就跟我说参加这个会议让自己更加坚定了当初自己选择的方向。另外,如果有丰富的工程经验也可以通过这次会议吸取同行的一些经验教训,取长补短。另外,丰富的networking机会也是这种会议的价值所在。

走出会议的那一刻,我觉得天空好蓝。

Under Construction (part 2/2)

In the previous post, I briefly recap the effort I have made to build a personal website. As you can tell from my previous work, I’m super into Sphinx-doc tool and I’m seeking to build a website that is more suitable for industry professional. So, this post is more about technology selection and what’s going on with my endeavor now.

“Everyone should have a blog”?

There is a tendency for person who works in tech industry: they love to blog about the technology. I’m not indicating that I’m against this tendency. In fact, I’m super into this idea. That makes programmer life much easier. When we run into technical questions, we can google and usually the solution is presented by some nice guy’s blog post. However, I’m not sure if making blog post is a great way to systematize your knowledge. Let’s take “Minimal Emacs Tutorial” as an example. Let’s imagine how we can translate this article into a blog post. “Learn about Emacs” section can be a post and then I tag it with “emacs-terms”, “emacs-commands”. “HowTos” section is separated into multiple posts because each howto is collected on different dates and as a blogger, I would make a blog post each time I found something new with emacs usage. The tags for “HowTos” section posts may be “emacs topics”. Now, if I want to take a look at what I have learned on Emacs, then I three operations: click on “emacs-terms” to find the posts with this tag, then “emacs-commands”, and finally “emacs topics” to traverse through the all emacs-related posts. As you can see, this is tiresome. Some may argue that this issue can be fixed if we add a theme tag, like “emacs” to every emacs-related post. This can workaround the problem but we need to be very careful about how we choose the tag.

That being said, however, I think making technical blog post’s advantage outweighs its disadvantage. Most important thing about writing blog is that you can keep track of your progress daily. I have been making my personal knowledge base for more than a year. I’m pretty satisfied with what I have accumulated so far. However, there are some caveats. Off top of my head,  I need more direct visualization of my progress. I want to see if I can pick up something new daily and that will give me a lot more motivation to push myself to learn more each day. In addition, some page can get enormously long. I have a page called “Collection of algorithms”, which takes like forever to scroll from top to bottom. This page is naturally suitable for blog post by making each algorithm into a post. Lastly, for marketing purpose. Making a technical blog post daily is an enormous effort and it will definitely look good for hiring managers – “Zeyuan is passionate about technology!”. Even though I can showcase my private knowledge base, that is much more inconvenient than a blog post that can be accessed by everyone.

You know you have a wordpress blog, right?

Yeah, I know. WordPress.com is a great place to write blogs. It makes me want to write something each month. However, I want to say writing technical posts on wordpress.com is sort of painful. In general, if you need to insert code, you need to work with “HTML” editor and use

[sourcecode language="csharp"]
//your code comes here
[/sourcecode]

This format is incompatible with reStructuredText. That means the content written here cannot be rendered using Sphinx-doc engine. That essentially makes the portability of the post to the minimum. In addition, I experience some weird bug when insert the source code. Inside “Sqoop2 7 Minutes Demo with DB2” post, I use lots of SQL statements inside the code block. Everything works fine the first time. However, when I try to update the post, all the quotation marks get translated to the html representation under “Visual” editing tab. I have to manually update all the quotation marks to its symbol form.

However, on the other hand, wordpress.com blog works best with photos. I really enjoy how wordpress.com manage photos in my “Trip in Nan Jing” and “First Time Ever Hackathon” posts.

What should use?

This is the question that I have been looking into for months. Let me list my expectation first:

  • Support reStructuredText
  • Static pages should be fine and must feel LIGHTWEIGHT
  • Support basic blog functionality – tags, categories, post date

Let’s list out what commonly-seen options are:

  • Octopress
  • hexo
  • Sphinx-doc
  • Pelican

Let’s cross Octopress first. I don’t want to touch Ruby ecosystem because the workflow to develop a personal site will completely be different. In addition, there are plugins that let Octopress support reStructuredText but I figure that doesn’t offer full capability like Sphinx-doc does.

Hexo is another beast. It is similar to Octopress in the sense that both of them use Markdown as their major language and it is also rooted in front-end world. I admit Node.js is a pretty cool language and the theme provided by Hexo looks amazng. However, it feels too modern to me and I want to take a little bit more conservative path.

I want to talk about Sphinx-doc and Pelican all together. Sphinx-doc is the foundation to write Python documentation and Pelican is based upon Sphinx-doc and tweak towards blog post. Tweaking Sphinx-doc towards a blog post site requires a fair amount of knowledge of Jinjia2 and some Javascript knowledge. I don’t have enough time for that. So that makes Pelican very tempting. However, I don’t want to use it for now as the workflow and how to control the Pelican behavior is quite different from Sphinx-doc. However, it is definitely worth revisiting in the future just because the size of community and plenty of themes.

The engine I choose is called Tinkerer. It’s the direct derivative of Sphinx-doc and the workflow doesn’t deviate from the standard Sphinx-doc workflow too much and the setup is quite the same. However, the latest release is back in 2014. That makes me a little worry. But let’s stick with this for now. I can easily switch to Pelican whenever I want to.

What does look like right now?

I’m not going to abandon this wordpress blog. However, I’m going to tweak the direction of the blog posting on this site. From now on, I will keep technical blog post to the Tinkerer-powered site. By technical, I mean that involves code, mathematical expression, and the content that will be part of my knowledge base doc eventually. However, on the other hand, all the life thoughts and any other non-technical posts will still be kept at this site, especially those posts involve lots of non-technical pictures. zhu45.org will still work as the main portal to my personal site until the new Tinkerer-powered site is GAed.

 

Under Construction (part 1/2)

Well, it’s almost the end of November and I haven’t posted anything yet this month. I feel pressured. Lots of things are actually going on this month, which involve my cat, my future career path, and some mentality struggles. I’m not gonna write about them right now because it’s not the right time. Hopefully, I can write about them someday in the future. I cannot promise this.

Today, I can actually write about is my “little” project I just started – building a site. Let me take a break here and give a quick recap. Up till now, building a site or, more specifically, a personal website seems to become my life-long effort.

Everything gotta start from somewhere …

If you are willing to take a quick look, Welcome to Zeyuan Hu’s World! is actually my starting point. I built this site during the summer time of my sophomore year at college. By that time, I was granted a summer research position and my main duty was to help out professor to do computer simulation of some mathematical models. I was just done with my first cs course – CS 302: Introduction to Computer Programming with Java and I heard from my uncle about this Python language. I have no clue among all those programming languages, Python caught my eyes during that time: probably, because of its cool name. So, I decided to give a serious shot at learning Python programming language both for my research job and for my personal interest.

It’s pretty damn hard to pick up a natural language by just skimming the textbook. This also applies for programming language. So, I decide to make notes while I study the Python. Making notes on paper is an option but paper is really hard to carry and easy to lose. When I read about the Python API, I’m pretty impressed by the elegance the Python official website shows and how it can keep everything so well-organized. So, I decide to use Sphinx-doc to make my own site with the main purpose to host my Python study notes. That’s how everything got started.

Making this site allows me to pick up CSS, reStructuredText, Python, and Matplotlib (which I used the library to make the header image of the site). If I evaluate this site under my current level, it’s still a solid website. Even though the aesthetic level may not be high, but the site serves my main purpose well – I got a well-written Python study notes for myself. The downside of this site comes from my personal feeling. I’m definitely not a front-end expert but I can tell this site feels quite HEAVY. There are multiple tabs (i.e., home, projects, cv/resume, and so on) but they are sort of hidden in the way that people have to click through the tab to find the content. This works great if I have accumulated tons of material to demonstrate. However, we both know, this is not the case. Another point is more of a pity. I can no longer maintain this site. My wisc computer lab account id got deleted after I graduated. I still feel grateful that school still hosts my site and that’s very moving policy I have ever encountered.

If you are interested, the source code for this site is hosted on Github. Feel free to check it out and there are some advanced reStructuredText and Sphinx usage in the repo.

Struggles …

My second journey with my site starts shortly after I begin to work at IBM. During the work, I feel the need to build my personal knowledge base. This is mainly  due to the amount of new information you need to keep track of for future use. I feel this is super important for someone who just joined the team and eager to make contribution to the project. At first, I use Sphinx-doc to build up my personal docs like I did previously. This doc or site is not available to public because there is no clear cut between confidential and non-confidential information. For instance, when you try to record the gain from code reading, even though C++ technique is perfectly fine for the public but if we put it under the context of DB2 source code, then the technique combined with DB2 source code is completely confidential. I don’t want to cross the line here. However, every coin has two sides. If some knowledge is completely fine for the public, then I want to share them out. So, building a website once again becomes a task that I need to finish.

I don’t want to get serious engagement with web development (even though I made a try) and on my preference ranking list, easily recording knowledge in a neat way comes first. I don’t want to spend tons of effort working on the web as I feel like understanding database and mastering algorithm sound more fun to me.

Under this criteria, I made some prototypes. The first one is built upon the previous iteration (sphinx-doc web) but incorporates bootstrap framework to make the whole site mobile-friendly and have a modern taste. This one is definitely better than my previous one. I liked it a lot when I first got the prototype out. However, I quickly felt the pain or the pressure of the writing, more precisely. The main purpose is to document my knowledge picked up from the work and it meant to be less reader-friendly. I have to edit some writings in order to make everyone online can make sense out of it. Plus, make distinguish between confidential part and non-confidential part can make this editing thing even worse. More importantly, I still think this site is HEAVY, especially for Chinese because it takes quite a while for everything loads up properly (google-analytics, github servic, and so on). So, I give up. I’m not giving up making my knowledge base but  I give up to make everything public. Meanwhile, I also cleaned up the very first web and adjusted the style a little bit. Unfortunately, this site doesn’t attract me as well.

Before the dawn …

The site I’m currently using is absolutely hideous for my friends (believe me, none of my friends really like it). The main argument they is that this site is just plainly 70s style: it doesn’t even have tabs. That’s true. In fact, that’s kind of the style of I’m seeking: everything is just there. You don’t have to dig the site to find the things you want. Plus, it is super LIGHTWEIGHT. I think this site is perfectly fine for someone who works in the academia: all the publications can be directly seen. In addition, I use a “hybrid architecture” for this site in the sense that I use this site + wordpress blog to host all my writings. By that time, I couldn’t find a proper solution to build a blog post website based on Sphinx-doc and I will talk more about this in my part 2 post.

Putting everything straight out to the reader works well for academia in my perspective. However, it becomes a downside as well: I’m working in the industry. I barely make publications. So, this site becomes really static :(. I hate to actively maintain a site but never update one does no good to me as well. This is actually why I write “This site is permanently under construction.” in my footnote of the page. I know someday I will once again to rebuild this site.

TO BE CONTINUED …

Chronological order of my work on building a homepage

First Time Ever Hackathon

I have never been to hackathon before. In my imagination, hackathon is like a festival for people who have passion about creating cool stuff during a limited amount of time. Hackathon is going to bring tons of fun if you can work with dedicated people on some interesting idea and try to make the idea into reality. Luckily, Early Professional Hire (EPH) Hackathon event hosted by China Development Laboratory (CDL) meets my expectation perfectly.

Experience Recount

When I first time heard about the hackathon on EPH Day 5, I’m a little intimidated. This is because the solution demo you present at last needs to fit in certain hackathon topics. The topics include business platform innovation, light-weight e-commerce reinvention, blockchain, application of Watson technology into medical services, and integration of Watson technology with 3D demonstration. All these topics are super cool, very advanced and I barely have a chance to get touched in my day-to-day work. So, I’m not sure if I can handle those topics well. However, I want to give this hackathon a shot: not only because this is part of EPH program but, more importantly, one of the topics really catches my eyes- that is, the integration of Watson technology with 3D demonstration.

This topic actually has a name, which is called Watson Introspector. It is a cognitive tool for understanding software, answering questions, and interacting with software architecture and data flows in 3D.  This topic suits my interests perfectly. It has always been a challenge for new comers to study a code base especially when the code base has been evolved for several years. What sets of functions or data flows get involved in certain feature of the software has always been the type of questions we are asking all the time, especially when bug fix or enhancement request kicks in. Conventionally, there are tools to help us to visualize the code path like debug trace, UML graph, and so on. However, none of them are straightforward and fun to use if we put them under new comer education context. I can hardly imagine some guy will choose staring at the debug trace on Friday night over going out for a date. So, I think the visualization of code path in 3D and get some question answering system integrated (like Watson) may probably make our software developer’s lives much easier.

After I set the topic that I want to work on the most, the next step is to get a team. Originally, there are four team members besides me within the team. All of them don’t have any experience with any technology involved in the topic. This is perfect because neither do I. However, all of them are testers, which make my situation a little bit difficult. This means that  I am the only developer in the team and I will take much more responsibility than I thought I would. But, that’s OK because my teammates want to grow with me and want to give the hackathon a try. So, I become the captain of the team.

Everything goes quite well for the first two weeks. We narrow down the architecture of the solution: we are going to make a 3D space game just like the classic snake game. In the study mode, player is free to explore the 3D world, which is constructed from the classes and functions parsed from a random-selected Java project. Inside the world, the player can interact with Watson on what kind of feature he wants to learn and Watson will return a code path that best meets the player’s request. Then player can spend his time getting familiar with the call stack of the functions along with the purpose of each function. In the test mode, the player is required to visit the functions he just studied in the correct function call order so that he can win the game. Our goal of making this game is to offer a fun way to learn about a source code project and we believe educational game best suits our needs.

Then, on September 14th, everything is just changed. Once we settle down the architecture of the solution, two members decide to quit with the excuse of limited time. I got this feeling that someday they are going to quit but I didn’t expect this timing. They refuse to work on the solution outside of the work time. This makes me quite frustration because anyone who attends hackathon should expect that he is going to spend fair amount of time outside of work to finish a demo. Even worse, this means we probably don’t have enough resources to finish our solution. It looks like mission impossible with only one developer and one tester left with the team.  But, a second thought comes into my mind. I work as the president of IBM Diamond & Ring Toastmasters Club. One important lesson I learn from it is that as a leader, the first priority task is to take responsibility and get the job done no matter how difficult the situation is. Under my current situation, my goal is to at least finish this hackathon, and I need to make this happen. Plus, I’m not alone: I still have a teammate, Rachel, who wants to give out all she has in order to succeed in the event. I just cannot let her and myself down.

In the final week,  we work super hard with our adviser, Trent, in order to get a demo working.  Even during Mid-Autumn Festival holiday, we still come to the office around 2 pm in the afternoon and hack through the rest of day to 1-2 am. That has been the theme for the whole week. On Tuesday, September 20th, right before the final day, we work over 30 hours to 4am, September 21th to do bug fix and 3D modeling. Rachel and Trent live closer to the office, so they rush home to get some sleep. I, however, live really far away from the company (I live southern 4th ring of Beijing) and unfortunately, I have to take a nap at the office coach to avoid being late for the team show order decision draw happened four hours later.

Even we almost live inside the company, we still haven’t finished our demo on the final day morning. There is some performance issue with our game during the launch phase: since we talk with Watson at the same time the game assets are loading, the framerate of the game drops significantly. In addition, we haven’t figured out a way to grow our character body just like the classic snake game. These put a lot of pressure on me because there isn’t enough time to fix everything in a nice clean way. However, we somehow manage to finish all this by the demo time. We adopt agile practice. Maybe we cannot fix these problems nicely but we can definitely walk around the problem just for the demo’s sake. We do incremental world object construction during the game loading phase: we only load the objects that player can actually see through the camera and we use a big skybox to block unloaded part of the world from the user. For the snake body problem, each time the character hits the target, we put a sphere behind the character and we somehow manage to let the newly added part follow the movement of the character. Maybe the movement doesn’t mimic the snake body movement nicely but for the sake of demo, that’s enough.

During the demo time, everything works well. Thanks to the public speaking practice I have kept doing at the toastmasters club, I delivered a successful presentation to the senior management level at the lab and we obtained 2nd place in this hackathon with the fewest team members among all the contest teams.

Here are some interesting stats that are worth mentioning:

  • We have 0 experience with the technology stack of the hackathon
  • We originally have 4 team members but down to 2 halfway through the event
  • We only have 1 developer and 1 tester eventually
  • We only have 1 person with sufficient programming experience
  • We work to the super late nights for at least 3 days
  • The longest non-stop hacking lasts for 30 hours
  • We consumed 50+ bottles of water and bags of snack
  • We watched 60+ hours video tutorial on YouTube and safaribooksonline.com
  • We write 2500+ lines of code for the demo

 

Hackathon Takeaway

Always remember you’re the captain

There are couple of times I want to quit the event. Thankfully, I don’t actually do that because I always remind myself I’m the captain of the two-man squad. When you set a goal to meet, you have got to do whatever you can to reach that goal and get the job done. Thanks to this hackathon event, I can now clearly see this point.

Stay positive during the difficult situation

There are downtime during the whole hackathon event. Face the technology we only have never actually experimented with when we enter the contest; Two of the original team members leave the team; Unfamiliar with the development tool; Debug the code to the late nights … All these things can drag the moral down pretty quickly. However, I’m the leader, I cannot do this. So, even in these difficult moments, I try to call the team for a short break and entertain ourselves by coming up jokes or have some random chats. These techniques work amazingly well because we don’t feel stressed and we can actually enjoy the whole problem-solving process. Without fun, the hackathon will never be the same to me.

Don’t be afraid of making the tough call

To be honest, I’m the chief solution architect of our demo and sometimes I have to make some tough calls, especially when both options look tempting. For instance, implementing the game like the classic snake game or like Super Mario are both good options. However, if we put time and various other resource limit under consideration, two options cannot be the same. Super Mario has its advantage in 3D exploration and the snake game reflects the core idea of our solution – make the body grows with the code path. I have to make the tough call on what path we want to pursue and I have to say, it’s always not easy.

Motivation and hardworking is the key to success

Motivation will give you the courage to take your first action but only can hardworking make you reach the goal. In this hackathon, I feel lucky that I follow my interest and choose Watson Introspector as my topic to work on. My interest provides me enough motivation to power through the whole event. However, I know that in the deep of my heart, I want to win and I want to my demo to reflect the technical expertise we equip. That requires hard working. Thankfully, we don’t let ourselves down and we work super hard towards our goal. I’m so glad we finally make it.

Public speaking is crucial

I feel our solution may not use all these fancy technologies like some other group does but I feel the public speaking or the presentation skill well compensated for this “disadvantage”. During the presentation, I use numbers from above to define our hackathon experience with the speech style like Steven Jobs and our beloved CEO, Ginni Rometty. This delivers a concrete message to the audience: we come here to win and we deserve it. I want to give the judges the feeling that we are 120% confident with our solution and more importantly, we feel proud of it.

All in all, this hackathon has become one of the moments that I’m extremely proud of myself as an IBMer and I do learn a ton from it: not just technical stuff but how to be a leader as well.

I want to end my post with the slogan of our project:

Evolve ourselves, beyond the limit!

Understanding shebang + eval combo from ancient perl script

During the daily development, we have a collection of scripts to help us automate some mundane task. Most of them are written in perl and quite often, I feel shamed to be a programmer that only know how to use those scripts without actually taking a look at their source code. However, I don’t know perl by any measure and recently, I decided to take this challenge: I quickly went through the basic aspects of the language (with the help of this nice tutorial: Learn Perl in about 2 hours 30 minutes )  and dived directly into some scripts to start reading. Then I met this daunting code chunk from the very beginning of a perl script:

#!/usr/bin/perl
eval 'exec perl5 -S $0 ${1+"$@"}'
   if 0;

So, I spent some days digging, and finally get this code chunk clears out. I’ll try my best to explain this code chunk in a newbie-friendly way (because I am one of them :)).

History

Usually, it is unnecessary for people to know that if a script should be executed by Perl, shell, or other interpreters. In other words, they can execute the script by typing the filename and the script itself will find the right interpreter to run it. This is usually done with shebang.  In perl, the common way to do so is to write


#!/usr/bin/perl

at the very first line of your perl script. However, not all system supports shebang and most likely, those systems will run your script as if it is a shell script, which, of course, will lead to the failure of execution. In this case, we need to figure out a way to tell those systems that “even if you are running the script as a shell script, please invoke perl interpreter to interpret the content of the script” and that is exactly what that daunting code does.  Now, let’s dive into this code chunk to see how it works.

Dive in

#!/usr/bin/perl
eval 'exec perl5 -S $0 ${1+"$@"}'
   if 0;

 [1]:  First line of code uses shebang and it invokes perl interpreter located under /usr/bin/, which should be enough  for systems that support shebang to know which perl interpreter should be used to run the content of script.

[2-3]:  For system that support shebang, the system already knows the content of script should be interpreted as perl. So, line 2 – 3 will be treated as perl. Since “Carriage return” is the same as “whitespace” in perl world, line 2 -3 will not get executed because of if 0. However, for system that doesn’t support shebang, the whole script is treated as shell script and thus, line 1 will be treated as shell comment and ignored. Then, the system continues to run line 2 – 3 as shell command. There is one important difference between perl and shell (i.e., bash) is that perl will spot the continuation line (because carriage return is the same as white space) but not bash. So, shell will first execute line 2. For now, let’s just say line 2 executed by shell will tell the system that “to re-run the whole script again under perl and where to find perl interpreter” (we will investigate more in detail in the following section). So now our goal is achieved: system that doesn’t support shebang will go ahead to use specified perl interpreter from line 2 to re-run the whole script as perl script. During the re-run, line 2-3 will get ignored by perl.

Line 2: what the heck is this?

Now, let’s study code from line 2 in detail.

eval 'exec perl5 -S $0 ${1+"$@"}'

The eval in shell takes a string as its argument, and evaluates it as if you’d typed that string on a command line. So, the shell actually executes

exec perl5 -S $0 ${1+"$@"}

$0 get expands to the name of the script by shell. However, ${1+”$@”} looks quite mysterious. It involves an ancient Bourne shell bug (if no argument provided, it uses an empty argument instead of nothing) and the article What does ${1+”@”} mean explains it very clear:

The ${1+"$@"} syntax first tests if $1 is set, that is, if there is an argument at all.
If so, then this expression is replaced with the whole "$@" argument list.
If not, then it collapses to nothing instead of an empty argument.

Aside note on ${1+"$@"}, it follows ${parameter+alt_value} pattern: If parameter set, use alt_value, else use null string. See more on it here.

So now, we can put all pieces together: when shell executes line 2, perl program (i.e. perl5) will be invoked and execute the script itself (expand from $0) and supply the argument list, which may be required by the script.

Example

Let me give an example.

Suppose we have a perl script named foo:

#!/usr/bin/perl
eval 'exec /wsdb/oemtools/linux/bin/perl5.16.2 -S $0 ${1+"$@"}'
    if 0;

use Config;
my $perl = $Config{perlpath};

print $perl."\n";

Besides the daunting code chunk, the rest will print out the absolute path of perl interpreter that our script get executed by. Now, let’s try out different way of executing our perl script foo:

$ perl foo
/usr/bin/perl
$ ./foo
/usr/bin/perl
$ sh foo
/wsdb/oemtools/linuxbin/perl5.16.2

The first two cases, we run the perl script in a standard way, since my system (SUSE Linux 11) supports shebang, the script gets executed by the perl interpreter specified in the shebang line. However, if we try to mimic the system that doesn’t support shebang by executing our script using shell (i.e., sh), the script is also get interpreted as perl script but with the perl interpreter from eval part. Notice sh usage here, sometimes the user of the script may assume the script is written by shell, and they will try to execute the script by sh. Then, in this case, our daunting code chunk provides a defensive mechanism that allows the perl script to be executed correctly even when it is run by shell.

With this explanation, I don’t think code chunk in find2perl will be daunting to you now:

#! /usr/bin/perl -w
    eval 'exec /usr/bin/perl -S $0 ${1+"$@"}'
        if 0; #$running_under_some_shell

 Modern days

Nowadays, people rarely use that daunting code chunk solely because some systems don’t support shebang. Increasingly, that daunting code chunk usually appears when people want to use a specified version of perl (not system default one like /usr/bin/perl) and at the same time, maintain some portability to the system that doesn’t support shebang. However, if we solely consider to use a specified version of perl instead of default one, then there is more than one way to do so. My list may not complete. Please feel free to comment below if I miss some usage on invoking customized perl.

#!/usr/bin/perl
eval 'exec /wsdb/oemtools/linux/bin/perl5.16.2 -S $0 ${1+"$@"}'
    if 0;

The first way is again our daunting code chunk. If we run our script by sh, then our customized perl (i.e., /wsdb/oemtools/linux/bin/perl5.16.2) is executed. -S as perl command option used here is to make perl use PATH environment variable to search for the script because on some system $0 doesn’t always contain the full pathname to the script. You can read more about -S option in perlrun doc and in fact, the daunting code chunk also got explained there.

The second way is to put the following code at the first line of perl script:

#!/wsdb/oemtools/linux/bin/perl5.16.2

This way you directly hardcode the customized perl interpreter in your script. This may sacrifice portability of the script.

Another way to use customized perl interpreter is put this code chunk, again, at the first line of the script:

#!/usr/bin/env perl

This will tell the system (that understands the shebang) to find the first “perl” executable in the list of $PATH. If you want to run your customized perl interpreter this way, you want to put the path to your customized perl interpreter at the beginning of $PATH environment varaible so that you ensure if first “perl” executable found by the system from $PATH is indeed the perl interpreter you want to use.

The last way to run your customized perl interpreter is somewhat similar to our daunting code chunk but with significant difference:

#!/bin/sh
#! -*-perl-*-
eval 'exec /wsdb/oemtools/linux/bin/perl5.16.2 -x -wS $0 ${1+"$@"}'
    if 0;

Let’s run it first to see what we can get. Like previous example section, we put the above code chunk inside a script called bar:

#!/usr/bin/sh
#!-*-perl-*-
eval 'exec /wsdb/oemtools/linux/bin/perl5.16.2 -x -wS $0 ${1+"$@"}'
    if 0;

use Config;
my $perl = $Config{perlpath};

print $perl."\n";
$ bar
/wsdb/oemtools/linuxbin/perl5.16.2
$ perl bar
/wsdb/oemtools/linuxbin/perl5.16.2
$ ./bar
/wsdb/oemtools/linuxbin/perl5.16.2
$ sh bar
/wsdb/oemtools/linuxbin/perl5.16.2

No matter how we execute our script, we always use our customized perl interpreter, even when system perl is explicitly specified (i.e., perl bar). The significant difference than our original daunting code chunk is the use of -x option. The -x does the following:

tells Perl that the program is embedded in a larger chunk of unrelated text, such as in a mail message. Leading garbage will be discarded until the first line that starts with #! and contains the string “perl”. Any meaningful switches on that line will be applied.

Let me walk through what exactly happen in our case. We will use the following information taken from perldoc during the walkthrough as well:

 If the #! line does not contain the word “perl” nor the word “indir”, the program named after the #! is executed instead of the Perl interpreter. This is slightly bizarre, but it helps people on machines that don’t do #! , because they can tell a program that their SHELL is /usr/bin/perl, and Perl will then dispatch the program to the correct interpreter for them.

We launch our bar script as ./bar:

1. Shell executes our script ./bar
2. The system actually executes /bin/sh ./bar because of our shebang specification.
3. sh executes /wsdb/oemtools/linux/bin/perl5.16.2 -x -wS bar
4. /wsdb/oemtools/linux/bin/perl5.16.2 skips:

#!/usr/bin/sh
#!-*-perl-*-
eval 'exec /wsdb/oemtools/linux/bin/perl5.16.2 -x -wS $0 ${1+"$@"}'
    if 0;

and executes:

use Config;
my $perl = $Config{perlpath};

print $perl."\n";

Let’s break down this step into further detail:

4.1 /wsdb/oemtools/linux/bin/perl5.16.2 executes /usr/bin/sh ./bar because it sees a shebang that doesn’t contain the word “perl”
4.2 eval part get executed (i.e., sh executes
/wsdb/oemtools/linux/bin/perl5.16.2 -x -wS bar
)
4.3 Since -x is specified, the first line #!/usr/bin/sh is ignored because it is a shebang but doesn’t contain the string “perl”. Line 2-3 is ignored because if 0. So, the execution starts with use Config; and move forward.

Let’s try launch our bar script using perl bar to see why system perl is not used in this case:

1. Shell executes the script perl bar
2. perl (i.e. /usr/bin/perl) executes /bin/sh bar because it sees a shebang that doesn’t contain the word “perl”
3. eval part get executed (i.e., sh executes
/wsdb/oemtools/linux/bin/perl5.16.2 -x -wS bar
)
4. So our script bar is executed by /wsdb/oemtools/linux/bin/perl5.16.2 instead of /usr/bin/perl

Let’s Practice

Based upon what we learn, you should not have much trouble understanding why

#!/bin/sh
eval 'exec /wsdb/oemtools/linux/bin/perl5.16.2 -wS $0 ${1+"$@"}'
    if 0;

will lead to

/bin/sh: -S: invalid option

error. The key lies in we are not using 1) shebang + string word “perl” and 2) -x option. If you have hard time finding out why, here is the answer.

Thanks for the reading!

 Reference

Linux Fork Bomb

Today, I learned a fun feature of shell called Linux Fork Bomb and this the piece of code I’m reading about:

:(){:|:&};:

Code Analysis

Let’s dive into this and have a little appreciation of the power of shell:

  • :() defines a function called :
  • :|: & runs function : , sends output to : and run in background
  • {...} indicates whatever inside is the content of the function :
  • : calls function for the very first time

Essentially you are creating a function that calls itself twice every call and doesn’t have any way to terminate itself. It will keep doubling up until you run out of system resources.

Some fun observation

: used as a placeholder in shell. For instance, while trueis same as while :. However, this may only work for bash because : is a built-in command for some shell and the buil-in command : has precedence over the function :. So, when we actually execute our bomb, built-in : will get executed instead of our function. So, bomb has been defused.

Here also offers  some insights on how to prevent fork bomb like this. It involves RLIMIT_NPROCIt is definitely worth to dig further.

You can watch a live demo and see how powerful the linux fork bomb can be.