Job hunting lesson learned

This post contains a collection of lessons I learned during the job hunting. I’m still looking for internship & job. That’s good because that means this post will be at least frequently updated in the foreseeable future.

  1. Always attending career fairs. In UT, if you are a CS student with a good standing, you can get an invitation to an event called FOCS Career Night. There will be a lot of recruiters. But, be careful, most of the recruiters are actually engineers or UT students (that’s right, some companies make Campus Ambassador attend the event as if they are the recruiters). There is a huge difference between recruiters and engineers: recruiters get the call on who gets the interview, not engineers! I made a mistake by attending the FOCS Career Night only and skip the Career Fairs. In fact, recruiters are actually coming to Career Fairs and some of them doing on-campus interview signup immediately. The on-campus interview is much better than OA. Even you got an invitation for scheduling an interview after FOCS Career Night, you still want to talk to the company at Career Fairs because recruiters can barely check their emails when they are on travel and interview slots are always based on first come first serve policy. So, you always make sure to come to the Career Fairs and schedule an interview immediately instead of replying the invitation email and wait for the response and then got one said interview slots are all filled. This happens to me on Indeed.
  2. Always doing OA immediately. When you receive an OA, the company usually will tell you that you can finish the test within certain days. However, things can change rather quickly. Even they give you buffer like finishing this test within 4 days, ignore the message and do OA immediately. Slots can fill rather quickly and some company has this under-table rule on even they say 4 days, they really mean immediately. This happens to me on Dropbox.
  3. Always finishing OA within min(60 minutes, restricted time). Some company allows you to finish OA within days. In other words, even you start the test, you have a couple of days to finish it. Ignore this, please! Even though the test lasts for days, finish it as quick as you can. The finishing time is a strong indicator of your coding ability. This happens to me on Twitter.
  4. Always follow-up with the recruiter. Sometimes, there might be system error: they send reject email to the wrong person. Make sure you confirm this with the recruiter and finish your OA no matter what happens. Even you got rejected, OA is still an invaluable practice opportunity. This happens to me on Dropbox.
  5. Always make sure you apply in the University Recruiting section. Companies make specific web pages for fresh graduate and recent graduate. Make sure you submit your resume there. If you submit the resume to the wrong place, you may in a pool that is filled with professional with 5+ years of experience. That always leads to either no hear back at all or an immediately reject letter. This happens to me on Dropbox.
  6. Use the LinkedIn and be aggressive. I’m a shy person but job hunting like the name suggested, it’s a hunting. You have to be aggressive. Connect with as many people as you can whether it is from Career Fair, social events, LinkedIn in-mail. Be polite and be bold. Ask them for the opportunity. One special note is that you may want to “harass” recruiters and senior developers in LinkedIn. Their words have much more power and you may get an interview very quickly. If they being rude when you ask for the favor politely, you already know that this company is definitely not the one you want to work with. This happens to me on Teradata (BTW, they are on the polite side).
  7. Prepare for the technical interview questions:
    1. The interviewer may make some slight modification to the questions even they are from leetcode. For example, instead of asking what exactly the shortest path are in the original leetcode question, the interviewer may ask how many steps in the shortest path. The difference is the former one may expect a list of coordinates (i.e., steps) and the latter one may expect a simply a number. This happens to me on Pocket Gem. The takeaway is that when you solve leetcode questions, think about what possible variations might be. However, it may seem infeasible that you do it for every problem. You don’t have to unless you don’t have anything else to do. The next point will help to address this concern.
    2. Browse some recent interview questions from the company you are about to interview with from forum. This helps to address the previous concern. If you see the company interview some leetcode questions, you may want to look at that leetcode questions and think about the possible variations. Also, usually company has a pool of questions and get some prior exposure from a forum, you may have a good preparation already. Also, this point helps if you are very short of preparation time. In this case, you just prepare for the questions from the forum and you’re good to go. Sometimes, this works much better than a long-term preparation strategy, which you may feel over-prepared and feel a good chunk of time get wasted on leetcode when you can simply prepare the questions from the forum.
    3. Get practice on the leetcode. Usually, people emphasize the importance of getting practice on leetcode. That’s true. However, this depends on when you about to apply for the position. For recruiting new grads, some companies prefer to start early (e.g., in Fall) and others don’t (e.g. in Januarg till March). People always think they should start early as soon as possible to get a spot in the limited headcount. That’s true but this strategy usually comes with a risk: you’ll see new interview questions that no other has seen before. Each year, companies may update their pool of questions. If you think you can solve leetcode problem like “1+1” and have a solid preparation in system design, then start early A the AP is best strategy.  However, if you are in an OKish position in algorithm preparation and design preparation, then you may delay applying one or two weeks. The beauty of delay comes directly from previous point: you may get exposure to the pool of questions before the actual interview. How long to delay is a case-by-case situation. Some companies (e.g., Dropbox, Pocket Gem) will be quite active and send your OA almost immediately after your application and you may want on your side. However, some companies may have a long process to take before setting up any interviews, then you may want to apply ASAP and let the internal processing time takes its time.

Leaving IBM

To be honest, this is probably the most difficult post I have ever written. This is majorly because there is a ton of stuff I want to say but I’m unsure whether I should keep them public or should keep it to myself. Another factor that makes this post hard to write is because the span of drafting. I have been drafting this post since April in 2016, right after when I decide to start the whole process of quit-IBM-and-get-a-PhD project.  I used to use this post as a log to record things and feelings when somethings happens around me at IBM. Frankly, if I take a look at the stuff I record (mostly are rantings) retrospectively, lots of stuff still hold but the anger just passes away with the time. So, that year-long drafting really makes me hesitate even more because the mood when those stuff are written are gone. However, two years can be a significant amount of time and quitting IBM can be called “an end of era” and I should give a closure to my happy-and-bitter experience with IBM anyway. So, here it goes.


Thank you, IBM!

I’m really thankful for the opportunities working with IBM. This experience really makes me grow both technically and mentally.  Technical-wise, I have the opportunity to get hands on experience with DB2 development. DB2 as a database engine is extremely complex. It has over 10 million lines of code and it is way beyond the scope of any school project. Working on those projects are quite challenging because there is no way you can get clear understanding of every part of the project. I still remember when I attend the new hire education on DB2, there is one guy says: “I have been working on the DB2 optimizer for over 10 years but I cannot claim with certainty that I know every bit of the component I own.” This fact really shocks me and based upon my experience so far, his claim still holds but with one subtle assumption, which I’ll talk about later. There are lots of tools are developed internally and reading through both the code and tool chains are a great fortune for any self-motivated developers. I pick a lots of skills alongside: C, C++, Makefile, Emacs, Perl, Shell, AIX and many more. I’m really appreciated with this opportunity and I feel my knowledge with database and operating system grow a lot since my graduation from college.

Mentally, there are also lots of gains. Being a fresh grad is no easy. Lots of people get burned out because they are just like people who try to learn swim and are put inside water: either swim or drown. I’m lucky that my first job is with IBM because the atmosphere is just so relax: people expect you to learn on your own but they are also friendly enough (majority of them) to give you a hand when you need help. I still remember my first ticket with a customer is on a severity one issue, which should be updated your progress with the problem daily. There is a lot of pressure on me because I really have no clue with the product at the very beginning. I’m thankful for those who help me at that time and many difficult moments afterwards. That makes me realize how important is to be nice and stay active with the people around you.  Because no matter how good you are with technology and the product, there are always stuff you don’t know. Staying active with people around you may help you go through the difficult moment like this by giving you a thread that you can start at least pull. In addition, participating with toastmasters club really improve my communication and leadership skills and more importantly, I make tons of friends inside the club. Without working at IBM, I probably won’t even know the existence of the toastmasters club. If you happen to follow my posts, you’ll see lots of going on around me when I work at IBM. Every experience you go through offer you a great opportunity to learn and improve yourself. Some people may look at them as setbacks but for me, I look at them as opportunities.


( the picture on the left is all the comments people give to me about my speech and on the right is the awards I have earned inside the club in these two years)

With the help of all those experience, I have developed a good habit of writing blogs (both technical and non-technical), reading books, and keep working out six days per week. All those things cannot be possible if I work at a place where extra hour work commonly happened. I’m very thankful for IBM for this because staying healthy both physically and mentally are super critical for one’s career. Even though those stuff don’t directly come from IBM, but IBM does provide the environment to nurture this things to happen.


IBM has its own problem. The problem is centered around people. There are many words I want to say but I think I’ll keep them secretly but I want to show my point with a picture:


I don’t know why IBM’s term “resource action” on firing employees and the sentence “IBM recognize that our employee are our most valuable resources.” bother me so much. I probably just hate the word “resource” as a way to directly describe people and how this word get spammed so much around IBM. I know everyone working for a big corporation is just like a cog in a machine. However, what I feel based upon lots of things happened around me is that IBM as its attitudes represented by its first-line managers (because those people I commonly work with) makes this fact very explicitly. It hurts, to be honest. No matter how hard you work and no matter how many prizes you have earned for yourself and your first-line manager, you are nothing more than a cog in a machine, which is not worth for high price to have you around because there are many cogs behind you that are ready to replace you. They are much cheaper, much younger, and more or less can work like you because your duty in the machine is just so precisely specified, which doesn’t really depend on how much experience you have had under your belt. To me, that’s devastating.

This leads to the problem that talented people are reluctant to stay with company. My mentor and the people are so good with DB2 have bid farewell to the team. That’s really sad to me because they are the truly asset to the company and the product. The consequence of this is that crucial knowledge is gone with people. Some quirks existing in the product are only known by some people and once they leave the company, the knowledge is gone with them. That makes mastering of the product even harder. That’s the subtle assumption that the person makes during the new hire education and that’s also part of the problem when working with legacy code. The whole legacy code issue is worth another post but one thing I now strongly believe is that any technical problem has its own root cause in company culture and management style. To me, I’m not a guru now but I cannot see the way to become a guru with my current position, which scares me the most

That’s it for this section and I’ll leave the rest to my journal.

Thoughts on PhD


This post serves as a record of thoughts regarding PhD. This post is from the person who is about to embark on the journey of getting a PhD. The thoughts from this post may look stupid or naive for someone who has already gone through the phase. However, based upon my past experience, if you don’t have some baseline for something that you decide to begin fighting for, you can barely have a measure on how much you have progressed when you actually start to fight, and highly likely, you may fall into the same trap over and over again for future similar situations.

Let’s dive into …

I have been considering getting a PhD since my sophomore year at University. This page summarizes some commonly-seen motivation for people getting a PhD. I think mine can be partly described as “Dr. Hu — sounds cool!” and “Eternal quest for knowledge (yeah right!)”. Another part, I guess, probably be the childhood dream of becoming a scientist. However, now, after two years working in industry, I realize that the most important motivation for me to get a PhD is that I want to have the ability to solve the problem that nobody explores before. This is different from quick learner because quick learner means grasping the material that has already studied before quickly. But, that doesn’t mean he can handle the unexplored area very well. I like to ask “why” when I face a problem but gradually I realize that I don’t have enough knowledge and more importantly, the confidence to solve some of crazy ideas in my mind. So, getting a PhD means I build some good knowledge foundation in some specific area and process the ability to solve any open question, even it is from the area that I haven’t explored before.

However, getting a PhD is a non-trivial task and itself demands lots of commitment. I personally view PhD and marriage are two the most serious commitment a person can ever give in his entire life. I try to play “rational” card here by doing some evaluation beforehand because I’m type of do-something-that-can-be-successful person and several years of study in economics make me become more and more like a “risk-averse” kind of person. So, I do various RA jobs in various departments (i.e. Math, Psychology, Biostatistics) to get a sense of what PhD life might look like.  The result is not good for me because I find out that working on some topics that you have the least interest in can be a lot like being in jail. However, those undergraduate research experience also has its positive side.  Imagine if I jump into the graduate school directly and choose some direction that I have no interest in (i.e. medical imaging),  I am sure that I cannot survive till the end. Sometimes I feel that the process of making decision is a two-way street: one way to do is to pick something that fits you from the pool; the other way is that you get rid of the choices that definitely do not work for you and then see what is left inside the pool. Apparently, for me, the latter strategy works slightly better.

So, I choose to work in industry for two years to find the things that I have passion about.
There isn’t much left in my choice pool by that time: I either pick from AI or from system.  For AI,  my focus is majorly on the application of ML techniques, such as CV, NLP. For system, my choice is distributed system (i.e. distributed database, distributed file system).  So, I need to carefully think about the pro and cons for which track I decide to pursue. Like I said in my offer choice post, there can hardly be a perfect choice that meets your need by any measure. Preference ranking in economics may be too ideal.  There is always trade-off. After spending two years working in database, I realize that I’m not really a hardcore system guy.  The most attractive feature from system is that I can do lots of coding. The coding here is naturally different from coding in, say, ML. In system, most of coding is done involves implementing data structures, data process models, and so on. However, for ML, coding is more like a direct translation of some mathematical formulas. However, the problem I find out about system research is that it is hard to propose problems that can directly link to the industry level production. This problem becomes clear to me after I attend DTCC 2017 last week.  The key success element for building a system is the production painpoints or user scenario. Alibaba and Tencent build system just to cater their specific business scenarios. In my view, the system has value once it can solve some specific problems that are not formed from someone’s imaginary. This can be very hard for newcomers who just join inside academic circle. In this case, advisor may work like a offer manager who regular visits companies to see what kind of problems they try to solve and bring those practical problems back to the research group and hopefully these issues can be resolved by his students. Research is all about solving problems and great research comes from the problems that have or potentially have great impact in the industry or people’s daily life.

In addition, if I recall the fun course experience from my undergraduate, I realize that I have much more fun with manipulating formulas and work out the problem that has strong connection with people’s daily life. The biggest trend right now is on big data. However, to be honest, for database system developer like me, I can barely get in touch with actual big data in my daily work. So, whether the system I build is robust enough to handle the actual big data, I don’t know. The only thing I can say is that I implement the design correctly. So, I feel like it is really hard to work out some good system by spending most of the time in school. This idea is partially confirmed by the trend that people jump out of academia and head to the industry like this.  However, even I have spent almost all the space so far talking about the “problems” I have observed about system research. I do enjoy the “traditional” programming scheme that system research possess. Rather than taking some data and train a somewhat blackbox network to achieve outcome, traditional if-else programming feels more rewarding for a hardcore programmer.

For AI, things can be radically different from system. Specifically for ML, one thing I learned is that ML is used to solve for the task that can hardly solvable by traditional programming, like autonomous driving, pattern recognition. Those stuff has strong connection to people’s daily life, which means can make a lot more impact. This is some historical pattern that can be easily observed: serving individual people is lot more profitable than serving big companies. Doing research on system is a lot like serving big companies if we consider the problem: who needs to build infrastructure from scratch? However, working on AI is a lot like serving people by making iPhones. If we observe the trend of companies like IBM and Apple, this analogy can easily work. So, even programming in ML is less satisfying in my sense, we just need to embrace the future to better maximize our utility.  Of course, mathematics are quite bit involved in the field of AI, and tweaking parameters of learning models can feel quite subjective. However, I guess that’s some obstacles I need to face. The rationale is same as before: there is no perfect choice and we just need to try even if we have only 10% confidence about success.

Last word …

The motivation for considering this issue right now is that I need to start planning my course schedule for the upcoming semester. The course schedule can be balanced between system and AI. But, it can also be AI focused. So, I really need to evaluate myself to see which direction I want to go.  There is a famous quote in China: “Choice matters!”

Takeaway from DTCC 2017

由于同事出差,我有幸参加了在北京国际会议中心举办的第八届中国数据库技术大会(Database Technology Conference China 2017)。这是我第一次参加业界交流大会,内心还是格外兴奋的。这次大会确实有很多的收获,我想用这篇博客记录下来。本来我想用英文记录的,毕竟对于计算机领域,英文是我的“母语”,但是介于分享主要以中文为主,所以我就还是以中文来记录了。



Get some sense from the peers

Focus on your own product is quite important. However, it’s even more important to see how your peers doing. I’m not an architect yet but I feel it’s helpful to begin thinking like an architect and see what the problems that your peers are facing and how they try to solve them. In addition, by knowing how’s the going with your peers, you may get a measure of yourself: is the work you are doing on the same level as your peers? Are you in a good shape in the job market? What’s the gap you need to fulfill skill-wise?

Deepen the understanding of the field

Even almost two years working on the database field, I still think myself as a newbie. This is mainly because database is arguably the most complex software that people can ever make and there are tons of stuff I don’t know. So, I want to see in a high level that what’s the trend of the field and what kind of reflection that people derive from their day-to-day engineering practice. I think this may help me to catch-up with the masters.

AI or System?

As I disclosed in my last post, I decide to head back to school and get a master degree. To be honest, my ultimate goal is to acquire a PhD in Computer Science and currently I’m actively preparing for it. The most important question is that which field I want to study?  I have two options and I have some interests in both fields: AI and System. Why these two options and not others is worth a whole new post and I don’t want to discuss here. So, my task for now is to gather as much information as possible about these two fields and see which one looks more attractive to me. This event is extremely helpful because it has sharing on System as well as on AI.

Day 1


年度主题解读 (曹鹏 – 京东金融副总裁)


  1. Finance领域受到了机器学习的冲击,最近几年有越来越多的FinTech公司出现。机器学习在这种公司的主要应用从这个分享来看是对客户群体更加精确的定位和分析。相应的,对于量化交易策略的作用,这个分享没有涉及。我最近一直比较关心机器学习在金融领域的应用,但是从这个分享上,我没有找到我想要找到的答案。因为,在我看来,对客户群体的精确定位是一种机器学习的通用应用,并不具备金融行业的独特性。
  2. 数据公司在我看来是一个不错的创业想法。分享中提到数据对于京东金融的重要性。他们不仅要求数据的广度,也要求数据的厚度。一个重要问题是数据是具有很强的时效性和冷热变化的。一年前顾客的消费记录对于现在来说并不具备非常强的指导意义。因此,京东金融每天都要收集大量的数据(~6TB)来保证整个分析的准确性。同时,演讲者透露出即便在这种情况下,他们觉得数据还是远远无法满足他们的需求的。这个就能解释为什么IBM最近收购了The Weather Company和医疗影像公司Merge Healthcare:无非就是看上了这两家公司的数据。这让我想做数据贩卖商会不会是一个不错的创业点子呢?

数据库发展概览 (吴承杨 – 甲骨文)


  1.  在去IOE喊了那么多年的今天,Oracle的市场占有率依然有56%之多
  2. 数据库的未来是云:这里演讲者用一个case讲述hybrid cloud的重要性。企业现在面临的问题是如何将公有云的数据和本地服务器上的数据有效的对接在一起以及如何将公有云私有化等。整场演讲更像是Oracle解决方案介绍会,技术方面很少涉及,但是指出了未来数据库发展的方向:上云。
  3. 演讲者台风不错,是一个不错的演讲者。

数据技术的下一站 – 数据应用 (王桐 – 永洪科技)



达梦如何冲击核心业务系统 – 国产数据库的产品发展之路 (韩朱忠 – 达梦数据)

我觉得这个分享可能是今天最励志的分享了。整个分享讲的就是一个国产小厂商是如何奋斗和外资数据库斗争,一点点争取市场份额,成长到今天这个样子的。这里边讲到的一个关于他们对这个用C写的数据库的SQL优化能力进行提升的例子。 他们曾经遇到过一条SQL, 长达3.9K行,换句话说就是粘到word文档里能粘350多页。里边包含着17个inner join, 557个子查询, 831个or筛选, 1000+个查询字段,2731个case when。他们通过不断优化将这个SQL语句从几百分钟降到不到1秒。另外一个故事是讲国产数据库生存的艰辛。因为大企业及银行电信等核心产业的数据库都是采用外资的, 国产根本进不去。国产只能在中小企业市场去竞争。但是,这家数据库通过自身的不断努力,终于拿下国家电网的单子以及西藏和东方航空的单子。这在我看来是非常了不起的成就。这就让我对IBM产生了反思。我不觉得我们DB2能在不经过针对性的优化的情况下就能处理这么复杂的SQL语句。这个例子也让我觉得要么我们是在用我们的名声和过去的积累在赢得客户,要么就是DB2售前的同事在做POC的时候super tryhard。我明显感受到我们和这些国产数据库在努力程度上的差距。也许有一天我和他们的地位会呼唤?我相信这是IBM高层不愿意看到的事情。我们确实该努力了。

SSD的IO Determination特性在数据库业务优化中的应用与拓展 (阳学仕 – 宝存科技)

这个是从storage上出发来讲如何用软件模拟硬件来提升读写速度。换句话说,这个分享带给我的思考就是数据库怎样才能利用IO determination提升读写速度。这里讲的IO determination我粗浅理解看来就是让硬盘上的应用能更加和谐共处,并通过提升应用优先级,IO资源上下限,以及时间上对读写顺序进行优化等方式来使应用获得所需要的资源。另外SSD对于网络发展的匹配也有涉及:通过硬件的提升,我们现在基本可以做到本地写入和通过网络写入远程只有10几微秒的差距。这些在我看来是属于OS的领域。硬件对DB的加成这个方向让我感到耳目一新。

面向未来的数据库体系架构的思考 (张瑞 – 阿里巴巴)


  1. 国内厂商和IBM在对待数据库上有本质上的区别。国内厂商如阿里巴巴,腾讯,以及百度都是以自身业务痛点作为出发点对自家的数据库进行开发和改造。所以相应的,这些家的数据库改造,提升都是带有极强的针对性的。他们的数据库架构可能并不具备非常强的通用性。相反,IBM是把数据库作为产品来销售的,因此在数据库本身设计上考虑到的更多是面面俱到,大而全的尽可能满足所有用户类型的需求。这就导致在某些场景下,IBM的DB2做不到像AliSQL, OceanDB, TDB那样强劲。因此,在超大型公司做数据库,最终方向可能都是“私人订制”。
  2. 机器学习与系统结合的越来越紧密。这里演讲者提到他们想在未来把自动运维转换到智能运维上面来。SQL不再是DBA来手动看,而是通过ML的某种方式来进行优化。这些阿里的人还没有想好但是他们觉得这是未来的方向。


下午听的有”百度NewSQL数据库系统”, “Tencent MySQL内核优化解析”, “滴滴大数据应用”,“自然语言技术在文智趋势分析产品上的应用”。百度上最大收获是说现在分布式事物数据库非常的热,如果研究透,就没有在国内趟不过去的问题。另外一点收获就是不要过分崇拜Google系统。虽然细节我没有听的特别懂,但是从演讲者言语间我感受到,黑猫白猫抓到耗子就是好猫。有的时候不能太学究。而且系统之间即使是理念一模一样,但是由于implementation不同,也会导致巨大的性能差异。

腾讯的讲的非常Technical, 加上演讲者是技术出身,整个session非常的煎熬,感觉就是内核优化是个大坑,需要很扎实的DB知识。最后两场我选得是和机器学习相关的。不得不说没有达到我心中的理想。滴滴介绍的是他们一些数学模型应用的场景。我感觉演讲者应该是加入滴滴时间不长,并没有从一些模型上讲出个所以然来,反倒是应用场景上更让我感受到经济学家也是有用武之地的:比如说如何运用高峰涨价来调控司机和打车人之间的供求关系,以及如何收取取消订单等行为给平台所带来的损失。也许是民怨太重,整个滴滴分享感觉像是个新闻发布会。最后的自然语言技术应用是非常无聊的。演讲者是产品经理出身,主要介绍了下腾讯是如何针把NLP技术应用在新闻上的。非常泛泛,没有提及一些NLP上的技术难点,非常失望。

Day 2


  • Informix现在是和物联网IOT紧密的捆绑在了一起

在IBM我的邻居就是Informix Technical Support组。他们组的老大之前也分享过Informix在物联网领域的应用。这在我看来是为Informix这个昔日的巨人在找新的发力点以获得新生。这点也在今天题为“万物互联时代的数据库支撑平台–SinoDB”上获得了印证。SinoDB可以理解为Informix的fork因为这个公司从IBM这里获得了Informix的源代码的授权。不得不说的是IBM在这里变成了吐槽的对象,这些以Informix元老员工成立的公司认为IBM并没有善待Informix这个继子。他们认为是时候把自己的“孩子”重新领回来让他茁壮成长了。这也让我不得不思考当初IBM收购Informix到底是为了什么?问了问和我一同参会的同事,Informix的代码是否已经和DB2的有机的融合在一起现在还是个未知数。这也让我明白为什么在Oracle收购MySQL之后会出现这么多MySQL的fork:毕竟不是亲儿子。

  • 问题的多重性和domain knowledge的重要性

下午场我就是盯着机器学习专场在听。其中我觉得来自连家的“机器学习技术在房屋估价中的应用”的分享最为有意思。分享的内容其实从标题就可以猜出个八九不离十。这个分享一个重要的信息就是机器学习并不是以算法为核心的而是以建立在以domain knowledge为支撑的加工过的data的基础上的。对于链家的问题就是他们的数据量是十万级的,远不及一些图像处理或者文本处理的亿级别的数据。另外他们的数据是类别变量和连续变量混合,连续变量有数量级差异;以及不可避免的脏数据。这些都很大程度上决定了要基于domain knowledge的feature engineering和针对数据特点的算法确定。现在想想也就不难理解为什么从在本科上统计课到现在看的Prof. Andrew Ng’s ML课程,大家拿到数据的第一步都是plotting:就是为了能更好的结合自己的domain knowledge来观察数据特点及预处理。另外说一句就是,在我看来从昨天的滴滴大数据应用到今天这场链家的机器学习应用,他们本质上处理的问题都是属于经济学范畴。与经济学中计量经济所不同的是,机器学习的方法更加暴力:分析数据就是分析数据,而不是先要把问题归类到经济,然后按照经济的科班套路先建模再通过数据验证模型的套路来解决问题。我这里不想说也不够资格说哪个解决问题的方式方法更好。我想说的是一个问题放在不同角度来解决套路真的是完全不一样。站在不同位置上看待同一个问题也许能会擦出更加明亮的火花?

Day 3

最后一天就是全天的专场了。前两天听下来基本上对System, ML方向有了个粗略的sense。到了第三天我就把重点放在了其他一些领域比如说区块链。这里我觉得讲的比较好的就是“区块链与大数据技术结合的商业应用”这场。可以看出的是区块链作为一个新兴技术,由于账本本身是公开的,可以把 这个想象成一个巨大的只支持insert和select的数据库,那么对于这个数据库里的数据挖掘和针对这个数据库所能做的一些优化就成为了现在区块链届关注的重点。据介绍现在这个账本已经有3,400G这么大。我另外了解到,分布式账本这种技术应用场景还是非常广泛的。比如说红十字会接受捐赠就可以利用区块链技术使得所有捐款信息完全透明公开。说句题外话,现在任何一个项目都需要不同类型的人才。系统,AI都有自己施展拳脚的空间。




First Time Ever Hackathon

I have never been to hackathon before. In my imagination, hackathon is like a festival for people who have passion about creating cool stuff during a limited amount of time. Hackathon is going to bring tons of fun if you can work with dedicated people on some interesting idea and try to make the idea into reality. Luckily, Early Professional Hire (EPH) Hackathon event hosted by China Development Laboratory (CDL) meets my expectation perfectly.

Experience Recount

When I first time heard about the hackathon on EPH Day 5, I’m a little intimidated. This is because the solution demo you present at last needs to fit in certain hackathon topics. The topics include business platform innovation, light-weight e-commerce reinvention, blockchain, application of Watson technology into medical services, and integration of Watson technology with 3D demonstration. All these topics are super cool, very advanced and I barely have a chance to get touched in my day-to-day work. So, I’m not sure if I can handle those topics well. However, I want to give this hackathon a shot: not only because this is part of EPH program but, more importantly, one of the topics really catches my eyes- that is, the integration of Watson technology with 3D demonstration.

This topic actually has a name, which is called Watson Introspector. It is a cognitive tool for understanding software, answering questions, and interacting with software architecture and data flows in 3D.  This topic suits my interests perfectly. It has always been a challenge for new comers to study a code base especially when the code base has been evolved for several years. What sets of functions or data flows get involved in certain feature of the software has always been the type of questions we are asking all the time, especially when bug fix or enhancement request kicks in. Conventionally, there are tools to help us to visualize the code path like debug trace, UML graph, and so on. However, none of them are straightforward and fun to use if we put them under new comer education context. I can hardly imagine some guy will choose staring at the debug trace on Friday night over going out for a date. So, I think the visualization of code path in 3D and get some question answering system integrated (like Watson) may probably make our software developer’s lives much easier.

After I set the topic that I want to work on the most, the next step is to get a team. Originally, there are four team members besides me within the team. All of them don’t have any experience with any technology involved in the topic. This is perfect because neither do I. However, all of them are testers, which make my situation a little bit difficult. This means that  I am the only developer in the team and I will take much more responsibility than I thought I would. But, that’s OK because my teammates want to grow with me and want to give the hackathon a try. So, I become the captain of the team.

Everything goes quite well for the first two weeks. We narrow down the architecture of the solution: we are going to make a 3D space game just like the classic snake game. In the study mode, player is free to explore the 3D world, which is constructed from the classes and functions parsed from a random-selected Java project. Inside the world, the player can interact with Watson on what kind of feature he wants to learn and Watson will return a code path that best meets the player’s request. Then player can spend his time getting familiar with the call stack of the functions along with the purpose of each function. In the test mode, the player is required to visit the functions he just studied in the correct function call order so that he can win the game. Our goal of making this game is to offer a fun way to learn about a source code project and we believe educational game best suits our needs.

Then, on September 14th, everything is just changed. Once we settle down the architecture of the solution, two members decide to quit with the excuse of limited time. I got this feeling that someday they are going to quit but I didn’t expect this timing. They refuse to work on the solution outside of the work time. This makes me quite frustration because anyone who attends hackathon should expect that he is going to spend fair amount of time outside of work to finish a demo. Even worse, this means we probably don’t have enough resources to finish our solution. It looks like mission impossible with only one developer and one tester left with the team.  But, a second thought comes into my mind. I work as the president of IBM Diamond & Ring Toastmasters Club. One important lesson I learn from it is that as a leader, the first priority task is to take responsibility and get the job done no matter how difficult the situation is. Under my current situation, my goal is to at least finish this hackathon, and I need to make this happen. Plus, I’m not alone: I still have a teammate, Rachel, who wants to give out all she has in order to succeed in the event. I just cannot let her and myself down.

In the final week,  we work super hard with our adviser, Trent, in order to get a demo working.  Even during Mid-Autumn Festival holiday, we still come to the office around 2 pm in the afternoon and hack through the rest of day to 1-2 am. That has been the theme for the whole week. On Tuesday, September 20th, right before the final day, we work over 30 hours to 4am, September 21th to do bug fix and 3D modeling. Rachel and Trent live closer to the office, so they rush home to get some sleep. I, however, live really far away from the company (I live southern 4th ring of Beijing) and unfortunately, I have to take a nap at the office coach to avoid being late for the team show order decision draw happened four hours later.

Even we almost live inside the company, we still haven’t finished our demo on the final day morning. There is some performance issue with our game during the launch phase: since we talk with Watson at the same time the game assets are loading, the framerate of the game drops significantly. In addition, we haven’t figured out a way to grow our character body just like the classic snake game. These put a lot of pressure on me because there isn’t enough time to fix everything in a nice clean way. However, we somehow manage to finish all this by the demo time. We adopt agile practice. Maybe we cannot fix these problems nicely but we can definitely walk around the problem just for the demo’s sake. We do incremental world object construction during the game loading phase: we only load the objects that player can actually see through the camera and we use a big skybox to block unloaded part of the world from the user. For the snake body problem, each time the character hits the target, we put a sphere behind the character and we somehow manage to let the newly added part follow the movement of the character. Maybe the movement doesn’t mimic the snake body movement nicely but for the sake of demo, that’s enough.

During the demo time, everything works well. Thanks to the public speaking practice I have kept doing at the toastmasters club, I delivered a successful presentation to the senior management level at the lab and we obtained 2nd place in this hackathon with the fewest team members among all the contest teams.

Here are some interesting stats that are worth mentioning:

  • We have 0 experience with the technology stack of the hackathon
  • We originally have 4 team members but down to 2 halfway through the event
  • We only have 1 developer and 1 tester eventually
  • We only have 1 person with sufficient programming experience
  • We work to the super late nights for at least 3 days
  • The longest non-stop hacking lasts for 30 hours
  • We consumed 50+ bottles of water and bags of snack
  • We watched 60+ hours video tutorial on YouTube and
  • We write 2500+ lines of code for the demo


Hackathon Takeaway

Always remember you’re the captain

There are couple of times I want to quit the event. Thankfully, I don’t actually do that because I always remind myself I’m the captain of the two-man squad. When you set a goal to meet, you have got to do whatever you can to reach that goal and get the job done. Thanks to this hackathon event, I can now clearly see this point.

Stay positive during the difficult situation

There are downtime during the whole hackathon event. Face the technology we only have never actually experimented with when we enter the contest; Two of the original team members leave the team; Unfamiliar with the development tool; Debug the code to the late nights … All these things can drag the moral down pretty quickly. However, I’m the leader, I cannot do this. So, even in these difficult moments, I try to call the team for a short break and entertain ourselves by coming up jokes or have some random chats. These techniques work amazingly well because we don’t feel stressed and we can actually enjoy the whole problem-solving process. Without fun, the hackathon will never be the same to me.

Don’t be afraid of making the tough call

To be honest, I’m the chief solution architect of our demo and sometimes I have to make some tough calls, especially when both options look tempting. For instance, implementing the game like the classic snake game or like Super Mario are both good options. However, if we put time and various other resource limit under consideration, two options cannot be the same. Super Mario has its advantage in 3D exploration and the snake game reflects the core idea of our solution – make the body grows with the code path. I have to make the tough call on what path we want to pursue and I have to say, it’s always not easy.

Motivation and hardworking is the key to success

Motivation will give you the courage to take your first action but only can hardworking make you reach the goal. In this hackathon, I feel lucky that I follow my interest and choose Watson Introspector as my topic to work on. My interest provides me enough motivation to power through the whole event. However, I know that in the deep of my heart, I want to win and I want to my demo to reflect the technical expertise we equip. That requires hard working. Thankfully, we don’t let ourselves down and we work super hard towards our goal. I’m so glad we finally make it.

Public speaking is crucial

I feel our solution may not use all these fancy technologies like some other group does but I feel the public speaking or the presentation skill well compensated for this “disadvantage”. During the presentation, I use numbers from above to define our hackathon experience with the speech style like Steven Jobs and our beloved CEO, Ginni Rometty. This delivers a concrete message to the audience: we come here to win and we deserve it. I want to give the judges the feeling that we are 120% confident with our solution and more importantly, we feel proud of it.

All in all, this hackathon has become one of the moments that I’m extremely proud of myself as an IBMer and I do learn a ton from it: not just technical stuff but how to be a leader as well.

I want to end my post with the slogan of our project:

Evolve ourselves, beyond the limit!

A recap on EPH program

This week (05/16 – 05/20), I attended 2016 GCG Early Professional Hire (EPH) Program offered by the company. The following is the recap of the whole program with some of my thoughts.


GCG Early Professional Hire (EPH) program run by IBM is a 2-year program that targets specifically at the new employees with working experience less than two years. It aims to develop core and valuable skills for the new IBMers. When I first receive the advertisement email, my incentive tells me not to attend even though it is required for new hires (you can reject by obtaining approve from your manager). However, I figure it is a good chance to take a break from the work and have a chance to meet some people (some beauties if I’m lucky and in fact, there are some), so I withdraw my request to not attend.

Kick-off Event (05/16 – 05/19)

Day 1


The kick-off Event is hold at Marco Polo Parkside Hotel in Beijing. It is really a fancy hotel and I’m really surprised that my company could spend so much money hosting an event in a hotel like this, especially it has been rough years for GCG.  The agenda for the first day consists of bunch of speeches, BU introductions, and a welcome dinner.  Sign in starts at 9 is quite tough for me as the distance between hotel and my apartment is 11.5 miles! However, “Watson Coffee” (some fruit and yogurt) helps me to go through this tough time to wait for event start at 10.

Morning speech is not quite impressive. The first speech is delivered by Shally Wang, GM of GCG. She talked about something that I could hardly recall but her opening talks about moving start time earlier to compensate the people get there early is quite thoughtful at some level.

The next speech is delivered by Anita Sabatino, a senior leader at IBM. I have to say her speech is the only shinning point on Day 1. She recaps her career at IBM:

She starts as a software engineer at IBM and becomes a sale once she is a advisory software engineer. She then changes the role to sales and work for JP Morgan for a couple of years before rejoining IBM. She then moves to China with her daughter who is adopted from China and works with Bank of China. She gives examples on how to build trust with clients. For instance, meets with leader from BOC weekly and always be on time; Build personal relationship to a reasonable amount that facilitates the collaboration. Also she shares some stories about her daughter.

This is a quite good speech because it has really substance. It is not hollow words without any meaning. It feels like a friend talks about career directly to you. Plus, I’m always interested in people sharing their career story and how they make decisions.

The last part of the morning consists of people from GCG share their story to new hires. They are not as senior as previous speakers but they are experienced. I didn’t quite listen to their talk shows because it’s already 12:30pm when they start their sharing and I’m quite starving. All in all, it’s just some show value stuff so that they can brag to their boss. Nothing new.

Afternoon & Evening

The lunch is buffet and I heard it costs around 200 RMB per person. It is quite good and I had a tons of steak and ice cream. The key word for the afternoon is “BORING”. It consists of speeches from different BU leaders (GTS, CAMSS, GBS, Technology Partnership), which essentially wants you to have a big picture about their BU and appreciate their business value. The rest of the day is a nice welcome dinner and some shows from fellow IBMers. The shows are quite nice but unfortunately I cannot sit till the last minute because it still went on at 8:30pm and I’m afraid of missing the last subway back home.

Day 2 – 4

These three days consist of four main parts:  building your professional reputation@IBM, workplace etiquette, delivering quality work with agility, business writing. The overall is quite boring but there are indeed some shinning points that are worth to mention:


BU Session (05/20)

There are two great speeches delivered today. One delivered by Ge Song, CDL cloud leader and the other one delivered by Zhong Tian, the only Distinguished Engineer (DE) at CDL.

Ge Song’s Speech

Ge Song’s speech mainly focuses on some takeaway she gets from Things I Wish I Knew Before Working in Industry (this source based upon her reference during the speech but could be wrong as she didn’t explicitly cite the source). The following are the key points she mentioned (I write them down based upon my audio record):

  1. Attitude makes everything; be willing to do more. She draws on her own experience and offers an example: When she started her career at IBM, her manager sometimes got challenge tasks and asked if anyone is willing to take it. The most courage sentence she could ever say at that time was “I can try it!” even she knew she was totally capable of doing it. So, she suggests that if you are pretty sure you can handle the task, then always say “I can do it!” This is because it is the manager’s responsibility to help you to succeed at your task. They will do whatever they can to help you (frequently review …) and to control the risk. They will not blame you for the failure because it is their failure if you fail. Also, be willing to take more tasks whenever possible and necessary. Don’t be the kind of person that cannot hang around any longer after 5pm and can’t wait to catch the first shuttle to get back home. So, always remember “No pain no gain”!
  2. Be visible (show value). The example she gives here is the global conference call scenario. Usually, for Chinese, people barely talk anything during the call except “Hi! I’m Mark.”, “Bye Bye!”. That doesn’t work in the sense that you don’t show your value. Here is a tips. If you know the conference call will discuss some difficult problem beforehand, you can prepare for that. When global team leader asks for any input, you should speak up (because you’re already prepared).
  3. Find your mentor. Everybody knows what mentor means for a person’s career. Here, she emphasizes that you should build a solid skill (foundation) before you ask for changing mentor.
  4. Be yourself and build your identity (Build your personal branding). You need to strive for excellence for the area you are working on (become a goto person). However, you don’t have to care how people treat you. Build your expertise and keep learning! “忠实于对技术的感情!”
  5. Think big and act from small (志存高远,从小事做起). 不要好高骛远!不要老觉得某个leader很强而忘记他在技术领域的耕耘。Again, she offers a tips regarding conference call. You need to focus two points during the call. 1. Why she asks this kind of technical question? 2. Develop your English speaking skill.
  6. Managing your time. You will become the person that you spend the most time on.
  7. Priority. briefly mentioned.
  8. Managing the risk. briefly mentioned.
  9. With courage to say “No”! briefly mentioned.
  10. Not only plan your career, also your life. briefly mentioned.

Zhong Tian’s Speech

Zhong Tian’s speech focuses on the share of a technical career. I listed some of the inspirational sentences he mentioned:

  • Keeps learning!
  • Excel what you do, the world is yours!
  • 不要觉得你是band6就应该做band6的活,如果你是band6已经在做band7的活的话,你离promote已经不远了!

Source code security

Well, this is a post that I started on 2016-04-15 and I finally finish today…

Yesterday morning (04/15/16), when I came to the office, I got a bad news from my manager: he was informed by security that I had an abnormal checkout of code on Monday, 04/11/16. The way how things work regarding source code security in our lab and probably in IBM other labs is that security will track each developer the frequency and quanity of checkout each day. They collect some statistics and alert the first-line manager when something potentially terrible happened. For instance, if I usually checkout code twice per day and each time around 20 source files, but on 04/15/16, I checkout 3456 files in day will certainly set off the alarm. Believe me, this number is exactly the number I was informed from my manager. What did I do on that day? It turns out that I need to make a special build on top of a GA build for a client and I need include all the code change specifically for this client in the past plus my code this time. The way to make a special build is that we use some scripts to check out the source files that are needed to be changed and merge the code, and run test buckets on them. Those will involve tons of checkout & checkin. After all, I successfully explain this to my manager and everything works out at last.

What interests to me for this incident is that this is the first time I realize the power of Clearcase. I have never heard of ClearCase until I join IBM. Back to the college, I solely work with Git and I feel extremely uncomfortable when I firstly work with ClearCase. However, from this incident, I personally start to feel like ClearCase is probably more powerful than Git on security level. Basically, in Git world, I need to fork or clone the repository so that I can have a local copy of ALL the source code and to start work on my branch. There has some problem in terms of security because I literally need to have all the code locally before I can work on my stuff. Make branch on the remote repository also has this issue. However, in ClearCase, I only need to first make a dynamic view and only check out the files I need to modify. If I check out too many files that will raise warning like this time. This security checking mechanism works great with ClearCase because:

  • There is a central server to hold all the source code. A Corporation can simply monitor the checkout behavior of this central code repository.
  • the quantity of checkout is different from person to person. In Git, it feels like a standard way for everyone to checkout all source files even you only need to modify one. However, with ClearCase, that can be different from person to person. This will the statistics monitoring checkout becomes meaningful.

I’m not saying Git is bad. In fact, in IBM, we are starting to have GitHub Enterprise that hosts on SoftLayer behind IBM firewall. That is really a great news for me because I can finally have “social coding” experience that I have been enjoying so far outside of the work. It will make some work I have done tailored specifically to fellow IBMers more organized and easy to get. I don’t need to attach the code inside emails sent to each member of the team that we collaborat with one by one. I can simply send the git repo to their team lead and each member of their team can access simultaneously. Plus, having Github inside IBM also helps me to track issue with the code I own and again, saves ton of communication cost for me.