On Reading CS Papers – Thoughts & Reflections

Be forewarned:

  • This is not an advice post. There are tons of people out there who desperately want to give people advice on reading papers. Read theirs, please.
  • This post is a continuous reflection on the topic “how to read a CS paper” from my personal practice. I will list out my academic status before each point so that it may be interesting to myself on how my view on the matter has changed as time goes forward.

2018

The first year of my CS master program. Just get started on CS research.

  • It’s OK to not like a paper

In my first semester, I majorly read papers on Human Computation and Crowdsourcing.  Very occasionally, I read papers on NLP. Some papers on NLP are from extra readings in Greg’s course. Some are related to Greg’s final project, which deals with both code and language.  I don’t really like and want to read papers back then. In NLP class, I prefer to read textbooks (Jufrasky’s one) and tutorial posts that I can find online. One roadblock for me to read papers is that there is certain background knowledge gap I need to fill and I just simply don’t know how to read a paper. So, for Greg’s NLP course, I only read some papers related to my final project. This paper is the base paper for my final project. I got this paper from professors in linguistics and software engineering and they want me to try out the same idea but using neural network model instead. I read this paper several times and the more I read, the more I want to throw up.  I just think this paper hides many critical implementation details and the score 95% is just too high for me to believe. The authors open source their code but their code has some nasty maven dependencies, which won’t compile under my environment. Their evaluation metric is non-standard in NLP and many “junk words” wrap around their results. Of course, the result of my experiment is quite negative.  I often think it is just a waste of life to spend your precious time on some paper you dislike.  Here, I’m more of talking about paper writing style and the reproducibility of papers’ results. I probably want to count shunning from some background gap as a legitime reason not like a paper.

  • Try to get most of the paper and go from there

I got this message from Matt’s Crowdsourcing class. In the class, I have read a very mathematical heavy paper, which invokes some combinations of PGM and variational inference on the credibility of fake news. I’m worried back then about how should I approach a paper like this one, which I’m extremely lack of background and mathematics formula looks daunting.  I pose my doubts on Canvas and Matt responds in class and gives the message.  I think the message really gives me some courage on continuing read papers.

  • It’s OK to skip (most) parts of a paper.  Remember: paper is not a textbook!

This semester I’m taking a distributed system class. To be honest, distributed system paper can be extremely boring if they are from industry. Even worse, system paper can be quite long: usually around 15 pages, double column. So, if I read every word from beginning to end, I’ll be super tired and the goal is not feasible for a four-paper-per-week class. So, I have to skip. Some papers are quite useful maybe just for one or two paragraphs. Some papers are useful maybe just because of one figure. As long as your expectation about a paper gets met, you can stop wherever you want.

  • Multiple views of reading a paper

I didn’t get the point until very recently. I did quite terrible on the first midterm of my distributed system class. The exam is about how to design a system to meet a certain requirement. In the first half of the course, I focus on the knowledge part presented by the paper but that doesn’t work out well. Until then, I realize that I need to read those systems paper from a system design point of view: what problems they need to solve, what challenges they have, how they solve the challenges.  OF course, those papers are valuable from knowledge perspective: how consistent hashing works, for example. But, depends on the goal of reading paper, I can prioritize different angles of reading a paper. If I need to implement the system mentioned in the paper, I probably need to switch to a different paper reading style.

  • Get every bit of details of paper if you need to

It’s time again for the final course projects. Again, I need to generate some ideas and find some baseline papers. In this case, “skip parts” and “get most out of the paper and move on” strategy probably won’t work well. All in all, I need to understand the paper and those are rely on the details from the paper. In this case, I need to sit through the whole journey and remove any blockers that I may encounter.

Advertisements

Job hunting lesson learned

This post contains a collection of lessons I learned during the job hunting. I’m still looking for internship & job. That’s good because that means this post will be at least frequently updated in the foreseeable future.

  1. Always attending career fairs. In UT, if you are a CS student with a good standing, you can get an invitation to an event called FOCS Career Night. There will be a lot of recruiters. But, be careful, most of the recruiters are actually engineers or UT students (that’s right, some companies make Campus Ambassador attend the event as if they are the recruiters). There is a huge difference between recruiters and engineers: recruiters get the call on who gets the interview, not engineers! I made a mistake by attending the FOCS Career Night only and skip the Career Fairs. In fact, recruiters are actually coming to Career Fairs and some of them doing on-campus interview signup immediately. The on-campus interview is much better than OA. Even you got an invitation for scheduling an interview after FOCS Career Night, you still want to talk to the company at Career Fairs because recruiters can barely check their emails when they are on travel and interview slots are always based on first come first serve policy. So, you always make sure to come to the Career Fairs and schedule an interview immediately instead of replying the invitation email and wait for the response and then got one said interview slots are all filled. This happens to me on Indeed.
  2. Always doing OA immediately. When you receive an OA, the company usually will tell you that you can finish the test within certain days. However, things can change rather quickly. Even they give you buffer like finishing this test within 4 days, ignore the message and do OA immediately. Slots can fill rather quickly and some company has this under-table rule on even they say 4 days, they really mean immediately. This happens to me on Dropbox.
  3. Always finishing OA within min(60 minutes, restricted time). Some company allows you to finish OA within days. In other words, even you start the test, you have a couple of days to finish it. Ignore this, please! Even though the test lasts for days, finish it as quick as you can. The finishing time is a strong indicator of your coding ability. This happens to me on Twitter.
  4. Always follow-up with the recruiter. Sometimes, there might be system error: they send reject email to the wrong person. Make sure you confirm this with the recruiter and finish your OA no matter what happens. Even you got rejected, OA is still an invaluable practice opportunity. This happens to me on Dropbox.
  5. Always make sure you apply in the University Recruiting section. Companies make specific web pages for fresh graduate and recent graduate. Make sure you submit your resume there. If you submit the resume to the wrong place, you may in a pool that is filled with professional with 5+ years of experience. That always leads to either no hear back at all or an immediately reject letter. This happens to me on Dropbox.
  6. Use the LinkedIn and be aggressive. I’m a shy person but job hunting like the name suggested, it’s a hunting. You have to be aggressive. Connect with as many people as you can whether it is from Career Fair, social events, LinkedIn in-mail. Be polite and be bold. Ask them for the opportunity. One special note is that you may want to “harass” recruiters and senior developers in LinkedIn. Their words have much more power and you may get an interview very quickly. If they being rude when you ask for the favor politely, you already know that this company is definitely not the one you want to work with. This happens to me on Teradata (BTW, they are on the polite side).
  7. Prepare for the technical interview questions:
    1. The interviewer may make some slight modification to the questions even they are from leetcode. For example, instead of asking what exactly the shortest path are in the original leetcode question, the interviewer may ask how many steps in the shortest path. The difference is the former one may expect a list of coordinates (i.e., steps) and the latter one may expect a simply a number. This happens to me on Pocket Gem. The takeaway is that when you solve leetcode questions, think about what possible variations might be. However, it may seem infeasible that you do it for every problem. You don’t have to unless you don’t have anything else to do. The next point will help to address this concern.
    2. Browse some recent interview questions from the company you are about to interview with from forum. This helps to address the previous concern. If you see the company interview some leetcode questions, you may want to look at that leetcode questions and think about the possible variations. Also, usually company has a pool of questions and get some prior exposure from a forum, you may have a good preparation already. Also, this point helps if you are very short of preparation time. In this case, you just prepare for the questions from the forum and you’re good to go. Sometimes, this works much better than a long-term preparation strategy, which you may feel over-prepared and feel a good chunk of time get wasted on leetcode when you can simply prepare the questions from the forum.
    3. Get practice on the leetcode. Usually, people emphasize the importance of getting practice on leetcode. That’s true. However, this depends on when you about to apply for the position. For recruiting new grads, some companies prefer to start early (e.g., in Fall) and others don’t (e.g. in Januarg till March). People always think they should start early as soon as possible to get a spot in the limited headcount. That’s true but this strategy usually comes with a risk: you’ll see new interview questions that no other has seen before. Each year, companies may update their pool of questions. If you think you can solve leetcode problem like “1+1” and have a solid preparation in system design, then start early A the AP is best strategy.  However, if you are in an OKish position in algorithm preparation and design preparation, then you may delay applying one or two weeks. The beauty of delay comes directly from previous point: you may get exposure to the pool of questions before the actual interview. How long to delay is a case-by-case situation. Some companies (e.g., Dropbox, Pocket Gem) will be quite active and send your OA almost immediately after your application and you may want on your side. However, some companies may have a long process to take before setting up any interviews, then you may want to apply ASAP and let the internal processing time takes its time.

“Research” Interest

This week Friday, I meet with my future roommate in Beijing. During the lunch, we had a conversation about each one’s research interest. My roommate, likes me, is also a CS graduate student at Austin. However, unlike me, he has a clear vision about what direction he is going to pursue in graduate school. He just finished his undergraduate degree in Automation department at Tsinghua University. Automation department, as he explained, is similar to a mixture of mechanical engineering and electrical engineering. He has interest in mathematics since high school and naturally, he wants to work on machine learning theory in graduate school with emphasis on computer vision (CV).

Now comes to my turn. That’s a hard question I have been thinking about for a while. I don’t have clear vision on what I’m going to pursue next. I think maybe I’m too greedy and want to keep everything. However, I also realize that I may not be as greedy as I thought initially. I know I don’t want to work on computer architecture, computation theory, algorithm, compiler, network. Now, my options really just choosing among operating system, database, and machine learning. For the machine learning, I even know I probably won’t choose computer vision eventually (still want to try a course though) and I more lean towards the natural language processing (NLP). However, picking one out of those areas is just too hard for me now, even after I did some analysis in my last post trying to buy myself into picking machine learning only. There is always a question running in my head: why I have to pick one? Sometimes I just envy the person like my future roommate who doesn’t have this torture in his mind (maybe he does? I don’t know).

This feeling, to be honest, doesn’t new to me. When I was undergraduate facing the pressure of getting a job, a naive approach is just locking oneself in the room and keeping thinking what profession might suit me the best. After two years of working, I grow up enough to know that this methodology on making choice is stupid and I also grow up enough to know that “give up is a practice of art”. Why I’m in this rush to pick the direction I want to pursue even before I’m taking any graduate course yet? Why can’t I sit down and try out several courses first? Because I want to get a PhD in good school so bad. Let’s face the fact that people get smarter and smarter in generations. Here “smarter and smarter” doesn’t necessarily mean that people won’t repeat the mistake that happened before. It means that people will have better capability to improve themselves. Machine learning is not hot in 2014 from my experience in college. Back that time, Leetcode only has around 100 problems. I have no particular emotional attachment to machine learning material when I’m taking the AI class. Maybe because wisconsin has tradition in system area? I don’t know. However, in 2017, everyone, even my mother who is a retired accountant, can say some words about AI, machine learning. Isn’t that crazy?

On my homepage,  I write the following words:

I like to spend time on both system and machine learning: system programming is deeply rooted in my heart that cannot easily get rid of; machine learning is like the magic trick that the audience always want to know how it works. I come back to the academia in the hope of finding the spark between these two fascinating fields.

Trust me, I really mean it. Maybe because I graduate from wisconsin, I have naturally passion for system-level programming, no matter it from operating system or database. Professor Remzi’s system class is just a blast for anyone who wants to know what’s going on really under the software application layer. Professor Naughton’s db course is fully of insights that I can keep referring to even I begin to work a DBMS in real world. Wisconsin is just too good in system field and this is something that I can hardly say no even I have work so hard lie to my face saying that “system is not worth your time”. What about machine learning? To be honest, great AI dream may never accomplish. Undergraduate AI course surveys almost every corner of AI development but only machine learning becomes the hottest nowadays. Almost every AI-related development nowadays (i.e. NLP,  Robotics, CV) relies on machine learning technique support. Why I’m attracted to machine learning? Because it’s so cool. I’m like a kid who is eager to know what is going on behind magic trick. Machine learning is a technique to solve un-programmable task. We cannot come up with a procedure to teach machine read text, identify image object, and so on. We can solve these tasks only because the advancement of machine learning. Isn’t this great? Why both? I think machine learning and system becomes more and more inseparable. Without good knowledge about system, one can hardly build a good machine learning system. Implementing batch gradient descent using map-reduce is a good example in this case.

I just realized that I haven’t answered the question about rushing towards the making decision. In order to get a good graduate school to pursue PhD, you need to demonstrate that you can do research. This is done by publishing papers. Most of undergraduates nowadays have papers under their belt. That’s huge pressure to me. Master program only has two years. I cannot afford the time to look around. I need to get started with research immediately in order to have a good standing when I apply to PhD in 2018.

So, as you can tell, I have problem. So, as a future researcher, I need to solve the problem. Here is what I’m planning to do:

  • Take courses in machine learning in first semester and begin to work on research project as soon as I can. I’ll give NLP problem a chance.
  • Meanwhile, sitting in OS class and begin to read papers produced by the Berkeley Database group. People their seem to have interest in the intersection between machine learning and system. This paper looks like promising one.
  • Talk to more people in the area and seek some advice from others.
  • Start reading “How to stop worrying and start living

Will this solve the problem eventually? I don’t know. Only time can tell.

Takeaway from DTCC 2017

由于同事出差,我有幸参加了在北京国际会议中心举办的第八届中国数据库技术大会(Database Technology Conference China 2017)。这是我第一次参加业界交流大会,内心还是格外兴奋的。这次大会确实有很多的收获,我想用这篇博客记录下来。本来我想用英文记录的,毕竟对于计算机领域,英文是我的“母语”,但是介于分享主要以中文为主,所以我就还是以中文来记录了。

会议目标

虽然机会来的很突然,但是我还是设立了一些目标以最大可能的利用好这次机会(以下是这篇博文的英文初稿,由于实在是懒着重新翻译成中文,各位就凑合着看吧):

Get some sense from the peers

Focus on your own product is quite important. However, it’s even more important to see how your peers doing. I’m not an architect yet but I feel it’s helpful to begin thinking like an architect and see what the problems that your peers are facing and how they try to solve them. In addition, by knowing how’s the going with your peers, you may get a measure of yourself: is the work you are doing on the same level as your peers? Are you in a good shape in the job market? What’s the gap you need to fulfill skill-wise?

Deepen the understanding of the field

Even almost two years working on the database field, I still think myself as a newbie. This is mainly because database is arguably the most complex software that people can ever make and there are tons of stuff I don’t know. So, I want to see in a high level that what’s the trend of the field and what kind of reflection that people derive from their day-to-day engineering practice. I think this may help me to catch-up with the masters.

AI or System?

As I disclosed in my last post, I decide to head back to school and get a master degree. To be honest, my ultimate goal is to acquire a PhD in Computer Science and currently I’m actively preparing for it. The most important question is that which field I want to study?  I have two options and I have some interests in both fields: AI and System. Why these two options and not others is worth a whole new post and I don’t want to discuss here. So, my task for now is to gather as much information as possible about these two fields and see which one looks more attractive to me. This event is extremely helpful because it has sharing on System as well as on AI.

Day 1

第一天分为上下半场。上午是开场及四个分享。下午则是五个同时进行的专场,每个专场有六个同一主题的分享。这就造成了我无法参加每一个分享。第一天我的策略就是面面俱到:系统的我也参加,AI相关的我也参加。以下就是针对我参加的每一场的一些心得感悟和评论:

年度主题解读 (曹鹏 – 京东金融副总裁)

本次会议的主题叫做“数据驱动,价值发现”。这个分享是从京东金融自身的角度对本次会议的主题进行了结构。从中我记住了两点:

  1. Finance领域受到了机器学习的冲击,最近几年有越来越多的FinTech公司出现。机器学习在这种公司的主要应用从这个分享来看是对客户群体更加精确的定位和分析。相应的,对于量化交易策略的作用,这个分享没有涉及。我最近一直比较关心机器学习在金融领域的应用,但是从这个分享上,我没有找到我想要找到的答案。因为,在我看来,对客户群体的精确定位是一种机器学习的通用应用,并不具备金融行业的独特性。
  2. 数据公司在我看来是一个不错的创业想法。分享中提到数据对于京东金融的重要性。他们不仅要求数据的广度,也要求数据的厚度。一个重要问题是数据是具有很强的时效性和冷热变化的。一年前顾客的消费记录对于现在来说并不具备非常强的指导意义。因此,京东金融每天都要收集大量的数据(~6TB)来保证整个分析的准确性。同时,演讲者透露出即便在这种情况下,他们觉得数据还是远远无法满足他们的需求的。这个就能解释为什么IBM最近收购了The Weather Company和医疗影像公司Merge Healthcare:无非就是看上了这两家公司的数据。这让我想做数据贩卖商会不会是一个不错的创业点子呢?

数据库发展概览 (吴承杨 – 甲骨文)

这场分享整体来说亮点不多。不过还是有一些重要信息的:

  1.  在去IOE喊了那么多年的今天,Oracle的市场占有率依然有56%之多
  2. 数据库的未来是云:这里演讲者用一个case讲述hybrid cloud的重要性。企业现在面临的问题是如何将公有云的数据和本地服务器上的数据有效的对接在一起以及如何将公有云私有化等。整场演讲更像是Oracle解决方案介绍会,技术方面很少涉及,但是指出了未来数据库发展的方向:上云。
  3. 演讲者台风不错,是一个不错的演讲者。

数据技术的下一站 – 数据应用 (王桐 – 永洪科技)

这个分享反应出永洪科技的主营业务和技术实力可能不是那么雄厚。整个分享我感受到永洪科技做的是数据库的应用开发,而不是数据库系统的本身。从这个分享中我了解到永洪把传统数据库以及大数据系统做了个集成平台,并在上面开发了针对不同行业应用的服务。这个感觉和IBM自家的Bluemix非常像,少的只是Watson系列。我个人看来做软件系统集成要比做系统本身难度要低很多。整个分享关注在永洪科技所提供的各种数据应用的服务。我查了一下,公司属于初创成立于2012年,我觉得走到今天这个地步也是不容易的。

整个分享亮点还是有的。一个是人物岗位关系图的展示,流程之间的pending关系以一种网状图的形式展现出来,每个节点是一个岗位。通过这种展示,我们能清晰看出哪个岗位人物最关键,他的缺席或者能力高低会对整个公司业务带来何种影响。另外一个亮点就是资源配置图。展现的是诸如会议室的使用情况,使用率等指标。但凡在IBM呆过的,对会议室这点肯定会深有体会:无数会议室被人预定却无发得到充分利用。我想这种资源展示应该是对我们这种会议室资源紧张的地方来讲会有很大帮助的吧?

达梦如何冲击核心业务系统 – 国产数据库的产品发展之路 (韩朱忠 – 达梦数据)

我觉得这个分享可能是今天最励志的分享了。整个分享讲的就是一个国产小厂商是如何奋斗和外资数据库斗争,一点点争取市场份额,成长到今天这个样子的。这里边讲到的一个关于他们对这个用C写的数据库的SQL优化能力进行提升的例子。 他们曾经遇到过一条SQL, 长达3.9K行,换句话说就是粘到word文档里能粘350多页。里边包含着17个inner join, 557个子查询, 831个or筛选, 1000+个查询字段,2731个case when。他们通过不断优化将这个SQL语句从几百分钟降到不到1秒。另外一个故事是讲国产数据库生存的艰辛。因为大企业及银行电信等核心产业的数据库都是采用外资的, 国产根本进不去。国产只能在中小企业市场去竞争。但是,这家数据库通过自身的不断努力,终于拿下国家电网的单子以及西藏和东方航空的单子。这在我看来是非常了不起的成就。这就让我对IBM产生了反思。我不觉得我们DB2能在不经过针对性的优化的情况下就能处理这么复杂的SQL语句。这个例子也让我觉得要么我们是在用我们的名声和过去的积累在赢得客户,要么就是DB2售前的同事在做POC的时候super tryhard。我明显感受到我们和这些国产数据库在努力程度上的差距。也许有一天我和他们的地位会呼唤?我相信这是IBM高层不愿意看到的事情。我们确实该努力了。

SSD的IO Determination特性在数据库业务优化中的应用与拓展 (阳学仕 – 宝存科技)

这个是从storage上出发来讲如何用软件模拟硬件来提升读写速度。换句话说,这个分享带给我的思考就是数据库怎样才能利用IO determination提升读写速度。这里讲的IO determination我粗浅理解看来就是让硬盘上的应用能更加和谐共处,并通过提升应用优先级,IO资源上下限,以及时间上对读写顺序进行优化等方式来使应用获得所需要的资源。另外SSD对于网络发展的匹配也有涉及:通过硬件的提升,我们现在基本可以做到本地写入和通过网络写入远程只有10几微秒的差距。这些在我看来是属于OS的领域。硬件对DB的加成这个方向让我感到耳目一新。

面向未来的数据库体系架构的思考 (张瑞 – 阿里巴巴)

这个主要介绍的是阿里巴巴里的AliSQL的架构以及针对阿里业务特点的数据库架构的反思。这里有两点我想提及:

  1. 国内厂商和IBM在对待数据库上有本质上的区别。国内厂商如阿里巴巴,腾讯,以及百度都是以自身业务痛点作为出发点对自家的数据库进行开发和改造。所以相应的,这些家的数据库改造,提升都是带有极强的针对性的。他们的数据库架构可能并不具备非常强的通用性。相反,IBM是把数据库作为产品来销售的,因此在数据库本身设计上考虑到的更多是面面俱到,大而全的尽可能满足所有用户类型的需求。这就导致在某些场景下,IBM的DB2做不到像AliSQL, OceanDB, TDB那样强劲。因此,在超大型公司做数据库,最终方向可能都是“私人订制”。
  2. 机器学习与系统结合的越来越紧密。这里演讲者提到他们想在未来把自动运维转换到智能运维上面来。SQL不再是DBA来手动看,而是通过ML的某种方式来进行优化。这些阿里的人还没有想好但是他们觉得这是未来的方向。

下午场综述

下午听的有”百度NewSQL数据库系统”, “Tencent MySQL内核优化解析”, “滴滴大数据应用”,“自然语言技术在文智趋势分析产品上的应用”。百度上最大收获是说现在分布式事物数据库非常的热,如果研究透,就没有在国内趟不过去的问题。另外一点收获就是不要过分崇拜Google系统。虽然细节我没有听的特别懂,但是从演讲者言语间我感受到,黑猫白猫抓到耗子就是好猫。有的时候不能太学究。而且系统之间即使是理念一模一样,但是由于implementation不同,也会导致巨大的性能差异。

腾讯的讲的非常Technical, 加上演讲者是技术出身,整个session非常的煎熬,感觉就是内核优化是个大坑,需要很扎实的DB知识。最后两场我选得是和机器学习相关的。不得不说没有达到我心中的理想。滴滴介绍的是他们一些数学模型应用的场景。我感觉演讲者应该是加入滴滴时间不长,并没有从一些模型上讲出个所以然来,反倒是应用场景上更让我感受到经济学家也是有用武之地的:比如说如何运用高峰涨价来调控司机和打车人之间的供求关系,以及如何收取取消订单等行为给平台所带来的损失。也许是民怨太重,整个滴滴分享感觉像是个新闻发布会。最后的自然语言技术应用是非常无聊的。演讲者是产品经理出身,主要介绍了下腾讯是如何针把NLP技术应用在新闻上的。非常泛泛,没有提及一些NLP上的技术难点,非常失望。

Day 2

第二天我觉得整体上不如第一天的精彩。主要原因我在想是方向性和行业发展战略性的内容比例在降低而具体技术内容所占比例在上升。不得不说的是通过这两天大会的观察,国内数据库领域MySQL系和Oracle系还是占主流,这主要是因为互联网行业的蓬勃发展。下面我就简单聊聊这一天的观察和体悟:

  • Informix现在是和物联网IOT紧密的捆绑在了一起

在IBM我的邻居就是Informix Technical Support组。他们组的老大之前也分享过Informix在物联网领域的应用。这在我看来是为Informix这个昔日的巨人在找新的发力点以获得新生。这点也在今天题为“万物互联时代的数据库支撑平台–SinoDB”上获得了印证。SinoDB可以理解为Informix的fork因为这个公司从IBM这里获得了Informix的源代码的授权。不得不说的是IBM在这里变成了吐槽的对象,这些以Informix元老员工成立的公司认为IBM并没有善待Informix这个继子。他们认为是时候把自己的“孩子”重新领回来让他茁壮成长了。这也让我不得不思考当初IBM收购Informix到底是为了什么?问了问和我一同参会的同事,Informix的代码是否已经和DB2的有机的融合在一起现在还是个未知数。这也让我明白为什么在Oracle收购MySQL之后会出现这么多MySQL的fork:毕竟不是亲儿子。

  • 问题的多重性和domain knowledge的重要性

下午场我就是盯着机器学习专场在听。其中我觉得来自连家的“机器学习技术在房屋估价中的应用”的分享最为有意思。分享的内容其实从标题就可以猜出个八九不离十。这个分享一个重要的信息就是机器学习并不是以算法为核心的而是以建立在以domain knowledge为支撑的加工过的data的基础上的。对于链家的问题就是他们的数据量是十万级的,远不及一些图像处理或者文本处理的亿级别的数据。另外他们的数据是类别变量和连续变量混合,连续变量有数量级差异;以及不可避免的脏数据。这些都很大程度上决定了要基于domain knowledge的feature engineering和针对数据特点的算法确定。现在想想也就不难理解为什么从在本科上统计课到现在看的Prof. Andrew Ng’s ML课程,大家拿到数据的第一步都是plotting:就是为了能更好的结合自己的domain knowledge来观察数据特点及预处理。另外说一句就是,在我看来从昨天的滴滴大数据应用到今天这场链家的机器学习应用,他们本质上处理的问题都是属于经济学范畴。与经济学中计量经济所不同的是,机器学习的方法更加暴力:分析数据就是分析数据,而不是先要把问题归类到经济,然后按照经济的科班套路先建模再通过数据验证模型的套路来解决问题。我这里不想说也不够资格说哪个解决问题的方式方法更好。我想说的是一个问题放在不同角度来解决套路真的是完全不一样。站在不同位置上看待同一个问题也许能会擦出更加明亮的火花?

Day 3

最后一天就是全天的专场了。前两天听下来基本上对System, ML方向有了个粗略的sense。到了第三天我就把重点放在了其他一些领域比如说区块链。这里我觉得讲的比较好的就是“区块链与大数据技术结合的商业应用”这场。可以看出的是区块链作为一个新兴技术,由于账本本身是公开的,可以把 这个想象成一个巨大的只支持insert和select的数据库,那么对于这个数据库里的数据挖掘和针对这个数据库所能做的一些优化就成为了现在区块链届关注的重点。据介绍现在这个账本已经有3,400G这么大。我另外了解到,分布式账本这种技术应用场景还是非常广泛的。比如说红十字会接受捐赠就可以利用区块链技术使得所有捐款信息完全透明公开。说句题外话,现在任何一个项目都需要不同类型的人才。系统,AI都有自己施展拳脚的空间。

小结

参加conference确实是一个非常愉快的体验。像我这种技术渣渣可以了解到各个领域的前进方向,找到自己努力的方向和未来的定位。和我一快来的同事就跟我说参加这个会议让自己更加坚定了当初自己选择的方向。另外,如果有丰富的工程经验也可以通过这次会议吸取同行的一些经验教训,取长补短。另外,丰富的networking机会也是这种会议的价值所在。

走出会议的那一刻,我觉得天空好蓝。

Under Construction (part 2/2)

In the previous post, I briefly recap the effort I have made to build a personal website. As you can tell from my previous work, I’m super into Sphinx-doc tool and I’m seeking to build a website that is more suitable for industry professional. So, this post is more about technology selection and what’s going on with my endeavor now.

“Everyone should have a blog”?

There is a tendency for person who works in tech industry: they love to blog about the technology. I’m not indicating that I’m against this tendency. In fact, I’m super into this idea. That makes programmer life much easier. When we run into technical questions, we can google and usually the solution is presented by some nice guy’s blog post. However, I’m not sure if making blog post is a great way to systematize your knowledge. Let’s take “Minimal Emacs Tutorial” as an example. Let’s imagine how we can translate this article into a blog post. “Learn about Emacs” section can be a post and then I tag it with “emacs-terms”, “emacs-commands”. “HowTos” section is separated into multiple posts because each howto is collected on different dates and as a blogger, I would make a blog post each time I found something new with emacs usage. The tags for “HowTos” section posts may be “emacs topics”. Now, if I want to take a look at what I have learned on Emacs, then I three operations: click on “emacs-terms” to find the posts with this tag, then “emacs-commands”, and finally “emacs topics” to traverse through the all emacs-related posts. As you can see, this is tiresome. Some may argue that this issue can be fixed if we add a theme tag, like “emacs” to every emacs-related post. This can workaround the problem but we need to be very careful about how we choose the tag.

That being said, however, I think making technical blog post’s advantage outweighs its disadvantage. Most important thing about writing blog is that you can keep track of your progress daily. I have been making my personal knowledge base for more than a year. I’m pretty satisfied with what I have accumulated so far. However, there are some caveats. Off top of my head,  I need more direct visualization of my progress. I want to see if I can pick up something new daily and that will give me a lot more motivation to push myself to learn more each day. In addition, some page can get enormously long. I have a page called “Collection of algorithms”, which takes like forever to scroll from top to bottom. This page is naturally suitable for blog post by making each algorithm into a post. Lastly, for marketing purpose. Making a technical blog post daily is an enormous effort and it will definitely look good for hiring managers – “Zeyuan is passionate about technology!”. Even though I can showcase my private knowledge base, that is much more inconvenient than a blog post that can be accessed by everyone.

You know you have a wordpress blog, right?

Yeah, I know. WordPress.com is a great place to write blogs. It makes me want to write something each month. However, I want to say writing technical posts on wordpress.com is sort of painful. In general, if you need to insert code, you need to work with “HTML” editor and use

[sourcecode language="csharp"]
//your code comes here
[/sourcecode]

This format is incompatible with reStructuredText. That means the content written here cannot be rendered using Sphinx-doc engine. That essentially makes the portability of the post to the minimum. In addition, I experience some weird bug when insert the source code. Inside “Sqoop2 7 Minutes Demo with DB2” post, I use lots of SQL statements inside the code block. Everything works fine the first time. However, when I try to update the post, all the quotation marks get translated to the html representation under “Visual” editing tab. I have to manually update all the quotation marks to its symbol form.

However, on the other hand, wordpress.com blog works best with photos. I really enjoy how wordpress.com manage photos in my “Trip in Nan Jing” and “First Time Ever Hackathon” posts.

What should use?

This is the question that I have been looking into for months. Let me list my expectation first:

  • Support reStructuredText
  • Static pages should be fine and must feel LIGHTWEIGHT
  • Support basic blog functionality – tags, categories, post date

Let’s list out what commonly-seen options are:

  • Octopress
  • hexo
  • Sphinx-doc
  • Pelican

Let’s cross Octopress first. I don’t want to touch Ruby ecosystem because the workflow to develop a personal site will completely be different. In addition, there are plugins that let Octopress support reStructuredText but I figure that doesn’t offer full capability like Sphinx-doc does.

Hexo is another beast. It is similar to Octopress in the sense that both of them use Markdown as their major language and it is also rooted in front-end world. I admit Node.js is a pretty cool language and the theme provided by Hexo looks amazng. However, it feels too modern to me and I want to take a little bit more conservative path.

I want to talk about Sphinx-doc and Pelican all together. Sphinx-doc is the foundation to write Python documentation and Pelican is based upon Sphinx-doc and tweak towards blog post. Tweaking Sphinx-doc towards a blog post site requires a fair amount of knowledge of Jinjia2 and some Javascript knowledge. I don’t have enough time for that. So that makes Pelican very tempting. However, I don’t want to use it for now as the workflow and how to control the Pelican behavior is quite different from Sphinx-doc. However, it is definitely worth revisiting in the future just because the size of community and plenty of themes.

The engine I choose is called Tinkerer. It’s the direct derivative of Sphinx-doc and the workflow doesn’t deviate from the standard Sphinx-doc workflow too much and the setup is quite the same. However, the latest release is back in 2014. That makes me a little worry. But let’s stick with this for now. I can easily switch to Pelican whenever I want to.

What does look like right now?

I’m not going to abandon this wordpress blog. However, I’m going to tweak the direction of the blog posting on this site. From now on, I will keep technical blog post to the Tinkerer-powered site. By technical, I mean that involves code, mathematical expression, and the content that will be part of my knowledge base doc eventually. However, on the other hand, all the life thoughts and any other non-technical posts will still be kept at this site, especially those posts involve lots of non-technical pictures. zhu45.org will still work as the main portal to my personal site until the new Tinkerer-powered site is GAed.

 

Under Construction (part 1/2)

Well, it’s almost the end of November and I haven’t posted anything yet this month. I feel pressured. Lots of things are actually going on this month, which involve my cat, my future career path, and some mentality struggles. I’m not gonna write about them right now because it’s not the right time. Hopefully, I can write about them someday in the future. I cannot promise this.

Today, I can actually write about is my “little” project I just started – building a site. Let me take a break here and give a quick recap. Up till now, building a site or, more specifically, a personal website seems to become my life-long effort.

Everything gotta start from somewhere …

If you are willing to take a quick look, Welcome to Zeyuan Hu’s World! is actually my starting point. I built this site during the summer time of my sophomore year at college. By that time, I was granted a summer research position and my main duty was to help out professor to do computer simulation of some mathematical models. I was just done with my first cs course – CS 302: Introduction to Computer Programming with Java and I heard from my uncle about this Python language. I have no clue among all those programming languages, Python caught my eyes during that time: probably, because of its cool name. So, I decided to give a serious shot at learning Python programming language both for my research job and for my personal interest.

It’s pretty damn hard to pick up a natural language by just skimming the textbook. This also applies for programming language. So, I decide to make notes while I study the Python. Making notes on paper is an option but paper is really hard to carry and easy to lose. When I read about the Python API, I’m pretty impressed by the elegance the Python official website shows and how it can keep everything so well-organized. So, I decide to use Sphinx-doc to make my own site with the main purpose to host my Python study notes. That’s how everything got started.

Making this site allows me to pick up CSS, reStructuredText, Python, and Matplotlib (which I used the library to make the header image of the site). If I evaluate this site under my current level, it’s still a solid website. Even though the aesthetic level may not be high, but the site serves my main purpose well – I got a well-written Python study notes for myself. The downside of this site comes from my personal feeling. I’m definitely not a front-end expert but I can tell this site feels quite HEAVY. There are multiple tabs (i.e., home, projects, cv/resume, and so on) but they are sort of hidden in the way that people have to click through the tab to find the content. This works great if I have accumulated tons of material to demonstrate. However, we both know, this is not the case. Another point is more of a pity. I can no longer maintain this site. My wisc computer lab account id got deleted after I graduated. I still feel grateful that school still hosts my site and that’s very moving policy I have ever encountered.

If you are interested, the source code for this site is hosted on Github. Feel free to check it out and there are some advanced reStructuredText and Sphinx usage in the repo.

Struggles …

My second journey with my site starts shortly after I begin to work at IBM. During the work, I feel the need to build my personal knowledge base. This is mainly  due to the amount of new information you need to keep track of for future use. I feel this is super important for someone who just joined the team and eager to make contribution to the project. At first, I use Sphinx-doc to build up my personal docs like I did previously. This doc or site is not available to public because there is no clear cut between confidential and non-confidential information. For instance, when you try to record the gain from code reading, even though C++ technique is perfectly fine for the public but if we put it under the context of DB2 source code, then the technique combined with DB2 source code is completely confidential. I don’t want to cross the line here. However, every coin has two sides. If some knowledge is completely fine for the public, then I want to share them out. So, building a website once again becomes a task that I need to finish.

I don’t want to get serious engagement with web development (even though I made a try) and on my preference ranking list, easily recording knowledge in a neat way comes first. I don’t want to spend tons of effort working on the web as I feel like understanding database and mastering algorithm sound more fun to me.

Under this criteria, I made some prototypes. The first one is built upon the previous iteration (sphinx-doc web) but incorporates bootstrap framework to make the whole site mobile-friendly and have a modern taste. This one is definitely better than my previous one. I liked it a lot when I first got the prototype out. However, I quickly felt the pain or the pressure of the writing, more precisely. The main purpose is to document my knowledge picked up from the work and it meant to be less reader-friendly. I have to edit some writings in order to make everyone online can make sense out of it. Plus, make distinguish between confidential part and non-confidential part can make this editing thing even worse. More importantly, I still think this site is HEAVY, especially for Chinese because it takes quite a while for everything loads up properly (google-analytics, github servic, and so on). So, I give up. I’m not giving up making my knowledge base but  I give up to make everything public. Meanwhile, I also cleaned up the very first web and adjusted the style a little bit. Unfortunately, this site doesn’t attract me as well.

Before the dawn …

The site I’m currently using is absolutely hideous for my friends (believe me, none of my friends really like it). The main argument they is that this site is just plainly 70s style: it doesn’t even have tabs. That’s true. In fact, that’s kind of the style of I’m seeking: everything is just there. You don’t have to dig the site to find the things you want. Plus, it is super LIGHTWEIGHT. I think this site is perfectly fine for someone who works in the academia: all the publications can be directly seen. In addition, I use a “hybrid architecture” for this site in the sense that I use this site + wordpress blog to host all my writings. By that time, I couldn’t find a proper solution to build a blog post website based on Sphinx-doc and I will talk more about this in my part 2 post.

Putting everything straight out to the reader works well for academia in my perspective. However, it becomes a downside as well: I’m working in the industry. I barely make publications. So, this site becomes really static :(. I hate to actively maintain a site but never update one does no good to me as well. This is actually why I write “This site is permanently under construction.” in my footnote of the page. I know someday I will once again to rebuild this site.

TO BE CONTINUED …

Chronological order of my work on building a homepage

First Time Ever Hackathon

I have never been to hackathon before. In my imagination, hackathon is like a festival for people who have passion about creating cool stuff during a limited amount of time. Hackathon is going to bring tons of fun if you can work with dedicated people on some interesting idea and try to make the idea into reality. Luckily, Early Professional Hire (EPH) Hackathon event hosted by China Development Laboratory (CDL) meets my expectation perfectly.

Experience Recount

When I first time heard about the hackathon on EPH Day 5, I’m a little intimidated. This is because the solution demo you present at last needs to fit in certain hackathon topics. The topics include business platform innovation, light-weight e-commerce reinvention, blockchain, application of Watson technology into medical services, and integration of Watson technology with 3D demonstration. All these topics are super cool, very advanced and I barely have a chance to get touched in my day-to-day work. So, I’m not sure if I can handle those topics well. However, I want to give this hackathon a shot: not only because this is part of EPH program but, more importantly, one of the topics really catches my eyes- that is, the integration of Watson technology with 3D demonstration.

This topic actually has a name, which is called Watson Introspector. It is a cognitive tool for understanding software, answering questions, and interacting with software architecture and data flows in 3D.  This topic suits my interests perfectly. It has always been a challenge for new comers to study a code base especially when the code base has been evolved for several years. What sets of functions or data flows get involved in certain feature of the software has always been the type of questions we are asking all the time, especially when bug fix or enhancement request kicks in. Conventionally, there are tools to help us to visualize the code path like debug trace, UML graph, and so on. However, none of them are straightforward and fun to use if we put them under new comer education context. I can hardly imagine some guy will choose staring at the debug trace on Friday night over going out for a date. So, I think the visualization of code path in 3D and get some question answering system integrated (like Watson) may probably make our software developer’s lives much easier.

After I set the topic that I want to work on the most, the next step is to get a team. Originally, there are four team members besides me within the team. All of them don’t have any experience with any technology involved in the topic. This is perfect because neither do I. However, all of them are testers, which make my situation a little bit difficult. This means that  I am the only developer in the team and I will take much more responsibility than I thought I would. But, that’s OK because my teammates want to grow with me and want to give the hackathon a try. So, I become the captain of the team.

Everything goes quite well for the first two weeks. We narrow down the architecture of the solution: we are going to make a 3D space game just like the classic snake game. In the study mode, player is free to explore the 3D world, which is constructed from the classes and functions parsed from a random-selected Java project. Inside the world, the player can interact with Watson on what kind of feature he wants to learn and Watson will return a code path that best meets the player’s request. Then player can spend his time getting familiar with the call stack of the functions along with the purpose of each function. In the test mode, the player is required to visit the functions he just studied in the correct function call order so that he can win the game. Our goal of making this game is to offer a fun way to learn about a source code project and we believe educational game best suits our needs.

Then, on September 14th, everything is just changed. Once we settle down the architecture of the solution, two members decide to quit with the excuse of limited time. I got this feeling that someday they are going to quit but I didn’t expect this timing. They refuse to work on the solution outside of the work time. This makes me quite frustration because anyone who attends hackathon should expect that he is going to spend fair amount of time outside of work to finish a demo. Even worse, this means we probably don’t have enough resources to finish our solution. It looks like mission impossible with only one developer and one tester left with the team.  But, a second thought comes into my mind. I work as the president of IBM Diamond & Ring Toastmasters Club. One important lesson I learn from it is that as a leader, the first priority task is to take responsibility and get the job done no matter how difficult the situation is. Under my current situation, my goal is to at least finish this hackathon, and I need to make this happen. Plus, I’m not alone: I still have a teammate, Rachel, who wants to give out all she has in order to succeed in the event. I just cannot let her and myself down.

In the final week,  we work super hard with our adviser, Trent, in order to get a demo working.  Even during Mid-Autumn Festival holiday, we still come to the office around 2 pm in the afternoon and hack through the rest of day to 1-2 am. That has been the theme for the whole week. On Tuesday, September 20th, right before the final day, we work over 30 hours to 4am, September 21th to do bug fix and 3D modeling. Rachel and Trent live closer to the office, so they rush home to get some sleep. I, however, live really far away from the company (I live southern 4th ring of Beijing) and unfortunately, I have to take a nap at the office coach to avoid being late for the team show order decision draw happened four hours later.

Even we almost live inside the company, we still haven’t finished our demo on the final day morning. There is some performance issue with our game during the launch phase: since we talk with Watson at the same time the game assets are loading, the framerate of the game drops significantly. In addition, we haven’t figured out a way to grow our character body just like the classic snake game. These put a lot of pressure on me because there isn’t enough time to fix everything in a nice clean way. However, we somehow manage to finish all this by the demo time. We adopt agile practice. Maybe we cannot fix these problems nicely but we can definitely walk around the problem just for the demo’s sake. We do incremental world object construction during the game loading phase: we only load the objects that player can actually see through the camera and we use a big skybox to block unloaded part of the world from the user. For the snake body problem, each time the character hits the target, we put a sphere behind the character and we somehow manage to let the newly added part follow the movement of the character. Maybe the movement doesn’t mimic the snake body movement nicely but for the sake of demo, that’s enough.

During the demo time, everything works well. Thanks to the public speaking practice I have kept doing at the toastmasters club, I delivered a successful presentation to the senior management level at the lab and we obtained 2nd place in this hackathon with the fewest team members among all the contest teams.

Here are some interesting stats that are worth mentioning:

  • We have 0 experience with the technology stack of the hackathon
  • We originally have 4 team members but down to 2 halfway through the event
  • We only have 1 developer and 1 tester eventually
  • We only have 1 person with sufficient programming experience
  • We work to the super late nights for at least 3 days
  • The longest non-stop hacking lasts for 30 hours
  • We consumed 50+ bottles of water and bags of snack
  • We watched 60+ hours video tutorial on YouTube and safaribooksonline.com
  • We write 2500+ lines of code for the demo

 

Hackathon Takeaway

Always remember you’re the captain

There are couple of times I want to quit the event. Thankfully, I don’t actually do that because I always remind myself I’m the captain of the two-man squad. When you set a goal to meet, you have got to do whatever you can to reach that goal and get the job done. Thanks to this hackathon event, I can now clearly see this point.

Stay positive during the difficult situation

There are downtime during the whole hackathon event. Face the technology we only have never actually experimented with when we enter the contest; Two of the original team members leave the team; Unfamiliar with the development tool; Debug the code to the late nights … All these things can drag the moral down pretty quickly. However, I’m the leader, I cannot do this. So, even in these difficult moments, I try to call the team for a short break and entertain ourselves by coming up jokes or have some random chats. These techniques work amazingly well because we don’t feel stressed and we can actually enjoy the whole problem-solving process. Without fun, the hackathon will never be the same to me.

Don’t be afraid of making the tough call

To be honest, I’m the chief solution architect of our demo and sometimes I have to make some tough calls, especially when both options look tempting. For instance, implementing the game like the classic snake game or like Super Mario are both good options. However, if we put time and various other resource limit under consideration, two options cannot be the same. Super Mario has its advantage in 3D exploration and the snake game reflects the core idea of our solution – make the body grows with the code path. I have to make the tough call on what path we want to pursue and I have to say, it’s always not easy.

Motivation and hardworking is the key to success

Motivation will give you the courage to take your first action but only can hardworking make you reach the goal. In this hackathon, I feel lucky that I follow my interest and choose Watson Introspector as my topic to work on. My interest provides me enough motivation to power through the whole event. However, I know that in the deep of my heart, I want to win and I want to my demo to reflect the technical expertise we equip. That requires hard working. Thankfully, we don’t let ourselves down and we work super hard towards our goal. I’m so glad we finally make it.

Public speaking is crucial

I feel our solution may not use all these fancy technologies like some other group does but I feel the public speaking or the presentation skill well compensated for this “disadvantage”. During the presentation, I use numbers from above to define our hackathon experience with the speech style like Steven Jobs and our beloved CEO, Ginni Rometty. This delivers a concrete message to the audience: we come here to win and we deserve it. I want to give the judges the feeling that we are 120% confident with our solution and more importantly, we feel proud of it.

All in all, this hackathon has become one of the moments that I’m extremely proud of myself as an IBMer and I do learn a ton from it: not just technical stuff but how to be a leader as well.

I want to end my post with the slogan of our project:

Evolve ourselves, beyond the limit!