## Sunday, May 03, 2015

### Working Environment & Personal Career - My Fourth Week in Cambridge

My fourth week in Cambridge was filled with all kinds of work as usual. However,  the time spent on life-related or administration-related stuff has been significantly reduced, so that I can have much more time in doing pure research and reading books.

Looking back to what I have done lately: got scandex paper accepted and gave a talklet in the Lab; finished scoped-flooding draft; finished all the simulation work on different topologies; set up the website for N4D Lab; set up the website for ACM DEV; finalised the user scenarios in UMobile project; got familiar with docker system and started system design; even finished a history book, so on and so on. Working efficiency seems very high here in the Computer Lab, I do not know if it is because of pressure or I simply feel happy to work more.

The atmosphere in the Lab is really nice. The Computer Lab in Cambridge University is the biggest computer science department I have ever seen, and is well-known for its System Research Group. In terms of working environment and infrastructure, however, I must admit University of Helsinki certainly wins, maybe because of its much smaller size. As we (system scientists) know, scalability issues generally applies to every system. University of Helsinki is so generous to its employees that you can hardly find a competitor for it. However, having good infrastructure is not sufficient condition for being a top university. In most cases, "software" can be as important as "hardware", and we researchers, are the software who are continuously contributing to university's research outcome and build up its reputation.

In my opinion, a good lab/institute should at least have the following characteristics:

• Trust - You simply cannot work (efficiently) in an uneasy environment. Pressure and tension kill the creativity.
• Fairness - Reward people for their achievements, not matter big or small. Being mean can simply kill people's motivation.
• Collaboration - moderate level of competition between even within a group is fine. However, the collaboration is more important since nowadays excellent work can hardly be done by one person.
• Promotion - it does not necessarily mean an increase in salary or change in position especially in academia. It is more about opportunities. Well, it might be a big problem for Finnish universities, and hopefully not for here.
• Flat - having a flat structure makes communication very efficient. However, it is not going to be the case in Cambridge due to its super long history. But the good thing is that the university is at least very liberal and democratic.
• Be nice - Last but probably the most important one. Good universities (should) never treat its researcher like slaves, especially for those with high throughput of incoming and outgoing scholars. Building up a network is the key to boost both personal career and institution reputation.
So far, the Computer Lab seems doing a great job from my own perspective. I do hope my following years here is going to be productive, and more importantly - happy. :)

 My first talklet on service caching in the Computer Lab of Cambridge University. The audience were quite nice and gentle to me during the talk, probably because I am new and junior in the Lab :)

## Saturday, April 25, 2015

### Walking Around in the City and History - My Second and Third Week in Cambridge

Three  weeks passed, I bought the bike, got the simcard, opened a bank account, received my first month salary, even did a business trip to meet UCL folks in London. I think I really settled down into Cambridge. Except occasionally, I applied very bad Scandinavian humour untimely. I think I am getting more and more familiar with every thing here, even with that peculiar accent.

Beside always being overwhelmed with research work, I finished a very thick book on British history. The book is in Chinese, almost 400 pages and covers the history from 6000 BC to 2000 AD. Due to this large time span, the book cannot really dig into details. Luckily, the author is a very good storyteller. Instead of stating "cold" fact, she narrated a lively and coherent story. It is quite funny that history used to be the subject I hated most when I was in the middle school and high school. I remember clearly I complained to my mom every day that when I had to remember so many names and why we had such a long history. However, things started changing as I am getting older. I realised you simply cannot get rid of history, and history itself, always repeats itself. I sometimes made fun of myself in front of my colleagues -- "Well, you know, maybe it is because I am becoming part of the history now ... fading away ...."

Except the Friday's happy hour (which is a long tradition in the Computer Laboratory in Cambridge University) when I usually go and grab a pint, I did not really visit pubs or bars. But I did visited Fitzwilliam museum and King's College and its famous Chapel. Recall that I have told you the importance of your university card in my previous post. Because of the card, I am not only able to visit the place for free but also can take two guest free of charge. Lovely!

Fitzwilliam museum and a corner in King's college.

Magnificent view of the King's College and its famous Chapel.

People are punting on the famous river Cam in such a lovely weather.

I certainly have to mention my first visit to the Cambridge University Library. In total, the Cambridge University has over one hundred library and every faculty may also have its own library. The main university library is very close to the lab, less than five minutes bike ride. To get into the library, you need a university card. I think the public can also get in as long as they apply the card from the registration office. I was completely impressed by the historical atmosphere at the moment when I set my foot in the hall. Such a sharp and strong contrast to the Kaisa library and Kumpula library which are the exemplars of the modernism and functionalism in Scandinavian design.

Cambridge University main library, feels like in Harry Potter movie.

Last thing, a side note, the water in Cambridge is really really hard. The moment I opened the kettle lid in the kitchen, I was impressed again, like I was impressed before by the library. I made fun of my British colleagues - "Are you growing (sea) fish in the kettle, this definitely looks like coral ...." However, I have been drinking it anyway since all others are drinking it. You know, "When in Rome, do as the Romans do".

## Sunday, April 19, 2015

### Settling Down into Cambridge

During these four years struggling with my PhD research, I had always been planning that I would take a long holiday after I got my PhD hat. A at least 3-month holiday, without doing anything just keeping my brain blank and relaxed, or maybe travelling around Europe. But life always surprises you. After I passed my PhD defence, I got only about one-week break, then hastily moved to Cambridge, UK to start my postdoc research in the Computer Laboratory.

Finding an accommodation is Cambridge is not difficult, but finding a good one with reasonable price is quite tricky. Eventually, I got a bedroom in a family and the rent costs me about 550 EUR per month. The downside of living with a family is obvious -- privacy. Since I am not social and life is rather simple (maybe even a bit tedious and boring), so privacy is not really a problem for me. On the other hand, the advantage of living with a family is also significant, you always have someone to ask and it is faster to get familiar with the local society. My host family, Mr and Mrs Kell, are typical middle class,  gentle and kind. Mr. Kell gave me a lift when I first arrived in Cambridge, and even kindly invited me to their Easter dinner.

First week is always difficult. I was totally overwhelmed with all kinds of administrative and work-related stuff. Though the HR office in Computer Laboratory did a great job in assisting me to settling down, there were still a lot of things that had not been written on the instruction but required me to handle. In the following, I will list some important things you might want to know before moving here, so that your migration will be much more smooth than mine.

Transportation

Cambridge is not big at all, especially the university area. In fact, you can just go everywhere on foot if you do not mind 30- to 40-minute walking. Walking is definitely OK in the first couple of days. But eventually, you will need a bike. There are many bikers in the city, but some roads do not really have a 'well-defined' bike lane. So you need be very careful especially before you get familiar with the town. My landlord told me local driver did not really like bikers. For the bike, I got mine from a shop in King's street, 135 pounds with three-month guarantee. It is second-hand but seems in a good condition. Later, I noticed you could also get pretty good and cheap ones on Amazon UK.

Bank Account

If you come to Cambridge to work as me, a bank account is definitely the first thing you want to have. The university needs your bank account number in order to give you salary. I almost missed the deadline of my first month payroll. In order to open an account, you need to have a current living address in UK, the department will issue another document including your current address and the information showing you work for the university. Besides, it is better to have a UK phone number. Which bank to choose? They are more or less the same. I chose the one recommended by both my landlord and the department -- Barclays.

National Insurance Number

You need to apply for a national insurance number if you want to use the local health service. First, you need call the centre to have a short interview, then they will send a paper application via plain mail. You can find the their phone number [here]. Remember you definitely need a UK phone number before calling them, since they require that in the registration. The paper application form will arrive within 10 working days, fill the form then send it back. There is a pre-paid envelop in the application package, so you don't have to pay. This part is pretty easy.

UK Mobile Phone Number

This part really depends on what kind of user you are. If you are not an active user, you can just order Pay as You Go which does not have any monthly fee. For myself, I am a heavy Internet user. So I chose to use giffgaff which is considered as cheap solution in UK. Of course, BT is also good in the sense they really have a lot of wifi hotspot in the city. You can always use BT's free-wifi if you are their subscriber. In general, the price is not very expensive, but it is definitely cheaper in Finland. :)

University Card

You will get the card from the HR, so you don't need to do anything except emailing them a nice photo of you before coming here. Note that this is not just a card to let you open the door, you will realise how useful it is in the future. As my landlord told me, Cambridge university is (arguably) the most powerful and influential entity in this town. ;-)

Then What Now?

Well, it totally depends on you now. For me, I will focus on my work, do interesting research and enjoy the life in the No.1 European university. :)

## Wednesday, April 01, 2015

### Book Review of "Apache Hive Essentials"

I am currently reviewing the book "Apache Hive Essentials" published by PACKT. PACKT is a pretty young publishing company based in Birmingham, UK. The company has a very interesting and agile publishing model called Print on Demand.

Given the hyped data science and big data framework buzzwords, the topic this book covers is definitely relevant and important to big data practitioners. The author appears to have a long and solid experience in the industry which gave him much practical knowledge on the subject. Having quickly skimmed through the book, my first impression is the book has a broad coverage of Apache Hive, ranging from the basic setup to security, data manipulation and the detailed explanation on the grammar, complemented with relatively straightforward examples.

My current feeling is, as a thin book of 200 pages, it did quite a decent job. Of course, I will read the book more carefully later this month then post a more detailed a review on my blog.

## Tuesday, March 24, 2015

### The Philosophy of PANNS -- A Very Preliminary Evaluation & A Not-So-Fair Comparison (maybe)

Though I have promised a long time ago [here] to perform a simple evaluation on PANNS, I finally got it done just today due to my overwhelming workload before moving to the Cambridge.

I. Caveat

Before we start, I must clarify several things beforehand. The purpose is not merely trying to avoid criticism, but to emphasize the context and goal of making PANNS as a small tool.

First, frankly, a large part of the hyped data science deals with data engineering stuff. However, it is well-known that there is no silver bullet in the engineering world. It further means that a good engineer has to look into a specific application area to determine the proper tool. Different data sets may lead to different conclusions when you are evaluating your machine learning/data mining algorithms, as we will see in this post.

Second, I only evaluated the accuracy and the index size here. I admit that PANNS will be beaten miserably by other software in terms of index building time. However, we assume that index will not be built frequently. Once the index is built, it will be only queried all the time. Besides, the index should be as small as possible so that it can work on the large data set. Please check "The Philosophy of PANNS" for more details.

Third, I only compare to Annoy in this post. I am not saying the other candidates are bad. Simply because there are already such comparisons and Annoy seems superior to others in many aspects. In fact, Annoy can be an order of magnitude faster than PANNS in terms of speed. But you will also see how PANNS will win out in other aspects in the follow of this post.

II. Evaluation

The data set used in the evaluation is synthetic. Each data point in the data set is 2000-dimension and follows a standard normal distribution $\mathcal{N}(0,1)$. The index contains 256 RPTrees. The accuracy and index size are measured, the numbers presented are the average results of 50 experiments. The two tables below summarize our results, one for Euclidean similarity and one for Angular similarity. In most cases, I prefer tables to figures when presenting numbers, so I did not bother to plot the result.

 Data Set Size 5000 10000 15000 20000 25000 30000 Accuracy - Panns 75.0 % 51.2 % 39.2 % 30.4 % 27.2 % 25.2 % Accuracy - Annoy 58.2 % 38.0 % 29.4 % 23.2 % 20.6 % 18.0 % Index - Panns 29.6 MB 59 MB 88 MB 116 MB 149 MB 174 MB Index - Annoy 56.0 MB 112 MB 169 MB 224 MB 279 MB 334 MB

Table 1. Comparison between PANNS and Annoy using Euclidean similarity.

 Data Set Size 5000 10000 15000 20000 25000 30000 Accuracy - Panns 75.0 % 53.6 % 36.0 % 37.0 % 27.0 % 25.0 % Accuracy - Annoy 65.0 % 36.8 % 26.4 % 26.4 % 19.2 % 17.2 % Index - Panns 29.6 MB 59 MB 88 MB 116 MB 149 MB 174 MB Index - Annoy 35.0 MB 70 MB 93 MB 140 MB 159 MB 188 MB

Table 2. Comparison between PANNS and Annoy using Angular similarity.

From both tables, the first thing we noticed is the difference in the index size. PANNS is able to save much more space when using the same amount of RPTrees. In the case of Euclidean similarity, the PANNS index is only half size of the Annoy. This benefit becomes even more noticeable when dealing with extremely large data sets and many RPTrees. From this results, we can understand why Panns lost quite a lot in some evaluations where the index only had small number of RPTrees.

In terms of accuracy, we can see PANNS consistently outperforms Annoy in all cases in our evaluation. But the difference starts diminishing as there are more and more data points. However, since PANNS uses much less space for storing the index, we can incorporate more RPTrees in the index given the same index size to achieve better accuracy. However, it becomes difficult to argue whether it is fair comparison to some extent.

In terms of building time, it is so true that Annoy is faster. However, I must point out this only holds when you use serial execution. PANNS provides parallel index building to take advantage of multiple cores on your workstation. In my case, it turned out PANNS is much faster than Annoy because I have 16 cores on my workstation. I hope this does not count as cheating ;-)

In terms of scalability, I also tried building the index from the English Wikepedia dump which consists of 3.8 million documents approximately. PANNS was able to get the job decently done (though took a long time) whereas Annoy always failed due to memory issue. However, I think further investigation is definitely needed.

III. Inconclusive Conclusion

Well, Erik already gave good summaries on "what to use in which context" in his post. I only provide some complementary advice in the following.

In general, PANNS generates much smaller index without sacrificing the accuracy. This becomes more important when you are dealing with large data sets and still want to incorporate as many RPTrees as possible to achieve satisfying accuracy.

The future trend is parallel computing. We need squeeze out all your computational resources from your devices. The slowness of PANNS can be effectively ameliorated by using parallel building and parallel query on multiple cores (even across the network). Please check out our paper on parallel RPTree building on Apache Spark [here].

We are still carrying on the research work in improving the PANNS accuracy. As mentioned, this more relates to the data engineering stuff. There are even better algorithms which can outperform PANNS from 20% to 100%, of course at the price of even longer building time. In the future, we will gradually incorporate these algorithms with a balance between efficiency and accuracy. All in all, PANNS is going to remain as simple and compact as possible for teaching purpose, as we already mention in the [previous article].

## Saturday, March 07, 2015

### The Spherical Cow in System Science - Matrix Multiplication on Apache Spark (Part III)

As mentioned in the previous [blog], we introduced the big data framework into teaching in 2014, and we specifically chose to use Apache Spark in 2015. Besides calculating the basic statistic, we also asked the students to do something more skillful -- matrix multiplications. The detailed assignment description can be found [here].

I. Two Simple (But Representative) Matrix Multiplication Problems

In the assignment, we gave the students a tall and thin matrix which is of size $10^3 \times 10^6$. Let's call it matrix $A$. The matrix $A$ has one million rows and each row has one thousand double numbers. In total, there are one billion double numbers. We asked the students to calculate $X$ as blow:

1. $X = A \times A^T \times A$
2. $X = diag(A \times A^T)$

Solving large linear equation systems is very common in both engineering and scientific fields, and knowing how to manipulate matrices is basic skills for engineers and scientists. These two examples, though seemingly simple, are quite representative in scientific computing. Then how to solve them actually?

II. Order Matters

Let's start from the first question $X = A \times A^T \times A$. There are two ways to calculate $X$, namely $X = (A \times A^T ) \times A$ and $X = A \times (A^T \times A)$. Mathematically, they are equivalent, namely same result and same computation complexity. However, from engineering and system perspective, there is a big difference! Why?

Let me first ask you "What is the size of $X$?". Yes, it is exactly the same as $A$, namely $10^3 \times 10^6$. No matter whether we calculate the first two first (i.e., $(A \times A^T )$) or the latter two first (i.e., $(A^T \times A)$), the intermediate result need to be transmitted on many nodes to finish the calculation in order to give us the $X$ of the same size as $A$. However, the two intermediate results are quite different, see the figure below:

You must get the point after seeing this figure, right? If we call it matrix $B$ as our intermediate result. We can see $(A \times A^T )$ leads to a huge $B$ which is $10^6 \times 10^6$ in case I, whereas $(A^T \times A )$ only leads to a very small $B$ which is $10^3 \times 10^3$. Now, you tell me, "which matrix is easier to distribute in the network?". There is going to be several orders of magnitudes difference in the network traffic. If the matrix $A$ is very very tall (i.e., contains huge amount of rows), it is not even possible to finish the calculate of $(A \times A^T )$. So, the lesson we learnt here is "Order does matter!"

III. Decomposition and Aggregation

We know getting a smaller $B$ is more beneficial, but how shall we carry on to compute $(A^T \times A )$? We know Spark cuts the $A$ into many chunks and stores them separately on different machine. Let's call these chunks of $A$ as $A_1, A_2, A_3, A_4 ... A_m$. Note these chunks have same number of columns (i.e., $10^3$) but may have different number of rows, in other words, their chunk size may be different.

Now, let me ask you another question - "Choose any $i \leq m$, what is the size of $A_i \times A_i^T$?". Yes, they are all the same, namely $10^3 \times 10^3$. Actually, this is just the partial result of final $B$. To calculate the final $B$, we just need to aggregate these partial results and sum them up as below:

$B = \sum_1^m A_i \times A_i^T$

After having $B$, getting the final result $A$ is trivial. We just need to broadcast $B$ onto every node, then finish the calculation on every chunk as below:

$X_i = A_i \times B$

Assemble all the $X_i$. More precisely, concatenate them. Then you have $X$.

IV. Getting Diagonal Elements

I was actually a bit hesitant about whether to give the second question or not. I thought it might be too easy for the students. However, the truth is that most of the students actually did try to compute the complete matrix $A \times A^T$ which is so big that you cannot even fit into the normal memory, then extract the diagonal elements.

When the students brought such solutions to me and complained about the memory issue, I just ask them to do one thing - "Can you tell me the definition of the diagonal elements and describe how you calculate them? "

They usually started with -- "Well, the first row of  $A$ (dot) multiply the first column of $A^T$ ...".

Then I continued to ask -- "Then what is the first column of $A^T$?"

Suddenly, the students got the point -- "Haha, it is still the first row of the $A$!!!"

So, you do not need to calculate the whole matrix in order to get the diagonal elements. You just need process the data set line by line, each row will give you one diagonal element.

$X_(i,i) = \sum_{j=1}^{1000} A_{i,j}^2$

V. Common Issues

Same as the previous [median example], this set of questions were also very successful. The students quite liked the problem design even though very few of them actually got all the solutions right. One common issue from the students is that they tried very hard to have precise control on how the matrix/data set is cut in Spark. Controlling how the data is divided is not impossible but just meaningless regarding solving the problem, of course given you have sufficient knowledge in linear algebra and matrix operations.

Read this series: The Spherical Cow in System Science - Part IPart IIPart III

## Friday, March 06, 2015

### The Spherical Cow in System Science - Calculating Median Value in Big Data Sets (Part II)

As system students of 90s, we remember that the classic distributed system course usually had strong assumptions on heterogeneity of nodes' geography and configurations (regarding both software and hardware). Many universities used Tanebaum's classic textbooks, and the distributed algorithms taught in the classes mostly dealt with fault tolerant, consistency and focused on a point-to-point communication paradigm. These algorithms are still very valuable and being will used in the various system. However, the industry force shifted the paradigm to a more centralized, more homogeneous, richer bandwidth-based high performance computer clusters. Or using more widely known and hyped name, we call it clouds computing.

To catch up the trend, we introduced the big data framework into DSP course (check my previous blog) in 2014 and 2015. We chose to use Hadoop in 2014 course and "upgraded" to Spark in 2015, but the questions remain more or less the same. Compared with the previous example I introduced, this set of exercises turned out to be a big success. So what did we actually asked the students to do?

I. Finding A Needle in A Haystack

The power of big data framework is fast processing and analyzing huge data sets by taking advantage of the computational and storage resources in high performance computer clusters. In the question set, we gave the students a date set of a very simple structure -- just one double number per line. The data set contains 10 billions lines. For generating data, I used Weibull and Gaussian mixture model. Technically, this is not really "Big Data", but I think it is pretty enough for the students to grasp the idea. We asked the students to provide the following statistics of the data set.

• Minimum value
• Maximum value
• Mean value
• Variance value
• Median value

By the way, we did not allow the students to use the default functions provided in the framework. We asked them to implement the functions by themselves using map-reduce paradigm (see figure blow). Some students did complain about reinventing wheels. But in my opinion, as a system engineer, you should know what happens behind the scene. However, to help the student start, I provided the example code to calculate the minimum value.

 A figure summarizes the map-reduce-based data-parallel computing paradigm.
For getting the maximum, it is very easy by simply tweaking my example code. Map-reduce is very good at handling data-parallel computation. A data set is divided into chunks then distributed on multiple nodes then processes in parallel. The minimum of the minimums (of all the chunks) is the minimum of the whole data set. The same applies to the maximum. However, the students need pay a bit more attention when dealing with mean value, since the mean of all mean values is simply not the mean value of the whole data set. It turned out some students did make this silly mistake.

The variance value needs even more thinking. We know there are (at least) two ways to calculate variance.

1. $Var(X) = E[(X - \mu)^2]$
2. $Var(X) = E[X^2] - (E[X])^2$

The first method is familiar to almost everyone, and indeed most students chose to use it. However, it needs to iterate the data set twice. Whereas for the second method, you only need to iterate the data set once. It is worth noting that iteration used to be expensive, especially for Hadoop in the early days. Though Spark has improved quite a lot, scheduling still contributes a significant time overhead in the calculations. So reducing the number of iterations is still a good idea when dealing with big data.

II. Handling Order Statistic

Things turned to be extremely interesting when the students started calculating the median value. Similar to the mean value, median of the medians are not the median of the whole data set. Median is one of the order statistic like Quantile, Percentile, Decile and etc. By definition, the median separates a data set into two equal parts. Technically, you need sort the data from the smallest to the largest and choose the middle one (break the tie if necessary). However, such conventional operation of getting median is not really suitable when dealing with large data sets.

Can you image sorting a data set of hundreds of GB or TB or even PB? First, you simply do not have enough memory to do that on one node. Second, you will introduce huge traffic if you try to do that in distributed way. Well, some of the students did try to sort the data, then quickly got frustrated. At this point, they realized they need to think differently in the context of big data. What's more, they also realized that the map-reduce may not be as straightforward as they thought.

The students eventually came up various solutions which can be categorized as below:

Median of medians - it is an approximation algorithm which gives a quick estimate of the median. But you still need sort the data set at least partially and only get an estimate, which is not good. For the algorithmic details, please check the [wiki page].

Random sampling - this method tries to get a random sample from the data set and use the median of the sample to estimate the median of the population. This method may work well on certain data which follow well-defined distributions like Gaussian, uniform and etc. However, let's imagine a data set wherein the first half numbers are 1, the latter half are 100, and the actual median is 57. Then the random sampling method is doomed to fail with high probability.

Binary search - if we think about the definition of median, it is not difficult to notice the median is bigger than 50% of the data and smaller then the other 50%, because this is simply its definition. So we can just randomly guess a number and check how many data fall on the left (i.e. smaller than the guessed number), and how many data fall on the right (i.e. bigger than the guessed number). Then we can know whether we under- or over-estimated the median. We can narrow down the scope by iterating this search until we find the true median.

Buckets/Histogram method - Binary search is already very close to the correct solution. But let me ask another question - how many iterations does binary search need to find the actual median? Easy, it is just $\log_2(N)$ where $N$ is the size of the data set. For those students who successfully reached the binary search solution, I asked them - "Can you further improve the performance by reducing the number of iterations?"

Of course you can! In some sense, the binary search uses two buckets and count the number of data in each bucket. We can certainly use more buckets and apply the same algorithmic logic here, then the narrow down the range where the median falls much faster. E.g., if we use $b$ buckets, then we only need  $\log_b(N)$ iterations. So feel free to choose $b = 1000$ or $b = 10000$, the introduced traffic overhead is negligible. Such method also has another name - histogram method. The figure blow sketches the crude idea, and there are some other materials on the Internet talking about this method. E.g., this YouTube video [here].

III. Do the Students Like This Exercise?

It turned out this exercise set was extremely successful. From teaching perspective, the exercise only uses simple statistics which every student can understand with very mathematical background. The wonderful part of the exercise is that all these statistics look similar and straightforward in the mathematical sense. However, from (data) engineering perspective, you need treat them completely differently. The exercise set reminds the student that porting the classic algorithms into big data context may not be as straightforward as they originally thought.

In the actual teaching, all the students solved min and max without any problem. Surprisingly, over 60% of the students still chose the slower way to calculate the variance. For median, it is even more striking. For these two years, only three students independently came up histogram method without any help from me. About the students' attitudes, in general, they like the question set very much, the main reason of which, I believe is because their strong curiosity on the hyped big data.

Read this series: The Spherical Cow in System Science - Part I, Part II, Part III

## Sunday, March 01, 2015

### The Popular Abstract of My PhD Dissertation

In less than one month's time, I am going to have my doctoral defense in Helsinki, Finland. In order to register my thesis in the university E-thesis System, I need provide two different versions of abstract. One is called scientific abstract which is mainly for the experts working in the related fields, the other is called popular abstract which targets the public who does not have prior knowledge of the subject and it must be in Finnish or Swedish. I am not sure whether this is a unique invention in Finland, but asking a non-Finnish speaker to provide such an abstract without providing enough help in the University seems a bit ridiculous.

Anyway, thanks to Maria's great help, I got the popular abstract done, though I really doubted at some point whether she really knew what she was doing there. :D Interestingly, having no field knowledge at all turned out to be a great advantage when Maria was trying to write the popular abstract for me. Here you can find two abstracts as following.

The Scientific Abstract

In-network caching aims at improving content delivery and alleviating pressures on network bandwidth by leveraging universally networked caches. This thesis studies the design of cooperative in-network caching strategy from three perspectives: content, topology and cooperation, specifically focuses on the mechanisms of content delivery and cooperation policy and their impacts on the performance of cache networks.

The main contributions of this thesis are twofold. From measurement perspective, we show that the conventional metric hit rate is not sufficient in evaluating a caching strategy on non-trivial topologies, therefore we introduce footprint reduction and coupling factor, which contain richer information. We show cooperation policy is the key in balancing various tradeoffs in caching strategy design, and further investigate the performance impact from content per se via different chunking schemes.

From design perspective, we first show different caching heuristics and smart routing schemes can significantly improve the caching performance and facilitate content delivery. We then incorporate well-defined fairness metric into design and derive the unique optimal caching solution on the Pareto boundary with bargaining game framework. In addition, our study on the functional relationship between cooperation overhead and neighborhood size indicates collaboration should be constrained in a small neighborhood due to its cost growing exponentially on general network topologies.

The Popular Abstract

Verkonsisäinen välimuistitallennus pyrkii parantamaan sisällöntoimitusta ja helpottamaan painetta verkon siirtonopeudessa hyödyntämällä universaaleja verkottuneita välimuisteja. Tämä väitöskirja tutkii yhteistoiminnallisen verkonsisäisen välimuistitallennuksen suunnittelua kolmesta näkökulmasta: sisällön, topologian ja yhteistyön kautta, erityisesti keskittyen sisällöntoimituksen mekanismeihin ja yhteistyökäytäntöihin sekä näiden vaikutuksiin välimuistiverkkojen performanssiin.

Väitöskirjan suurimmat aikaansaannokset ovat kahdella saralla. Mittaamisen näkökulmasta näytämme, että perinteinen metrinen välimuistin osumatarkkuus ei ole riittävä ei-triviaalin välimuistitallennusstrategian arvioinnissa, joten esittelemme parempaa informaatiota sisältävät jalanjäljen pienentämisen sekä yhdistämistekijän. Näytämme, että yhteistyökäytäntö on avain erilaisten välimuistitallennusstrategian suunnitteluun liittyvien kompromissien tasapainotukseen ja tutkimme lisää sisällön erilaisten lohkomisjärjestelmien kautta aiheuttamaa vaikutusta performanssiin.

Suunnittelun näkökulmasta näytämme ensin, kuinka erilaiset välimuistitallennuksen heuristiikat ja viisaan reitityksen järjestelmät parantavat merkittävästi välimuistitallennusperformanssia (Did you see how long this word is?!?! Oh, dear, Finnish!) sekä helpottavat sisällön toimitusta. Sisällytämme sitten suunnitteluun hyvin määritellyn oikeudenmukaisuusmittarin ja johdamme uniikin optimaalin välimuistitallennusratkaisun Pareto-rintamalla neuvottelupelin kehyksissä. Lisäksi tutkimuksemme yhteistyökustannusten ja naapurustokoon funktionaalisesta suhteesta viittaa siihen, että yhteistyö on syytä rajoittaa pieneen naapurustoon sen kustannusten kasvaessa eksponentiaalisesti yleisessä verkkotopologiassa.

P.S. If you happen to have any interest in information-centric networking, you can find my online thesis via [this link].

### The Spherical Cow in System Science - Teaching Distributed System Course (Part I)

I. What Do We Want to Teach Students in System Courses?

I have been the teaching assistant of the Distributed System Project course, which is one of the compulsory course for every master student specialized in system and networking, for over five years' time in the University of Helsinki. Now it is my last year to handle this course before moving to the Cambridge University. During these four years, I learnt quite a lot by interacting with different students in the class. Since I designed most of the exercises for the course and graded all the students' solutions, it provided me the first-hand experience to understand how the students thought about the teaching and the content. I feel obliged to write down some of my thoughts and share the experience with current and future teaching assistants/staffs.

The Distributed System Project was the first course I had ever got involved in teaching in my life. I independently taught this course in 2013 while my PhD supervisor was away taking his sabbatical in the U.S. Because it is a project course, the core philosophy is "Learn by doing", we do not really need prepare any slides to actually lecture the students in the class. Instead, we try to let the students gain the hands-on experience by finishing some small projects. Therefore, the course assignments need to be carefully designed so that the students can apply those abstract theories (i.e., the spherical cow mentioned in the title) they have learnt in the distributed system course in a more concrete scenario. You can find all the information of the course via the following links:

The exercises are generally divided into two categories. First category usually covers the classic distributed algorithms (e.g., Lamport clock, vector clock, election algorithm and etc.) which is quite trivial and does not require me too much efforts to design. The second category is much more challenging since we tried to let the students take a system-level view when they are solving the problems. Some of the questions were very successful, some were not. In this series of posts, I will choose some examples and discuss them one by one.

About the title, assuming you know that joke, I think my purpose is quite obvious. So let me safely assume we can save the strength in explaining the title and simply start our first example.

II. Example I - The Migration of Complexity in the Distributed System Design

This was the exercise we gave in both 2011 and 2012. The exercise is very straightforward, the detailed the assignment description can be found [here]. In short, we simply ask the students to implement a naive calculator application which can plot the sine function using client/service model. We chose to use web browser, AJAX and RESTful API to avoid letting the students develop everything from scratch. We also specify that every request to the service can only carry one operator and two operands.

However, the story did not end like this simple. We asked the students to implement three versions of the aforementioned calculator under three different cases as following:

• The server is smart enough  to know how to calculate sine function. The client only plots the figure.
• The server is stupid and only plots the figure for the given points. The client is smart enough to know how to calculate sine function.
• Both server and clients are stupid. The server only knows +, -, *, / four arithmetic operations, and the client only knows how to plot.

I think you probably already got the idea. Case 1 and case 2 actually represent the evolution of our computation model within these several decades. In the beginning of computer systems, we had a really "powerful" server (powerful is a relative sense) who would take care of all the computations. A user only needs a terminal to connect to the server. A terminal, in some sense, is just a naked and stupid client which is responsible for submitting jobs and displaying the results. In technical terms, we call it thin client and fat server model. This model is reflected in our first case.

As time goes by, functionality started shifting to the client side. CPU became much faster, the storage grew much bigger and the price dropped a lot. Then more and more features were added to the client side. As the PC became more powerful, the applications were also growing in their complexity and more applications started running at the client side to improve user experience. The shift of the use pattern eventually led to fat client and thin server model. Nowadays, even a moderate smartphone is more powerful than a mainframe dozens of years ago. Apparently, our second case tries to capture this model.

It is very interesting to notice that the hyped clouds computing is shifting the computations back to a centralized entity again. Note that the entity here can refer a cluster of computers instead of a standalone server. The key enabler of this trend is the virtualization technology. People realized that the horizontal scalability is a more feasible solution from both economy and engineering perspective.

III. Do Not Forget the Communication Channel Is Also An Integral Part of A Distributed System!

However, our third assumption tries to capture another (maybe a bit unfortunate) case wherein both the server and client are stupid. In such a case, how can a client collaborate with a server only supporting  +, -, *, / four basic arithmetic operations to finish the calculation of sine function? One solution is using Taylor Series to approximate the sine function. Because we explicitly required the solution need to reach certain precision, the students have to figure out the minimal degree needed in the polynomials. When solving this case, we can clearly see that the complexity migrates to the communication channel. Namely, the client and server need to collaborate many rounds in order to finish the computation task.

I am pretty sure you already got the big picture of this assignment. In the first case, the server takes care of the (computation) complexity while the client handles the complexity in the second case. However, if neither can deal with the complexity, the complexity needed to finish the computation has to go somewhere, then it goes to the communication channel in our third case. The exercise reminds the student that the communication channel (usually our network) is also an integral part of a distributed system design.

IV. Is This Really A Good Exercise?

About this exercise, the comments from my colleagues were quite positive. However, the actual feedback from the students in the class were rather bad. Most students failed to grasp the core idea of the exercise and simply thought that we asked them to develop a naive AJAX application. I even got some angry complaints from a student saying "I am not coming here to become a web developer!" Oh, well, you know what? The web is actually the biggest distributed computer system in the world nowadays (excluding the mobile devices ;-) ). Besides, being a web developer is not a bad idea at all in the first place :D

All in all, we finally realized that, without explicitly explaining what is the purpose of this exercise, the students could not really grasp its idea. However, if we put everything across in the first place, there is simply no fun any more in solving the problem. Eventually, we dropped this exercise out after its two trials in 2011 and 2012, which was really sad in my opinion!

Read this series: The Spherical Cow in System Science - Part IPart IIPart III

## Friday, February 20, 2015

### The Philosophy of PANNS -- A "Biased" Story for the Nearest Neighbor Search in High Dimensional Spaces

I put Panns source code online about a half year ago, more precisely, in the middle of 2014. After that, except occasionally I replied to some emails inquiring about the technical details of this small tool, I almost completely forgot such a thing due to my stressful PhD studies. Until one day, one of my colleagues started talking about Panns during the lunch time, I was a bit surprised that the tool was gaining its popularity. What's more, some people even compared it to other popular similar tools, which really flattered me quite a lot though Panns performed actually badly (regarding its building and querying time) in the comparison. Still, I was very happy about the fact that people are using it, testing it, and maybe also tweaking it.

So why did I decide to write a post about Panns? Well, I am not really going to present the algorithm line by line, nor I am going to talk about any theoretical stuff here. Instead, I want to talk about a little bit philosophy. Yes, you heard me, philosophy. ;) After I started my PhD journey, I realized that you need to have a philosophical stance on almost everything. Especially if you are working in the system science field, you must know there is simply no such a silver bullet which can solve all your problems. During the development, you need to make various design decisions in order to balance the different tradeoffs. All such design decisions are based on your philosophical stance on the system.

In the following text, I want to explain the philosophy of Panns. Then you will understand why it is designed in such a way which further explains why it is much better than others on certain aspects and doomed to lose in other aspects. So, let's start with a brief background story.

How did Panns start in the first place?

My PhD research has been focusing on Information-Centric Networking. It does not matter at all if you have never heard of this term. The point is we want to interconnect the content within a network with their semantic meanings, which essentially requires a language model to back up the idea. In the beginning of 2014, we started a new project -- Kvasir. In Kvasir project, we explored different innovative ways of seamlessly integrating intelligent content provision and recommendation into the web browsing by utilizing the aforementioned idea. Kvasir eventually turned into a more and more interesting  and promising project, so we decided to develop a real system. Panns, technically speaking, is only a small function module in Kvasir system.

Why vector-based language model?

We chose to use vector-based language model in Kvasir. Certainly I know that there is a whole bunch of different language models out there for use. However, there is no evidence in the literature showing  one specific model dominates others. Besides that, if there is really a reason for choosing vector-based models, it is because I want to use LSA (Latent Semantic Analysis) and I am simply good at algebraic models.

Why Python?

The reason is not because Python is my working language. The primary consideration is based on the teaching purpose. Since I am working in a university wherein teaching is an important part of my daily work along with research. We know C/C++ can definitely lead to better performance comparing to Python. However, we also noticed that most students are more comfortable with Python partly due to its popularity and partly due to its simplicity and friendly learning curve. We did worry that using C/C++ might cause the code less accessible to the students. With Python, I can write more elegant code and the students can focus on the key ideas instead of being distracted by mechanical details, of course, at the price of performance. However, in reality, Panns performance is not that bad in our context.

Why random projection trees?

The essential problem we are trying to solve is looking for k nearest neighbors for a given point in very high dimensional spaces. What's more, you need to finish the job fast. This is a really hard question and there is a large body of literature behind it. In general, you need to trade accuracy for speed, and it is totally fine in the language model context where you are trying to find the most similar articles. The small numerical error will not destroy the results and users mostly still feel the returned articles are relevant. Choosing random projection trees (RP-tree) is based on the following several considerations - first, simplicity on the structure, which makes theoretical analysis feasible which further shows the algorithm has a bounded error. Second, both tree building and random projections can be easily parallelized on a computer cluster. As you can see, we are pursuing horizontal scalability.

Why separating the index file into two parts?

The index file generated by Panns are separated into two parts. One contains all the RP-trees, and the other contains the raw vectors. The rationale behind is that the tree-structure needs to be small so that it can reside within the physical memory for fast access, whereas the raw data vectors are usually much larger and it is better to utilize mmap technique so that we do not need load the whole file into the memory. Such design is based on the well-known fact that content popularity distribution is highly skewed and only a very small amount of content are indeed very popular. It further indicates only a small part of the data set is frequently visited.

Why spending so much time in indexing?

Panns spends much longer time in indexing comparing to other tools. The theoretical argument goes like this: as the number of points grows larger and larger, as the dimensionality increases higher and higher, those highly condensed points tend to be uniformly distributed in the space. The split point needs to be carefully placed on the random projection to reduce the probability of misclassifications. Recall that RP-tree is essentially a clustering algorithm. If you find the theoretical argument is boring, then the following argument may make more sense: we care more about the accuracy and we have time to wait for the indexing to finish. Because once the index is constructed, we seldom re-build it within a short period. In that case, we do not really mind to wait a bit longer. 10 seconds, 10 minutes, really make no difference to us. Besides, the indexing can be much faster if parallel indexing mode is enabled.

When do you really need Panns?

Simply put, if you have a huge data set (e.g., over several millions of points) of very high dimensionality (e.g., at least over one thousand), and you are happy with several hundred milliseconds query speed. Then Panns is for you. Or if you are a student who just start this subject and want to get some hands-on experience, Panns is for you. For others, well, why not give it a try :)

Is it finished?

Of course not. But as I promised, I only talk about the design philosophy in this post. I am not planning to do any experiments or plot anything right now. In the [next post], I would like to spend some more time in explaining the performance of Panns in some measurement study, and  talk about why the result is so and what causes such behaviors. So we can have a better understanding on what Panns is really good at.