Information + Network + System: panns

Though I have promised a long time ago [here] to perform a simple evaluation on PANNS, I finally got it done just today due to my overwhelming workload before moving to the Cambridge.

I. Caveat

Before we start, I must clarify several things beforehand. The purpose is not merely trying to avoid criticism, but to emphasize the context and goal of making PANNS as a small tool.

First, frankly, a large part of the hyped data science deals with data engineering stuff. However, it is well-known that there is no silver bullet in the engineering world. It further means that a good engineer has to look into a specific application area to determine the proper tool. Different data sets may lead to different conclusions when you are evaluating your machine learning/data mining algorithms, as we will see in this post.

Second, I only evaluated the accuracy and the index size here. I admit that PANNS will be beaten miserably by other software in terms of index building time. However, we assume that index will not be built frequently. Once the index is built, it will be only queried all the time. Besides, the index should be as small as possible so that it can work on the large data set. Please check "The Philosophy of PANNS" for more details.

Third, I only compare to Annoy in this post. I am not saying the other candidates are bad. Simply because there are already such comparisons and Annoy seems superior to others in many aspects. In fact, Annoy can be an order of magnitude faster than PANNS in terms of speed. But you will also see how PANNS will win out in other aspects in the follow of this post.

II. Evaluation

The data set used in the evaluation is synthetic. Each data point in the data set is 2000-dimension and follows a standard normal distribution $\mathcal{N}(0,1)$. The index contains 256 RPTrees. The accuracy and index size are measured, the numbers presented are the average results of 50 experiments. The two tables below summarize our results, one for Euclidean similarity and one for Angular similarity. In most cases, I prefer tables to figures when presenting numbers, so I did not bother to plot the result.

Data Set Size	5000	10000	15000	20000	25000	30000
Accuracy - Panns	75.0 %	51.2 %	39.2 %	30.4 %	27.2 %	25.2 %
Accuracy - Annoy	58.2 %	38.0 %	29.4 %	23.2 %	20.6 %	18.0 %
Index - Panns	29.6 MB	59 MB	88 MB	116 MB	149 MB	174 MB
Index - Annoy	56.0 MB	112 MB	169 MB	224 MB	279 MB	334 MB

Table 1. Comparison between PANNS and Annoy using Euclidean similarity.

Data Set Size	5000	10000	15000	20000	25000	30000
Accuracy - Panns	75.0 %	53.6 %	36.0 %	37.0 %	27.0 %	25.0 %
Accuracy - Annoy	65.0 %	36.8 %	26.4 %	26.4 %	19.2 %	17.2 %
Index - Panns	29.6 MB	59 MB	88 MB	116 MB	149 MB	174 MB
Index - Annoy	35.0 MB	70 MB	93 MB	140 MB	159 MB	188 MB

Table 2. Comparison between PANNS and Annoy using Angular similarity.

From both tables, the first thing we noticed is the difference in the index size. PANNS is able to save much more space when using the same amount of RPTrees. In the case of Euclidean similarity, the PANNS index is only half size of the Annoy. This benefit becomes even more noticeable when dealing with extremely large data sets and many RPTrees. From this results, we can understand why Panns lost quite a lot in some evaluations where the index only had small number of RPTrees.

In terms of accuracy, we can see PANNS consistently outperforms Annoy in all cases in our evaluation. But the difference starts diminishing as there are more and more data points. However, since PANNS uses much less space for storing the index, we can incorporate more RPTrees in the index given the same index size to achieve better accuracy. However, it becomes difficult to argue whether it is fair comparison to some extent.

In terms of building time, it is so true that Annoy is faster. However, I must point out this only holds when you use serial execution. PANNS provides parallel index building to take advantage of multiple cores on your workstation. In my case, it turned out PANNS is much faster than Annoy because I have 16 cores on my workstation. I hope this does not count as cheating ;-)

In terms of scalability, I also tried building the index from the English Wikepedia dump which consists of 3.8 million documents approximately. PANNS was able to get the job decently done (though took a long time) whereas Annoy always failed due to memory issue. However, I think further investigation is definitely needed.

III. Inconclusive Conclusion

Well, Erik already gave good summaries on "what to use in which context" in his post. I only provide some complementary advice in the following.

In general, PANNS generates much smaller index without sacrificing the accuracy. This becomes more important when you are dealing with large data sets and still want to incorporate as many RPTrees as possible to achieve satisfying accuracy.

The future trend is parallel computing. We need squeeze out all your computational resources from your devices. The slowness of PANNS can be effectively ameliorated by using parallel building and parallel query on multiple cores (even across the network). Please check out our paper on parallel RPTree building on Apache Spark [here].

We are still carrying on the research work in improving the PANNS accuracy. As mentioned, this more relates to the data engineering stuff. There are even better algorithms which can outperform PANNS from 20% to 100%, of course at the price of even longer building time. In the future, we will gradually incorporate these algorithms with a balance between efficiency and accuracy. All in all, PANNS is going to remain as simple and compact as possible for teaching purpose, as we already mention in the [previous article].

I put Panns source code online about a half year ago, more precisely, in the middle of 2014. After that, except occasionally I replied to some emails inquiring about the technical details of this small tool, I almost completely forgot such a thing due to my stressful PhD studies. Until one day, one of my colleagues started talking about Panns during the lunch time, I was a bit surprised that the tool was gaining its popularity. What's more, some people even compared it to other popular similar tools, which really flattered me quite a lot though Panns performed actually badly (regarding its building and querying time) in the comparison. Still, I was very happy about the fact that people are using it, testing it, and maybe also tweaking it.

So why did I decide to write a post about Panns? Well, I am not really going to present the algorithm line by line, nor I am going to talk about any theoretical stuff here. Instead, I want to talk about a little bit philosophy. Yes, you heard me, philosophy. ;) After I started my PhD journey, I realized that you need to have a philosophical stance on almost everything. Especially if you are working in the system science field, you must know there is simply no such a silver bullet which can solve all your problems. During the development, you need to make various design decisions in order to balance the different tradeoffs. All such design decisions are based on your philosophical stance on the system.

In the following text, I want to explain the philosophy of Panns. Then you will understand why it is designed in such a way which further explains why it is much better than others on certain aspects and doomed to lose in other aspects. So, let's start with a brief background story.

How did Panns start in the first place?

My PhD research has been focusing on Information-Centric Networking. It does not matter at all if you have never heard of this term. The point is we want to interconnect the content within a network with their semantic meanings, which essentially requires a language model to back up the idea. In the beginning of 2014, we started a new project -- Kvasir. In Kvasir project, we explored different innovative ways of seamlessly integrating intelligent content provision and recommendation into the web browsing by utilizing the aforementioned idea. Kvasir eventually turned into a more and more interesting and promising project, so we decided to develop a real system. Panns, technically speaking, is only a small function module in Kvasir system.

Why vector-based language model?

We chose to use vector-based language model in Kvasir. Certainly I know that there is a whole bunch of different language models out there for use. However, there is no evidence in the literature showing one specific model dominates others. Besides that, if there is really a reason for choosing vector-based models, it is because I want to use LSA (Latent Semantic Analysis) and I am simply good at algebraic models.

Why Python?

The reason is not because Python is my working language. The primary consideration is based on the teaching purpose. Since I am working in a university wherein teaching is an important part of my daily work along with research. We know C/C++ can definitely lead to better performance comparing to Python. However, we also noticed that most students are more comfortable with Python partly due to its popularity and partly due to its simplicity and friendly learning curve. We did worry that using C/C++ might cause the code less accessible to the students. With Python, I can write more elegant code and the students can focus on the key ideas instead of being distracted by mechanical details, of course, at the price of performance. However, in reality, Panns performance is not that bad in our context.

Why random projection trees?

The essential problem we are trying to solve is looking for k nearest neighbors for a given point in very high dimensional spaces. What's more, you need to finish the job fast. This is a really hard question and there is a large body of literature behind it. In general, you need to trade accuracy for speed, and it is totally fine in the language model context where you are trying to find the most similar articles. The small numerical error will not destroy the results and users mostly still feel the returned articles are relevant. Choosing random projection trees (RP-tree) is based on the following several considerations - first, simplicity on the structure, which makes theoretical analysis feasible which further shows the algorithm has a bounded error. Second, both tree building and random projections can be easily parallelized on a computer cluster. As you can see, we are pursuing horizontal scalability.

Why separating the index file into two parts?

The index file generated by Panns are separated into two parts. One contains all the RP-trees, and the other contains the raw vectors. The rationale behind is that the tree-structure needs to be small so that it can reside within the physical memory for fast access, whereas the raw data vectors are usually much larger and it is better to utilize mmap technique so that we do not need load the whole file into the memory. Such design is based on the well-known fact that content popularity distribution is highly skewed and only a very small amount of content are indeed very popular. It further indicates only a small part of the data set is frequently visited.

Why spending so much time in indexing?

Panns spends much longer time in indexing comparing to other tools. The theoretical argument goes like this: as the number of points grows larger and larger, as the dimensionality increases higher and higher, those highly condensed points tend to be uniformly distributed in the space. The split point needs to be carefully placed on the random projection to reduce the probability of misclassifications. Recall that RP-tree is essentially a clustering algorithm. If you find the theoretical argument is boring, then the following argument may make more sense: we care more about the accuracy and we have time to wait for the indexing to finish. Because once the index is constructed, we seldom re-build it within a short period. In that case, we do not really mind to wait a bit longer. 10 seconds, 10 minutes, really make no difference to us. Besides, the indexing can be much faster if parallel indexing mode is enabled.

When do you really need Panns?

Simply put, if you have a huge data set (e.g., over several millions of points) of very high dimensionality (e.g., at least over one thousand), and you are happy with several hundred milliseconds query speed. Then Panns is for you. Or if you are a student who just start this subject and want to get some hands-on experience, Panns is for you. For others, well, why not give it a try :)

Is it finished?

Of course not. But as I promised, I only talk about the design philosophy in this post. I am not planning to do any experiments or plot anything right now. In the [next post], I would like to spend some more time in explaining the performance of Panns in some measurement study, and talk about why the result is so and what causes such behaviors. So we can have a better understanding on what Panns is really good at.

Information + Network + System

Tuesday, March 24, 2015

The Philosophy of PANNS -- A Very Preliminary Evaluation & A Not-So-Fair Comparison (maybe)

Friday, February 20, 2015

The Philosophy of PANNS -- A "Biased" Story for the Nearest Neighbor Search in High Dimensional Spaces