Saturday, March 07, 2015

The Spherical Cow in System Science - Matrix Multiplication on Apache Spark (Part III)

As mentioned in the previous [blog], we introduced the big data framework into teaching in 2014, and we specifically chose to use Apache Spark in 2015. Besides calculating the basic statistic, we also asked the students to do something more skillful -- matrix multiplications. The detailed assignment description can be found [here].


I. Two Simple (But Representative) Matrix Multiplication Problems

In the assignment, we gave the students a tall and thin matrix which is of size $10^3 \times 10^6$. Let's call it matrix $A$. The matrix $A$ has one million rows and each row has one thousand double numbers. In total, there are one billion double numbers. We asked the students to calculate $X$ as blow:

  1. $X = A \times A^T \times A$
  2. $X = diag(A \times A^T)$

Solving large linear equation systems is very common in both engineering and scientific fields, and knowing how to manipulate matrices is basic skills for engineers and scientists. These two examples, though seemingly simple, are quite representative in scientific computing. Then how to solve them actually?


II. Order Matters

Let's start from the first question $X = A \times A^T \times A$. There are two ways to calculate $X$, namely $X = (A \times A^T ) \times A$ and $X = A \times (A^T \times A)$. Mathematically, they are equivalent, namely same result and same computation complexity. However, from engineering and system perspective, there is a big difference! Why?

Let me first ask you "What is the size of $X$?". Yes, it is exactly the same as $A$, namely $10^3 \times 10^6$. No matter whether we calculate the first two first (i.e., $(A \times A^T )$) or the latter two first (i.e., $(A^T \times A)$), the intermediate result need to be transmitted on many nodes to finish the calculation in order to give us the $X$ of the same size as $A$. However, the two intermediate results are quite different, see the figure below:


You must get the point after seeing this figure, right? If we call it matrix $B$ as our intermediate result. We can see $(A \times A^T )$ leads to a huge $B$ which is $10^6 \times 10^6$ in case I, whereas $(A^T \times A )$ only leads to a very small $B$ which is $10^3 \times 10^3$. Now, you tell me, "which matrix is easier to distribute in the network?". There is going to be several orders of magnitudes difference in the network traffic. If the matrix $A$ is very very tall (i.e., contains huge amount of rows), it is not even possible to finish the calculate of $(A \times A^T )$. So, the lesson we learnt here is "Order does matter!"


III. Decomposition and Aggregation

We know getting a smaller $B$ is more beneficial, but how shall we carry on to compute $(A^T \times A )$? We know Spark cuts the $A$ into many chunks and stores them separately on different machine. Let's call these chunks of $A$ as $A_1, A_2, A_3, A_4 ... A_m$. Note these chunks have same number of columns (i.e., $10^3$) but may have different number of rows, in other words, their chunk size may be different.

Now, let me ask you another question - "Choose any $i \leq m$, what is the size of $A_i \times A_i^T$?". Yes, they are all the same, namely $10^3 \times 10^3$. Actually, this is just the partial result of final $B$. To calculate the final $B$, we just need to aggregate these partial results and sum them up as below:

$B = \sum_1^m A_i \times A_i^T$

After having $B$, getting the final result $A$ is trivial. We just need to broadcast $B$ onto every node, then finish the calculation on every chunk as below:

$X_i = A_i \times B$

Assemble all the $X_i$. More precisely, concatenate them. Then you have $X$.


IV. Getting Diagonal Elements

I was actually a bit hesitant about whether to give the second question or not. I thought it might be too easy for the students. However, the truth is that most of the students actually did try to compute the complete matrix $A \times A^T$ which is so big that you cannot even fit into the normal memory, then extract the diagonal elements.

When the students brought such solutions to me and complained about the memory issue, I just ask them to do one thing - "Can you tell me the definition of the diagonal elements and describe how you calculate them? "

They usually started with -- "Well, the first row of  $A$ (dot) multiply the first column of $A^T$ ...".

Then I continued to ask -- "Then what is the first column of $A^T$?"

Suddenly, the students got the point -- "Haha, it is still the first row of the $A$!!!"

So, you do not need to calculate the whole matrix in order to get the diagonal elements. You just need process the data set line by line, each row will give you one diagonal element.

$X_(i,i) = \sum_{j=1}^{1000} A_{i,j}^2$


V. Common Issues

Same as the previous [median example], this set of questions were also very successful. The students quite liked the problem design even though very few of them actually got all the solutions right. One common issue from the students is that they tried very hard to have precise control on how the matrix/data set is cut in Spark. Controlling how the data is divided is not impossible but just meaningless regarding solving the problem, of course given you have sufficient knowledge in linear algebra and matrix operations.


Read this series: The Spherical Cow in System Science - Part IPart IIPart III

No comments: