Thursday, March 15, 2012

GraphLab workshop, Why should you care ?

If you are on the Advanced Matrix Factorization group or reading his blog you have already seen the news, Danny Bickson is organizing the first GraphLab workshop in July in the Bay area. What is GraphLab ? From the overview:

Designing and implementing efficient and provably correct parallel machine learning (ML) algorithms can be very challenging. Existing high-level parallel abstractions like MapReduce are often insufficiently expressive while low-level tools like MPI and Pthreads leave ML experts repeatedly solving the same design challenges. By targeting common patterns in ML, we developed GraphLab, which improves upon abstractions like MapReduce by compactly expressing asynchronous iterative algorithms with sparse computational dependencies while ensuring data consistency and achieving a high degree of parallel performance. 
In short it is a way to perform iterative algorithms on sparse graphs (parallel processing is also included). With the advent of cheap cloud computing, and the underlying need for post-processing in sparse recovery or advanced matrix factorization like dictionary learning, robust PCA and the like, it might be interesting to investigate the matter and even present something at this workshop. If you want to know more about how GraphLab can be helpful for your very large jobs or how it fits into your algorithm, you may want to check some of the examples, in particular, for rapid prototyping the following two pages:
.Looking through an example like this one on an instance of the LASSO algorithm, I can begin to see more clearly what aspect of compressive sensing could be benefit from this. First it looks like the sparsity of the measurement matrix is central to getting the parallel algorithm to work efficiently. We already have some of those listed here and so from memory we have:
I note that the last ones are being used with a Belief Propagation solver that are also called AMP which happen to be one of the reason GraphLab started in the first place. These AMP solvers seem to be some of the best sparse recovery solvers we have in our toolboxes. From the Big Picture in Compressive Sensing these AMP algorithms out there include:
If you recall, Danny implemented GAMP on GraphLab a while back (see also here). We also know that these AMP solvers can also be applied to these matrix factorization problems (MMV, Low Rank decomposition [here]...). Eventually the real question is : how will GraphLab + cheap cloud computing + BP solvers (or others) + sparse measurement matrices compete with silicon ?
[Update]
  Danny added some other items on top of  what I just said namely:

  1. Initially, we thought most of the users will use our parallel processing API, but we found out that users prefer using pre-implemented algorithms. Thus we have released several libraries of implemented algorithms, where currently the matrix factorization is the most popular - see http://graphlab.org/pmf.html
  2. A solution based on our matrix factorization library won last year ACM KDD CUP 2011, 5th place (out of more than 1000 participant). http://kddcup.yahoo.com/workshop.php
  3. A relevant algorithm for the CS community is shotgun: http://bickson.blogspot.com/2011/06/large-scale-logistic-regression-on.html - our parallel LASSO/logistic regression solver, which is implemented in graphlab.

6 comments:

Rob said...

Thanks for making a complex idea understandable.

Danny Bickson said...

Igor, you are amazing! I am sure now we get a much improved visibility for the CS/ matrix factorization community.

Thomas Arildsen said...

Thanks for enlightening us. Now that you mention sparse measurement matrices, LDPC matrices have shown some promise in a couple of papers under way recently. Could they be relevant here?
I wonder about the somewhat sparse measurement matrices of the random demodulator as well?

Igor said...

yes the LDPC matrices should also work. My understanding is that GraphLab allows the "chunking" of the work in iterative algorithms because at every step, the multiplication is performed with a sparse measurement matrix (thereby reducing not nulling) the communication between cpus. So to a certain extent the sparsity of the matrix being multiplied at every step of the iterative algorithm "maps" into the graph between cpus. Danny ?

Danny Bickson said...

Yes, exactly, graphlab is designed to support sparse graphs better than dense graph. This is because that the amount of correlation between variables/factors/etc is significantly reduced in sparse graphs and thus allowing better parallelism.

Unknown said...

Dear Igor,

Thanks for the post. I have some formal training in machine learning and I'm pretty comfortable with Python. So, is graphlab a good entry point to large scale or distributed machine learning?

Also, I think the links that you provided to MATLAB and the Python/Jython API are both broken. Could you please update those?

Thanks

Printfriendly