Sunday, August 10, 2014

Sunday Morning Insight: Why Kaggle Changes Everything



If you are reading Nuit Blanche, you probably know about Kaggle, the site where a whole slew of datasets are tested against different Supervised Learning algorithms. Kaggle changes everything because you have to look at each of their competitions as a distributed attack on a single dataset from different families of algorithms and with different biases. It is as if one were to test a compressive sensing dataset against all these algorithms. This richness in point of view and angles of attack is rare in academia. In recent times, some algorithms have shown to produce good results such as CNNs for image related tasks and Random Forests for other types of data. While CNNs retain the attention of academia all the way to industry, Random Forest, in my view, is clearly the unexpected algorithm that seem to be doing well consistently on a number of tasks.  I am not sure we could have gotten that insight from just reading the academic literature. Here are two competitions that are still on going and retained my interest:

The Higgs Boson competition (1292 teams)
Why ? whichever algorithm does well in the competition will have the possibility to work on real data. With real data, there is the potential to discover more than the Higgs. 

The Criteo competition (283 teams)
Why ? It might be the case that "The best minds of my generation are thinking about how to make people click ads," but this is the first time we have public access to a large corpus of real CTR data.

Why is it that an academic paper is not generated at the end of each competition ? I don't know but the insight gathered from many algorithms is simply invaluable.
Credit: Kaggle



Join the CompressiveSensing subreddit or the Google+ Community and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

2 comments:

Dick Gordon said...

Okay, so how do we get a competition going between algorithms for x-ray dose reduction in computed tomography? CT is now 49% of USA per capita dose:
Mettler Jr, F.A., M. Bhargavan, K. Faulkner, D.B. Gilley, J.E. Gray, G.S. Ibbott, J.A. Lipoti, M. Mahesh, J.L. McCrohan & M.G. Stabin (2009). Radiologic and nuclear medicine studies in the United States and worldwide: Frequency, radiation dose, and comparison with other radiation sources—1950–2007. Radiology 253(2), 520-531.
http://www.researchgate.net/publication/26856063_Radiologic_and_nuclear_medicine_studies_in_the_United_States_and_worldwide_frequency_radiation_dose_and_comparison_with_other_radiation_sources--1950-2007/file/d912f506566c4e8dfe.pdf

Yours, -Dr. Richard Gordon DickGordonCan@gmail.com

Igor said...

Dick,

It's a very good question. How could we frame reconstruction in terms of superviszd learning ? And how do go about being adversarial in setting up the contest. Recall that each team can submit several results from a test dataset, so we need a good metric.

Igor.

Printfriendly