Originally posted on The MITX What’s Next Blog, April 26, 2013
By Beth Logan, Director of Optimization
I started working in “Big Data” before it had a name. My roots are in speech recognition where at the time – and likely still – it was hard to publish at a respectable conference unless you could demonstrate a significant improvement on a large dataset. “Large” meant 100s of hours of speech, which took many hours to process on a cluster. Fortunately, the community was quite mature and many such large datasets were readily available with excellent labels and tools to get started (e.g. see http://www.ldc.upenn.edu/).
The availability of such public datasets combined with cheap and powerful computing resources was the breakthrough speech recognition needed to become ubiquitous. These datasets allowed accurate benchmarking of results across teams, facilitating innovation. However, they cost tens of thousands of dollars for non-academic institutions, reflecting the high cost and labor intensity of labeling. Yet the value generated by commercial speech recognition algorithms no doubt outweighs this cost. This raises the obvious question of what other communities would benefit from large, shared datasets.
To read the rest of Beth’s case for the sharing on big data, please see the full post on the MITX blog, Why You Should Share Your Big Data.