Computer can recognise sound by watching videos.

In recent years, computers have gotten remarkably good at recognising speech and images: Think of the dictation software on most cellphones, or the algorithms that automatically identify people in photos posted to Facebook. but recognition of sound such as drilling or cheering has lagged behind┬áhat’s because most automated recognition systems, whether they process audio or visual information, are the result of machine learning, in which computers search for patterns in huge compendia of training data.

The researchers’ machine-learning system is a neural network, so called because its architecture loosely resembles that of the human brain. A neural net consists of processing nodes that, like individual neurons, can perform only rudimentary computations but are densely interconnected.

The training process continually modifies the settings of the individual nodes, until the output of the final layer reliably performs some classification of the data say, identifying the objects in the image. Vondrick, Aytar, and Torralba first trained a neural net on two large, annotated sets of images .

To compare the sound-recognition network’s performance to that of its predecessors, however, the researchers needed a way to translate its language of images into the familiar language of sound names. So they trained a simple machine-learning system to associate the outputs of the sound-recognition network with a set of standard sound labels.

For that, the researchers did use a database of annotated audio — one with 50 categories of sound and about 2,000 examples. Those annotations had been supplied by humans. But it’s much easier to label 2,000 examples than to label 2 million. And the MIT researchers’ network, trained first on unlabeled video, significantly outperformed all previous networks trained solely on the 2,000 labeled examples.