Searching by sound for now has limited uses. There are a handful of applications, specifically for mobile devices that utilize searching by sound. Shazam (service), Soundhound (previously Midomi), Doreso and others has seen considerable success by using a simple algorithm to match an acoustic fingerprint to a song in a library. These applications takes a sample clip of a song, or a user generated melody and checks a music library/music database to see where the clip matches with the song. From there, song information will queried and displayed to the user.
These kind of applications is mainly used for finding a song that the user does not already know.
Searching by sound is not limited to just identifying songs, but also for identifying melodies, tunes or advertisements, sound library management and video files.
Search by sound Wikipedia
The way these apps search by sound is through generating an acoustic fingerprint; a digital summary of the sound. A microphone is used to pick up an audio sample, which is then broken down into a simple numeric signature, a code unique to each track. Using the same method of fingerprinting sounds, when Shazam picks up a sound clip, it will generate a signature for that clip. Then it’s simple pattern matching from there using an extensive audio music database.
The practice of using acoustic fingerprints is not limited to just music however, but other areas of the entertainment business as well. Shazam also can identify television shows with the same technique of acoustic fingerprinting. Of course, this method of breaking down a sound sample into a unique signature is useless unless there is an extensive database of music with keys to match with the samples. Shazam has over 11 million songs in its database.
Other services such as Midomi and Soundhound allow users to add to that library of music in order to expand the chances to match a sound sample with its corresponding sound.
Generating a signature from the song is essential for searching by sound, and can be tricky. However, the way certain applications such as Shazam found a way around this issue by a creating time-frequency graph called spectrogram.
Any piece of music can be translated to a spectrogram. For each song in its database, each song is basically a graph that plots the three dimensions of music, frequency vs amplitude (intensity) vs time. The algorithm then picks out the points which peaks in the graph, labeled as “higher energy content”. In practice, this seems to work out to about three points per song.
This is how a song can be identified with just two or three notes. This greatly reduces the impact that background noise has on searching by sound. The key values taken away from this would be frequency in hertz and time in seconds. Shazam builds their fingerprint catalog out as a hash table, where the key is the frequency. They do not just mark a single point in the spectrogram, rather they mark a pair of points: the “peak intensity” plus a second “anchor point”. So their key is not just a single frequency, it is a hash of the frequencies of both points. This leads to less hash collisions which in turn speeds up catalog searching by several orders of magnitude by allowing them to take greater advantage of the table’s constant (O(1)) look-up time.
This method of acoustic fingerprinting allows applications such as Shazam to have the ability to differentiate between two closely related covers, as well as not having to account for popularity of a certain song.
Midomi and Soundhound both utilize Query by Humming, or QbH. This is a branch off of acoustic fingerprints, but is still a musical retrieval system. After receiving a user generated hummed melody, which is the input query, and returns a ranked list of songs that is closest to the user query.