Until 2008 there was no service which could positively recognize a track playing in an ambient noise environment. The shazam app being launched on the android platform in October 2008 revolutionized song identification. The challenges being able to filter an audio sample from around a couple million track database in the shortest possible time. The basic idea being, shazam matches the input data with its stored database and returns it to the user. The key principle being audio fingerprinting that is a content based compact signature that summarizes an audio recording. It's a straight forward approach to recognize an audio independent of its format without any need of meta-data and watermark embedding. Shazam has 3-dimensional spectrographs of all the songs stored in its database. The three dimensions are time, frequency and intensity. It only stores high intensity points of the track at the respective time and frequencies. Shazam generates a similar spectrograph of the recorded audio sample as well. These high intensity points are than matched to the database and the match is found.

In this report we discuss all the steps in detail which go in recognizing an audio sample. The basic idea is that, the shazam app when opened in a musical environment creates an audio fingerprint of the recorded song being played. This audio fingerprint is matched with the pre-saved audio fingerprints in the company database. Upon successful matching, the track information such as track name, album, lyrics, artist, genre etc are displayed.


This whole process starts with the generation of spectrograph. The spectrograph is a 3D graph. It has time on one axis, frequency on the other axis and sound intensity on the 3rd axis at a particular time and frequency. The spectrogram is the square of magnitude of a short time Fourier transform. The mathematical representation is shown below.


An example of a spectrograph used by shazam is shown below in figure. This spectrograph was generated for the track " Blurred lines " by Robin Thicke.



Audio fingerprint is extracted from this spectrograph by a process called feature extraction. Track identification using audio fingerprinting is considered to be one of the fastest and accurate forms of music recognition. Audio fingerprints are compact content based signatures of audio recordings. These audio fingerprints capture highly specific characteristics of a short audio fragment. This feature makes it possible to accurately identify the fingerprint and distinguish itself from millions of songs. The key specialty of audio fingerprinting is to link unknown audio to corresponding metadata, irrespective of its audio format. An ideal audio fingerprinting system should be able to recognize an unknown audio fragment from the database regardless of the noise in the environment. The audio fingerprinting system should be efficient. This is attained by making the fingerprints compact and implementing a complex and smarter search algorithm. It also requires the fingerprint extraction process to be powerful. Fingerprints and matching algorithms should result in the same content taken from the distorted audio recording. Fingerprint extracts the characteristics of the audio recording in robust and concise format. [1] The requirements needed for an audio fingerprint includes:-  

1. Discrimination power over huge number of other fingerprints.

2. Invariance to distortions

3. Compactness

4. Computational simplicity

An example of matching using audio fingerprints extracted from the spectrograph is shown below in figure.




A sound is merely a physical vibration that propagates in the air a hits the ear. The ear decrypts the information. This vibration can be represented as a sinusoidal waveform with time and frequency.


The above equation describes a pure sound tone of a single frequency. Here 'y' is the instantaneous voltage and 'A' is the maximum signal amplitude. The intensity plotted on the spectrograph is the square of the magnitude of the maximum amplitude 'A'. A pure tone has a frequency range of 20hz -20khz. Every real sound is a combination of multiple pure tones of different amplitudes. Tracks that we listen are basically analog signals which can be decrypted by the human ear. For analysis and manipulation of these signals, the computer does not understand analog signals. These signals have to be converted into a digital domain. This process is known as transform. The analog signal in time domain is a continuous signal. To store this signal, we have to break it down into digital samples. The information in each of these digital samples do not change over time. To keep up the same information as the original sound, a number of digital samples have to be taken. The number of digital samples depend on the frequency of the sound wave. According to Nyquist criteria, the sampling frequency has to be more than twice of the maximum signal frequency.


These samples are digitized and converted to frequency domain. The transformation used is DFT (Discrete Fourier Transform)


The intensity of the sound tone is recorded against the time and frequency using a short time Fourier transform and the spectrogram is generated.


The audio fingerprint of the sound tone is generated using the spectrograph. The working process is shown below in figure

Feature extraction1.png

The recorded intensity tabulations are quantized into integers and a spectrogram matrix is created. Considering a fingerprint for a time duration of one second, the mean of the intensities is calculated. The next step is to create the binary image. If the intensity value is greater than the mean, 1 is recorded. If intensity is lesser than the mean, 0 is recorded. The binary image is shown in the figure. The binary image is then tiled 2x2 creating 16 tiles in total. Each tile is given the sum of the binary value. Now a brief analysis is done to pick 6 vectors which show the most salient points in the quantized image. The 6 vectors form the fingerprint of the sound tone for one second. This process is called Feature extraction.


Shazam's database has fingerprints of over 11 million tracks. It is a complex task to match the input fingerprint with that of the database and output accurate results. This section tries to discuss the matching process which takes place at the Shazam's database.

Hashing is technique of saving data. Hash tables take in integer input and produce integer output. Hashing is used in Shazam's audio recognition tool by storing time and frequencies of the fingerprint. The peaks of the fingerprints are combined into a standalone hash function. The hash table at the database has entries of frequency and time of a particular track. The hash function of the input query is also processed similarly. An example of query hash function and a database hash function are shown below in figure.


The query sample shows the frequency of the sound tone at the recorded time offset. Shazam takes in a maximum of 10 second window which is sampledapproximately 3 times a second. At the database end, large hash tables store the frequencies and time of various songs in an ascending order.

Let the first data in the query sample is matched with 2nd row of the database sample (823.44Hz). The 2nd data in the query sample is matched with the last row of the database sample (1892.31Hz). [2] The timing difference is computed as


δtk =tk’-tk

These two computations have to result in the same result. In this case,

(1.321 - 1.054) = (34.945 - 34.678)

Once it is satisfied. We can flag 1 hit. Executing the same computations over period of 30 samples, a considerable amount of hits result in the match of the song. In this case the query sample matches to " Song B " by " Artist 2 ". A clearer understanding of a hash function is shown in figure    



1. Background noise in some environments and significant noise power: The use of this tools is mostly done in clubs and noisy environments. The noise power reduces the signal to noise ration. In this case the sensing of peak intensities can be problem.

2. Distortion arising from sampling equipment: Technology of the microphones vary for different handheld devices. The sampling frequency can be very fast sometimes and sometimes very slow. The shazam system has to adapt to different sensing devices.

3. Types of playback: Some songs are being played at a faster rate ( remix ) and some at a slower rate (instrumental ) . The time difference between two recorded intensity points may vary with the original song in the database.

4. Database management: Shazam has to store the fingerprints of more than 11 million songs. This has to be done without increasing the wait time for the result. The probability of a hit has to keep improving even with new songs being updated in the database every time.

5. System traffic: Shazam has more than 500 million users. It's a hurdle to serve these many users with unique song requests. There can't be delays in computation as they might result in a fail.


This technique as shown in figure helps tackle the challenges like noise and distortion.


The working overview of the clustering based technique is a simple straight forward approach. Noise is best tackled by averaging the amplitudes of the signal. In this case, we average the intensity of small clusters. To understand clustering, consider the figure shown below.


Consider 4 clusters of many intensity points. Let them be P1, P2, P3 and P4 where

P1 = {1, 2, 3, 5, 6}

P2 = {2, 3, 5, 6, 9}

P3 = {2, 5, 6, 11, 15}

P4 = {3, 7, 8, 10, 15}

The new centroid (average) of this cluster combination will be P={2, 3, 5, 6, 15}. The reason for this computation is that 2,3,5,6,15 are the most repeated intensity points in all the 4 clusters. In a similar way, we can record the average clusters of a database of songs and save them as reference.The reference fingerprint in the database is partitioned in different clusters. When the input query sample is received, we assign the respective cluster value depending on its intensity numerals. The end to end process of clustering algorithm is shown below in figure. Here the macthing is done using clustering.


Once the input query sample are indicated by their respective cluster values ( C1,C2,C3...), a look up table (LUT) is formed(d). This look up table writes the frame number of the query frame to the corresponding cluster reference value. The test frames in reference database are then labeled by the query frames(e). The query frame is slided over the reference frame and the number of matches at each position are recorded(f). The start frame with the most matches is the resulting matching frame(g). This process is carried over for a number of recorded sample frames and the result is computed. This process uses GPU for faster computations and comparisons between query frame and reference frames.


The techniques behind audio track recognition were discussed in this report. Generation of spectrogram where in resides the most critical data was discussed. This was led by the extraction of fingerprint by using Feature extraction technique. The match at the database was shown using frequency-time hash function tables. A faster way of audio recognition known as clustering based technique was also discussed. Audio fingerprinting technology gives Shazam the edge over all the competing sound recognition apps on the various platforms.


1. Cano, P., Batle, E., Kalker, T., & Haitsma, J. (2002, December). A review of algorithms for audio fingerprinting. In Multimedia Signal Processing, 2002 IEEE Workshop on (pp. 169-173). IEEE

​2. Ouali, C., Dumcouhel, P., & Gupta, V. Fast Audio Fingerprinting System Using GPU and a Clustering-Based Technique.

​3. Wang, A. (2003, October). An Industrial Strength Audio Search Algorithm. In ISMIR (pp. 7-13)

​4. Wang, A. (2006). The Shazam music recognition service. Communications of the ACM, 49(8), 44-48.