image/svg+xml
Panako - A Scalable Acoustic Fingerprinting System Handling Time-Scale and Pitch Modification
Joren Six and Marc Leman - joren.six@ugent.be - IPEM, University Ghent - Belgium
15th ISMIR Conference, Taipei, Taiwan, 27-31 October, 2014
References[1] P. Cano, E. Batlle, T. Kalker, and J. Haitsma. A review of audio fingerprinting. The Journal of VLSI Signal Processing, 2005.[2] A. L. Wang. An Industrial-Strength Audio Search Algorithm. In Proceedings of ISMIR 2003, 2003.[3] D. Ellis, B. Whitman, and A. Porter. Echoprint - an open music identification service. In Proceedings ISMIR 2011, 2011.[4] A. Arzt, S. Böck, and G. Widmer. Fast identification of piece and score position via symbolic fingerprinting. Proceedings of ISMIR 2012, 2012[5] J. Six, O. Cornelis, and M. Leman. TarsosDSP, a Real-Time Audio Processing Framework in Java. In Proceedings of the 53rd AES Conference, 2014.[6] S. Fenet, G. Richard, and Y. Grenier. A Scalable Audio Fingerprint Method with Robustness to Pitch-Shifting. In Proceedings of ISMIR 2011, 2011[7] R. Sonnleitner, G. Widmer. Quad-based Audio Fingerprinting Robust to Time and Frequency Scaling. Proceedings of the Conference on Digital Audio Effects (DAFx 2014), September 2014[8] B. Zhu, W. Li, Z. Wang, and X. Xue. A novel audio fingerprinting method robust to time scale modification and pitch shifting. In Proceedings of the conference on Multimedia. ACM, 2010.[9] M. Malekesmaeili and R. K. Ward. A local fingerprinting approach for audio copy detection. Computing Research Repository (CoRR), 2013.
AbstractThis poster presents a scalable granular acoustic fingerprintingsystem. An acoustic fingerprinting system uses condensed representations of audio signals, acoustic fingerprints, to identify short audio fragments in large audiodatabases. The system presented here is shown to answerqueries quickly and reliably even when queries are subjected totime-scale and pitch modifications. The design of this system is the main contribution of this research.
Feature
Extraction
Audio
Fingerprint
Construction
Matching
Other
Fingerprints
Features
Fingerprint
IdentifiedAudio
Fig 1. General acoustic fingerprinting scheme.
80
100
120
140
160
180
100
120
140
160
180
200
∆
t
1
∆
t
2
∆
t
1
∆
t
2
Time(step)
Frequency(bin)
Reference
Pitch-shifted
Time-stretched
Time-scalemod
Fig 3. The effect of time-scale and pitch modifications on a fingerprint.
Availability & ReproducabilityThe Panako software is available on http://panako.be under the AGPL. Panako is tested on Debian and Mac OS X, but should work on every platform with a recent Java Runtime Environment.The dataset used in the validation is freely available from Jamendo.com, a website where artists share their work freely, under various creative commons licenses.To reproduce the results, scripts are available to download the audio dataset, generate query files, store thereference audio and query the system. Supporting tools to analyse the query results are available as well..
Results & ConclusionsThe system has been evaluated using a freely availabledata set of 30,000 songs and compared with a baselinesystem (see Figure 4,5,6,7). This work presented a practical acoustic fingerprinting system. The system allows fast and reliable identification of small audio fragments in a large set of audio,even when the fragment has been pitch-shifted and time-stretched with respect to the reference audio. If a matchis found the system reports where in the reference audioa query matches, and how much time/frequency has beenmodified. To achieve this, the system uses local maxima ina Constant-Q spectrogram. It combines event points intogroups of three, and uses time ratios to form a time-scaleinvariant fingerprint component. To form pitch-shift in-variant fingerprint components only frequency differencesare stored. For retrieval, an exact hashing matching algo-rithm is used.
IntroductionAcoustic fingerprinting systems have many practical uses cases. They follow the scheme depicted in Figure 1. Ideally, a fingerprinting system only needs a short audio fragment to find a match in large set of reference audio. One of the challenges is to design a system in a way that the reference database can grow to contain millions of entries. Another challenge is that a robust fingerprinting should handle noise and other modifications well, while limiting the amount of false positives and processing time [1]. These modifications typicallyinclude dynamic range compression, equalization, added background noise and artifacts introduced by audio coders or A/D-D/A conversions.Over the years several efficient acoustic fingerprinting methodshave been introduced [2,3]. These methods perform well, even with degraded audio quality and with industrial sized reference databases. However, these systems are not designed to handle queries with modified time-scale or pitch although these distortions can be present in replayed material. During radio broadcasts songs are occasionally played faster tomake them fit into a time slot. During a DJ-set pitch-shifting and time-stretching are present almost continuously. To correctly identify audio in these cases as well, a fingerprinting system robust against pitch-shifting and time-stretching is desired.Some fingerprinting systems have been developed that take pitch-shifts into account [6]. Others are designed to handle both pitch and time-scale modification[8,9]. To find a match, these computationally expensive systems iterate the whole database. To the best of our kowledge, a description of a practical fingerprinting system that allows substantial pitch-shift and time-scale modification can only be found in [7], and in this work.
MethodThe proposed method is inspired by three works[2,4,6]. Combining key components of those works results in a design of a granular acoustic fingerprinter that is robust to noise and substantial compression, has a scalable method for fingerprint storage and matching, and allows time-scale modification andpitch-shifting.The method presented here uses local maxima in a spectral representation[2]. It combines three event points, and takes time ratios to form time-scale invariant fingerprints[4]. It leverages the Constant-Q transform, and only stores frequency differences for pitch-shift invariance[6]. The fingerprints are designed with an exact hashing matching algorithm in mind[2].The whole process is depicted in Figure 2.Feature extraction: the first step is to transform the audio to a spectral representation. The Constant-Q transform is used to get an equal amount of frequency bins in each octave. This is depicted in Figure 2a. The next step is to locate peaks within the time-frequencyplane (Fig 2b). This is done using a 2D peak extraction algorithm. Fingerprint construction: to form fingerprints, three peaks are combined, as in Figure 2c. The effects on a fingerprint extracted from reference audio and a fingerprint extracted from the same audio after pitch-shifting, time-stretching and time-scale modification can beseen in Figure 3. The ratios between the time differences and the differences between the frequency components are invariant. These constants are employed in the fingerprint hash. To add discriminative power to the hash, coarse frequency location indicators are includedas well. These hashes are stored in a B-tree: Matching: to match a query with the reference audio, fingerprints, and their corresponding hashes, are extracted. For each hash, matching reference audio items are fetched from the datastore. Random matches are removed from this resultset by only keeping reference audio items that are presentmultiple times. To limit the amount of false positives, alignment in time is checked. Also, a match is only marked as valid if the time stretch-factor and pitch-shift factor between query and reference audio are constant. Finally, if a match is found, the reference audio identifier is returned.
Audio
f (bins)
t (frames)
Fig 2e. Fingerprint hases in thereference database. Here, thequery is time-strethed.
f (bins)
t (frames)
Fig 2c. Step 3, Connect the event points in sets of three. Each tripletforms a fingerprint.
Fingerprint Construction
Fig 2d. Step 4, Each fingerprint is Hashed. The hashes are matchedwith the reference database.
f (bins)
t (frames)
Matching
Features
Fingerprints
f (bins)
t (frames)
Fig 2a. Step 1, Constant-Q transform of the incoming audio.
f (bins)
t (frames)
Fig 2b. Step 2, Extract event points from the spectrogram with a tiled2D peak extraction algorithm.
Feature Extraction
Identified audio
Reference fingerprints
Reference Database
Fig 2. The Panako fingerprinting system combines triplets of peaks in aConstant-Q spectrogram to form fingerprints.
80
90
100
110
120
0
20
40
60
80
100
Time-scalemodificationfactor(Percentage)
TruePositives(Percentage)
20sfragments
40sfragments
60sfragments
Audfprint(40s)
Fig 6. The effect of time-scale modification on retrieval performance.
80
90
100
110
120
0
20
40
60
80
100
Pitchshift(Percentage)
TruePositives(Percentage)
20sfragments
40sfragments
60sfragments
Audfprint(40s)
Fig 7. The effect of pitch shift on retrieval performance.
80
90
100
110
120
0
20
40
60
80
100
Timestretchfactor(Percentage)
TruePositives(Percentage)
20sfragments
40sfragments
60sfragments
Audfprint(40s)
Fig 4. The effect of time-stretching on retrieval performance.
0
20
40
60
80
100
Reference
Band-passed
Echo
Tremolo
Flanger
GSM
Chorus
TruePositives(Percentage)
20sfragments
40sfragments
60sfragments
Audfprint(40s)
Fig 5. The effect various effects onretrieval performance.
--