Universiteit Gent
Faculteit Letteren en Wijsbegeerte
Vakgroep Kunst-, Muziek- en Theaterwetenschappen
Joren Six
Proefschrift voorgelegd tot het behalen van
de graad van Doctor in de Kunstwetenschappen
Academisch jaar 2017-2018
Universiteit Gent
Faculteit Letteren en Wijsbegeerte
Vakgroep Kunst-, Muziek- en Theaterwetenschappen
Promotor: | Prof. dr. Marc Leman |
Doctoraatsbegeleidingscommissie: | Dr. Olmo Cornelis |
Dr. Frans Wiering | |
Examencommissie: | Dr. Federica Bressan |
Dr. Olmo Cornelis | |
Prof. dr. ir. Tijl De Bie | |
Prof. dr. Pieter-Jan Maes | |
Dr. Frans Wiering | |
Dr. Micheline Lesaffre | |
Dr. Luc Nijs | |
Proefschrift voorgelegd tot het behalen van
de graad van Doctor in de Kunstwetenschappen
Academisch jaar 2017-2018
Universiteit Gent
Faculteit Letteren en Wijsbegeerte
Vakgroep Kunst-, Muziek- en Theaterwetenschappen, IPEM
De Krook, Miriam Makebaplein 1, B-9000 Gent, België
This doctoral dissertation is the culmination of my research carried out at both IPEM, Ghent University and the School of Arts also in Ghent. I have been lucky enough to pursue and combine my interests for both music and computer science in my research. As a trained computer scientist I have been applying my engineering background to problems in systematic musicology. The output of this work has been described in various articles some of which are bundled in this dissertation.
Admittedly, my research trajectory does not follow the straightest path but meanders around several fields. This meandering has enabled me to enjoy various vistas and led me to a plethora of places - physically and intellectually - not easily reached without taking a turn now and again. I think this multi-disciplinary approach prepared me better for a hopefully somewhat stable career in research. I also had the time required to cover a lot of ground. At the School of Arts, Ghent I was employed for four years as a scientific employee. At IPEM, Ghent University I was given the opportunity to continue my work as a doctoral student again for four years. This allowed me to not only have a broad perspective but also reach depth required to contribute new knowledge and propose innovative methods.
It is safe to say that without Olmo Cornelis I would not have started this PhD project. Thanks to Olmo for writing the project proposal which eventually allowed me to start my research career at the School of Arts. The concept of having an engineer next to a humanities scholar was definitely enriching to me and I do hope that the opposite is also somewhat true. His guidance during those first four (and following) years was indispensable. His pointers on music theory, support with academic writing and ideas on computer aided (ethno)musicology are just a few examples.
Actually, I want to profoundly thank the whole group of colleague researchers at the School of Arts, or what was then known as the Royal Conservatory of Ghent. Undeniably, they had a defining influence on my research and personal life. I fondly remember many discussions in the cellar and at my desk at the Wijnaert. I struggle to think of examples where gossiping, scheming and collusions between a colleague and partner of a colleague could have a more positive outcome. Indeed, I do mean you: Ruth and Clara.
Later on, Marc Leman gave me the opportunity to continue my research at IPEM. I am very grateful to have been offered this privilege. I am quite aware that being able to pursue one’s interests by doing research is exactly that: a privilege. IPEM provided fertile ground to further my research. Marc’s hands-off approach displays a great amount of trust in his research staff. This freedom worked especially well for me since it made me self-critical on priorities and planning.
I would also like to acknowledge the IPEM bunch: a diverse collective of great individuals each in their own way. I especially would like to thank Ivan for the many hardware builds and creative ideas for practical solutions. Katrien for taking care of the administrative side of things. Esther for having the patience to listen to me whining about my kids. Jeska and Guy for proofreading this work. And all the others for the many discussions at the kitchen table during lunch and generally for being great colleagues.
Furthermore, I am very grateful to the RMCA (Royal Museum for Central Africa), Tervuren, Belgium for providing access to its unique archive of Central African music.
Thanks to my friends and family for the support over the years. I would especially want to thank Will for proofraeding parts of this work and Emilie for pointing me to VEWA. Of course, it is hardly an understatement to claim that this work would not be here without my parents. Thanks for kindling an interest in music, letting me attend piano lessons and for keeping me properly fed and relatively clean, especially in my first years. Thanks also to Bobon for caring for Oscar and Marit, often on short notice in those many small emergency situations. On the topic: I would like to leave Oscar and Marit out of this acknowledgment since they only sabotaged this work, often quite successfully. But they are a constant reminder on relativity of things and I love them quite unconditionally. Finally, I would like to thank the light of my life, the daydream that accompanies me at night, the mother of my children: Barbara.
Een van de grote onderzoeksvragen in systematische muziekwetenschappen is hoe mensen met muziek omgaan en deze begrijpen. Deze wetenschap onderzoekt hoe muzikale structuren in relatie staan met de verschillende effecten van muziek. Deze fundamentele relatie kan op verschillende manieren benaderd worden. Bijvoorbeeld een perspectief vertrekkende vanuit traditie waarbij muziek aanzien wordt als een fenomeen van menselijke expressie. Een cognitief perspectief is een andere benadering, daarbij wordt muziek gezien als een akoestische informatiestroom gemoduleerd door perceptie, categorisatie, blootstelling en allerhande leereffecten. Een even geldig perspectief is dat van de uitvoerder waarbij muziek voortkomt uit gecoördineerde menselijke interactie. Om een muzikaal fenomeen te begrijpen is een combinatie van (deel)perspectieven vaak een meerwaarde.
Elk perspectief brengt methodes met zich mee die onderzoeksvragen naar concrete musicologische onderzoeksprojecten kan omvormen. Digitale data en software vormen tegenwoordig bijna altijd de kern van deze methodes. Enkele van die algemene methodes zijn: extractie van akoestische kenmerken, classificatie, statistiek, machine learning. Een probleem hierbij is dat het toepassen van deze empirische en computationele methodes technische oplossingen vraagt. Het ontwikkelen van deze technische oplossingen behoort vaak niet tot de competenties van onderzoekers die vaak een achtergrond in de zachte wetenschappen hebben. Toegang tot gespecialiseerde technische kennis kan op een bepaald punt noodzakelijk worden om hun onderzoek verder te zetten. Mijn doctoraatsonderzoek situeert zich in deze context.
Ik presenteer in dit werk concrete technische oplossingen die bijdragen aan de systematische muziekwetenschappen. Dit gebeurt door oplossingen te ontwikkelen voor meetproblemen in empirisch onderzoek en door implementatie van onderzoekssoftware die computationeel onderzoek faciliteert. Om over de verschillende aspecten van deze oplossingen een overzicht te krijgen worden ze in een vlak geplaatst.
De eerste as van dit vlak contrasteert dienstverlenende oplossingen met oplossingen die methodes in de systematische muziekwetenschappen aandragen of aangeven hoe onderzoek kan gebeuren met dienstverlenende oplossingen voor de systematische muziekwetenschappen. Die ondersteunen of automatiseren onderzoektaken. De dienstverlenende oplossingen kunnen de omvang van onderzoek vergroten door het eenvoudiger te maken om met grotere datasets aan de slag te gaan. De tweede as in het vlak geeft aan hoe sterk de oplossing leunt op Music Information Retrieval (MIR) technieken. MIR-technieken worden gecontrasteerd met verschillende technieken ter ondersteuning van empirisch onderzoek.
Mijn onderzoek resulteerde in dertien oplossingen die in dit vlak geplaatst worden. De beschrijving van zeven van die oplossingen is opgenomen in dit werk. Drie ervan vallen onder methodes en de resterende vier zijn dienstverlendende (services). Het softwaresysteem Tarsos stelt bijvoorbeeld een methode voor om toohoogtegebruik in muzikale praktijk op grote schaal te vergelijken met theoretische modellen van toonladders. Het softwaresysteem SyncSink is een voorbeeld van een service. Het laat toe om onderzoeksdata te synchroniseren wat het eenvoudiger maakt om meerdere sensorstromen of participanten op te nemen. Andere services zijn TarsosDSP en Panako. TarsosDSP kan kenmerken uit audio halen en Panako is een acoustic fingerprinting systeem.
In het algemeen volgen de gepresenteerde oplossingen een reproduceerbare methodologie. Computationeel en MIR onderzoek is niet altijd even makkelijk te reproduceren. In de voorgestelde oplossingen werd aandacht gegeven aan dit aspect. De software werd via open source licenties online geplaatst en de systemen werden zo veel als mogelijk getest met publiek beschikbare data. Dit maakt de processen transparant en verifieerbaar. Het stelt ook anderen in staat om de software te gebruiken, te bekritiseren en te verbeteren.
De belangrijkste bijdragen van dit doctoraatsonderzoek zijn de individuele oplossingen. Met Panako [175] werd een nieuw acoustic fingerprinting algoritme beschreven in de academische literatuur. Vervolgens werden applicaties van Panako toegepast voor het beheer van digitale muziekarchieven. Deze applicaties werden beschreven en getest [168]. Tarsos [173] laat toe om toonhoogtegebruik op grote schaal te onderzoeken. Ik heb bijdragen geleverd aan de discussie rond reproduceerbaarheid van MIR onderzoek [167]. Ook werd een systeem voor verrijkte muziekervaring voorgesteld [177]. Naast deze specifieke bijdragen zijn er ook algemene zoals het conceptualiseren van technologische contributies aan de systematische muziekwetenschappen via het onderscheid tussen services en methodes. Als laatste werd het concept augmented humanities ook geïntroduceerd als een onderzoekslijn voor verder onderzoek.
One of the main research questions of systematic musicology is concerned with how people make sense of their musical environment. It is concerned with signification and meaning-formation and relates musical structures to effects of music. These fundamental aspects can be approached from many different directions. One could take a cultural perspective where music is considered a phenomenon of human expression, firmly embedded in tradition. Another approach would be a cognitive perspective, where music is considered as an acoustical signal of which perception involves categorizations linked to representations and learning. A performance perspective where music is the outcome of human interaction is also an equally valid perspective. To understand a phenomenon combining multiple perspectives often makes sense.
The methods employed within each of these approaches turn questions into concrete musicological research projects. It is safe to say that today many of these methods draw upon digital data and tools. Some of those general methods are feature extraction from audio and movement signals, machine learning, classification and statistics. However, the problem is that, very often, the empirical and computational methods require technical solutions beyond the skills of researchers that typically have a humanities background. At that point, those researchers need access to specialized technical knowledge to advance their research. My PhD-work should be seen within the context of that tradition. In many respects I adopt a problem-solving attitude to problems that are posed by research in systematic musicology.
This work explores solutions that are relevant for systematic musicology. It does this by engineering solutions for measurement problems in empirical research and developing research software which facilitates computational research. These solutions are placed in an engineering-humanities plane. The first axis of the plane contrasts services with methods. Methods in systematic musicology propose ways to generate new insights in music related phenomena or contribute to how research can be done. Services for systematic musicology, on the other hand, support or automate research tasks which allow to change the scope of research. A shift in scope allows researchers to cope with larger data sets which offers a broader view on the phenomenon. The second axis indicates how important Music Information Retrieval (MIR) techniques are in a solution. MIR-techniques are contrasted with various techniques to support empirical research.
My research resulted in a total of thirteen solutions which are placed in this plane. The description of seven of these are bundled in this dissertation. Three fall into the methods category and four in the services category. For example Tarsos presents a method to compare performance practice with theoretical scales on a large scale. SyncSink is an example of a service. It offers a solution for synchronization of multi-modal empirical data and enables researchers to easily use more streams of sensor data or to process more participants. Other services are TarsosDSP and Panako. The former offers real-time feature extraction and the latter an acoustic fingerprinting framework.
Generally, the solutions presented in this dissertation follow a reproducible methodology. Computational research and MIR research is often problematic to reproduce due to code that is not available, copyrights on music which prevent sharing evaluation data-sets and a lack of incentive to spend time on reproducible research. The works bundled here do pay attention to aspects relating to reproducibility. The software is made available under open source licenses and the systems are evaluated using publicly available music as much as possible. This makes processes transparent and systems verifiable. It also allows others, from in and outside academia, to use, criticize and improve the systems.
The main contributions of my doctoral research are found in the individual solutions. Panako [175] contributed a new acoustic fingerprinting algorithm to the academic literature. Subsequently applications of Panako for digital music archive management applications were described and evaluated [168]. Tarsos [173] facilitates large-scale tone scale use. I have contributed to the discussion on meaningful contributions to and reproducibility in MIR [167]. I have also presented a framework for active listening which enables augmented musical realities [177]. Next to these specific contributions the more general contributions include a way to conceptualize contributions to systematic musicology along a methods vs services axis and the concept of augmented humanities as a future direction of systematic musicological research.
The first chapter outlines the problem and situates my research in the context of systematic musicology, engineering and digital humanities. It also introduces a plane in which solutions can be placed. One axis of this plane contrasts methods and services. The the other axis differentiats between MIR and other techniques. The first chapter continues with a section on the general methodology which covers aspects of reproducibility. It concludes with a summary.
The next two chapters bundle seven publications in total (chapters Section 2 and Section 3). The publications bundled in these chapters only underwent minor cosmetic changes to fit the citation style and layout of this dissertation. Each bundled publication has been subjected to peer review. The two chapters which bundle publications start with an additional introduction that focuses on how the works are placed in the broader framework of the overarching dissertation. Each introduction also contains bibliographical information which mentions co-authors together with information on the journal or conference proceeding where the research was originally published. I have limited the bundled publications to the ones for which I am the first author and are most central to my research. This means that some works I co-authored are not included which keeps the length in check. Some of those works are [46], [200] and [198]. However, they are situated in plane introduced in chapter one. For a complete list of output, see Appendix A.
The fourth and final chapter offers a discussion together with concluding remarks. The contributions of my research are summarized there as well. Additionally, the term augmented humanities is introduced as a way to conceptualize future work. The main text ends with concluding remarks.
Finally, the appendix contains a list of output (Appendix A). As output in the form of software should be seen as an integral part of this dissertation, software is listed as well. The appendix also includes a list of figures, tables and acronyms (Appendix B). The last part of the dissertation includes a summary in dutch and the list of referenced works.
’Systematic musicology
One of the main research questions of systematic musicology is concerned with how people make sense of their musical environment. It is deals with signification and meaning-formation and relates to how music empowers people [111], how relations between musical structures and meaning formation should be understood and which effects music has. These ‘fundamental questions are non-historical in nature’[55] which contrasts systematic musicology with historical musicology.
There are many ways in which the research questions above can be approached or rephrased. For example, the questions can also be approached from a cultural perspective, where music is considered as a phenomenon of human expression embedded in tradition, and driven by innovation and creativity. The questions can be approached from a cognitive perspective, where music is considered as information, or better: as an acoustical signal, of which the perception involves particular categorizations, cognitive structures, representations and ways of learning. Or they can be approached from a performance perspective, where music is considered as the outcome of human interactions, sensorimotor predictions and actions and where cognitive, perceptive processing goes together with physical activity, emotions, and expressive capacities. All these perspectives have their merits and to understand a phenomenon a multi-perspective is often adopted, based on bits and pieces taken from each approach.
Admitted, the above list of approaches may not be exhaustive. The list is only meant to indicate that there are many ways in which musicology approaches the question of meaning formation, signification, and empowerment. Likewise, there are many ways to construct a multi-perspectivistic approach to the study of musical meaning formation. A more exhaustive historical overview of the different sides of musicology and the research questions driving (sub)fields is given by [55].
Accordingly, the same remark can be made with respect to the methods that turn the above mentioned approaches and perspectives into concrete musicological research projects. It would be possible to list different methods that are used in musicology but it is not my intention to attempt to give such a list, and certainly not an exhaustive list. Instead, what concerns me here, is the level below these research perspectives, one could call it the foundational level of the different methodologies that materialize the human science perspectives.
At this point I believe that it is safe to say that many of the advanced research methodologies in musicology today draw upon digital data and digital tools
Research in systematic musicology provides a prototypical example of such an environment since this research tradition is at the forefront of development in which advanced digital data and tools are used and required. [7] compiled an up-to-date reference work. Indeed, while the research topics in systematic musicology have kept their typical humanities flavor – notions such as ‘expression’, ‘value’, ‘intention’, ‘meaning’, ‘agency’ and so on are quite common – the research methods have gradually evolved in the direction of empirical and computational methods that are typically found in the natural sciences [79]. A few examples of such general methods are feature extraction from audio and movement signals, machine learning, classification and statistics. This combination of methods gives systematic musicology its inter-disciplinary character.
However, the problem is that, very often, the empirical and computational methods require technical solutions beyond the skills of researchers that typically have a humanities background. At that point, those researchers need access to specialized technical knowledge to advance their research.
Let me give a concrete example to clarify my point. The example comes from a study about African music, where I collaborated with musicologist and composer dr. Olmo Cornelis on an analysis of the unique and rich archive of audio recordings at the Royal Museum of Central Africa [42]. The research question inquired music scales: have these scales changed over time due to African acculturation to European influences? Given the large number of audio recordings (approximately 35000), it is useful to apply automatic music information retrieval tools that assists the analysis; for example, tools that can extract scales from the recordings automatically, and tools that can compare the scales from African music with scales from European music. Traditionally, such tools are not made by musicologists that do this kind of analysis, but by engineers that provide digital solutions to such a demand. If the engineers do not provide the tools, then the analysis is not possible or extremely difficult and time consuming. However, if the musicologists do not engage with engineers to specify needs in view of a research question, then engineers cannot provide adequate tools. Therefore, both the engineer and the musicologist have to collaborate in close interaction in order to proceed and advance research
My PhD-work should be seen within the context of that tradition. In many respects I adopt a problem-solving attitude to problems that are posed by research in systematic musicology. Often the problems themselves are ill-defined. My task, therefore, is to break down ill-defined problems and to offer well-defined solutions. Bridging this gap requires close collaboration with researchers from the humanities and continuous feedback on solutions to gain a deep understanding of the problems at hand. To summarize my research goal in one sentence: the goal of my research is to engineer solutions that are relevant for systematic musicology.
Overall, it is possible to consider my contribution from two different axes. One axis covers the link between engineering methods and engineering services that are relevant to musicology. The other axis covers the link between different engineering techniques. They either draw on Music Information Retrieval techniques or on a number of other techniques here categorized as techniques for empirical research. These two axes together define a plane. This plane could be called the engineering-humanities plane since it offers a way to think about engineering contributions in and for the humanities. Note that the relation between this work and the humanities will be more clearly explained in a section below (see ?). It allows to situate my solutions, as shown in Figure 1.
The vertical axis specifies engineering either a service or a method. Some solutions have aspects of both services and methods and are placed more in the center of this axis. [204] makes a similar distinction but calls it: computing for and computation in humanities. Computation for humanities is defined as the ‘instrumental use of computing for the sake of humanities’. The engineering solutions are meant to support or facilitate research, and therefore, computation-for can also be seen as a service provided by engineering. Computation-in humanities is defined as that what ‘actively contributes to meaning-generation’. The engineering solutions are meant to be used as part of the research methodology, for example, to gather new insights by modeling aspects of cultural activities, and therefore, the computation-in can be seen as methods provided by engineering.
In my PhD I explore this axis actively by building tools, prototypes, and experimental set-ups. Collectively these are called solutions. These solutions should be preceded by the indefinite article: they present a solution in a specific context not the solution. They are subsequently applied in musicological research. My role is to engineer innovative solutions which support or offer opportunities for new types of research questions in musicology.
An example of a solution situated at the service-for side: a word processor that allows a researcher to easily lay-out and edit a scientific article. Nowadays, this seems like a trivial example but it is hard to quantify the performance gained by employing word processors for a key research task, namely describing research. TeX, the predecessor of the typesetting system used for this dissertation, was invented specifically as a service for the (computer) science community by [91]. For the interested reader: a technological history of word processor and its effects on literary writing is described by [87].
The services-for are contrasted with methods-in. Software situated at the method-in side is for example a specialized piece of software that models dance movements and is able to capture, categorize and describe prototypical gestures, so that new insights into that dance repertoire can be generated. It can be seen as a method in the humanities. The distinction can become blurry when solutions appear as method-in and as service-for depending on the usage context. An example is Tarsos (Figure 1, [173], the article on Tarsos is also included in section [173]). If Tarsos is used to verify a pitch structure it is used as a service-for. If a research is done on pitch structures of a whole repertoire and generates novel insights it can be seen as method-in.
Engineering solutions become methods-in humanities The relationship between computing and humanities, to which musicology is a sub-discipline, has been actively investigated in the digital humanities
Please note that I do not deal with the question whether the specific perspective is relevant for the multi-layered meanings existing in a population of users of those artifacts. What interests me here is how engineering can provide solutions to methods used by scholars studying cultural artifacts, even if those methods cover only part of the multi-layers meaning that is attached to the artifacts.
To come to a solution, it is possible to distinguish three steps. In the first step, the solution does not yet take into account a specific hardware or software implementation. Rather, the goal is to get a clear view on the scholar’s approach to the analysis of cultural artifacts. In the second step, then, it is necessary to take into account the inherent restrictions of an actual hard- or software implementation.
Finally, in the third step, an implementation follows. This results in a working tool, a solution. The solution can be seen as a model of the method. It works as a detailed, (quasi) deterministic model of a methodology.
Rather than building new solutions it is possible that a researcher finds an off-the shelf solution that is readily available. However, it often turns out that a set of thoughtful application-specific modifications may be necessary in order to make the standard piece of software applicable to the study of specific cultural artifacts.
While the above mentioned steps towards an engineering solution may appear as a rather deterministic algorithm, the reality is much different. Due to technical limitations the last two steps may strongly influence the ‘preceding’ step. Or to put a more positive spin onto it: the technical possibilities and opportunities, may in turn influence the research questions of a researcher in the humanities. To go even one step further, technical solutions can be an incentive for inventing and adopting totally new methods for the study of cultural artifacts. Accordingly, it is this interchange between the technical implementation, modeling and application in and for humanities that forms the central theme of my dissertation.
This view, where technical innovations serve as a catalyst for scientific pursuits in the humanities, reverses the idea that a humanities scholar should find a subservient engineer to implement a computational model. It replaces subservience with an equal partnership between engineers and humanities scholars. The basic argument is that technical innovations often have a profound effect on how research is conducted, and technical solutions may even redirect the research questions that can be meaningfully handled. In this section I have been exclusively dealing with computational solutions for modeling cultural artifacts (methods-in). Now it is time to go into detail on services-for which may also have a profound effect on research.
Engineering solutions become services-for humanities Engineering solutions for humanities are often related to the automation or facilitation of research tasks. Such solutions facilitate research in humanities in general, or musicology in particular, but the solutions cannot be considered methods. The research tasks may concern tasks that perhaps can be done by hand but that are tedious, error-prone and/or time consuming. When automated by soft- and hardware systems, they may be of great help to the researcher so that the focus of can be directed towards solving the research question instead of practical matters. A simple example is a questionnaire. When done on paper, a lot of time is needed to transcribe data in a workable format. However, when filled out digitally, it may be easy to get the data in a workable format.
Solutions that work as services for the humanities often have the power to change the scope of research. Without engineering a solution, a researcher may have been able to analyze a selected subset of artifacts. With engineering solution, a researcher may be able to analyze large quantities of artifacts. Successful services are the result of close collaboration and tight integration. Again, I claim that equal partnership between humaniest and engineers is a better model to understand how services materialize in practice.
For example, the pitch analysis tool implemented in Tarsos [173] can handle the entire collection of 35000 recordings of the KMMA collection. Accordingly, the scope of research changes dramatically. From manual analysis of a small set of perhaps 100 songs, to automatic analysis of the entire collection over a period of about 100 years. This long-term perspective opens up entirely new research questions, such as whether Western influence affected tone-scale use in African music. Note that by offering Tarsos as a service, methods may need to be reevaluated.
Another example in this domain concerns the synchronization of different data streams used in studies of musical performance. Research on the interaction between movement and music indeed involves the analysis of multi-track audio, video streams and sensor data. These different data streams need to be synchronized in order to make a meaningful analysis possible and my solution offers an efficient way to synchronize all these data streams [176]. This solution saves a lot of effort and tedious alignment work that otherwise has to be done by researchers whose focus is not synchronization of media, but the study of musical performance. The solution is also the result of an analysis of the research needs on the work floor and has been put in practice [51]. It again enables research on a different scope: a larger number of independent sensor streams with more participants can be easily handled.
To summarize the method-services axis: methods have cultural artifacts at the center and generate new insights, whilst services facilitate research tasks which have the potential to profoundly influence research praxis (e.g. research scope). The other axis in the plane deals with the centrality of MIR-techniques in each solution.
The second axis of the engineering-humanities plane specifies an engineering approach. It maps how central music information retrieval (MIR) techniques are in each solution. The category of techniques for empirical research includes sensor technology, micro-controller programming, analog to digital conversion and scripting techniques. The MIR-techniques turned out to be very useful for work on large music collections, while the techniques in analogue-digital engineering turned out to be useful for the experimental work in musicology.
One of the interesting aspects of my contribution, I believe, is concerned with solutions that are situated in between MIR and tools for experimental work, such as the works described in [176] and [177]. To understand this, I provide some background to the techniques used along this axis. However to get a grasp of the techniques used, it is perhaps best to start with a short introduction on MIR and related fields.
Symbol-based MIR The most generally agreed upon definition of MIR is given by [54].
Originally the field was mainly involved in the analysis of symbolic music data. A music score, encoded in a machine readable way, was the main research object. Starting from the 1960s computers became more available and the terms computational musicology and music information retrieval were coined. The terms immediately hint at the duality between searching for music - accessibility, information retrieval - and improved understanding of the material: computational musicology. [26] provides an excellent introduction and historic overview of the MIR research field.
Signal-based MIR As computers became more powerful in the mid to late nineties, desktop computers performed better and better on digital signal processing tasks. This combined with advances in audio compression techniques, cheaper digital storage, and accessibility to Internet technologies, led to vast amounts, or big data collections, of digital music. The availability of large data sets, in turn, boosted research in MIR but now with musical signals at the center of attention.
Signal based MIR aims to extract descriptors from musical audio. MIR-techniques are based on low-level feature extraction and on classification into higher-level descriptors. Low-level features contain non-contextual information close to the acoustic domain such as frequency spectrum, duration and energy. Higher level musical content-description focuses on aspects such as timbre, pitch, melody, rhythm and tempo. The highest level is about expressive, perceptual contextual interpretations that typically focus on factors related to movement, emotion, or corporeal aspects. This concept has been captured in a schema by [110] and copied here in Figure 3. The main research question in MIR is often on how to turn a set of low-level features into a set of higher level concepts.
For example, harmony detection can be divided into low-level instantaneous frequency detection and a perceptual model that transforms frequencies into pitch estimations. Finally multiple pitch estimations are integrated over time - contextualized - and contrasted with tonal harmony resulting in harmony estimation. In the taxonomy of [110], this means going from the acoustical over the sensorial and perceptual to the cognitive or structural level.
Usually, the audio-mining techniques deliver descriptions for large sets of audio. The automated approach is time-saving and often applied to audio collections that are too large to annotate manually. The descriptors related to such large collections and archives provide a basis for further analysis. In that context, data-mining techniques become useful. The techniques focus on the computational analysis of a large volume of data, using statistical correlation and categorization techniques that look for patterns, tendencies, groupings and changes.
A more extensive overview of the problems that are researched in MIR is given in the the books by [89], [135] and [152]. The overview article by [26] gives more insights in the history of MIR. The proceedings of the yearly ISMIR
MIR and other research fields Despite the historical close link between symbolic-based MIR and computational musicology, the link between signal-based MIR and music cognition has not been as strong. At first sight, signal-based MIR and music cognition are both looking at how humans perceive, model and interact with music. However, MIR is more interested in looking for pragmatic solutions for concrete problems, less in explaining processes, while music cognition research, on the other hand, is more interested in explaining psychological and neurological processes. The gap between the fields has been described by [5].
An example in instrument recognition may clarify the issue. In the eyes of a MIR researcher, instrument recognition is a matter of ‘applying instrument labels correctly to a music data set’. Typically, the MIR researcher extracts one or more low-level features from music with the instruments of interest, trains a classifier and applies it to unlabeled music. Finally the MIR researcher shows that the approach improves the current state of the art. [115] present such an algorithm, in this case based on MFCC
In contrast, music cognition research tends to approach the same task from a different perspective. The question may boil down to ’How are we able to recognize instruments?’. Neuropsychological experiments carried out by [139] suggest how music instrument recognition is processed. The result is valuable and offers useful insights in processes. However, the result does not mention how the findings could be exploited using a computational model.
MIR and music cognition can be considered as two prongs of the same fork. Unfortunately the handle seems missing: there are a few exceptions that tried to combine insights into auditory perception with computational modeling and MIR-style feature extraction. One such exception is the IPEM-Toolbox [104]. While this approach was picked up and elaborated in the music cognition field [92], the approach fell on deaf ears in the MIR community. Still, in recent years, some papers claim that it is possible to improve the standards, the evaluation practices, and the reproducibility of MIR research by incorporating more perception-based computational tools. This approach may generate some impact in MIR [6].
The specific relation between MIR and systematic musicology is examined in a paper by [102]. It is an elaboration on a lecture given at the University of Cologne in 2003, which had the provocative title “Who stole musicology?”. Leman observes that in the early 2000’s there was a sudden jump in a number of researchers working on music. While the number of musicology scholars remained relatively small, engineers and neuroscientists massively flocked to music. They offered intricate computational models and fresh views on music perception and performance. Engineers and neuroscientists actively and methodologically contributed to the understanding of music with such advances and in such large numbers that Leman posed “if music could be better studied by specialized disciplines, then systematic musicology had no longer a value”. However, Leman further argues that there is a value in modern systematic musicology that is difficult to ‘steal’, which is ‘music’. This value (of music) depends on the possibility to develop a trans-disciplinary research methodology, while also paying attention to a strong corporeal aspect that is currently largely ignored by other fields. This aspect includes “the viewpoint that music is related to the interaction between body, mind, and physical environment; in a way that does justice to how humans perceive and act in the world, how they use their senses, their feelings, their emotions, their cognitive apparatus and social interactions”. While this point was elaborated in his book on embodied music cognition [101], it turns out that this holistic approach is still - ten years later - only very rarely encountered in MIR research.
Computational Ethnomusicology
It is known that the signal-based methods developed by the MIR community target classical Western music or commercial pop by an overwhelming majority [43], whereas the immense diversity in music all over the world is largely ignored
For example, a chroma feature shows the intensity of each of the 12 Western notes at a point in time in a musical piece. Chroma features subsequently imply a tonal organization with an octave divided in 12 equal parts, preferably with tuned to 440Hz. Methods that build upon such chroma features perform well on Western music but they typically fail on non-Western music that has other tonal organization. By itself this is no problem, it is simply a limitation of the method (model) and chroma can be adapted for other tonal organizations. However, the limitation is a problem when such a tool would be applied to music that does not have a tonal space that the tool can readily measure. In other words, it is necessary to keep in mind that the toolbox of a MIR researcher is full of methods that make assumptions about music, while these assumptions do not hold universally. Therefore, one should be careful about Western music concepts in standardized MIR methods. They can not be applied equally on other musics without careful consideration.
Computational ethnomusicology is a research field in which MIR tools are adopted or re-purposed so that they can provide specialized methods and models for all kinds of musics. For a more detailed discussion on this see page eight of [41]. The field aims to provide better access to different musics and to offer a broader view, including video, dance and performance analysis. [194] redefine this field
Accordingly, re-use of MIR tools demands either a culture specific approach or a general approach. In the first case specific structural elements of the music under analysis encoded into models. For example in Afro-Cuban music, elements specific to the ‘clave’ can be encoded so that a better understanding of timing in that type of music becomes possible [213]. While this solution may limit the applicability to a specific (sub)culture, it also allows deep insights in mid- and high-level concepts of a possible instantiation of music. In that sense, the MIR solutions to harmonic analysis can be seen as a culture specific approach, yielding meaningful results for Western tonal music only.
One of the goals of the MIR solution is to understand elements in music that are universal. In that sense, it corresponds quite well with what systematic musicology is aiming at: finding effects of music on humans, independent of cultures or historical period. Given that perspective, one could say that rhythm and pitch are fundamental elements of such universals in music. They appear almost alway across all cultures and all periods of time. Solutions that offer insights in frequently reused pitches, for example, may therefore be generally applicable. They have been applied to African, Turkish, Indian and Swedish folk music [173]. Most often, however, such solutions are limited to low-level musical concepts. The choice seems to boil down to: being low-level and universal, or high-level and culture-specific.
Both culture-specific and what could be called culture-invariant approaches to computational ethnomusicology should further understanding and allow innovative research. Computational ethnomusicology is defined by [194] as “the design, development and usage of computer tools that have the potential to assist in ethnomusicological research.”. This limits the research field to a service-for ethnomusicology. I would argue that the method-in should be part of this field as well. A view that is shared, in slightly different terms, by [69]: “computer models can be ‘theories’ or ‘hypotheses’ (not just ‘tools’) about processes and problems studied by traditional ethnomusicologists”. While computational ethnomusicology is broader in scope, the underlying techniques have a lot in common with standard MIR-techniques which are present also in my own research.
MIR-techniques With the history and context of MIR in mind it is now possible to highlight the techniques in my own work. It helps to keep the taxonomy by [110] included in Figure 3 in mind.
Tarsos [173] is a typical signal-based MIR-system in that it draws upon low level pitch estimation features and combines those to form higher level insights: pitch and pitch class histograms relating to scales and pitch use. Tarsos also offers similarity measures for these higher level features. It allows to define how close two musical pieces are in that specific sense. In the case of Tarsos these are encoded as histogram similarity measures (overlap, intersection). Several means to encode, process and compare pitch interval sets are present as well. Tarsos has a plug-in system to allow to start from any system that is able to extract pitch from audio but by default the TarsosDSP [174] library is used.
TarsosDSP is a low-level feature extractor. As mentioned previously, it has several pitch extractor algorithms but also includes extraction of onset and beat tracking. It is also capable to extract spectral features and much more. For details see [174]. It is a robust foundation to build MIR systems on. Tarsos is one example but my work in acoustic fingerprinting also is based on TarsosDSP.
The acoustic fingerprinting work mainly draws upon spectral representations (FFT or Constant-Q)
Techniques for empirical research While MIR-techniques form an important component in my research, here they are contrasted with various techniques for empirical research. The solutions that appear at the other extreme of the horizontal axis of the plane do not use MIR-techniques. However, there are some solutions for empirical research that do use MIR techniques to some degree. These solutions are designed to support specific experiments with particular experimental designs. While the experimental designs can differ a lot, they do share several components that appear regularly. Each of these can present an experimenter with a technical challenge. I can identify five:
Activation
. The first component is to present a subject with a stimulus. Examples of stimuli are sounds that need to be presented with millisecond accurate timing or tactile feedback with vibration motors or music modified in a certain way.
Measurement
. The second component is the measurement of the phenomenon of interest. It is essential to use sensors that capture the phenomenon precisely and that these do not interfere with the task for the subject. Examples of measurement devices are wearables to capture a musicians movement, microphones to capture sound, witness video cameras and motion capture systems. Note that a combination of measurement devices is often needed.
Transmission
. The third component aims to expose measurements in a usable form. This can involve translation via calibration to a workable unit (Euler angles, acceleration expressed in g-forces, strain-gauge measurements in kg). For example, it may be needed to go from raw sensor readings on a micro-controller to a computer system or from multiple measurement nodes to a central system.
Accumulation
. The fourth component deals with aggregating, synchronizing and storage of measured data. For example, it might be needed to capture measurements and events in consistently named text files with time stamps or a shared clock.
Analysis
. The final component is analysis of the measured data with the aim to support or disprove a hypothesis with a certain degree of confidence. Often standard (statistical) software suffices, but it might be needed to build custom solutions for the analysis step as well.
These components need to be combined to reflect the experimental design and to form reliable conclusions. Note that each of the components can either be trivial and straightforward or pose a significant challenge. A typical experiment combines available off-the shelf hardware and software solutions with custom solutions which allows successful use in an experimental setting. In innovative experiments it is rare to find designs completely devoid of technical challenges.
For example, take the solution devised for [202]. The study shines a light on the effect of music on heart-rate in rest. It uses musical fragments that are time-stretched to match the subjects’ heart rate as stimulus (activation). A heart-rate sensor is the main measurement device (measurement). A micro-controller, in this case an Arduino, takes the sensor values and sends them (transmission) to a computer. These components are controlled by a small program on the computer that initiates each component at the expected time resulting in a research data set (accumulation) that can be analyzed (analysis). In this case the data are ‘changes in heart-rate’ and ‘heart-rate variability’ when subjects are in rest or listen to specific music. The technical challenges and contributions were mainly found in high-quality time-stretching of the stimuli (activation) and reliably measuring heart rate at the fingertips (measurement). The other components could be handled with off-the-shelf components. Note that measuring heart rate with chest straps might have been more reliable but this would have been also more invasive, especially for female participants. Aspects of user-friendliness are almost always a concern and even more so if measurement devices interfere with the task: in this setup participants needed to feel comfortable in order to relax.
Another example is the solution called the LI-Jogger (short for Low Impact jogger, see [198]). A schema of the set-up can be found in Figure 4. This research aims to build a system to lower footfall impact for at-risk amateur runners via biofeedback with music. Foot fall impact is a known risk factor of a common running injury and lowering this impact in turn lowers this risk. Measurement involves measuring peak tibial (shin) acceleration in three dimensions with a high time resolution via a wearable sensor on the runner. Running speed needs to be controlled for. This is done via a sonar (as described by [114]). Accumulation of the data is a technical challenge as well since footfall measured by the accelerometer needs to be synchronized precisely with other sensors embedded in the sports science lab. To allow this, an additional infra-red (IR) sensor was used to capture a clock of the motion capture system which provides a clock that is followed by several devices (force plates). In this case a superfluous stream of measurements was required to allow synchronization. In this project each component is challenging:
Activation
. The biofeedback needs measurements in real-time and modifies music to reflect these measurements in a certain way. This biofeedback is done on a battery-powered wearable device that should hinder the runner as little as possible.
Measurement
. The measurement requires a custom system with high-quality 3D accelerometers. The accelerometers need to be light and unobtrusive. To allow synchronization and speed control an additional IR-sensor and sonar were needed.
Transmission
. To expose acceleration and other sensor data custom software is required both on the transmitting micro-controller as on the receiving system.
Accumulation
. Accumulation requires scripts that use the IR-sensor stream to synchronize acceleration data with a motion capture system and other measurement devices (force-plate, footroll measurement).
Analysis
. Analysis of the multi-modal data is also non-trivial. The article that validates the measurement system [198] also requires custom software to compare the gold-standard force-plate data and acceleration data.
The techniques used in these tools for empirical research are often relatively mundane. However, they do span a broad array of hardware and software technologies that need to be combined efficiently to offer a workable solution. Often there is considerable creativity involved in engineering a cost-effective, practical solution in a limited amount of time. As can be seen from the examples it helps to be able to incorporate knowledge on micro-controllers, sensors, analog/digital conversion, transmission protocols, wireless technologies, data-analysis techniques, off-the-shelf components, scripting techniques, algorithms and data structures.
Now that the two axes of the humanities-engineering plane have been sufficiently clarified it is time to situate my solutions in this plane.
Given the above explanation of the axes, it is now possible to assign a location in this plane for each of the solutions that form the core of my doctoral research (Figure 1):
This solution introduces and validates a method to gauge the metric complexity of a musical piece, depending on the level of agreement between automatic beat estimation algorithms. The validation is done by comparing expert human annotations with annotations by a committee of beat estimation algorithms. The solution is based on MIR-techniques and it is used as a method to get insights into rhythmic ambiguity in sets of music. It is, therefore, placed in the upper left of the engineering-humanities plane. It is described by [46] and not included in this dissertation.
This solution introduces a method to automatically extract pitch and pitch class histograms from any pitched musical recording. It also offers many tools to process the pitch data. It has a graphical user interface but also a batch processing option. It can be used as on a single recording or on large databases. It covers quite a big area in the plane: depending on the research it can be used as a method for large scale scale analysis or as a service to get insights into pitch use of a single musical piece. Tarsos is situated to the side of MIR-techniques since it depends on techniques as feature extraction. It is described in [173], which is included in chapter [173].
The main contribution of this work is a reflection on reproducible research methodologies for computational research in general and MIR research in particular. As an illustration a seminal MIR-article is replicated and a reproducible evaluation method is presented. The focus is on methods of computational research and it utilizes MIR-techniques so it is placed at the methods/MIR-techniques side. It is described in [167], which is included in chapter [167].
This solution presents a method to compare meta-data, reuse segmentation boundaries, improves listening experiences and to merge digital audio. The method is applied to the data set of the RMCA archive which offers new insights into the meta-data quality. The underlying technique is acoustic fingerprinting, a classical MIR-technique. The case-study uses a service provided by [175]. The article [168] is included in chapter [168].
This digital signal processing (DSP) library is the foundation of Tarsos. TarsosDSP offers many low-level feature extraction algorithms in a package aimed for MIR-researchers, students or developers. It is a piece of work in its own right and is situated on the MIR-techniques/service side. The service is employed in TarsosDSP. It has been used for speech rehabilitation [24], serious gaming contexts [157] and human machine interaction [155]. The article [174], is included in chapter [174]
This solution is an acoustic fingerprinting algorithm that allows efficient lookup of small audio excerpts in large reference databases. Panako works even if the audio underwent changes in pitch. It is placed firmly in the MIR side and service side. There are many ways in which Panako can be used to manage large music archives. These different ways are discussed in [20]. See [175], which is included in chapter [175].
This solution offers a technology for augmented listening experiences. It is effectively proving the tools for a computer-mediated reality. As with a typical augmented reality technology it takes the context of a user and modifies – augments or diminishes – it with additional layers of information. In this work the music playing in the environment of the user is identified with precise timing. This allows to enrich listening experiences. The solution employs MIR-techniques to improve engagement of a listener with the music in the environment. There are, however, also applications to use this in experimental designs for empirical research. So it is a service situated in between MIR-techniques and techniques for empirical research. It is described by [177], which is included in chapter [177].
This solution presents a general system to synchronize heterogeneous experimental data streams. By adding an audio stream to each sensor stream, the problem of synchronization is reduced to audio-to-audio alignment. It employs MIR-technique to solve a problem often faced in empirical research. The service is placed more to the side of techniques for empirical research . [195] extended the off-line algorithm with a real-time version in his master’s thesis. The service is described in [176], which is included in chapter [176].
This empirical study compares tapping behaviour when subjects are presented with with tactile, auditory and combined tactile and auditory queues. To allow this, a system is required that can register tapping behaviour and present subjects with stimuli with a very precise timing. The main aim of the study is to present a method and report results on the tapping behaviour by subjects. So while the measurement/stimulus system could be seen as a service, the main contribution lies in the method and results. These are described in [166] which is not included in this dissertation.
This solution is a soft/hardware system that allows immediate feedback of foot-fall impact to allow auditory feedback with music. It has been used in the previous section as an example (see ?) where every aspect of an experimental design poses a significant challenge. The innovative aspects are that impact is measured at a high data rate (1000Hz) in three dimensions by a wearable system that is synchronized precisely with other measurement modalities. LI-Jogger supports an empirical research so it is placed right in the plane. It aims to provide a view into how music modifies overground (vs treadmill) running movement but the system itself is a service needed to achieve that goal. This solution is described in [198]. A large intervention study that will apply this solution is planned. The results of this study will be described in follow-up articles
This solution includes an analysis and comparison of harmonics in a singing voice while inhaling and exhaling. MIR-techniques are used to gain insights in this performance practice. It is, therefore, situated at the methods/MIR-techniques side of the plan. My contribution was primarily in the data-analysis part. The findings have been described in [203], which is not included in this dissertation.
This solution is a heart-rate measurement system with supporting software to initiate and record conditions and present stimuli to participants. The stimulus is music with modified tempo to match a subjects heart-rate. It has been used in the previous section as an example see ?. The system made the limited influence of music on heart-rate clear in a systematic way. It is situated in the upper right quadrant. See [202].
This solution concerns a hard and software system to measure engagement with live performance versus prerecorded performances for patients with dementia. My contribution for it is in the software that combines various movement sensors and web-cameras that register engagement and in presenting the stimuli. The sensors streams and videos are synchronized automatically. Figure 5 shows synchronized video, audio and movement imported in the ELAN software [212]. It supports empirical research and is placed in the services category. See [51] for a description of the system and the data analysis. The article is not included in this dissertation.
Several of the listed solutions and the projects that use these show similarities to projects from the digital humanities. It is of interest to further dive into this concept and further clarify this overlap.
As a final step in my attempt to situate my work in engineering and humanities I want to briefly come back to the digital humanities research field. Digital humanities is an umbrella term for research practices that combine humanities with digital technologies which show much overlap with my activities. Therefore it may be useful for a conceptual situation of my activities in this dissertation.
There are a number of articles and even books that try to define the digital humanities [13]. There is even disagreement if it is a research field, “new modes of scholarship”[25] or a “set of related methods”[161]. [211] have given up:“Since the field is constantly growing and changing, specific definitions can quickly become outdated or unnecessarily limit future potential.”. However, a definition is presented as broadly accepted by [86] as:
This definition is sufficiently broad that it would also work as a definition for systematic musicology especially in relation to this dissertation. The works bundled here are at the intersection of computing and musicology, it involves invention and also has limited contributions to the knowledge of computing. An other relevant observation with respect to this dissertation is the following: “[Digital Humanities is inspired by]...the conviction that computational tools have the potential to transform the content, scope, methodologies, and audience of humanistic inquiry.”[25]. Again, this is similar to what is attempted in this thesis. Moreover, digital humanities projects generally have these common traits which shares much with this dissertation project:
The projects are collaborative. Often engineers and humanities scholars collaborate with a shared aim.
The nature of the methods require a transdisciplinary approach. Often computer science is combined with deep insights into humanities subjects.
The main research object is available in the digital domain. This immediately imposes a very stringent limitation. The research object and relations between research objects need a practical digital representation to allow successful analysis.
The history of the term can be found in the literary sciences and was grew out of ‘humanities computing’ or ‘computing in the humanities’ [13]. Where originally it was seen as “a technical support to the work of the ‘real’ humanities scholars” [13]. The term ‘digital humanities’ was coined to mark a paradigm shift from merely technical support of humanities research to a “genuinely intellectual endeavor with its own professional practices, rigorous standards, and exciting theoretical explorations” [74]. Today most digital humanities scholars are firmly embedded in either library sciences, history or literary science. Due to this history there are only a few explicit links with musicology although the described methods in the digital humanities are sufficiently broad and inclusive.
One of the problems tackled by the digital humanities is the abundance of available digital data. The problem presents itself for example for historians: “It is now quite clear that historians will have to grapple with abundance, not scarcity. Several million books have been digitized by Google and the Open Content Alliance in the last two years, with millions more on the way shortly nearly every day we are confronted with a new digital historical resource of almost unimaginable size.” [36].
As a way to deal with this abundance [134] makes the distinction between ‘close reading’ and ‘distant reading’ of a literary text. With ‘close’ meaning attentive critical reading by a scholar while ‘distant reading’ focuses on ‘fewer elements, interconnections, shapes, relations structures, forms and models’. Moretti argues there is an urgent need for distant reading due to she sheer size of the literary field and the problem that only relatively few works are subjected to ‘close reading’: “a field this large cannot be understood by stitching together separate bits of knowledge about individual cases [close reading], because it’s a collective system, that should be grasped as such, as a whole.” [134]
One example of this ‘distant reading’ approach is given by [153]. In this article each word is labeled with an emotional value: positive, neutral or negative. Words like ‘murder’ or ‘death’ are deemed negative. ‘friends’ or ‘happy’ are positive while the neutral category contains words like ‘door’,‘apple’ or ‘lever’. With this simple model a whole corpus is examined and six prototypical story arcs are detected. Examples include the Tragedy ( a downwards fall) and Cinderella (rise - fall - rise). The approach is a good example of how computational tools can quickly summarize features and how insights can be gathered for a whole system and for which the ‘close reading’ approach would not work.
To some extend, the field of musicology underwent a similar path as literary science. Large amounts of scores have been encoded into digital formats. The number of digitized - or digitally recorded - music recordings easily runs in the millions. Using similar terminology as [134], a distinction can be made between ‘close listening’ and ‘distant listening’
To allow this broad, fair view on whole musical collections, this ‘distant listening’ needs to be as intelligent as possible. The construction of models of music, selection of features and corresponding similarity measures needs to be done diligently. This is one of the main challenges in the MIR research field.
However, this focus on databases covers only an aspect of musicologists’ interest. As mentioned, there is a vast domain of research in musicology that focuses on music interaction. In fact, the work at IPEM is known world-wide for establishing a science of interaction (e.g. see [111]. Digital humanities is usually associated with fixed items like archives, databases, sets of literary works, historical GIS (Geographical Information System) data which, in my view, is too limited. Therefore, I introduce the term ‘augmented humanities’ later on (see Section 4.2) in order to cover that aspect of my work in which engineering solutions work as an element in a context of human-music interaction, rather than archives and MIR. However, as my dissertation shows, it turns out that MIR techniques are very useful for solving problems in human-music interaction.
First more details are given on the reproducible methodology that is followed in the presented works of this dissertation.
In the works bundled in this dissertation, special efforts have been made to reach a reproducible, verifiable methodology. Reproducibility is one of the corner-stones of scientific methodology. A claim made in a scientific publication should be verifiable and the described method should provide enough detail to allow replication. If the research is carried out on a specific set of data, this data should be available as well. If not for the general public, then at least to peers or - even more limiting - to reviewers of the work. If those basics are upheld, then the work becomes verifiable, reproducible and comparable. It also facilitates improvements by other researchers.
In a journal article [167] on this topic I have bundled the main concerns with, and suggestions for, methodological computational research on music. The journal article details the problems with reproducibility in computational research and illustrates this by replicating, in full, a seminal acoustic fingerprinting paper. The main points of that article are repeated here with additional links to the works presented in this dissertation. The aim of this chapter is to strike a balance between providing common methodological background while avoiding too much repeated text.
The ideal, where methods are described in sufficient detail and data is available, is often not reached. From a technical standpoint, sharing tools and data has never been more easy. Reproducibility, however, remains a problem. Especially for Music Information Retrieval research and, more generally, research involving moderately complex software systems. Below, a number of general problems are identified and subsequently it is detailed how these are handled in the presented works.
Journal articles and especially conference papers have limited space for detailed descriptions of methods or algorithms. For moderately complex systems there are numerous parameters, edge cases and details which are glossed over in textual descriptions. This makes articles readable and the basic method intelligible, but those details need to be expounded somewhere otherwise too much is left to assumptions and perhaps missing shared understanding. The ideal place for such details is well documented, runnable code. Unfortunately, intellectual property rights by universities or research institutions often limit researchers to freely distribute code.
In the works presented in this dissertation attention has been given to sharing research code. As Ghent University is more and more striving for open-access publications and even working on a policy and support for sharing research data there is up to today little attention for research software. Arguably, it makes little sense to publish only part of research in the open - the textual description - and keeping code and data behind closed doors. Especially if the research is funded by public funds. A clear stance on intellectual property rights of research code would help researchers.
A possibility to counter this problem is to use Open Source licenses. These licenses “allow software to be freely used, shared and modified”
On GitHub, a code repository hosting service, TarsosDSP has 14 contributers and has been forked more than 200 times. This means that 14 people contributed to the main code of TarsosDSP and that there are about 200 different flavors of TarsosDSP. These flavors are at some point split - forked - from the main code and are developed by people with a slightly different focus. On GitHub more than 100 bug reports are submitted. This example highlights another benefit from opening source code. The code is tested by others, bugs are reported and some even contribute fixes and improvements.
The fact that GPL was used also made it possible to further improve the Tarsos software myself after my transition from the School of Arts, Ghent to Ghent University. A process that could have been difficult if the code was not required to be released under GPL. Panako and SyncSink [175] also build upon GPL licensed software and are therefore released under a similar license.
To make computational research reproducible both the source code and the data that was used need to be available. Copyrights on music make it hard to share music freely. Redistribution of historic field-recordings in museum archives is even more problematic. Due to the nature of the recordings, copyright status is often unclear. Clearing the status of tracks involves international, historical copyright laws and multiple stakeholders such as performers, soloists, the museum, the person who performed the field recording and potentially a publisher that already published parts on an LP. The rights of each stakeholder need to be carefully considered while at the same time they can be hard to identify due to a lack of precise meta-data and the passage of time. I see two ways to deal with this:
Pragmatic versus Ecological or Jamendo vs iTunes. There is a great deal of freely available music published under various creative commons licenses. Jamendo for example contains half a million creative commons-licensed
Audio versus Features. Research on features extracted from audio does not need audio itself, if the features are available this can suffice. There are two large sets of audio features. The million song data set by [14] and Acousticbrainz
In my own work an acoustic fingerprinting evaluation methodology was developed using music from Jamendo for Panako [175]. The exact same methodology was copied by [182] and elaborated on by myself [167] with a focus on methodological aspects. For acoustic fingerprinting the Jamendo dataset fits since it does offer a large variability in genres and is representative in those cases.
For the research in collaboration with the Museum for Central Africa [174] the issue of unclear copyrights and ethical questions on sharing field recordings is pertinent. Indeed, the copyright status of most recordings is unclear. Ethical questions on the conditions in which the recordings were made and to what extent the recorded musicians gave informed consent for further use can be raised. The way this research was done follows the second guideline. The research is done on extracted features and only partially on audio. These features are not burdened by copyright issues and can be shared freely. More specifically, a large set of features were extracted by placing a computer at the location of the museum and processing each field recording. This method enables the research to be replicated – starting from the features – and verified.
Output by researchers is still mainly judged by the number of articles they publish in scientific journals or conferences. Other types of output are not valued as much. The incentive to put a lot of work in documenting, maintaining and publishing reproducible research or supplementary material is lacking. This focus on publishing preferably novel findings in journal articles probably affects the research conducted. It drives individual researchers - consciously or unconsciously - to further their careers by publishing underpowered small studies in stead of furthering the knowledge in their fields of research [77].
A way forward is to provide an incentive for researchers to make their research reproducible. This requires a mentality shift. Policies by journals, conference organizers and research institutions should gradually change to require reproducibility. There are a few initiatives to foster reproducible research, specifically for music informatics research. The 53rd Audio Engineering Society (AES) conference had a price for reproducibility. [174] was submitted to that conference and subsequently acknowledged as reproducibility-enabling. ISMIR 2012 had a tutorial on “Reusable software and reproducibility in music informatics research” but structural attention for this issue at ISMIR seems to lack. As one of the few places Queen Mary University London (QMUL) seems to have continuous attention to the issue and researchers are trained in software craftsmanship. They also host a repository for software dealing with sound at http://soundsoftware.ac.uk. and offer a yearly workshop on “Software and Data for Audio and Music Research”:
The third SoundSoftware.ac.uk one-day workshop on “Software and Data for Audio and Music Research” will include talks on issues such as robust software development for audio and music research, reproducible research in general, management of research data, and open access.
14
Another incentive to spend time documenting and publishing research software is already mentioned above: code is reviewed by others, bugs are submitted and some even take the time to contribute to the code by fixing bugs or extending functionality. A more indirect incentive is that it forces a different approach to writing code. Quick and dirty hacks are far less appealing if one knows beforehand that the code will be out in the open and will be reviewed by peers. Publishing code benefits the reuseablity, modularity, clarity and longevity and software quality in general. It also forces one to think about installability, buildability and other aspects that make software sustainable [48].
In my work the main incentive to publish code is to make the tools and techniques available and attempt to put those in the hands of end-users. While the aim is not to develop end-user software, establishing a feedback loop with users can be inspiring and even drive further research. In an article I co-authored with a number of colleagues, a possible approach is presented to fill the gap between research software and end-users [50]. One of the hurdles is to make users - in this case managers of archives of ethnic music - aware of the available tools and techniques. In two other articles I (co-)authored [20] exactly this is done: it details how archives can benefit from a mature MIR technology - in these case acoustic fingerprinting.
The distinction between end-user ready software, a (commercial) product, and useful contributions to a field in the form of research software may not be entirely clear. One of the outputs of computational research is often research software. This software should focus on novel methods encoded in software and not on creating end-user software. Perhaps it is of interest to stress the differences. Below an attempt follows to make this distinction by focusing on several aspects of software.
Transparency
. The processes encoded in end-user software are not necessarily transparent. End-user software can be used effectively as a black box. The outcome - what you can achieve with the software - is more important than how it is done. Research software should focus exactly on making the process transparent while getting tasks done.
Complexity
. The complexity of research software should not be hidden, it should be clear which parameters there are and how parameters change results. Researchers can be expected to put effort in getting to know details of software they use. This is again different in end-user software where ease-of-use and intuitiveness matter.
Openness
. Researchers should be able to improve, adapt and experiment with the software. It stands to reason that research software should be open and allow, even encourage, improvement. Source control and project management websites such as GitHub and SoundSoftware.ac.uk facilitate these kinds of interactions. For end-user software this may not be a requirement.
Note that these characteristics of research software do not necessarily prevent such software from being applied as-is in practice. The research software Panako [175], which serves as an experimental platform for acoustic fingerprinting algorithms, is being used by Musimap and the International Federation of the Phonographic Industry (IFPI). Tarsos [173] has an attractive easy-to-use graphical interface and is being used in workshops and during lectures by students. SyncSink [176] exposes many parameters to tweak but can and is effectively used by researchers to synchronize research data. So some research software can be used as-is.
Conversely, there is also transparent and open end-user software available that does not hide its complexity such as the relational database system PostgreSQL. This means that the characteristics (transparency, complexity, openness) are not exclusive to research software but they are, in my view, requirements for good research software. The focus should be on demonstration of a process while getting (academic) tasks done and not on simply getting tasks done. This can be found, for example, in the ‘mission statement’ of TarsosDSP:
TarsosDSP is a Java library for audio processing. Its aim is to provide an easy-to-use interface to practical music processing algorithms implemented as simply as possible ... The library tries to hit the sweet spot between being capable enough to get real tasks done but compact and simple enough to serve as a demonstration on how DSP algorithms work.
This distinction between ‘real’ and ‘academic’ tasks is perhaps of interest. In ‘real’ tasks practical considerations with regards to computational load, scalability and context need to be taken into account. This is much less the case for ‘academic’ tasks: there the process is the most important contribution, whereas performance and scalability may be an afterthought. For example, take research software that encodes a multi-pitch estimation algorithm that is able to score much better than current state of the art. If this algorithm has an enormous computational load and it takes many hours to process a couple of seconds of music this still is a valid contribution: it shows how accurate multi pitch estimation can be. It is, however, completely unpractical to use if thousands of songs need to be processed. The system serves an academic purpose but can not be used for ‘real’ tasks. A fingerprinting system that has desirable features but only can handle a couple of hundred reference items is another example. The previously mentioned publication by [50] which I co-authored deals with this topic and gives several examples of research software capable enough to be used effectively by archivists to manage digital music archives.
To summarize: in my work research software packages form a considerable type of output. These systems serve an academic purpose in the first place but if they can be used for ‘real’ tasks then this is seen as an added benefit. I have identified common problems with reproducibility in computational research and have strived to make my own contributions reproducible by publishing source code and evaluating systems with publicly available data sets as much as possible. This makes my contributions verifiable and transparent but also allows others to use, criticize and improve these systems.
To contextualize my research I have given a brief overview of the interdisciplinary nature of the systematic musicology research field. The main point was that advanced research practices almost always involve challenging technological problems not easily handled by researchers with a background in humanities. The problems concern dealing with various types of digital data, computational models and complex experimental designs. I have set myself the task to engineer solutions that are relevant for systematic musicology. I have created such solutions and these are placed in a plane. One axis of that plane goes from methods to services. The other axis contrasts MIR-technologies with technologies for empirical research. These concepts are delineated and contextualized. A total of thirteen solutions explore this plane and are briefly discussed. The solutions have many attributes also found in digital humanities research projects. This link with the digital humanities was made explicit as well.
The solutions presented in my work strive to follow a reproducible methodology. Problems with reproducible computational research were identified and the importance of reproducibility was stressed. It was explained how my own research strifes for this ideal: source code that implements solutions is open sourced and solutions are evaluated with publicly available data as much as possible. Finally a distinction was clarified between research prototypes and end-user ready software.
The following two chapters contain several publications that have been published elsewhere. They are self-contained works which means that some repetition might be present. The next chapter deals with publications that describe methods. The chapter that bundles publications describing services follows. For each chapter an additional introduction is included.
This chapter bundles three articles that are placed in the methods category of the humanities-engineering plane depicted in Figure 1. The focus of the papers is to present and apply methods which can yield new insights:
The first paper [173] describes Tarsos. It details a method to allow extraction, comparison and analysis tools for pitch class histograms on a large scale. The method is encoded in a software system called Tarsos. Tarsos features a graphical user interface to analyze a single recording quickly and an API to allow analysis of many recordings. The final parts of the paper give an example of extraction and matching scales for hundreds of recordings of the Makam tradition effectively contrasting theoretical models with (historical) performance practice. This serves as an illustration how models can be contrasted with practice on a large scale with Tarsos.
The second paper [168] presents a method to find duplicates in large music archives. It shows how duplicate detection technology can be employed to estimate the quality of meta-data and to contrast meta-data of an original with a duplicate. The method is applied to the data set of the RMCA as a case study.
Reproducibility is the main topic of the third work [167]. It details the problems with reproducibility in computational research and MIR research. These problems are illustrated by replicating a seminal acoustic fingerprinting paper. While the results of the replication come close to the originally published results and the main findings are solidified, there is a problematic unexplained discrepancy, an unknown unknown. The main contribution of the paper lays in the method which describes how new insights in MIR can be described in a sustainable manner.
The article by [46] describes a method to automatically estimate the rhythmic complexity of a piece of music by using a set of beat tracking algorithms. The method is validated by comparing the results of the set of algorithms with human expert annotations. I co-authored the article and it could have been bundled here as well but I chose to limit bundled works to articles for which I serve as the main author.
In the past decennium, several computational tools became available for extracting pitch from audio recordings [35]. Pitch extraction tools are prominently used in a wide range of studies that deal with analysis, perception and retrieval of music. However, up to recently, less attention has been paid to tools that deal with distributions of pitch in music.
The present paper presents a tool, called Tarsos, that integrates existing pitch extraction tools in a platform that allows the analysis of pitch distributions. Such pitch distributions contain a lot of information, and can be linked to tunings, scales, and other properties of musical performance. The tuning is typically reflected in the distance between pitch classes. Properties of musical performance may relate to pitch drift within a single piece, or to influence of enculturation (as it is the case in African music culture, see [131]). A major feature of Tarsos is concerned with processing audio-extracted pitches into pitch and pitch class distributions from which further properties can be derived.
Tarsos provides a modular platform used for pitch analysis - based on pitch extraction from audio and pitch distribution analysis - with a flexibility that includes:
The possibility to focus on a part of a song by selecting graphically displayed pitch estimations in the melograph.
A zoom function that allows focusing on global or detailed properties of the pitch distribution.
Real-time auditory feedback. A tuned midi synthesizer can be used to hear pitch intervals.
Several filtering options to get clearer pitch distributions or a more discretized melograph, which helps during transcription.
In addition, a change in one of the user interface elements is immediately propgated through the whole processing chain, so that pitch analysis becomes easy, adjustable and verifiable.
This paper is structured as follows. First, we present a general overview of the different processing stages of Tarsos, beginning with the low level audio signal stage and ending with pitch distributions and their musicological meaning. In the next part, we focus on some case studies and give a scripting example. The next part elaborates on the musical aspects of Tarsos and refers to future work. The fifth and final part of the main text contains a conclusion.
The Tarsos platform Figure 6 shows the general flow of information within Tarsos. It starts with an audio file as input. The selection of a pitch estimation algorithm leads to a pitch estmations, which can be represented in different ways. This representation can be further optimized, using different types of filters for peak selection. Finally, it is possible to produce an audio output of the obtained results. Based on that output, the analysis-representation-optimization cycle can be refined. All steps contain data that can be exported in different formats. The obtained pitch distribution and scale itself can be saved as a scala file which in turn can be used as input, overlaying the estimation of another audio file for comparison.
In what follows, we go deeper into the several processing aspects, dependencies, and particularities. In this section we first discuss how to extract pitch estimations from audio. We illustrate how these pitch estimations are visualized within Tarsos. The graphical user interface is discussed. The real-time and output capabilities are described, and this section ends with an explanation about scripting for the Tarsos api. As a reminder: there is a manual available for Tarsos at http://0110.be/tag/JNMR.
Extracting pitch estimations from audio
Prior to the step of pitch estimation, one should take into consideration that in certain cases, audio preprocessing can improve the subsequent analysis within Tarsos. Depending on the source material and on the research question, preprocessing steps could include noise reduction, band-pass filtering, or harmonic/percussive separation [136]. Audio preprocessing should be done outside of the Tarsos tool. The, optionally preprocessed, audio is then fed into Tarsos and converted to a standardized format
The next step is to generate pitch estimations. Each selected block of audio file is examined and pitches are extracted from it. In figure Figure 7, this step is located between the input and the signal block phases. Tarsos can be used with external and internal pitch estimators. Currently, there is support for the polyphonic MAMI pitch estimator [35] and any VAMP plug-in [30] that generates pitch estimations. The external pitch estimators are platform dependent and some configuration needs to be done to get them working. For practical purposes, platform independent implementations of two pitch detection algorithms are included, namely, YIN [49] and MPM [127]. They are available without any configuration. Thanks to a modular design, internal and external pitch detectors can be easily added. Once correctly configured, the use of these pitch modules is completely transparent, as extracted pitch estimations are transformed to a unified format, cached, and then used for further analysis at the symbolic level.
Visualizations of pitch estimations Once the pitch detection has been performed, pitch estimations are available for further study. Several types of visualizations can be created, which lead, step by step, from pitch estimations to pitch distribution and scale representation. In all these graphs the cent unit is used. The cent divides each octave into 1200 equal parts. In order to use the cent unit for determining absolute pitch, a reference frequency of 8.176Hz has been defined
A first type of visualization is the melograph representation, which is shown in Figure 8. In this representation, each estimated pitch is plotted over time. As can be observed, the pitches are not uniformly distributed over the pitch space, and form a clustering around 5883 cents.
A second type of visualization is the pitch histogram, which shows the pitch distribution regardless of time. The pitch histogram is constructed by assigning each pitch estimation in time to a bin between 0 and 14400
A third type of visualization is the pitch class histogram, which is obtained by adding each bin from the pitch histogram to a corresponding modulo 1200 bin. Such a histogram reduces the pitch distribution to one single octave. A peak thus represents the total duration of a pitch class in a selected block of audio. Notice that the peak at 5883 cents in the pitch histogram (Figure 9) now corresponds to the peak at 1083 cents in the pitch class histogram (Figure 10).
It can also be useful to select only filter pitch estimations that make up the pitch class histogram. The most obvious ‘filter’ is to select only an interesting timespan and pitch range. The distributions can be further manipulated using other filters and peak detection. The following three filters are implemented in Tarsos:
The first is an estimation quality filter. It simply removes pitch estimations from the distribution below a certain quality threshold. Using YIN, the quality of an estimation is related to the periodicity of the block of sound analyzed. Keeping only high quality estimations should yield clearer pitch distributions.
The second is called a near to pitch class filter. This filter only allows pitch estimations which are close to previously identified pitch classes. The pitch range parameter (in cents) defines how much ornamentations can deviate from the pitch classes. Depending on the music and the research question, one needs to be careful with this - and other - filters. For example, a vibrato makes pitch go up and down - pitch modulation - and is centered around a pitch class. Figure ? gives an example of Western vibrato singing. The melograph reveals the ornamental singing style, based on two distinct pitch classes. The two pitch classes are hard to identify with the histogram ? but are perceptually there, they are made clear with the dotted gray line. In contrast, figure ? depicts a more continuous glissando which is used as a building block to construct a melody in an Indian raga. For these cases, [94] introduced the concept of two-dimensional ’melodic atoms’. (In [75] it is shown how elementary bodily gestures are related to pitch and pitch gestures.) The histogram of the pitch gesture Figure ? suggests one pitch class while a fundamentally different concept of tone is used. Applying the near to pitch class filter on this type of music could result into incorrect results. The goal of this filter is to get a clearer view on the melodic contour by removing pitches between pitch classes, and to get a clearer pitch class histogram.
The third filter is a steady state filter. The steady state filter has a time and pitch range parameter. The filter keeps only consecutive estimations that stay within a pitch range for a defined number of milliseconds. The default values are 100ms within a range of 15 cents. The idea behind it is that only ’notes’ are kept and transition errors, octave errors and other short events are removed.
Once a selection of the estimations are made or, optionally, other filters are used, the distribution is ready for peak detection. The peak detection algorithm looks for each position where the derivative of the histogram is zero, and a local height score is calculated with the formula in (Equation 1). The local height score is defined for a certain window , is the average height in the window, refers to the standard deviation of the height in the window. The peaks are ordered by their score and iterated, starting from the peak with the highest score. If peaks are found within the window of the current peak, they are removed. Peaks with a local height score lower than a defined threshold are ignored. Since we are looking for pitch classes, the window wraps around the edges: there is a difference of 20 cent between 1190 cent and 10 cent.
Figure 11 shows the local height score function applied to the pitch class histogram shown in Figure 10. The desired leveling effect of the local height score is clear, as the small peak at cents becomes much more defined. The threshold is also shown. In this case, it eliminates the noise at around cents. The noise is caused by the small window size and local height deviations, but it is ignored by setting threshold . The performance of the peak detection depends on two parameters, namely, the window size and the threshold. Automatic analysis either uses a general preset for the parameters or tries to find the most stable setting with an exhaustive search. Optionally gaussian smoothing can be applied to the pitch class histogram, which makes peak detection more straightforward. Manual intervention is sometimes needed, by fiddling with the two parameters a user can quickly browse through several peak detection result candidates.
Once the pitch classes are identified, a pitch class interval matrix can be constructed. This is the fourth type of representation, which is shown in Table Table 1. The pitch class interval matrix represents the found pitch classes, and shows the intervals between the pitch classes. In our example, a perfect fourth
P.C. | 107 | 364 | 585 | 833 | 1083 |
---|---|---|---|---|---|
107 | 0 | 256 | 478 | 726 | 976 |
364 | 944 | 0 | 221 | 470 | 719 |
585 | 722 | 979 | 0 | 248 | 498 |
833 | 474 | 730 | 952 | 0 | 250 |
1083 | 224 | 481 | 702 | 950 | 0 |
The interface Most of the capabilities of Tarsos are used through the graphical user interface (Figure ?). The interface provides a way to explore pitch organization within a musical piece. However, the main flow of the process, as described above, is not always as straightforward as the example might suggest. More particularly, in many cases of music from oral traditions, the peaks in the pitch class histogram are not always well-defined (see Section ?). Therefore, the automated peak detection may need manual inspection and further manual fine-tuning in order to correctly identify a songs’ pitch organization. The user interface was designed specifically for having a flexible environment where all windows with representations communicate their data. Tarsos has the attractive feature that all actions, like the filtering actions mentioned in Section ?, are updated for each window in real-time.
One way to closely inspect pitch distributions is to select only a part of the estimations. In the block diagram of Figure 7, this is represented by the funnel. Selection in time is possible using the waveform view (Figure ?-5). For example, the aim could be a comparison of pitch distributions at the beginning and the end of a piece, to reveal whether a choir lowered or raised its pitch during a performance (see Section ? for a more elaborate example).
Selection in pitch range is possible and can be combined with a selection in time using the melograph (Figure ?-3). One may select the melodic range such as to exclude pitched percussion, and this could yield a completely different pitch class histogram. This feature is practical, for example when a flute melody is accompanied with a low-pitched drum and when you are only interested in flute tuning. With the melograph it is also possible to zoom in on one or two notes, which is interesting for studying pitch contours. As mentioned earlier, not all music is organized by fixed pitch classes. An example of such pitch organization is given in Figure ?, a fragment of Indian music where the estimations contain information that cannot be reduced to fixed pitch classes.
To allow efficient selection of estimations in the time and frequency, they are stored in a kd-tree [12]. Once such a selection of estimations is made, a new pitch histogram is constructed and the pitch class histogram view (Figure ?-1) changes instantly.
Once a pitch class histogram is obtained, peak detection is a logical next step. With the user interface, manual adjustment of the automatically identified peaks is possible. New peak locations can be added and existing ones can be moved or deleted. In order to verify the pitch classes manually, it is possible to click anywhere on the pitch class histogram. This sends a midi-message with a pitch bend to synthesize a sound with a pitch that corresponds to the clicked location. Changes made to the peak locations propagate instantly throughout the interface.
The pitch class interval matrix (Figure ?-2) shows all new pitch class intervals. Reference pitches are added to the melograph and midi tuning messages are sent (see Section ?). The pitch class interval matrix is also interactive. When an interval is clicked on, the two pitch classes that create the interval sound at the same time. The dynamics of the process and the combination of both visual and auditory clues makes manually adjusted, precise peak extraction, and therefore tone scale detection, possible. Finally, the graphical display of a piano keyboard in Tarsos allows us to play in the (new) scale. This feature can be executed on a computer keyboard as well, where notes are projected on keys. Any of the standard midi instruments sounds can be chosen.
It is possible to shift the pitch class histogram up- or downwards. The data is then viewed as a repetitive, octave based, circular representation. In order to compare scales, it is possible to upload a previously detected scale (see Section ?) and shift it, to find a particular fit. This can be done by hand, exploring all possibilities of overlaying intervals, or the best fit can be suggested by Tarsos.
Real-time capabilities Tarsos is capable of real-time pitch analysis. Sound from a microphone can be analyzed and immediate feedback can be given on the played or sung pitch. This feature offers some interesting new use-cases in education, composition, and ethnomusicology.
For educational purposes, Tarsos can be used to practice singing quarter tones. Not only the real time audio is analyzed, but also an uploaded scale or previously analyzed file can be listened to by clicking on the interval table or by using the keyboard. Singers or string players could use this feature to improve their intonation regardless of the scale they try to reach.
For compositional purposes, Tarsos can be used to experiment with microtonality. The peak detection and manual adjustment of pitch histograms allows the construction of any possible scale, with the possibility of setting immediate harmonic and melodic auditory feedback. Use of the interval table and the keyboard, make experiments in interval tension and scale characteristics possible. Musicians can tune (ethnic) instruments according to specific scales using the direct feedback of the real-time analysis. Because of the midi messages, it is also possible to play the keyboard in the same scale as the instruments at hand.
In ethnomusicology, Tarsos can be a practical tool for direct pitch analysis of various instruments. Given the fact that pitch analysis results show up immediately, microphone positions during field recordings can be adjusted on the spot to optimize measurements.
Output capabilities Tarsos contains export capabilities for each step, from the raw pitch estimations until the pitch class interval matrix. The built-in functions can export the data as comma separated text files, charts, TeX-files, and there is a way to synthesize estimations. Since Tarsos is scriptable there is also a possibility to add other export functions or modify the existing functions. The api and scripting capabilities are documented on the Tarsos website: http://0110.be/tag/JNMR.
For pitch class data, there is a special standardized text file defined by the Scala.scl
extension. The Scala program comes with a dataset of over 3900 scales ranging from historical harpsichord temperaments over ethnic scales to scales used in contemporary music. Recently this dataset has been used to find the universal properties of scales [80]. Since Tarsos can export scala files it is possible to see if the star-convex structures discussed by [80] can be found in scales extracted from real audio. Tarsos can also parse Scala files, so that comparison of theoretical scales with tuning practice is possible. This feature is visualized by the upwards Scala arrow in Figure Figure 7. When a scale is overlaid on a pitch class histogram, Tarsos finds the best fit between the histogram and the scala file.
A completely different output modality is midi. The midi Tuning Standard defines midi messages to specify the tuning of midi synthesizers. Tarsos can construct Bulk Tuning Dump
-messages with pitch class data to tune a synthesizer enabling the user to play along with a song in tune. Tarsos contains the Gervill synthesizer, one of the very few (software) synthesizers that offer support for the midi Tuning Standard. Another approach to enable users to play in tune with an extracted scale is to send pitch bend messages to the synthesizer when a key is pressed. Pitch bend is a midi-message that tells how much higher or lower a pitch needs to sound in comparison with a standardized pitch. Virtually all synthesizers support pitch bend, but pitch bends operate on midi-channel level. This makes it impossible to play polyphonic music in an arbitrary tone scale.
Scripting capabilities Processing many audio files with the graphical user interface quickly becomes tedious. Scripts written for the Tarsos api can automate tasks and offer a possibility to utilize Tarsos’ building blocks in entirely new ways. Tarsos is written in Java, and is extendable using scripts in any language that targets the JVM (Java Virtual Machine) like JRuby, Scala
given a large number of songs and a number of tone scales in which each song can be brought, guess the tone scale used for each song. In section ? this task is explained in detail and effectively implemented.
this task tries to find the moments in a piece of music where the pitch class histogram changes from one stable state to another. For western music this could indicate a change of mode, a modulation. This task is similar as the one described by [112]. With the Tarsos api you can compare windowed pitch histograms and detect modulation boundries.
this task tries to find evolutions in tone scale use in a large number of songs from a certain region over a long period of time. Are some pitch intervals becoming more popular than others? This is done by [131] for a set of African songs.
it is theorized by [193] that pitch class histograms can serve as an acoustic fingerprint for a song. With the building blocks of Tarsos: pitch detection, pitch class histogram creation and comparison this was put to the test by [170].
The article by [193] gives a good overview of what can be done using pitch histograms and, by extension, the Tarsos api. To conclude: the Tarsos api enables developers to quickly test ideas, execute experiments on large sets of music and leverage the features of Tarsos in new and creative ways.
In what follows, we explore Tarsos’ capabilities using case studies in non-Western music. The goal is to focus on problematic issues such as the use of different pitch extractors, music with pitch drift, and last but not least, the analysis of large databases.
Analysing a pitch histogram We will first consider the analysis of a song that was recorded in 1954 by missionary Scohy-Stroobants in Burundi. The song is performed by a singing soloist, Léonard Ndengabaganizi. The recording was analysed with the YIN pitch detection method and a pitch class histogram was calculated: it can be seen in Figure ?. After peak detection on this histogram, the following pitch intervals were detected: 168, 318, 168, 210, and 336 cents. The detected peaks and all intervals are shown in an interval matrix (see Figure ?). It can be observed that this is a pentatonic division that comprises small and large intervals, which is different from an equal tempered or meantone division. Interestingly, the two largest peaks define a fifth interval, which is made of a pure minor third (318 cents) and a pure major third (378 cents) that lies between the intervals cents). In addition, a mirrored set of intervals is present, based on 168-318-168 cents. This phenomena is also illustrated by Figure ?.
Different pitch extractors However, Tarsos has the capability to use different pitch extractors. Here we show the difference between seven pitch extractors on a histogram level. A detailed evaluation of each algorithm cannot be covered in this article but can be found in the cited papers. The different pitch extractors are:
YIN [49] (YIN) and the McLeod Pitch Method (MPM), which is described by [127], are two time-domain pitch extractors. Tarsos contains a platform independent implementation of the algorithms.
Spectral Comb (SC), Schmitt trigger(Schmitt) and Fast Harmonic Comb (FHC) are described by [21]. They are available for Tarsos through VAMP-plugins [29];
MAMI 1 and MAMI 6 are two versions of the same pitch tracker. MAMI 1 only uses the most present pitch at a certain time, MAMI 6 takes the six most salient pitches at a certain time into account. The pitch tracker is described by [35].
Figure 12 shows the pitch histogram of the same song as in the previous section, which is sung by an unaccompanied young man. The pitch histogram shows a small tessitura and wide pitch classes. However, the general contour of the histogram is more or less the same for each pitch extraction method, five pitch classes can be distinguished in about one-and-a-half octaves, ranging from 5083 to 6768 cent. Two methods stand out. Firstly, MAMI 6 detects pitch in the lower and higher regions. This is due to the fact that MAMI 6 always gives six pitch estimations in each measurement sample. In this monophonic song this results in octave - halving and doubling - errors and overtones. Secondly, the Schmitt method also stands out because it detects pitch in regions where other methods detect a lot less pitches, e.g. between 5935 and 6283 cent.
Figure 13 shows the pitch class histogram for the same song as in Figure 12, now collapsed into one octave. It clearly shows that it is hard to determine the exact location of each pitch class. However, all histogram contours look similar except for the Schmitt method, which results in much less well defined peaks. The following evaluation shows that this is not only the case.
YIN | MPM | Schmitt | FHC | SC | MAMI 1 | MAMI 6 | |
---|---|---|---|---|---|---|---|
YIN | 1.00 | 0.81 |
0.41 |
0.65 | 0.62 | 0.69 | 0.61 |
MPM | 0.81 |
1.00 | 0.43 | 0.67 | 0.64 | 0.71 | 0.63 |
Schmitt | 0.41 |
0.43 | 1.00 | 0.47 | 0.53 | 0.42 | 0.56 |
FHC | 0.65 | 0.67 | 0.47 | 1.00 | 0.79 | 0.67 | 0.66 |
SC | 0.62 | 0.64 | 0.53 | 0.79 | 1.00 | 0.65 | 0.70 |
MAMI 1 | 0.69 | 0.71 | 0.42 | 0.67 | 0.65 | 1.00 | 0.68 |
MAMI 6 | 0.61 | 0.63 | 0.56 | 0.66 | 0.70 | 0.68 | 1.00 |
Average | 0.69 | 0.70 | 0.55 |
0.70 | 0.70 | 0.69 | 0.69 |
In order to be able to gain some insight into the differences between the pitch class histograms resulting from different pitch detection methods, the following procedure was used: for each song in a data set of more than 2800 songs - a random selection of the music collection of the Belgian Royal Museum of Central Africa (RMCA) - seven pitch class histograms were created by the pitch detection methods. The overlap - a number between zero and one - between each pitch class histogram pair was calculated. A sum of the overlap between each pair was made and finally divided by the number of songs. The resulting data can be found in Table 2. Here histogram overlap or intersection is used as a distance measure because [62] show that this measure works best for pitch class histogram retrieval tasks. The overlap between two histograms and with classes is calculated with Equation 2. For an overview of alternative correlation measures between probability density functions see [33].
The Table 2 shows that there is, on average, a large overlap of 81%, between the pitch class histograms created by YIN and those by MPM. This can be explained by the fact that the two pitch extraction algorithms are very much alike: both operate in the time-domain with autocorrelation. The table also shows that Schmitt generates rather unique pitch class histograms. On average there is only 55% overlap with the other pitch class histogram. This performance was already expected during the analysis of one song (above).
The choice for a particular pitch detection method depends on the music and the analysis goals. The music can be monophonic, homophonic or polyphonic, different instrumentation and recording quality all have influence on pitch estimators. Users of Tarsos are encouraged to try out which pitch detection method suits their needs best. Tarsos’ scripting api - see section ? - can be helpful when optimizing combinations of pitch detection methods and parameters for an experiment.
Shifted pitch distributions Several difficulties in analysis and interpretation may arise due to pitch shift effects during musical performances. This is often the case with a capella choirs. Figure ? shows a nice example of an intentionally raised pitch, during solo singing in the Scandinavian Sami culture. The short and repeated melodic motive remains the same during the entire song, but the pitch raises gradually ending up 900 cents higher than the beginning. Retrieving a scale for the entire song is in this case irrelevant, although the scale is significant for the melodic motive. Figure 15 shows an example where scale organization depends on the characteristics of the instrument. This type of African fiddle, the iningidi, does not use a soundboard to shorten the strings. Instead the string is shortened by the fingers that are in a (floating) position above the string: an open string and three fingers give a tetratonic scale. Figure 14 shows an iningidi being played. This use case shows that pitch distributions for entire songs can be misleading, in both cases it is much more informative to compare the distribution from the first part of the song with the last part. Then it becomes clear how much pitch shifted and in what direction.
Interesting to remark is that these intervals have more or less the same distance, a natural consequence of the distance of the fingers, and that, consequently, not the entire octave tessitura is used. In fact only 600 cents, half an octave, is used. A scale that occurs typically in fiddle recordings, that rather can be seen as a tetrachord. The open string (lowest note) is much more stable than the three other pitches that deviate more, as is shown by the broader peaks in the pitch class histogram. The hand position without soundboard is directly related to the variance of these three pitch classes. When comparing the second minute of the song with the seventh, one sees a clear shift in pitch, which can be explained by the fact the musician changed the hand position a little bit. In addition, another phenomena can be observed, namely, that while performing, the open string gradually loses tension, causing a small pitch lowering which can be noticed when comparing the two fragments. This is not uncommon for ethnic music instruments.
Tarsos’ scripting applied to Makam recognition In order to make the use of scripting more concrete, an example is shown here. It concerns the analysis of Turkish classical music. In an article by [62], pitch histograms were used for - amongst other tasks - makam
For a small set of tone scales and a large set of musical performances , each brought in one of the scales, identify the tone scale of each musical performance automatically.
An example of makam recognition can be seen in Figure Figure 16. A theoretical template - the dotted, red line - is compared to a pitch class histogram - the solid, blue line - by calculating the maximum overlap between the two. Each template is compared with the pitch class histogram, the template with maximum overlap is the guessed makam. Pseudocode for this procedure can be found in Algorithm ?.
Tarsos is a tool for the analysis of pitch distributions. For that aim, Tarsos incorporates several pitch extraction modules, has pitch distribution filters, audio feedback tools, and scripting tools for batch processing of large databases of musical audio. However, pitch distributions can be considered from different perspectives, such as ethnographical studies of scales [159], theoretical studies in scale analysis [164], harmonic and tonal analysis [96], and other structural analysis approaches to music (such as set theoretical and Schenkerian). Clearly, Tarsos does not offer a solution to all these different approaches to pitch distributions. In fact, seen from the viewpoint of Western music analysis, Tarsos is a rather limited tool as it doesn’t offer harmonic analysis, nor tonal analysis, nor even statistical analysis of pitch distributions. All of this should be applied together with Tarsos, when needed. Instead, what Tarsos provides is an intermediate level between pitch extraction (done by pitch extractor tools) and music theory. The major contribution of Tarsos is that it offers an easy to use tool for pitch distribution analysis that applies to all kinds of music, including Western and non-Western. The major contribution of Tarsos, so to speak, is that it offers pitch distribution analysis without imposing a music theory. In what follows, we explain why such tools are needed and why they are useful.
Tarsos and Western music theoretical concepts Up to recently, musical pitch is often considered from the viewpoint of a traditional music theory, which assumes that pitch is stable (e.g. vibrato is an ornament of a stable pitch), that pitch can be segmented into tones, that pitches are based on octave equivalence, and that octaves are divided into 12 equal-sized intervals of each 100 cents, and so on. These assumptions have the advantage that music can be reduced to symbolic representations, a written notation, or notes, whose structures can be studied at an abstract level. As such, music theory has conceptualized pitch distributions as chords, keys, modes, sets, using a symbolic notation.
So far so good, but tools based on these concepts may not work for many nuances of Western music, and especially not for non-Western music. In Western music, tuning systems have a long history. Proof of this can be found in tunings of historical organs, and in tuning systems that have been explored by composers in the 20th century (cf. Alois Haba, Harry Partch, Ivo Darreg, and Lamonte Young). Especially in non-Western classical music, pitch distributions are used that radically differ from the Western theoretical concepts, both in terms of tuning, as well as in pitch occurrence, and in timbre. For example, the use of small intervals in Arab music contributes to nuances in melodic expression. To better understand how small pitch intervals contribute to the organization of this music, we need tools that do no assume octave divisions in 12 equal-sized intervals (see [62]). Other types of music do not have octave equivalence (cf. the Indonesian gamalan), and also some music work with modulated pitch. For example, [75] describe classical Chinese guqin music which uses tones that contain sliding patterns (pitch modulations), which form a substantial component of the tone and consider it as a succession of prototypical gestures. [94] introduces a set of 2D melodic units, melodic atoms, in describing Carnatic (South-Indian classical) music. They represent or synthesize the melodic phrase and are not bound by a scale type. Hence, tools based on Western common music theoretical conceptions of pitch organization may not work for this type of music.
Oral musical traditions (also called ethnic music) provide a special case since there is no written music theory underlying the pitch organization. An oral culture depends on societal coherence, interpersonal influence and individual musicality, and this has implications on how pitch gets organized. Although oral traditions often rely on a peculiar pitch organization, often using a unique system of micro-tuned intervals, it is also the case that instruments may lack a fixed tuning, or that tunings may strongly differ from one instrument to the other, or one region to the other. Apparently, the myriad of ways in which people succeed in making sense out of different types of pitch organization can be considered as cultural heritage that necessitates a proper way of documentation and study [131].
Several studies attempt at developing a proper approach to pitch distributions. [68] look for pitch gestures in European folk music as an additional aspect to pitch detection. Moving from tone to scale research, [34] acknowledges interval differences in Indian classical music, but reduces to a chromatic scale for similarity analysis and classification techniques. [188] developed, already in 1969, an automated method for extracting pitch information from monophonic audio for assembling the scale of the spilåpipa by frequency histograms. [18] build a system to classify and recognize Turkish maqams from audio files using overall frequency histograms to characterize the maqams scales and to detect the tonic centre. Maqams contain intervals of different sizes, often not compatible with the chromatic scale, but partly relying on smaller intervals. [132] focuses on pitch distributions of especially African music that deals with a large diversity of irregular tuning systems. They avoid a priori pitch categories by using a quasi-continuous rather than a discrete interval representation. In [131] they show that African songs have shifted more and more towards Western well temperament from 1950s to 1980s.
To sum up, the study of pitch organization needs tools that go beyond elementary concepts of the Western music theoretical canon (such as octave equivalence, stability of tones, equal temporal scale, and so on). This is evident from the nuances of pitch organization in Western music, in non-Western classical music, as well as in oral music cultures. Several attempts have been undertaken, but we believe that a proper way of achieving this is by means of a tool that combines audio-based pitch extraction with a generalized approach to pitch distribution analysis. Such a tool should be able to automatically extract pitch from musical audio in a culture-independent manner, and it should offer an approach to the study of pitch distributions and its relationship with tunings and scales. The envisioned tool should be able to perform this kind of analysis in an automated way, but it should be flexible enough to allow a musicologically grounded manual fine-tuning using filters that define the scope at which we look at distributions. The latter is indeed needed in view of the large variability of pitch organization in music all over the world. Tarsos is an attempt at supplying such a tool. One the one hand, Tarsos tries to avoid music theoretical concepts that could contaminate music that doesn’t subscribe the constraints of the Western music theoretical canon. On the other hand, the use of Tarsos is likely to be too limited, as pitch distributions may further draw upon melodic units that may require an approach to segmentation (similar to the way segmented pitch relates to notes in Western music) and further gestural analysis (see the references to the studies mentioned above).
Tarsos pitfalls The case studies from section ? illustrate some of the capabilities of Tarsos as tool for the analysis of pitch distributions. As shown Tarsos offers a graphical interface that allows a flexible way to analyse pitch, similar to other editors that focus on sound analysis (Sonic Visualizer, Audacity, Praat). Tarsos offers support for different pitch extractors, real-time analysis (see section ?), and has numerous output capabilities (See section ?). The scripting facility allows us to use of Tarsos’ building blocks in unique ways efficiently.
However, Tarsos-based pitch analysis should be handled with care. The following three recommendations may be taken into account: First of all, one cannot extract scales without considering the music itself. Pitch classes that are not frequently used, won’t show up clearly in a histogram and hence might be missed. Also not all music uses distinct pitch classes: the Chinese and Indian music traditions have been mentioned in this case. Because of the physical characteristics of the human voice, voices can glide between tones of a scale, which makes an accurate measurement of pitch not straightforward. It is recommended to zoom in on the estimations in the melograph representation for a correct understanding.
Secondly, analysis of polyphonic recordings should be handled with care since current pitch detection algorithms are primarily geared towards monophonic signals. Analysis of homophonic singing for example may give incomplete results. It is advisable to try out different pitch extractors on the same signal to see if the results are trustworthy.
Finally, [159] recognizes the use of “pitch categories” but warns that, especially for complex inharmonic sounds, a scale is more than a one dimensional series of pitches and that spectral components need to be taken into account to get better insights in tuning and scales. Indeed, in recent years, it became clear that the timbre of tones and the musical scales in which these tones are used, are somehow related [164]. The spectral content of pitch (i.e. the timbre) determines the perception of consonant and dissonant pitch intervals, and therefore also the pitch scale, as the latter is a reflection of the preferred melodic and harmonic combinations of pitch. Based on the principle of minimal dissonance in pitch intervals, it is possible to derive pitch scales from spectral properties of the sounds and principles of auditory interference (or critical bands). [162] argue that perception is based on the disambiguation of action-relevant cues, and they manage to show that the harmonic musical scale can be derived from the way speech sounds relate to the resonant properties of the vocal tract. Therefore, the annotated scale as a result of the precise use of Tarsos, does not imply the assignment of any characteristic of the music itself. It is up to the user to correctly interprete of a possible scale, a tonal center, or a melodic development.
Tarsos - future work The present version of Tarsos is a first step towards a tool for pitch distribution analysis. A number of extensions are possible.
For example, given the tight connection between timbre and scale, it would be nice to select a representative tone from the music and transpose it to the entire scale, using a phase vocoder. This sound sample and its transpositions could then be used as a sound font for the midi synthesizer. This would give the scale a more natural feel compared to the general midi device instruments that are currently present.
Another possible feature is tonic detection. Some types of music have a well-defined tonic, e.g. in Turkish classical music. It would make sense to use this tonic as a reference pitch class. Pitch histograms and pitch class histograms would then not use the reference frequency defined in appendix ? but a better suited, automatically detected reference: the tonic. It would make the intervals and scale more intelligible.
Tools for comparing two or more scales may be added. For example, by creating pitch class histograms for a sliding window and comparing those with each other, it should be possible to automatically detect modulations. Using this technique, it should also be possible to detect pitch drift in choral, or other music.
Another research area is to extract features on a large data set and use the pitch class histogram or interval data as a basis for pattern recognition and cluster analysis. With a time-stamped and geo-tagged musical archive, it could be possible to detect geographical or chronological clusters of similar tone scale use.
On the longer term, we plan to add representations of other musical parameters to Tarsos as well, such as rhythmic and instrumental information, temporal and timbral features. Our ultimate goal is to develop an objective albeit partial view on music by combining those three parameters within an easy to use interface.
In this paper, we have presented Tarsos, a modular software platform to extract and analyze pitch distributions in music. The concept and main features of Tarsos have been explained and some concrete examples have been given of its usage. Tarsos is a tool in full development. Its main power is related to its interactive features which, in the hands of a skilled music researcher, can become a tool for exploring pitch distributions in Western as well as non-Western music.
Since different representations of pitch are used by Tarsos and other pitch extractors this section contains definitions of and remarks on different pitch and pitch interval representations.
For humans the perceptual distance between Hz and Hz is the same as between Hz and Hz. A pitch representation that takes this logarithmic relation into account is more practical for some purposes. Luckily there are a few:
The midi standard defines note numbers from 0 to 127, inclusive. Normally only integers are used but any frequency in Hz can be represented with a fractional note number using equation Equation 3.
Rewriting Equation 3 to ? shows that midi note number corresponds with a reference frequency of Hz which is on a keyboard with tuned to Hz. It also shows that the midi standard divides the octave in 12 equal parts.
To convert a midi note number to a frequency in Hz one of the following equations can be used.
Using pitch represented as fractional midi note numbers makes sense when working with midi instruments and midi data. Although the midi note numbering scheme seems oriented towards western pitch organization (12 semitones) it is conceptually equal to the cent unit which is more widely used in ethnomusicology.
[206] introduced the nowadays widely accepted cent unit. To convert a frequency in Hz to a cent value relative to a reference frequency also in Hz.
With the same reference frequency equations Equation 4 and ? differ only by a constant factor of exactly . In an environment with pitch representations in midi note numbers and cent values it is practical to use the standardized reference frequency of Hz.
To convert a frequency in Hz to a cent value relative to a reference frequency also in Hz.
Divide the octave in and parts respectively, which is the only difference with cents.
Pitch Ratio Representation Pitch ratios are essentially pitch intervals, an interval of one octave, cents equal to a frequency ratio of . To convert a ratio to a value in cent :
The natural logarithm, the logarithm base with being Euler’s number, is noted as . To convert a value in cent to a ratio :
Further discussion on cents as pitch ratios can be be found in appendix B of [164]. There it is noted that:
There are two reasons to prefer cents to ratios: Where cents are added, ratios are multiplied; and it is always obvious which of two intervals is larger when both are expressed in cents. For instance, an interval of a just fifth, followed by a just third is , a just seventh. In cents, this is . Is this larger or smaller than the Pythagorean seventh ? Knowing that the latter is cents makes the comparison obvious.
Conclusion The cent unit is mostly used for pitch interval representation while the midi key and Hz units are used mainly to represent absolute pitch. The main difference between cent and fractional midi note numbers is the standardized reference frequency. In our software platform Tarsos we use the exact same standardized reference frequency of Hz which enables us to use cents to represent absolute pitch and it makes conversion to midi note numbers trivial. Tarsos also uses cents to represent pitch intervals and ratios.
Several audio files were used in this paper to demonstrate how Tarsos works and to clarify musical concepts. In this appendix you can find pointers to these audio files.
The thirty second excerpt of the musical example used throughout chapter ? can be downloaded on http://tarsos.0110.be/tag/JNMR and is courtesy of: wergo/Schott Music & Media, Mainz, Germany, www.wergo.de and Museum Collection Berlin. Ladrang Kandamanyura (slendro pathet manyura) is track eight on Lestari - The Hood Collection, Early Field Recordings from Java - SM 1712 2
. It is recorded in 1957 and 1958 in Java.
The yoiking singer of Figure ? can be found on a production released on the label Caprice Records in the series of Musica Sveciae Folk Music in Sweden. The album is called Jojk CAP 21544 CD 3, Track No 38 Nila, hans svager/His brother-in-law Nila.
The api example (section ?) was executed on the data set by Bozkurt. This data set was also used in [62]. The Turkisch song brought in the makam Hicaz from Figure 16 is also one of the songs in the data set.
For the comparison of different pitch trackers on pitch class histogram level (section ?) a subset of the music collection of the Royal Museum for Central Africa (RMCA, Tervuren, Belgium) was used. We are grateful to the RMCA for providing access to its unique archive of Central African music. A song from the RMCA collection was also used in section ?. It has the tape number MR.1954.1.18-4 and was recorded in 1954 by missionary Scohy-Stroobants in Burundi. The song is performed by a singing soloist, Léonard Ndengabaganizi. Finally the song with tape number MR.1973.9.41-4, also from the collection of the RMCA, was used to show pitch shift within a song (Figure 15). It is called Kana nakunze and is recorded by Jos Gansemans in Mwendo, Rwanda in the year 1973.
Keywords: Replication, Acoustic fingerprinting, Reproducibility.
Reproducibility is one of the corner-stones of scientific methodology. A claim made in a scientific publication should be verifiable and the described method should provide enough detail to allow replication, “reinforcing the transparency and accountability of research processes” [113]. The Open Science movement has recently gained momentum among publishers, funders, institutions and practicing scientists across all areas of research. It is based on the assumption that promoting “openness” will foster equality, widen participation, and increase productivity and innovation in science. Re-usability is a keyword in this context: data must be “useful rather than simply available” [113], with a focus on facilitating the advancement of knowledge based on previous work (spot check to avoid repeating work rather than on verifying the correctness of previous work.
From a technical standpoint, sharing tools and data has never been easier. Reproducibility, however, remains a problem. Especially for Music Information Retrieval (MIR) research and, more generally, research involving complex software systems. This problem has several causes:
Journal articles and especially conference papers have limited space for detailed descriptions of methods or algorithms. Even for only moderately complex systems there are numerous parameters, edge cases and details which are glossed over in textual descriptions. This makes articles readable and the basic method intelligible, but those details need to be expounded somewhere. The ideal place for such details is well documented, runnable code. Unfortunately, intellectual property rights by universities or research institutions often limit researchers to freely distribute their code. This is problematic since it leaves the ones reproducing the work guessing for details. It makes replicating a study prohibitively hard. Even if there is code available it is often not documented well or it is very hard to make it actually run and reproduce results.
Copyrights on music
make it hard to share music freely. MIR research often has commercial goals and focuses on providing access to commercial, popular music. It is sensible to use commercial music while doing research as well. Unfortunately this makes it potentially very expensive to reproduce an experiment: all music needs to be purchased again and again by researchers reproducing the work.
The original research also needs to uniquely identify the music used, which is challenging if there are several versions, re-issues or recordings of a similarly titled track. Audio fingerprinting techniques allow us to share unique identifiers
Redistribution of historical field-recordings in museum archives is even more problematic. Due to the nature of the recordings, copyright status is often unclear. Clearing the status of such tracks involves international, historical copyright laws and multiple stakeholders such as performers, soloists, the museum, the person who performed the field recording and potentially a publisher that already published parts on an LP. The rights of each stakeholder need to be carefully considered, while at the same time being difficult to identify due to a lack of precise meta-data and awing also to the passage of time. While it is possible to clear a few tracks, it quickly becomes an insurmountable obstacle to clear a representative set of recordings for the research community. For very old recordings where copyright is not a problem, sometimes there are ethical issues related to public sharing: some Australian indigenous music, for instance, is considered very private and not meant to be listened to by others
The evaluation of research work (and most importantly of researchers) is currently based on the number of articles published in ranked scientific journals or conferences. Other types of scientific products are not valued as much. The advantage of investing resources in documenting, maintaining and publishing reproducible research and supplementary material is not often obvious in the effort of prioritising and strategising research outputs [113]. Short-lived project funding is also a factor that directs attention of researchers to short-term output (publications), and not to long-term aspects of reproducible contributions to a field. In short there is no incentive to spend much time on non-textual output.
Reproducing works is not an explicit tradition in computer science research. In the boundless hunt to further the state-of-the-art there seems no time or place for a sudden standstill and reflection on previous work. Implicitly, however, there seems to be a lot of replication going on. It is standard practice to compare results of a new method with earlier methods (baselines) but often it is not clear whether authors reimplemented those baselines or managed to find or modify an implementation. It is also not clear if those baselines are verified for correctness by reproducing the results reported in the original work. Moreover, due to a lack of standardized datasets, approaches are often hard to compare directly.
If all goes well, exact replications do not contain any new findings and therefore they may be less likely to get published. A considerable amount of work that risks remaining unpublished is not a proposition many researchers are looking forward to, expressing the tension between acting for the good of the community and their own [137].
In the social sciences the reproducibility project illustrated that the results of many studies could not be successfully reproduced [143] mainly due to small sample sizes and selection bias, a finding that was also demonstrated in a special issue in Musicae Scientiae on Replication in music psychology [59]. In these replicated studies the main problem did not lay in replicating methods.
For research on complex software systems (MIR) it is expected that the replicated results will closely match the original if the method can be accurately replicated and if the data are accessible. But those two conditions are hard to meet. The replication problem lies exactly in the difficulties in replicate the method and to access the data. Once method and data are available, a statistical analysis on the behavior of deterministic algorithms is inherently less problematic than on erratic humans. [185] showed that even if data and method are available, replication can be challenging if the problem is ill-defined and the test data contains inconsistencies.
Even if there is little doubt on the accuracy of reported results, the underlying need for replication remains. First of all, it checks if the problem is well-defined. Secondly, it tests if the method is described well and in fine enough detail. Thirdly ,it tests if the data used are described well and accessible. And finally results are also confirmed. It serves basically to check if proper scientific methodology is used and solidifies the original work.
Open Science and MIR Open Science doesn’t come as a set of prescriptive rules, but rather as a set of principles centred around the concept of “openness”, with (i) theoretical, (ii) technological/practical and (iii) ethical implications. Each scientific community needs to identify how Open Science applies to its own domain, developing “the infrastructures, algorithms, terminologies and standards required to disseminate, visualise, retrieve and re-use data” [107]. A general survey on Open Science policies in the field of MIR has never been performed, so an overview of their current application and their specific declination is not clearly defined. However, the members of this community have an implicit understanding of their own methods and their common practices to spread their materials and outputs, making it possible to lay out some fixed points. For example, methods, runnable annotated code and data sets, are key to MIR reproducibility. Often research serves precisely to introduce an improvement, a variation or an extension to an existing algorithm. When the algorithm is not available, it needs to be re-created in order to implement the modification – which is not only resource consuming, but never gives the guarantee that the re-created code matches the antecedent down to the last detail [145]. Data sets are also very important, and they should be made available – if not for the general public, at least for peers and/or for reviewers.
Implementing Open Science policies in their full potential would change the face of science practice as we know it today. But its achievement requires a capillar change in how we understand our day-to-day research activities and how we carry them out, and right now we are in a situation where most researchers endorse openness yet “struggle to engage in community-oriented work because of the time and effort required to format, curate, and make resources widely available” [108]. At the same time, the adoption of Open Science policies is encouraged but not mandatory and the “variety of constraints and conditions relevant to the sharing of research materials” creates “confusion and disagreement” among researchers [113]. A recent survey of biomedical researchers in the United Kingdom [113] identified 9 external factors that affect the practice of Open Science, including the existence (or lack) of repositories and databases for data, materials, software and models; the credit system in academic research; models and guidelines for intellectual property; collaborations with industrial partners, as well as attempts at commercialization and the digital nature of research. These constraints are generally applicable across scientific domains, thus including MIR – where the aspect of commercialization emerges much earlier in the research workflow, at the level of music collections that need to be purchased.
So thinking of Open Science in MIR, where systematic support of reproducibility is but one of the possible applications, is an invitation to think about openness in relation to “all components of research, including data, models, software, papers, and materials such as experimental samples” [113]. An important and cross-domain side aim of Open Science is also to show the importance of “encouraging critical thinking and ethical reflection among the researchers involved in data processing practices” [107]. Open Science is not only about materials and platforms, but about people: the ’social’ is not merely ’there’ in science: “it is capitalised upon and upgraded to become an instrument of scientific work” [90].
This work replicates an acoustic fingerprinting system. This makes it one of the very few reported replication articles in Music Information Retrieval. [187] also replicated MIR systems. They replicated two musical genre classification systems to critically review the systems and to challenge the reported high performances. Our aim is to highlight the reproducibility aspects of a milestone acoustic fingerprinting paper and to provide an illustration of good research practices. In doing so we will also provide an implementation to the research community, as well as solidifying the original acoustic fingerprinting research.
An acoustic fingerprint is a condensed representation of audio that can be matched reliably and quickly with a large set of fingerprints extracted from reference audio. The general acoustic fingerprinting system process is depicted in Figure Figure 17. A short query is introduced in the system. Fingerprints are extracted from the query audio and subsequently compared with a large set of fingerprints in the reference database. Finally, either a match is found or it is reported that it is not present in the database. Such acoustic fingerprint systems have many use-cases such as digital rights management, identifying duplicates [47], audio synchronization [176] or labeling untagged audio with meta-data [20].
The requirements for an acoustic fingerprinting system are described by [32]. They need to be granular, robust, reliable and economic in terms of storage requirements and computational load while resolving a query. Granular means that only a short fragment is needed for identification. Robustness is determined by various degradations a query can be subjected to while remaining recognizable. Degradations can include additional noise, low-quality encoding, compression, equalization, pitch-shifting and time-stretching. The ratios between true/false positives/negatives determine the reliability of the system. To allow potentially millions of reference items, an economy in terms of storage space is needed. Finally, resolving a query needs to be economic in terms of computational load. The weight of each requirement can shift depending on the context: if only a couple of hundred items end up in the reference database, the low storage space requirement is significantly relaxed.
Acoustic fingerprinting is a well researched MIR topic and over the years several efficient acoustic fingerprinting methods have been introduced [76]. These methods perform well even with degraded audio quality and with industrial sized reference databases. Some systems are able to recognize audio even when pitch-shifts are present [58] but without allowing for time-scale modification. Others systems are designed to handle both pitch and time-scale modification at the same time. Some systems only support small [214] datasets, others relatively large ones [208].
This work replicates and critically reviews an acoustic fingerprinting system by [70]. The ISMIR proceedings article is from 2002 and it is elaborated upon by an article in the Journal of New Music Research [71]. The paper was chosen for several reasons:
It is widely cited: the ISMIR paper is cited more than 750 times and more than 250 times since 2013 according to Google Scholar. This indicates that it is relevant and still relevant today. A recent study, for example, improved the system by replacing the FFT with a filter bank [147]. Another study [40] improved its robustness against noise.
It is a paper that has a very prototypical structure which presents and evaluates a MIR system. The system, in this case, is an acoustic fingerprinting system. Replicating this work, in other words, should be similar to replicating many others.
The described algorithm and the evaluation method are only moderately complex and self-contained. They only depend on regularly available tools or methods. Note that this reason is symptomatic of the reproducibility problem: some papers are borderline impossible to replicate.
Contributions The contributions of this article are either generally applicable or specifically about the replicated work. The specific contributions are the verification of the results described by [70] and a solidification of the work. A second contribution lies in a publicly available, verifiable, documented implementation of the method of that paper
The paper continues with introducing the method that is replicated and the problems encountered while replicating it. Subsequently the same is done for the evaluation. To ameliorate problems with respect to replicability in the original evaluation an alternative evaluation is proposed. The results are compared and finally a discussion follows where guidelines are proposed.
As with most acoustic fingerprinting systems this method consists of a fingerprint extraction step and a search strategy. In the terminology of Figure 17 this would be the feature extraction/fingerprint construction step and the matching step.
Fingerprint extraction The fingerprint extraction algorithm is described in more detail in section 4.2 of [70] but is summarized here as well. First of all the input audio is resampled to about 5000Hz. On the resampled signal a Hamming windowed FFT with a length of 2048 samples is taken every 64 samples - an overlap of . In the FFT output only 33 logarithmically spaced bins between to in the magnitude spectrum are used. The energy of frequency band at frame index is called . Finally, a fingerprint is constructed using the with the following formula:
Since the last frequency band is discarded - there is no for the last band - only 32 of the original 33 values remain. Every FFT frame is reduced to a 32bit word. Figure ? shows a three second audio fragment comparing the original a) with a 128kb/s CBR MP3 encoded version. b) the difference between the two. c) shows the distance measure for this acoustic fingerprinting system: the number of places where the two binary words differ (in red in Figure ?). This distance measure is also known as the Hamming distance or the bit error rate BER.
Figure ? provides a bit more insights into the BER in two cases. In the first case a high quality query, a 128 kb/s CBR encoded MP3, is compared with the reference and only a small number of bits change. Note that there are quite a few places where the BER is zero. The other case uses a low quality GSM codec. The BER, in this case, is always above zero.
The original paper includes many details about chosen parameters. It defines an FFT size, window function and sample rate, which is a good start. Unfortunately the parameters are not consistently used throughout the paper. Twice is reported as the FFT step size and twice . In the replicated implementation an step size is used ()
We further argue that even a detailed, consistent textual description of an algorithm always leaves some wiggle room for different interpretations [145]. Only if source code is available with details on which system - software and hardware - the evaluation is done can an exact replication become feasible. The source code could also include bugs that perhaps have an effect on the results. Bugs will, by definition, not be described as such in a textual description.
This strengthens the case that source code should be an integral part of a scientific work. If interested in further details of the new implementation readers are referred to the source code in the supplementary material
Search strategy The basic principle of the search strategy is a nearest neighbor search in Hamming space. For each fingerprint extracted from a query, a list of near neighbors is fetched which ideally includes a fingerprint from the matching audio fragment. The actual matching fragment will be present in most lists of near neighbors. A naive approach would compute the Hamming distance between a query fingerprint and each fingerprint in the reference database. This approach can be improved with algorithms that evaluate only a tiny fraction of the reference database but yield the same retrieval rates. The details of the search strategy are much less critical then the parameters of the fingerprint extraction step. As long as the nearest neighbor search algorithm is implemented correctly the only difference will be the speed at which a query is resolved.
The search strategies’ main parameter is the supported Hamming distance. With an increased Hamming distance more degraded audio can be retrieved but the search space quickly explodes. For the number of bits the search space equals:
A good search strategy strikes a balance between query performance and retrieval rate. The search strategy from the original work does this by keeping track of which bits of a fingerprint are uncertain. The uncertain bits are close to the threshold. It assigns each bit with a value from 1 to 32 that describes confidence in the bit, with 1 being the least reliable bit and 32 the most reliable. Subsequently, a search is done for the fingerprint itself and for fingerprints which are permutations of the original with one or more uncertain bits toggled. To strike that balance between performance and retrieval rate the number of bits need to be chosen. If the three least reliable bits are toggled, this generates permutations. This is much less than flipping 3 bits randomly in the 32bit fingerprint:
Once a matching fingerprint is found the next step is to compare a set of fingerprints of the query with the corresponding set of fingerprints in the reference audio. The Hamming distance for each fingerprint pair is calculated. If the sum of the distances is below a threshold then it is declared a match, otherwise the search continues until either a match is found or until the query is labeled as unknown. The parameters are determined experimentally in the original work: 256 fingerprints are checked and the threshold for the Hamming distance is 2867bits. So from a total of , 2867 or about 35% are allowed to be different.
The implementation is done with two hash tables. A lookup table with fingerprints as key and a list of pairs as value. The identifier refers uniquely to a track in the reference database. The offset points precisely to the time at which the fingerprint appears in that track. The second hash table has an identifier as key and an array of fingerprints as value. Using the offset, the index in the fingerprint array can be determined. Subsequently, the previous 256 fingerprints from the query can be compared with the corresponding fingerprints in the reference set and a match can be verified.
Implementing this search strategy is relatively straightforward.
The evaluation of the system is done in two ways. First we aim to replicate the original evaluation and match the original results as closely as possible to validate the new implementation and the original work. The original evaluation is not easily replicated since it uses copyrighted evaluation audio with ambiguous descriptions, a data set that is not available or described and modifications that are only detailed up until a certain degree.
The second evaluation is fully replicable: it uses freely available evaluation audio, a data set with creative commons music and modifications that are encoded in scripts. Interested readers are encouraged to replicate the results in full.
Replication of the original evaluation The evaluation of the original system is done on four short excerpts from commercially available tracks:
’We selected four short audio excerpts (Stereo, 44.1kHz, 16bps) from songs that belong to different musical genres: “O Fortuna” by Carl Orff, “Success has made a failure of our home” by Sinead o’Connor, “Say what you want” by Texas and “A whole lot of Rosie” by AC/DC.’
Unfortunately it fails to mention which excerpts where used or even how long these excerpts were. The selection does have an effect on performance. If, for instance, a part with little acoustic information is selected versus a dense part different results can be expected. It also fails to mention which edition, version or release is employed which is especially problematic with the classical piece: many performances exist with varying lengths and intensities. The paper also mentions a reference database of 10 000 tracks but fails to specify which tracks it contains. The fact that only one excerpt from each song is used for evaluation makes the selection critical which is problematic by itself. Reporting an average performance with standard deviations would have been more informative.
To evaluate the robustness of the system each short excerpt is modified in various ways. The modifications to the query are described well but there is still room for improvement. For example it is not mentioned how time-scale modification is done: there are different audible artifacts - i.e. different results - when a time or frequency domain method for time-scale modification is used. The description of the echo modification seems to have been forgotten while dry, wetness or delay length parameters definitely have a large effect on sound and subsequent results.
To summarize: essential information is missing to replicate the results exactly. The next best thing is to follow the basic evaluation method which can be replicated by following various clues and assumptions. To this end the previously mentioned four tracks were bought from a digital music store (7digital, see table ?). Two were available in a lossless format and two in a high quality MP3 format (320 kb/s CBR). The test data set can not be freely shared since commercial music is used which, again, hinders replicability.
Identifier |
Track | Format |
---|---|---|
56984036 | Sinead | 320kbs MP3 |
52740482 | AC \DC | 16-bit/44.1kHz FLAC |
122965 | Texas | 320kbs MP3 |
5917942 | Orff | 16-bit/44.1kHz FLAC |
The original evaluation produces two tables. The first documents the bit error rates (BER, Table ?. It compares the fingerprints extracted from a reference recording with those of modified versions. If all bits are equal the error rate is zero. If all bits are different then the error rate is one. Comparison of random fingerprints will result in a bit error rate of around 0.5. The original article suggests that 256 fingerprints (about three seconds of audio) are compared and the average is reported. Experimentally the original article determines that a BER of 0.35 or less is sufficient to claim that two excerpts are the same with only a very small chance of yielding a false positive. The BER evaluation has been replicated but due to the fact that the excerpts are not identical and the modifications also deviate slightly the replicated BER values differ. However if the original and replicated results are compared using a Pearson correlation there is a very strong linear relation . This analysis suggests that the system behaves similarly for the various modifications. The analysis left out the white noise condition, which is an outlier. The replicated modification probably mixed more noise into the signal than the original. Some modifications could not be successfully replicated either because they are not technically relevant (cassette tape, real media encoding) or the method to do the modification was unclear (GSM C/I).
A second table (Table ?) shows how many of 256 fingerprints could be retrieved in two cases. The first case tries to find only the exact matches in the database. The reported number shows how many of the 256 fingerprints point to the matching fingerprint block in the database. If all fingerprints match the maximum of 256 is reported. In the second case the 10 most unreliable bits are flipped resulting in 1024 fingerprints which are then matched with the database. In both cases only one correct hit is needed to successfully identify an audio excerpt.
The original results are compared with the replicated results with a Pearson correlation. The exact matching case shows a strong linear correlation and the case of 10 flipped bits show similar results . This suggests that the system behaves similarly considering that the audio excerpts, the modifications and implementation include differences and various assumptions had to be made.
A replicable evaluation The original evaluation has several problems with respect to replicability. It uses commercial music but fails to mention which exact audio is used both in the reference database as for the evaluation. The process to generate modification is documented but still leaves room for interpretation. There are also other problems: The evaluation depends on the selection of only four audio excerpts.
The ideal acoustic fingerprinting system evaluation depends on the use-case. For example, the evaluation method described by [150] focuses mainly on broadcast monitoring and specific modifications that appear when broadcasting music over the radio. The SyncOccur corpus [151] also focuses on this use-case. An evaluation of an acoustic fingerprinting system for DJ-set monitoring [180] or sample identification [197] needs another approach. These differences in focus lead to a wide variety of evaluation techniques for systems which makes them hard to compare. The evaluation described here evaluates a fingerprint system for (re-encoded) duplicate detection with simple degradations
The evaluation is done as follows. Using a script, available as supplementary material, 10,100 tracks creative commons licensed musical tracks are downloaded from Jamendo. Jamendo is a music sharing service. 10,000 of these tracks are then added to the reference database. The remaining 100 are not. The script provides a list of Jamendo track identifiers that are used in the reference database. Using another script, 1100 queries are selected at random
Once the results are available each query is checked if it is either a true positive , false positive , true negative or false negative . Figure 18 is a graphical reminder on these measures. Next to , , and sensitivity, specificity, precision and accuracy are calculated as well. Table ? gives the relation between these measures.
Sensitivity | |
Specificity | |
Precision | |
Accuracy | |
Table ? summarizes the results. As expected the system’s specificity and precision is very high. The few cases where a false positive is reported is due to audio duplicates in the reference database. The reference database does contain a few duplicate items where audio is either completely the same or where parts of another track are sampled. Note that the evaluation is done on the track level, the time offset is not taken into account. Since exact repetition is not uncommon, especially in electronic music, a query can be found at multiple, equally correct time offsets. If the system returns the correct track identifier with an unexpected offset then it is still counted as a true positive.
The sensitivity and accuracy of the system goes down when the average bit errors per fingerprint approaches the threshold of 10 erroneous bits. True positives for GSM encoded material are only found about half of the time. The average Hamming distance in bits for queries with a changed time scale of is higher than the GSM encoded queries while accuracy is much higher. This means that for the GSM encoded material the reliability information is not reliable: the 10 least reliable bits are flipped but still the original fingerprint is not found for about half of the queries.
There are some discrepancies between these results and the reported results in the original study. The average Hamming distance between queries and reference is higher in the new evaluation. This is potentially due to the use of 128kbs MP3’s during the evaluation. The original material is decoded to store in the reference database and the queries are re-encoded after modification. Another discrepancy is related to the GSM encoded queries: the original results seem to suggest that all GSM encoded queries would yield a true positive (see table ?). This was not achieved in the replication. Whether this is due to incorrect assumptions, different source material, the evaluation method or other causes is not clear.
As statistical comparison showed, the replicated system behaves generally in a similar way as the originally described system. On top of that an alternative, reproducible, evaluation showed that following the system’s design allows for functional acoustic fingerprinting. There are however unexplained discrepancies between both systems especially concerning the GSM modification. It is worrisome that it is impossible to pinpoint the source of these discrepancies since neither the original evaluation material, evaluation method, nor implementation are available. While there is no guarantee that the replication is bug free, at least the source can be checked.
All in all, the results are quite similar to the original. As stated in the introduction replication of results should be expected to pose no problem. It is, however, the replication of methods and accessibility of data that makes replication prohibitively time-consuming. This could be alleviated with releasing research code and data. While the focus of the MIR community should remain on producing novel techniques to deal with musical information and not on producing end-user ready software, it would be beneficial for the field to keep sustainable software aspects in mind when releasing research prototypes. Aspects such as those identified by [83] where a distinction is made between usability (documentation, installability,...) and maintainability (identity, copyright, accessibility, interoperability,...).
Intellectual property rights, copyrights on music and a lack of incentive pose a problem for reproducibility of MIR research work. There are, however, ways to deal with these limiting factors and foster reproducible research. We see a couple of work-arounds and possibilities, which are described below.
As universities are striving more and more for open-access publications there could be a similar movement for data and code. After all, it makes little sense to publish only part of the research in the open (the textual description) while keeping code and data behind closed doors. Especially if the research is funded by public funds. In Europe, there is an ambition to make all scientific articles freely available by 2020 and to achieve optimal reuse of scientific data
Copyrights on music
make it hard to share music freely. We see two ways to deal with this:
Pragmatic vs Ecological or Jamendo vs iTunes. There is a great deal of freely available music published under various creative commons licenses. Jamendo for example contains half a million cc-lisenced tracks which are uniquely identifiable and can be download via an API. Much of the music that can be found there is recorded at home with limited means. It only contains a few professionally produced recordings. This means that systems can behave slightly differently on the Jamendo set when compared with a set of commercial music. What is gained in pragmatism is perhaps lost in ecological validity. Whether this is a problem depends very much on the research question at hand. In the evaluation proposed here Jamendo was used (similarly to [182]) since it does offer a large variability in genres and is representative for this use-case.
Audio vs Features. Research on features extracted from audio does not need audio itself: if the features are available this can suffice. There are two large sets of audio features. The million song data set by [14] and Acousticbrainz, described by [149]. Both ran feature extractors on millions of commercial tracks and have an API to query or download the data. Unfortunately the source of the feature extractors used in the Million Song data set are not available and only described up until a certain level of detail which makes it a black box and, in my eyes, unfit for decent reproducible science. Indeed, due to internal reorganizations and mergers the API and the data have become less and less available. The science build on the million song dataset is on shaky ground. Fortunately Acousticbrains is completely transparent. It uses well documented, open source software [16] and the feature extractors are reproducible. The main shortcoming of this approach is that only a curated set of features is available. If another feature is needed, then you are out of luck. Adding a feature is far from trivial, since even Acousticbraiz has no access to all audio: they rely on crowdsourced feature extraction.
Providing an incentive for researchers to make their research reproducible is hard. This requires a mentality shift. Policies by journals, conference organizers and research institutions should gradually change to require reproducibility. There are a few initiatives to foster reproducible research, specifically for music informatics research. The 53rd Audio Engineering Society (AES) conference had a price for reproducibility. ISMIR 2012 had a tutorial on “Reusable software and reproducibility in music informatics research” but structural attention for this issue at ISMIR seems to lack. There is, however, a yearly workshop organized by Queen Mary University London (QMUL) on “Software and Data for Audio and Music Research”:
The third SoundSoftware.ac.uk one-day workshop on “Software and Data for Audio and Music Research” will include talks on issues such as robust software development for audio and music research, reproducible research in general, management of research data, and open access.
31
At QMUL there seems to be continuous attention to the issue and researchers are trained in software craftsmanship
In this article we problematized reproducibility in MIR and illustrated this by replicating an acoustic fingerprinting system. While similar results were obtained there are unexplained and unexplainable discrepancies due to the fact that the original data, method and evaluation is only partly available and assumptions need to be made. We proposed an alternative, reproducible, evaluation and extrapolated general guidelines aiming to improve reproducibility of MIR research in general.
MIR applications, documentation, collaboration, digital music archives
Music Information Retrieval (MIR) technologies have a lot of untapped potential in the management of digital music archives. There seems to be several reasons for this. One is that MIR technologies are simply not well known to archivists. Another reason is that it is often unclear how MIR technology can be applied in a digital music archive setting. A third reason is that considerable effort is often needed to transform a potentially promising MIR research prototype into a working solution for archivists as end-users.
In this article we focus on duplicate detection. It is an MIR technology that has matured over the last two decades for which there is usable software available. The aim of the article is to describe several applications for duplicate detection and to encourage the communication about them to the archival community. Some of these applications might not be immediately obvious since duplicate detection is used indirectly to complement meta-data, link or merge archives, improve listening experiences and it has opportunities for segmentation. These applications are grounded in experience with working on the archive of the Royal Museum for Central Africa, a digitised audio archive of which the majority of tracks are field recordings from Central Africa.
The problem of duplicate detection is defined as follows:
How to design a system that is able to compare every audio fragment in a set with all other audio in the set to determine if the fragment is either unique or appears multiple times in the complete set. The comparison should be robust against various artefacts.
The artefacts in the definition above include noise of various sources. This includes imperfections introduced during the analog-to-digital (A/D) conversion. Artefacts resulting from mechanical defects, such as clicks from gramophone discs or magnetic tape hum. Detecting duplicates should be possible when changes in volume, compression or dynamics are introduced as well.
There is a distinction to be made between exact, near and far duplicates by [140]. Exact duplicates contain the exact same information, near duplicates are two tracks with minor differences e.g. a lossless and lossy version of the same audio. Far duplicates are less straightforward. A far duplicate can be an edit where parts are added to the audio – e.g. a radio versus an album edit with a solo added. Live versions or covers of the same song can also be regarded as a far duplicate. A song that samples an original could again be a far duplicate. In this work we focus on duplicates which contain the same recorded material from the original. This includes samples and edits but excludes live versions and covers.
The need for duplicate detection is there since, over time, it is almost inevitable that duplicates of the same recording end up in a digitised archive. For example, an original field recording is published on an LP, and both the LP as the original version get digitised and stored in the same lot. It is also not uncommon that an archive contains multiple copies of the same recording because the same live event was captured from two different angles (normally on the side of the parterre and from the orchestra pit), or because before the advent of digital technology, copies of degrading tapes were already being made on other tapes. Last but not least, the chance of duplicates grows exponentially when different archives or audio collections get connected or virtually merged, which is a desirable operation and one of the advantages introduced by the digital technology (see ?).
From a technical standpoint and using the terminology by [32] a duplicate detector needs to have the following requirements:
It needs to be capable to mark duplicates without generating false positives or missing true positives. In other words precision and recall need to be acceptable.
It should be capable to operate on large archives. It should be efficient. Efficient here means quick when resolving a query and efficient on storage and memory use when building an index.
Duplicates should be marked as such even if there is noise or the speed is not kept constant. It should be robust against various modifications.
Lookup for short audio fragments should be possible, the algorithm should be granular. A resolution of 20 seconds or less is beneficial.
Once such system is available, several applications are possible. [140] describes many of these applications as well, but, notably, the application of re-use of segmentation boundaries is missing.
Being aware of duplicates is useful to check or complement meta-data. If an item has richer meta-data than a duplicate, the meta-data of the duplicate can be integrated. With a duplicate detection technology conflicting meta-data between an original and a duplicate can be resolved or at least flagged. The problem of conflicting meta-data is especially prevalent in archives with ethnic music where often there are many different spellings of names, places and titles. Naming instruments systematically can also be very challenging.
When multiple recordings in sequence are marked as exact duplicates, meaning they contain the exact same digital information, this indicates inefficient storage use. If they do not contain exactly the same information it is possible that either the same analog carrier was accidentally digitised twice or there are effectively two analogue copies with the same content. To improve the listening experience the most qualitative digitised version can be returned if requested, or alternatively to assist philological research all the different versions (variants, witnesses of the archetype) can be returned.
It potentially solves segmentation issues. When an LP is digitised as one long recording and the same material has already been segmented in an other digitisation effort, the segmentation boundaries can be reused. Also duplicate detection allows to identify when different segmentation boundaries are used. Perhaps an item was not segmented in one digitisation effort while a partial duplicate is split and has an extra meta-data item – e.g. an extra title. Duplicated detection allows re-use of segmentation boundaries or, at the bare minimum, indicate segmentation discrepancies.
Technology makes it possible to merge or link digital archives from different sources – e.g. the creation of a single point of access to documentation from different institutions concerning a special subject; the implementation of the “virtual re-unification” of collections and holdings from a single original location or creator now widely scattered [82]. More and more digital music archives ‘islands’ are bridged by efforts such as Europeana Sounds. Europeana Sounds is a European effort to standardise meta-data and link digital music archives. The EuropeanaConnect/DISMARC Audio Aggregation Platform provides this link and could definitely benefit from duplicate detection technology and provide a view on unique material.
If duplicates are found in one of these merged archives, all previous duplicate detection applications come into play as well. How similar is the meta-data between original and duplicate? How large is the difference in audio quality? Are both original and duplicate segmented similarly or is there a discrepancy?
Robustness to speed change Duplicate detection robust to speed changes has an important added value. When playback (or recording) speed changes from analogue carriers, both tempo and pitch change accordingly. Most people are familiar with the effect of playing a 33 rpm LP at 45 rpm. But the problem with historic archives and analogue carriers is more subtle: the speed at which the tape gets digitised might not match the original recording speed, impacting the resulting pitch. Often it is impossible to predict with reasonable precision when the recording device was defective, inadequately operated, or when the portable recorder was slowly running out of battery.
So not only it is nearly impossible to make a good estimation of the original non-standard recording speed, but it might not be a constant speed at all, it could actually fluctuate ‘around’ a standard speed. This is also a problem with wax cylinders, where there are numerous speed indications but they are not systematically used – if indications are present at all. In the impossibility to solve this problem with exact precision, a viable approach, balancing out technical needs and philological requirements, is normally to transfer the audio information at standard speed with state-of-the-art perfectly calibrated machinery. The precision of the A/D transfer system in a way compensates for the uncertainty of the source materials. We still obtain potentially sped-up or slowed-down versions of the recording, but when the original context in which the recording was produced can be reconstructed, it is possible to add and subtract quantities from the digitised version because that is exactly known (and its parameters ought to be documented in the preservation meta-data).If the playback speed during transfer is tampered, adapted, guessed, anything that results in a non-standard behaviour in the attempt of matching the original recording speed, will do nothing but add uncertainty to uncertainty, imprecision to imprecision.
An additional reason to digitise historical audio recordings at standard speed and with state-of-the-art perfectly calibrated machinery, is that by doing so, the archive master [81] will preserve the information on the fluctuations of the original. If we are to “save history, not rewrite it” [17], then our desire to “improve” the quality of the recording during the process of A/D conversion should be held back. Noises and imperfections present in the source carrier bear witness to its history of transmission, and as such constitute part of the historical document. Removing or altering any of these elements violates basic philological principles [19] that should be assumed in any act of digitisation which has the ambition to be culturally significant. The output of a process where sources have been altered (with good or bad intention, consciously or unconsciously, intentionally or unintentionally, or without documenting the interventions) is a corpus that is not authentic, unreliable and for all intents and purposes useless for scientific studies. Therefore, in the light of what has been said so far, the problem of speed fluctuation is structural and endemic in historical analogue sound archives, and cannot be easily dismissed. Hence the crucial importance of algorithms that treat this type of material to consider this problem and operate accordingly.
Some possible applications of duplicate detection have been presented in the previous section, now we see how they can be put into practice. It is clear that naively comparing every audio fragment – e.g. every five seconds – with all other audio in an archive quickly becomes impractical, especially for medium-to-large size archives. Adding robustness to speed changes to this naive approach makes it downright impossible. An efficient alternative is needed and this is where acoustic fingerprinting techniques comes into play, a well researched MIR topic.
The aim of acoustic fingerprinting is to generate a small representation of an audio signal that can be used to reliably identify identical, or recognise similar, audio signals in a large set of reference audio. One of the main challenges is to design a system so that the reference database can grow to contain millions of entries. Over the years several efficient acoustic fingerprinting methods have been introduced [209]. These methods perform well, even with degraded audio quality and with industrial sized reference databases. However, these systems are not designed to handle duplicate detection when speed is changed between the original and duplicate. For this end, fingerprinting system robust against speed changes are desired.
Some fingerprinting systems have been developed that take pitch-shifts into account [58] without allowing time-scale modification. Others are designed to handle both pitch and time-scale modification [214]. The system by [214] employs an image processing algorithm on an auditory image to counter time-scale modification and pitch-shifts. Unfortunately, the system is computationally expensive, it iterates the whole database to find a match. The system by [118] allows extreme pitch-shifting and time-stretching, but has the same problem.
The ideas by both [175] allow efficient duplicate detection robust to speed changes. The systems are built mainly with recognition of original tracks in DJ-sets in mind. Tracks used in DJ-sets are manipulated in various ways and often speed is changed as well. The problem translates almost directly to duplicate detection for archives. The respective research articles show that these systems are efficient and able to recognise audio with a speed change.
Only [175] seems directly applicable in practice since it is the only system for which there is runnable software and documentation available. It can be downloaded from http://panako.be and has been tested with datasets containing tens of thousands of tracks on a single computer. The output is data about duplicates: which items are present more than once, together with time offsets.
The idea behind Panako is relatively simple. Audio enters the system and is transformed into a spectral representation. In the spectral domain peaks are identified. Some heuristics are used to detect only salient, identifiable peaks and ignore spectral peaks in areas with equal energy – e.g. silent parts. Once peaks are identified, these are bundled to form triplets. Valid triplets only use peaks that are near both in frequency as in time. For performance reasons a peak is also only used in a limited number of triplets. These triplets are the fingerprints that are hashed and stored and ultimately queried for matches.
Exact hashing makes lookup fast but needs to be done diligently to allow retrieval of audio with modified speed. A fingerprint together with a fingerprint extracted from the same audio but with modified speed can be seen in Figure 19. While absolute values regarding time change, ratios remain the same: . The same holds true for the frequency ratios. This information is used in a hash. Next to the hash, the identifier of the audio is stored together with the start time of the first spectral peak.
Lookup follows a similar procedure: fingerprints are extracted and hashes are formed. Matching hashes from the database are returned and these lists are processed. If the list contains an audio identifier multiple times and the start times of the matching fingerprints align in time accounting for an optional linear scaling factor then a match is found. The linear time scaling factor is returned together with the match. An implementation of this system was used in the case study.
The Royal Museum for Central Africa, Tervuren, Belgium preserves a large archive with field recordings mainly from Central Africa. The first recordings were made on wax cylinders in the late 19th century and later on all kinds of analogue carriers were used from various types of gramophone discs to sonofil. During a digitisation project called DEKKMMA (digitisation of the Ethnomusicological Sound Archive of the Royal Museum for Central Africa) [42] the recordings were digitised. Due to its history and size it is reasonable to expect that duplicates be found in the collection. In this case study we want to identify the duplicates, quantify the similarity in meta-data between duplicates and report the number of duplicates with modified speed. Here it is not the aim improve the data itself, this requires specialists with deep knowledge on the archive to resolve or explain (meta-data) conflicts: we mainly want to illustrate the practical use of duplicate detection.
With the Panako [175] fingerprints of 35,306 recordings of the archive were extracted. With the default parameters of Panako this resulted in an index of 65 million fingerprints for 10 million seconds of audio or 6.5 fingerprints per second. After indexing, each recording was split into pieces of 25 seconds with 5 seconds overlap, this means a granularity of 20 seconds. Each of those pieces (10,000,000 s / 20 s = 500,000 items) was compared with the index and resulted in a match with itself and potentially one or more duplicates. After filtering out identical matches, 4,940 fragments of 25 seconds were found to be duplicates. The duplicate fragments originated from 887 unique recordings. This means that 887 recordings (2.5%) were found to be (partial) duplicates. Thanks to the efficient algorithm, this whole process requires only modest computational power. It was performed on an Intel Core2 Quad CPU Q9650 @ 3.00GHz, with 8GB RAM, introduced in 2009.
Due to the nature of the collection, some duplicates were expected. In some cases the collection contains both the digitised version of a complete side of an analogue carrier as well as segmented recordings. Eighty duplicates could be potentially be explained in this way thanks to similarities in the recording identifier. In the collection recordings have an identifier that follows a scheme:
collection_name.year.collection_id.subidentifier-track
If a track identifier contains A or B it refers to a side of an analog carrier (cassette or gramophone disc). The pair of recordings MR.1979.7.1-A1
and MR.1979.7.1-A6
suggest that A1
contains the complete side and A6 is track 6 on that side. The following duplicate pair suggests that the same side of a carrier has been digitised twice but stored with two identifiers: MR.1974.23.3-A
and MR.1974.23.3-B
. Unfortunately this means that one side is probably not digitised.
The 800 other duplicates do not have similar identifiers and lack a straightforward explanation. These duplicates must have been accumulated over the years. Potentially duplicates entered in the form of analogue copies in donated collections. It is clear that some do not originate from the same analog carrier when listening to both versions. The supplementary material contains some examples. Next, we compare the meta-data difference between original and duplicate.
Differences in meta-data Since the duplicates originate from the same recorded event, to original and duplicate should have identical or very similar meta-data describing their content. This is unfortunately not the case. In general, meta-data implementation depends on the history of an institution. In this case the older field-recordings are often made by priests or members of the military who did not follow a strict methodology to describe the musical audio and its context. Changes in geographical nomenclature over time, especially in Africa, is also a confounding factor [43]. There is also a large amount of vernacular names for musical instruments. The lamellophone for example is known as Kombi, Kembe, Ekembe, Ikembe Dikembe and Likembe [43] to name only a few variations. On top of that, the majority of the Niger-Congo languages are tonal (Yoruba, Igbo, Ashanti, Ewe) which further limits accurate, consistent description with a western alphabet. These factors, combined with human error in transcribing and digitising information, results in an accumulation of inaccuracies. Figure ? shows the physical meta-data files. If there are enough duplicates in an archive, duplicate detection can serve as a window on the quality of meta-data in general.
Field |
Empty | Different | Exact match | Fuzzy or exact match |
---|---|---|---|---|
Identifier | 0.00% | 100.00% | 0.00% | 0.00% |
Year | 20.83% | 13.29% | 65.88% | 65.88% |
People | 21.17% | 17.34% | 61.49% | 64.86% |
Country | 0.79% | 3.15% | 96.06% | 96.06% |
Province | 55.52% | 5.63% | 38.85% | 38.85% |
Region | 52.03% | 12.16% | 35.81% | 37.95% |
Place | 33.45% | 16.67% | 49.89% | 55.86% |
Language | 42.34% | 8.45% | 49.21% | 55.74% |
Functions | 34.12% | 25.34% | 40.54% | 40.54% |
Title | 42.23% | 38.40% | 19.37% | 30.18% |
Collector | 10.59% | 14.08% | 75.34% | 86.71% |
Table 3 show the results of the meta-data analysis. For every duplicate a pair of meta-data elements is retrieved and compared. They are either empty, match exactly or differ. Some pairs match quite well but not exactly. It is clear that the title of the original O ho yi yee yi yee is very similar to the title of the duplicate O ho yi yee yie yee. To capture such similarities as well, a fuzzy string match algorithm based on Sørensen–Dice coefficients is employed. When comparing the title of an original with a duplicate, only 19% match. If fuzzy matches are included 30% match. The table makes clear titles often differ while country is the most stable meta-data field. It also makes clear that the overall quality of the meta-data leaves much to improve. To correctly merge meta-data fields requires specialist knowledge - is it yie or yi - and individual inspection. This falls outside the scope of this case study.
Original title |
Duplicate title |
---|---|
Warrior dance | Warriors dance |
Amangbetu Olia | Amangbetu olya |
Coming out of walekele | Walekele coming out |
Nantoo | Yakubu Nantoo |
O ho yi yee yi yee | O ho yi yee yie yee |
Enjoy life | Gently enjoy life |
Eshidi | Eshidi (man’s name) |
Green Sahel | The green Sahel |
Ngolo kele | Ngolokole |
Speed modifications In our dataset only very few items with modified speed have been detected. For 98.8% of the identified duplicates the speed matches exactly between original and duplicate. For the remaining 12 identified duplicates speed is changed in a limited range, from -5% to +4%. These 12 pieces must have multiple analogue carriers in the archive. Perhaps copies were made with recording equipment that was not calibrated; or if the live event was captured from multiple angles, it is possible that the calibration of the original recorders was not consistent. There is a number of reasons why a digitised archive ends up containing copies of the same content at slightly different speeds, but it is normally desirable that the cause for this depends on the attributes of the recordings before digitisation, and it is not introduced during the digitisation process. Our case study shows that duplicates can be successfully detected even when speed is modified. How this is done is explained in the following section.
In this section, the practical functioning of Panako is described. The Panako acoustic fingerprinting suite is Java software and needs a recent Java Runtime. The Java Runtime and TarsosDSP [174] are the only dependencies for the Panako system, no other software needs to be installed. Java makes the application multi-platform and compatible with most software environments. It has a command-line interface, users are expected to have a basic understanding of their command line environment.
Panako contains a deduplicate command which expects either a list of audio files or a text file that contains the full path of audio files separated by newlines. This text file approach is more practical on large archives. After running the deduplicate program a text file will contain the full path of duplicate files together with the time at which the duplicate audio was detected.
Several parameters need to be set for a successful de-duplication. The main parameters determine the granularity level, allowed modifications and performance levels. The granularity level determines the size of the audio fragments that are used for de-duplication. If this is set to 20 seconds instead of 10, then the number of queries is, obviously, halved. If speed is expected to be relatively stable, a parameter can be set to limit the allowed speed change. The performance can be modified by choosing the number of fingerprints that are extracted per second. The parameters determine several trade-offs between query speed, storage size, and retrieval performance. The default parameters should have the system perform reasonably effectively in most cases.
The indirect applications of linking meta-data is dependent on organization of the meta-data of the archive but has some common aspects. First, the audio identifiers of duplicates are arranged in original/duplicate pairs. Subsequently, the meta-data of these pairs is retrieved from the meta-data store (e.g. a relational database system). Finally, the meta-data element pairs are compared and resolved. The last step can use a combination of rules to automatically merge meta-data and manual intervention when a meta-data conflict arises. The manual intervention requires analysis to determine the correct meta-data element for both original and duplicate.
Reuse of segmentation boundaries needs similar custom solutions. However, there are again some commonalities in reuse of boundaries. First, audio identifiers from the segmented set are identified within the unsegmented set resulting in a situation as in Figure 20. The identified segment boundaries can subsequently be reused. Finally, segments are labeled. Since these tasks are very dependent on file formats, database types, meta-data formats and context in general it is hard to offer a general solutions. This means that while the duplicate detection system is relatively user friendly and ready to use, applying it still needs a software developer but not, and this is crucial, an MIR specialist.
In this paper we described possible applications of duplicate detection techniques and presented a practical solution for duplicate detection in an archive of digitised audio of African field recordings. More specifically applications were discussed to complement meta-data, to link or merge digital music archives, to improve listening experiences and to re-use segmentation data. In the case study on the archive of the Royal Museum of Central Africa we were able to show that duplicates can be successfully identified. We have shown that the meta-data in that archive differs significantly between original and duplicate. We have also shown that duplicate detection is robust to speed variations.
The archive used in the case study is probably very similar to many other archives of historic recordings and similar results can be expected. In the case study we have shown that the acoustic fingerprinting software Panako is mature enough for practical application in the field today. We have also given practical instructions on how to use the software. It should also be clear that all music archives can benefit from this technology and we encourage archives to experiment with duplicate detection since only modest computing power is needed even for large collections.
This chapter bundles four articles that are placed more towards the services region of the humanities-engineering plane depicted in Figure 1. They offer designs and implementations of tools to support certain research tasks. The four works bundled here are:
The first [174] describes a software library which originated as a set of pitch estimation algorithms for Tarsos [173]. Over the years more and more audio processing algorithms were added to the library and to highlight its inherent value it was separated from Tarsos and made into a reusable component. It was presented with a poster presentation at an Audio Engineering Society conference in London. At the conference TarsosDSP was acknowledged as a ‘reproducibility-enabling work’ which is in my view a requirement for valuable services to research communities. TarsosDSP was picked up in research [155] and in many interactive music software projects.
The second [175] work details a description of a novel acoustic fingerprinting algorithm. It can be placed in the services category since it offers a publicly verifiable implementation of this new algorithm together with several baseline algorithms. Next to the implementation, a reproducible evaluation methodology is described as well. The code to run the evaluation is open as well. This can be seen as a second service to the community. This approach has been moderately successful since Panako was already used multiple times as a baseline to compare with other, newer systems [181]. It was presented at the ISMIR conference of 2014 in Taipei, Taiwan during a poster presentation.
The third [176] work included in this chapter is a journal article that describes a way to synchronize heterogeneous research data geared towards music and movement research. Next to the description there is also a verifiable implementation publicly available. It falls into the services category since synchronizing data before analysis is a problemetic research task many researchers deal with. The method is especially applicable for research on interaction between movement and music since this type of research needs many different measurement devices and wearable sensors that are not easily synchronized. It has been used to sync, amongst others, the dataset of [51].
Finally the fourth and last work [177] presents meta-data to a user synchronized with music in its environment. Meta-data is broadly defined as meaning any type of additional information stream that enriches the listening experience (for example lyrics, light effects or video). This technology could be employed as a service to support a research task as well: for example if during an experiment dancers need to be presented with tactical stimuli this technology could be used. Essentially it is an augmented reality technology or more generally a technology to establish a computer-mediated reality.
Frameworks or libraries
Name |
Ref | Extr. | Synth | R-T | Tech |
---|---|---|---|---|---|
Aubio | True | False | True | C | |
CLAM |
True | True | True | C | |
CSL |
True | True | True | C++ | |
Essentia | True | False | True | C++ | |
Marsyas | True | True | False | C++ | |
SndObj |
True | True | True | C++ | |
Sonic Visualizer | True | False | False | C++ | |
STK | False | True | True | C++ | |
Tartini | True | False | True | C++ | |
YAAFE | True | False | False | C++ | |
Beads | False | True | True | Java | |
JASS | False | True | True | Java | |
jAudio | True | False | False | Java | |
Jipes | True | False | False | Java | |
jMusic | False | True | False | Java | |
JSyn | False | True | True | Java | |
Minim | False | True | True | Java | |
TarsosDSP |
True | True | True | Java | |
TarsosDSP also fills a need for educational tools for Music Information Retrieval. As identified by [67], there is a need for comprehensible, well-documented MIR-frameworks which perform useful tasks on every platform, without the requirement of a costly software package like Matlab. TarsosDSP serves this educational goal, it has already been used by several master students as a starting point into music information retrieval [11].
The framework tries to hit the sweet spot between being capable enough to get real tasks done, and compact enough to serve as a demonstration for beginning MIR-researchers on how audio processing works in practice. TarsosDSP therefore targets both students and more experienced researchers who want to make use of the implemented features.
After this introduction a section about the design decisions made follows, then the main features of TarsosDSP are highlighted. Chapter four is about the availability of the framework. The paper ends with a conclusion and future work.
To meet the goals stated in the introduction a couple of design decisions were made.
Java based TarsosDSP was written in Java to allow portability from one platform to another. The automatic memory management facilities are a great boon for a system implemented in Java. These features allow a clean implementation of audio processing algorithms. The clutter introduced by memory management instructions, and platform dependent ifdef
’s typically found in C++ implementations are avoided. The Dalvik Java runtime enables to run TarsosDSP’s algorithms unmodified on the Android platform. Java or C++ libraries are often hard to use due to external dependencies. TarsosDSP has no external dependencies, except for the standard Java Runtime. Java does have a serious drawback, it struggles to offer a low-latency audio pipeline. If real-time low-latency is needed, the environment in which TarsosDSP operates needs to be optimized, e.g. by following the instructions by [85].
The processing pipeline is kept as simple as possible. Currently, only single channel audio is allowed, which helps to makes the processing chain extremely straightforwardAudioDispatcher
chops incoming audio in blocks of a requested number of samples, with a defined overlap. Subsequently the blocks of audio are scaled to a float
in the range [-1,1]
. The wrapped blocks are encapsulated in an AudioEvent
object which contains a pointer to the audio, the start time in seconds, and has some auxiliary methods, e.g. to calculate the energy of the audio block. The AudioDispatcher
sends the AudioEvent
through a series of AudioProcessor
objects, which execute an operation on audio. The core of the algorithms are contained in these AudioProcessor
objects. They can e.g. estimate pitch or detect onsets in a block of audio. Note that the size of a block of audio can change during the processing flow. This is the case when a block of audio is stretched in time. For more examples of available AudioProcessor
operations see section ?. Figure ? shows a processing pipeline. It shows how the dispatcher chops up audio and how the AudioProcessor
objects are linked. Also interesting to note is line 8, where an anonymous inner class is declared to handle pitch estimation results. The example covers filtering, analysis, effects and playback. The last statement on line 23 bootstraps the whole process.
Optimizations TarsosDSP serves an educational goal, therefore the implementations of the algorithms are kept as pure as possible, and no obfuscating optimizations are made. Readability of the source code is put before its execution speed, if algorithms are not quick enough users are invited to optimize the Java code themselves, or look for alternatives, perhaps in another programming language like C++. This is a rather unique feature of the TarsosDSP framework, other libraries take a different approach. jAudio [125] and YAAFE [120] for example reuse calculations for feature extraction, this makes algorithms more efficient, but also harder to grasp. Other libraries still, like SoundTouch
Implemented Features In this chapter the main implemented features are highlighted. Next to the list below, there are boiler-plate features e.g. to adjust gain, write a wav-file, detect silence, following envelope, playback audio. Figure 22 shows a visualization of several features computed with TarsosDSP.
TarsosDSP was originally conceived as a library for pitch estimation, therefore it contains several pitch estimators: YIN [49], MPM [127], AMDF [156]
Two onset detectors are provided. One described in [9], and the one used by the BeatRoot system [52].
The WSOLA time stretch algorithm [205], which allows to alter the speed of an audio stream without altering the pitch is included. On moderate time stretch factors - 80%-120% of the original speed - only limited audible artifacts are noticeable.
A resampling algorithm based on [179] and the related open source resample software package
A pitch shifting algorithm, which allows to change the pitch of audio without affecting speed, is formed by chaining the time-stretch algorithm with the resample algorithm.
As examples of audio effects, TarsosDSP contains a delay and flanger effect. Both are implemented as minimalistic as possible.
Several IIR-filters are included. A single pass and four stage low pass filter, a high pass filter, and a band pass filter.
TarsosDSP also allows audio synthesis and includes generators for sine waves and noise. Also included is a Low Frequency Oscillator (LFO) to control the amplitude of the resulting audio.
A spectrum can be calculated with the inevitable FFT or using the provided implementation of the Constant-Q [23] transform.
To show the capabilities of the framework, seventeen examples are built. Most examples are small programmes with a simple user interface, showcasing one algorithm. They do not only show which functionality is present in the framework, but also how to use those in other applications. There are example applications for time stretching, pitch shifting, pitch estimation, onset detection, and so forth. Figure ? shows an example application, featuring the pitch shifting algorithm.
TarsosDSP is used by Tarsos [173] a software tool to analyze and experiment with pitch organization in non-western music. It is an end-user application with a graphical user interface that leverages a lot of TarsosDSP’s features. It can be seen as a showcase for the framework.
The source code is available under the GPL license terms at GitHub:
https://github.com/JorenSix/TarsosDSP.
Contributions are more than welcome. TarsosDSP releases, the manual, and documentation can all be found at the release directory which is available at the following url:
http://0110.be/releases/TarsosDSP/.
Nightly builds can be found there as well. Other downloads, documentation on the example applications and background information is available on:
http://0110.be
Providing the source code under the GPL license makes sure that derivative works also need to provide the source code, which enables reproducibility.
In this paper TarsosDSP was presented. An Open Source Java library for real time audio processing without external dependencies. It allows real-time pitch and onset extraction, a unique feature in the Java ecosystem. It also contains algorithms for time stretching, pitch shifting, filtering, resampling, effects, and synthesis. TarsosDSP serves an educational goal, therefore algorithms are implemented as simple and self-contained as possible using a straightforward pipeline. The library can be used on the Android platform, as a back-end for Java applications or stand alone, by using one of the provided example applications. After two years of active development it has become a valuable addition to the MIR-community.
The ability to identify a small piece of audio by comparing it with a large reference audio database has many practical use cases. This is generally known as audio fingerprinting or acoustic fingerprinting. An acousic fingerprint is a condensed representation of an audio signal that can be used to reliably identify identical, or recognize similar, audio signals in a large set of reference audio. The general process of an acoustic fingerprinting system is depicted in Figure Figure 23. Ideally, a fingerprinting system only needs a short audio fragment to find a match in large set of reference audio. One of the challenges is to design a system in a way that the reference database can grow to contain millions of entries. Another challenge is that a robust fingerprinting should handle noise and other modifications well, while limiting the amount of false positives and processing time [32]. These modifications typically include dynamic range compression, equalization, added background noise and artifacts introduced by audio coders or A/D-D/A conversions.
Over the years several efficient acoustic fingerprinting methods have been introduced [209]. These methods perform well, even with degraded audio quality and with industrial sized reference databases. However, these systems are not designed to handle queries with modified time-scale or pitch although these distortions can be present in replayed material. Changes in replay speed can occur either by accident during an analog to digital conversion or they are introduced deliberately.
Accidental replay speed changes can occur when working with physical, analogue media. Large music archive often consist of wax cylinders, magnetic tapes and gramophone records. These media are sometimes digitized using an incorrect or varying playback speed. Even when calibrated mechanical devices are used in a digitization process, the media could already have been recorded at an undesirable or undocumented speed. A fingerprinting system should therefore allow changes in replay speed to correctly detect duplicates in such music archives.
Deliberate time-scale manipulations are sometimes introduced as well. During radio broadcasts, for example, songs are occasionally played a bit faster to make them fit into a time slot. During a DJ-set pitch-shifting and time-stretching are present almost continuously. To correctly identify audio in these cases as well, a fingerprinting system robust against pitch-shifting and time-stretching is desired.
Some fingerprinting systems have been developed that take pitch-shifts into account [58] without allowing time-scale modification. Others are designed to handle both pitch and time-scale modification [214]. The system by [214] employs an image processing algorithm on an auditory image to counter time-scale modification and pitch-shifts. Unfortunately, the system is computationally expensive, it iterates the whole database to find a match. The system by [118] allows extreme pitch- shifting and time-stretching, but has the same problem. To the best of our knowledge, a description of a practical acoustic fingerprinting system that allows substantial pitch-shift and time-scale modification is nowhere to be found in the literature. This description is the main contribution of this paper.
The proposed method is inspired by three works. Combining key components of those works results in a design of a granular acoustic fingerprinter that is robust to noise and substantial compression, has a scalable method for fingerprint storage and matching, and allows time-scale modification and pitch-shifting.
Firstly, the method used by [209] establishes that local maxima in a time-frequency representation can be used to construct fingerprints that are robust to quantization effects, filtering, noise and substantial compression. The described exact-hashing method for storing and matching fingerprints has proven to be very scalable. Secondly, [4] describe a method to align performances and scores. Especially interesting is the way how triplets of events are used to search for performances with different timings. Thirdly, The method by [58] introduces the idea to extract fingerprints from a Constant-Q [23] transform, a time-frequency representation that has a constant amount of bins for every octave. In their system a a fingerprint remains constant when a pitch-shift occurs. However, since time is encoded directly within the fingerprint, the method does not allow time-scale modification.
Considering previous works, the method presented here uses local maxima in a spectral representation. It combines three event points, and takes time ratios to form time-scale invariant fingerprints. It leverages the Constant-Q transform, and only stores frequency differences for pitch-shift invariance. The fingerprints are designed with an exact hashing matching algorithm in mind. Below each aspect is detailed.
Finding local maxima Suppose a time-frequency representation of a signal is provided. To locate the points where energy reaches a local maximum, a tiled two-dimensional peak picking algorithm is applied. First the local maxima for each spectral analysis frame are identified. Next each of the local maxima are iterated and put in the center of a tile with as dimensions. If the local maximum is also the maximum within the tile it is kept, otherwise it is discarded. Thus, making sure only one point is identified for every tile of . This approach is similar to the ones by [58]. This results in a list of event points each with a frequency component , expressed in bins, and a time component , expressed in time steps. and are chosen so that there are between and event points every second.
A spectral representation of an audio signal has a certain granularity; it is essentially a grid with bins both in time as in frequency. When an audio signal is modified, the energy that was originally located in one single bin can be smeared over two or more bins. This poses a problem, since the goal is to be able to locate event points with maximum energy reliably. To improve reliability, a post processing step is done to refine the location of each event point by taking its energy and mixing it with the energy of the surrounding bins. The same thing is done for the surrounding bins. If a new maximum is found in the surroundings of the initial event point, the event point is relocated accordingly. Effectively, a rectangular blur with a kernel is applied at each event point and its surrounding bins.
Once the event points with local maximum energy are identified, the next step is to combine them to form a fingerprint. A fingerprint consists of three event points, as seen in Figure Figure 24. To construct a fingerprint, each event point is combined with two nearby event points. Each event point can be part of multiple fingerprints. Only between and fingerprints are kept every second. Fingerprints with event points with the least cumulative energy are discarded. Now that a list of fingerprints has been created a method to encode time information in a fingerprint hash is needed.
Handling time stretching: event triplets Figure 24 shows the effect of time stretching on points in the time-frequency domain. There, a fingerprint extracted from reference audio (Fig.Figure 24, red, triangle) is compared with a fingerprint from time stretched audio (Fig.Figure 24, orange, full circle). Both fingerprints are constructed using three local maxima and . While the frequency components stay the same, the time components do change. However, the ratios between the time differences are constant as well. The following equation holds
With event point having a time and frequency component () and the corresponding event points having the components (). Since , the ratio always resolves to a number in the range . This number, scaled and rounded, is a component of the eventual fingerprint hash (an approach similar to [4]).
Now that a way to encode time information, indifferent of time-stretching, has been found, a method to encode frequency, indifferent to pitch-shifting is desired.
Handling pitch-shifts: constant-q transform Figure 24 shows a comparison between a fingerprint from pitch shifted audio (blue, clear circle) with a fingerprint from reference audio (red, triangle). In the time-frequency domain pitch shifting is a vertical translation and time information is preserved. Since every octave has the same number of bins [23] a pitch shift on event will have the following effect on it’s frequency component , with being a constant, . It is clear that the difference between the frequency components remains the same, before and after pitch shifting: [58]. In the proposed system three event points are available, the following information is stored in the fingerprint hash:
The last two elements, and are sufficiently coarse locations of the first and third frequency component. They are determined by the index of the frequency band they fall into after dividing the spectrum into eight bands. They provide the hash with more discriminative power but also limit how much the audio can be pitch-shifted, while maintaining the same fingerprint hash.
Handling time-scale modification Figure 24 compares a fingerprint of reference audio (Fig.Figure 24, red, triangle) with a fingerprint from the same audio that has been sped up (Fig.Figure 24, green, x). The figure makes clear that speed change is a combination of both time-stretching and pitch-shifting. Since both are handled in with the previous measures, no extra precautions need to be taken. The next step is to combine the properties into a fingerprint that is efficient to store and match.
Fingerprint hash A fingerprint with a corresponding hash needs to be constructed carefully to maintain aforementioned properties. The result of a query should report the amount of pitch-shift and time-stretching that occurred. To that end, the absolute value of and is stored, they can be used to compare with and from the query. The time offset at which a match was found should be returned as well, so needs to be stored. The complete information to store for each fingerprint is:
The hash, the first element between brackets, can be packed into a integer. To save space, and can be combined in one integer. An integer of is also used to store . The reference audio identifier is also a identifier. A complete fingerprint consists of . At eight fingerprints per second a song of four minutes is reduced to . An industrial size data set of one million songs translates to a manageable
Matching algorithm The matching algorithm is inspired by [209], but is heavily modified to allow time stretched and pitch-shifted matches. It follows the scheme in Figure 23 and has seven steps.
Local maxima are extracted from a constant-Q spectrogram from the query. The local maxima are combined by three to form fingerprints, as explained in Sections ?, ? and ?.
For each fingerprint a corresponding hash value is calculated, as explained in Section ?.
The set of hashes is matched with the hashes stored in the reference database, and each exact match is returned.
The matches are iterated while counting how many times each individual audio identifier occurs in the result set.
Matches with an audio identifier count lower than a certain threshold are removed, effectively dismissing random chance hits. In practice there is almost always only one item with a lot of matches, the rest being random chance hits. A threshold of three or four suffices.
The residual matches are checked for alignment, both in frequency and time, with the reference fingerprints using the information that is stored along with the hash.
A list of audio identifiers is returned ordered by the amount of fingerprints that align both in pitch and frequency.
In step six, frequency alignment is checked by comparing the component of the stored reference with , the frequency component of the query. If, for each match, the difference between and is constant, the matches align.
Alignment in time is checked using the reference time information and , and the time information of the corresponding fingerprint extracted from the query fragment , . For each matching fingerprint the time offset is calculated. The time offset resolves to the amount of time steps between the beginning of the query and the beginning of the reference audio, even if a time modification took place. It stands to reason that is constant for matching audio.
The matching algorithm also provides information about the query. The time offset tells at which point in time the query starts in the reference audio. The time difference ratio represents how much time is modified, in percentages. How much the query is pitch-shifted with respect to the reference audio can be deduced from , in frequency bins. To convert a difference in frequency bins to a percentage the following equation is used, with the number of cents per bin, Eulers number, and the natural logarithm:
The matching algorithm ensures that random chance hits are very uncommon, the number of false positives can be effectively reduced to zero by setting a threshold on the number of aligned matches. The matching algorithm also provides the query time offset and the percentage of pitch-shift and time-scale modification of the query with respect to the reference audio.
To test the system, it was implemented in the Java programming language. The implementation is called Panako and is available under the GNU Affero General Public License on http://panako.be. The DSP is also done in Java using a DSP library by [174]. To store and retrieve hashes, Panako uses a key-value store. Kyoto Cabinet, BerkeyDB, Redis, LevelDB, RocksDB, Voldemort, and MapDB were considered. MapDB is an implementation of a storage backed B-Tree with efficient concurrent operations [100] and was chosen for its simplicity, performance and good Java integration. Also, the storage overhead introduced when storing fingerprints on disk is minimal. Panako is compared with Audfprint by Dan Ellis, an implementation of a fingerprinter system based on [209].
The test data set consists of freely available music downloaded from Jamendo
Each fragment is presented to both Panako and Audfprint and the detection results are recorded. The systems are regarded as binary classifiers of which the amount of true positives (), false positives (), true negatives () and false negatives () are counted. During the experiment with Panako no false positives () were detected. Also, all fragments that are not present in the reference database were rejected correctly (). So Panako’s specificity is . This can be explained by the design of the matching algorithm. A match is identified as such if a number of hashes, each consisting of three points in a spectrogram, align in time. A random match between hashes is rare, the chances of a random match between consecutively aligned hashes is almost non-existent, resulting in specificity.
The sensitivity of the system, however, depends on the type of modification on the fragment. Figure 25 shows the results after pitch-shifting. It is clear that the amount of pitch-shift affects the performance, but in a fluctuating pattern. The effect can be explained by taking into account the Constant-Q bins. Here, a bin spans cents, a shift of cents spreads spectral information over two bins, if is an odd number. So performance is expected to degrade severely at cents () and cents () an effect clearly visible in Figure 25. The figure also shows that performance is better if longer fragments are presented to the system. The performance of Audfprint, however, does not recover after pitch-shifts of more than three percent.
Figure 26 shows the results after time stretching. Due to the granularity of the time bins, and considering that the step size stays the same for each query type, time modifications have a negative effect on the performance. Still, a more than a third of the queries is resolved correctly after a time stretching modification of 8%. Performance improves with the length of a fragment. Surprisingly, Audfprint is rather robust against time-stretching, thanks to the way time is encoded into a fingerprint.
Figure 27 shows the results after time-scale modification. The performance decreases severely above eight percent. The figure shows that there is some improvement when comparing the results of 20s fragments to 40s fragments, but going from 40s to 60s does not change much. Audiofprint is unable to cope with time-scale modification due to the changes in both frequency and time.
In Figure 28, the results for other modifications like echo, chorus, flanger, tremolo, and a band pass filter can be seen. The parameters of each effect are chosen to represent typical use, but on the heavy side. For example the echo effect applied has a delay line of 0.5 seconds and a decay of 30%. The system has the most problems with the chorus effect. Chorus has a blurring effect on a spectrogram, which makes it hard for the system to find matches. Still it can be said that the algorithm is rather robust against very present, clearly audible, commonly used audio effects. The result of the band pass filter with a center of 2000Hz is especially good. To test the systems robustness to severe audio compression a test was executed with GSM-compressed queries. The performance on 20s fragments is about 30% but improves a lot with query length, the 60s fragment yields 65%. The results for Audfprint show that there is room for improvement for the performance of Panako.
A practical fingerprinting system performs well, in terms of speed, on commodity hardware. With Panako extracting and storing fingerprints for 25s of audio is done in one second using a single core of a dated processor
Failure analysis shows that the system does not perform well on music with spectrograms either with very little energy or energy evenly spread across the range. Also extremely repetitive music, with a spectrogram similar to a series of dirac impulses, is problematic. Also, performance drops when time modifications of more than 8% are present. This could be partially alleviated by redesigning the time parameters used in the fingerprint hash, but this would reduce the discriminative power of the hash.
In this paper a practical acoustic fingerprinting system was presented. The system allows fast and reliable identification of small audio fragments in a large set of audio, even when the fragment has been pitch-shifted and time-stretched with respect to the reference audio. If a match is found the system reports where in the reference audio a query matches, and how much time/frequency has been modified. To achieve this, the system uses local maxima in a Constant-Q spectrogram. It combines event points into groups of three, and uses time ratios to form a time-scale invariant fingerprint component. To form pitch-shift invariant fingerprint components only frequency differences are stored. For retrieval an exact hashing matching algorithm is used.
The system has been evaluated using a freely available data set of 30,000 songs and compared with a baseline system. The results can be reproduced entirely using this data set and the open source implementation of Panako. The scripts to run the experiment are available as well. The results show that the system’s performance decreases with time-scale modification of more than eight percent. The system is shown to cope with pitch-shifting, time-stretching, severe compression, and other modifications as echo, flanger and band pass.
To improve the system further the constant-Q transform could be replaced by an efficient implementation of the non stationary Gabor transform. This is expected to improve the extraction of event points and fingerprints without effecting performance. Panako could also benefit from a more extensive evaluation and detailed comparison with other techniques. An analysis of the minimum , most discriminative, information needed for retrieval purposes could be especially interesting.
During the past decades there has been a growing interest in the relation between music and movement, an overview of ongoing research is given by [63]. This type of research often entails the analysis of data from various (motion) sensors combined with multi-track audio and video recordings. These multi-modal signals need to be synchronized reliably and precisely to allow successful analysis, especially when aspects on musical timing are under scrutiny.
Synchronization of heterogeneous sources poses a problem due to large variability in sample rate and due to the latencies introduced by each recording modality. For example it could be the case that accelerometer data, sampled at 100Hz, needs to be synchronized with multi-track audio recorded at 48kHz and with two video streams, recorded using webcams at 30 frames per second.
Several methods have been proposed to address synchronization problems when recording multi-modal signals. The most straightforward approach to solve this problem is to route a master clock signal through each device and synchronize using this pulse. [84] show a system where an SMPTE signal serves as a clock for video cameras and other sensor as well. In a system by [78] a clock signal generated with an audio card is used to synchronize OSC, MIDI, serial data and audio. A drawback of this approach is that every recording modality needs to be fitted with a clock signal input. When working with video this means that expensive cameras are needed that are able to control shutter timing with a sync port. Generally available webcams do not have such functionality. The EyesWeb system [28] has similar preconditions.
An other approach is to use instantaneous synchronization markers in data-streams. In audio streams such marker could be a hand clap. A bright flash could be used in a video stream. These markers are subsequently employed to calculate timing offsets and synchronize streams either by hand or assisted by custom software. This method does not scale very well to multiple sensor streams and does not cope well with drift or dropped samples. Some types of sensor streams are hard to manipulate with markers e.g. ECG recordings, which prohibits the use of this method. Although the method has drawbacks it can be put to use effectively in controlled environments, as is shown by [66].
In this article a novel low-cost approach is proposed to synchronize streams by embedding ambient audio into each stream. With the stream and ambient audio being recorded synchronously, the problem of mutual synchronization between streams is effectively reduced to audio-to-audio alignment. As a second contribution of this paper, a robust, computationally efficient audio-to-audio alignment algorithm is introduced. The algorithm extends audio fingerprinting techniques with a cross-covariance step. It offers precise and reliable synchronization of audio streams of varying quality. The algorithm proposes a synchronization solution even if drift is present or when samples are dropped in one of the streams.
There are several requirements for the audio-to-audio alignment algorithm. First and foremost, it should offer reliable and precise time-offsets to align multiple audio-streams. Offsets should be provided not only at the start of the stream but continuously in order to spot drifting behavior or dropped samples. The algorithm needs to be computationally efficient so it can handle multiple recordings of potentially several hours long. Highly varying signal quality should not pose a problem for the alignment algorithm: the algorithm should be designed to reliably match a stream sampled at a low bit-rate and sample rate with a high-fidelity signal.
The requirements are similar to those of acoustic fingerprinting algorithms. An acoustic fingerprinting algorithm uses condensed representations of audio signals, acoustic fingerprints, to identify short audio fragments in large audio databases. A robust fingerprinting algorithm generates similar fingerprints for perceptually similar audio signals, even if there is a large difference in quality between the signals. [209] describes such algorithm. Wang’s algorithm is able to recognize short audio fragments reliably even if the audio has been subjected to modifications like dynamic range compression, equalization, added background noise and artifacts introduced by audio coders or A/D-D/A conversions. The algorithm is computationally efficient, relatively easy to implement and yields precise time-offsets. All these features combined make it a good candidate for audio-to-audio alignment. This is recognized by others as well since variations on the algorithm have been used to identify multiple video’s of an event [47] and to identify repeating acoustic events in long audio recordings [138]. The patent application for the algorithm by [210] mentions that it could be used for precise synchronization, unfortunately it is not detailed how this could be done. Below a novel post-processing step to achieve precise synchronization is proposed.
Another approach to audio-to-audio synchronization is described by [165]. Their algorithm offers an accuracy of +-11ms in the best case and does not cope with drift or dropped samples. Two elements that are improved upon in the algorithm proposed below.
The algorithm works as follows. First audio is transformed to the time-frequency domain. In the time-frequency domain peaks are extracted. Peaks have a frequency component and a time component . The frequency component is expressed as an FFT bin index and the time component as an analysis frame index. The peaks are chosen to be spaced evenly over the time-frequency plane. A combination of two nearby peaks are combined to form one fingerprint, as shown in Figure 29. The fingerprints of the reference audio are stored in a hashtable with the key being a hash of , , . Also stored in the hashtable, along with the fingerprint, is , the absolute time of the first peak. Further details can be found in [209].
For audio that needs to be synchronized with a reference, fingerprints are extracted and hashed in the exact same way. Subsequently, the hashes that match with hashes in the reference hashtable are identified. For each matching hash, a time offset is calculated between the query and the reference audio, using the auxiliary information stored in the hashtable . If query and reference audio match, an identical time offset will appear multiple times. Random chance matches do not have this property. If a match is finally found, the time offset is reported. The matching step is visualized in Figure 30. In Figure 30 several matching fingerprints are found. For two of those matching fingerprints the time offset is indicated using the dotted lines. The figure also makes clear that the method is robust to noise or audio that is only present in one of the streams. Since only a few fingerprints need to match with the same offset, the other fingerprints - introduced by noise or other sources - can be safely discarded (the red fingerprints in Figure 30).
In this algorithm time offsets are expressed using the analysis frame index. The time resolution of such audio frame is not sufficient for precise synchronization.
The cross-covariance calculation step is not efficient , so it should be done for the least amount of audio blocks possible. Since it is known from before at which audio block indexes similar audio is present - at audio blocks with matching offsets - in the reference and query, the calculation can be limited to only those audio blocks. The number of cross-covariance calculations can be further reduced by only calculating covariances until agreement is reached.
Until now there have been no precautions to deal with drift or dropped samples. To deal with drift the algorithm above is expanded to allow multiple time-offsets between the audio that needs to be synchronized and a reference. During the iteration of fingerprint hash matches, a list of matching fingerprints is kept for each offset. If list of matching fingerprints reaches a certain threshold the corresponding offset is counted as a valid. Drift can then be identified since many, gradually increasing or decreasing, offsets will be reported. If samples are dropped, two distinct offsets will be reported. The time at which samples are dropped or drift occurs is found in the list of matching fingerprints for an offset. The time information of the first and last fingerprint match mark the beginning and end of a sequence of detected matches.
With a working audio-to-audio alignment strategy in place, a setup for multimodal recording should include audio streams for each sensor stream. While building a setup it is of utmost importance that the sensor-stream and the corresponding audio-stream are in check. For a video recording this means that AV-sync needs to be guaranteed. To make sure that analog sensors are correctly synchronized with audio the use of a data acquisition module is advised. Generally these modules are able to sample data at sufficiently high sampling rates and precision. The main advantage is that such module generally has many analog input ports and by recording the audio via the same path it is guaranteed to be in sync with the sensor streams. In the setup detailed below (Section ?), for example, an USB data acquisition module with 16 inputs, 16 bit resolution and maximum sample rate of 200kHz is used.
To accommodate data-streams with an inherently low sampling - 8kHz or less - rate a method can be devised that does not sample the audio at the same rate as the data-stream. In the setup detailed below an accelerometer (Section ?) is sampled at 345Hz by dividing the audio into blocks of 128 samples, and only measuring one acceleration for 128 audio samples. Since the audio is sampled at 44.1kHz the data is indeed sampled at . Depending on the type of data-stream and audio-recording device, other sampling rates can be achieved by using other divisors. Conversely, high data sampling rates can be accommodated by using a multiplier instead of a divisor.
Note that during such recording session a lot of freedom concerning the workflow is gained. While one stream is recording continuously stopping and starting other recording modalities can be done without affecting sync. The order at which recording devices are started also has no effect on synchronization. This is in stark contrast with other synchronization methods. In Figure 31 this is shown.
Once all data is synchronized analysis can take place. If video analysis is needed, a tool like ELAN [212] can be used. If audio and sensor-streams are combined without video, Sonic Visualizer [31] is helpful to check mutual alignment. To store and share multimodal data RepoVIZZ [122] is useful.
To test the audio-to-audio alignment algorithm, it was implemented in Java. A user-friendly interface, called SyncSink
To measure the accuracy of the time-offsets that are reported by the algorithm the following experiment is done. For each audio file in a dataset a random snippet of ten seconds is copied. The ten seconds of audio is stored together with an accurate representation of the offset at which it starts in the reference audio. Subsequently the snippet is aligned with the reference audio and the actual offset is compared with the offset reported by the algorithm. To make the task more realistic the snippet is GSM 06.10 encoded. The GSM 06.10 encoding is a low-quality 8KHz, 8-bit encoding. This degradation is done to ensure that the algorithm reports precise time-offsets even when high-fidelity signals - the reference audio files - are aligned with low-quality audio - the snippets.
The procedure described above was executed for a thousand snippets of ten seconds. For 17 an incorrect offset was found due to identically repeating audio. It could be said that these offsets yield an alternative, but equally correct, alignment. For 10 snippets no alignment was found. For the remaining 973 snippets the offsets were on average 1.01ms off, with a standard deviation of 2.2ms. The worst case of 16ms (128/8000Hz) was reported once. Note that a delay in audio from 1-14ms affects spatial localization, a 15-34ms delay creates a chorus/flanger like effect and starting from 35ms discrete echos can be heared.
To get an idea how quickly the algorithm returns a precise offset, the runtime was measured. Four reference audio files were created, each with a duration of one hour and each with a corresponding query file. The queries consist of the same audio but GSM encoded and with a small time-offset. A query performance of on average 81 times real-time is reached on modest computing hardware
This method was used in a study with dementia patients. The study aimed at measuring how well participants can synchronize to a musical stimulus. A schematic of the system can be found in Figure 33. The system has the following components:
A balance board equiped with an analog pressure sensors at each corner.
Two HD-webcams (Logitech C920), recording the balance board and ambient audio using the internal microphones.
An electret microphone (CMA-4544PF-W) with amplifier (MAX4466) circuit.
A data acquisition module with analog inputs. Here an Advantech USB4716 DAQ was used. It has 16 single-ended inputs with 16-bit resolution and is able to sample up to 200 kHz.
A wearable microcontroller with an electret microphone (CMA-4544PF-W), MicroSD-card slot and an analog accelerometer (MMA7260Q) attached to it. Here we used the Teensy 3.1 with audio shield. It runs at 96MHz and has enough memory and processing power to handle audio sampled at 44.1kHz in real-time. The microcontroller can be seen in Figure Figure 34.
The microcontroller shown in Figure 34 was programmed to stream audio data sampled at 44.1kHz to the SD-card in blocks of 128 samples. Using the same processing pipeline the instantaneous acceleration was measured for each block of audio. This makes sure that the measurements and audio stay in sync even if there is a temporary bottleneck present in the pipeline. During the recording session this showed to be of value: due to a slow micro SD-card
Once all data was transferred to a central location mutual time offsets were calculated automatically. Subsequently the files were trimmed in order to synchronize them. In practice this means chopping off a part of the video or audio file (using a tool like ffmpeg) and modifying the data files accordingly. The data of this recording session and the software used is available at http://0110.be/syncsink. The next step is to analyse the data with tools like ELAN [212] or Sonic Visualizer [31] (Figure 35 and Figure 36) or any other tool. The analysis itself falls outside the scope of this paper.
Since ambient audio is used as a synchronization clock signal the speed of sound needs to be taken into account. If microphones are spread out over a room the physical latency quickly adds up to 10ms (for 3.4m) or more. If microphone placement is fixed this can be taken into account. If microphone placement is not fixed it should be determined if the precision that the proposed method can provide is sufficient for the measurement.
The proposed method should not be seen as a universal synchronization solution but provides a method that can fit some workflows. Combining audio-to-audio alignment with other synchronization methods is of course possible. If for example motion capture needs to be synchronized with other sources, the motion capture clock pulse can be routed through a device that records the clock together with ambient audio making it possible to sync with other modalities. The same could be done for an EEG system and clock. This setup would make it possible to sync EEG with motion capture data, an otherwise difficult task. Combining the method with other synchronization approaches - e.g. synchronization markers - is also possible.
The current system is presented as a post-processing step but if the latencies of each recording system are relatively stable then there is potential to use the approach real-time. It would work as follows: each recording device would start streaming both audio and data. After about ten seconds audio-to-audio alignment can be done and the mutual offsets can be determined. Once this information is known the sensor data can be buffered and released in a timely manner to form one fused synchronized sensor data stream. The overall latency of stream is then at best equal to the recording modality with the largest latency. While streaming data, the audio-to-audio alignment should be repeated periodically to check or adapt offsets of sensor streams.
An efficient audio-to-audio alignment algorithm was presented and used effectively to synchronizing recordings and linked data streams. The algorithm is based on audio fingerprinting techniques. It finds a rough offset using fingerprints and subsequently refines the offset with a cross-covariance step. During synthetic benchmarks an average synchronization accuracy of 1.1ms was reached with a standard deviation of 2.2ms. A query performance of 81 times real-time is reached on modest computing hardware when synchronizing two streams. A case study showed how the method is used in research practice. The case study combines recordings from two webcams and a balance board together with acceleration data recorded using a microcontroller which were all synchronized reliably and precisely.
The ability to identify which music is playing in the environment of a user has several use cases. After a successful recognition meta-data about the music is immediately available: artist, title, album. More indirect information can also be made available: related artist, upcoming concerts by the artist or where to buy the music. Such systems have been in use for more than a decade now.
A system that is able to not only recognize the music, but also determine a sufficiently precise playback time opens new possibilities. It would allow to show lyrics, scores or tablature in sync with the music. If the time resolution is fine enough it would even allow to play music videos synced to the environment. In this work a design of such system is proposed. The paper focuses on yet another type of time-dependent contextual data: beat lists. A prototype is developed that provides feedback exactly on the beat for the following three reasons:
For its inherent value. Humans are generally able to track musical beat and rhythm. Synchronizing movement with perceived beats is a process that is natural to most. Both processes develop during early childhood [73]. However, some humans are unable to follow musical beat. They fall into two categories. The first category are people that suffer from hearing impairments which have difficulties to perceive sound in general and music in particular. Especially users of cochlear implants that were early-deafened but only implanted during adolescence or later have difficulties following rhythm [61]. In contrast, post-lingually deafened CI users show similar performance as normal hearing persons [124]. The second category are people suffering from beat deafness [146]. Beat deafness is a type of congenital amusia which makes it impossible to extract music’s beat. Both groups could benefit from a technology that finds the beat in music and provides tactile or visual feedback on the beat.
For evaluation purposes. Using discrete events - the beats - makes evaluation relatively straightforward. It is a matter of comparing the expected beat timing with the timing of the feedback event.
For pragmatic reasons. The contextual data - the beat lists - are available or can be generated easily. There are a few options to extract beat timestamps. The first is to manually annotate beat information for each piece of music in the reference database. It is the most reliable method, but also the most laborious. The second option is to use a state of the art beat tracking algorithms e.g. the one available in Essentia [16]. The third option is to request beat timestamps from specialized web services. The AcousticBrainz project [149] provides such a service. AcousticBrainz
The following sections present the system and its evaluation. The paper ends with the discussion and the conclusions.
A system that provides time-dependent context for music has several requirements. The system needs to be able to recognize audio or music being played in the environment of the user together with a precise time-offset. It also needs contextual, time-dependent information to provide to the user e.g. lyrics, scores, music video, tablature or triggers. The information should be prepared beforehand and stored in a repository. The system also needs an interface to provide the user with the information to enhance the listening experience. As a final soft requirement, the system should preferably minimize computational load and resources so it could be implemented on smartphones.
Acoustic fingerprinting algorithms are designed to recognize which music is playing in the environment. The algorithms use condensed representations of audio signals to identify short audio fragments in vast audio databases. A well-designed fingerprinting algorithm generates similar fingerprints for perceptually similar audio signals, even if there is a large difference in quality between the signals. [209] describes such algorithm. The algorithm is based on pairs of spectral peaks which are hashed and compared with the peaks of reference audio. Wang’s algorithm is able to recognize short audio fragments reliably even in the presence of background noise or artifacts introduced by A/D or D/A conversions. The algorithm is computationally efficient, relatively easy to implement and yields precise time-offsets. All these features combined make it a good algorithm to detect which music is being played and to determine the time-offset precisely. Figure ? shows fingerprints extracted from two audio streams using Panako [175], an implementation of the aforementioned algorithm. The reference audio is in the top, the query in the bottom. Using fingerprints that match (in green), the query is aligned with the reference audio.
For the prototype, the complete process is depicted in Figure ?. A client uses its microphone to register sounds in the environment. Next, fingerprints are extracted from the audio stream. The fingerprints are sent to a server. The server matches the fingerprints with a reference database. If a match is found, a detailed time-offset between the query and the reference audio is calculated. Subsequently, the server returns this time-offset together with a list of beat timestamps. Using this information the client is able to generate feedback events that coincide with the beat of the music playing in the users environment. This process is repeated to make sure that the feedback events remain in sync with the music in the room. If the server fails to find a match, the feedback events stop.
With the list of beat events available the system needs to generate feedback events. The specific feedback mode depends on the use case. Perhaps it suffices to accentuate the beat using an auditory signal: e.g. by a loud sound with a sharp attack. A bright flash on each beat could also help some users. Haptic feedback can be done with vibration motors attached to the wrist using a bracelet. A commercially available wireless tactile metronome - the Soundbrenner Pulse - lends itself well for this purpose. A combination of feedback modes could prove beneficial since multisensory feedback can improve sensorimotor synchronization [56].
The evaluation makes clear how synchronized context can be delivered to ambient audio or music. The evaluation quantifies the time-offset between the beats - annotated beforehand - and the time of the feedback event that should correspond with a beat. For an evaluation of the underlying fingerprinting algorithm readers are referred to the evaluation by [209].
The evaluation procedure is as follows: a device plays a piece of music. The device also knows and updates the current playback position accurately in real-time
To counter problems arizing with soft real-time thread scheduling and audio output latency the evaluation was done using a microcontroller with a hard real-time scheduler and low-latency audio playback. An extra benifit is that timing measurements are very presice. Cheap, easily programmable microcontrollers come with enough computing power these days to handle high-quality audio. One such device is the Axoloti: a microcontroller for audio applications that can be programmed using a patcher environment. Another is the Teensy equiped with a Teensy Audio Board. Here we use a Teensy 3.2 with Audio Board for audio playback. It supports the Arduino environment which makes it easy to program it as a measuring device. It has an audio-output latency of about 1 ms. It is able to render 16 bit audio sampled at 44100 Hz and is able to read PCM-encoded audio from an SD-card. In the experimental setup, the Teensy is connected to a Behringer B2031 active speaker.
BPM |
Artist - Title |
82 | Arctic Monkeys - Brianstorm |
87 | Pendulum - Propane Nightmares |
90 | Ratatat - Mirando |
91 | C2C - Arcades |
95 | Hotei - Battle Without Honor or Humanity |
95 | Skunk Anansie - Weak |
100 | Really Slow Motion - The Wild Card |
105 | Muse - Panic Station |
108 | P.O.D. - Alive |
111 | Billie - Give Me the Knife |
121 | Daft Punk - Around The World |
128 | Paul Kalkbrenner - Das Gezabel de Luxe |
144 | Panda Dub - Purple Trip |
146 | Digicult - Out Of This World |
153 | Rage Against the Machine - Bombtrack |
162 | Pharrell Williams - Happy |
Audio enters the client by means of the built in microphone. The client part of Figure ? is handled by a laptop: a late 2010 Macbook Air, model A1369 running Mac OS X 10.10. Next the audio is fed into the fingerprinting system. The system presented here is based on Panako [175]. The source of Panako is available online and it is implemented in Java 1.8. Panako was modified to allow a client/server architecture. The client is responsible for extracting fingerprints. A server matches fingerprints and computes time offsets. The server also stores a large reference database with fingerprints together with beat positions. The client and server communicate via a web-service using a JSON protocol. JSON is a standard that describes a way to encode data, it allows communication between computers.
When the time-offsets and the beat list are available on the client feedback events are generated. To evaluate the system the feedback events are sent over a USB-serial port to the Teensy. The Teensy replies over the serial port with the current playback position of the audio. The playback position is compared with the expected position of the beat and the difference is reported in milliseconds. Negative values mean that the feedback-event came before the actual audible beat, whereas positive values mean the opposite. The system is tested using a the data set presented in Table 6. It features music in a broad BPM range with a clear, relatively stable beat.
The results are depicted in Figure ?. The system responds on average 16 ms before the beat. This allows feedback events to be perceived together with the beat. Depending on the tempo (BPM) of the music and type of feedback it might be needed to schedule events later or even sooner. This can be done by adapting the latency parameter which modifies the timing of the scheduled feedback events. However, there is a large standard deviation of 42 ms. The current system is limited to an accuracy of 32 ms: the size of a block of 256 audio samples, sampled at 8000 Hz. The block size used in the fingerprinter. In Figure ? each histogram bin is 32 ms wide and centered around -16 ms. The results show that the system is able to recognize the correct audio block but is sometimes one block off. The main issue here is the unpredictable nature of scheduling in Java: Java threads are not guaranteed to start with predictable millisecond accurate timing. Garbage collection can cause even larger delays. Larger delays are due to incorrectly recognized audio. Repetition in music can cause the algorithm to return an incorrect absolute offset which makes the beat drift. The results, however, do show that the concept of the system is very promising and can deliver timing dependent context.
In its current state the system listens to 12 seconds of audio and sends the fingerprints of those 12 seconds to the server. The state is reset after that. The system does not employ use-case dependent heuristics. If it is known beforehand that the user will most likely listen to full tracks the current state, time-offset and beat-lists could be reused intelligently to improve accuracy, especially in the case of repeating audio. This could also be optimized by running a real-time onset detector and correlating the detected onsets list with the beat list returned by the server, this would however make the system more computationally expensive.
The proposed system only supports recorded music. Supporting live music is challenging but could be done. The MATCH algorithm [53] for example supports tracking live performances in real time via dynamic time warping. The basic song structure however needs to be kept intact during the live rendition, otherwise the pre-computed contextual data becomes useless. Another challenge is DJ-sets. Although recorded music is used, during DJ-sets the tempo of the music is often modified to match a previous piece of music. To support such situations a more involved acoustic fingerprinting algorithm is needed. Currently there are two algorithms described in the literature that report both time-offsets and tempo modifications accurately [182].
Repetition is inherent in music. Especially in electronic music the same exact audio can be repeated several times. A fingerprinting algorithm that only uses a short audio excerpt could, in such cases, return an incorrect absolute offset. To alleviate this problem, context could also be taken into account. Also the type of data returned needs to be considered. Lyrics could be incorrect while tablature or beats could still be aligned correctly since they do not depend as much on an absolute offset.
Since the system uses a computationally inexpensive algorithm, it can be executed on a smartphone. The implementation used here is compatible with Android since it depends only on two Android compatible libraries [175]. If only a small reference music library is used, all server components of the system could be moved to the smartphone. An app that offers aligned music videos for the music for one album could run all components easily on a smartphone without the need for an external server.
For the prototype a database with pre-computed beat-position is created off-line using all the acoustic information of the song. However, it is possible to determine beat positions with a real-time beat tracking algorithm by [65]. Unfortunately, this poses several problems. Beat-tracking involves an important predictive and reflective element. To correctly model beats based on a list of onsets extracted in real-time, musical context is needed. This context may simply not be available. Another issue is that the computational load for a beat-tracking systems is often high. [135] gives an overview of beat tracking techniques which are challenging to implement on smartphones. A third problem is that feedback needs to be provided before the actual acoustic beat is perceived by the user. Tactile feedback e.g. takes around 35 ms to be processed [56]. Feedback based on a real-time beat-tracker - which introduces a latency by itself - would be always late. Generating feedback based on real-time beat-tracking algorithms is impractical especially in the context of smartphones with low-quality microphones and restricted computational resources.
To further develop the prototype into an assistive technology, more fundamental research is needed to pinpoint the optimal type of feedback for user groups. The early deafened late implanted CI user group is recognized as an ’understudied clinical population’ [61] for which models on auditory rhythm perception are underdeveloped. Insights into tactile or multi-modal rhythm perception for this specific group seem to be lacking from the academic literature. There is however a study that suggests that multi-sensory cues improve sensorimotor synchronization [56]. In [183] another fundamental issue is raised. In the study two participants seem to be able to perceive small timing deviations in audio but are unable to move accordingly. As the authors put it “This mismatch of perception and action points toward disrupted auditory-motor mapping as the key impairment accounting for poor synchronization to the beat”. The question remains whether this holds for tactile-motor mappings especially in the late implanted CI user-group.
A system was described that employs acoustic fingerprinting techniques to provide augmented music listening experience. A prototype was developed that provides feedback synchronized with music being played in the environment. The system needs a dataset with fingerprints from reference audio and pre-computed beat-lists. Since it offers fine-grained context awareness it can show lyrics, scores, visuals, aligned music videos or other meta-data that enrich the listening experience. The system can also be used to trigger events linked to audio during e.g. a theater performance.
This chapter starts with a list of contributions. It also provides a place for discussing limitations and future work by introduction of the term augmented humanities. It finishes with concluding remarks.
General contributions of this doctoral thesis are the way to think about technical contributions to a field using the concepts of methods and services. The specific contributions of this doctoral research project are found in the individual articles.
The main contribution of Tarsos [173] is that it lowered the barrier to allow musicologist and students to quickly extract pitch class histograms. Accessibility to an easy-to-use analysis tool allows it to be a stepping stone to automated methods for large scale analysis which are also available with Tarsos. The underlying DSP library [174] became a stand-alone service and has been used in research and interactive music applications.
The contribution of the Panako [175] acoustic fingerprinting system is threefold. First it features a novel algorithm for acoustic fingerprinting that is able to match queries with reference audio even if pitch shifting and time-stretching is present. Although similarities to a previously patented method [208] were pointed out after publication it remains a valuable contribution to the academic literature. The second contribution lies in the publicly available and verifiable implementation of the system and three other baseline algorithms [209], which serve as a service to the MIR community. The final contribution is the reproducible evaluation method which has been copied by [182] and is part of the focus of [167]. To make the technology and capabilities of Panako better known it has been featured in articles targeting archivist and library science communities [168].
[167] contributed to the discussion in computational research and challenges concerning reproducibility. It summarizes the common problems and illustrates these by replicating – in full – a seminal acoustic fingerprinting paper. It also proposes ways to deal with reproducibility problems and applies these for the reproduced work.
[176] describes a novel way to synchronize of multi-modal research data. The idea there is to reduce the problem of synchronization of data to audio to audio alignment. The main contribution of the work is the described method and the implementation. The method was used in practice [51] and extended with a real-time component during a master thesis project [195].
[177], finally enables augmented musical realities. The musical environment of a user can be enriched with all types of meta-data by using the blue-prints of the system described in this article.
Another set of contributions are the solutions that enabled empirical studies. Development of various hard- and software systems with considerable technical challenges allowed research on various aspects of interaction with music. These solutions are featured in articles by [203]. Each of these solutions assisted in a technical solution for one or more components typically present of an empirical research project: activation, measurement, transmission, accumulation or analysis. For more detail on each of these please see ?.
To hint at the limitations of my current research and immediately offer perspectives on future work the augmented humanities in introduced.
In my research on MIR and archives and in the digital humanities in general, cultural objects are often regarded as static, immutable objects. An example of this vision is the instrument collection of the Royal museum of Central Africa in Tervuren, Belgium (RMCA). There, thousands of musical instruments are conserved and studied. These physical objects are cataloged according to sound producing mechanism or physical characteristics and digitized via photographs by the MIMO Consortium
However, cultural artifacts typically thrive on interaction. Interaction between performer, public, time, space, context, body and expressive capabilities. An unplayable musical instrument studied as a static ‘thing’ loses its interactive dynamics. The instrument is very much part of an interactive system between an expert performer, listeners and context. When listening in silence over headphones to a musical record of a participatory act of music making, it loses much of its meaning. The static witness of this act only captures only a ghost of its ‘true multi-dimensional self’. Other research questions can be answered when this interactive aspect is re-introduced.
This distinction between static and interactive could be summarized as a path from digital humanities to augmented humanities. The augmented humanities could re-introduce augment or diminish interactive behavior to test certain hypothesis. The interactive behavior could be elicited or, conversely, prevented or sabotaged to gain insights into this behavior. The cultural object would be seen as an actor in this model. To interfere or mediate interactions augmented reality technology could be re-purposed. Perhaps some examples would clarify this vision.
The D-Jogger system by [133] could be seen as rudimentary augmented humanities research. It is a system for runners which are listening to specific music while running. The music is chosen to have a tempo in the range of the runners number of steps per minute. The system employs the tendency many runners have to align their footfalls with the musical beats. So far, music is still an immutable actor in this interaction, runners simply align their actions with the music. However, if the music is modified dynamically to match its tempo with the runners, a system with two active actors appears. Both are drawn to one another. The interaction can then be elicited by allowing the music to always sync to the runner or the interaction can be prevented: the beat would then never align with a footfall. The runner can also be sped-up by first allowing an alignment and dynamically increasing the musical tempo slightly. The point is that music stops being an immutable, a static but becomes an actor that offers a view on the interactive system. Modifying the interaction sheds light on which musical features activate or relax movement [105], the coupling strength between runner and musical tempo and the adaptive behavior of a runner.
A study by [39] studies musical scores of baroque dance music tempo indications are not present. The metronome and precise indications on the score of the preferred tempo only appeared later in the romantic period. To infer a tempo from a static score is very challenging. A more interactive approach is to play the music in different tempi and let people with experience in dancing on baroque music move to the music and subsequently determine an optimal tempo. In this example again the static cultural object – a musical score – is transformed in a malleable (with respect to tempo) interaction between dancers and performers.
Historically, MIR technologies are developed to describe and query large databases of music [26]. However, MIR techniques can be used in interactive settings as well. For example, in [177] acoustic fingerprinting techniques are used to evoke interaction. This system is capable to present the user with an augmented musical reality or more generally a computer-mediated reality. It allows to augment the music in a users environment with additional layers of information. In the article a case is built around a cochlear implant user and dance music playing in his or her environment. The system emphasizes musical beat with a tactile pulse in sync with the music environment. This allows users with limited means of tracking musical beats to synchronize movement to beats.
To further this type of research in which musical realities are constructed and interaction is modified requires three aspects. First, insights are needed into core topics of the humanities and hypothesis on how humans develop meaningful, sense-giving, interactions with their (musical) environment (many can be found in [103]). Secondly, a technological aspect is required to allow a laboratory setting in which environments can be adjusted (augmented) and responses recorded. The third aspect concerns the usability of these new technologies, it should be clear how convincing this augmented reality which relates to ecological validity.
The ArtScience Lab at the Krook in Ghent together with the usability laboratories also at the Krook, provides the right infrastructure to be able to show the potential of an augmented humanities approach. With [177] I have already taken a step in this direction and would like to further continue in this direction.
I have introduced a way to organize solutions for problems in systematic musicology by mapping them on a plane. One axis contrasts methods with services while the other axis deals with techniques: MIR-techniques versus techniques for empirical research. By engineering prototypes, tools for empirical research and software systems I have explored this plane and contributed solutions that are relevant for systematic musicology. More specifically I have contributed to tone scale research, acoustic fingerprinting, reproducibility in MIR and to several empirical research projects.
I have aimed for a reproducible methodology by releasing the source code of software systems under open source licenses and have evaluated systems with publicly available music when possible.
To illustrate the limitations of my current research and to propose a direction of future work I have proposed the term augmented humanities where hypothesis on musical interactions are tested by interfering in these interactions. Augmented reality technologies offer opportunities to do this in a convincing manner.
Articles in chronological order. Bold means included in the dissertation.
Articles in bold are included in the dissertation. The type of presentation is included at the end of each item.
– Oral presentation
– Demo presentation
– Poster presentation
– Poster presentation
– Poster presentation
– Poster presentation
– Poster presentation
– Poster presentation
– Poster presentation
– Poster presentation
– Oral presentation
– Oral presentation
– Poster presentation
– Poster presentation
– Presentation type undecided
Panel discussion, 2012: “Technological challenges for the computational modelling of the world’s musical heritage” ”, Folk Music Analysis Conference 2012 – FMA 2012, organizers: Polina Proutskova and Emilia Gomez, Seville, Spain
Guest lecture, 2012: Non-western music and digital humanities, for: “Studies in Western Music History: Quantitative and Computational Approaches to Music History”, MIT., Boston, U.S.
Guest lecture, 2011: “Presenting Tarsos, a software platform for pitch analysis”, Electrical and Electronics Eng.Dept. IYTE, Izmir, Turkey
Workshop 2017: “Computational Ethnomusicology – Methodologies for a new field”, Leiden, The Netherlands
Guest lectures, A002301 (2016–2017 and 2017-2018) “Grondslagen van de muzikale acoustica en sonologie” – Theory and Practice sessions together with dr. Pieter-Jan Maes, UGent
This dissertation has output in the form of scientific, peer reviewed articles but also considerable output in the form of research software systems and source code. This section lists research software with a small descriptions and links where source code and further information can be found.
Panako is an extendable acoustic fingerprinting framework. The aim of acoustic fingerprinting is to find small audio fragments in large audio databases. Panako contains several acoustic fingerprinting algorithms to make comparison between them easy. The main Panako algorithm uses key points in a Constant-Q spectrogram as a fingerprint to allow pitch-shifting, time-stretching and speed modification. The aim of Panako is to serve as a platform for research on Acoustic Fingerprinting Systems wile striving to be applicable in small to medium scale situations.
Described in | [175] |
Lisence | AGPL |
Repository | https://github.com/JorenSix/Panako |
Downloads | https://0110.be/releases/Panako |
Website | https://panako.be |
Tarsos is a software tool to analyze and experiment with pitch organization in all kinds of musics. Most of the analysis is done using pitch histograms and octave reduced pitch class histograms. Tarsos has an intuitive user interface and contains a couple of command line programs to analyze large sets of music.
Described in | [173] and [169] |
Lisence | AGPL |
Repository | https://github.com/JorenSix/Tarsos |
Downloads | https://0110.be/releases/Tarsos |
Website | https://0110.be/tags/Tarsos |
TarsosDSP is a Java library for audio processing. Its aim is to provide an easy-to-use interface to practical music processing algorithms implemented, as simply as possible, in pure Java and without any other external dependencies. TarsosDSP features an implementation of a percussion onset detector and a number of pitch detection algorithms: YIN, the Mcleod Pitch method and “Dynamic Wavelet Algorithm Pitch Tracking” algorithm. Also included is a Goertzel DTMF decoding algorithm, a time stretch algorithm (WSOLA), resampling, filters, simple synthesis, some audio effects, a pitch shifting algorithm and wavelet filters.
Described in | [174] |
Lisence | GPL |
Repository | https://github.com/JorenSix/TarsosDSP |
Downloads | https://0110.be/releases/TarsosDSP |
Website | https://0110.be/tags/TarsosDSP |
SyncSink is able to synchronize video and audio recordings of the same event. As long as some audio is shared between the multimedia files a reliable synchronization solution will be proposed. SyncSink is ideal to synchronize video recordings of the same event by multiple cameras or to align a high definition audio recording with a video recording (with less qualitative audio).
SyncSink is also used to facilitate synchronization of multimodal research data e.g. to research the interaction between movement and music.
Described in | [176] |
Lisence | AGPL |
Repository | https://github.com/JorenSix/Panako |
Downloads | https://0110.be/releases/Panako |
Website | http://panako.be |
TeensyDAQ is a Java application to quickly visualize and record analog signals with a Teensy micro-controller and some custom software. It is mainly useful to quickly get an idea of how an analog sensor reacts to different stimuli. Some of the features of the TeensyDAQ:
Visualize or sonify up to five analog signals simultaneously in real-time. Capture analog input signals with sampling rates up to 8000Hz. Record analog input to a CSV-file and, using drag-and-drop, previously recorded CSV-files can be visualized. While a capture session is in progress you can going back in time and zoom, pan and drag to get a detailed view on your data.
Lisence | GPL |
Repository | https://github.com/JorenSix/TeensyDAQ |
Downloads | https://0110.be/releases/TeensyDAQ |
AES |
Audio Engineering Society |
AGPL |
Affero General Public License |
AMDF |
Average Magnitude Difference |
API |
Application Programmers Interface |
BER |
Bit Error Rate |
BPM |
Beats Per Minute |
CBR |
Constant Bit Rate |
CC |
Creative Commons |
CNRS |
Centre Nationnal de la Recherche Scientifique |
CREM |
Centre de Recherche en Ethnomusicologie |
CSV |
Comma separated values |
DAQ |
Data Aquisition |
DB |
Database |
DEKKMMA |
Digitalisatie van het Etnomusicologisch Klankarchief |
van het Koninklijk Museum voor Midden-Afrika | |
DSP |
Digital Signal Processing |
DTMF |
Dual Tone – Multi Frequency |
ECG |
Electrocardiogram |
EEG |
Electroencephalography |
ESCOM |
European Society for the Cognitive Sciences of Music |
FFT |
Fast Fourier Transform |
FMA |
Folk Music Analysis |
FN |
False Negative |
FP |
False Positive |
FPS |
Frames Per Second |
FWO |
Fonds Wetenschappelijk Onderzoek |
GNU |
GNU Not Unix |
GPL |
General Public License |
GPU |
Graphical Processing Unit |
HD |
Hard Disk |
HR |
Heart Rate |
IASA |
International Association of Sound and |
Audiovisual Archives | |
IRCDL |
Italian Research Conference on Digital Libraries |
ISMIR |
International Society of Music Information Retrieval |
JNMR |
Journal of New Music Research |
JVM |
Java Virtual Machine |
KMMA |
Koninklijk Museum voor Midden Afrika |
LFO |
Low Frequency Oscillator |
MFCC |
Mel-Frequency Cepstral Coefficients |
MIDI |
Musical Instrument Digital Interface |
MIR |
Music Information Retrieval |
MIREX |
Music Information Retrieval Evaluation eXchange |
MPM |
Mcleod Pitch Method |
PCH |
Pitch Class Histogram |
PCM |
Pulse Code Modulation |
QMUL |
Queen Mary University London |
RMCA |
Royal Museum for Central Africa |
TN |
True Negative |
TP |
True Positive |
USB |
Univeral Serial Bus |
WSOLA |
Waveform Similarity and OverLap Add |
XML |
eXtensible Markup Language |
XSD |
XML Schema Definition |
XSL |
XML Stylesheet Language |
YAAFE |
Yet Another Audio Feature Extractor |