Engineering systematic musicology

Joren Six

Universiteit Gent
Faculteit Letteren en Wijsbegeerte
Vakgroep Kunst-, Muziek- en Theaterwetenschappen

Joren Six

Proefschrift voorgelegd tot het behalen van
de graad van Doctor in de Kunstwetenschappen
Academisch jaar 2017-2018

Universiteit Gent
Faculteit Letteren en Wijsbegeerte
Vakgroep Kunst-, Muziek- en Theaterwetenschappen

Promotor:	Prof. dr. Marc Leman

Doctoraatsbegeleidingscommissie:	Dr. Olmo Cornelis
	Dr. Frans Wiering

Examencommissie:	Dr. Federica Bressan
	Dr. Olmo Cornelis
	Prof. dr. ir. Tijl De Bie
	Prof. dr. Pieter-Jan Maes
	Dr. Frans Wiering
	Dr. Micheline Lesaffre
	Dr. Luc Nijs

Proefschrift voorgelegd tot het behalen van
de graad van Doctor in de Kunstwetenschappen
Academisch jaar 2017-2018

Universiteit Gent
Faculteit Letteren en Wijsbegeerte
Vakgroep Kunst-, Muziek- en Theaterwetenschappen, IPEM
De Krook, Miriam Makebaplein 1, B-9000 Gent, België

Acknowledgement

This doctoral dissertation is the culmination of my research carried out at both IPEM, Ghent University and the School of Arts also in Ghent. I have been lucky enough to pursue and combine my interests for both music and computer science in my research. As a trained computer scientist I have been applying my engineering background to problems in systematic musicology. The output of this work has been described in various articles some of which are bundled in this dissertation.

Admittedly, my research trajectory does not follow the straightest path but meanders around several fields. This meandering has enabled me to enjoy various vistas and led me to a plethora of places - physically and intellectually - not easily reached without taking a turn now and again. I think this multi-disciplinary approach prepared me better for a hopefully somewhat stable career in research. I also had the time required to cover a lot of ground. At the School of Arts, Ghent I was employed for four years as a scientific employee. At IPEM, Ghent University I was given the opportunity to continue my work as a doctoral student again for four years. This allowed me to not only have a broad perspective but also reach depth required to contribute new knowledge and propose innovative methods.

It is safe to say that without Olmo Cornelis I would not have started this PhD project. Thanks to Olmo for writing the project proposal which eventually allowed me to start my research career at the School of Arts. The concept of having an engineer next to a humanities scholar was definitely enriching to me and I do hope that the opposite is also somewhat true. His guidance during those first four (and following) years was indispensable. His pointers on music theory, support with academic writing and ideas on computer aided (ethno)musicology are just a few examples.

Actually, I want to profoundly thank the whole group of colleague researchers at the School of Arts, or what was then known as the Royal Conservatory of Ghent. Undeniably, they had a defining influence on my research and personal life. I fondly remember many discussions in the cellar and at my desk at the Wijnaert. I struggle to think of examples where gossiping, scheming and collusions between a colleague and partner of a colleague could have a more positive outcome. Indeed, I do mean you: Ruth and Clara.

Later on, Marc Leman gave me the opportunity to continue my research at IPEM. I am very grateful to have been offered this privilege. I am quite aware that being able to pursue one’s interests by doing research is exactly that: a privilege. IPEM provided fertile ground to further my research. Marc’s hands-off approach displays a great amount of trust in his research staff. This freedom worked especially well for me since it made me self-critical on priorities and planning.

I would also like to acknowledge the IPEM bunch: a diverse collective of great individuals each in their own way. I especially would like to thank Ivan for the many hardware builds and creative ideas for practical solutions. Katrien for taking care of the administrative side of things. Esther for having the patience to listen to me whining about my kids. Jeska and Guy for proofreading this work. And all the others for the many discussions at the kitchen table during lunch and generally for being great colleagues.

Furthermore, I am very grateful to the RMCA (Royal Museum for Central Africa), Tervuren, Belgium for providing access to its unique archive of Central African music.

Thanks to my friends and family for the support over the years. I would especially want to thank Will for proofraeding parts of this work and Emilie for pointing me to VEWA. Of course, it is hardly an understatement to claim that this work would not be here without my parents. Thanks for kindling an interest in music, letting me attend piano lessons and for keeping me properly fed and relatively clean, especially in my first years. Thanks also to Bobon for caring for Oscar and Marit, often on short notice in those many small emergency situations. On the topic: I would like to leave Oscar and Marit out of this acknowledgment since they only sabotaged this work, often quite successfully. But they are a constant reminder on relativity of things and I love them quite unconditionally. Finally, I would like to thank the light of my life, the daydream that accompanies me at night, the mother of my children: Barbara.

Nederlandstalige samenvatting

Een van de grote onderzoeksvragen in systematische muziekwetenschappen is hoe mensen met muziek omgaan en deze begrijpen. Deze wetenschap onderzoekt hoe muzikale structuren in relatie staan met de verschillende effecten van muziek. Deze fundamentele relatie kan op verschillende manieren benaderd worden. Bijvoorbeeld een perspectief vertrekkende vanuit traditie waarbij muziek aanzien wordt als een fenomeen van menselijke expressie. Een cognitief perspectief is een andere benadering, daarbij wordt muziek gezien als een akoestische informatiestroom gemoduleerd door perceptie, categorisatie, blootstelling en allerhande leereffecten. Een even geldig perspectief is dat van de uitvoerder waarbij muziek voortkomt uit gecoördineerde menselijke interactie. Om een muzikaal fenomeen te begrijpen is een combinatie van (deel)perspectieven vaak een meerwaarde.

Elk perspectief brengt methodes met zich mee die onderzoeksvragen naar concrete musicologische onderzoeksprojecten kan omvormen. Digitale data en software vormen tegenwoordig bijna altijd de kern van deze methodes. Enkele van die algemene methodes zijn: extractie van akoestische kenmerken, classificatie, statistiek, machine learning. Een probleem hierbij is dat het toepassen van deze empirische en computationele methodes technische oplossingen vraagt. Het ontwikkelen van deze technische oplossingen behoort vaak niet tot de competenties van onderzoekers die vaak een achtergrond in de zachte wetenschappen hebben. Toegang tot gespecialiseerde technische kennis kan op een bepaald punt noodzakelijk worden om hun onderzoek verder te zetten. Mijn doctoraatsonderzoek situeert zich in deze context.

Ik presenteer in dit werk concrete technische oplossingen die bijdragen aan de systematische muziekwetenschappen. Dit gebeurt door oplossingen te ontwikkelen voor meetproblemen in empirisch onderzoek en door implementatie van onderzoekssoftware die computationeel onderzoek faciliteert. Om over de verschillende aspecten van deze oplossingen een overzicht te krijgen worden ze in een vlak geplaatst.

De eerste as van dit vlak contrasteert dienstverlenende oplossingen met oplossingen die methodes in de systematische muziekwetenschappen aandragen of aangeven hoe onderzoek kan gebeuren met dienstverlenende oplossingen voor de systematische muziekwetenschappen. Die ondersteunen of automatiseren onderzoektaken. De dienstverlenende oplossingen kunnen de omvang van onderzoek vergroten door het eenvoudiger te maken om met grotere datasets aan de slag te gaan. De tweede as in het vlak geeft aan hoe sterk de oplossing leunt op Music Information Retrieval (MIR) technieken. MIR-technieken worden gecontrasteerd met verschillende technieken ter ondersteuning van empirisch onderzoek.

Mijn onderzoek resulteerde in dertien oplossingen die in dit vlak geplaatst worden. De beschrijving van zeven van die oplossingen is opgenomen in dit werk. Drie ervan vallen onder methodes en de resterende vier zijn dienstverlendende (services). Het softwaresysteem Tarsos stelt bijvoorbeeld een methode voor om toohoogtegebruik in muzikale praktijk op grote schaal te vergelijken met theoretische modellen van toonladders. Het softwaresysteem SyncSink is een voorbeeld van een service. Het laat toe om onderzoeksdata te synchroniseren wat het eenvoudiger maakt om meerdere sensorstromen of participanten op te nemen. Andere services zijn TarsosDSP en Panako. TarsosDSP kan kenmerken uit audio halen en Panako is een acoustic fingerprinting systeem.

In het algemeen volgen de gepresenteerde oplossingen een reproduceerbare methodologie. Computationeel en MIR onderzoek is niet altijd even makkelijk te reproduceren. In de voorgestelde oplossingen werd aandacht gegeven aan dit aspect. De software werd via open source licenties online geplaatst en de systemen werden zo veel als mogelijk getest met publiek beschikbare data. Dit maakt de processen transparant en verifieerbaar. Het stelt ook anderen in staat om de software te gebruiken, te bekritiseren en te verbeteren.

De belangrijkste bijdragen van dit doctoraatsonderzoek zijn de individuele oplossingen. Met Panako [175] werd een nieuw acoustic fingerprinting algoritme beschreven in de academische literatuur. Vervolgens werden applicaties van Panako toegepast voor het beheer van digitale muziekarchieven. Deze applicaties werden beschreven en getest [168]. Tarsos [173] laat toe om toonhoogtegebruik op grote schaal te onderzoeken. Ik heb bijdragen geleverd aan de discussie rond reproduceerbaarheid van MIR onderzoek [167]. Ook werd een systeem voor verrijkte muziekervaring voorgesteld [177]. Naast deze specifieke bijdragen zijn er ook algemene zoals het conceptualiseren van technologische contributies aan de systematische muziekwetenschappen via het onderscheid tussen services en methodes. Als laatste werd het concept augmented humanities ook geïntroduceerd als een onderzoekslijn voor verder onderzoek.

Summary

One of the main research questions of systematic musicology is concerned with how people make sense of their musical environment. It is concerned with signification and meaning-formation and relates musical structures to effects of music. These fundamental aspects can be approached from many different directions. One could take a cultural perspective where music is considered a phenomenon of human expression, firmly embedded in tradition. Another approach would be a cognitive perspective, where music is considered as an acoustical signal of which perception involves categorizations linked to representations and learning. A performance perspective where music is the outcome of human interaction is also an equally valid perspective. To understand a phenomenon combining multiple perspectives often makes sense.

The methods employed within each of these approaches turn questions into concrete musicological research projects. It is safe to say that today many of these methods draw upon digital data and tools. Some of those general methods are feature extraction from audio and movement signals, machine learning, classification and statistics. However, the problem is that, very often, the empirical and computational methods require technical solutions beyond the skills of researchers that typically have a humanities background. At that point, those researchers need access to specialized technical knowledge to advance their research. My PhD-work should be seen within the context of that tradition. In many respects I adopt a problem-solving attitude to problems that are posed by research in systematic musicology.

This work explores solutions that are relevant for systematic musicology. It does this by engineering solutions for measurement problems in empirical research and developing research software which facilitates computational research. These solutions are placed in an engineering-humanities plane. The first axis of the plane contrasts services with methods. Methods in systematic musicology propose ways to generate new insights in music related phenomena or contribute to how research can be done. Services for systematic musicology, on the other hand, support or automate research tasks which allow to change the scope of research. A shift in scope allows researchers to cope with larger data sets which offers a broader view on the phenomenon. The second axis indicates how important Music Information Retrieval (MIR) techniques are in a solution. MIR-techniques are contrasted with various techniques to support empirical research.

My research resulted in a total of thirteen solutions which are placed in this plane. The description of seven of these are bundled in this dissertation. Three fall into the methods category and four in the services category. For example Tarsos presents a method to compare performance practice with theoretical scales on a large scale. SyncSink is an example of a service. It offers a solution for synchronization of multi-modal empirical data and enables researchers to easily use more streams of sensor data or to process more participants. Other services are TarsosDSP and Panako. The former offers real-time feature extraction and the latter an acoustic fingerprinting framework.

Generally, the solutions presented in this dissertation follow a reproducible methodology. Computational research and MIR research is often problematic to reproduce due to code that is not available, copyrights on music which prevent sharing evaluation data-sets and a lack of incentive to spend time on reproducible research. The works bundled here do pay attention to aspects relating to reproducibility. The software is made available under open source licenses and the systems are evaluated using publicly available music as much as possible. This makes processes transparent and systems verifiable. It also allows others, from in and outside academia, to use, criticize and improve the systems.

The main contributions of my doctoral research are found in the individual solutions. Panako [175] contributed a new acoustic fingerprinting algorithm to the academic literature. Subsequently applications of Panako for digital music archive management applications were described and evaluated [168]. Tarsos [173] facilitates large-scale tone scale use. I have contributed to the discussion on meaningful contributions to and reproducibility in MIR [167]. I have also presented a framework for active listening which enables augmented musical realities [177]. Next to these specific contributions the more general contributions include a way to conceptualize contributions to systematic musicology along a methods vs services axis and the concept of augmented humanities as a future direction of systematic musicological research.

Outline

The first chapter outlines the problem and situates my research in the context of systematic musicology, engineering and digital humanities. It also introduces a plane in which solutions can be placed. One axis of this plane contrasts methods and services. The the other axis differentiats between MIR and other techniques. The first chapter continues with a section on the general methodology which covers aspects of reproducibility. It concludes with a summary.

The next two chapters bundle seven publications in total (chapters Section 2 and Section 3). The publications bundled in these chapters only underwent minor cosmetic changes to fit the citation style and layout of this dissertation. Each bundled publication has been subjected to peer review. The two chapters which bundle publications start with an additional introduction that focuses on how the works are placed in the broader framework of the overarching dissertation. Each introduction also contains bibliographical information which mentions co-authors together with information on the journal or conference proceeding where the research was originally published. I have limited the bundled publications to the ones for which I am the first author and are most central to my research. This means that some works I co-authored are not included which keeps the length in check. Some of those works are [46], [200] and [198]. However, they are situated in plane introduced in chapter one. For a complete list of output, see Appendix A.

The fourth and final chapter offers a discussion together with concluding remarks. The contributions of my research are summarized there as well. Additionally, the term augmented humanities is introduced as a way to conceptualize future work. The main text ends with concluding remarks.

Finally, the appendix contains a list of output (Appendix A). As output in the form of software should be seen as an integral part of this dissertation, software is listed as well. The appendix also includes a list of figures, tables and acronyms (Appendix B). The last part of the dissertation includes a summary in dutch and the list of referenced works.

1Problem definition and methodology

1.1Problem definition

’Systematic musicology¹ has traditionally been conceived of as an interdisciplinary science, whose aim it is to explore the foundations of music from different points of view, such as acoustics, physiology, psychology, anthropology, music theory, sociology, and aesthetics’ [106]. It has various sub-disciplines such as music psychology, sociomusicology, music acoustics, cognitive neuroscience of music and the computer sciences of music which, in trun, has a significant overlap with music information retrieval and sound and music computing.

One of the main research questions of systematic musicology is concerned with how people make sense of their musical environment. It is deals with signification and meaning-formation and relates to how music empowers people [111], how relations between musical structures and meaning formation should be understood and which effects music has. These ‘fundamental questions are non-historical in nature’[55] which contrasts systematic musicology with historical musicology.

There are many ways in which the research questions above can be approached or rephrased. For example, the questions can also be approached from a cultural perspective, where music is considered as a phenomenon of human expression embedded in tradition, and driven by innovation and creativity. The questions can be approached from a cognitive perspective, where music is considered as information, or better: as an acoustical signal, of which the perception involves particular categorizations, cognitive structures, representations and ways of learning. Or they can be approached from a performance perspective, where music is considered as the outcome of human interactions, sensorimotor predictions and actions and where cognitive, perceptive processing goes together with physical activity, emotions, and expressive capacities. All these perspectives have their merits and to understand a phenomenon a multi-perspective is often adopted, based on bits and pieces taken from each approach.

Admitted, the above list of approaches may not be exhaustive. The list is only meant to indicate that there are many ways in which musicology approaches the question of meaning formation, signification, and empowerment. Likewise, there are many ways to construct a multi-perspectivistic approach to the study of musical meaning formation. A more exhaustive historical overview of the different sides of musicology and the research questions driving (sub)fields is given by [55].

Accordingly, the same remark can be made with respect to the methods that turn the above mentioned approaches and perspectives into concrete musicological research projects. It would be possible to list different methods that are used in musicology but it is not my intention to attempt to give such a list, and certainly not an exhaustive list. Instead, what concerns me here, is the level below these research perspectives, one could call it the foundational level of the different methodologies that materialize the human science perspectives.

At this point I believe that it is safe to say that many of the advanced research methodologies in musicology today draw upon digital data and digital tools². To name only a few, digital data ranges from audio/video recordings, motion capture data, to digitally encoded interviews. Analysis methods for this type of data can be assisted by, or even require, digital solutions. Sometimes, they are readily available but very often the digital solutions are only partly available or simply lacking. Almost by definition, there is lack of digital solutions in innovative research environments.

Research in systematic musicology provides a prototypical example of such an environment since this research tradition is at the forefront of development in which advanced digital data and tools are used and required. [7] compiled an up-to-date reference work. Indeed, while the research topics in systematic musicology have kept their typical humanities flavor – notions such as ‘expression’, ‘value’, ‘intention’, ‘meaning’, ‘agency’ and so on are quite common – the research methods have gradually evolved in the direction of empirical and computational methods that are typically found in the natural sciences [79]. A few examples of such general methods are feature extraction from audio and movement signals, machine learning, classification and statistics. This combination of methods gives systematic musicology its inter-disciplinary character.

However, the problem is that, very often, the empirical and computational methods require technical solutions beyond the skills of researchers that typically have a humanities background. At that point, those researchers need access to specialized technical knowledge to advance their research.

Let me give a concrete example to clarify my point. The example comes from a study about African music, where I collaborated with musicologist and composer dr. Olmo Cornelis on an analysis of the unique and rich archive of audio recordings at the Royal Museum of Central Africa [42]. The research question inquired music scales: have these scales changed over time due to African acculturation to European influences? Given the large number of audio recordings (approximately 35000), it is useful to apply automatic music information retrieval tools that assists the analysis; for example, tools that can extract scales from the recordings automatically, and tools that can compare the scales from African music with scales from European music. Traditionally, such tools are not made by musicologists that do this kind of analysis, but by engineers that provide digital solutions to such a demand. If the engineers do not provide the tools, then the analysis is not possible or extremely difficult and time consuming. However, if the musicologists do not engage with engineers to specify needs in view of a research question, then engineers cannot provide adequate tools. Therefore, both the engineer and the musicologist have to collaborate in close interaction in order to proceed and advance research³ This aspect of close interaction is crucial: successful collaboration thrives on shared understanding and continuous feedback. Simply handing over a set of requirements or demands is a surefire way to achieve unsatisfactory results. Systematic musicology is one of the more successful fields where this type of interdisciplinary collaboration has happened.

My PhD-work should be seen within the context of that tradition. In many respects I adopt a problem-solving attitude to problems that are posed by research in systematic musicology. Often the problems themselves are ill-defined. My task, therefore, is to break down ill-defined problems and to offer well-defined solutions. Bridging this gap requires close collaboration with researchers from the humanities and continuous feedback on solutions to gain a deep understanding of the problems at hand. To summarize my research goal in one sentence: the goal of my research is to engineer solutions that are relevant for systematic musicology.

Overall, it is possible to consider my contribution from two different axes. One axis covers the link between engineering methods and engineering services that are relevant to musicology. The other axis covers the link between different engineering techniques. They either draw on Music Information Retrieval techniques or on a number of other techniques here categorized as techniques for empirical research. These two axes together define a plane. This plane could be called the engineering-humanities plane since it offers a way to think about engineering contributions in and for the humanities. Note that the relation between this work and the humanities will be more clearly explained in a section below (see ?). It allows to situate my solutions, as shown in Figure 1.

Figure 1: The engineering-humanities plane. Different solutions for digital services and tools are situated in a plane defined by two axes. One axis confronts services and methods while the other axis defines techniques. The solutions with dashed lines are not bundled in this dissertation, the others are. — Figure 1: The engineering-humanities plane. Different ‘solutions’ for digital services and tools are situated in a plane defined by two axes. One axis confronts services and methods while the other axis defines techniques. The solutions with dashed lines are not bundled in this dissertation, the others are.

Services versus methods

The vertical axis specifies engineering either a service or a method. Some solutions have aspects of both services and methods and are placed more in the center of this axis. [204] makes a similar distinction but calls it: computing for and computation in humanities. Computation for humanities is defined as the ‘instrumental use of computing for the sake of humanities’. The engineering solutions are meant to support or facilitate research, and therefore, computation-for can also be seen as a service provided by engineering. Computation-in humanities is defined as that what ‘actively contributes to meaning-generation’. The engineering solutions are meant to be used as part of the research methodology, for example, to gather new insights by modeling aspects of cultural activities, and therefore, the computation-in can be seen as methods provided by engineering.

In my PhD I explore this axis actively by building tools, prototypes, and experimental set-ups. Collectively these are called solutions. These solutions should be preceded by the indefinite article: they present a solution in a specific context not the solution. They are subsequently applied in musicological research. My role is to engineer innovative solutions which support or offer opportunities for new types of research questions in musicology.

An example of a solution situated at the service-for side: a word processor that allows a researcher to easily lay-out and edit a scientific article. Nowadays, this seems like a trivial example but it is hard to quantify the performance gained by employing word processors for a key research task, namely describing research. TeX, the predecessor of the typesetting system used for this dissertation, was invented specifically as a service for the (computer) science community by [91]. For the interested reader: a technological history of word processor and its effects on literary writing is described by [87].

The services-for are contrasted with methods-in. Software situated at the method-in side is for example a specialized piece of software that models dance movements and is able to capture, categorize and describe prototypical gestures, so that new insights into that dance repertoire can be generated. It can be seen as a method in the humanities. The distinction can become blurry when solutions appear as method-in and as service-for depending on the usage context. An example is Tarsos (Figure 1, [173], the article on Tarsos is also included in section [173]). If Tarsos is used to verify a pitch structure it is used as a service-for. If a research is done on pitch structures of a whole repertoire and generates novel insights it can be seen as method-in.

Engineering solutions become methods-in humanities The relationship between computing and humanities, to which musicology is a sub-discipline, has been actively investigated in the digital humanities⁴ field. One of the seminal works on this topic is by [123]. In this work McCarty recognized that cultural artifacts are incredibly diverse and multi-faceted. Indeed, there can be multiple actors interacting with cultural artifacts in various modalities, and these interactions all contribute to a multi-layered meaning that is attached to those artifacts. However, given the multi-layered meanings, it often happens that research requires a well-defined specific perspective about the artifact. The analysis may then result in a set of discrete components and relations about the artifact, which can be encoded into a digital format and approached with a computational model. At this point the humanities and computer science meet as can be seen in Figure 2.

Please note that I do not deal with the question whether the specific perspective is relevant for the multi-layered meanings existing in a population of users of those artifacts. What interests me here is how engineering can provide solutions to methods used by scholars studying cultural artifacts, even if those methods cover only part of the multi-layers meaning that is attached to the artifacts.

Figure 2: A schema that relates humanities computing to computer science. It details the relationship between components that can solve a research question on a cultural artifact with computational tools. Note that this schema focuses on methods and services are not considered. Copied with permission from p. 197 — Figure 2: A schema that relates humanities computing to computer science. It details the relationship between components that can solve a research question on a cultural artifact with computational tools. Note that this schema focuses on methods and services are not considered. Copied with permission from

To come to a solution, it is possible to distinguish three steps. In the first step, the solution does not yet take into account a specific hardware or software implementation. Rather, the goal is to get a clear view on the scholar’s approach to the analysis of cultural artifacts. In the second step, then, it is necessary to take into account the inherent restrictions of an actual hard- or software implementation.

Finally, in the third step, an implementation follows. This results in a working tool, a solution. The solution can be seen as a model of the method. It works as a detailed, (quasi) deterministic model of a methodology.

Rather than building new solutions it is possible that a researcher finds an off-the shelf solution that is readily available. However, it often turns out that a set of thoughtful application-specific modifications may be necessary in order to make the standard piece of software applicable to the study of specific cultural artifacts.

While the above mentioned steps towards an engineering solution may appear as a rather deterministic algorithm, the reality is much different. Due to technical limitations the last two steps may strongly influence the ‘preceding’ step. Or to put a more positive spin onto it: the technical possibilities and opportunities, may in turn influence the research questions of a researcher in the humanities. To go even one step further, technical solutions can be an incentive for inventing and adopting totally new methods for the study of cultural artifacts. Accordingly, it is this interchange between the technical implementation, modeling and application in and for humanities that forms the central theme of my dissertation.

This view, where technical innovations serve as a catalyst for scientific pursuits in the humanities, reverses the idea that a humanities scholar should find a subservient engineer to implement a computational model. It replaces subservience with an equal partnership between engineers and humanities scholars. The basic argument is that technical innovations often have a profound effect on how research is conducted, and technical solutions may even redirect the research questions that can be meaningfully handled. In this section I have been exclusively dealing with computational solutions for modeling cultural artifacts (methods-in). Now it is time to go into detail on services-for which may also have a profound effect on research.

Engineering solutions become services-for humanities Engineering solutions for humanities are often related to the automation or facilitation of research tasks. Such solutions facilitate research in humanities in general, or musicology in particular, but the solutions cannot be considered methods. The research tasks may concern tasks that perhaps can be done by hand but that are tedious, error-prone and/or time consuming. When automated by soft- and hardware systems, they may be of great help to the researcher so that the focus of can be directed towards solving the research question instead of practical matters. A simple example is a questionnaire. When done on paper, a lot of time is needed to transcribe data in a workable format. However, when filled out digitally, it may be easy to get the data in a workable format.

Solutions that work as services for the humanities often have the power to change the scope of research. Without engineering a solution, a researcher may have been able to analyze a selected subset of artifacts. With engineering solution, a researcher may be able to analyze large quantities of artifacts. Successful services are the result of close collaboration and tight integration. Again, I claim that equal partnership between humaniest and engineers is a better model to understand how services materialize in practice.

For example, the pitch analysis tool implemented in Tarsos [173] can handle the entire collection of 35000 recordings of the KMMA collection. Accordingly, the scope of research changes dramatically. From manual analysis of a small set of perhaps 100 songs, to automatic analysis of the entire collection over a period of about 100 years. This long-term perspective opens up entirely new research questions, such as whether Western influence affected tone-scale use in African music. Note that by offering Tarsos as a service, methods may need to be reevaluated.

Another example in this domain concerns the synchronization of different data streams used in studies of musical performance. Research on the interaction between movement and music indeed involves the analysis of multi-track audio, video streams and sensor data. These different data streams need to be synchronized in order to make a meaningful analysis possible and my solution offers an efficient way to synchronize all these data streams [176]. This solution saves a lot of effort and tedious alignment work that otherwise has to be done by researchers whose focus is not synchronization of media, but the study of musical performance. The solution is also the result of an analysis of the research needs on the work floor and has been put in practice [51]. It again enables research on a different scope: a larger number of independent sensor streams with more participants can be easily handled.

To summarize the method-services axis: methods have cultural artifacts at the center and generate new insights, whilst services facilitate research tasks which have the potential to profoundly influence research praxis (e.g. research scope). The other axis in the plane deals with the centrality of MIR-techniques in each solution.

MIR-techniques versus techniques for empirical research

The second axis of the engineering-humanities plane specifies an engineering approach. It maps how central music information retrieval (MIR) techniques are in each solution. The category of techniques for empirical research includes sensor technology, micro-controller programming, analog to digital conversion and scripting techniques. The MIR-techniques turned out to be very useful for work on large music collections, while the techniques in analogue-digital engineering turned out to be useful for the experimental work in musicology.

One of the interesting aspects of my contribution, I believe, is concerned with solutions that are situated in between MIR and tools for experimental work, such as the works described in [176] and [177]. To understand this, I provide some background to the techniques used along this axis. However to get a grasp of the techniques used, it is perhaps best to start with a short introduction on MIR and related fields.

Symbol-based MIR The most generally agreed upon definition of MIR is given by [54].

Originally the field was mainly involved in the analysis of symbolic music data. A music score, encoded in a machine readable way, was the main research object. Starting from the 1960s computers became more available and the terms computational musicology and music information retrieval were coined. The terms immediately hint at the duality between searching for music - accessibility, information retrieval - and improved understanding of the material: computational musicology. [26] provides an excellent introduction and historic overview of the MIR research field.

Figure 3: Conceptual MIR framework connecting high level concepts and low level features. Reproduced with permission from

Signal-based MIR As computers became more powerful in the mid to late nineties, desktop computers performed better and better on digital signal processing tasks. This combined with advances in audio compression techniques, cheaper digital storage, and accessibility to Internet technologies, led to vast amounts, or big data collections, of digital music. The availability of large data sets, in turn, boosted research in MIR but now with musical signals at the center of attention.

Signal based MIR aims to extract descriptors from musical audio. MIR-techniques are based on low-level feature extraction and on classification into higher-level descriptors. Low-level features contain non-contextual information close to the acoustic domain such as frequency spectrum, duration and energy. Higher level musical content-description focuses on aspects such as timbre, pitch, melody, rhythm and tempo. The highest level is about expressive, perceptual contextual interpretations that typically focus on factors related to movement, emotion, or corporeal aspects. This concept has been captured in a schema by [110] and copied here in Figure 3. The main research question in MIR is often on how to turn a set of low-level features into a set of higher level concepts.

For example, harmony detection can be divided into low-level instantaneous frequency detection and a perceptual model that transforms frequencies into pitch estimations. Finally multiple pitch estimations are integrated over time - contextualized - and contrasted with tonal harmony resulting in harmony estimation. In the taxonomy of [110], this means going from the acoustical over the sensorial and perceptual to the cognitive or structural level.

Usually, the audio-mining techniques deliver descriptions for large sets of audio. The automated approach is time-saving and often applied to audio collections that are too large to annotate manually. The descriptors related to such large collections and archives provide a basis for further analysis. In that context, data-mining techniques become useful. The techniques focus on the computational analysis of a large volume of data, using statistical correlation and categorization techniques that look for patterns, tendencies, groupings and changes.

A more extensive overview of the problems that are researched in MIR is given in the the books by [89], [135] and [152]. The overview article by [26] gives more insights in the history of MIR. The proceedings of the yearly ISMIR⁵ conferences, the main conference in MIR, give a good sense of the topics that are in vogue.

MIR and other research fields Despite the historical close link between symbolic-based MIR and computational musicology, the link between signal-based MIR and music cognition has not been as strong. At first sight, signal-based MIR and music cognition are both looking at how humans perceive, model and interact with music. However, MIR is more interested in looking for pragmatic solutions for concrete problems, less in explaining processes, while music cognition research, on the other hand, is more interested in explaining psychological and neurological processes. The gap between the fields has been described by [5].

An example in instrument recognition may clarify the issue. In the eyes of a MIR researcher, instrument recognition is a matter of ‘applying instrument labels correctly to a music data set’. Typically, the MIR researcher extracts one or more low-level features from music with the instruments of interest, trains a classifier and applies it to unlabeled music. Finally the MIR researcher shows that the approach improves the current state of the art. [115] present such an algorithm, in this case based on MFCC⁶’s. The result is valuable and offers a pragmatic approach to tag a large dataset with instrument labels potentially improving its accessibility. Typically, in this approach, the underlying processes, assumptions or perceptual models are not made explicit. The paper focuses on results not on processes, arguably it focuses on applied engineering and less on explanatory science. Similar methodological concerns are raised by [196].

In contrast, music cognition research tends to approach the same task from a different perspective. The question may boil down to ’How are we able to recognize instruments?’. Neuropsychological experiments carried out by [139] suggest how music instrument recognition is processed. The result is valuable and offers useful insights in processes. However, the result does not mention how the findings could be exploited using a computational model.

MIR and music cognition can be considered as two prongs of the same fork. Unfortunately the handle seems missing: there are a few exceptions that tried to combine insights into auditory perception with computational modeling and MIR-style feature extraction. One such exception is the IPEM-Toolbox [104]. While this approach was picked up and elaborated in the music cognition field [92], the approach fell on deaf ears in the MIR community. Still, in recent years, some papers claim that it is possible to improve the standards, the evaluation practices, and the reproducibility of MIR research by incorporating more perception-based computational tools. This approach may generate some impact in MIR [6].

The specific relation between MIR and systematic musicology is examined in a paper by [102]. It is an elaboration on a lecture given at the University of Cologne in 2003, which had the provocative title “Who stole musicology?”. Leman observes that in the early 2000’s there was a sudden jump in a number of researchers working on music. While the number of musicology scholars remained relatively small, engineers and neuroscientists massively flocked to music. They offered intricate computational models and fresh views on music perception and performance. Engineers and neuroscientists actively and methodologically contributed to the understanding of music with such advances and in such large numbers that Leman posed “if music could be better studied by specialized disciplines, then systematic musicology had no longer a value”. However, Leman further argues that there is a value in modern systematic musicology that is difficult to ‘steal’, which is ‘music’. This value (of music) depends on the possibility to develop a trans-disciplinary research methodology, while also paying attention to a strong corporeal aspect that is currently largely ignored by other fields. This aspect includes “the viewpoint that music is related to the interaction between body, mind, and physical environment; in a way that does justice to how humans perceive and act in the world, how they use their senses, their feelings, their emotions, their cognitive apparatus and social interactions”. While this point was elaborated in his book on embodied music cognition [101], it turns out that this holistic approach is still - ten years later - only very rarely encountered in MIR research.

Computational Ethnomusicology

It is known that the signal-based methods developed by the MIR community target classical Western music or commercial pop by an overwhelming majority [43], whereas the immense diversity in music all over the world is largely ignored⁷. Similarly, in symbolic-based MIR, scholars developed machine readable encodings and retrieval systems focusing almost exclusively on scores with modern Western staff notation, whereas other notation systems or music encoding methods were left unexplored. That means that MIR suffers from an ethnocentric viewpoint: its concepts on rhythm, pitch organization, genres, labels, meta-data taxonomies are deeply rooted into Western music theory and, as a result, the MIR tools are not generalizable, or inclusive. For an engineer it means that the available tools are limited and that the demands from a musicologist may be quite remote from the tools that MIR can offer today.

For example, a chroma feature shows the intensity of each of the 12 Western notes at a point in time in a musical piece. Chroma features subsequently imply a tonal organization with an octave divided in 12 equal parts, preferably with $A 4$ tuned to 440Hz. Methods that build upon such chroma features perform well on Western music but they typically fail on non-Western music that has other tonal organization. By itself this is no problem, it is simply a limitation of the method (model) and chroma can be adapted for other tonal organizations. However, the limitation is a problem when such a tool would be applied to music that does not have a tonal space that the tool can readily measure. In other words, it is necessary to keep in mind that the toolbox of a MIR researcher is full of methods that make assumptions about music, while these assumptions do not hold universally. Therefore, one should be careful about Western music concepts in standardized MIR methods. They can not be applied equally on other musics without careful consideration.

Computational ethnomusicology is a research field in which MIR tools are adopted or re-purposed so that they can provide specialized methods and models for all kinds of musics. For a more detailed discussion on this see page eight of [41]. The field aims to provide better access to different musics and to offer a broader view, including video, dance and performance analysis. [194] redefine this field⁸ and give an overview of works done in the field of computational ethnomusicology. This type of engineering solutions are much needed by musicologists who manage large collections of digitized historic ethnic music available in several museums [44].

Accordingly, re-use of MIR tools demands either a culture specific approach or a general approach. In the first case specific structural elements of the music under analysis encoded into models. For example in Afro-Cuban music, elements specific to the ‘clave’ can be encoded so that a better understanding of timing in that type of music becomes possible [213]. While this solution may limit the applicability to a specific (sub)culture, it also allows deep insights in mid- and high-level concepts of a possible instantiation of music. In that sense, the MIR solutions to harmonic analysis can be seen as a culture specific approach, yielding meaningful results for Western tonal music only.

One of the goals of the MIR solution is to understand elements in music that are universal. In that sense, it corresponds quite well with what systematic musicology is aiming at: finding effects of music on humans, independent of cultures or historical period. Given that perspective, one could say that rhythm and pitch are fundamental elements of such universals in music. They appear almost alway across all cultures and all periods of time. Solutions that offer insights in frequently reused pitches, for example, may therefore be generally applicable. They have been applied to African, Turkish, Indian and Swedish folk music [173]. Most often, however, such solutions are limited to low-level musical concepts. The choice seems to boil down to: being low-level and universal, or high-level and culture-specific.

Both culture-specific and what could be called culture-invariant approaches to computational ethnomusicology should further understanding and allow innovative research. Computational ethnomusicology is defined by [194] as “the design, development and usage of computer tools that have the potential to assist in ethnomusicological research.”. This limits the research field to a service-for ethnomusicology. I would argue that the method-in should be part of this field as well. A view that is shared, in slightly different terms, by [69]: “computer models can be ‘theories’ or ‘hypotheses’ (not just ‘tools’) about processes and problems studied by traditional ethnomusicologists”. While computational ethnomusicology is broader in scope, the underlying techniques have a lot in common with standard MIR-techniques which are present also in my own research.

MIR-techniques With the history and context of MIR in mind it is now possible to highlight the techniques in my own work. It helps to keep the taxonomy by [110] included in Figure 3 in mind.

Tarsos [173] is a typical signal-based MIR-system in that it draws upon low level pitch estimation features and combines those to form higher level insights: pitch and pitch class histograms relating to scales and pitch use. Tarsos also offers similarity measures for these higher level features. It allows to define how close two musical pieces are in that specific sense. In the case of Tarsos these are encoded as histogram similarity measures (overlap, intersection). Several means to encode, process and compare pitch interval sets are present as well. Tarsos has a plug-in system to allow to start from any system that is able to extract pitch from audio but by default the TarsosDSP [174] library is used.

TarsosDSP is a low-level feature extractor. As mentioned previously, it has several pitch extractor algorithms but also includes extraction of onset and beat tracking. It is also capable to extract spectral features and much more. For details see [174]. It is a robust foundation to build MIR systems on. Tarsos is one example but my work in acoustic fingerprinting also is based on TarsosDSP.

The acoustic fingerprinting work mainly draws upon spectral representations (FFT or Constant-Q)⁹ and robust peaks in the frequency domain. These peaks are the low level features in this application. The peaks are then combined and indexed in a database which allows efficient lookup. Lookup of small audio queries is one of the quintessential information retrieval tasks. While I have contributed to the academic literature on acoustic fingerprinting with [175], I have applied this MIR-technique in various less straightforward ways (see [177] and [168]). I have also employed it as a technique to support empirical research [176] together with various other techniques.

Techniques for empirical research While MIR-techniques form an important component in my research, here they are contrasted with various techniques for empirical research. The solutions that appear at the other extreme of the horizontal axis of the plane do not use MIR-techniques. However, there are some solutions for empirical research that do use MIR techniques to some degree. These solutions are designed to support specific experiments with particular experimental designs. While the experimental designs can differ a lot, they do share several components that appear regularly. Each of these can present an experimenter with a technical challenge. I can identify five:

Activation

. The first component is to present a subject with a stimulus. Examples of stimuli are sounds that need to be presented with millisecond accurate timing or tactile feedback with vibration motors or music modified in a certain way.
Measurement

. The second component is the measurement of the phenomenon of interest. It is essential to use sensors that capture the phenomenon precisely and that these do not interfere with the task for the subject. Examples of measurement devices are wearables to capture a musicians movement, microphones to capture sound, witness video cameras and motion capture systems. Note that a combination of measurement devices is often needed.
Transmission

. The third component aims to expose measurements in a usable form. This can involve translation via calibration to a workable unit (Euler angles, acceleration expressed in g-forces, strain-gauge measurements in kg). For example, it may be needed to go from raw sensor readings on a micro-controller to a computer system or from multiple measurement nodes to a central system.
Accumulation

. The fourth component deals with aggregating, synchronizing and storage of measured data. For example, it might be needed to capture measurements and events in consistently named text files with time stamps or a shared clock.
Analysis

. The final component is analysis of the measured data with the aim to support or disprove a hypothesis with a certain degree of confidence. Often standard (statistical) software suffices, but it might be needed to build custom solutions for the analysis step as well.

These components need to be combined to reflect the experimental design and to form reliable conclusions. Note that each of the components can either be trivial and straightforward or pose a significant challenge. A typical experiment combines available off-the shelf hardware and software solutions with custom solutions which allows successful use in an experimental setting. In innovative experiments it is rare to find designs completely devoid of technical challenges.

For example, take the solution devised for [202]. The study shines a light on the effect of music on heart-rate in rest. It uses musical fragments that are time-stretched to match the subjects’ heart rate as stimulus (activation). A heart-rate sensor is the main measurement device (measurement). A micro-controller, in this case an Arduino, takes the sensor values and sends them (transmission) to a computer. These components are controlled by a small program on the computer that initiates each component at the expected time resulting in a research data set (accumulation) that can be analyzed (analysis). In this case the data are ‘changes in heart-rate’ and ‘heart-rate variability’ when subjects are in rest or listen to specific music. The technical challenges and contributions were mainly found in high-quality time-stretching of the stimuli (activation) and reliably measuring heart rate at the fingertips (measurement). The other components could be handled with off-the-shelf components. Note that measuring heart rate with chest straps might have been more reliable but this would have been also more invasive, especially for female participants. Aspects of user-friendliness are almost always a concern and even more so if measurement devices interfere with the task: in this setup participants needed to feel comfortable in order to relax.

Figure 4: LI jogger schema. The aim is to lower footfall impact by providing a runner with musical feedback. Footfalls are measured with accelerometers at each ankle. The acceleration peaks are identified on a wearable computing device (right) which immediately sends sonic feedback to the runner (via headphones). The feedback is simplified here using the traffic lights as large, medium or low impact. To control for speed a wristband notifies the runner to speed up, slow down or keep the current pace. Each aspect of the experimental setup required a custom tailored approach. — Figure 4: LI jogger schema. The aim is to lower footfall impact by providing a runner with musical feedback. Footfalls are measured with accelerometers at each ankle. The acceleration peaks are identified on a wearable computing device (right) which immediately sends sonic feedback to the runner (via headphones). The feedback is simplified here using the traffic lights as ‘large’, ‘medium’ or ‘low’ impact. To control for speed a wristband notifies the runner to speed up, slow down or keep the current pace. Each aspect of the experimental setup required a custom tailored approach.

Another example is the solution called the LI-Jogger (short for Low Impact jogger, see [198]). A schema of the set-up can be found in Figure 4. This research aims to build a system to lower footfall impact for at-risk amateur runners via biofeedback with music. Foot fall impact is a known risk factor of a common running injury and lowering this impact in turn lowers this risk. Measurement involves measuring peak tibial (shin) acceleration in three dimensions with a high time resolution via a wearable sensor on the runner. Running speed needs to be controlled for. This is done via a sonar (as described by [114]). Accumulation of the data is a technical challenge as well since footfall measured by the accelerometer needs to be synchronized precisely with other sensors embedded in the sports science lab. To allow this, an additional infra-red (IR) sensor was used to capture a clock of the motion capture system which provides a clock that is followed by several devices (force plates). In this case a superfluous stream of measurements was required to allow synchronization. In this project each component is challenging:

Activation

. The biofeedback needs measurements in real-time and modifies music to reflect these measurements in a certain way. This biofeedback is done on a battery-powered wearable device that should hinder the runner as little as possible.
Measurement

. The measurement requires a custom system with high-quality 3D accelerometers. The accelerometers need to be light and unobtrusive. To allow synchronization and speed control an additional IR-sensor and sonar were needed.
Transmission

. To expose acceleration and other sensor data custom software is required both on the transmitting micro-controller as on the receiving system.
Accumulation

. Accumulation requires scripts that use the IR-sensor stream to synchronize acceleration data with a motion capture system and other measurement devices (force-plate, footroll measurement).
Analysis

. Analysis of the multi-modal data is also non-trivial. The article that validates the measurement system [198] also requires custom software to compare the gold-standard force-plate data and acceleration data.

The techniques used in these tools for empirical research are often relatively mundane. However, they do span a broad array of hardware and software technologies that need to be combined efficiently to offer a workable solution. Often there is considerable creativity involved in engineering a cost-effective, practical solution in a limited amount of time. As can be seen from the examples it helps to be able to incorporate knowledge on micro-controllers, sensors, analog/digital conversion, transmission protocols, wireless technologies, data-analysis techniques, off-the-shelf components, scripting techniques, algorithms and data structures.

Now that the two axes of the humanities-engineering plane have been sufficiently clarified it is time to situate my solutions in this plane.

Situating my solutions in the humanities-engin- eering plane

Given the above explanation of the axes, it is now possible to assign a location in this plane for each of the solutions that form the core of my doctoral research (Figure 1):

Tempo

This solution introduces and validates a method to gauge the metric complexity of a musical piece, depending on the level of agreement between automatic beat estimation algorithms. The validation is done by comparing expert human annotations with annotations by a committee of beat estimation algorithms. The solution is based on MIR-techniques and it is used as a method to get insights into rhythmic ambiguity in sets of music. It is, therefore, placed in the upper left of the engineering-humanities plane. It is described by [46] and not included in this dissertation.

Tarsos

This solution introduces a method to automatically extract pitch and pitch class histograms from any pitched musical recording. It also offers many tools to process the pitch data. It has a graphical user interface but also a batch processing option. It can be used as on a single recording or on large databases. It covers quite a big area in the plane: depending on the research it can be used as a method for large scale scale analysis or as a service to get insights into pitch use of a single musical piece. Tarsos is situated to the side of MIR-techniques since it depends on techniques as feature extraction. It is described in [173], which is included in chapter [173].

Repro

The main contribution of this work is a reflection on reproducible research methodologies for computational research in general and MIR research in particular. As an illustration a seminal MIR-article is replicated and a reproducible evaluation method is presented. The focus is on methods of computational research and it utilizes MIR-techniques so it is placed at the methods/MIR-techniques side. It is described in [167], which is included in chapter [167].

Duplicates

This solution presents a method to compare meta-data, reuse segmentation boundaries, improves listening experiences and to merge digital audio. The method is applied to the data set of the RMCA archive which offers new insights into the meta-data quality. The underlying technique is acoustic fingerprinting, a classical MIR-technique. The case-study uses a service provided by [175]. The article [168] is included in chapter [168].

TarsosDSP

This digital signal processing (DSP) library is the foundation of Tarsos. TarsosDSP offers many low-level feature extraction algorithms in a package aimed for MIR-researchers, students or developers. It is a piece of work in its own right and is situated on the MIR-techniques/service side. The service is employed in TarsosDSP. It has been used for speech rehabilitation [24], serious gaming contexts [157] and human machine interaction [155]. The article [174], is included in chapter [174]

Panako

This solution is an acoustic fingerprinting algorithm that allows efficient lookup of small audio excerpts in large reference databases. Panako works even if the audio underwent changes in pitch. It is placed firmly in the MIR side and service side. There are many ways in which Panako can be used to manage large music archives. These different ways are discussed in [20]. See [175], which is included in chapter [175].

Augmented listening

This solution offers a technology for augmented listening experiences. It is effectively proving the tools for a computer-mediated reality. As with a typical augmented reality technology it takes the context of a user and modifies – augments or diminishes – it with additional layers of information. In this work the music playing in the environment of the user is identified with precise timing. This allows to enrich listening experiences. The solution employs MIR-techniques to improve engagement of a listener with the music in the environment. There are, however, also applications to use this in experimental designs for empirical research. So it is a service situated in between MIR-techniques and techniques for empirical research. It is described by [177], which is included in chapter [177].

SyncSink

This solution presents a general system to synchronize heterogeneous experimental data streams. By adding an audio stream to each sensor stream, the problem of synchronization is reduced to audio-to-audio alignment. It employs MIR-technique to solve a problem often faced in empirical research. The service is placed more to the side of techniques for empirical research . [195] extended the off-line algorithm with a real-time version in his master’s thesis. The service is described in [176], which is included in chapter [176].

Tap

This empirical study compares tapping behaviour when subjects are presented with with tactile, auditory and combined tactile and auditory queues. To allow this, a system is required that can register tapping behaviour and present subjects with stimuli with a very precise timing. The main aim of the study is to present a method and report results on the tapping behaviour by subjects. So while the measurement/stimulus system could be seen as a service, the main contribution lies in the method and results. These are described in [166] which is not included in this dissertation.

LI-Jogger

This solution is a soft/hardware system that allows immediate feedback of foot-fall impact to allow auditory feedback with music. It has been used in the previous section as an example (see ?) where every aspect of an experimental design poses a significant challenge. The innovative aspects are that impact is measured at a high data rate (1000Hz) in three dimensions by a wearable system that is synchronized precisely with other measurement modalities. LI-Jogger supports an empirical research so it is placed right in the plane. It aims to provide a view into how music modifies overground (vs treadmill) running movement but the system itself is a service needed to achieve that goal. This solution is described in [198]. A large intervention study that will apply this solution is planned. The results of this study will be described in follow-up articles

Isfv

This solution includes an analysis and comparison of harmonics in a singing voice while inhaling and exhaling. MIR-techniques are used to gain insights in this performance practice. It is, therefore, situated at the methods/MIR-techniques side of the plan. My contribution was primarily in the data-analysis part. The findings have been described in [203], which is not included in this dissertation.

Hr

This solution is a heart-rate measurement system with supporting software to initiate and record conditions and present stimuli to participants. The stimulus is music with modified tempo to match a subjects heart-rate. It has been used in the previous section as an example see ?. The system made the limited influence of music on heart-rate clear in a systematic way. It is situated in the upper right quadrant. See [202].

Figure 5: Synchronized data of a subject and experimenter visualized in ELAN by . The system registers engagement with prerecorded or live music using multiple cameras and balance boards.

Dementia

This solution concerns a hard and software system to measure engagement with live performance versus prerecorded performances for patients with dementia. My contribution for it is in the software that combines various movement sensors and web-cameras that register engagement and in presenting the stimuli. The sensors streams and videos are synchronized automatically. Figure 5 shows synchronized video, audio and movement imported in the ELAN software [212]. It supports empirical research and is placed in the services category. See [51] for a description of the system and the data analysis. The article is not included in this dissertation.
Several of the listed solutions and the projects that use these show similarities to projects from the digital humanities. It is of interest to further dive into this concept and further clarify this overlap.

Digital humanities

As a final step in my attempt to situate my work in engineering and humanities I want to briefly come back to the digital humanities research field. Digital humanities is an umbrella term for research practices that combine humanities with digital technologies which show much overlap with my activities. Therefore it may be useful for a conceptual situation of my activities in this dissertation.

There are a number of articles and even books that try to define the digital humanities [13]. There is even disagreement if it is a research field, “new modes of scholarship”[25] or a “set of related methods”[161]. [211] have given up:“Since the field is constantly growing and changing, specific definitions can quickly become outdated or unnecessarily limit future potential.”. However, a definition is presented as broadly accepted by [86] as:

This definition is sufficiently broad that it would also work as a definition for systematic musicology especially in relation to this dissertation. The works bundled here are at the intersection of computing and musicology, it involves invention and also has limited contributions to the knowledge of computing. An other relevant observation with respect to this dissertation is the following: “[Digital Humanities is inspired by]...the conviction that computational tools have the potential to transform the content, scope, methodologies, and audience of humanistic inquiry.”[25]. Again, this is similar to what is attempted in this thesis. Moreover, digital humanities projects generally have these common traits which shares much with this dissertation project:

The projects are collaborative. Often engineers and humanities scholars collaborate with a shared aim.
The nature of the methods require a transdisciplinary approach. Often computer science is combined with deep insights into humanities subjects.
The main research object is available in the digital domain. This immediately imposes a very stringent limitation. The research object and relations between research objects need a practical digital representation to allow successful analysis.

The history of the term can be found in the literary sciences and was grew out of ‘humanities computing’ or ‘computing in the humanities’ [13]. Where originally it was seen as “a technical support to the work of the ‘real’ humanities scholars” [13]. The term ‘digital humanities’ was coined to mark a paradigm shift from merely technical support of humanities research to a “genuinely intellectual endeavor with its own professional practices, rigorous standards, and exciting theoretical explorations” [74]. Today most digital humanities scholars are firmly embedded in either library sciences, history or literary science. Due to this history there are only a few explicit links with musicology although the described methods in the digital humanities are sufficiently broad and inclusive.

One of the problems tackled by the digital humanities is the abundance of available digital data. The problem presents itself for example for historians: “It is now quite clear that historians will have to grapple with abundance, not scarcity. Several million books have been digitized by Google and the Open Content Alliance in the last two years, with millions more on the way shortly nearly every day we are confronted with a new digital historical resource of almost unimaginable size.” [36].

As a way to deal with this abundance [134] makes the distinction between ‘close reading’ and ‘distant reading’ of a literary text. With ‘close’ meaning attentive critical reading by a scholar while ‘distant reading’ focuses on ‘fewer elements, interconnections, shapes, relations structures, forms and models’. Moretti argues there is an urgent need for distant reading due to she sheer size of the literary field and the problem that only relatively few works are subjected to ‘close reading’: “a field this large cannot be understood by stitching together separate bits of knowledge about individual cases [close reading], because it’s a collective system, that should be grasped as such, as a whole.” [134]

One example of this ‘distant reading’ approach is given by [153]. In this article each word is labeled with an emotional value: positive, neutral or negative. Words like ‘murder’ or ‘death’ are deemed negative. ‘friends’ or ‘happy’ are positive while the neutral category contains words like ‘door’,‘apple’ or ‘lever’. With this simple model a whole corpus is examined and six prototypical story arcs are detected. Examples include the Tragedy ( a downwards fall) and Cinderella (rise - fall - rise). The approach is a good example of how computational tools can quickly summarize features and how insights can be gathered for a whole system and for which the ‘close reading’ approach would not work.

To some extend, the field of musicology underwent a similar path as literary science. Large amounts of scores have been encoded into digital formats. The number of digitized - or digitally recorded - music recordings easily runs in the millions. Using similar terminology as [134], a distinction can be made between ‘close listening’ and ‘distant listening’¹⁰. ‘Close listening’ then means an attentive manual analysis of either a score or a recorded piece of music with the purpose to reach a certain research goal. Conversely, ‘distant listening’ focuses on summarized features, shapes, relations, evolutions with potentially very different research goals in mind. The ‘close listening’ approach allows detailed analysis but faces the problem that only a very small subset of carefully selected - the exceptional - pieces can be scrutinized. This approach potentially yields a skewed view on the collective body of works. Since little regard is given to the mundane, the exceptional could be regarded as the norm. The ‘distant listening’ approach offers ways to straighten that skewed view while allowing for research that tracks changes over time. By allowing the analysis of the mundane next to the exceptional, a fairer view is generated on the system as a whole.

To allow this broad, fair view on whole musical collections, this ‘distant listening’ needs to be as intelligent as possible. The construction of models of music, selection of features and corresponding similarity measures needs to be done diligently. This is one of the main challenges in the MIR research field.

However, this focus on databases covers only an aspect of musicologists’ interest. As mentioned, there is a vast domain of research in musicology that focuses on music interaction. In fact, the work at IPEM is known world-wide for establishing a science of interaction (e.g. see [111]. Digital humanities is usually associated with fixed items like archives, databases, sets of literary works, historical GIS (Geographical Information System) data which, in my view, is too limited. Therefore, I introduce the term ‘augmented humanities’ later on (see Section 4.2) in order to cover that aspect of my work in which engineering solutions work as an element in a context of human-music interaction, rather than archives and MIR. However, as my dissertation shows, it turns out that MIR techniques are very useful for solving problems in human-music interaction.

First more details are given on the reproducible methodology that is followed in the presented works of this dissertation.

1.2Methodology

In the works bundled in this dissertation, special efforts have been made to reach a reproducible, verifiable methodology. Reproducibility is one of the corner-stones of scientific methodology. A claim made in a scientific publication should be verifiable and the described method should provide enough detail to allow replication. If the research is carried out on a specific set of data, this data should be available as well. If not for the general public, then at least to peers or - even more limiting - to reviewers of the work. If those basics are upheld, then the work becomes verifiable, reproducible and comparable. It also facilitates improvements by other researchers.

In a journal article [167] on this topic I have bundled the main concerns with, and suggestions for, methodological computational research on music. The journal article details the problems with reproducibility in computational research and illustrates this by replicating, in full, a seminal acoustic fingerprinting paper. The main points of that article are repeated here with additional links to the works presented in this dissertation. The aim of this chapter is to strike a balance between providing common methodological background while avoiding too much repeated text.

The ideal, where methods are described in sufficient detail and data is available, is often not reached. From a technical standpoint, sharing tools and data has never been more easy. Reproducibility, however, remains a problem. Especially for Music Information Retrieval research and, more generally, research involving moderately complex software systems. Below, a number of general problems are identified and subsequently it is detailed how these are handled in the presented works.

Journal articles and especially conference papers have limited space for detailed descriptions of methods or algorithms. For moderately complex systems there are numerous parameters, edge cases and details which are glossed over in textual descriptions. This makes articles readable and the basic method intelligible, but those details need to be expounded somewhere otherwise too much is left to assumptions and perhaps missing shared understanding. The ideal place for such details is well documented, runnable code. Unfortunately, intellectual property rights by universities or research institutions often limit researchers to freely distribute code.

In the works presented in this dissertation attention has been given to sharing research code. As Ghent University is more and more striving for open-access publications and even working on a policy and support for sharing research data there is up to today little attention for research software. Arguably, it makes little sense to publish only part of research in the open - the textual description - and keeping code and data behind closed doors. Especially if the research is funded by public funds. A clear stance on intellectual property rights of research code would help researchers.

A possibility to counter this problem is to use Open Source licenses. These licenses “allow software to be freely used, shared and modified”¹¹. In my own works I have benefited a lot from Open Source code. This code allowed me to focus on the task at hand and not spending too much time on basic house-keeping code. Examples include statistical libraries, command line argument parsing code, FFT libraries and so forth. During the development of Tarsos and TarsosDSP [173] a lot of code released under the GPL (General Public License) license was re-purposed or translated from other projects to benefit Tarsos. A total of 18 sources are listed in the readme of Tarsos and TarsosDSP. The GPL license requires that modifications are made public under the same conditions as well. This viral aspect of GPL has the advantage that it becomes impossible to keep code behind closed doors. The GPL requires an open methodology which fits well with research software and prototypes. If code is released in a reasonable way, adhering to the sustainable software philosophy [48], the processes or models encoded in the software become verifiable, transparent and ready for improvement or external contributions.

On GitHub, a code repository hosting service, TarsosDSP has 14 contributers and has been forked more than 200 times. This means that 14 people contributed to the main code of TarsosDSP and that there are about 200 different flavors of TarsosDSP. These flavors are at some point split - forked - from the main code and are developed by people with a slightly different focus. On GitHub more than 100 bug reports are submitted. This example highlights another benefit from opening source code. The code is tested by others, bugs are reported and some even contribute fixes and improvements.

The fact that GPL was used also made it possible to further improve the Tarsos software myself after my transition from the School of Arts, Ghent to Ghent University. A process that could have been difficult if the code was not required to be released under GPL. Panako and SyncSink [175] also build upon GPL licensed software and are therefore released under a similar license.

To make computational research reproducible both the source code and the data that was used need to be available. Copyrights on music make it hard to share music freely. Redistribution of historic field-recordings in museum archives is even more problematic. Due to the nature of the recordings, copyright status is often unclear. Clearing the status of tracks involves international, historical copyright laws and multiple stakeholders such as performers, soloists, the museum, the person who performed the field recording and potentially a publisher that already published parts on an LP. The rights of each stakeholder need to be carefully considered while at the same time they can be hard to identify due to a lack of precise meta-data and the passage of time. I see two ways to deal with this:

Pragmatic versus Ecological or Jamendo vs iTunes. There is a great deal of freely available music published under various creative commons licenses. Jamendo for example contains half a million creative commons-licensed¹² tracks which are uniquely identifiable and can be downloaded via an API. Much of the music that can be found there is recorded at home with limited means. It only contains a few professionally produced recordings. This means that systems can behave slightly differently on the Jamendo set when compared with a set of commercial music. What is gained in pragmatism is perhaps lost in ecological validity. Whether this is a problem depends very much on the research question at hand.
Audio versus Features. Research on features extracted from audio does not need audio itself, if the features are available this can suffice. There are two large sets of audio features. The million song data set by [14] and Acousticbrainz¹³, described by [149]. Both ran feature extractors on millions of commercial tracks and have an API to query or download the data. Unfortunately the source code of the feature extractors used in the Million Song data set is not available. Additionally, the features are only described up until a certain level of detail. This makes the dataset a black box and, in my eyes, unfit for decent reproducible science. Indeed, due to internal reorganizations and mergers the API and the data become less and less available. The science build on this set is on shaky ground. Fortunately Acousticbrainz is completely transparent. It uses well documented, open source software [16] and the feature extractors are reproducible. The main shortcoming of this approach is that only a curated set of features is available, if another feature is needed, then you are out of luck. Adding a feature is far from trivial, since even Acousticbrainz has no access to all audio: they rely on crowd-sourced feature extraction.

In my own work an acoustic fingerprinting evaluation methodology was developed using music from Jamendo for Panako [175]. The exact same methodology was copied by [182] and elaborated on by myself [167] with a focus on methodological aspects. For acoustic fingerprinting the Jamendo dataset fits since it does offer a large variability in genres and is representative in those cases.

For the research in collaboration with the Museum for Central Africa [174] the issue of unclear copyrights and ethical questions on sharing field recordings is pertinent. Indeed, the copyright status of most recordings is unclear. Ethical questions on the conditions in which the recordings were made and to what extent the recorded musicians gave informed consent for further use can be raised. The way this research was done follows the second guideline. The research is done on extracted features and only partially on audio. These features are not burdened by copyright issues and can be shared freely. More specifically, a large set of features were extracted by placing a computer at the location of the museum and processing each field recording. This method enables the research to be replicated – starting from the features – and verified.

Publication culture and providing incentives

Output by researchers is still mainly judged by the number of articles they publish in scientific journals or conferences. Other types of output are not valued as much. The incentive to put a lot of work in documenting, maintaining and publishing reproducible research or supplementary material is lacking. This focus on publishing preferably novel findings in journal articles probably affects the research conducted. It drives individual researchers - consciously or unconsciously - to further their careers by publishing underpowered small studies in stead of furthering the knowledge in their fields of research [77].

A way forward is to provide an incentive for researchers to make their research reproducible. This requires a mentality shift. Policies by journals, conference organizers and research institutions should gradually change to require reproducibility. There are a few initiatives to foster reproducible research, specifically for music informatics research. The 53rd Audio Engineering Society (AES) conference had a price for reproducibility. [174] was submitted to that conference and subsequently acknowledged as reproducibility-enabling. ISMIR 2012 had a tutorial on “Reusable software and reproducibility in music informatics research” but structural attention for this issue at ISMIR seems to lack. As one of the few places Queen Mary University London (QMUL) seems to have continuous attention to the issue and researchers are trained in software craftsmanship. They also host a repository for software dealing with sound at http://soundsoftware.ac.uk. and offer a yearly workshop on “Software and Data for Audio and Music Research”:

The third SoundSoftware.ac.uk one-day workshop on “Software and Data for Audio and Music Research” will include talks on issues such as robust software development for audio and music research, reproducible research in general, management of research data, and open access.¹⁴

Another incentive to spend time documenting and publishing research software is already mentioned above: code is reviewed by others, bugs are submitted and some even take the time to contribute to the code by fixing bugs or extending functionality. A more indirect incentive is that it forces a different approach to writing code. Quick and dirty hacks are far less appealing if one knows beforehand that the code will be out in the open and will be reviewed by peers. Publishing code benefits the reuseablity, modularity, clarity and longevity and software quality in general. It also forces one to think about installability, buildability and other aspects that make software sustainable [48].

In my work the main incentive to publish code is to make the tools and techniques available and attempt to put those in the hands of end-users. While the aim is not to develop end-user software, establishing a feedback loop with users can be inspiring and even drive further research. In an article I co-authored with a number of colleagues, a possible approach is presented to fill the gap between research software and end-users [50]. One of the hurdles is to make users - in this case managers of archives of ethnic music - aware of the available tools and techniques. In two other articles I (co-)authored [20] exactly this is done: it details how archives can benefit from a mature MIR technology - in these case acoustic fingerprinting.

Research software versus end-user software

The distinction between end-user ready software, a (commercial) product, and useful contributions to a field in the form of research software may not be entirely clear. One of the outputs of computational research is often research software. This software should focus on novel methods encoded in software and not on creating end-user software. Perhaps it is of interest to stress the differences. Below an attempt follows to make this distinction by focusing on several aspects of software.

Transparency

. The processes encoded in end-user software are not necessarily transparent. End-user software can be used effectively as a black box. The outcome - what you can achieve with the software - is more important than how it is done. Research software should focus exactly on making the process transparent while getting tasks done.
Complexity

. The complexity of research software should not be hidden, it should be clear which parameters there are and how parameters change results. Researchers can be expected to put effort in getting to know details of software they use. This is again different in end-user software where ease-of-use and intuitiveness matter.
Openness

. Researchers should be able to improve, adapt and experiment with the software. It stands to reason that research software should be open and allow, even encourage, improvement. Source control and project management websites such as GitHub and SoundSoftware.ac.uk facilitate these kinds of interactions. For end-user software this may not be a requirement.

Note that these characteristics of research software do not necessarily prevent such software from being applied as-is in practice. The research software Panako [175], which serves as an experimental platform for acoustic fingerprinting algorithms, is being used by Musimap and the International Federation of the Phonographic Industry (IFPI). Tarsos [173] has an attractive easy-to-use graphical interface and is being used in workshops and during lectures by students. SyncSink [176] exposes many parameters to tweak but can and is effectively used by researchers to synchronize research data. So some research software can be used as-is.

Conversely, there is also transparent and open end-user software available that does not hide its complexity such as the relational database system PostgreSQL. This means that the characteristics (transparency, complexity, openness) are not exclusive to research software but they are, in my view, requirements for good research software. The focus should be on demonstration of a process while getting (academic) tasks done and not on simply getting tasks done. This can be found, for example, in the ‘mission statement’ of TarsosDSP:

TarsosDSP is a Java library for audio processing. Its aim is to provide an easy-to-use interface to practical music processing algorithms implemented as simply as possible ... The library tries to hit the sweet spot between being capable enough to get real tasks done but compact and simple enough to serve as a demonstration on how DSP algorithms work.

This distinction between ‘real’ and ‘academic’ tasks is perhaps of interest. In ‘real’ tasks practical considerations with regards to computational load, scalability and context need to be taken into account. This is much less the case for ‘academic’ tasks: there the process is the most important contribution, whereas performance and scalability may be an afterthought. For example, take research software that encodes a multi-pitch estimation algorithm that is able to score much better than current state of the art. If this algorithm has an enormous computational load and it takes many hours to process a couple of seconds of music this still is a valid contribution: it shows how accurate multi pitch estimation can be. It is, however, completely unpractical to use if thousands of songs need to be processed. The system serves an academic purpose but can not be used for ‘real’ tasks. A fingerprinting system that has desirable features but only can handle a couple of hundred reference items is another example. The previously mentioned publication by [50] which I co-authored deals with this topic and gives several examples of research software capable enough to be used effectively by archivists to manage digital music archives.

To summarize: in my work research software packages form a considerable type of output. These systems serve an academic purpose in the first place but if they can be used for ‘real’ tasks then this is seen as an added benefit. I have identified common problems with reproducibility in computational research and have strived to make my own contributions reproducible by publishing source code and evaluating systems with publicly available data sets as much as possible. This makes my contributions verifiable and transparent but also allows others to use, criticize and improve these systems.

1.3Summary

To contextualize my research I have given a brief overview of the interdisciplinary nature of the systematic musicology research field. The main point was that advanced research practices almost always involve challenging technological problems not easily handled by researchers with a background in humanities. The problems concern dealing with various types of digital data, computational models and complex experimental designs. I have set myself the task to engineer solutions that are relevant for systematic musicology. I have created such solutions and these are placed in a plane. One axis of that plane goes from methods to services. The other axis contrasts MIR-technologies with technologies for empirical research. These concepts are delineated and contextualized. A total of thirteen solutions explore this plane and are briefly discussed. The solutions have many attributes also found in digital humanities research projects. This link with the digital humanities was made explicit as well.

The solutions presented in my work strive to follow a reproducible methodology. Problems with reproducible computational research were identified and the importance of reproducibility was stressed. It was explained how my own research strifes for this ideal: source code that implements solutions is open sourced and solutions are evaluated with publicly available data as much as possible. Finally a distinction was clarified between research prototypes and end-user ready software.

The following two chapters contain several publications that have been published elsewhere. They are self-contained works which means that some repetition might be present. The next chapter deals with publications that describe methods. The chapter that bundles publications describing services follows. For each chapter an additional introduction is included.

2Methods

2.1Introduction

This chapter bundles three articles that are placed in the methods category of the humanities-engineering plane depicted in Figure 1. The focus of the papers is to present and apply methods which can yield new insights:

[173] –
[168] –
[167] –

The first paper [173] describes Tarsos. It details a method to allow extraction, comparison and analysis tools for pitch class histograms on a large scale. The method is encoded in a software system called Tarsos. Tarsos features a graphical user interface to analyze a single recording quickly and an API to allow analysis of many recordings. The final parts of the paper give an example of extraction and matching scales for hundreds of recordings of the Makam tradition effectively contrasting theoretical models with (historical) performance practice. This serves as an illustration how models can be contrasted with practice on a large scale with Tarsos.

The second paper [168] presents a method to find duplicates in large music archives. It shows how duplicate detection technology can be employed to estimate the quality of meta-data and to contrast meta-data of an original with a duplicate. The method is applied to the data set of the RMCA as a case study.

Reproducibility is the main topic of the third work [167]. It details the problems with reproducibility in computational research and MIR research. These problems are illustrated by replicating a seminal acoustic fingerprinting paper. While the results of the replication come close to the originally published results and the main findings are solidified, there is a problematic unexplained discrepancy, an unknown unknown. The main contribution of the paper lays in the method which describes how new insights in MIR can be described in a sustainable manner.

The article by [46] describes a method to automatically estimate the rhythmic complexity of a piece of music by using a set of beat tracking algorithms. The method is validated by comparing the results of the set of algorithms with human expert annotations. I co-authored the article and it could have been bundled here as well but I chose to limit bundled works to articles for which I serve as the main author.

2.2Tarsos, a modular platform for precise pitch analysis of western and non-western music

Abstract

Introduction

In the past decennium, several computational tools became available for extracting pitch from audio recordings [35]. Pitch extraction tools are prominently used in a wide range of studies that deal with analysis, perception and retrieval of music. However, up to recently, less attention has been paid to tools that deal with distributions of pitch in music.

The present paper presents a tool, called Tarsos, that integrates existing pitch extraction tools in a platform that allows the analysis of pitch distributions. Such pitch distributions contain a lot of information, and can be linked to tunings, scales, and other properties of musical performance. The tuning is typically reflected in the distance between pitch classes. Properties of musical performance may relate to pitch drift within a single piece, or to influence of enculturation (as it is the case in African music culture, see [131]). A major feature of Tarsos is concerned with processing audio-extracted pitches into pitch and pitch class distributions from which further properties can be derived.

Tarsos provides a modular platform used for pitch analysis - based on pitch extraction from audio and pitch distribution analysis - with a flexibility that includes:

The possibility to focus on a part of a song by selecting graphically displayed pitch estimations in the melograph.
A zoom function that allows focusing on global or detailed properties of the pitch distribution.
Real-time auditory feedback. A tuned midi synthesizer can be used to hear pitch intervals.
Several filtering options to get clearer pitch distributions or a more discretized melograph, which helps during transcription.

In addition, a change in one of the user interface elements is immediately propgated through the whole processing chain, so that pitch analysis becomes easy, adjustable and verifiable.

This paper is structured as follows. First, we present a general overview of the different processing stages of Tarsos, beginning with the low level audio signal stage and ending with pitch distributions and their musicological meaning. In the next part, we focus on some case studies and give a scripting example. The next part elaborates on the musical aspects of Tarsos and refers to future work. The fifth and final part of the main text contains a conclusion.

The Tarsos platform Figure 6 shows the general flow of information within Tarsos. It starts with an audio file as input. The selection of a pitch estimation algorithm leads to a pitch estmations, which can be represented in different ways. This representation can be further optimized, using different types of filters for peak selection. Finally, it is possible to produce an audio output of the obtained results. Based on that output, the analysis-representation-optimization cycle can be refined. All steps contain data that can be exported in different formats. The obtained pitch distribution and scale itself can be saved as a scala file which in turn can be used as input, overlaying the estimation of another audio file for comparison.

In what follows, we go deeper into the several processing aspects, dependencies, and particularities. In this section we first discuss how to extract pitch estimations from audio. We illustrate how these pitch estimations are visualized within Tarsos. The graphical user interface is discussed. The real-time and output capabilities are described, and this section ends with an explanation about scripting for the Tarsos api. As a reminder: there is a manual available for Tarsos at http://0110.be/tag/JNMR.

Figure 6: The main flow of information within Tarsos.

Extracting pitch estimations from audio

Prior to the step of pitch estimation, one should take into consideration that in certain cases, audio preprocessing can improve the subsequent analysis within Tarsos. Depending on the source material and on the research question, preprocessing steps could include noise reduction, band-pass filtering, or harmonic/percussive separation [136]. Audio preprocessing should be done outside of the Tarsos tool. The, optionally preprocessed, audio is then fed into Tarsos and converted to a standardized format¹⁵.

The next step is to generate pitch estimations. Each selected block of audio file is examined and pitches are extracted from it. In figure Figure 7, this step is located between the input and the signal block phases. Tarsos can be used with external and internal pitch estimators. Currently, there is support for the polyphonic MAMI pitch estimator [35] and any VAMP plug-in [30] that generates pitch estimations. The external pitch estimators are platform dependent and some configuration needs to be done to get them working. For practical purposes, platform independent implementations of two pitch detection algorithms are included, namely, YIN [49] and MPM [127]. They are available without any configuration. Thanks to a modular design, internal and external pitch detectors can be easily added. Once correctly configured, the use of these pitch modules is completely transparent, as extracted pitch estimations are transformed to a unified format, cached, and then used for further analysis at the symbolic level.

Figure 7: Detailed block diagram representing all components of Tarsos, from input to output, from signal level to symbolic level. All additional features (selection, filtering, listening) are visualized (where they come into play). Each step is described into more detail in Chapter 3.

Visualizations of pitch estimations Once the pitch detection has been performed, pitch estimations are available for further study. Several types of visualizations can be created, which lead, step by step, from pitch estimations to pitch distribution and scale representation. In all these graphs the cent unit is used. The cent divides each octave into 1200 equal parts. In order to use the cent unit for determining absolute pitch, a reference frequency of 8.176Hz has been defined¹⁶, which means that 8.176Hz equals 0 cents, 16.352Hz equals 1200 cents and so on.

A first type of visualization is the melograph representation, which is shown in Figure 8. In this representation, each estimated pitch is plotted over time. As can be observed, the pitches are not uniformly distributed over the pitch space, and form a clustering around 5883 cents.

Figure 8: A melograph representation. Estimations of the first ten seconds of an Indonesian Slendro piece are shown. It is clear that pitch information is horizontally clustered, e.g. the cluster around 5883 cents, indicated by the dashed horizontal line. For reference a dotted horizontal line with A4, 440Hz is also present. — Figure 8: A melograph representation. Estimations of the first ten seconds of an Indonesian Slendro piece are shown. It is clear that pitch information is horizontally clustered, e.g. the cluster around 5883 cents, indicated by the dashed horizontal line. For reference a dotted horizontal line with $A 4$ , 440Hz is also present.

A second type of visualization is the pitch histogram, which shows the pitch distribution regardless of time. The pitch histogram is constructed by assigning each pitch estimation in time to a bin between 0 and 14400¹⁷ cents, spanning 12 octaves. As shown in Figure 9, the peak at 5883 cents is now clearly visible. The height of a peak represents the total number of times a particular pitch is estimated in the selected audio. The pitch range is the difference between the highest and lowest pitch. The graph further reveals that some peaks appear every 1200 cents, or every octave.

Figure 9: A pitch histogram with an Indonesian Slendro scale. The circles mark the most estimated pitch classes. The dashed vertical lines show the same pitch class in different octaves. A dotted vertical line with A4, 440Hz, is used as a reference for the diapason. — Figure 9: A pitch histogram with an Indonesian Slendro scale. The circles mark the most estimated pitch classes. The dashed vertical lines show the same pitch class in different octaves. A dotted vertical line with $A 4$ , 440Hz, is used as a reference for the diapason.

A third type of visualization is the pitch class histogram, which is obtained by adding each bin from the pitch histogram to a corresponding modulo 1200 bin. Such a histogram reduces the pitch distribution to one single octave. A peak thus represents the total duration of a pitch class in a selected block of audio. Notice that the peak at 5883 cents in the pitch histogram (Figure 9) now corresponds to the peak at 1083 cents in the pitch class histogram (Figure 10).

It can also be useful to select only filter pitch estimations that make up the pitch class histogram. The most obvious ‘filter’ is to select only an interesting timespan and pitch range. The distributions can be further manipulated using other filters and peak detection. The following three filters are implemented in Tarsos:

The first is an estimation quality filter. It simply removes pitch estimations from the distribution below a certain quality threshold. Using YIN, the quality of an estimation is related to the periodicity of the block of sound analyzed. Keeping only high quality estimations should yield clearer pitch distributions.

The second is called a near to pitch class filter. This filter only allows pitch estimations which are close to previously identified pitch classes. The pitch range parameter (in cents) defines how much ornamentations can deviate from the pitch classes. Depending on the music and the research question, one needs to be careful with this - and other - filters. For example, a vibrato makes pitch go up and down - pitch modulation - and is centered around a pitch class. Figure ? gives an example of Western vibrato singing. The melograph reveals the ornamental singing style, based on two distinct pitch classes. The two pitch classes are hard to identify with the histogram ? but are perceptually there, they are made clear with the dotted gray line. In contrast, figure ? depicts a more continuous glissando which is used as a building block to construct a melody in an Indian raga. For these cases, [94] introduced the concept of two-dimensional ’melodic atoms’. (In [75] it is shown how elementary bodily gestures are related to pitch and pitch gestures.) The histogram of the pitch gesture Figure ? suggests one pitch class while a fundamentally different concept of tone is used. Applying the near to pitch class filter on this type of music could result into incorrect results. The goal of this filter is to get a clearer view on the melodic contour by removing pitches between pitch classes, and to get a clearer pitch class histogram.

The third filter is a steady state filter. The steady state filter has a time and pitch range parameter. The filter keeps only consecutive estimations that stay within a pitch range for a defined number of milliseconds. The default values are 100ms within a range of 15 cents. The idea behind it is that only ’notes’ are kept and transition errors, octave errors and other short events are removed.

Figure 10: A pitch class histogram with an Indonesian Slendro scale. The circles mark different pitch classes. For reference, the dashed lines represent the Western equal temperament. The pitch class A is marked with a dotted line. — Figure 10: A pitch class histogram with an Indonesian Slendro scale. The circles mark different pitch classes. For reference, the dashed lines represent the Western equal temperament. The pitch class $A$ is marked with a dotted line.

Once a selection of the estimations are made or, optionally, other filters are used, the distribution is ready for peak detection. The peak detection algorithm looks for each position where the derivative of the histogram is zero, and a local height score is calculated with the formula in (Equation 1). The local height score $s_{w}$ is defined for a certain window $w$ , $μ_{w}$ is the average height in the window, $σ_{w}$ refers to the standard deviation of the height in the window. The peaks are ordered by their score and iterated, starting from the peak with the highest score. If peaks are found within the window of the current peak, they are removed. Peaks with a local height score lower than a defined threshold are ignored. Since we are looking for pitch classes, the window $w$ wraps around the edges: there is a difference of 20 cent between 1190 cent and 10 cent.

\begin{matrix} s_{w} = \frac{h e i g h t - μ_{w}}{σ_{w}} \\ (1) \end{matrix}

Figure 11 shows the local height score function applied to the pitch class histogram shown in Figure 10. The desired leveling effect of the local height score is clear, as the small peak at $107$ cents becomes much more defined. The threshold is also shown. In this case, it eliminates the noise at around $250$ cents. The noise is caused by the small window size and local height deviations, but it is ignored by setting threshold $t$ . The performance of the peak detection depends on two parameters, namely, the window size and the threshold. Automatic analysis either uses a general preset for the parameters or tries to find the most stable setting with an exhaustive search. Optionally gaussian smoothing can be applied to the pitch class histogram, which makes peak detection more straightforward. Manual intervention is sometimes needed, by fiddling with the two parameters a user can quickly browse through several peak detection result candidates.

Figure 11: A local height score function used to detect peaks in a pich class histogram. Comparing the original histogram of figure with the local height score shows the leveling effect of the local height score function. The dashed vertical lines represents the Western equal temperament, the dashed horizontal line the threshold t. — Figure 11: A local height score function used to detect peaks in a pich class histogram. Comparing the original histogram of figure with the local height score shows the leveling effect of the local height score function. The dashed vertical lines represents the Western equal temperament, the dashed horizontal line the threshold $t$ .

Once the pitch classes are identified, a pitch class interval matrix can be constructed. This is the fourth type of representation, which is shown in Table Table 1. The pitch class interval matrix represents the found pitch classes, and shows the intervals between the pitch classes. In our example, a perfect fourth¹⁸, a frequency ratio of 4/3 or 498 cent, is present between pitch class 585 and 1083. This means that a perfect fifth, a frequency ratio of $\frac{2 / 1}{4 / 3} = 3 / 2$ or $1200 - 498 = 702$ cent, is also present¹⁹.

Table 1: Pitch classes (P.C.) and pitch class intervals, both in cents. The same pentatonic Indonesian Slendro is used as in figure . A prefect fifth and its dual, a perfect fourth, are marked by a bold font.
P.C.	107	364	585	833	1083
107	0	256	478	726	976
364	944	0	221	470	719
585	722	979	0	248	498
833	474	730	952	0	250
1083	224	481	702	950	0

The interface Most of the capabilities of Tarsos are used through the graphical user interface (Figure ?). The interface provides a way to explore pitch organization within a musical piece. However, the main flow of the process, as described above, is not always as straightforward as the example might suggest. More particularly, in many cases of music from oral traditions, the peaks in the pitch class histogram are not always well-defined (see Section ?). Therefore, the automated peak detection may need manual inspection and further manual fine-tuning in order to correctly identify a songs’ pitch organization. The user interface was designed specifically for having a flexible environment where all windows with representations communicate their data. Tarsos has the attractive feature that all actions, like the filtering actions mentioned in Section ?, are updated for each window in real-time.

A screenshot of Tarsos with 1 a pitch class histogram, 2 a pitch class interval table, 3 a melograph with pitch estimations, 4 a midi keyboard and 5 a waveform.

One way to closely inspect pitch distributions is to select only a part of the estimations. In the block diagram of Figure 7, this is represented by the funnel. Selection in time is possible using the waveform view (Figure ?-5). For example, the aim could be a comparison of pitch distributions at the beginning and the end of a piece, to reveal whether a choir lowered or raised its pitch during a performance (see Section ? for a more elaborate example).

Selection in pitch range is possible and can be combined with a selection in time using the melograph (Figure ?-3). One may select the melodic range such as to exclude pitched percussion, and this could yield a completely different pitch class histogram. This feature is practical, for example when a flute melody is accompanied with a low-pitched drum and when you are only interested in flute tuning. With the melograph it is also possible to zoom in on one or two notes, which is interesting for studying pitch contours. As mentioned earlier, not all music is organized by fixed pitch classes. An example of such pitch organization is given in Figure ?, a fragment of Indian music where the estimations contain information that cannot be reduced to fixed pitch classes.

To allow efficient selection of estimations in the time and frequency, they are stored in a kd-tree [12]. Once such a selection of estimations is made, a new pitch histogram is constructed and the pitch class histogram view (Figure ?-1) changes instantly.

Once a pitch class histogram is obtained, peak detection is a logical next step. With the user interface, manual adjustment of the automatically identified peaks is possible. New peak locations can be added and existing ones can be moved or deleted. In order to verify the pitch classes manually, it is possible to click anywhere on the pitch class histogram. This sends a midi-message with a pitch bend to synthesize a sound with a pitch that corresponds to the clicked location. Changes made to the peak locations propagate instantly throughout the interface.

The pitch class interval matrix (Figure ?-2) shows all new pitch class intervals. Reference pitches are added to the melograph and midi tuning messages are sent (see Section ?). The pitch class interval matrix is also interactive. When an interval is clicked on, the two pitch classes that create the interval sound at the same time. The dynamics of the process and the combination of both visual and auditory clues makes manually adjusted, precise peak extraction, and therefore tone scale detection, possible. Finally, the graphical display of a piano keyboard in Tarsos allows us to play in the (new) scale. This feature can be executed on a computer keyboard as well, where notes are projected on keys. Any of the standard midi instruments sounds can be chosen.

It is possible to shift the pitch class histogram up- or downwards. The data is then viewed as a repetitive, octave based, circular representation. In order to compare scales, it is possible to upload a previously detected scale (see Section ?) and shift it, to find a particular fit. This can be done by hand, exploring all possibilities of overlaying intervals, or the best fit can be suggested by Tarsos.

Real-time capabilities Tarsos is capable of real-time pitch analysis. Sound from a microphone can be analyzed and immediate feedback can be given on the played or sung pitch. This feature offers some interesting new use-cases in education, composition, and ethnomusicology.

For educational purposes, Tarsos can be used to practice singing quarter tones. Not only the real time audio is analyzed, but also an uploaded scale or previously analyzed file can be listened to by clicking on the interval table or by using the keyboard. Singers or string players could use this feature to improve their intonation regardless of the scale they try to reach.

For compositional purposes, Tarsos can be used to experiment with microtonality. The peak detection and manual adjustment of pitch histograms allows the construction of any possible scale, with the possibility of setting immediate harmonic and melodic auditory feedback. Use of the interval table and the keyboard, make experiments in interval tension and scale characteristics possible. Musicians can tune (ethnic) instruments according to specific scales using the direct feedback of the real-time analysis. Because of the midi messages, it is also possible to play the keyboard in the same scale as the instruments at hand.

In ethnomusicology, Tarsos can be a practical tool for direct pitch analysis of various instruments. Given the fact that pitch analysis results show up immediately, microphone positions during field recordings can be adjusted on the spot to optimize measurements.

Output capabilities Tarsos contains export capabilities for each step, from the raw pitch estimations until the pitch class interval matrix. The built-in functions can export the data as comma separated text files, charts, TeX-files, and there is a way to synthesize estimations. Since Tarsos is scriptable there is also a possibility to add other export functions or modify the existing functions. The api and scripting capabilities are documented on the Tarsos website: http://0110.be/tag/JNMR.

For pitch class data, there is a special standardized text file defined by the Scala²⁰ program. The Scala file format has the .scl extension. The Scala program comes with a dataset of over 3900 scales ranging from historical harpsichord temperaments over ethnic scales to scales used in contemporary music. Recently this dataset has been used to find the universal properties of scales [80]. Since Tarsos can export scala files it is possible to see if the star-convex structures discussed by [80] can be found in scales extracted from real audio. Tarsos can also parse Scala files, so that comparison of theoretical scales with tuning practice is possible. This feature is visualized by the upwards Scala arrow in Figure Figure 7. When a scale is overlaid on a pitch class histogram, Tarsos finds the best fit between the histogram and the scala file.

A completely different output modality is midi. The midi Tuning Standard defines midi messages to specify the tuning of midi synthesizers. Tarsos can construct Bulk Tuning Dump-messages with pitch class data to tune a synthesizer enabling the user to play along with a song in tune. Tarsos contains the Gervill synthesizer, one of the very few (software) synthesizers that offer support for the midi Tuning Standard. Another approach to enable users to play in tune with an extracted scale is to send pitch bend messages to the synthesizer when a key is pressed. Pitch bend is a midi-message that tells how much higher or lower a pitch needs to sound in comparison with a standardized pitch. Virtually all synthesizers support pitch bend, but pitch bends operate on midi-channel level. This makes it impossible to play polyphonic music in an arbitrary tone scale.

Scripting capabilities Processing many audio files with the graphical user interface quickly becomes tedious. Scripts written for the Tarsos api can automate tasks and offer a possibility to utilize Tarsos’ building blocks in entirely new ways. Tarsos is written in Java, and is extendable using scripts in any language that targets the JVM (Java Virtual Machine) like JRuby, Scala²¹ and Groovy. For its concurrency support, concise syntax and seamless interoperability with Java, the Scala programming languages are used in example scripts, although the concepts apply to scripts in any language. The number of applications for the api is only limited by the creativity of the developer using it. Tasks that can be implemented using the Tarsos api are for example:

Tone scale recognition:: given a large number of songs and a number of tone scales in which each song can be brought, guess the tone scale used for each song. In section ? this task is explained in detail and effectively implemented.
Modulation detection:: this task tries to find the moments in a piece of music where the pitch class histogram changes from one stable state to another. For western music this could indicate a change of mode, a modulation. This task is similar as the one described by [112]. With the Tarsos api you can compare windowed pitch histograms and detect modulation boundries.
Evolutions in tone scale use:: this task tries to find evolutions in tone scale use in a large number of songs from a certain region over a long period of time. Are some pitch intervals becoming more popular than others? This is done by [131] for a set of African songs.
Acoustic Fingerprinting:: it is theorized by [193] that pitch class histograms can serve as an acoustic fingerprint for a song. With the building blocks of Tarsos: pitch detection, pitch class histogram creation and comparison this was put to the test by [170].

The article by [193] gives a good overview of what can be done using pitch histograms and, by extension, the Tarsos api. To conclude: the Tarsos api enables developers to quickly test ideas, execute experiments on large sets of music and leverage the features of Tarsos in new and creative ways.

Exploring Tarsos’ capabilities through case studies

In what follows, we explore Tarsos’ capabilities using case studies in non-Western music. The goal is to focus on problematic issues such as the use of different pitch extractors, music with pitch drift, and last but not least, the analysis of large databases.

Analysing a pitch histogram We will first consider the analysis of a song that was recorded in 1954 by missionary Scohy-Stroobants in Burundi. The song is performed by a singing soloist, Léonard Ndengabaganizi. The recording was analysed with the YIN pitch detection method and a pitch class histogram was calculated: it can be seen in Figure ?. After peak detection on this histogram, the following pitch intervals were detected: 168, 318, 168, 210, and 336 cents. The detected peaks and all intervals are shown in an interval matrix (see Figure ?). It can be observed that this is a pentatonic division that comprises small and large intervals, which is different from an equal tempered or meantone division. Interestingly, the two largest peaks define a fifth interval, which is made of a pure minor third (318 cents) and a pure major third (378 cents) that lies between the intervals $168 + 210 = 378$ cents). In addition, a mirrored set of intervals is present, based on 168-318-168 cents. This phenomena is also illustrated by Figure ?.

Different pitch extractors However, Tarsos has the capability to use different pitch extractors. Here we show the difference between seven pitch extractors on a histogram level. A detailed evaluation of each algorithm cannot be covered in this article but can be found in the cited papers. The different pitch extractors are:

YIN [49] (YIN) and the McLeod Pitch Method (MPM), which is described by [127], are two time-domain pitch extractors. Tarsos contains a platform independent implementation of the algorithms.
Spectral Comb (SC), Schmitt trigger(Schmitt) and Fast Harmonic Comb (FHC) are described by [21]. They are available for Tarsos through VAMP-plugins [29];
MAMI 1 and MAMI 6 are two versions of the same pitch tracker. MAMI 1 only uses the most present pitch at a certain time, MAMI 6 takes the six most salient pitches at a certain time into account. The pitch tracker is described by [35].

Figure 12 shows the pitch histogram of the same song as in the previous section, which is sung by an unaccompanied young man. The pitch histogram shows a small tessitura and wide pitch classes. However, the general contour of the histogram is more or less the same for each pitch extraction method, five pitch classes can be distinguished in about one-and-a-half octaves, ranging from 5083 to 6768 cent. Two methods stand out. Firstly, MAMI 6 detects pitch in the lower and higher regions. This is due to the fact that MAMI 6 always gives six pitch estimations in each measurement sample. In this monophonic song this results in octave - halving and doubling - errors and overtones. Secondly, the Schmitt method also stands out because it detects pitch in regions where other methods detect a lot less pitches, e.g. between 5935 and 6283 cent.

Figure 12: Seven different pitch histograms of a traditional Rwandese song. Five pitch classes repeat every octave. The Schmitt trigger (Schmitt) results in much less well defined peaks in the pitch histogram. MAMI 6 detects much more information to be found in the lower and higher regions, this is due to the fact that it always gives six pitch estimations, even if they are octave errors or overtones.

Figure 13 shows the pitch class histogram for the same song as in Figure 12, now collapsed into one octave. It clearly shows that it is hard to determine the exact location of each pitch class. However, all histogram contours look similar except for the Schmitt method, which results in much less well defined peaks. The following evaluation shows that this is not only the case.

Figure 13: Seven different pitch class histograms of a traditional Rwandese song. Five pitch classes can be distinguished but is clear that it is hard to determine the exact location of each pitch class. The Schmitt trigger (Schmitt) results in a lot less well defined peaks in the pitch class histogram.

Table 2: Similarity matrix showing the overlap of pitch class histograms for seven pitch detection methods. The similarities are the mean of 2484 audio files. The last row shows the average of the overlap for a pitch detection method.
	YIN	MPM	Schmitt	FHC	SC	MAMI 1	MAMI 6
YIN	1.00	0.81	0.41	0.65	0.62	0.69	0.61
MPM	0.81	1.00	0.43	0.67	0.64	0.71	0.63
Schmitt	0.41	0.43	1.00	0.47	0.53	0.42	0.56
FHC	0.65	0.67	0.47	1.00	0.79	0.67	0.66
SC	0.62	0.64	0.53	0.79	1.00	0.65	0.70
MAMI 1	0.69	0.71	0.42	0.67	0.65	1.00	0.68
MAMI 6	0.61	0.63	0.56	0.66	0.70	0.68	1.00
Average	0.69	0.70	0.55	0.70	0.70	0.69	0.69

In order to be able to gain some insight into the differences between the pitch class histograms resulting from different pitch detection methods, the following procedure was used: for each song in a data set of more than 2800 songs - a random selection of the music collection of the Belgian Royal Museum of Central Africa (RMCA) - seven pitch class histograms were created by the pitch detection methods. The overlap - a number between zero and one - between each pitch class histogram pair was calculated. A sum of the overlap between each pair was made and finally divided by the number of songs. The resulting data can be found in Table 2. Here histogram overlap or intersection is used as a distance measure because [62] show that this measure works best for pitch class histogram retrieval tasks. The overlap $c (h_{1}, h_{2})$ between two histograms $h_{1}$ and $h_{2}$ with $K$ classes is calculated with Equation 2. For an overview of alternative correlation measures between probability density functions see [33].

\begin{matrix} c (h_{1}, h_{2}) = \frac{\sum_{k = 0}^{K - 1} m i n (h_{1} (k), h_{2} (k))}{m a x (\sum_{k = 0}^{K - 1} h_{1} (k), \sum_{k = 0}^{K - 1} h_{2} (k))} \\ (2) \end{matrix}

The Table 2 shows that there is, on average, a large overlap of 81%, between the pitch class histograms created by YIN and those by MPM. This can be explained by the fact that the two pitch extraction algorithms are very much alike: both operate in the time-domain with autocorrelation. The table also shows that Schmitt generates rather unique pitch class histograms. On average there is only 55% overlap with the other pitch class histogram. This performance was already expected during the analysis of one song (above).

The choice for a particular pitch detection method depends on the music and the analysis goals. The music can be monophonic, homophonic or polyphonic, different instrumentation and recording quality all have influence on pitch estimators. Users of Tarsos are encouraged to try out which pitch detection method suits their needs best. Tarsos’ scripting api - see section ? - can be helpful when optimizing combinations of pitch detection methods and parameters for an experiment.

Shifted pitch distributions Several difficulties in analysis and interpretation may arise due to pitch shift effects during musical performances. This is often the case with a capella choirs. Figure ? shows a nice example of an intentionally raised pitch, during solo singing in the Scandinavian Sami culture. The short and repeated melodic motive remains the same during the entire song, but the pitch raises gradually ending up 900 cents higher than the beginning. Retrieving a scale for the entire song is in this case irrelevant, although the scale is significant for the melodic motive. Figure 15 shows an example where scale organization depends on the characteristics of the instrument. This type of African fiddle, the iningidi, does not use a soundboard to shorten the strings. Instead the string is shortened by the fingers that are in a (floating) position above the string: an open string and three fingers give a tetratonic scale. Figure 14 shows an iningidi being played. This use case shows that pitch distributions for entire songs can be misleading, in both cases it is much more informative to compare the distribution from the first part of the song with the last part. Then it becomes clear how much pitch shifted and in what direction.

Figure 14: The iningidi, a type of African fiddle. To play the instrument, the neck is held in the palm of the left hand so that the string can be stopped using the second phalanx of the index, middle and ring fingers. Consequently, a total of four notes can be produced.

Interesting to remark is that these intervals have more or less the same distance, a natural consequence of the distance of the fingers, and that, consequently, not the entire octave tessitura is used. In fact only 600 cents, half an octave, is used. A scale that occurs typically in fiddle recordings, that rather can be seen as a tetrachord. The open string (lowest note) is much more stable than the three other pitches that deviate more, as is shown by the broader peaks in the pitch class histogram. The hand position without soundboard is directly related to the variance of these three pitch classes. When comparing the second minute of the song with the seventh, one sees a clear shift in pitch, which can be explained by the fact the musician changed the hand position a little bit. In addition, another phenomena can be observed, namely, that while performing, the open string gradually loses tension, causing a small pitch lowering which can be noticed when comparing the two fragments. This is not uncommon for ethnic music instruments.

Figure 15: Histogram of an African fiddle song. The second minute of the song is represented by the dashed line, the seventh minute is represented by the dotted line. The lowest, most stable pitch class is the result of the open string. It lost some tension during the piece and started to sound lower. This is in sharp contrast with the other pitch classes that sound higher, due to a change in hand position.

Tarsos’ scripting applied to Makam recognition In order to make the use of scripting more concrete, an example is shown here. It concerns the analysis of Turkish classical music. In an article by [62], pitch histograms were used for - amongst other tasks - makam²² recognition. The task was to identify which of the nine makams is used in a specific song. With the Tarsos api, a simplified, generalized implementation of this task was scripted in the Scala programming language. The task is defined as follows:

For a small set of tone scales $T$ and a large set of musical performances $S$ , each brought in one of the scales, identify the tone scale $t$ of each musical performance $s$ automatically.

Figure 16: The solid, blue line is a pitch class histogram of a Turkish song brought in the makam Hicaz. The dotted, red line represents a theoretical template of that same Hicaz makam. Maximizing the overlap between a theoretical and an actual pitch class histogram suggests which makam is used.

An example of makam recognition can be seen in Figure Figure 16. A theoretical template - the dotted, red line - is compared to a pitch class histogram - the solid, blue line - by calculating the maximum overlap between the two. Each template is compared with the pitch class histogram, the template with maximum overlap is the guessed makam. Pseudocode for this procedure can be found in Algorithm ?.

Musicological aspects of Tarsos

Tarsos is a tool for the analysis of pitch distributions. For that aim, Tarsos incorporates several pitch extraction modules, has pitch distribution filters, audio feedback tools, and scripting tools for batch processing of large databases of musical audio. However, pitch distributions can be considered from different perspectives, such as ethnographical studies of scales [159], theoretical studies in scale analysis [164], harmonic and tonal analysis [96], and other structural analysis approaches to music (such as set theoretical and Schenkerian). Clearly, Tarsos does not offer a solution to all these different approaches to pitch distributions. In fact, seen from the viewpoint of Western music analysis, Tarsos is a rather limited tool as it doesn’t offer harmonic analysis, nor tonal analysis, nor even statistical analysis of pitch distributions. All of this should be applied together with Tarsos, when needed. Instead, what Tarsos provides is an intermediate level between pitch extraction (done by pitch extractor tools) and music theory. The major contribution of Tarsos is that it offers an easy to use tool for pitch distribution analysis that applies to all kinds of music, including Western and non-Western. The major contribution of Tarsos, so to speak, is that it offers pitch distribution analysis without imposing a music theory. In what follows, we explain why such tools are needed and why they are useful.

Tarsos and Western music theoretical concepts Up to recently, musical pitch is often considered from the viewpoint of a traditional music theory, which assumes that pitch is stable (e.g. vibrato is an ornament of a stable pitch), that pitch can be segmented into tones, that pitches are based on octave equivalence, and that octaves are divided into 12 equal-sized intervals of each 100 cents, and so on. These assumptions have the advantage that music can be reduced to symbolic representations, a written notation, or notes, whose structures can be studied at an abstract level. As such, music theory has conceptualized pitch distributions as chords, keys, modes, sets, using a symbolic notation.

So far so good, but tools based on these concepts may not work for many nuances of Western music, and especially not for non-Western music. In Western music, tuning systems have a long history. Proof of this can be found in tunings of historical organs, and in tuning systems that have been explored by composers in the 20th century (cf. Alois Haba, Harry Partch, Ivo Darreg, and Lamonte Young). Especially in non-Western classical music, pitch distributions are used that radically differ from the Western theoretical concepts, both in terms of tuning, as well as in pitch occurrence, and in timbre. For example, the use of small intervals in Arab music contributes to nuances in melodic expression. To better understand how small pitch intervals contribute to the organization of this music, we need tools that do no assume octave divisions in 12 equal-sized intervals (see [62]). Other types of music do not have octave equivalence (cf. the Indonesian gamalan), and also some music work with modulated pitch. For example, [75] describe classical Chinese guqin music which uses tones that contain sliding patterns (pitch modulations), which form a substantial component of the tone and consider it as a succession of prototypical gestures. [94] introduces a set of 2D melodic units, melodic atoms, in describing Carnatic (South-Indian classical) music. They represent or synthesize the melodic phrase and are not bound by a scale type. Hence, tools based on Western common music theoretical conceptions of pitch organization may not work for this type of music.

Oral musical traditions (also called ethnic music) provide a special case since there is no written music theory underlying the pitch organization. An oral culture depends on societal coherence, interpersonal influence and individual musicality, and this has implications on how pitch gets organized. Although oral traditions often rely on a peculiar pitch organization, often using a unique system of micro-tuned intervals, it is also the case that instruments may lack a fixed tuning, or that tunings may strongly differ from one instrument to the other, or one region to the other. Apparently, the myriad of ways in which people succeed in making sense out of different types of pitch organization can be considered as cultural heritage that necessitates a proper way of documentation and study [131].

Several studies attempt at developing a proper approach to pitch distributions. [68] look for pitch gestures in European folk music as an additional aspect to pitch detection. Moving from tone to scale research, [34] acknowledges interval differences in Indian classical music, but reduces to a chromatic scale for similarity analysis and classification techniques. [188] developed, already in 1969, an automated method for extracting pitch information from monophonic audio for assembling the scale of the spilåpipa by frequency histograms. [18] build a system to classify and recognize Turkish maqams from audio files using overall frequency histograms to characterize the maqams scales and to detect the tonic centre. Maqams contain intervals of different sizes, often not compatible with the chromatic scale, but partly relying on smaller intervals. [132] focuses on pitch distributions of especially African music that deals with a large diversity of irregular tuning systems. They avoid a priori pitch categories by using a quasi-continuous rather than a discrete interval representation. In [131] they show that African songs have shifted more and more towards Western well temperament from 1950s to 1980s.

To sum up, the study of pitch organization needs tools that go beyond elementary concepts of the Western music theoretical canon (such as octave equivalence, stability of tones, equal temporal scale, and so on). This is evident from the nuances of pitch organization in Western music, in non-Western classical music, as well as in oral music cultures. Several attempts have been undertaken, but we believe that a proper way of achieving this is by means of a tool that combines audio-based pitch extraction with a generalized approach to pitch distribution analysis. Such a tool should be able to automatically extract pitch from musical audio in a culture-independent manner, and it should offer an approach to the study of pitch distributions and its relationship with tunings and scales. The envisioned tool should be able to perform this kind of analysis in an automated way, but it should be flexible enough to allow a musicologically grounded manual fine-tuning using filters that define the scope at which we look at distributions. The latter is indeed needed in view of the large variability of pitch organization in music all over the world. Tarsos is an attempt at supplying such a tool. One the one hand, Tarsos tries to avoid music theoretical concepts that could contaminate music that doesn’t subscribe the constraints of the Western music theoretical canon. On the other hand, the use of Tarsos is likely to be too limited, as pitch distributions may further draw upon melodic units that may require an approach to segmentation (similar to the way segmented pitch relates to notes in Western music) and further gestural analysis (see the references to the studies mentioned above).

Tarsos pitfalls The case studies from section ? illustrate some of the capabilities of Tarsos as tool for the analysis of pitch distributions. As shown Tarsos offers a graphical interface that allows a flexible way to analyse pitch, similar to other editors that focus on sound analysis (Sonic Visualizer, Audacity, Praat). Tarsos offers support for different pitch extractors, real-time analysis (see section ?), and has numerous output capabilities (See section ?). The scripting facility allows us to use of Tarsos’ building blocks in unique ways efficiently.

However, Tarsos-based pitch analysis should be handled with care. The following three recommendations may be taken into account: First of all, one cannot extract scales without considering the music itself. Pitch classes that are not frequently used, won’t show up clearly in a histogram and hence might be missed. Also not all music uses distinct pitch classes: the Chinese and Indian music traditions have been mentioned in this case. Because of the physical characteristics of the human voice, voices can glide between tones of a scale, which makes an accurate measurement of pitch not straightforward. It is recommended to zoom in on the estimations in the melograph representation for a correct understanding.

Secondly, analysis of polyphonic recordings should be handled with care since current pitch detection algorithms are primarily geared towards monophonic signals. Analysis of homophonic singing for example may give incomplete results. It is advisable to try out different pitch extractors on the same signal to see if the results are trustworthy.

Finally, [159] recognizes the use of “pitch categories” but warns that, especially for complex inharmonic sounds, a scale is more than a one dimensional series of pitches and that spectral components need to be taken into account to get better insights in tuning and scales. Indeed, in recent years, it became clear that the timbre of tones and the musical scales in which these tones are used, are somehow related [164]. The spectral content of pitch (i.e. the timbre) determines the perception of consonant and dissonant pitch intervals, and therefore also the pitch scale, as the latter is a reflection of the preferred melodic and harmonic combinations of pitch. Based on the principle of minimal dissonance in pitch intervals, it is possible to derive pitch scales from spectral properties of the sounds and principles of auditory interference (or critical bands). [162] argue that perception is based on the disambiguation of action-relevant cues, and they manage to show that the harmonic musical scale can be derived from the way speech sounds relate to the resonant properties of the vocal tract. Therefore, the annotated scale as a result of the precise use of Tarsos, does not imply the assignment of any characteristic of the music itself. It is up to the user to correctly interprete of a possible scale, a tonal center, or a melodic development.

Tarsos - future work The present version of Tarsos is a first step towards a tool for pitch distribution analysis. A number of extensions are possible.

For example, given the tight connection between timbre and scale, it would be nice to select a representative tone from the music and transpose it to the entire scale, using a phase vocoder. This sound sample and its transpositions could then be used as a sound font for the midi synthesizer. This would give the scale a more natural feel compared to the general midi device instruments that are currently present.

Another possible feature is tonic detection. Some types of music have a well-defined tonic, e.g. in Turkish classical music. It would make sense to use this tonic as a reference pitch class. Pitch histograms and pitch class histograms would then not use the reference frequency defined in appendix ? but a better suited, automatically detected reference: the tonic. It would make the intervals and scale more intelligible.

Tools for comparing two or more scales may be added. For example, by creating pitch class histograms for a sliding window and comparing those with each other, it should be possible to automatically detect modulations. Using this technique, it should also be possible to detect pitch drift in choral, or other music.

Another research area is to extract features on a large data set and use the pitch class histogram or interval data as a basis for pattern recognition and cluster analysis. With a time-stamped and geo-tagged musical archive, it could be possible to detect geographical or chronological clusters of similar tone scale use.

On the longer term, we plan to add representations of other musical parameters to Tarsos as well, such as rhythmic and instrumental information, temporal and timbral features. Our ultimate goal is to develop an objective albeit partial view on music by combining those three parameters within an easy to use interface.

Conclusion

In this paper, we have presented Tarsos, a modular software platform to extract and analyze pitch distributions in music. The concept and main features of Tarsos have been explained and some concrete examples have been given of its usage. Tarsos is a tool in full development. Its main power is related to its interactive features which, in the hands of a skilled music researcher, can become a tool for exploring pitch distributions in Western as well as non-Western music.

Appendix A - pitch representation

Since different representations of pitch are used by Tarsos and other pitch extractors this section contains definitions of and remarks on different pitch and pitch interval representations.

For humans the perceptual distance between $220$ Hz and $440$ Hz is the same as between $440$ Hz and $880$ Hz. A pitch representation that takes this logarithmic relation into account is more practical for some purposes. Luckily there are a few:

midi Note Number

The midi standard defines note numbers from 0 to 127, inclusive. Normally only integers are used but any frequency $f$ in Hz can be represented with a fractional note number $n$ using equation Equation 3.

\begin{matrix} \begin{matrix} n & = 69 + 12 {log}_{2} (\frac{f}{440}) n & = 12 \times {log}_{2} (\frac{f}{r}); r = \frac{440}{2^{(69 / 12)}} = 8.176 Hz \end{matrix} \\ (3) \end{matrix}

Rewriting Equation 3 to ? shows that midi note number $0$ corresponds with a reference frequency of $8.176$ Hz which is $C_{- 1}$ on a keyboard with $A_{4}$ tuned to $440$ Hz. It also shows that the midi standard divides the octave in 12 equal parts.

To convert a midi note number $n$ to a frequency $f$ in Hz one of the following equations can be used.

\begin{matrix} f & = 440 \times 2^{(n - 69) / 12} f & = r \times 2^{(n / 12)} with r = 8.176 Hz \end{matrix}

Using pitch represented as fractional midi note numbers makes sense when working with midi instruments and midi data. Although the midi note numbering scheme seems oriented towards western pitch organization (12 semitones) it is conceptually equal to the cent unit which is more widely used in ethnomusicology.

Cent

[206] introduced the nowadays widely accepted cent unit. To convert a frequency $f$ in Hz to a cent value $c$ relative to a reference frequency $r$ also in Hz.

\begin{matrix} c = 1200 \times {log}_{2} (\frac{f}{r}) \\ (4) \end{matrix}

With the same reference frequency $r$ equations Equation 4 and ? differ only by a constant factor of exactly $100$ . In an environment with pitch representations in midi note numbers and cent values it is practical to use the standardized reference frequency of $8.176$ Hz.

To convert a frequency $f$ in Hz to a cent value $c$ relative to a reference frequency $r$ also in Hz.

\begin{matrix} f = r \times 2^{(c / 1200)} \\ (5) \end{matrix}

Savart & Millioctaves

Divide the octave in $301.5$ and $1000$ parts respectively, which is the only difference with cents.

Pitch Ratio Representation Pitch ratios are essentially pitch intervals, an interval of one octave, $1200$ cents equal to a frequency ratio of $2 / 1$ . To convert a ratio $t$ to a value in cent $c$ :

\begin{matrix} c = \frac{1200 ln (t)}{ln (2)} \\ (6) \end{matrix}

The natural logarithm, the logarithm base $e$ with $e$ being Euler’s number, is noted as $ln$ . To convert a value in cent $c$ to a ratio $t$ :

\begin{matrix} t = e^{\frac{c ln (2)}{1200}} \\ (7) \end{matrix}

Further discussion on cents as pitch ratios can be be found in appendix B of [164]. There it is noted that:

There are two reasons to prefer cents to ratios: Where cents are added, ratios are multiplied; and it is always obvious which of two intervals is larger when both are expressed in cents. For instance, an interval of a just fifth, followed by a just third is $(3 / 2) (5 / 4) = 15 / 8$ , a just seventh. In cents, this is $702 + 386 = 1088$ . Is this larger or smaller than the Pythagorean seventh $243 / 128$ ? Knowing that the latter is $1110$ cents makes the comparison obvious.

Conclusion The cent unit is mostly used for pitch interval representation while the midi key and Hz units are used mainly to represent absolute pitch. The main difference between cent and fractional midi note numbers is the standardized reference frequency. In our software platform Tarsos we use the exact same standardized reference frequency of $8.176$ Hz which enables us to use cents to represent absolute pitch and it makes conversion to midi note numbers trivial. Tarsos also uses cents to represent pitch intervals and ratios.

Appendix B - audio material

Several audio files were used in this paper to demonstrate how Tarsos works and to clarify musical concepts. In this appendix you can find pointers to these audio files.
The thirty second excerpt of the musical example used throughout chapter ? can be downloaded on http://tarsos.0110.be/tag/JNMR and is courtesy of: wergo/Schott Music & Media, Mainz, Germany, www.wergo.de and Museum Collection Berlin. Ladrang Kandamanyura (slendro pathet manyura) is track eight on Lestari - The Hood Collection, Early Field Recordings from Java - SM 1712 2. It is recorded in 1957 and 1958 in Java.
The yoiking singer of Figure ? can be found on a production released on the label Caprice Records in the series of Musica Sveciae Folk Music in Sweden. The album is called Jojk CAP 21544 CD 3, Track No 38 Nila, hans svager/His brother-in-law Nila.
The api example (section ?) was executed on the data set by Bozkurt. This data set was also used in [62]. The Turkisch song brought in the makam Hicaz from Figure 16 is also one of the songs in the data set.
For the comparison of different pitch trackers on pitch class histogram level (section ?) a subset of the music collection of the Royal Museum for Central Africa (RMCA, Tervuren, Belgium) was used. We are grateful to the RMCA for providing access to its unique archive of Central African music. A song from the RMCA collection was also used in section ?. It has the tape number MR.1954.1.18-4 and was recorded in 1954 by missionary Scohy-Stroobants in Burundi. The song is performed by a singing soloist, Léonard Ndengabaganizi. Finally the song with tape number MR.1973.9.41-4, also from the collection of the RMCA, was used to show pitch shift within a song (Figure 15). It is called Kana nakunze and is recorded by Jos Gansemans in Mwendo, Rwanda in the year 1973.