From 11-16 October 2020 the latest instalment of the ISMIR conference series was held. Due to the pandemic, the 21st ISMIR conference was the first virtual one. As usual, participants and presenters from around the world joined the conference. For the first time, however, not all participants synchronised their circadian rhythm. By repeating most events with 12h in between, the organisers managed to put together a schedule befitting nearly all participants.
The virtual format had some clear advantages: travel was not needed, so (environmental) cost was low. Attendance fees were lower than usual since no spaces or catering was needed. This democratised the conference experience and attendance reached a record high.
Form the 1st of October 2020 I will start on a new research project. The BOF fund of Ghent University is kind enough to sponsor the project for three years. The abstract is as follows:
Music is present in every culture in the world. We as a species seem to have an urge to make music. While the diversity of music cultures around the world is phenomenal, they do seem to have patterns in common. Especially for pitch, one of the fundamental building blocks of music, there are strong reasons to believe that there are commonalities amongst cultures on how pitch is organised A better insight in these common patterns may help to answer questions on the definition, origins and evolution of music.
Common patterns in pitch organisation can be studied from two perspectives. Firstly, the perspective of how humans perceive and make music can be gained from systematic, experimental work. Over the years this has yielded insights in which pitch organisations might be most fit for our perceptual, neurophysiological system. Secondly, these patterns can be observed directly in large-scale, corpus-based, cross-cultural studies which has a potential that is not exploited as of yet.
During this fellowship a large-scale global corpus with field recordings will be compiled in collaboration. Music Information Retrieval techniques will be employed to describe how pitch is organised in the corpus. More specifically, it will support claims on the use of discrete pitches, octave equivalence, the number of pitch classes in use and the pitch interval structures. The uncovered fundamental properties of pitch will be confronted with findings from experimental work.
Recently I presented the outline of the project with the following slides:
A good year ago I was asked to develop audio recognition technology for an e-costume. The idea was that lights in the costume would follow a sequence synchronised to a certain song. Only a single song should trigger the lights, all other music should be ignored. Recognition of music and synchronisation is typically done using audio fingerprinting techniques. The challenge was that the recognition needed to run on a cheap, battery-powered microcontroller with limited CPU and memory. I delivered a prototype but eventually a cheap, battle-tested, off-the-shelf, IP-cleared, alternative was found.
The prototype gathered dust for a while but the idea stuck in my head. With my daughters fourth birthday approaching during the lockdown, I decided to turn the prototype into an over-engineered birthday gift and let an ‘Elsa-dress’ react to ‘Let It Go’ from the Frozen soundtrack. With the prototype as a starting point, I ordered an RGB-LED-strip, a beefy Li-Ion Battery, an I2S digital microphone and, of course, an Elsa-dress.
I had an ESP32 microcontroller laying around and used it as the core of the system: it supports I²S, has a floating point unit (FPU), is easy to use together with LED strips and has enough memory. The FPU makes it straightforward to use the same code on traditional computers as on embedded devices: fixed-point math can be avoided.
After soldering the components together and with the help from my better half to sew in the LED strip, it all came together. In the video below, the result of our work can be seen. The video first shows a song that should not and is not recognised. Then, “Let It Go” is played and correctly recognised. After the song is stopped, the lights go on for a while and finally stop: this is by design to allow gaps in recognition. Lastly, the song is continued and again correctly recognised.
The code went through several iterations and was expanded beyond the original scope and became a capable general purpose acoustic fingerprinting system with its many applications. Olaf performs quite well thanks to its resource friendly design and the use of PFFT and LMDB. Especially LMDB, a fast, B+-tree backed key value store with low storage overhead enables performant storage and lookups.
The GitHub does not contain an example for the ESP32. That code depends on the microcontroller, digital microphone and pins used and Olaf needs to be hacked to exhibit the requested behaviour. All in all that code is much less reusable (and sharable, testable, maintainable). I have, however, included a platformIO project for Olaf on ESP32 for reference.
WASM: Olaf in the browser
Olaf, being written in ANSI C, can run in the browser thanks to the Emscripten compiler. According to its website, Emscripten ‘…lets you run C and C++ on the web at near-native speed without plugins’ Combining the Web Audio API and the WASM version of Olaf makes web-based acoustic fingerprinting applications possible.
Below you can try out Olaf. The exact same code is running on your browser as on the ESP32 demonstrated above. This means that Olaf is listening to recognise ‘Let It Go’ from the Frozen soundtrack. For your convenience the song can be started below on the left. On the right, you can start Olaf by allowing incoming audio to be analysed. The FFT is calculated by Olaf and visualised using Pixi.js. After a few seconds the red fingerprints should become green, indicating a match. Once you stop the song, the fingerprints will eventually turn red again. As with the video above: going from a match to no match takes a couple of seconds to allow gaps in recognition.
1. Start the song and play it aloud. Singing along is encouraged.
2. Start the microphone and check whether recognition succeeds.
Olaf was featured on hackaday. There is also a small discussion about Olaf on Hacker News. A write-up of this project also ended up as a contribution to the Late Breaking Demo track of the first virtual ISMIR conference: Olaf ISMIR 2020 LBD abstract.
The audio shield takes care of the line level audio input. This audio input is then decoded. The decoding is done by libltc. The library runs as is on a Teensy without modification. The three elements are combined in a relatively simple teensy patch
To use the decoder connect the line level input left channel to an SMPTE source via e.g. an RCA plug.
During an experiment which monitors a music performance it might be a requirement to record music, video and sensor data synchronously. Recording analog sensors (balance boards, accelerometers, light sensors, distance sensors) together with audio and video is often problematic. Ideally standard DAW software can be used to record both audio and sensor data. A system is presented here that makes it relatively straightforward to record sensor data together with audio/video.
The basic idea is simple: a microcontroller is programmed to appear as a class compliant MIDI device. Analog measurements on the micro-controller are translated to a specific MIDI protocol. The MIDI data, on the capturing side, can then be converted again into the original sensor data. This setup has several advantages:
It makes it easy to record sensor data together with audio data in a standard DAW software package. Recording a recording a midi track and audio track simultaneously in, e.g., Ableton Live, is easy.
Communication with the micro-controller is bi-directional. The micro-controller can be programmed to react to certain MIDI messages. A note-on can, for example, be used to start analog sensor recording. These MIDI commands can be send from any possible source that can ‘speak’ MIDI.
Thanks to the Web MIDIAPI this construct presents an easy way to let analog sensors and websites interact.
Real time sonification of the sensor-data is also supported. There are many ready to go options to sonify MIDI. Axoloti, Max/MSP, Zupiter are some of the environments that can be pugged into.
Fig: Visualization in html of analog sensor data, captured as MIDI
While the concept is relatively simple, there are many details to get right. Please consult the MIDImorphosis github page which details the system that consists of an analog sensor, a MIDI protocol and a clocking infrastructure.
Together with Jeska, I presented an ongoing study on musical interaction. In the study one of the measurements was the body movement of two participants. This is done with boards that are equipped with weight sensors. The data that comes out of this can be inspected for synchronisation, quality and quantity of movement, movement periodicities.
The hardware is the work of Ivan Schepers, the software used to capture and transmit messages is called “the MIDImorphosis” and developed by me. The research is in collaboration with Jeska Buhman, Marc Leman and Alessandro Dell’Anna. An article with detailed findings is forthcoming.
The uniqueness of human music relative to speech and animal song has been extensively debated, but rarely directly measured. We applied an automated scale analysis algorithm to a sample of 86 recordings of human music, human speech, and bird songs from around the world. We found that human music throughout the world uniquely emphasized scales with small-integer frequency ratios, particularly a perfect 5th (3:2 ratio), while human speech and bird song showed no clear evidence of consistent scale-like tunings. We speculate that the uniquely human tendency toward scales with small-integer ratios may relate to the evolution of synchronized group performance among humans.
Automatic comparison of global children’s and adult songs
Music throughout the world varies greatly, yet some musical features like scale structure display striking crosscultural similarities. Are there musical laws or biological constraints that underlie this diversity? The “vocal mistuning” hypothesis proposes that cross-cultural regularities in musical scales arise from imprecision in vocal tuning, while the integer-ratio hypothesis proposes that they arise from perceptual principles based on psychoacoustic consonance. In order to test these hypotheses, we conducted automatic comparative analysis of 100 children’s and adult songs from throughout the world. We found that children’s songs tend to have narrower melodic range, fewer scale degrees, and less precise intonation than adult songs, consistent with motor limitations due to their earlier developmental stage. On the other hand, adult and children’s songs share some common tuning intervals at small-integer ratios, particularly the perfect 5th (~3:2 ratio). These results suggest that some widespread aspects of musical scales may be caused by motor constraints, but also suggest that perceptual preferences for simple integer ratios might contribute to cross-cultural regularities in scale structure. We propose a “sensorimotor hypothesis” to unify these competing theories.
At work we have a really nice piano and I wanted to be able to broadcast a live performance over the internet with low latency to potential live listeners. In all honesty, only my significant other gets moderately lukewarm about the idea of hearing me play live. Anyhow:
I did not find any practical tool to easily pump audio over the internet. I did find something that was very close called trx by Mark Hills: trx is a simple toolset for broadcasting live audio from Linux. It unfortunately only works with the ALSA audio system and is limited to Linux. I decided to extend it to support macOS and Pulse Audio. I also extended its name to form trix.
Audio Transmitter/Receiver over Ip eXchange (trix) is a simple toolset for broadcasting live audio from Linux or macOS. It sends and receives encoded audio over IP networks, via an audio interface. If audio interfaces are properly configured, a low-latency point-to-point or multicast broadband audio connection can be achieved. This could be used for networked music performances. The inclusion of the intermediate rtAudio library provides support for various audio input and outputs.
More information on trix can be found on the trix github page.
The system can be configured for low latency use. The whole chain is dependent several different components which each add to the total latency: audio input latency, encoder (algorithmic) delay, network latency and finally audio output latency.
Thanks to the use of RtAudio it should be possible to use low latency API’s to access audio devices (ASIO on windows or Jack on Unix). This means that audio input and output latencies can be as low as the hardware allows. The opus encoder/decoder that is used has a low algorithmic delay. By default it has a 25ms delay but it can be configured to only 2.5ms (see here). The network latency (and jitter) is very much dependent on the distance to cover. On a local network this can be kept low, when using wide area networks (the internet) control is lost and latencies can add up depending on the number of hops to take. Jitter can be problematic if the smallest possible buffers are used: then dropouts might occur and this might affect the audio in a noticeable way.
I have uploaded a small piece of software which allows users to find a specific audio marker in audio streams. It is mainly practical to synchronise a camera (audio/video) recording with other audio with the same marker. The marker is a set of three beeps. These three beeps are found with millisecond accurate precision within the audio streams under analysis. By comparing the timing of marker synchronization becomes possible. It can be regarded as an alternative for the movie clapper boards.
With the goal in mind to reduce common runner injuries we first need to measure some running style characteristics. Therefore, we have developed a sensor to measure how hard a runners foot repeatedly hits the ground. This sensor has been compared with laboratory equipment which proofs that its measurements are valid and can be repeated. The main advantages of our sensor is that it can be used ‘in the wild’, outside the lab on the runners regular tours. We want to use this sensor to provide real-time biofeedback in order to change running style and ultimately reduce injury risk.
Studies seeking to determine the effects of gait retraining through biofeedback on peak tibial acceleration (PTA) assume that this biometric trait is a valid measure of impact loading that is reliable both within and between sessions. However, reliability and validity data were lacking for axial and resultant PTAs along the speed range of over-ground endurance running. A wearable system was developed to continuously measure 3D tibial accelerations and to detect PTAs in real-time. Thirteen rearfoot runners ran at 2.55, 3.20 and 5.10 m*s-1 over an instrumented runway in two sessions with re-attachment of the system. Intraclass correlation coefficients (ICCs) were used to determine within-session reliability. Repeatability was evaluated by paired T-tests and ICCs. Concerning validity, axial and resultant PTAs were correlated to the peak vertical impact loading rate (LR) of the ground reaction force. Additionally, speed should affect impact loading magnitude. Hence, magnitudes were compared across speeds by RM-ANOVA. Within a session, ICCs were over 0.90 and reasonable for clinical measurements. Between sessions, the magnitudes remained statistically similar with ICCs ranging from 0.50 to 0.59 for axial PTA and from 0.53 to 0.81 for resultant PTA. Peak accelerations of the lower leg segment correlated to LR with larger coefficients for axial PTA (r range: 0.64–0.84) than for the resultant PTA per speed condition. The magnitude of each impact measure increased with speed. These data suggest that PTAs registered per stand-alone system can be useful during level, over-ground rearfoot running to evaluate impact loading in the time domain when force platforms are unavailable in studies with repeated measurements.
‘Team Scheire’ is a Flemish TV program with a similar concept as BBC Two’s ‘The Big Life Fix’. In the program, makers create ingenious new solutions to everyday problems and build life-changing solutions for people in desperate need.
One of the cases is Ben. Ben loves to run but has a recurring running related injury. To monitor Ben’s running and determine a maximum training length a sensor was developed that measures the impact and the amount of steps taken. The program makers were interested in the results of the Nano4Sports project at UGent. One of the aims of that project is to build those type of sensors and knowhow related to correct interpretation of data and use of such devices. Below a video with some background information can be found:
Thanks to the support of a travel grant by the faculty of Arts and Philosophy of Ghent University I was able to attend the ISMIR 2018 conference. A conference on Music Information Retrieval. I am co author on a contribution for the the Late-Breaking / Demos session
The structure of musical scales has been proposed to reflect universal acoustic principles based on simple integer ratios. However, some studying tuning in small samples of non-Western cultures have argued that such ratios are not universal but specific to Western music. To address this debate, we applied an algorithm that could automatically analyze and cross-culturally compare scale tunings to a global sample of 50 music recordings, including both instrumental and vocal pieces. Although we found great cross-cultural diversity in most scale degrees, these preliminary results also suggest a strong tendency to include the simplest possible integer ratio within the octave (perfect fifth, 3:2 ratio, ~700 cents) in both Western and non-Western cultures. This suggests that cultural diversity in musical scales is not without limit, but is constrained by universal psycho-acoustic principles that may shed light on the evolution of human music.
Recently I have published a small library on github called JGaborator. The library calculates fine grained constant-Q spectral representations of audio signals quickly from Java. The calculation of a Gabor transform is done by a C++ library named Gaborator. A Java native interface (JNI) bridge to the C++ Gaborator is provided. A combination of Gaborator and a fast FFT library (such as pfft) allows fine grained constant-Q transforms at a rate of about 200 times real-time on moderate hardware. It can serve as a front-end for several audio processing or MIR applications.
While the gaborator allows reversible transforms, only a forward transform (from time domain to the spectral domain) is currently supported from Java.A spectral visualization tool for sprectral information is part of this package. See below for a screenshot:
Claims made in many Music Information Retrieval (MIR) publications are hard to verify due to the fact that (i) often only a textual description is made available and code remains unpublished – leaving many implementation issues uncovered; (ii) copyrights on music limit the sharing of datasets; and (iii) incentives to put effort into reproducible research – publishing and documenting code and specifics on data – is lacking. In this article the problems around reproducibility are illustrated by replicating an MIR work. The system and evaluation described in ‘A Highly Robust Audio Fingerprinting System’ is replicated as closely as possible. The replication is done with several goals in mind: to describe difficulties in replicating the work and subsequently reflect on guidelines around reproducible research. Added contributions are the verification of the reported work, a publicly available implementation and an evaluation method that is reproducible.
A philologist’s approach to heritage is traditionally based on the curation of documents, such as text, audio and video. However, with the advent of interactive multimedia, heritage becomes floating and volatile, and not easily captured in documents. We propose an approach to heritage that goes beyond documents. We consider the crucial role of institutes for interactive multimedia (as motor of a living culture of interaction) and propose that the digital philologist’s task will be to promote the collective/shared responsibility of (interactive) documenting, engage engineering in developing interactive approaches to heritage, and keep interaction-heritage alive through the education of citizens.
I was kindly invited by SoundCloud to give a presentation on “Acoustic fingerprinting in research”. The presentation took place during one of the “MIR Meetups” in Berlin on Monday, April 23, 2018. Before my presentation there was a presentation by Derek and Josh (both SoundCloud engineers) detailing the state of the internal fingerprinting system of SoundCloud.
During my presentation I gave an overview of various applications of acoustic fingerprinting in a music research environment and detailed how these applications can be handled and are implemented in Panako: an open source fingerprinting system
Below the slides used during the presentation can be found:
The presentation during my defense was meant for a broader audience. During the presentation I gave examples of the research topics I have been working and focused on how these are connected. The presentation titled Engineering systematic musicology can be seen by following the previous link and is included below. The slide with the live spectrogram and the slide with the map need to be started by double clicking otherwise they remain empty.
The presentation is essentially an interactive HTML5 website build with the reveal.js framework. This has the advantage that multimedia is well supported and all kinds of interactions can be scripted. The presentation above, for example, uses the web audio API for live audio visualization and the google maps API for interactive maps. Video integration is also seamless. It would be a struggle to achieve similar multi-media heavy presentations with other presentation software packages such as Impress, Keynote or Powerpoint.
I often need audio and video material embedded into presentations. I have had bad experiences with powerpoint/keynote and especially the LaTeX beamer package and multimedia: audio/video material does not start playing or at the wrong moment, finicky on codecs, limited compatibility, a clunky UX (whoever came up with the idea to show multimedia controls while hovering over e.g. an audio thumbnail should be reoriented towards back-end programming) all contribute to errors while handling audio/video. Moreover the interactive capabilities are limiting.
“Since 2005, the Italian Research Conference on Digital Libraries has served as an important national forum focused on digital libraries and associated technical, practical, and social issues. IRCDL encompasses the many meanings of the term “digital libraries”, including new forms of information institutions; operational information systems with all manner of digital content; new means of selecting, collecting, organizing, and distributing digital content…"
The 26th of January Federica presented our joint contribution titled “Applications of Duplicate Detection in Music Archives: from Metadata Comparison to Storage Optimisation”. The work focuses on applications of duplicate detection for managing digital music archives. It aims to make this mature music information retrieval (MIR) technology better known to archivists and provide clear suggestions on how this technology can be used in practice. More specifically applications are discussed to complement meta-data, to link or merge digital music archives, to improve listening experiences and to re-use segmentation data.
This weekend the University Hamburg – Institute for Systematic Musicology and more specifically Christian D. Koehn organized the International Symposium on Computational Ethnomusicological Archiving. The symposium featured a broad selection of research topics (physical modelling of instruments, MIR research, 3D scanning techniques, technology for (re)spacialisation of music, library sciences) which all had a relation with archiving musics of the world:
How could existing digital technologies in the field of music information retrieval, artificial intelligence, and data networking be efficiently implemented with regard to digital music archives? How might current and future developments in these fields benefit researchers in ethnomusicology? How can analytical data about musical sound and descriptive data about musical culture be more comprehensively integrated?
In this presentation we describe our experience of working with computational analysis on digitized wax cylinder recordings. The audio quality of these recordings is limited which poses challenges for standard MIR tools. Unclear recording and playback speeds further hinder some types of audio analysis. Moreover, due to a lack of systematical meta-data notation it is often uncertain where a single recording originates or when exactly it was recorded. However, being the oldest available sound recordings, they are invaluable witnesses of various musical practices and they are opportunities to improve the understanding of these practices. Next to sketching these general concerns, we present results of the analysis of pitch content of 400 wax cylinder recordings from Indiana University (USA) and from the Royal Museum from Central Africa (Belgium). The scales of the 400 recordings are mapped and analyzed as a set. It is found that the fifth is almost always present and that scales with four and five pitch classes are organized similarly and differ from those with six and seven pitch classes, latter center around intervals of 170 cents, and former around 240 cents.