0110.be logo

~ An audio focused ffmpeg build for the web

I have prepared an audio focused ffmpeg build for the web which facilitates browser based audio applications. I have prepared three demos:

  1. Audio transcoding and playback demo: converts any media file into audio compatible with the Web Audio API for in-browser playback or analysis.
  2. High quality time-stretching or pitch-shifting: demonstrates how pitch and tempo can be modified independently thanks to the Rubber Band Library.
  3. Basic media info: gives information about the streams and encodings used in a media file.


Fig: audio transcodinging in the browser. A wav file is converted to an mp3.

A bit more about the rationale behind this effort: Browsers have become practical platforms for audio processing applications thanks to the combination of Web Audio API , performant Javascript environment and WebAssembly. Have a look, for example, at essentia.JS.

However, browsers only support a small subset of audio formats and container formats. Dealing with many (legacy) audio formats is often a rather painful experience since there are so many media container formats which can contain a surprising variation of audio (and video) encodings. In short, decoding audio for in-browser analysis or playback is often problematic.

Luckily there is FFmpeg which claims to be ‘a complete, cross-platform solution to record, convert and stream audio and video’. It is, indeed, capable to decode almost any audio encoding known to man from about any container. Additionally, it also contains tools to filter, manipulate, resample, stretch, … audio. FFmpeg is a must-have when working with audio. It would be ideal to have FFmpeg running in a browser…

Thanks to WebAssembly ffmpeg can be compiled for use in the browser. There have been efforts to get ffmpeg working in the browser. These efforts have been focusing on the complete ffmpeg suite. Now I have prepared an audio focused ffmpeg build for the web based on these efforts. I have selected only audio parts which makes the resulting .wasm binary four to five times smaller (from ~20MB to ~5MB). I also provided a simplified Javascript wrapper. The project brings audio decoding to the browser but also audio filtering, transcoding, pitch-shifting, sample rate conversions, audio channel manipulation, and so forth. It is also capable to extract audio streams from video container formats.

Next to the pure functionality of ffmpeg there are general advantages to run audio analysis software in the browser at client-side:

Check out the audio focused ffmpeg build for the web on GitHub.


~ pffft.wasm: an FFT library for the web

PFFFT is a small, pretty fast FFT library programmed in C with a BSD-like license. I have taken it upon myself to compile a WebAssembly version of PFFFT to make it available for browsers and node.js environments. It is called pffft.wasm and available on GitHub.

The pffft.wasm library comes in two flavours. One is compiled with SIMD instructions while the other comes without these instructions. SIMD stands for ‘single instruction, multiple data’ and does what it advertises: in a single step it processes multiple datapoints. The aim of SIMD is to make calculations several times faster. Especially for workloads where the same calculations are repeated over and over again on similar data, SIMD optimisation is relevant. FFT calculation is such a workload.

Evidently the SIMD version is much faster but there is no need to take my word for it. Below you can benchmark the SIMD version of pffft.wasm and compare it with the non-SIMD version on your machine. A pure Javascript FFT library called FFT.js serves as a baseline.

When running the same benchmark on Firefox and on Chrome it becomes clear that FFT.js on Chrome is about twice as fast thanks to its superior Javascript engine for this workload. The performance of the WebAssembly versions in Chrome and Firefox is nearly identical. Safari unfortunately does not (yet) support SIMD WebAssembly binaries and fails to complete the benchmark.

The source code, the limitations and other info can be found at the pffft.wasm GitHub repository

Edit: PulseFFT might be of interest as well: a (as far as I can tell non-SIMD) WASM version of KissFFT.



~ Panako 2.0 - Updates for an acoustic fingerprinting system

At the online ISMIR 2021 conference I have presented updates to Panako, an audio fingerprinting system:

This work presents updates to Panako, an acoustic fingerprinting system that was introduced at ISMIR 2014. The notable feature of Panako is that it matches queries even after a speedup, time-stretch or pitch-shift. It is freely available and has no problems indexing and querying 100k sea shanties. The updates presented here improve query performance significantly and allow a wider range of time-stretch, pitch-shift and speed-up factors: e.g. the top 1 true positive rate for 20s query that were sped up by 10 percent increased from 18% to 83% from the 2014 version of Panako to the new version. The aim of this short write-up is to reintroduce Panako, evaluate the improvements and highlight two techniques with wider applicability. The first of the two techniques is the use of a constant-Q non-stationary Gabor transform: a fast, reversible, fine-grained spectral transform which can be used as a front-end for many MIR tasks. The second is how near-exact hashing is used in combination with a persistent B-Tree to allow some margin of error while maintaining reasonable query speeds.

Together with the paper there is also a poster and a short video presentation which explains the work:


~ Decoding LTC and SMPTE on Teensy - Now using interrupts

Have you ever found yourself wondering how to build an accurate, low-latency LTC decoder with a common micro-controller? Well! Wonder no more and read on! Or, stop reading and do go read something that is more appealing to your predispositions.

SMPTE timecodes were originally used to synchronize audio and video material. SMPTE timecode data is often encoded into audio using LTC or linear time code. This special audio stream can be recorded together with other audio and video material. By decoding the LTC audio afterwards and working back to SMPTE timecodes, synchronization of multiple camera angles and audio material becomes straightforward. This concept tagging data streams with SMPTE timecodes is also used for other types of data.


Decoding LTC data
Fig: LTC is a ‘self-clocking’ protocol for which a period can be found automatically. Once the period is found, transitions within the period are counted. A period with a transition translates to a 1, a period without any transitions to a 0.

SMPTE timecodes supports up to 30 frames per second and this resolution might not be sufficient for some data streams. It helps if the frames could be split up and 60 or 120 frames per second could be generated. With a low latency LTC decoder it would be possible to support this case and, for example, provide four pulses for every SMPTE frame. To be more precise: a SMPTE frame consists of 80 bits and in this case we would send a pulse exactly when decoding bit 0, bit 20, bit 40 and bit 60. We would then be able to sample at 120Hz while staying in sync with the SMPTE.

My first attempt was to treat the signal like audio and use a ready built library for LTC audio decoding The problem there is that sampling is done which might not exactly match the SMPTE bit transition period and relatively large buffers are used to decode LTC. The bit exact decoding is not possible using this method: the latency is too large, the method also uses excessive computational power and memory.

Biasing circuit to offset voltages from zero centred to having a bias
Fig: Biasing circuit to offset voltages

In my second attempt, the current iteration, interrupts are used to detect rising and falling edges in the LTC stream. By counting the number of microseconds between these edges a bit string is constructed. Effectively decoding LTC without any wasted computational power or memory and at a very low latency. If the LTC stream is well-formed, following each incoming bit and reacting to it becomes straightforward. Finally, after gently massaging the LTC bit string, SMPTE timecodes ooze out of the system at a low latency.

I have implemented a low latency LTC and SMPTE timecode data decoder for a Teensy microcontroller. One of the current limitations is that only 30fps SMPTE without skipped frames is supported. Another limitation is that the precision of the derived 120Hz clock is dependent on the sampling rate of the encoded audio signal: if e.g. only 8000Hz is used, transitions can only be precise up until 125µs. The derived clock will jitter slightly but will not drift.

There is still a slight problem with audio and Teensy input: audio is generally transmitted from -1.8V to +1.8V and not – as a Teensy would expect – from 0 to 3.3V. To make this change a small biasing circuit is placed before the Teensy input. In my case two 100k resistors and a 0.1uF capacitor worked best. The interrupt is relatively robust against signals that are a clipping (outside the 0 – 3.3V) or slightly too silent. If the signal becomes too small LTC decoding obviously fails.

For more information and updates see the GitHub Repository for the low latency LTC and SMPTE timecode data decoder.


~ Updates for Panako - an acoustic fingerprinting system

Panako is an acoustic fingerprinting system I developed a couple of years ago. With acoustic fingerprinting systems it is possible to find duplicates in digital music archives and compare meta-data or identify unlabelled audio fragments. In the margins of my post-doc project working with large music archives, I have found the time to update Panako significantly. The updates simplify, improve and speed up Panako.

General content based audio search scheme
Fig. General content based audio search scheme.

The main algorithms are simplified. There is also a reduction of dependencies and a refocus to core functionality. This also simplifies building the software. The retrieval characteristics are improved, mainly thanks to the use of a fine-grained Gabor transform. Also new is the near-exact hashing construct which helps with off-by-one issues when matching time bins. The key-value store used is now LMDB, which speeds up the query performance of Panako significantly. The updates should make Panako stand the test of time somewhat better.

A more complete list of updates can be found below and on the Panako GitHub repository:

  • The number of dependencies has been drastically cut by removing support for multiple key-value stores.
  • The key-value store has been changed to a faster and simpler system (from MapDB to LMDB).
  • The SyncSink functionality has been moved to another project (with Panako as dependency).
  • The main algorithms have been replaced with simpler and better working versions:
    • Olaf is a new implementation of the classic Shazam algorithm.
    • The algoritm described in the Panako paper was also replaced. The core ideas are still the same. The main change is the use of a Gabor transform to go from time domain to the spectral domain (previously a constant-q transform was used). The gabor transform is implemented by JGaborator which in turn relies on The Gaborator C++ library via JNI.
  • Folder structure has been simplified.
  • The UI which was mainly used for debugging has been removed.
  • A new set of helper scripts are added in the scripts directory. They help with evaluation, parsing results, checking results, building panako, creating documentation,…
  • Changed the default panako location to ~/.panako, so users can install and use panako more easily (without need for sudo rights)

An interactive CLI session with Panako
Fig: An interactive CLI session with Panako.


~ SyncSink - Synchronize media by aligning audio

I have just released a new version of SyncSink. SyncSink is a tool to synchronize media files with shared audio. It is ideal to synchronize video captured by multiple cameras or audio captured by many microphones. It finds a rough alignment between audio captured from the same event and subsequently refines that offset with a crosscorrelation step. Below you can see SyncSink in action or you can try out SyncSink (you will need ffmpeg and Java installed on your system).

SyncSink used to be part of the Panako acoustic fingerprinting system but I decided that it was better to keep the Panako package focused and made a separate repository for SyncSink. More information can be found at the SyncSink GiHub repo

SyncSink is a tool to synchronize media files with shared audio. SyncSink matches and aligns shared audio and determines offsets in seconds. With these precise offsets it becomes trivial to sync files. SyncSink is, for example, used to synchronize video files: when you have many video captures of the same event, the audio attached to these video captures is used to align and sync multiple (independently operated) cameras.

Evidently, SyncSink can also synchronize audio captured from many (independent) microphones if some environmental sound is shared (leaked in) the each recording.


Fig: SyncSink in action: syncing some audio files


~ Calling JNI code from multiple Java threads: sharing state

1

1
Java Threads
Java Threads
C++ states
C++ states
2
2
1
1
2
2
3
3
3
3
JNI Bridge

Mapping Java threads to C++ states in a JNI bridge

This post deals with the problem of using stateful C++ code from multiple Java threads. With JNI (Java Native Interface) it is possible to glue C++ code to a Java environment. There are many helpful tutorials on how to call C++ code and receive results. JNI helps to reuse existing, often highly complex and computationally expensive, C++ code.

The introductory tutorials often stop once it is made clear how to repackage (simple) datatypes and do not mention threads. It is, however, reasonable to expect JNI code to take into account thread-safety and proper multi-threading. In all but the simplest cases it is not that straightforward to share state at the C++ side and allow JNI code to be called from multiple Java threads. Incorrectly sharing state can lead to memory leaks and segmentation faults (segfaults) and crashes the application. In what follows, a way to share thread-local state is presented.

It is quite common to have an init, work and dispose method to create a state, use that state and do some work and finally dispose of used resources. Each Java thread independently calls these methods and expects results. These results should not change if multiple Java threads are calling the same methods. In other words: the state should remain Java thread-local. A typical Java class could look like the code below.

With the Java code in mind, the C++ code should know which Java thread is used and which state needs to be used for the work. Luckily there is a way to find out: The JNI specification states that each JNIEnv is local to a Java thread. So we can use the JNIEnv pointer to identify a thread. This is the idea that is used below.

The code maps a JNIEnv pointer to a structure with (any) state information. An unordered map is used for this mapping. There is, however, still a problem: multiple threads can call the init method at once. So multiple threads potentially write to the unordered_map at the same time which leads to problems. To prevent this from happening a mutex is used. The mutex, together with a unique lock, makes sure that only a single thread writes to the unordered map. The same holds for the dispose method.

The work method does not need a unique lock since it does not write to the unordered map and reading from multiple threads is no problem.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
#include <unordered_map>
#include <mutex>

const int DATA_ARRAY_SIZE = 300000 * 2;

struct BridgeState{
 jfloat *data;
}

//A hash map with a JNIEnv * as key and a BridgeState * as value
std::unordered_map<uintptr_t, uintptr_t> stateMap;

//A mutex to ensure that writes to the stateMap are synchronized.
std::mutex stateMutex;

JNIEXPORT jint JNICALL Java_init(JNIEnv * env, jobject object){
  //Makes sure only one thread writes to the stateMap
  std::unique_lock<std::mutex> lck (stateMutex);

  BridgeState * state =  new BridgeState();
  uintptr_t env_addresss = reinterpret_cast<uintptr_t>(env);

  state->cArray = new jfloat[DATA_ARRAY_SIZE];
 
  uintptr_t state_addresss = reinterpret_cast<uintptr_t>(state);
  stateMap[env_addresss] = state_addresss;
  
  return 1;
}

JNIEXPORT jint JNICALL Java_work(JNIEnv * env, jobject object){
  //get a ref to the state pointer
  uintptr_t env_addresss = reinterpret_cast<uintptr_t>(env);
  BridgeState * state = reinterpret_cast<BridgeState *>(stateMap[env_addresss]);

  //do something with state->data, e.g. calculate the sum
  int sum = 0;
  for(int i = 0 ; i < DATA_ARRAY_SIZE ; i++){
    state->data[i] = state->data[i] + 1;
    sum += (int) state->data[i];
  }
  return sum; 
}

JNIEXPORT jint JNICALL Java_dispose(JNIEnv * env, jobject object){
  //Makes sure only one thread writes to the stateMap
  std::unique_lock<std::mutex> lck (stateMutex);

  uintptr_t env_addresss = reinterpret_cast<uintptr_t>(env);
  BridgeState * state = reinterpret_cast<BridgeState *>(stateMap[env_addresss]);
  stateMap.erase(env_addresss); 

  //cleanup memory
  delete [] state->data;
  delete state;
  return 0
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
public class Bridge{
        
  //Load a native library
  static {
    try {
      System.loadLibrary("bridge"); 
    catch (UnsatisfiedLinkError e){
      e.printStackTrace();
    }
  }
  
  private native int init();

  private native int work();

  private native int dispose();

  public static void main (String[] args){
    //start work on 20 threads
    for(int i = 0 ; i<20 ; i++){
      Thread.new(new Runnable() {
        @Override
        public void run() {
          Bridge b = new Bridge();
          b.init();
          b.work();
          b.dispose();
      }).start();
    }
  }
}

This conceptual code has been lifted from a JNI library doing actual work: The JGaborator JNI bridge . If you need more information on how to compile and use this construct in actual code, please have a look at the JGaborator GitHub repository


~ JGaborator Updated - Fine grained spectral transforms from Java

I have updated the JGaborator library. The library calculates fine grained constant-Q spectral representations of audio signals quickly from Java. Such spectral transform can be used for visualisation or as a front-end for audio processing or music information retrieval applications.

The calculation of a Gabor transform is done by a C++ library named Gaborator. JGaborator provides a Java native interface (JNI) bridge to that library. Thanks to the recent updates, the library is now automatically unpacked which makes it easy to use on supported platforms (intel macOS and x64 Linux).

The new version of JGaborator now also allows multiple Java threads to call the transform. This has the potential to speed up some audio processing chains dramatically.

The visualisation parts of JGaborator also received light touch-ups. Below a number of screenshots can be seen with of spectral representations of several audio files. If you want to try it yourself download the JGaborator JAR-file. Note that it should work only on intel macOS and x64 Linux with ffmpeg installed on your path. For other environments, please read and follow the JGaborator instructions to get it working.


~ Music-based biofeedback to reduce tibial shock in over-ground running: a proof-of-concept study


For the last couple of years there has been a fruitful collaboration ongoing between the systematic musicology (IPEM) and sports-science departments at Ghent University. IPEM has a rich history of fundamental research on the link between movement and music. In a newly published proof-of-concept study the music-movement link improves running style. The runner is equipped with a musical biofeedback system to lower foot-impact. For more details, see:

Music-based biofeedback to reduce tibial shock in over-ground running: a proof-of-concept study, published in Scientific Reports(2021) by Van den Berghe, P., Lorenzoni, V., Derie, R. et al.

Abstract Methods to reduce impact in distance runners have been proposed based on real-time auditory feedback of tibial acceleration. These methods were developed using treadmill running. In this study, we extend these methods to a more natural environment with a proof-of-concept. We selected ten runners with high tibial shock. They used a music-based biofeedback system with headphones in a running session on an athletic track. The feedback consisted of music superimposed with noise coupled to tibial shock. The music was automatically synchronized to the running cadence. The level of noise could be reduced by reducing the momentary level of tibial shock, thereby providing a more pleasant listening experience. The running speed was controlled between the condition without biofeedback and the condition of biofeedback. The results show that tibial shock decreased by 27% or 2.96 g without guided instructions on gait modification in the biofeedback condition. The reduction in tibial shock did not result in a clear increase in the running cadence. The results indicate that a wearable biofeedback system aids in shock reduction during over-ground running. This paves the way to evaluate and retrain runners in over-ground running programs that target running with less impact through instantaneous auditory feedback on tibial shock.


~ ISMIR 2020 - Virtual Conference

ISMIR 2020 Logo

From 11-16 October 2020 the latest instalment of the ISMIR conference series was held. Due to the pandemic, the 21st ISMIR conference was the first virtual one. As usual, participants and presenters from around the world joined the conference. For the first time, however, not all participants synchronised their circadian rhythm. By repeating most events with 12h in between, the organisers managed to put together a schedule befitting nearly all participants.

The virtual format had some clear advantages: travel was not needed, so (environmental) cost was low. Attendance fees were lower than usual since no spaces or catering was needed. This democratised the conference experience and attendance reached a record high.

The scientific program of the conference was impressive and varied. It is At the conferences Late Breaking/Demo session I presented Olaf: Overly Lightweight Acoustic Fingerprinting.


Previous blog posts

30-09-2020 ~ PaPiOM: Patterns in Pitch Organization in Music

20-08-2020 ~ Olaf - Acoustic fingerprinting on the ESP32 and in the Browser

06-12-2019 ~ LTC - SMPTE Decoder on Teensy

03-10-2019 ~ MIDImorphosis: recording audio and sensor data

10-09-2019 ~ LW Research Day 2019 on Digital Humanities

02-07-2019 ~ AAWM/FMA 2019 - Birmingham

13-03-2019 ~ trix: Realtime audio over IP

22-02-2019 ~ Audio marker finder

05-02-2019 ~ Validity and reliability of peak tibial accelerations as real-time measure of impact loading during over-ground rearfoot running at different speeds - Journal of Biomechanics

05-02-2019 ~ Nano4Sports in Team Scheire