Imagine you have recorded lots of very long audio files. Hours long, at least. Gigabytes long, at least. You can store these in standard formats such as WAV or FLAC. You can share those via cloud storage too. But now imagine that you want to do data analysis involving random access, for example "I only want to analyse a particular 10-second region from midday on the 7th of October." What can you do?
In most cases you'll need to download the entire WAV or FLAC that contains your chunk. Depending your software, you may need to load the whole file into memory to find the excerpt that you need. Bah! That's two very clunky and wasteful steps: downloading too much, and loading too much into memory.
Enter cloud-optimised data formats such as Zarr. With Zarr you can store and share multi-dimensional arrays, and when you want to access a chunk you can specify the range you want, and Zarr will only need to transfer the file chunks that are needed to cover that.
So: can we use this for audio? Does it even make sense? I think so: audio data usually only has one "long" axis (the time dimension), but we record lots of long soundscape audio, and it's hard to share these datasets conveniently. I suspect that Zarr can help us with cloud-hosting audio data... and one more thing: Zarr's choice of compression tools allows us even to use FLAC compression which is designed for audio, meaning that we should be able to get great data compression AT THE SAME TIME as making it really convenient for analysis.
I'm new to Zarr. But it seems easy to use. Here's my method and Python code to try Zarr with FLAC:
First here's how I set up a virtual environment for it:
python -m venv venv_zarr
source ./venv_zarr/bin/activate
pip install -U pip
pip install ipython zarr librosa flac-numcodecs
Then I'll simply use librosa to read the file, and Zarr to write it out to disk, once in Zarr's default compression format (Blosc) and once in FLAC.
Note that you need to be careful and decide what audio sample format you want to use. Since my sound file is stored on disk as 16 bit, I'll use the "int16" data format in Python, so that the audio data points are in the same precision in every case.
import librosa
import zarr
import os
from math import ceil
from flac_numcodecs import Flac
filename = os.path.expanduser("~/ZOOM0001_LR_Leiden_20230308_0800.WAV")
sr = librosa.get_samplerate(filename)
dur = librosa.get_duration(path=filename)
audio_dtype="int16"
# NB I don't have any strong preferences about chunk size.
chunk_size = 4096
######################################
# First using Zarr default compression
stream = librosa.stream(filename,
block_length=chunk_size,
hop_length=chunk_size,
frame_length=chunk_size,
mono=False,
dtype=audio_dtype)
durspls = int(ceil(dur * sr))
# Create the zarr array.
z = zarr.open("/tmp/bigaudio_default.zarr", mode='w', shape=(2, durspls),
chunks=(2, chunk_size), dtype=audio_dtype)
totspls = 0
for y_block in stream:
z[:, totspls:totspls+y_block.shape[1]] = y_block
totspls += y_block.shape[1]
stream.close()
z.info # BLOSC: No. bytes stored: 1.1G, Storage ratio: 1.9
#######################################
# Second using FLAC compression in Zarr
stream = librosa.stream(filename,
block_length=4096,
hop_length=4096,
frame_length=4096,
mono=False,
dtype=audio_dtype)
durspls = int(ceil(dur * sr))
# Create the zarr array.
flac_compressor = Flac(level=8)
z = zarr.open("/tmp/bigaudio_flac.zarr", mode='w', shape=(2, durspls),
chunks=(2, chunk_size), dtype=audio_dtype,
compressor=flac_compressor)
totspls = 0
for y_block in stream:
z[:, totspls:totspls+y_block.shape[1]] = y_block
totspls += y_block.shape[1]
stream.close()
z.info # FLAC: No. bytes stored: 481.1M, Storage ratio: 4.3
And for completeness here's a simple test of loading a 10-second chunk from the middle of the file:
z = zarr.open("/tmp/bigaudio_flac.zarr", mode='r')
readoffset = 200_000_000
readdur = int(10 * sr)
subset = z[:, readoffset:readoffset+readdur]
print(subset)
It's that line subset = z[...]
which loads some data into memory, giving me a numpy array which I can use as normal. And for me that was lightning-fast. It definitely didn't load the whole audio; and I didn't have to tell the system anything about how to achieve that :)
According to my usual filesize tool ("du"), here's a summary of the filesizes:
Great! The FLAC settings might not be identical in the two setups but that doesn't matter here. We get the best of both worlds: Highly compressed audio data, and lightning-fast access to arbitrary little chunks of the massive file. Specifically, it's really notable that the default compression (Blosc) is about 3 times worse than FLAC at compressing soundscape audio.