Parallelizing FLAC Encoding
Fred Akalin
One thing I noticed ever since getting a multi-core system was that the reference FLAC encoder is not multi-threaded. This isn't a huge problem for most people as you can simply encode multiple files at the same time but I usually rip my audio CDs into a single audio file with a cue sheet instead of separate track files and so I am usually encoding a single large audio file instead of multiple smaller ones. Even so, encoding a CD-length audio file takes under a minute but I thought it would be a fun and useful weekend project to see if I could parallelize the simpler example encoder. The format specification indicates that input blocks are encoded independently which makes the problem embarassingly parallel and trawling through the FLAC mailing lists reveals that no one has had the time nor the inclination to look into it.
However, I was able to write a multithreaded FLAC encoder that achieves near-linear speedup with only minor hacks to the libFLAC API. Here are some encode times on an 8-core 2.8 GHz Xeon 5400 for a 636 MB wave file (some caveats are discussed below):
baseline | 34.906s |
---|---|
1 threads | 31.424s |
2 threads | 16.936s |
4 threads | 10.173s |
8 threads | 6.808s |
I took the simple approach of sharding the input file into n roughly equal pieces and passing them to n encoder threads, assembling the output file from the n output buffers. In general this is not a good way of partitioning the workload as time is wasted if one shard takes significantly more time to process but for my use case this isn't a significant problem.
#include <fcntl.h> /* open() */
#include <sys/mman.h> /* mmap()/munmap() */
#include <sys/stat.h> /* stat() */
#include <unistd.h> /* close() */
int main(int argc, char *argv[]) {
int fdin;
struct stat buf;
char *bufin;
fdin = open(argv[1], O_RDONLY);
fstat(fdin, &buf);
bufin = mmap(NULL, buf.st_size, PROT_READ, MAP_SHARED, fdin, 0);
/* The input file (passed in via argv[1]) is now mapped read-only to
the memory region in bufin up to bufin + buf.st_size. */
/* Note that you can work directly with the mapped input file
instead of fread()ing the header into a buffer. */
if((buf.st_size < WAV_HEADER_SIZE) ||
memcmp(bufin, "RIFF", 4) ||
memcmp(bufin+8, "WAVEfmt \020\000\000\000\001\000\002\000", 16) ||
memcmp(bufin+32, "\004\000\020\000data", 8)) {
/* Invalid input file: print error and exit. */
}
for (i = 0; i < num_threads; ++i) {
shard_infos[i].bufin = bufin + WAV_HEADER_SIZE + i * bytes_per_thread;
/* bufsize for the last thread may be slightly larger. */
shard_infos[i].bufsize = bytes_per_thread;
}
/* Spawn encode threads (which calls encode_shard() below) passing
an element of shard_infos to each. */
...
munmap(bufin, buf.st_size);
close(fdin);
}
FLAC__bool encode_shard(struct shard_info *shard_info) {
FLAC__StreamEncoder *encoder = FLAC__stream_encoder_new();
...
/* The input file is paged in lazily as this function accesses
bufin from shard_info->bufin. */
FLAC__stream_encoder_process_interleaved(encoder,
shard_info->bufin,
shard_info->bufsize);
...
FLAC__stream_encoder_delete(encoder);
}
However, handling the output file is a bit trickier. Since the
encoded FLAC data output by the threads vary in size we have to wait
until all encoding threads are done before we know the right offsets
to write the output data. A convenient and fast way to handle this is
to use asynchronous I/O; we know where to write the output data for a
shard as soon as the encoding thread for all previous shards finish so
we simply wait for the encoding threads in shard order and queue up a
write request after each thread finishes. Here I use the POSIX
asynchronous I/O API in my multithreaded encoder (again, slightly
paraphrased):
#include <aio.h> /* aio_*() */
#include <pthread.h> /* pthread_*() */
#include <string.h> /* memset() */
int main(int argc, char *argv[]) {
int fdout;
pthread_t threads[MAX_THREADS];
struct aiocb aiocbs[MAX_THREADS];
unsigned long byte_offset = 0;
/* mmap input file in. */
...
fdout = open(argv[2], O_WRONLY | O_CREAT | O_TRUNC);
/* Spawn encode threads passing an element of shard_infos to
each. */
...
/* Wait for each thread in sequence and queue up output writes. */
/* We need to zero out any aiocb struct that we use before we fill
in any members. */
memset(aiocbs, 0, num_threads * sizeof(*aiocbs));
for (i = 0; i < num_threads; ++i) {
pthread_join(threads[i], NULL);
aiocbs[i].aio_buf = shard_infos[i].bufout;
aiocbs[i].aio_nbytes = shards_infos[i].bytes_written;
aiocbs[i].aio_offset = byte_offset;
aiocbs[i].aio_fildes = fdout;
aio_write(&aiocbs[i]);
byte_offset += shard_infos[i].bytes_written;
}
/* Wait for all output writes to finish. */
for (i = 0; i < num_threads; ++i) {
const struct aiocb *aiocbp = &aiocbs[i];
aio_suspend(&aiocbp, 1, NULL);
aio_return(&aiocbs[i]);
}
close(fdout);
}
The POSIX API is a bit unwieldy for this use case; ideally, there
would be a version of aio_suspend()
that would suspend the
calling process until all of the specified requests have completed.
As it is now the simplest way is to loop through the requests as
above, especially since the maximum number of simultaneous
asynchronous I/O requests is usually quite small (16 on my system).
Also, I found that the OS X implementation of aio_write()
did not obey this part of the specified behavior:
If O_APPEND is set for aiocbp->aio_fildes, aio_write() operations append to the file in the same order as the calls were made. If O_APPEND is not set for the file descriptor, the write operation will occur at the abso- lute position from the beginning of the file plus aiocbp->aio_offset.
but it was just as easy (and clearer) to explicitly set the correct offset.
I had to hack up libFLAC a bit to implement my multithreaded encoder.
I exposed the update_metadata_()
to make it easy to write the
correct number of total samples in the metadata field and also to zero
out the min/max framesize fields. I also exposed the
FLAC__stream_encoder_set_do_md5()
function (which it should
have been in the first place) so that I can turn off the writing of
md5 field in the metadata. Finally, I added the function
FLAC__stream_encoder_set_current_frame_number()
so that the
correct frame numbers are written at encode time.
For comparison purposes I turn off md5 calculation in my multithreaded
encoder as well as the baseline one. Since calling
FLAC__stream_encoder_set_current_frame_number()
causes
crashes with vericiation turned on I also turn that off. The numbers
above reflect that so they're underestimates of how a production
multithreaded encoder would perform. However, the essential behavior
of the program shouldn't change much.
Here is a patch file for the flac 1.2.1
source that implements the hacks I described
above. Here is the source for my multithreaded FLAC
encoder. I've tested it with i686-apple-darwin9-gcc-4.0.1
and i686-apple-darwin9-gcc-4.2.1
on Mac OS X. I got the
above numbers compiling
mt_encode.c
with gcc 4.2.1 and the switches -Wall
-Werror -g -O2 -ansi
.
Like this post? Subscribe to my feed or follow me on Twitter .