A near-perfect replica of your voice can be produced with just 30 seconds of recorded audio. What is the right way to combat this threat?

Outsmarting AI-Generated Audio

by Joshua McKenty

With modern AI software and services, a near-perfect replica of your voice can be produced with just 30 seconds of recorded audio. Passable imitations can be generated with as little as 5 seconds. And both can be produced in real-time. What is the right way to combat this threat?

There are two common approaches to combating synthetic media: provenance, and detection. Let's look at how they fare in this test.

Provenance for audio?

The basic mechanism of "provenance" technology is simple - it establishes a "chain of custody" from the original capture device, to the output device. (From My Lips to God's Ears). In the case of visual data, this provenance is a strong guarantee of authenticity - at least until we have life-size holographic displays. But for audio, we're already living in that unfortunate future - to the microphone, there is no discernable difference between your voice, a pre-recorded copy of your voice, or a synthetic impersonation of your voice: they all sound the same. While we can use various forms of steganography to establish when the microphone captured this audio, none of that helps to confirm where those sound waves came from. So raw audio provenance is of no use here.The limitations of detection

Detection tech: spotty at best, biased at worst

Detection technology, unfortunately, fares no better. The small imperfections that current detection approaches rely upon to try and identify synthetic content provide spotty results at best. Worse still, every effort we make to improve detection is actually making the problem worse! (See our previous article on the vicious cycle of GAN development). And there's a nasty side effect, too: Many of today's detectors have an upsetting amount of bias baked into their training data. Rather than detecting synthetic audio, they're simply detecting non-native language speakers. This might be passable in the stochastic land of large-scale platform content screening, which is where detection approaches are being (most appropriately) applied - but for personal use, it's not a great look.

So what to do, then? Provenance of audio won't help, but detection doesn't either? Fortunately, there is a third option: Hybrid Analysis.

Solution: hybrid analysis

Detection algorithms can be used very effectively to tell whether or not a given audio source is in sync with a video signal. And provenance technology can reliably establish the authenticity of that video. Combining the two finally delivers what we've been searching for: a voice we can trust. (As an added bonus, we can also verify the identity of the speaker!)