Researchers are in an arms race with hackers to create better and better security systems to prevent data theft, including things like dual authentication password systems, thumbprints and retinal scans. One type of security system that is gaining popularity is automatic speaker identification, or using a person’s voice as a passcode.
These systems, already in use for phone banking and other applications, are good at weeding out digital attacks—for instance, if someone plays a recording or digitally manipulated voice over the phone. But digital security engineers at the University of Wisconsin-Madison have found the systems are not quite as foolproof when it comes to a novel analog attack. By speaking through customized PVC pipes, the type found at most hardware stores, they were able to trick the machine learning algorithms behind automatic speaker identification systems.
The team, led by PhD student Shimaa Ahmed and Kassem Fawaz, an associate professor of electrical and computer engineering, presented their findings at the Usenix Security Symposium held August 9-11 in Anaheim, California.
“There are a lot of commercial companies selling this technology and a lot of banks using it. It’s also used for personal assistants like Siri. The systems are advertised now as secure as a fingerprint, but that’s not very accurate,” says Ahmed. “All of those are susceptible to attacks on speaker identification. The attack we developed is very cheap; just get a tube from the hardware store and change your voice.”
The project began when the team began looking into automatic speaker identification, probing the proprietary security systems for weaknesses. What they found is if they spoke through their hands or talked into a box instead of speaking clearly, the models did not behave as expected.
Ahmed investigated whether it was possible to alter the resonance, or specific frequency vibrations, of a voice to defeat the security system. Because her work began in the middle of the COVID-19 quarantine, she began by using paper towel tubes to test out the idea. Later, after returning to the lab, the group hired Yash Wani, then an undergraduate and now a PhD student, to help modify the PVC pipes in the UW Makerspace. Using various diameter of pipe purchased at a local hardware store, Ahmed, Yani and their team altered the length and diameter of the pipe until they could produce the same resonance as a target voice.
Eventually, the team developed an algorithm capable of determining the dimensions needed to transform the resonance of almost any voice to another. In fact, the researchers found mimicking the resonance using the tube attack worked in spoofing 60% of voices in a test set of 91 speakers, while unaltered human impersonators were able to fool the systems only 6% of the time.
The spoof attack works for a couple of reasons. First, because the sound is analog, it bypasses the voice authentication system’s digital attack filters. Second, the tube does not transform one voice into an exact copy of another, but instead spoofs the resonance of the target voice, which is enough to cause the machine learning algorithm to misclassify the attacking voice.
Fawaz says part of the motivation behind the project is to simply alert the security community that voice identification is not as secure as many people think, though he says many researchers are already aware of the technology’s flaws.
The project has a bigger goal as well. “We’re trying to say something more fundamental,” Fawaz says. “Generally, all machine learning applications that are analyzing speech signals make an assumption that the voice is coming from a speaker, through the air to a microphone. But you shouldn’t make assumptions that the voice is what you expect it to be. There are all sorts of potential transformations in the physical world to that speech signal. If that breaks the assumptions underlying the system, then the system will misbehave.”
Other authors include Ali Shahin Shamsabadi of the Alan Turing Institute; Mohammed Yaghini and Nicholas Papernot of the University of Toronto and Vector Institute and Ilia Shumailov of the University of Oxford and Vector Institute.
The authors acknowledge support from DARPA (through the GARD program); the Wisconsin Alumni Research Foundation; the NSF through awards CNS-1838733 and CNS-2003129; CIFAR
(through a Canada CIFAR AI Chair), NSERC (under the Discovery Program and COHESA strategic research network), a gift from Intel, and a gift from NVIDIA.
Featured image caption: Shimaa Ahmed, a PhD student working in the lab of Associate Professor Kassem Fawaz, determined a method of defeating automatic speaker identification systems using PVC pipe found at any hardware store. Credit: Todd Brown.