Speech recognition systems seduced by masked messages
They refer to these tweaked tunes, which issue mostly inaudible commands to speech recognition devices within earshot, as CommanderSongs.
In CommanderSong: A Systematic Approach for Practical Adversarial Voice Recognition, a paper distributed through preprint service Arxiv, the ten authors involved in the project – Xuejing Yuan, Yuxuan Chen , Yue Zhao, Yunhui Long, Xiaokang Liu, Kai Chen, Shengzhi Zhang, Heqing Huang, Xiaofeng Wang, and Carl A. Gunter – describe their technique for duping deep-learning models used to recognize speech with “adversarial perturbations.”
Adversarial attacks are a way to deceive AI systems by altering input data to obtain desired results from a specific system. They’ve been explored extensively for images. For example, MIT students recently demonstrated that they could trick Google’s image recognition system into labeling a turtle as a rifle.
Alterations
Less work has been done with audio and speech recognition. The researchers say where images provide an easy way to alter pixels to trip up algorithms without noticeable visual artifacts, it isn’t obvious whether audio attacks can also pass under the radar because alterations added to voices typically cannot be recognized by voice-controlled devices like Amazon Echo.
Last year, a different group of clever people proposed what they called DolphinAttack, to manipulate software-based voice recognition apps using sound outside the range of human hearing. That technique, however, can be mitigated by technology capable of suppressing ultrasound signals.
The CommanderSong researchers – from the State Key Laboratory of Information Security (SKLOIS), University of Chinese Academy of Sciences, Florida Institute of Technology, University of Illinois at Urbana-Champaign, IBM T. J. Watson Research Center, and Indiana University – say their technique has two differences: it does not rely on any other technology to hide the command, and it cannot be blocked by audio frequency filters.
“Our idea to make a voice command unnoticeable is to integrate it in a song,” they explain in their paper. “In this way, when the crafted song is played, the [speech recognition] system will decode and execute the injected command inside, while users are still enjoying the song as usual.”
In a phone interview with The Register, Gunter, a computer science professor at the University of Illinois, said while previous work has been done showing that garbled sounds can trigger voice recognition systems, masking the command in a song would be less noticeable because music is often present.
“It has a more practical attack vector,” he said.
The researchers started with a randomly selected song and command track generated by a text-to-speech engine. They then decoded each audio file using the open-source Kaldi speech-recognition toolkit, and extracted the output of a deep neural network (DNN).
After identifying specific DNN outputs that represents the desired command, they manipulated the song and command audio using the gradient descent method, a machine learning optimization algorithm.
Chord cutters
In essence, they used their knowledge of the way the audio would be processed to ensure the speech recognition system would hear the command within the music.
The result is adversarial audio – songs containing a command interpretable by Kaldi code but unlikely to be noticed by a human listener.
The altered audio may be perceptible to a listener, but it’s doubtful the added sound would be recognized as anything other than distortion.
“You mistake some of these signals as defects in the media,” said Gunter, allowing that some songs masked the command better than others. “Some of the examples, they would make you grimace. Others are more subtle.”
The researchers tested a variety of in-song commands delivered directly to Kaldi as audio recordings, such as: “Okay Google, read mail” and “Echo, open the front door.” The success rate of these was 100 per cent.
They also tested in-song commands delivered audibly, where environmental noise can hinder recognition, including “Echo, ask Capital One to make a credit card payment” and “Okay Google, call one one zero one one nine one two zero.”
As a stand-in for actual devices, the boffins used the Kaldi software listening to songs with embedded commands, delivered via a JBL clip2 portable speaker, TAKSTAR broadcast gear and an ASUS laptop, from a distance of 1.5 metres.
For the open air test, success rates varied from 60 per cent to 94 per cent.
Gunter said that to be certain the attack would work with, say Amazon’s Echo, you’d have to reverse engineer the Alexa speech recognition engine. But he said he knows colleagues working on that.
The researchers suggest that CommanderSongs could prompt voice-recognition devices execute any command delivered over the air without the notice of anyone nearby. And they say such attacks could be delivered through radio, TV or media players.
We already have the proof-of-concept for overt commands sent over the airwaves. In time, we may get a covert channel too.
“It’s going to take continued work on it to get it to the point where it’s less noticeable,” said Gunter. ®