The content of the article
There are more and more ways to identify a person by voice. And in parallel, researchers come up with ways to circumvent these mechanisms – both to protect their own personal information and to crack systems protected in this way. I decided to look into the latest achievements of scientists in this field in order to tell about everything to you.
A person’s voice is the result of the movement of ligaments, tongue, lips. Only numbers representing the wave recorded by the microphone are available to the computer. How does a computer create sound that we can hear from speakers or headphones?
Text to speech
One of the most popular and researched methods of generating sounds is the direct conversion of the text to be reproduced into sound. The earliest programs of this kind glued individual letters into words, and words into sentences.
With the development of synthesizer programs, a set of letters pre-recorded on a microphone has become a set of syllables, and then whole words.
The advantages of such programs are obvious: they are easy to write, use, support, can reproduce all the words that are in the language, predictable – all this at one time became the reason for their commercial use. But the quality of the voice created by this method leaves much to be desired. We all remember the distinguishing features of such a generator – insensitive speech, improper stress, words and letters torn from each other.
Sounds to Speech
This method of speech generation relatively quickly replaced the first one, since it imitated human speech better: we pronounce not letters, but sounds. That is why systems based on international phonetic alphabet – IPA, better and more pleasant by ear.
The basis of this method lay individual sounds pre-recorded in the studio, which are glued into words. Compared to the first approach, a qualitative improvement is noticeable: instead of simply gluing audio tracks, methods for mixing sounds are used both on the basis of mathematical laws and on the basis of neural networks.
Speech to Speech
A relatively new approach is completely based on neural networks. Recursive architecture Wavenet, built by researchers from DeepMind, allows you to convert a sound or text into another sound directly, without involving pre-recorded building blocks (Research Article)
The key to this technology is the proper use of recursive neurons. Long short-term memorywhich retain their state not only at the level of each individual cell of the neural network, but also at the level of the entire layer.
In general, this architecture works with any kind of sound wave, regardless of whether it is music or a person’s voice.
There are several projects based on WaveNet.
To recreate speech, such systems use generators of sound notation from the text and generators of intonation (stress, pause) to create a natural-sounding voice.
This is the most advanced technology for creating speech: it not only glues or mixes sounds incomprehensible to the machine, but independently creates transitions between them, pauses between words, changes the pitch, strength and timbre of the voice to please the correct pronunciation – or any other purpose.
Making fake voice
For the simplest identification, which I talked about in my previous article, almost any method is suitable – even the unprocessed five seconds of the recorded voice can be enough for especially successful hackers. But to bypass a more serious system built, for example, on neural networks, we need a real, high-quality voice generator.
Continuation is available only to participants
Materials from the latest issues become available separately only two months after publication. To continue reading, you must become a member of the Xakep.ru community.
Join the Xakep.ru Community!
Membership in the community during the specified period will open you access to ALL Hacker materials, increase your personal cumulative discount and allow you to accumulate a professional Xakep Score!
I am already a member of Xakep.ru