Microsoft, a technology giant with plans to invest $10 billion in ChatGPT, is currently working on an artificial intelligence (AI) called VALL-E that can clone someone’s voice from just a three-second audio clip. According to a paper published by Cornell University, VALL-E, which was trained with 60,000 hours of English speech, is capable of mimicking a voice in “zero-shot scenarios”. This means that the AI tool can make a voice say words it has never heard the voice say before.


VALL-E uses text-to-speech technology to convert written words into spoken words in “high-quality personalized” speeches. The AI tool was trained using recordings of more than 7,000 real speakers from LibriLight, an audiobook dataset made up of public-domain texts read by volunteers. Microsoft has released samples of how VALL-E works, showcasing how the voice of a speaker can be cloned.


Currently, the AI tool is not available for public use and Microsoft has not made it clear what its intended purpose is. The researchers have stated that the results so far have shown that VALL-E “significantly outperforms” the most advanced systems of its kind in terms of speech naturalness and speaker similarity. However, the researchers also pointed out the lack of diversity of accents among speakers and that some words in the synthesized speech were “unclear, missed, or duplicated.”


In addition, the researchers included an ethical warning about VALL-E and its potential risks. They stated that the tool could be misused, for example in “spoofing voice identification or impersonating a specific speaker”. The researchers suggested that to mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E. However, they did not provide details on how this could be done. They also added that if the model is generalized to unseen speakers in the real.

