unilad homepage
unilad homepage
    • News
      • UK News
      • US News
      • World News
      • Crime
      • Health
      • Money
      • Sport
      • Travel
    • Music
    • Technology
    • Film and TV
      • News
      • DC Comics
      • Disney
      • Marvel
      • Netflix
    • Celebrity
    • Politics
    • Advertise
    • Terms
    • Privacy & Cookies
    • LADbible Group
    • LADbible
    • SPORTbible
    • GAMINGbible
    • Tyla
    • UNILAD Tech
    • FOODbible
    • License Our Content
    • About Us & Contact
    • Jobs
    • Latest
    • Archive
    • Topics A-Z
    • Authors
    Facebook
    Instagram
    X
    Threads
    TikTok
    YouTube
    Submit Your Content
    Microsoft shares tool that can mimic voice and speech with 3 seconds of sample audio

    Home> Technology

    Published 10:56 11 Jan 2023 GMT

    Microsoft shares tool that can mimic voice and speech with 3 seconds of sample audio

    The software company has shared a new piece of technology that can mimic speech with just a few seconds of a voice sample

    Rhiannon Ingle

    Rhiannon Ingle

    google discoverFollow us on Google Discover
    Featured Image Credit: Shutterstock

    Topics: Technology, Microsoft

    Rhiannon Ingle
    Rhiannon Ingle

    Rhiannon Ingle is a Senior Journalist at Tyla, specialising in TV, film, travel, and culture. A graduate of the University of Manchester with a degree in English Literature, she honed her editorial skills as the Lifestyle Editor of The Mancunian, the UK’s largest student newspaper. With a keen eye for storytelling, Rhiannon brings fresh perspectives to her writing, blending critical insight with an engaging style. Her work captures the intersection of entertainment and real-world experiences.

    Advert

    Advert

    Advert

    The software giant has shared a tool that can replicate voice and speech with just a few seconds of an audio sample.

    The new technology is being referred to as "text to speech synthesis".

    Microsoft explain that they have trained a neural codec language model, called VALL-E, to synthesise speech.

    Advert

    In a post on GitHub, Microsoft revealed that the text to speech synthesis training data stretched to "60K hours of English speech which is hundreds of times larger than existing systems."

    Just using three seconds of an audio sample, VALL-E is able to accurately mimic voice and speech.

    "VALL-E emerges in-context learning capabilities and can be used to synthesise high-quality personalised speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt," Microsoft outlines.

    Not only is this new technology able to synthesise speech, but it can also take into consideration different emotions and moods that can influence the tone or pitch of speech.

    The software company used over 60,000 minutes of speech data.
    Microsoft

    The post details: "In addition, we find VALL-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis."

    Such emotions and feelings include anger, sleepy, neutral, amused and disgusted with very distinctive variations differing between each one.

    In short, the technology, created by Meta, analyses the ways in which a person sounds and then breaks down that specific information into various components.

    Such components are called tokens and these are used as training data to establish matches between what the technology "understands" about how one particular voice would sound if it were to speak in words or phrases other than what was provided in three-second audio sample.

    Or, as Microsoft puts it: "To synthesise personalised speech (e.g., zero-shot TTS), VALL-E generates the corresponding acoustic tokens conditioned on the acoustic tokens of the 3-second enrolled recording and the phoneme prompt, which constrain the speaker and content information respectively.

    The technology only need a three-second audio sample to synthesise speech.
    Pixabay

    "Finally, the generated acoustic tokens are used to synthesise the final waveform with the corresponding neural codec decoder."

    While the technology is revolutionary in many ways, the software corporation has acknowledged the "potential risks" that surround VALL-E in an ethics statement also posted on GitHub.

    It reads: "Since VALL-E could synthesise speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker.

    "We conducted the experiments under the assumption that the user agree to be the target speaker in speech synthesis.

    "When the model is generalised to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesised speech detection model," it concludes.

    Choose your content:

    4 days ago
    7 days ago
    9 days ago
    • Justin Sullivan/Getty Images
      4 days ago

      OpenAI names 22 industries at risk of job losses as it proposes four day week

      Two new reports suggest AI might be coming for your job - but you could also get a three day weekend

      Technology
    • Kayla Bartkowski/Bloomberg via Getty Images
      7 days ago

      Congressman Tim Burchett claims he has seen UFO footage that ‘defies logic’

      Tim Burchett says he has seen UFO footage that couldn't be man made - and he wants answers from the government

      Technology
    • NASA/JPL-Caltech/MSSS
      9 days ago

      NASA's Curiosity rover makes groundbreaking discovery that suggests Mars can support life

      An expert has claimed the new reveal 'increases the prospect that Mars offered a home for life in the ancient past'

      Technology
    • John Nacion/Variety via Getty Images
      9 days ago

      Bill Nye issues stern warning to Trump over concerns he could 'end NASA'

      Bill Nye the Science Guy revealed that Donald Trump's NASA proposal is a 'huge mistake'

      Technology
    • Major internet outage shuts down Microsoft and Xbox for thousands
    • Microsoft spent 8 years and $7,600,000,000 building a product which doesn't exist today
    • Microsoft announces it's going to pump billions into an AI software that could make white collar jobs obsolete
    • Bill Gates shares his one major regret when it comes to creating Microsoft