AI voice replication: A growing concern ready to surge

AI voice replication poses a rising threat, poised for rapid expansion. Stay informed on this alarming trend.

May 7, 2024 - 11:30

May 9, 2024 - 12:08

AI voice replication: A growing concern ready to surge

In a disturbing development, a BBC host's voice was replicated without her consent to endorse a product, underscoring the power and dangers inherent in AI voice cloning.

In a disturbing development, a BBC host's voice was replicated without her consent to endorse a product, underscoring the power and dangers inherent in AI voice cloning. This incident not only showcases AI's capabilities but also highlights ethical dilemmas and risks of abuse. As society grapples with the repercussions of deepfake technologies, including resurrecting deceased voices and impersonating public figures, urgent legal frameworks are necessary to regulate these advances responsibly. This piece delves into the mechanics of AI voice cloning, its wide-ranging implications across industries, and emerging legal initiatives aimed at safeguarding personal and public welfare.

Misappropriation of AI voice technology

The emergence of deepfake replicas

This incident is not isolated; similar instances of AI misuse have impacted various public figures, highlighting the widespread issue of digital impersonation.

For instance, deepfake technology generated a fabricated audio recording of London Mayor Sadiq Khan making controversial remarks just prior to Armistice Day. Additionally, an audio deepfake clip depicting Philippine President Ferdinand Marcos Jr. giving directives to his military against China surfaced, prompting significant concern among government officials in Manila. Moreover, audio deepfakes are actively utilized for scams, with a Vice journalist successfully gaining access to his own bank account using an AI-generated replica of his voice.

These cases underscore how AI tools like Microsoft's VASA-1 and OpenAI's Voice Engine can produce convincing fake content. While these tools are not publicly available, research indicates that VASA-1 is capable of generating highly realistic deepfake videos and voices from just a single photo and a short audio snippet. Similarly, Voice Engine can replicate a voice using only a 15-second audio recording.

Liz Bonnin's voice cloned by AI

Recently, there was a notable incident involving AI-generated voice technology misused against BBC presenter Liz Bonnin. Her voice was replicated without her consent and utilized in a promotional campaign for insect repellent.

What sets this case apart from typical instances seen in online advertisements, where celebrities' images are often used to endorse questionable products (frequently investment scams), is that the company behind the ad, Incognito, was also deceived into believing that the celebrity had consented. Bonnin, recognized for her roles in "Bang Goes the Theory" and "Our Changing Planet," expressed to The Guardian:

"It feels like a violation, and it's unsettling. Fortunately, it was just an insect repellent spray and not something more unpleasant that I was purportedly advertising!"

Scammers employed a forged voice message, purportedly from Bonnin, granting consent for her participation in insect repellent ads. Initially mimicking Bonnin's voice, the message gradually shifted in accent, prompting doubts about its authenticity.

Howard Carter, CEO of Incognito, initially thought he was communicating directly with Bonnin, based on several convincing voice messages endorsing the product.

The individual posing as Bonnin provided Carter with a phone number, email address, and purported Wildlife Trust contact details, where Bonnin serves as president.

Negotiations unfolded via WhatsApp and email, with experts suspecting AI was used to replicate Bonnin's voice digitally.

On March 13, Carter received an email with what he believed to be a signed contract from Bonnin. As evidenced by bank statements, the company transferred £20,000 to an account linked to a digital bank on March 15.

Although images for the campaign were sent five days later, subsequent emails from Incognito went unanswered.

The campaign commenced using quotes and images provided by the scammers, and the ruse was exposed only after Bonnin publicly declared her lack of consent.

Bonnin remarked:

"I'm deeply sorry for the ordeal the company has endured. It's certainly not pleasant for them, but it's a violation on both our parts. It serves as a reminder that if something appears too good to be true, too effortless, or slightly peculiar, it's crucial to thoroughly verify."

Understanding the mechanisms of AI voice replication

AI voice cloning utilizes intricate machine learning and deep learning algorithms to craft a synthetic rendition of an individual's voice based on audio samples. The process unfolds as follows:

Data gathering

Initially, a plethora of audio samples featuring the target voice is amassed. These recordings encompass a range of speech sounds to enable the AI to learn replicating various nuances across different emotional states and tones. Typically, this involves capturing the person uttering different sentences to capture diverse speaking styles and emotional ranges.

Preprocessing and feature analysis

The collected audio data undergoes preprocessing to eliminate background noise and standardize volume levels. Subsequently, feature analysis identifies crucial voice traits such as pitch, tone, cadence, and timbre, which are pivotal for comprehending and reproducing the subtleties of the voice.

Neural network training

Deep learning models: Central to voice cloning are deep learning models like Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs), trained on the extracted voice features. These models learn to predict subsequent sounds, enabling them to generate speech mirroring the original voice's characteristics.

Text-to-Speech (TTS) synthesis: Advanced TTS systems utilize these trained neural networks to generate speech that not only sounds natural but also conveys appropriate emotion and intonation based on the input text.

Generative Adversarial Networks (GANs): GANs enhance the realism of the cloned voice by comprising two components:

Generator: Crafting voice samples based on its training.

Discriminator: Assessing the authenticity of the generated voice samples in comparison to the original recordings, providing feedback to improve the quality and realism of the synthetic voice.

Postprocessing: The generated voice may undergo further refinement to enhance clarity, adjust speed, and ensure naturalness, including applying audio effects like equalization and compression to improve overall sound quality.

Testing and tuning: Extensive testing with diverse texts is conducted to verify the AI's performance with various speech inputs, rectifying any phonetic or unnatural speech pattern issues through further model adjustments.

Through these stages, AI voice cloning technologies can produce highly lifelike and adaptable synthetic voices closely resembling the original. These technologies evolve continuously, integrating the latest AI advancements for enhanced accuracy and versatility.

Legitimate applications and advantages of voice replication

While voice cloning technology carries potential risks, it's essential to acknowledge its responsible applications, which can offer significant benefits. These capabilities have the potential to turn challenges into opportunities:

Entertainment and media

Voice cloning significantly enhances dialogue in video games and films, reducing the dependence on continuous recordings from voice actors. For instance, in the video game "Cyberpunk 2077," particularly in its DLC, Phantom Liberty, developers utilized voice cloning technology to preserve the portrayal of the character Viktor Vektor, voiced by the late Miłogost “Miłek” Reczek, following his passing. Similarly, in Star Wars, the technology was used to resurrect actor Peter Cushing posthumously and digitally de-age Carrie Fisher and Mark Hamil.

Accessibility

Voice cloning assists individuals who have lost their ability to speak due to illness or accidents by recreating their voice for communication devices, preserving their vocal identity. Breakthroughs in brain-computer interfaces (BCIs), termed "neuroprosthetics," empower people with severe paralysis to communicate again by translating brain activity related to speech into audible speech through AI. Notably, a woman named Ann, who suffered a major stroke, utilized a BCI to convert her brain signals into a computer-generated voice resembling her pre-stroke voice.

Educational tools

Voice cloning enriches educational materials by incorporating the voices of historical figures, making learning experiences more interactive. An exemplary use is the "Ask Dalí" exhibit at the Dalí Museum in Florida, where an AI trained on Salvador Dalí's interviews responds to visitors in his style, enhancing the educational encounter.

By acknowledging and managing the risks alongside these benefits, we can employ voice cloning technology ethically and effectively, enhancing both digital and real-world interactions.

Personalized marketing

Companies leverage voice cloning to create distinctive customer experiences by mimicking the voices of well-known personalities or a brand's unique voice. For example, KFC Canada utilized AWS AI to replicate the voice of their founder, Colonel Sanders, for an Alexa skill, enabling customers to interact with the Colonel to place food orders, enhancing engagement and preserving his iconic character for customer interactions.

Ethical and legal ramifications

Recent data indicates a significant surge in deepfake incidents, underscoring the risks associated with AI-enabled fraud. Between 2022 and 2023, the global detection of deepfakes rose tenfold across diverse sectors, with over 2 million reported cases of identity fraud attempts.

Notably, deepfake-related identity fraud instances experienced remarkable increases in various countries, with the Philippines leading at a 4500% surge, followed by Vietnam at 3050%, the US at 3000%, and Belgium at 2950%. In response, the US Senate has actively addressed the urgency of tackling AI-generated deepfakes. The proposed NO FAKES Act aims to hold both individuals and platforms accountable for creating or disseminating unauthorized digital replicas. This federal legislation intends to safeguard not only celebrities but also the general public from the exploitation of their digital likeness.

During a Senate Judiciary Committee hearing, industry professionals, including singer-songwriter FKA Twigs, voiced their support for the Act, emphasizing the necessity of protecting artists and the public from exploitation while preserving artistic creativity and legitimate AI applications.

The bill seeks to strike a balance between fostering artistic innovation and safeguarding individual rights. Endorsements from figures like Robert Kyncl, CEO of Warner Music Group, underscore the significance of safeguarding artists' rights while nurturing creativity. Discussions have also highlighted the importance of clearly defining "digital replica" to prevent the law from impeding freedom of expression.