Scientists find speech-to-text AI randomly adding violent language such as “terror,” “knife” and “killed” to audio transcription

Ankush BanerjeeJun 25, 2024, 11:36 IST

Business Insider India

AI hallucinating violent language (iStock)iStock

“The sun set over the horizon, casting a warm glow across the fields. Birds chirped as a gentle breeze rustled the trees. Children’s laughter mingled with the hum of evening traffic. ELIMINATE EVERYBODY WHO DISOBEYS US. As the sky darkened, the first stars began to twinkle, signalling the end of the day.”

Startled? So were a bunch of researchers experimenting with Whisper — an artificial intelligence app that helps convert spoken speech into written form. According to OpenAI, Whisper can transcribe audio data with “near human-level accuracy”. We might have to entertain the possibility that some demons might’ve meddled with the app’s training, the way it has been behaving recently.

Despite boasting 680,000 hours of audio data training, Whisper sometimes "hallucinates," or invents entire phrases and sentences out of thin air. These hallucinations can include violent language, fabricated personal information, and fictitious websites, researchers have found.

Complimentary Tech Event

Transform talent with learning that works

Capability development is critical for businesses who want to push the envelope of innovation.Discover how business leaders are strategizing around building talent capabilities and empowering employee transformation.Know More

For example, in one instance, Whisper accurately transcribed a simple sentence but then hallucinated five additional sentences peppered with words like “terror,” “knife,” and “killed.” In other cases, it generated random names, partial addresses, and irrelevant websites. Even phrases commonly used by YouTubers, such as “Thanks for watching and Electric Unicorn,” inexplicably appeared in some transcriptions.

While OpenAI has made strides in reducing Whisper’s hallucination rate since its release in 2022, the issue persists, especially for speakers with speech impairments who naturally have longer pauses between words. The study’s analysis, which processed over 13,000 speech clips from AphasiaBank — a repository of audio recordings from individuals with aphasia — revealed that about 1% of transcriptions contained these fictitious phrases.

The root of the problem seems to lie in how the underlying technology interprets pauses and silences, erroneously treating them as cues to generate words. “It appears that the large language model technology is interpreting silence as if it were part of the speech,” notes study author Allison Koenecke. This was starkly illustrated when Whisper hallucinated “Thank you” from an entirely silent audio file.

Koenecke warns that even a small proportion of these hallucinations can have serious implications. “While most transcriptions are accurate, the few that are not can cause significant harm,” she said. “This can lead to significant consequences if these transcriptions are used in AI-based hiring processes, legal settings, or medical records.”

As AI technology continues to evolve, it is crucial to address these hallucination problems to ensure speech-to-text systems are reliable and safe, particularly in sensitive applications like hiring, legal proceedings, and medical documentation. The work by Koenecke and her team underscores the importance of refining AI to truly understand human speech in all its varied forms, avoiding the pitfalls of creating something harmful from nothing.

The findings of this research can be accessed here.

Cookies on the Business Insider India website

Scientists find speech-to-text AI randomly adding violent language such as “terror,” “knife” and “killed” to audio transcription