scorecard
  1. Home
  2. tech
  3. news
  4. I stumbled into a really powerful use for the ChatGPT iPhone app: super fast, super accurate voice-to-text transcription

I stumbled into a really powerful use for the ChatGPT iPhone app: super fast, super accurate voice-to-text transcription

Nicholas Carlson   

I stumbled into a really powerful use for the ChatGPT iPhone app: super fast, super accurate voice-to-text transcription
Tech4 min read
  • The ChatGPT iOS app lets you prompt it with your voice.
  • After using it for a while, I realized it's the best voice-to-text transcription I've ever seen.

A couple weeks ago, OpenAI finally came out with a native iOS app for ChatGPT.

One cool feature of the app is you can just speak to ChatGPT to create a prompt instead of using your thumbs to write text.

After I did that a few times, I noticed something that wasn't obvious at first. The voice-to-text transcription the app was doing was better than any voice-to-text transcription I'd ever seen. It was miles better than what you get when you use Apple's native transcription tech in iMessage or whatever. And it's even better than some popular standalone AI-powered transcription tools out there.

Let me show you what I mean.

The other day, I was writing a post about the latest Trump indictment, and I wanted to quote from one of my favorite podcasts, Serious Trouble.

Toward the end of the episode, cohost Josh Barro asks former prosecutor turned big shot defense attorney Ken White if Trump is going to be arrested and brought to court.

Here's how one of the leading AI-powered transcription services, Otter.ai, transcribed White's answer:

"Now likely, since a summons demand, he'll he'll technically surrender to the US Marshal. And they will technically have him under arrest, and take his finger prints and picture and that type of thing. But it's not going to be a handcuffs thing like layup you know, you're just sort of walks politely into the marshal's office and they'll watch politely into the courtroom, particularly when you volunteer and be shown up on 99.9% Certain he'll get, you know, a lenient bond probably even you know, just as though we're calling this it's not a bond at all. yond and he'll be arranged You'll be informed of the charges. And maybe they'll take a not guilty plea now. Maybe they'll send it to judge cannon and she'll take the plate. Sometimes federal judges they'd like the magistrate judge to take the place. Sometimes the judge likes to take the plate is a court by courtroom thing, but not a lot of big substance will happen to the first appearance."

And here's ChatGPT transcribed White:

"Now, likely, since they summonsed him in, he'll technically surrender to the US Marshal, and they will technically have him under arrest, and take his fingerprints and picture and that type of thing, but it's not going to be a handcuffs thing, likely. You know, you just sort of walk politely into the Marshal's office, and they'll walk politely into the courtroom. Particularly when you voluntarily show up. I'm 99.9% certain he'll get a lenient bond, probably even just a zone of recognizance, not a bond at all. And he'll be arraigned, he'll be informed of the charges, and maybe they'll take a not guilty plea now, maybe they'll send it to Judge Cannon, and she'll take the plea. Sometimes federal judges, they'd like a magistrate judge to take the plea, sometimes the judge herself likes to take the plea. It's a court-by-courtroom thing, but not a lot of big substance will happen at the first appearance."

I didn't edit either of those passages. I just copy and pasted one from Otter.ai and the other from ChatGPT.

The difference is remarkable, right? You can almost hear the person speaking in the ChatGPT version.

I emailed Sam Altman, CEO of OpenAI, to ask him why ChatGPT is so good at this. He told me the reason the ChatGPT app is so good at this is that it's using another OpenAI technology called "Whisper."

Here's how OpenAI markets/explains "Whisper" on its website:

"The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation."

That explanation left me a little…confused. So I went looking for more and found an academic paper on the Whisper model. I tried to parse it but, hoo boy, it might as well be in Greek to me.

So I loaded up another favorite AI tool of mine, ChatPDF, in my browser and dropped a PDF of the paper into it. ChatPDF lets you enter a PDF as a prompt for ChatGPT, and then answers questions you have about the PDF using text.

After some prompting, ChatPDF gave me a useful explanation for why Whisper is so good.

"The authors of this paper developed a new approach to training speech recognition systems that uses large amounts of audio data from the internet, without the need for human labeling. They found that this approach resulted in speech recognition models that are more accurate and robust than previous models, even when tested on different types of speech."

"Traditional supervised learning requires a large amount of labeled data, which is expensive and time-consuming to obtain. In contrast, the approach used in this paper is based on weak supervision, which means that the models are trained on large amounts of unlabeled data, without the need for human labeling. The authors used a technique called unsupervised pre-training, which allows the models to learn directly from raw audio data. This approach is more scalable and can make use of much larger datasets than traditional supervised learning. The authors found that their approach resulted in speech recognition models that are more accurate and robust than previous models, even when tested on different types of speech."

So there you go! Whisper can train itself on much bigger datasets of audio because it doesn't need humans to label it all. It's got AI powering AI, sounds like to me. I assume that's also a sign of things to come. The other day a VC stunned a group of journalists he was briefing when he said that ChatGPT 7 will be built by ChatGPT 6. Huh!

Altman mentioned to me that there are some apps and services out there beyond the ChatGPT app using Whisper. I found one in the Mac app store called Whisper Transcriptions.

So far, the user interface is pretty lousy compared to Otter.ai, but I'm excited to see if the transcriptions themselves are superior.


Advertisement

Advertisement