Cloud English

232,000 subscribers

⏱ 👁 66,326 views

"This is a Breakthrough" | Google Gemini Gives REAL English Pronunciation Feedback

Video Overview & Insights

Google Gemini Pro can analyze an audio file for pronunciation mistakes, and provide feedback. But how accurate is it?

“

Thank you. This is cool

— @bibs.42

✅ Check out the 90-day program:

https://www.lukepriddy.com

“

Good stuff! This prompt is well though-out and can be recycled with minimal tweaks into the opposite use case: a comic accent coach that helps you learn how to speak in a different accent (US regional or foreign). Next stage is to learn how to imitate other people (politicians, celebrities, etc.) It might require some fiddling with the prompt to put the emphasis on language ticks, recurring phrasing, rythm identification, but the pronunciation coach prompt provides a well developed template making it easy to add what the bot should identify and dissect from a person's speech patterns and provide tips an drills to get there.
Great social skill when telling a story that involves regional accents!

— @SpinDriftTheory

Get a free English course:

https://www.lukepriddy.com/naturalconversations

“

Actually, I thought the AI was more accurate than was said in the video. I heard the 'd' sound in Jack Ma's speech. I didn't hear 'jew' but it didn't sound like a regular d sound either. Something in-between.

— @markc6411

____

Listen to the audio:

“

Can you check this out? personaplex-7b

— @batuhan.ozturk

https://anchor.fm/cloud-english

More User Perspectives

Gemini is the only thing I use right now. No tool comes close to its abilities. And it's free in the AI Studio!

@Truck_Kun_Driver

This seems to mix up tooling with the underlying LLMs. Whether an assistant can process formats like m4a vs wav/mp3 mostly depends on the surrounding tooling, not the model itself (wav/mp3 are usually safer).
Also, “actually hearing” isn’t the same as speech-to-text. Multimodal LLMs can work directly with audio, and a consistent test could be set up across models like GPT-5.2, Gemini, and others. Accent and pronunciation handling depends on audio tokenization and training data, not just STT.

@JanBadertscher

I am a language teacher too I am wondering if there is an AI tool I can use to help me give feedback for my speech students?

@johnpauladlawan9753

this is crazy especialy using in suno thx

@Unicode-z7b

Forget English. Or Spanish, or German or French, the only languages the western world seems to know. Gemini is so good at Indian languages. Its Marathi Hindi or Gujrathi pronunciation is perfect.

@ChanduKale

ChatGPT has been able to do this for a long time and did it first. But you can't upload a clip, you have to talk to it in the live streaming conversation mode.

@human_shaped

Did they spring for the $10 an hour Fiverr engineers to build this? I'm sure it's awesome, if that's so...

@jontpt

Does this also work when directly speaking to Gemini using a mic?

@광동아재廣東大叔

It would probably be best to make it transcribe it fully into ipa first, so it cant hallucinate more errors when requested without contradicting itself

@lmao4982

It's being a long while since this is possible with Gemini Live I have to say... Not perfect, but working for months already, being Spanish I can make use of it at least, if you have a higher English level maybe not usable til now for you tho

@XtremeGameplaysHD101

It is absolutely fantastic!!!!

@zecalex1

Try: "Give me a 320x320 PNG 16bit tileset of 10x10 where we will have characters, objects and terrain". Good luck

@carlosduran5460

During tokenization a lot of context is lost sorry may want to use Google studio for better results and change Media quality and Top Q

@HelamanGile

I like how you thought the multimodal models were turning text to speech then processing that when that hasn’t been the case for a while. (Different than their talk to type option).

@jonsmith6331

Thanks to you, I made a shadow speaking app, using vibe coding on Gemini. I give a sound file, it separates it into sentences chunk and i shadow speak for each chunk, i record my voice, and select the best of my voice for each sentence, then i merge my voice and download it. You are awesome man, now with this video I can ask Gemini to analyze my accent, I can add it to my app.

@mcx23-o5z

Nothing convinced me this is real. I mean, if it says ones are there that aren't: HOW ON EARTH do we know it didn't simply convert to text and then guess the speaker would make stereotypical mistakes and then locate in the TEXT where you'd expect to find those? Bam. Probably what happened, it may be a total scam. The fact that it recognizes accent and style prove nothing: I believe YouTube likely developed algorithms to reach those conclusions for their own purposes and Gemini is just reusing them. (YouTube probably needs the info for its recommendation engine.)

@parasitius

Thanks for reviewing this capability of Gemini. They teased this but there weren't enough review to stress test this and I didn't have time to actually test this myself, but you pointing out some weaknesses will make the developers notice these too and also make me aware about it, tho I still prefer to use it to practice english, it's definitely great already to give you a headstart!

@theplanetearth7176

The prompt used for Gemini in the video (1:52-2:58) is:

"Act as an expert English pronunciation coach. Listen to the attached audio clip and analyze the speaker's pronunciation flow and clarity. No stereotypes. Do not list common mistakes for this accent. I want to avoid any possibility that it's just assuming mistakes because, oh, this is common for a native Spanish speaker or a native Chinese speaker. Only list mistakes that actually happen in the specific recording. If you don't hear it, don't list it. I want it to be written with simple spelling like sheep versus ship. I want proof for every mistake you find. You must quote the exact sentence where it happened. I want the overall vibe. Guess the accent origin. Describe the speaker's energy. Choppy, smooth, robotlike. Specific sounds. Three to four specific mistakes with letters or sounds with the quote. With the target word. Word stress. Do they emphasize the wrong part of a word? For example, photography or phototography. It should be photography. And then finally, specific drills that this person could use to fix any mistakes."

@JohnJohn-f2l5e

I’ve been doing American accent training occasionally with ChatGPT for over a year now in voice mode. Also using BoldVoice

@themariogreiner

Andddd thats why i wont be investing in Duolingo!!

@Theartcat123

Unfortunately, this is just Gemini trying to serve your query without any skills to actually do it.

Essentially, it's cleverly gaslighting you with common pronunciation and delivery mistakes given the speech-to-text content and producing a plausible result while your brain fills in the gaps and makes it "believable".

A simple test to verify it's not really analyzing the audio is to give it a clip of a native English speaker - it will assume it's not an English speaker and give you comments on the mistakes even though it was a perfect clip.

@ketanadotcom

in 9:38 you can hear "de people"

@Sebastian-nb4cu

The Live Voice Mode is native audio so it can hear you as well. No TTS or STT.

@Jonathan-zg3sc

What about specific speech impediments eg types of lisps

@youroob

What about specific speech impediments eg types of lisps

@youroob

Very good discovery my friend!

@marcelotemer

Here is how to fix Gemini giving general accent overview like for chinese, it giving common chinese errors.

Step 1 : create a new notebook on notebooklm and upload your audio there.

Step 2: connect That notebook to Gemini by selecting notebooklm by "+ " sign.

Step 3: Uplaod your audio in gemini along with below prompt:

Role:
You are an expert English pronunciation and speech clarity coach.
Input:
You will receive one audio recording of a single speaker speaking English.
🎯 Your Task
Listen carefully to the audio and analyze the speaker’s actual pronunciation, flow, and clarity based only on what is audible in this recording.
🚫 Strict Rules (Must Follow Exactly)
No Stereotypes
Do NOT mention “common mistakes” of any accent.
Only report mistakes that clearly occur in this exact recording.
Evidence-Only Rule
If you do not clearly hear an error, do not include it.
No guessing, no assumptions.
Simple Spelling Only
❌ Do NOT use phonetic symbols (IPA).
✅ Use simple English spelling to show sound differences
Example: Sheep vs Ship
Mandatory Proof
For every pronunciation issue, you MUST quote the exact sentence from the audio where it occurred.
🧱 Output Structure (Follow Exactly)
1. Accent & Vibe
Best guess of the speaker’s accent origin (based only on the audio)
Overall speaking energy
(e.g., fast, hesitant, smooth, choppy, robotic, relaxed)
2. The Specific Sounds (Vowels & Consonants)
Identify 3–4 real pronunciation mistakes only.
Use this exact format for each mistake:
The Quote: "exact sentence from the audio"
Target Word: (the intended correct word)
What it Sounded Like: (simple spelling of what was actually said)
3. The Flow & Music of Speech
Analyze only what you hear:
Word Stress:
Did the speaker stress the wrong part of a word?
(Example: pho-TOG-raphy vs PHO-to-graphy)
Rhythm:
Does the speech sound natural or choppy (machine-gun style)?
Tone / Intonation:
Does the voice rise or fall unnaturally at sentence endings?
4. The Fix (Teacher’s Physical Tips Only)
Provide exactly 3 drills that fix these specific issues.
Rules for drills:
✅ Physical instructions only
Focus on:
Tongue position
Lip shape
Jaw opening
❌ No theory, no motivation, no generic advice
⚠️ Final Instruction
If a mistake is not clearly audible, do not include it under any section.

Boom you have personalized feedback as Gemini going to use your notbook as souce

@Playfulmomentsai

Currently, unless the interface allows for it, by providing a specific input area/box, the models respond paradoxically to negative prompting('No X, y or Z' or 'Don't/do not X, y, or z'), especially in TTS.

You'll find if you remove those you fare much better in the output.

@esspyarrow8772

Can it get rid of my fucking stupid American accent? A posh English accent is what I would like

@bobbyboyderecords

I guess someone has never heard of multimodality!

@stanvanillo9831

Try to ask AI the same. But instead of uploading some files: speak to it.

@Ukuraina-cs6su

true multimodality instead of some AIs converting voice into text first

@sobir128

It still seems like based on stereo types

@MixturaLife

Gemini can only impress deaf halfwits.

@Cambridge_English_Teacher

I think something wrong here. Chatgpt might not do it when audio file guy go to advance voice mode and talk live. It can understand accents

@relax_itsjustmyopinion

Great video, thanks! Next time I suggest you start a new chat for each test. Otherwise, all previous info (and recordings) are included in the 2nd and 3rd test). It would be better to start fresh for each recording with the exact same initial prompt. It would be more scientific and the results would be more comparable

@ZohoExpert

Ask about ALUMINIUM 😂 Most Americans say ALUMINUM, which is incorrect.

@AnotherJoe

This exactly illustrates why I moan every time some company or reviewer advertizes how you can ask AI to "be my expert" in anything.

@5idi

Gemini, ChatGPT is still a long way from teaching English, unfortunately. Most likely, IT giants are simply not interested in this area.

@AndrewGlobus

Look into the "real-time" models (eg openai realtime" that can detect tone and inclination, other models are text-to-speech based as you stated

@themoose8

Yeah it's been capable of this for maybe a year and a half now. I have it correct my german prononciation as an english speaker.

@avi7278

6:47 it did say it and you even read it, in 6:08

@ShoyebOP007

Having it say if doesn't support the filetype and then just giving up without attempting to try other file types seems like a missed opportunity. Even if that file type should work in your opinion that was not the test. I would have either looked up supported file types or asked it to give it a fair chance to even attempt this task

@AnesuC

#American English #English pronunciation #Spoken English #Travel English #English language #Cloud English #Luke Priddy #Free English Lesson #American culture #ESL #Learn American English #American English Pronunciation #American English Teacher #English words #English learning best practices #Master English #Intermediate English lessons #Free English lessons #How to do things in English #English #English literacy #English learning lifestyle #English learning podcast