"This is a Breakthrough" | Google Gemini Gives REAL English Pronunciation Feedback
Video Overview & Insights
Google Gemini Pro can analyze an audio file for pronunciation mistakes, and provide feedback. But how accurate is it?
Thank you. This is cool
✅ Check out the 90-day program:
https://www.lukepriddy.com
Good stuff! This prompt is well though-out and can be recycled with minimal tweaks into the opposite use case: a comic accent coach that helps you learn how to speak in a different accent (US regional or foreign). Next stage is to learn how to imitate other people (politicians, celebrities, etc.) It might require some fiddling with the prompt to put the emphasis on language ticks, recurring phrasing, rythm identification, but the pronunciation coach prompt provides a well developed template making it easy to add what the bot should identify and dissect from a person's speech patterns and provide tips an drills to get there.
Great social skill when telling a story that involves regional accents!
Get a free English course:
https://www.lukepriddy.com/naturalconversations
Actually, I thought the AI was more accurate than was said in the video. I heard the 'd' sound in Jack Ma's speech. I didn't hear 'jew' but it didn't sound like a regular d sound either. Something in-between.
____
Listen to the audio:
Can you check this out? personaplex-7b
https://anchor.fm/cloud-english
Other links:
I am not a native speaker so maybe I do not get the point but, is this really needed… I mean, do you understand Sofia Vergara accent… I do so, I guess you do… so, why all this fuss with the program. I think there are more important things to take into account when trying to teach a language.
linktr.ee/lukepriddy
____
In the first clip,
It seems it was limited to the first 15 seconds only
#learnenglish #english #americanenglish
How I would enhance the experiment (simple "crash tests")
To check if it "really listens":
1. Identical text — different pronunciation
• Record the same phrase in three ways: correctly / intentionally with an error (ship→sheep) / with another error (word stress).
• If the model hears it, it should be able to distinguish between versions with identical text.
2. Different text — same pronunciation error
• Change the words, but leave the same phonetic error (e.g., TH→D in different words).
• If the model does not hear, it will be "tied" to specific words/text.
3. Audio degradation
• Noise/compression/truncated frequencies.
• If the model really listens, the quality of the diagnosis should naturally decline (and it should recognize this).
4. "False friends"
• Words where the error is "typical," but you deliberately pronounce them perfectly.
• Check for stereotyping: it should not "find" what is not there.
More User Perspectives
Gemini is the only thing I use right now. No tool comes close to its abilities. And it's free in the AI Studio!
@Truck_Kun_DriverThis seems to mix up tooling with the underlying LLMs. Whether an assistant can process formats like m4a vs wav/mp3 mostly depends on the surrounding tooling, not the model itself (wav/mp3 are usually safer).
Also, “actually hearing” isn’t the same as speech-to-text. Multimodal LLMs can work directly with audio, and a consistent test could be set up across models like GPT-5.2, Gemini, and others. Accent and pronunciation handling depends on audio tokenization and training data, not just STT.
I am a language teacher too I am wondering if there is an AI tool I can use to help me give feedback for my speech students?
@johnpauladlawan9753this is crazy especialy using in suno thx
@Unicode-z7bForget English. Or Spanish, or German or French, the only languages the western world seems to know. Gemini is so good at Indian languages. Its Marathi Hindi or Gujrathi pronunciation is perfect.
@ChanduKaleChatGPT has been able to do this for a long time and did it first. But you can't upload a clip, you have to talk to it in the live streaming conversation mode.
@human_shapedDid they spring for the $10 an hour Fiverr engineers to build this? I'm sure it's awesome, if that's so...
@jontptDoes this also work when directly speaking to Gemini using a mic?
@광동아재廣東大叔It would probably be best to make it transcribe it fully into ipa first, so it cant hallucinate more errors when requested without contradicting itself
@lmao4982It's being a long while since this is possible with Gemini Live I have to say... Not perfect, but working for months already, being Spanish I can make use of it at least, if you have a higher English level maybe not usable til now for you tho
@XtremeGameplaysHD101It is absolutely fantastic!!!!
@zecalex1Try: "Give me a 320x320 PNG 16bit tileset of 10x10 where we will have characters, objects and terrain". Good luck
@carlosduran5460During tokenization a lot of context is lost sorry may want to use Google studio for better results and change Media quality and Top Q
@HelamanGileI like how you thought the multimodal models were turning text to speech then processing that when that hasn’t been the case for a while. (Different than their talk to type option).
@jonsmith6331Thanks to you, I made a shadow speaking app, using vibe coding on Gemini. I give a sound file, it separates it into sentences chunk and i shadow speak for each chunk, i record my voice, and select the best of my voice for each sentence, then i merge my voice and download it. You are awesome man, now with this video I can ask Gemini to analyze my accent, I can add it to my app.
@mcx23-o5zNothing convinced me this is real. I mean, if it says ones are there that aren't: HOW ON EARTH do we know it didn't simply convert to text and then guess the speaker would make stereotypical mistakes and then locate in the TEXT where you'd expect to find those? Bam. Probably what happened, it may be a total scam. The fact that it recognizes accent and style prove nothing: I believe YouTube likely developed algorithms to reach those conclusions for their own purposes and Gemini is just reusing them. (YouTube probably needs the info for its recommendation engine.)
@parasitiusThanks for reviewing this capability of Gemini. They teased this but there weren't enough review to stress test this and I didn't have time to actually test this myself, but you pointing out some weaknesses will make the developers notice these too and also make me aware about it, tho I still prefer to use it to practice english, it's definitely great already to give you a headstart!
@theplanetearth7176The prompt used for Gemini in the video (1:52-2:58) is:
"Act as an expert English pronunciation coach. Listen to the attached audio clip and analyze the speaker's pronunciation flow and clarity. No stereotypes. Do not list common mistakes for this accent. I want to avoid any possibility that it's just assuming mistakes because, oh, this is common for a native Spanish speaker or a native Chinese speaker. Only list mistakes that actually happen in the specific recording. If you don't hear it, don't list it. I want it to be written with simple spelling like sheep versus ship. I want proof for every mistake you find. You must quote the exact sentence where it happened. I want the overall vibe. Guess the accent origin. Describe the speaker's energy. Choppy, smooth, robotlike. Specific sounds. Three to four specific mistakes with letters or sounds with the quote. With the target word. Word stress. Do they emphasize the wrong part of a word? For example, photography or phototography. It should be photography. And then finally, specific drills that this person could use to fix any mistakes."
I’ve been doing American accent training occasionally with ChatGPT for over a year now in voice mode. Also using BoldVoice
@themariogreinerAndddd thats why i wont be investing in Duolingo!!
@Theartcat123Unfortunately, this is just Gemini trying to serve your query without any skills to actually do it.
Essentially, it's cleverly gaslighting you with common pronunciation and delivery mistakes given the speech-to-text content and producing a plausible result while your brain fills in the gaps and makes it "believable".
A simple test to verify it's not really analyzing the audio is to give it a clip of a native English speaker - it will assume it's not an English speaker and give you comments on the mistakes even though it was a perfect clip.
in 9:38 you can hear "de people"
@Sebastian-nb4cuThe Live Voice Mode is native audio so it can hear you as well. No TTS or STT.
@Jonathan-zg3scWhat about specific speech impediments eg types of lisps
@youroobWhat about specific speech impediments eg types of lisps
@youroobVery good discovery my friend!
@marcelotemerHere is how to fix Gemini giving general accent overview like for chinese, it giving common chinese errors.
Step 1 : create a new notebook on notebooklm and upload your audio there.
Step 2: connect That notebook to Gemini by selecting notebooklm by "+ " sign.
Step 3: Uplaod your audio in gemini along with below prompt:
Role:
You are an expert English pronunciation and speech clarity coach.
Input:
You will receive one audio recording of a single speaker speaking English.
🎯 Your Task
Listen carefully to the audio and analyze the speaker’s actual pronunciation, flow, and clarity based only on what is audible in this recording.
🚫 Strict Rules (Must Follow Exactly)
No Stereotypes
Do NOT mention “common mistakes” of any accent.
Only report mistakes that clearly occur in this exact recording.
Evidence-Only Rule
If you do not clearly hear an error, do not include it.
No guessing, no assumptions.
Simple Spelling Only
❌ Do NOT use phonetic symbols (IPA).
✅ Use simple English spelling to show sound differences
Example: Sheep vs Ship
Mandatory Proof
For every pronunciation issue, you MUST quote the exact sentence from the audio where it occurred.
🧱 Output Structure (Follow Exactly)
1. Accent & Vibe
Best guess of the speaker’s accent origin (based only on the audio)
Overall speaking energy
(e.g., fast, hesitant, smooth, choppy, robotic, relaxed)
2. The Specific Sounds (Vowels & Consonants)
Identify 3–4 real pronunciation mistakes only.
Use this exact format for each mistake:
The Quote: "exact sentence from the audio"
Target Word: (the intended correct word)
What it Sounded Like: (simple spelling of what was actually said)
3. The Flow & Music of Speech
Analyze only what you hear:
Word Stress:
Did the speaker stress the wrong part of a word?
(Example: pho-TOG-raphy vs PHO-to-graphy)
Rhythm:
Does the speech sound natural or choppy (machine-gun style)?
Tone / Intonation:
Does the voice rise or fall unnaturally at sentence endings?
4. The Fix (Teacher’s Physical Tips Only)
Provide exactly 3 drills that fix these specific issues.
Rules for drills:
✅ Physical instructions only
Focus on:
Tongue position
Lip shape
Jaw opening
❌ No theory, no motivation, no generic advice
⚠️ Final Instruction
If a mistake is not clearly audible, do not include it under any section.
Boom you have personalized feedback as Gemini going to use your notbook as souce
Currently, unless the interface allows for it, by providing a specific input area/box, the models respond paradoxically to negative prompting('No X, y or Z' or 'Don't/do not X, y, or z'), especially in TTS.
You'll find if you remove those you fare much better in the output.
Can it get rid of my fucking stupid American accent? A posh English accent is what I would like
@bobbyboyderecordsI guess someone has never heard of multimodality!
@stanvanillo9831Try to ask AI the same. But instead of uploading some files: speak to it.
@Ukuraina-cs6sutrue multimodality instead of some AIs converting voice into text first
@sobir128It still seems like based on stereo types
@MixturaLifeGemini can only impress deaf halfwits.
@Cambridge_English_TeacherI think something wrong here. Chatgpt might not do it when audio file guy go to advance voice mode and talk live. It can understand accents
@relax_itsjustmyopinionGreat video, thanks! Next time I suggest you start a new chat for each test. Otherwise, all previous info (and recordings) are included in the 2nd and 3rd test). It would be better to start fresh for each recording with the exact same initial prompt. It would be more scientific and the results would be more comparable
@ZohoExpertAsk about ALUMINIUM 😂 Most Americans say ALUMINUM, which is incorrect.
@AnotherJoeThis exactly illustrates why I moan every time some company or reviewer advertizes how you can ask AI to "be my expert" in anything.
@5idiGemini, ChatGPT is still a long way from teaching English, unfortunately. Most likely, IT giants are simply not interested in this area.
@AndrewGlobusLook into the "real-time" models (eg openai realtime" that can detect tone and inclination, other models are text-to-speech based as you stated
@themoose8Yeah it's been capable of this for maybe a year and a half now. I have it correct my german prononciation as an english speaker.
@avi7278Having it say if doesn't support the filetype and then just giving up without attempting to try other file types seems like a missed opportunity. Even if that file type should work in your opinion that was not the test. I would have either looked up supported file types or asked it to give it a fair chance to even attempt this task
@AnesuC