You say おばさん to your friend's grandmother, smiling, proud of your clean vowels. You just called her an aunt. The word for grandmother is おばあさん, and the only difference is that you hold the ba for one extra beat: oba-san versus obaa-san. Same consonants, same vowels. One beat longer, completely different person.
This is the part of Japanese pronunciation that the "it's easy, just five clean vowels" advice quietly skips. The five vowels really are easy. What trips up English speakers for years is length: how long you hold a sound. In Japanese, length is not an accent flourish you add for polish. It changes which word you said. And before it ever becomes a speaking problem, it's a listening problem, because if your ear doesn't catch the extra beat, you file the wrong word away in the first place.
The five vowels aren't the hard part. The length is.
ビル means building. ビール means beer. The difference is one held vowel: biru versus bii-ru. Walk into an izakaya, ask for a ビル, and at best you get a confused look. This contrast is phonemic, which is the linguist's way of saying it carries meaning the way a whole different consonant would in English (Wikipedia: Japanese phonology).
It runs through everyday vocabulary. ゆき (yuki) is snow; ゆうき (yūki) is courage. おじさん (ojisan) is uncle; おじいさん (ojiisan) is grandfather. The pattern is always the same: one extra beat on a single vowel swaps the word out from under you. English does nothing like this. We stretch vowels for emphasis ("sooo good") without ever changing the dictionary entry, so the instinct to treat length as optional is baked in deep, and it has to be unlearned.
Japanese counts beats, not syllables
Here is the model that makes all of this click: Japanese counts time in equal beats called mora, not in syllables. A mora is one short, even tick of the clock, and every one gets the same duration (Wikipedia: Mora).
Take Tokyo. English hears two syllables: TOH-kyo. Japanese hears four beats: と・う・きょ・う (to-u-kyo-u). The "ō" in Tokyo is not one long sound, it's two beats of the same vowel stacked together, and the romanized spelling "Tokyo" hides that completely. A long vowel is simply two mora in a row.
The nasal ん is its own beat too. 簡単 (kantan, "simple") is four beats: ka-n-ta-n, with the ん standing alone as a full tick. 日本 (Nihon, "Japan") is three: ni-ho-n. This is the same unit haiku is built on. A 5-7-5 haiku counts seventeen mora, not seventeen English syllables, which is why English "translations" that hit 5-7-5 syllables usually overshoot the original (Tofugu). Once you start hearing words as a string of even beats, length stops being mysterious. You're just counting.
The little っ is a beat of silence
きて, きって, きいて. Three words. To an English ear, mush. To a Japanese ear, they're as different as "cat," "cot," and "coat" (The Japanese Page).
きて (kite) is "come," two beats. きいて (kiite) is "listen," three beats, with a long vowel: ki-i-te. And きって (kitte) is "stamp," three beats, but the middle beat is silent. That little っ, the small tsu, is the geminate, and it works by stopping. You close off the airflow, hold the silence for one full beat, then release into the next consonant. Say "kit" and freeze for a tick before "te": kit—te.
The trick that helps most: you're not making a new sound, you're inserting a tiny gap. English does this across word boundaries ("hot tea" versus "hottie") but almost never inside a single word, so it feels unnatural at first. It shows up everywhere. さか (saka) is a slope; さっか (sakka) is an author. かた (kata) is a shoulder; かった (katta) is "was expensive." 来た (kita), "came," versus 切った (kitta), "cut." The same held beat even shows up when you count things, which is why 一本 is ippon and 六本 is roppon rather than ichi-hon and roku-hon (Japanese counters). Skip the silent beat and you've said a different word, or no word at all.
Try it in Conversa
Practice with AI characters who adapt to your level and give real-time feedback.
Try Conversa FreeWhy this is a listening problem first
Most guides frame length as something you fix in your own mouth. The harder half is your ear. If you don't hear the extra beat in おばあさん, your brain doesn't store "grandmother," it stores "aunt," and no amount of speaking practice fixes a word you misheard going in.
Romaji makes this worse. Spellings like "Tokyo," "Osaka," and "judo" flatten the long vowels right out of the written form, so learners reading romaji never see the beat they're supposed to hold, and tend to under-hold it. (Whether romaji causes the habit is hard to prove, but it certainly doesn't help you notice length.) The good news is that the fix runs both directions at once. Train your ear to expect the beat and your mouth starts producing it. Length, like pitch accent, is part of what separates a flat textbook delivery from speech that lands as natural.
The clap-per-beat drill
Say a word out loud and clap once on every mora. おばあさん gets five claps: o-ba-a-sa-n. おばさん gets four: o-ba-sa-n. Feel the difference land in your hands instead of only your ears. The grandmother gets an extra clap.
Do the same with the geminate, and clap on the silence too. きって is three claps: ki-(clap on the held beat)-te. きて is two. It feels strange to clap on nothing, which is exactly the point, because that "nothing" is a real beat that English speakers drop. とうきょう is four claps, not two.
Check yourself against real audio rather than your own guesswork. Jisho.org has audio on many entries, and Forvo has native recordings of whole words. Shadow them at full speed once the slow version feels solid, because length compresses in fast speech and you need to hear it there too. This is also the kind of thing an AI conversation partner is good for: you can say ビール out loud, get understood or not, and find out immediately whether your beat landed.
None of this is hard in the way grammar is hard. There's no table to memorize. It's a habit of attention: hearing Japanese as a row of even ticks instead of a blur. So the next time you meet a new word, count its beats before you say it. Clap it out if you have to. The grandmother you didn't accidentally call "auntie" will appreciate the effort.
