Welch Labs

Welch Labs

885,000 subscribers

⏱ 👁 918,855 views

How DeepSeek Rewrote the Transformer [MLA]

Video Overview & Insights

Thanks to KiwiCo for sponsoring today’s video! Go to https://www.kiwico.com/welchlabs and use code WELCHLABS for 50% off your first monthly club crate or for 20% off your first Panda Crate!

“

Thanks to KiwiCo for sponsoring today’s video! Go to https://www.kiwico.com/welchlabs and use code WELCHLABS for 50% off your first monthly club crate or for 20% off your first Panda Crate!

— @WelchLabs

MLA/DeepSeek Poster at 17:12 (Free shipping for a limited time with code DEEPSEEK):

https://www.welchlabs.com/resources/mladeepseek-attention-poster-13x19

“

Great video, you should upgrade it to include missing RoPE

— @martinriveros3470

Limited edition MLA Poster and Signed Book:

https://www.welchlabs.com/resources/deepseek-bundle-mla-poster-and-signed-book-limited-run

“

banger

— @BartTrojanowski

Imaginary Numbers book is back in stock!

https://www.welchlabs.com/resources/imaginary-numbers-book

“

It looks obvious when properly explained. Amazing Video!

— @songpandy9590

Special Thanks to Patrons https://www.patreon.com/c/welchlabs

Juan Benet, Ross Hanson, Yan Babitski, AJ Englehardt, Alvin Khaled, Eduardo Barraza, Hitoshi Yamauchi, Jaewon Jung, Mrgoodlight, Shinichi Hayashi, Sid Sarasvati, Dominic Beaumont, Shannon Prater, Ubiquity Ventures, Matias Forti, Brian Henry, Tim Palade, Petar Vecutin, Nicolas baumann, Jason Singh, Robert Riley, vornska, Barry Silverman, Jake Ehrlich

“

Did anyone here question themselves why they would use "The American flag is red, white, and ..." as an example, while knowing that the deepseek model is Chinese.
I don't fall for American propaganda any more. Could've used ANY other example, but it had to be something patriotic in order to get the message across. What a shame.
The most funny part about this is that China doesn't even care about all of that propaganda effort at all.

— @OetziOfficial

References

DeepSeek-V2 paper: https://arxiv.org/pdf/2405.04434

“

This is the best explanation ever. Thanks so much. It is so clear. I love it.

— @bayesian7404

DeepSeek-R1 paper: https://arxiv.org/abs/2501.12948

Great Article by Ege Erdil: https://epoch.ai/gradient-updates/how-has-deepseek-improved-the-transformer-architecture

“

Actually there is a small error/left out detail, The query vector is also converted into latent query vector with dimension of 1536 which is 3x the the latent dimension of K and V of 512. The video has it as 576. Not accurate.

The query vector is down-projected only for the sake of saving training time and memory. Doesnt help in any way to the cause of reducing KV cache.

— @mr.fearless7594

GPT-2 Visualizaiton: https://github.com/TransformerLensOrg/TransformerLens

Manim Animations: https://github.com/stephencwelch/manim_videos

“

Holy shit this is so smart. Amazing. Mindblowing even. And honestly, quite obvious in hindsight.

— @advaitamallik7703

Technical Notes

1. Note that DeepSeek-V2 paper claims a KV cache size reduction of 93.3%. They don’t exactly publish their methodology, but as far as I can tell it’s something likes this: start with Deepseek-v2 hyperparameters here: https://huggingface.co/deepseek-ai/DeepSeek-V2/blob/main/configuration_deepseek.py. num_hidden_layers=30, num_attention_heads=32, v_head_dim = 128. If DeepSeek-v2 was implemented with traditional MHA, then KV cache size would be 2*32*128*30*2=491,520 B/token. With MLA with a KV cache size of 576, we get a total cache size of 576*30=34,560 B/token. The percent reduction in KV cache size is then equal to (491,520-34,560)/492,520=92.8%. The numbers I present in this video follow the same approach but are for DeepSeek-v3/R1 architecture: https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/config.json. num_hidden_layers=61, num_attention_heads=128, v_head_dim = 128. So traditional MHA cache would be 2*128*128*61*2 = 3,997,696 B/token. MLA reduces this to 576*61*2=70,272 B/token. Tor the DeepSeek-V3/R1 architecture, MLA reduces the KV cache size by a factor of 3,997,696/70,272 =56.9X.

“

14:59 youtube short starts here

— @grownupgaming

2. I claim a couple times that MLA allows DeepSeek to generate tokens more than 6x faster than a vanilla transformer. The DeepSeek-V2 paper claims a slightly less than 6x throughput improvement with MLA, but since the V3/R1 architecture is heavier, we expect a larger lift, which is why i claim “more than 6x faster than a vanilla transformer” - in reality it’s probably significantly more than 6x for the V3/R1 architecture.

3. In all attention patterns and walkthroughs, we’re ignoring the |beginning of sentence| token. “The American flag is red, white, and” actually maps to 10 tokens if we include this starting token, and may attention patterns do assign high values to this token.

“

I hope you know this paper was a word for word copy of a single post on 4chan lmao. Some autist on 4 Chan did this by himself in his bedroom, when this leaked 4chan was hacked the next day.

— @spaghettisquasher

4. We’re ignoring bias terms matrix equations.

5. We’re ignoring positional embeddings. These are fascinating. See DeepSeek papers and ROPE.

“

I don't understand anything, but someday i will.

— @Bleak_Hope

More User Perspectives

@

DEEPSEEK BREACH EVIDENCE CONTACT ME

@Justnate787

@

At 2:46, whats that in the background? "Papa"? xD

@goldengamer_lp

@

who else here is 19 and understands this video? i wanna know how many of them are of my age group

@ErenYeager-b2h

@

Freaking great video! Thanks!

@dybenkog.5350

@

What the fuck is those English captions???

@greatguy64

@

amazing chinese technology (nobody uses it and it sucks)

@pegatrisedmice

@

This is the reason why 2 sticks of ram costs 899

@TheReal_LuaDipa

@

Thank you so much for this! I’ve watched a few videos about this and didn’t understand it. This one finally helped me visualize the process and understand why it decreases memory usage while preserving performance

@calebrowley4419

@

Brilliant way to explain the math behind attention.

@karana2260

@

how is this content free goddangit

@Bravodie

@

Sponsored by Kiwi cache

@samhuang9120

@

How do you people understand this? I don't understand this at all, even with the visuals.

@irtazaazam2573

@

Amazing video. Thankyou

@andrewrossy

@

It would be great to find a course that would help me make a chatgpt2. At least get a better understanding of nueral nets

@rverm1000

@

This is by far the most innovative video that i watched on the LLM's and these new architecture patterns

@naitikpatel3904

@

still best deep dive I've ever seen. please don't stop doing this

@waagnermann

@

Holy Christ... I was on an interview at AT&T's headquarters. I am a technical guy so they took me into the development lab to show me around. While in the dev lab, a room that that is a clean room and has certain requirement including an air lock with powerful air blowers and filters before you can enter the area. Anyways while I was being shown one of the projects they were working on I farted. I farted so powerfully that it penetrated my pressurized suit. Personally I thought it smelled like bubble gum, but others disagreed. Anyways I knew that it penetrated my suit because it caused the security system to perform an automatic lock down of the secure areas of the complex. I don't know what else to say, oh yeah, I did get the contract and the entire experience was a gas.

@MadAndblack

@

This video is a masterpiece!

@ympeng7969

@

16:27 - Thats crazy...

@sto2779

@

Bruh!!!
Every llm is made with transformers
What's different with deepseek

You are not explaining deepseek you are explaining transformers and attention, these two things are core of every llm we know these days....

@JoshiJii-v5v3w

@

From what I've understood they also did some undocumented manual assembly code on the gpu work to improve performance

@spider853

@

damn, i still don't understand this well enough to explain it. i can only explain ML in very broad details.

@OnionKnight541

@

Thankyou very much for the video. now I understand more deeply of the architecture.

@nufh

@

A tempting to brute force encryption

@Kj113f

@

I love the visualization!

@RichardWieditz

@

Be blessed. What an amazing explanation.

@denismaringo2729

@

I love the way you presented this so elegantly. The narration, the animation, the script, can see all the effort put into this to make this as effortless for the viewer as possible

@jatinagarwal464

@

First of all huge gratitude for making this video. The effort that must have gone into creating visuals of this quality, alone is worth millions.

More than making the transformer architecture clear, it makes me think how there's nothing special about attention, both the K, Q, V implementation and conceptually in general. There could be other ways to generate the association values of every pair of tokens.

@farhanhubble

@

13:35 is the explanation, everything prior is the context

@alecfox3309

@

Great video, how do you create the graphics for such videos?

@johnpill1

@

@TecHomer26 ✅ “crazy technological contents”

@TecHomer26

@

Perfectly shows how corrupt the OpenAI ecosystem really is. They started as "non-profit", then went private. Barely shared any information, spent billions on computing power and infrastructure, without improving efficiency. Now they want tax payers to bail them out of the massive debt they took on.

Meanwhile, DeepSeek made their results publicly available. And they did it on inferior hardware. OpenAI then tried to sue them. Truly despicable. 🤮

@lilhaxxor

@

Man, we need the MLA Poster to be shipped to Australia

@vfeclair2749

@

This probably is one of the most informative on AI than.

@SyncExplorer

@

you should make a mouse pad of the poster!

@nickgrekos

@

Hi, I just wanted to say thank you for creating such amazing content! I'm a Japanese viewer who doesn't speak English, and I rely on the auto-dubbed versions to enjoy your videos.
I noticed that some videos, like 'How Models Learn Part 1,' don't have the auto-dubbing option yet, which makes them a bit difficult for me to follow completely.
If possible, could you please consider adding auto-dubbed versions to these videos (and others)? I believe it would help many other Japanese viewers like me to fully understand and appreciate your brilliant work.
On another note (P.S.), I was also very interested in reading your book, possibly by buying and translating it. Do you have any plans to release an e-book version?
Thank you for your consideration and keep up the great videos!
(Translated by Gemini)

@sn_2293

@

Great vid I learned a lot...

@brucewayne3633

@

Pls translate in Hindi also

@AdityaBhore-m7z

@

Trending topics on YouTube often mirror wider cultural conversations, making timing crucial

@RamirezKaylen