How DeepSeek Rewrote the Transformer [MLA]
Video Overview & Insights
Thanks to KiwiCo for sponsoring today’s video! Go to https://www.kiwico.com/welchlabs and use code WELCHLABS for 50% off your first monthly club crate or for 20% off your first Panda Crate!
Thanks to KiwiCo for sponsoring today’s video! Go to https://www.kiwico.com/welchlabs and use code WELCHLABS for 50% off your first monthly club crate or for 20% off your first Panda Crate!
MLA/DeepSeek Poster at 17:12 (Free shipping for a limited time with code DEEPSEEK):
https://www.welchlabs.com/resources/mladeepseek-attention-poster-13x19
Great video, you should upgrade it to include missing RoPE
Limited edition MLA Poster and Signed Book:
https://www.welchlabs.com/resources/deepseek-bundle-mla-poster-and-signed-book-limited-run
banger
Imaginary Numbers book is back in stock!
https://www.welchlabs.com/resources/imaginary-numbers-book
It looks obvious when properly explained. Amazing Video!
Special Thanks to Patrons https://www.patreon.com/c/welchlabs
Juan Benet, Ross Hanson, Yan Babitski, AJ Englehardt, Alvin Khaled, Eduardo Barraza, Hitoshi Yamauchi, Jaewon Jung, Mrgoodlight, Shinichi Hayashi, Sid Sarasvati, Dominic Beaumont, Shannon Prater, Ubiquity Ventures, Matias Forti, Brian Henry, Tim Palade, Petar Vecutin, Nicolas baumann, Jason Singh, Robert Riley, vornska, Barry Silverman, Jake Ehrlich
Did anyone here question themselves why they would use "The American flag is red, white, and ..." as an example, while knowing that the deepseek model is Chinese.
I don't fall for American propaganda any more. Could've used ANY other example, but it had to be something patriotic in order to get the message across. What a shame.
The most funny part about this is that China doesn't even care about all of that propaganda effort at all.
References
DeepSeek-V2 paper: https://arxiv.org/pdf/2405.04434
This is the best explanation ever. Thanks so much. It is so clear. I love it.
DeepSeek-R1 paper: https://arxiv.org/abs/2501.12948
Great Article by Ege Erdil: https://epoch.ai/gradient-updates/how-has-deepseek-improved-the-transformer-architecture
Actually there is a small error/left out detail, The query vector is also converted into latent query vector with dimension of 1536 which is 3x the the latent dimension of K and V of 512. The video has it as 576. Not accurate.
The query vector is down-projected only for the sake of saving training time and memory. Doesnt help in any way to the cause of reducing KV cache.
GPT-2 Visualizaiton: https://github.com/TransformerLensOrg/TransformerLens
Manim Animations: https://github.com/stephencwelch/manim_videos
Holy shit this is so smart. Amazing. Mindblowing even. And honestly, quite obvious in hindsight.
Technical Notes
1. Note that DeepSeek-V2 paper claims a KV cache size reduction of 93.3%. They don’t exactly publish their methodology, but as far as I can tell it’s something likes this: start with Deepseek-v2 hyperparameters here: https://huggingface.co/deepseek-ai/DeepSeek-V2/blob/main/configuration_deepseek.py. num_hidden_layers=30, num_attention_heads=32, v_head_dim = 128. If DeepSeek-v2 was implemented with traditional MHA, then KV cache size would be 2*32*128*30*2=491,520 B/token. With MLA with a KV cache size of 576, we get a total cache size of 576*30=34,560 B/token. The percent reduction in KV cache size is then equal to (491,520-34,560)/492,520=92.8%. The numbers I present in this video follow the same approach but are for DeepSeek-v3/R1 architecture: https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/config.json. num_hidden_layers=61, num_attention_heads=128, v_head_dim = 128. So traditional MHA cache would be 2*128*128*61*2 = 3,997,696 B/token. MLA reduces this to 576*61*2=70,272 B/token. Tor the DeepSeek-V3/R1 architecture, MLA reduces the KV cache size by a factor of 3,997,696/70,272 =56.9X.
2. I claim a couple times that MLA allows DeepSeek to generate tokens more than 6x faster than a vanilla transformer. The DeepSeek-V2 paper claims a slightly less than 6x throughput improvement with MLA, but since the V3/R1 architecture is heavier, we expect a larger lift, which is why i claim “more than 6x faster than a vanilla transformer” - in reality it’s probably significantly more than 6x for the V3/R1 architecture.
3. In all attention patterns and walkthroughs, we’re ignoring the |beginning of sentence| token. “The American flag is red, white, and” actually maps to 10 tokens if we include this starting token, and may attention patterns do assign high values to this token.
I hope you know this paper was a word for word copy of a single post on 4chan lmao. Some autist on 4 Chan did this by himself in his bedroom, when this leaked 4chan was hacked the next day.
4. We’re ignoring bias terms matrix equations.
5. We’re ignoring positional embeddings. These are fascinating. See DeepSeek papers and ROPE.
I don't understand anything, but someday i will.
More User Perspectives
DEEPSEEK BREACH EVIDENCE CONTACT ME
@Justnate787At 2:46, whats that in the background? "Papa"? xD
@goldengamer_lpwho else here is 19 and understands this video? i wanna know how many of them are of my age group
@ErenYeager-b2hFreaking great video! Thanks!
@dybenkog.5350What the fuck is those English captions???
@greatguy64amazing chinese technology (nobody uses it and it sucks)
@pegatrisedmiceThis is the reason why 2 sticks of ram costs 899
@TheReal_LuaDipaThank you so much for this! I’ve watched a few videos about this and didn’t understand it. This one finally helped me visualize the process and understand why it decreases memory usage while preserving performance
@calebrowley4419Brilliant way to explain the math behind attention.
@karana2260how is this content free goddangit
@BravodieSponsored by Kiwi cache
@samhuang9120How do you people understand this? I don't understand this at all, even with the visuals.
@irtazaazam2573Amazing video. Thankyou
@andrewrossyIt would be great to find a course that would help me make a chatgpt2. At least get a better understanding of nueral nets
@rverm1000This is by far the most innovative video that i watched on the LLM's and these new architecture patterns
@naitikpatel3904still best deep dive I've ever seen. please don't stop doing this
@waagnermannHoly Christ... I was on an interview at AT&T's headquarters. I am a technical guy so they took me into the development lab to show me around. While in the dev lab, a room that that is a clean room and has certain requirement including an air lock with powerful air blowers and filters before you can enter the area. Anyways while I was being shown one of the projects they were working on I farted. I farted so powerfully that it penetrated my pressurized suit. Personally I thought it smelled like bubble gum, but others disagreed. Anyways I knew that it penetrated my suit because it caused the security system to perform an automatic lock down of the secure areas of the complex. I don't know what else to say, oh yeah, I did get the contract and the entire experience was a gas.
@MadAndblackThis video is a masterpiece!
@ympeng796916:27 - Thats crazy...
@sto2779Bruh!!!
Every llm is made with transformers
What's different with deepseek
You are not explaining deepseek you are explaining transformers and attention, these two things are core of every llm we know these days....
From what I've understood they also did some undocumented manual assembly code on the gpu work to improve performance
@spider853damn, i still don't understand this well enough to explain it. i can only explain ML in very broad details.
@OnionKnight541Thankyou very much for the video. now I understand more deeply of the architecture.
@nufhA tempting to brute force encryption
@Kj113fI love the visualization!
@RichardWieditzBe blessed. What an amazing explanation.
@denismaringo2729I love the way you presented this so elegantly. The narration, the animation, the script, can see all the effort put into this to make this as effortless for the viewer as possible
@jatinagarwal464First of all huge gratitude for making this video. The effort that must have gone into creating visuals of this quality, alone is worth millions.
More than making the transformer architecture clear, it makes me think how there's nothing special about attention, both the K, Q, V implementation and conceptually in general. There could be other ways to generate the association values of every pair of tokens.
13:35 is the explanation, everything prior is the context
@alecfox3309Great video, how do you create the graphics for such videos?
@johnpill1@TecHomer26 ✅ “crazy technological contents”
@TecHomer26Perfectly shows how corrupt the OpenAI ecosystem really is. They started as "non-profit", then went private. Barely shared any information, spent billions on computing power and infrastructure, without improving efficiency. Now they want tax payers to bail them out of the massive debt they took on.
Meanwhile, DeepSeek made their results publicly available. And they did it on inferior hardware. OpenAI then tried to sue them. Truly despicable. 🤮
Man, we need the MLA Poster to be shipped to Australia
@vfeclair2749This probably is one of the most informative on AI than.
@SyncExploreryou should make a mouse pad of the poster!
@nickgrekosHi, I just wanted to say thank you for creating such amazing content! I'm a Japanese viewer who doesn't speak English, and I rely on the auto-dubbed versions to enjoy your videos.
I noticed that some videos, like 'How Models Learn Part 1,' don't have the auto-dubbing option yet, which makes them a bit difficult for me to follow completely.
If possible, could you please consider adding auto-dubbed versions to these videos (and others)? I believe it would help many other Japanese viewers like me to fully understand and appreciate your brilliant work.
On another note (P.S.), I was also very interested in reading your book, possibly by buying and translating it. Do you have any plans to release an e-book version?
Thank you for your consideration and keep up the great videos!
(Translated by Gemini)
Great vid I learned a lot...
@brucewayne3633Pls translate in Hindi also
@AdityaBhore-m7zTrending topics on YouTube often mirror wider cultural conversations, making timing crucial
@RamirezKaylen