Last week’s news from China about Vidu and DeepSeek v2 give us the cue to explore the inference market in depth. Let’s dive in.
14M$ vs OpenAI
Vidu is a text-to-video model created by Shengshu AI - don’t Google for Vidu, you will end up on vidu.io that offers a text-to-video model which is not the Vidu I’m referring to, confusing I know - a company run by a former Tencent PM.
Shengshu AI developed an impressive model in just 10 months, and they did it with very little funding: just 14M$ from a single Angel round back in 2023. I say “little” because, in the context of GenAI, it’s getting nearly impossible to find companies raising less than 30-40M$.
Vidu is capable of creating videos up to 16s, of remarkable quality and consistency, at least judging from the company’s trailer. When Sora was announced, OpenAI published a series of videos - and you can expect those were as cherry-picked as they come, since they were intended for a public announcement - in which not only you could see a variety of inconsistencies, but also the absence of something that, for lack of a better term, we will call: the dreaded rotation from hell.
The Dreaded Rotation from Hell doesn’t roll of the tongue, so we will use a much smoother acronym: TDRFH (👋 RWKV, see, I’m learning from you!).
When a camera pans, the model needs to learn that things look more or less the same, just translate to a different position, but when something rotates, the model must learn a representation of their 3D structure, in relation to the space it is in, and it needs to maintain coherency.
To give you an example: if a camera rotates around a banana, at the end of the rotation the frame should go back to the same banana. The process shouldn’t give birth to an alchemical transmutation to a cucumber 🥒. This is hard.
Sora does it to an extent, but as we pointed out in Update #9 (not on Substack, sorry) there were plenty of limbs going around and merging with other living stuff, while most other videos got around this problem by mostly panning and tilting.
Vidu appears to be somewhat good at this, scrub to second 28 of their video to understand what I mean, or check the picture below, because you are made of flesh and laziness.
There are certainly some minor inconsistencies, but I was quite impressed that, as the camera rotates, the gap in between each TV ends up showing the background wall with the correct painting:
Unlike OpenAI, Vidu has published a detailed description of their model architecture that explains how they achieve such good results. The model is not just spatially consistent, it’s also capable of generating cuts: showing the same frame from a different angle, while maintaining coherency. When requested, it can also create clean transitions between scenes. Neat!
Sure, Sora sets a “standard” in terms of video length, but Vidu shows that even on a relatively modest budget (last month, the company raised another round for an undisclosed amount of “several hundreds of millions yuan” - 100M RMB is ~14M USD), it’s possible to match the SOTA with the right approach, team and data. This also reinforces my belief that OpenAI’s edge is starting to slip. More on this later.
Coherency, prompt adherence and the ability to steer the model in the direction we want are all key elements to the adoption of text-to-video models in the movie industry. Vidu represents a significant step towards higher quality content, and it opens a new range of possibilities that appear to be very very close to what Sora can (will?) offer.
DeepSeek v2
The second notable news from China was about Deep Seek v2 which is “just another LLM” (jaLLM? 🤔 should I stop making up acronyms?) only it’s kind of special.
LLMs all go through different levels of censorship, or alignment if you prefer a term that is less charged (just to clarify: I consider them logically separate, but they can happen at the same time during training):
Company-safety alignment: to ensure the content generated by the AI doesn’t hurt the image of the company, think of Google making sure Gemini doesn’t generate or discuss about white people.
User-safety alignment: to ensure the AI doesn’t try to hurt you by suggesting to mix bleach and rubbing alcohol to clean your stovetop stains (spoiler alert: this creates chloroform, which is highly toxic and does not work like in the movies, you’ve been warned!)
(Optional) Performance alignment: where the model is trained to provide concise answers that do not encourage engagement, in order to reduce resource consumption. Think of GPT-4 in the past 6 months.
I’m not a big fan of 1. - though I understand the need for it, what I mean is that it doesn’t add any benefit for customers - and 3., as they both degrade the model’s capabilities in a significant way. I would probably do without 2. as well, but your mileage may vary.
Anyway, if you’d like to put it in OpenAI words:
In China though an additional step is required before companies can offer their LLM services to the masses:
Doctrine alignment: to ensure the AI doesn’t offer views that are not in line with the Party’s narrative.
Indoctrination of AI was bound to come up, eventually. In the current geopolitical climate you can expect this to become a common trend. After all, step 1. can already be seen as a form of Company indoctrination.
Last March, China’s National Information Security Standardization Technical Committee (CNISSTC? Nope, it’s called TC260, I swear) published the Basic Security Requirements for Generative AI. The document highlights that:
The main security risks of the training data and generated answers are those related to the violation of the socialist values, for example, incitement to overthrow the socialist system, damaging the national image and endangering national security, undermining national unity and social stability, promoting terrorism and violence and spreading false and harmful information.
Among many other things, the document requires developers to create a database containing at least 10,000 keywords to be used to detect when conversations around any of the monitored topics are taking place.
DeepSeek v2 appears to be the first model to satisfy these requirements, and it’s now widely available in China and to the rest of the world.
This LLM is also interesting for another reason. The architecture used - both model and checkpoints are available, making it effectively the first Chinese open LLM - has been developed to greatly minimize the cost of training and inference.
The way it happens is by using a Mixture-of-Experts with 2 shared experts and 160 routed experts! While the model is a massive 236B parameters, each token activates “only” 21Bn parameters (1/3 of LLama3). This is where things get interesting.
This architecture - together with the use of Multi-head Latent Attention (MLA) that leverages on the assumption of a low-rank attention matrix - saved the team 42% in training cost, 93% in KV cache reduction and boosted throughput by 5.76x compared to the first DeepSeek! To understand the impact of these choices, we need to run the numbers, so…
Let’s Run the Numbers
Every service in China runs at a scale that is unmatched by any other single country, including India, where the penetration of digital services is still a fraction in comparison (48% vs 74%).
Scale means cost, and DeepSeek v2 has been engineered to address this issue as much as possible. We have discussed at length why Transformers are expensive, if you missed it, go back and read the previous update. Bring kombucha on the way back please.
To understand the cost impact we need to choose a metric, we have two choices: model size or performance.
The first doesn’t make a lot of sense, there’s no point in doing an apple-to-apple comparison if one model is 7B and another is 400B, so let’s aim for performance. Even then, performance where? Languages? Math? Astrology? Kombuchianism?
We have to make an assumption: that a chatbot will mostly be used to chat and for question answering, so instead of coming up with a fancy formula - a cosine distance to be frank, I just didn’t want to type in all numbers from the benchmark - to find the minimum distance across all dimensions between models, let’s eyeball it from here:
Alright, our contender has been identified: LLama3 70B!
The paper reports a generation of an absolutely insane 50.000+ TPS (Token per Second) on a 8xH800 node. Right, technically in China they don’t access to a lot of H100s. The H800 is similar to the H100, but it has a lower NVLink speed and capped FP64 performance.
I had to do a lot of digging but Together.ai claims they can run LLama3 70B at 150 TPS at full FP16. Time for cocktail napkin math!
DeepSeek v2: 8xH800 or 640GB VRAM, 50.000+ TPS
LLama 3 70B: 2xH100 or 80GB VRAM, 150 TPS
Let’s pretend that H800 and H100 are more or less the same, in the case of DeepSeek v2 we get 6000 TPS per GPU, in the case of LLama 3 we get 75 TPS. Let me paste a quote from Nvidia:
Applying these metrics, a single NVIDIA H200 Tensor Core GPU generated about 3,000 tokens/second — enough to serve about 300 simultaneous users — in an initial test using the version of Llama 3 with 70 billion parameters.
That means a single NVIDIA HGX server with eight H200 GPUs could deliver 24,000 tokens/second, further optimizing costs by supporting more than 2,400 users at the same time.
Translated to our example we have:
DeepSeek v2: 8xH800 or 640GB VRAM, 50.000+ TPS
LLama 3 70B: 8xH200 or 1128GB VRAM, 24.000 TPS
In short: the most powerful Nvidia GPU still delivers less than half the speed on LLama3 compared to what DeepSeek v2 shows on its benchmarks on a much lower GPU class.
LLama3 vs DeepSeek v2 (as a Service)
What if we want to run these models as a service? The calculation here is easier, granted we don’t know what markup is applied by each company, this exercise will give us a clear picture anyway (I’m assuming a blend of 3:1 or 3 input tokens and 1 output token):
Amazon Bedrock LLama 3 70B (1M tokens): 2.9$
DeepSeek v2 (1M tokens): 0.18$
DeepSeek v2 is 16x cheaper than LLama 3 70B!
What About Performance?
Yes the model is dirt cheap, maybe it produces only random garbage?
I suggest you to review the benchmarks provided on their project page and on HuggingFace🤗. In general though DeepSeek v2 performs similarly to LLama3 on most tasks, it’s much better on Chinese benchmarks, and it does a little better than LLama 3 with the Math dataset.
I’m eager to see benchmarks from third parties, just to confirm that the model really has such a crazy throughput! Although pricing speaks for itself, unless they’re operating at a massive loss only to acquire customer base. But if that’s a lie, it will be short-lived as proving throughput won’t take long.
In short: DeepSeek v2 appears to deliver on value, yes it’s large compared to LLama 3 70B, but it also uses just 1/3 of the params per tokens and, if we believe the benchmarks, it’s much much faster.
The Inference Market
At this point you might be wondering where we are headed, and I am too. While no one knows the future, we can make a couple of educated guesses.
Chinese CSPs have clearly told Nvidia that they do not want their downgraded GPUs, especially when, after initial testing, the performance is worse than expected. On the horizon we have an increasingly more competitive Huawei, FunTalk tested the Ascend 910B and reported that:
Huawei Ascend 910B currently achieves nearly 80% of the performance level of Nvidia’s A100. Data-center service providers consider this chip to be the preferred “alternative” for domestic companies.
Sure, Huawei is not yet competing head to head with Nvidia, but it’s showing the ability of taking to the local market a competitive AI chip, that will only get better from here on.
An 80% A100 is well in line to become a 100% H100 in the medium term, so CSPs in China have a good reason to reject the new GPUs from Nvidia. In fact they can opt for a valid alternative, without incurring the risk supply chain constraints from potential new sanctions.
We have already discussed how an important part of AI research is shifting towards model optimization, rather than developing brand new capabilities, and this is a necessary step to ensure the viability of the entire AI business model. AI is powerful, but if it needs to be run at scale, it needs to be sustainable, this means: reduced cost of inference and reduced energy requirements.
All major CSPs are developing their own chips to both reduce reliance on Nvidia and to reduce the cost of inference. Groq, Cerebras, Tenstorrent and SambaNova are all scaling up their capacity, and providing dedicated chips to serve models at scale and in a cost effective way to those organizations that won’t be able to access directly Google/Microsoft/Amazon/Meta chips.
Meta has deployed a major LLM to billions of people but it still has to find a viable business model for it, meanwhile it’s burning through a lot of cash. OpenAI bleeds more money than it makes with ChatGPT. On the bright side, it looks like Altman’s attempt at lobbying against open models is taking a pause. But the threat from Meta charging ahead with LLama3 70B and beating GPT-3.5 in every aspect, and Anthropic getting on par (or better) with GPT-4, should not be taken lightly.
A Fast Approaching Turning Point
The inference market is maturing and it seems to be close to its first turning point. Companies realized that, for as much as they love LLMs, they also love their own data to remain private. Additionally, they like to maintain control over their models' lifecycle (updates, alignment, fine-tuning), a task for which the heavily guarded public APIs are not always suitable.
In a rare move towards openness, OpenAI introduced ModelSpec, from their blog: “a new document that specifies how we want our models to behave in the OpenAI API and ChatGPT”.
A sign that OpenAI is taking notes, as customers complain in frustration, with models changing behavior in seemingly random ways - for the worse, at least in the past 6 months - and without explanation from update to update. Also a possible realization, I hope, that their extreme closeness is not helping companies build trust in them.
As organizations start to realize how expensive it is to run - and maintain - a RAG+GPT4 pipeline, they are also looking at ways to mitigate these costs, without taking a massive hit on performance, and while they’re at it, also at safeguarding their own data.
With these elements now clear, let’s draw a bottom line:
Nvidia: there is still strong demand for Nvidia GPUs, and there will be for the foreseeable future, but I heard - in a podcast, not in some elevator during a secret meeting of chip manufacturers, unfortunately - that Jensen is now shifting the attention to give more priority to the likes of Coreweave. The strategy makes sense: CSPs are making their own chips, Coreweave and similar won’t, and they will fuel the majority of demand after this first wave.
Meta: they recently presented the second generation of MTIA chips, they have LLama3 400B in training that will almost certainly surpass GPT-4 and it will be open. They also have access to one of the largest sources of data in existence: Facebook & Instagram.
Amazon/Google/Microsoft: they all have announced new accelerator chips and they are all intended, for the most part, for internal use, imagine ModelX as a Service type of thing + internal workloads (recommendation, advertisement, assistants etc). Microsoft is adding internal AI talents, possibly in a move to reduce its dependency on OpenAI.
OpenAI: they’re testing this gpt2 (literally: im-a-good-gpt2-chatbot) in the OpenAI Arena, which might or might not be an early preview of GPT-5, but they’re definitely getting squeezed by both Microsoft and Meta. What will happen to the 100Bn$ Stargate datacenter Microsoft+OpenAI are planning to open in 2028? And more importantly: who’s going to fund it? I sense that OpenAI might end up in a difficult position without preferential access to accelerators at scale.
Groq/Cerebras/Tenstorrent/SambaNova: they’re not targeting CSPs (as they need and want their own custom designs), rather they aim for those datacenters and entities that are cost-sensitive and have a need to accelerate inference and training beyond what’s possible with GPUs. The plan is solid, but all these vendors are still relatively expensive, and they will remain so until economies of scale kick in. While we are not there yet, the time seems to be right. Who’s going to win? As cost per PFLOP converges, the software ecosystem will be a decisive factor, unless one of these architectures suddenly turns out to be vastly superior to the others (so far, this doesn’t seem to be the case).
Putting everything together, you’ll notice that - if you’re an enterprise or even better, a government - OpenAI’s appeal is waning: do you really need to pay a premium - and give up control altogether - for GPT-4, when an open model is available to you at a literal fraction of the cost, and none of the downsides? Anthropic, I’m 👀 looking at you too.
We cannot make a case for OpenAI’s data edge anymore, while that was true in 2022 to some extent, Meta and Google have no shortage of customers’ data and these datasets, unlike those acquired by OpenAI, will keep growing. Sure, GPT-5 will come out and it will be glorious, they will add data control capabilities too, but if their lead only lasts for 6 months, you might be better off with a model you can fully control.
Even taking OpenAI aside, most companies don’t seem to be able to run LLMs at scale in a sustainable way. One of the reasons (excluding the lack of a business model) might be simple: the way models are served today is not cost-effective due to hardware cost (capex and opex), but there might be a way around it: accelerators.
Accelerators, whether for internal use or not, are becoming hot. There is an objective need for specialized hardware for AI, and not because we don’t like Jensen 🫶, but because GPUs where adapted to this task, while this new hardware has been designed for it, with clear upsides: lower cost, better performance and (soon) wider availability.
Mass availability of custom chips will probably have two effects: a shift from API providers to MaaS (Model as a Service), unless top-performance is an absolute requirement, and the growth of flexible inference services. And mind you, these will be hosted on CSPs as well, exposed as an API, but customers will have full control over the models’ lifecycle, updates and alignment.
See you next time!