Aspire Market Guides


For a long time, there’s been an active discussion about exploring a better architecture for large language models (LLM) besides the transformer. Well, two months into 2025, this California-based startup seems to have a promising solution. 

Inception Labs, founded by professors from Stanford, the University of California, Los Angeles (UCLA), and Cornell, has introduced Mercury, which the company claims to be the first commercial-scale diffusion large language model. 

Mercury is ten times faster than current frontier models, according to an independent benchmarking platform, Artificial Analysis, the model’s output speed exceeds 1000 tokens per second on NVIDIA H100 GPUs, a speed previously possible only using custom chips. 

“Transformers have dominated LLM text generation and generate tokens sequentially. This is a cool attempt to explore diffusion models as an alternative by generating the entire text at the same time using a coarse-to-fine process,” Andrew Ng, founder of DeepLearning.AI, wrote in a post on X.  

Ng’s last phrase is key to understanding why Inception Labs’ approach seems interesting. Andrej Karpathy, a former researcher at OpenAI, who’s currently leading Eureka Labs, helps us understand this better. In a post on X, he said that LLMs based on transformers are trained autoregressively, meaning predicting words (or tokens) from left to right. 

However, diffusion is a technique that AI models use to generate images and videos. “Diffusion is different – it doesn’t go left to right, but all at once. You start with noise and gradually denoise into a token stream,” added Karpathy. 

He also indicated that Mercury has the potential to be different and showcase new possibilities. And as per the company’s testing – it does make a difference in the output speed. 

In the company’s evaluation across standard coding benchmarks, Mercury surpasses the performance of speed-focused small models like GPT-4o Mini, Gemini 2.0 Flash and Claude 3.5 Haiku. The Mercury Coder Mini model achieved 1109 tokens per second. 

Source: Artificial Analysis

Moreover, the startup also said diffusion models are advantageous in reasoning and structuring their responses because they are not restricted to considering only their previous outputs. Besides, they can continuously refine their output to reduce hallucinations and errors. Thus, diffusion techniques power the models under video generation tools like Sora and Midjourney.

The company also took a subtle dig at the techniques used by current reasoning models and their bet on inference time scaling that uses additional compute while generating the output.

“Generating long reasoning traces comes at the price of ballooning inference costs and unusable latency. A paradigm shift is needed to make high-quality AI solutions truly accessible,” the company said. 

Inception Labs has released a preview version of the Mercury Coder, which allows users to test the model’s capabilities.

Small models optimised for speed are at threat –  but what about specialised hardware providers like Groq, Cerebras and SambaNova? 

Are Groq, Cerebras, and SambaNova Under a Threat? 

It isn’t for no reason that NVIDIA achieved the status of the world’s most valuable company during the age of the AI frenzy. Their GPUs are ubiquitously preferred for training AI models. 

However, the company’s Achilles heel was providing low latency and high-speed outputs—even Jensen Huang, CEO of NVIDIA, noted this. This opened up the opportunity for companies like Groq, Cerebras, and SambaNova to build hardware dedicated to high-speed outputs. 

However, Mercury’s speed was only matched before by models hosted on specialised inference platforms—for instance, Mistral’s Le Chat running on Cerebras

Recently, Jonathan Ross, CEO of Groq, said that people will continue to buy NVIDIA GPUs for training, but high-speed inference will necessitate specialised hardware. Does Mercury’s breakthrough suggest a threat to this ecosystem? 

Moreover, Inception Labs also said that diffusion LLMs are a replacement for all current use cases like RAG, tool use, and agentic workflows. But this isn’t the first time a diffusion model for language has been explored. In 2022, a group of Stanford researchers published research on the same technique but observed that the inference was slow. 

“Interestingly, the main advantage now [with Mercury] is speed. Impressive to see how far diffusion LMs have come!” said Percy Liang, a Stanford professor comparing Mercury to the older study. 

Similarly, a group of researchers from China recently published a study on a diffusion language model they built called LLaDA. The researchers said that the 8 billion parameter version of this model offered competitive performance, and their benchmark evaluations revealed better performance in several tests compared to models in its category. 



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *