Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Elon Musk says Tesla needs “gigantic chip fab” for its AI and robotics

    November 7, 2025

    OpenAI’s Sora app launches on Android

    November 7, 2025

    OpenAI faces seven more suits over

    November 7, 2025
    Facebook X (Twitter) Instagram
    ailogicnews.aiailogicnews.ai
    • Home
    ailogicnews.aiailogicnews.ai
    Home»Deepseek»DeepSeek claims its new AI model can cut the cost of predictions by 75% – here’s how
    Deepseek

    DeepSeek claims its new AI model can cut the cost of predictions by 75% – here’s how

    AI Logic NewsBy AI Logic NewsOctober 7, 2025No Comments5 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email
    DeepSeek

    Nikolas Kokovlis/NurPhoto via Getty Images

    Follow ZDNET: Add us as a preferred source on Google.


    ZDNET key takeaways

    • DeepSeek unveils a new AI model focused on cost efficiency.
    • The main innovation is a reduction in compute to run attention.
    • The innovation is not revolutionary; it’s evolutionary. 

    The Chinese artificial intelligence startup DeepSeek AI, which stunned the world in January with claims of dramatic cost efficiency for generative AI, is back with the latest twist on its use of the technology to drive down the price of computing. 

    Last week, DeepSeek unveiled its latest research, DeepSeek-V3.2-Exp. On its corporate blog, the company claims the new model can cut the cost of making predictions, known as inference, by 75%, from $1.68 per million tokens to 42 cents. 

    Also: DeepSeek may be about to shake up the AI world again – what we know 

    As was the case in January, DeepSeek is drawing on techniques in the design of gen AI neural nets, which are part of a broad approach within deep-learning forms of AI, to squeeze more from computer chips by exploiting a phenomenon known as “sparsity.”

    The magic of sparsity

    Sparsity is like a magic dial that finds the best match for your AI model and available compute.

    Sparsity comes in many forms. Sometimes, it involves eliminating data that doesn’t materially affect the AI model’s output. The same economic rule of thumb has been true for every new generation of personal computers: either a better result for the same money or the same result for less money.

    Also: What is sparsity? DeepSeek AI’s secret, revealed by Apple researchers

    In its earlier work, DeepSeek used the sparsity approach of turning off large sections of neural network “weights” or “parameters” to reduce total computational cost. 

    In the new work, as detailed in the technical paper posted on GitHub by DeepSeek researchers, the key is retraining the neural net to only pay attention to a subset of the data in its training data. 

    Paying better attention

    One of the most expensive computing operations in training a neural network for applications, such as chatbots, is what’s known as the “attention” mechanism. Attention compares each word you type to prior words, known as the context, and to a vocabulary of words the AI model has in its memory. 

    The technical term for what you type at the prompt is the “query,” and the words to compare to, or stored in memory, are known as “keys.” When the attention mechanism finds a match between your query and a stored key, it can select what’s called a “value” from the vocabulary to output as the next word or words. 

    Also: Companies are making the same mistake with AI that Tesla made with robots

    The term “word” here is a shorthand for what goes on under the hood. As with all AI models, DeepSeek’s program turns words, word fragments, letters, and punctuation into “tokens,” which are atomic objects given a numeric value when stored in the tech company’s vocabulary.

    The attention operation needs to compare a numeric score of the query token to every key token, which it does by matrix multiplication. As the tokens handled by a model grow in size — and as more “context,” recent tokens, are employed — the compute cost grows exponentially. 

    As an alternative approach, the researchers take the prior version of the AI model, DeepSeek-V3.1, “Terminus,” and add what they call a “lightning indexer.”

    In what is known as a “sparse training” procedure, they separately train both the V3.1 model and the lightning indexer from scratch. The V3.1 part has the normal attention mechanism. The lighting indexer doesn’t and is instead trained to find a much smaller subset of tokens that are much more likely to be relevant from among the entire vocabulary of tokens. 

    Lightning strikes

    The point of this approach is that, with a subset, the indexer can reduce the mass of query-key searches at prediction time, using only a select group, and thereby consume less compute power each time a prediction needs to be made. 

    “Its computational efficiency is remarkable,” the research authors said of the indexer.

    Also: OpenAI’s Altman calls AI sector ‘bubbly’, but says we shouldn’t worry – here’s why

    The result of the lightning indexer is that their sparsity approach, which DeepSeek calls DeepSeek Sparse Attention, “requires much less computation” in their tests against V3.1, and results in “a significant end-to-end speedup in long-context scenarios.” 

    Moreover, the authors said: “We do not observe substantial performance degradation compared with DeepSeek-V3.1-Terminus, on both short- and long-context tasks” with respect to accuracy.

    Mind you, it’s not only sparsity. There are a couple of other tweaks they used, including training V3.2 on domain-specific task data, such as for mathematics problems and coding. 

    The authors said that more extensive real-world testing is necessary and is underway. 

    Evolutionary not revolutionary

    Given the hype that has surrounded DeepSeek since January, it’s worth keeping in mind that the lightning index and DeepSeek Parse Attention are simply the latest offerings in a long tradition of sparsity exploitation, as I pointed out in a previous article. 

    For many years, researchers have specifically explored ways to reduce the computational burden of the key-value calculations. There have been numerous variants of attention used to reduce query-key cost, leading researchers to develop a taxonomy. 

    The original attention method is referred to as “multi-head attention.” Other approaches have been “multi-query attention,” “grouped-query attention,” and “flash attention.” DeepSeek even has its own brand of attention in v3.1, which it preserves with V3.2, called “multi-head latent attention,” an approach that brought benefits to 3.1.

    Given that there have been, and likely will continue to be, innovations to the attention mechanism from many parties, this DeepSeek innovation looks more evolutionary than revolutionary.  

    Get the morning’s top stories in your inbox each day with our Tech Today newsletter.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleColumn | Friends took my face with
    Next Article Is A.I. Investment Getting Too Circular?
    AI Logic News

    Related Posts

    Deepseek

    OpenAI’s Sora app launches on Android

    November 7, 2025
    Deepseek

    Kimi K2 Thinking: What it means

    November 6, 2025
    Deepseek

    AI Wunderkind DeepSeek’s Founder Debuts On China’s 100 Richest List

    November 6, 2025
    Demo
    Top Posts

    FTC’s Holyoak Has Her Eyes On DeepSeek

    February 22, 20256 Views

    OpenAI Rejects Elon Musks Bid Further Escalating The Feud

    February 17, 20253 Views

    Optimize Inventory Management with AI for Small Online Retailers

    February 17, 20253 Views
    Latest Reviews
    ailogicnews.ai
    © 2025 Lee Enterprises

    Type above and press Enter to search. Press Esc to cancel.