Tencent Hunyuan cuts AI compute 75% with new sparse attention algorithm

Tencent Holdings Ltd.'s Hunyuan AI team developed a sparse attention algorithm that achieves near-dense-attention accuracy using 75% less computing power, potentially cutting inference costs for long-context reasoning by millions of dollars annually.

"Stem re-examines block-level sparsity from the perspective of causal information flow, which prior approaches overlooked," the Tencent Hunyuan research team said in a technical paper detailing the algorithm.

The algorithm introduces two innovations: Token Position Decay, which weights tokens based on their distance in the sequence, and Output-Aware Metric, which selects attention blocks based on their contribution to the final output. At the operator level, the open-sourced HPC Stem+BSA operators reduced first-token latency by 3.7 times under a 128,000-token context window, the team reported.

Tencent, which trades at about 20 times forward earnings, has been investing heavily in its Hunyuan model to compete with Alibaba Group Holding Ltd.'s Qwen, Baidu Inc.'s Ernie and DeepSeek. Lower inference costs could improve margins for Tencent's cloud business and enable more affordable AI features across WeChat, which has more than 1.3 billion monthly active users.

Competitive Landscape Intensifies

The efficiency gain arrives as China's AI model race enters a cost-cutting phase. DeepSeek's V3 model, released in late 2024, demonstrated that competitive performance was possible at a fraction of the training cost of frontier US models. Tencent's Stem algorithm targets the inference side — the recurring expense of running models in production — which accounts for 60 percent to 80 percent of total AI workload costs for deployed applications, according to industry estimates.

Alibaba's Qwen team has also published sparse attention research, while Baidu has optimized its Ernie model for long-context tasks. Tencent's decision to open-source the HPC Stem+BSA operators distinguishes its approach, allowing developers to integrate the efficiency gains without proprietary licensing.

What the 3.7x Latency Reduction Means

A 3.7-times reduction in first-token latency under 128,000-token contexts is significant for real-time applications. For a WeChat AI agent processing a long customer-service conversation, that translates to a response starting in seconds rather than tens of seconds. Citi analysts said in a note that TongchengTravel Holdings Ltd. could benefit from potential close collaboration with Tencent's WeChat AI Agent, reiterating a buy rating on the stock.

The 128,000-token context window is comparable to what leading models offer — OpenAI's GPT-4 Turbo supports 128,000 tokens, while Anthropic's Claude 3.5 supports 200,000. Tencent's algorithm could give Hunyuan a cost advantage in the long-context segment, where inference expenses scale quadratically with sequence length under standard dense attention.

Investment Implications

For Tencent, the cost savings compound across its AI footprint. The company reported cloud revenue of 53.3 billion yuan ($7.4 billion) in fiscal 2024, with AI-related workloads as a growing component. Every percentage point reduction in inference cost improves margins in a business where Tencent competes with Alibaba Cloud and Huawei Cloud on price.

The open-source strategy also carries strategic logic. By releasing the HPC operators publicly, Tencent gains community contributions and ecosystem adoption — developers who build on Stem-optimized infrastructure are more likely to deploy Hunyuan models. This mirrors Meta Platforms Inc.'s approach with its Llama model series, which has become the most widely adopted open-source AI family.

This article is for informational purposes only and does not constitute investment advice.