AI inference reshapes memory demand, creating 2 new growth markets

The shift from AI training to inference is reshaping the memory industry in ways that extend well beyond HBM, with KV cache offloading and agentic AI workloads creating two distinct growth markets for enterprise SSDs and LPDRAM.

"AI's memory system will completely transform storage systems," Nvidia founder and Chief Executive Officer Jensen Huang said at the GTC Taipei conference in June 2026, calling memory infrastructure one of the most challenging parts of the AI stack.

The structural shift is driven by two forces. First, inference workloads are generating an explosion in KV cache demand — the dynamic memory that stores key-value vectors during the prefill phase to avoid redundant computation during decoding. Nvidia data shows the average output token count per query has surged more than fivefold annually since the second half of 2024, reaching about 30,000 to 40,000 tokens. When GPU HBM capacity is exhausted, systems must discard the cache and recompute, raising latency and total cost of ownership.

To solve this, Nvidia released Dynamo software in March 2025, which offloads less-frequently accessed KV cache to cheaper memory tiers including CPU DRAM and SSDs. In January 2026, the company followed with the CMX Context Memory Storage Platform, managed by the BlueField-4 DPU. Each rack uses 64 BlueField-4 DPUs to manage about 9,600 terabytes of capacity, inserting a new "G3.5" pod-level context storage layer between local SSD and shared storage. At Computex 2026, Nvidia's BlueField-4 DPU structural model already contained SK Hynix PEB210 E1.S and PE9010 M.2 SSD samples, signaling the SSD POD sub-market is moving from concept to hardware.

Agentic AI Reshapes CPU Memory Demand

The second driver is agentic AI, where models must actively plan, call tools, make decisions and execute agent loops — all tasks handled by the CPU. Huang has said agents live in a nanosecond-scale world where ultra-low latency is paramount, elevating the importance of CPU architecture.

TrendForce estimates that as agentic AI deployments scale, the CPU-to-GPU workload ratio will shift from the traditional 1:4 or 1:8 toward roughly 1:1, creating significant incremental demand for CPU-attached memory. Nvidia's Vera CPU, launched in 2026 for agentic workloads, supports up to 1.5 terabytes of LPDDR5X — three times the capacity of its predecessor Grace.

However, TrendForce reported that Nvidia has halved the SOCAMM memory capacity on the next-generation Vera Rubin superchip module, citing insufficient LPDRAM capacity allocated to Nvidia in supplier preliminary production plans for 2027. The adjustment reflects near-term supply constraints rather than a reduction in Nvidia's overall memory demand.

The broader CPU market is undergoing its own generational refresh for agentic AI. Intel launched Xeon 6+ (Clearwater Forest), AMD released EPYC Venice, Arm introduced the Arm AGI CPU, and Ampere's AmpereOne MX is expected to enter production this year. The multi-vendor competition is accelerating CPU memory demand growth across the industry.

Investment Implications

For memory investors, the two trends point to growth markets beyond HBM. Enterprise SSDs are gaining a new demand vector from KV cache offloading as Nvidia, Google and other platform vendors roll out SSD POD architectures. LPDRAM is seeing structural demand expansion from the CPU side as agentic AI pushes server architectures toward balanced CPU-GPU configurations.

The supply constraint flagged for Nvidia's Vera Rubin suggests near-term LPDRAM capacity may be tight, benefiting established memory manufacturers including SK Hynix, Samsung Electronics and Micron Technology that control the bulk of LPDRAM production. For SSD makers, the emergence of dedicated context storage tiers in AI infrastructure represents a new addressable market that did not exist two years ago.

This article is for informational purposes only and does not constitute investment advice.