Google has released a new compression algorithm this week that it says can shrink the memory an AI model needs during inference by at least six times—and the announcement rattled one of the hottest trades in tech. Shares of SK Hynix fell as much as 6.4% on the Korea Exchange. Samsung dropped nearly 5%. Japan’s Kioxia—which had surged over 700% since August on AI-fuelled optimism—slid sharply. Micron and Sandisk both took hits in New York.The two-day selloff cracked open a fault line that had been quietly building under memory stocks: the assumption that AI demand for chips would only ever go up.
The algorithm targets the part of AI that ‘remembers’ your conversation
The algorithm, called TurboQuant, goes after something called the key-value (KV) cache. Every time a large language model processes a conversation, it stores past calculations in this cache—a digital cheat sheet—so it doesn’t have to recompute them from scratch. The longer the conversation, the heavier this cheat sheet gets, and the faster it eats into GPU memory.TurboQuant compresses that cheat sheet aggressively using two techniques. The first, PolarQuant, converts the high-dimensional vectors that AI models rely on from standard XYZ coordinates into polar form—the difference between saying “Go 3 blocks East, 4 blocks North” and “Go 5 blocks at a 37-degree angle.” Same destination, fewer numbers.The second technique, called Quantized Johnson-Lindenstrauss (QJL), applies a 1-bit error-correction pass to clean up whatever inaccuracies PolarQuant leaves behind. The result: models can run at just 3-bit precision with no quality loss and no retraining. On Nvidia H100 accelerators, Google recorded an 8x speedup in computing attention logits—the process by which a model decides what in a prompt actually matters.
Flash memory got crushed; HBM walked away mostly fine
Not every corner of the memory market was equally spooked, and analysts were quick to make the distinction. TurboQuant’s efficiency gains are specific to inference and the KV cache, which means the real threat falls on NAND flash memory—not high-bandwidth memory (HBM).HBM is what sits inside Nvidia’s AI accelerators and powers the training infrastructure at companies like Microsoft and Meta. Bloomberg Intelligence analyst Jake Silverman wrote that HBM demand, along with DRAM made by Micron, would “likely be unaffected.” Morgan Stanley made the same call. It’s Kioxia and Sandisk—both heavily exposed to NAND—that absorbed the worst of the damage.
Why some analysts think this actually means more chip demand, not less
Several analysts pushed back on the panic entirely, reaching for a 19th-century economic theory to make their case. The Jevons Paradox, originally about coal, holds that making a resource more efficient tends to increase its consumption—because efficiency opens up use cases that didn’t exist before.JPMorgan’s trading desk cited the paradox in a note, arguing there’s no near-term threat to memory consumption. SemiAnalysis analyst Ray Wang told CNBC that fixing a bottleneck makes AI hardware more capable, and more capable models will eventually need more memory, not less. Quilter Cheviot’s Ben Barringer put it plainly: TurboQuant is “evolutionary, not revolutionary.”
Developers already running it locally—and it works
Away from the market noise, the algorithm has immediate practical value for anyone running AI outside a data centre. Google released TurboQuant publicly with no licensing restrictions and no retraining requirement—meaning it can be dropped into existing models immediately.Within 24 hours, developers had already ported it to local AI frameworks, including MLX for Apple Silicon. One community benchmark ran the Qwen3.5-35B model across context lengths up to 64,000 tokens at 2.5-bit TurboQuant and found perfect accuracy across the board. For enterprises with strict data privacy needs, or developers pushing the limits of on-device AI, this is the kind of software efficiency gain that quietly changes what’s possible on existing hardware.





