The post FlashAttention-4 Hits 1,605 TFLOPS on NVIDIA Blackwell GPUs appeared on BitcoinEthereumNews.com. Alvin Lang Jan 22, 2026 23:03 NVIDIA’s FlashAttentionThe post FlashAttention-4 Hits 1,605 TFLOPS on NVIDIA Blackwell GPUs appeared on BitcoinEthereumNews.com. Alvin Lang Jan 22, 2026 23:03 NVIDIA’s FlashAttention

FlashAttention-4 Hits 1,605 TFLOPS on NVIDIA Blackwell GPUs

2 min read


Alvin Lang
Jan 22, 2026 23:03

NVIDIA’s FlashAttention-4 achieves 71% hardware efficiency on Blackwell chips, delivering 3.6x speedup over FA2 for AI training workloads.

NVIDIA has released FlashAttention-4, the latest optimization for transformer neural networks that squeezes 1,605 TFLOPS out of its Blackwell architecture—capturing 71% of the hardware’s theoretical maximum performance.

The announcement matters for anyone watching AI infrastructure investments. As large language models push toward longer context windows, the attention mechanism’s quadratic memory complexity becomes a brutal bottleneck. FlashAttention-4 attacks this problem directly, and the benchmark numbers suggest meaningful gains for production AI workloads.

What the Numbers Show

On the B200 GPU, FA4 delivers a 3.6x speedup over FlashAttention-2 during forward passes at 32,768 sequence length. Backward pass performance hits 3.15x faster than FA2 under the same conditions. Against existing frameworks, FA4 posts 1.3x improvement over cuDNN and 2.4x over Triton Inference Server implementations.

The memory efficiency gains are equally significant. Standard attention scales at O(N²) with sequence length—meaning doubling your context window quadruples memory requirements. FA4 brings this down to O(N) through tiling and incremental softmax normalization. NVIDIA claims 20x lower memory usage compared to PyTorch baselines.

Hardware-Software Co-Design

FA4 was built specifically for Blackwell’s quirks. The architecture presents an asymmetric scaling problem: compute power roughly doubles while memory bandwidth doesn’t keep pace. Traditional approaches leave tensor cores sitting idle while waiting for data.

The solution leverages Blackwell’s dedicated Tensor Memory (TMEM)—256 KB of on-chip memory per streaming multiprocessor. By storing intermediate calculations directly in TMEM instead of shared memory, FA4 sidesteps the bandwidth bottleneck that would otherwise throttle the faster compute units.

Larger tile sizes (up to 128×128) and deeper pipelines keep the hardware busy. The backward pass—typically the slower half of training—benefits from bypassing register accumulation entirely.

Production Integration

Major inference frameworks including SGLang and vLLM already support FA4 prefill operations. NVIDIA has incorporated these techniques into cuDNN 9.14, making the optimizations accessible to developers without custom kernel work.

For AI companies burning through compute budgets, the efficiency gains translate directly to cost savings. A 3x+ speedup on training passes means either faster iteration cycles or the ability to train larger models within existing infrastructure constraints.

The broader trend here: as transformer models grow, algorithmic efficiency at the kernel level becomes as important as raw hardware capability. FlashAttention-4 represents the current frontier of that optimization work.

Image source: Shutterstock

Source: https://blockchain.news/news/flashattention-4-nvidia-blackwell-gpu-performance

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.
Tags:

You May Also Like

Japan-Based Bitcoin Treasury Company Metaplanet Completes $1.4 Billion IPO! Will It Buy Bitcoin? Here Are the Details

Japan-Based Bitcoin Treasury Company Metaplanet Completes $1.4 Billion IPO! Will It Buy Bitcoin? Here Are the Details

The post Japan-Based Bitcoin Treasury Company Metaplanet Completes $1.4 Billion IPO! Will It Buy Bitcoin? Here Are the Details appeared on BitcoinEthereumNews.com. Japan-based Bitcoin treasury company Metaplanet announced today that it has successfully completed its public offering process. Metaplanet Grows Bitcoin Treasury with $1.4 Billion IPO The company’s CEO, Simon Gerovich, stated in a post on the X platform that a large number of institutional investors participated in the process. Among the investors, mutual funds, sovereign wealth funds, and hedge funds were notable. According to Gerovich, approximately 100 institutional investors participated in roadshows held prior to the IPO. Ultimately, over 70 investors participated in Metaplanet’s capital raising. Previously disclosed information indicated that the company had raised approximately $1.4 billion through the IPO. This funding will accelerate Metaplanet’s growth plans and, in particular, allow the company to increase its balance sheet Bitcoin holdings. Gerovich emphasized that this step will propel Metaplanet to its next stage of development and strengthen the company’s global Bitcoin strategy. Metaplanet has recently become one of the leading companies in Japan in promoting digital asset adoption. The company has previously stated that it views Bitcoin as a long-term store of value. This large-scale IPO is considered a significant step in not only strengthening Metaplanet’s capital but also consolidating Japan’s role in the global crypto finance market. *This is not investment advice. Follow our Telegram and Twitter account now for exclusive news, analytics and on-chain data! Source: https://en.bitcoinsistemi.com/japan-based-bitcoin-treasury-company-metaplanet-completes-1-4-billion-ipo-will-it-buy-bitcoin-here-are-the-details/
Share
BitcoinEthereumNews2025/09/18 08:42
BNB Chain Takes Lead in RWA Tokenization, Expert Sees BNB Rally to $1,300

BNB Chain Takes Lead in RWA Tokenization, Expert Sees BNB Rally to $1,300

                         Read the full article at                             coingape.com.                         
Share
Coinstats2025/09/18 14:40
‘Slam dunk’ case? The brutal killing of a female cop and her son

‘Slam dunk’ case? The brutal killing of a female cop and her son

Policewoman Diane Marie Mollenido and her eight-year-old son John Ysmael are killed over what police believe was a car scam
Share
Rappler2026/02/05 16:58