NVIDIAs TiDAR

From the link bellow on a new technique:

TiDAR delivers impressive performance improvements across multiple dimensions:

Speed: 4.71x to 5.91x more tokens per second compared to baseline AR models
Efficiency: Generates 7.45 tokens per forward pass (1.5B model) and 8.25 tokens per forward pass (8B model)
Quality: Maintains competitive performance with AR models on coding tasks (HumanEval, MBPP) and math reasoning (GSM8K)
Versatility: Outperforms existing diffusion models (Dream, Llada) and speculative decoding methods (EAGLE-3) in both efficiency and quality
System-friendly: Supports exact KV caching and requires no hyperparameter tuning during inference

The method is the first to successfully close the quality gap with AR models while delivering substantial speedup gains.

Practical Applications:

Real-time AI Assistants: Enable faster response times in chatbots and virtual assistants without sacrificing answer quality, improving user experience in customer service and educational applications.

Code Generation Tools: Accelerate AI-powered development environments like GitHub Copilot, allowing developers to receive high-quality code suggestions with reduced latency.

Content Creation Platforms: Speed up AI writing assistants, marketing copy generators, and creative writing tools while maintaining output quality for professional use cases.

Edge AI Deployment: Optimize inference on resource-constrained devices by maximizing GPU utilization, making sophisticated language models more viable for mobile and embedded applications.

https://arxivexplained.com/papers/tidar-think-in-diffusion-talk-in-autoregression