Notas del episodio
Discover a groundbreaking approach to optimizing Large Language Models with Tomasz Kolinko, a true OG tinkerer and entrepreneur. In this One-Shot interview, Tomasz unveils his 'Effort Engine,' a novel algorithm that dynamically selects which computations are performed during LLM inference, allowing for significant speed improvements while maintaining surprising output quality. Learn how this method goes beyond traditional quantization by dynamically managing computations and even enabling partial model loading to save VRAM.
Tomasz shares his unique benchmarking techniques, including the use of Kullback-Leibler divergence and heat maps, offering a new lens to understand how models behave under reduced 'effort.' This conversation provides practical insights into the underlying mechanics of AI models and offers a fully open-source proje ...