Ssd Offload News

oLLM: Lightweight Python Library Enables 100K-Context LLMs on 8GB GPUs with SSD Offload

Originally Published 3 months ago — by MarkTechPost

oLLM is a lightweight Python library that enables large-context LLM inference on consumer GPUs by offloading weights and KV-cache to SSDs, maintaining high precision without quantization, and supporting models like Qwen3-Next-80B, GPT-OSS-20B, and Llama-3, making it feasible to run large models on 8 GB GPUs for offline tasks, though with lower throughput and storage demands.

technology technology ai-inference gpu-memory-optimization large-language-model ollm ssd-offload technology