oLLM: Lightweight Python Library Enables 100K-Context LLMs on 8GB GPUs with SSD Offload
Originally Published 3 months ago — by MarkTechPost

oLLM is a lightweight Python library that enables large-context LLM inference on consumer GPUs by offloading weights and KV-cache to SSDs, maintaining high precision without quantization, and supporting models like Qwen3-Next-80B, GPT-OSS-20B, and Llama-3, making it feasible to run large models on 8 GB GPUs for offline tasks, though with lower throughput and storage demands.