It’s been a while since I last wrote here. Lately, I’ve been diving deep into AI inference—the process of running AI models to generate responses—specifically exploring whether we truly need expensive GPUs for running modern language models. Spoiler alert: the answer might surprise you. After extensive testing on Oracle Cloud Infrastructure (OCI), comparing ARM-based Ampere processors against the latest AMD EPYC chips, I discovered that the right combination of software optimizations and compressed models can deliver remarkable performance—all without a single GPU. ...
















