← BackJan 7, 2026

ByteShape Outperforms Traditional Quantizers for 30B Qwen3‑2507 on Edge and GPU Devices

ByteShape leverages shape‑aware bit‑length learning to produce 30 B Qwen3‑2507 models that hit real‑time throughput on Raspberry Pi 5 and maintain superior quality on RTX 5090 and RTX 4080 GPUs. Compared to Unsloth and MagicQuant, ByteShape delivers higher tokens‑per‑second (TPS) or lower error rates within the same VRAM budget, making it the default choice for on‑device inference.

Executive Summary =================== In the quest to run large language models (LLMs) on constrained hardware, the choice of quantisation format can make the difference between a usable system and an unusable one. The "ShapeLearn" framework applies per‑tensor bit‑length selection to optimise not just for memory footprint, but for the practical metrics that matter in deployment: tokens per second (TPS) and output quality measured against a full BF16 baseline. Across a spectrum of devices—from the modest Raspberry Pi 5 to high‑end NVIDIA GPUs—ByteShape consistently outperforms the most popular alternatives (Unsloth and MagicQuant), delivering either higher throughput for the same quality or higher quality without sacrificing speed. Model Selection Strategy ----------------------- The optimisation pipeline begins by ensuring the quantised model fits within the target device’s memory. Only after a model comfortably resides in RAM or VRAM do we shrink its size further, with the goal of improving the real‑world trade‑off between speed and fidelity. ShapeLearn’s core contribution is recognising that the bit‑length alone does not dictate speed; the kernel implementation and hardware path‐specific characteristics also play a decisive role. Therefore, shape‑aware selection tailors the datatype for each tensor so that the resulting runtime path aligns with the device’s performance apex. Raspberry Pi 5 (16 GB) --------------------- On the Raspberry Pi 5, the Qwen3‑30B model runs in real‑time when quantised to 2.70 bits‑per‑weight (BPW) using the Q3_K_S format. This configuration achieves 8.03 TPS while retaining 94.18 % of the BF16 quality — a threshold that comfortably exceeds typical human reading speed. In comparison, the best Unsloth model at comparable quality sits at roughly 5–6 TPS. ByteShape’s ability to push the throughput‑quality curve rightward means it can sustain interactive responses on a low‑end device where other methods fall short. Speed‑First vs Accuracy‑First ----------------------------- For latency‑sensitive scenarios, the Q3_K_S‑2.70 bpw variant offers the sweet spot. Conversely, if application accuracy is paramount, ByteShape’s IQ4_XS‑3.25 bpw and IQ4_XS‑3.87 bpw variants keep the relative error below 1.3 % with TPS in the 6–7 range. When a maximum error budget is tolerated, Q3_K_S‑3.25 bpw presents the most efficient balance: at 2.03 % relative error, it delivers 6.68 TPS, outperforming Unsloth’s fastest entry by nearly 25 %. Intel i7 (64 GB) ---------------- On a high‑capacity CPU, ByteShape models dominate the TPS‑quality scatter. The IQ4_XS‑4.67 bpw achieves 23.1 TPS at 98 % quality, while the Q3_K_S‑3.25 bpw model offers the best balanced point: 23.1 TPS with 98 % accuracy. In quality‑first tests, ByteShape’s IQ4_XS‑4.67 bpw outperforms Unsloth's Q6_K and Q5_K_M at 0.25 % versus 0.36–0.44 % relative error, respectively. MagicQuant lags behind in both TPS and error rate. GPU Performance ---------------- For GPUs, TPS depends not only on weight precision but also on the specific decoding kernel. On an RTX 5090 (32 GB), a clear 4‑bit sweet spot emerges: several 4‑bit models (e.g., Unsloth Q4_0, Unsloth IQ4_XS, MagicQuant iq4_nl) cluster around 300 TPS with ~99 % accuracy. ByteShape maintains a predictable TPS‑quality descent beyond this cluster, providing higher accuracy where needed. Key Accuracy Milestone ---------------------- The IQ4_XS‑4.67 bpw model on the RTX 5090 delivers 272.98 TPS at 99.75 % accuracy—higher than Unsloth’s Q6_K (264.88 TPS, 99.64 %) and MagicQuant’s mxfp4_moe (240.42 TPS, 99.32 %). Under a 16 GB VRAM budget (RTX 4080), ByteShape remains superior: its IQ4_XS‑3.87 bpw achieves 214.81 TPS at 98.66 % accuracy, 39 % faster and 59 % more accurate than Unsloth’s best fit. Hardware‑Aware Kernel Behaviour -------------------------------- Lower BPW does not automatically translate to faster performance on NVIDIA GPUs. 4‑bit kernels efficiently utilise fixed‑size VRAM reads and minimise decode overhead, whereas 3‑ or 2‑bit kernels incur higher instruction counts and potential bandwidth fragmentation. The llama.cpp design prioritises portability over absolute throughput: weights are stored in 256‑value blocks to keep the format implementation straightforward across hardware. However, this can lead to sub‑optimal VRAM utilisation on GPUs where block decoding must be parallelised. ShapeLearn mitigates this by selecting per‑tensor datatypes that align better with the hardware’s efficient execution paths. Methodology Overview --------------------- Performance metrics are gathered on the target device for each quantised variant, producing a TPM value. A single normalised quality score, derived from MMLU, GSM8K, IFEval, and LiveCodeBench V4, quantifies how much fidelity the variant retains relative to BF16. Each data point in the plots answers two questions simultaneously: how fast does this model run, and how well does it preserve accuracy, given the memory fit constraint. Conclusion ---------- The evidence is clear: treat memory as a constraint, not an optimisation target. Once a model fits, your focus should shift entirely to the TPS‑quality trade‑off. ByteShape, by integrating shape‑aware bit‑length learning with a hardware‑friendly quantisation format, consistently lands on the upper‑right of this curve across CPUs and GPUs, delivering either higher speed for a given quality or higher quality for a given speed. Practical Recommendations -------------------------- * **Raspberry Pi 5 (16 GB)** – use Q3_K_S‑2.70 bpw for responsive, real‑time inference. * * **High‑end CPUs (e.g., Intel i7 64 GB)** – select IQ4_XS‑4.67 bpw for the best accuracy or Q3_K_S‑3.25 bpw for a balanced sweet spot. * * **RTX 5090/32 GB or RTX 4080/16 GB GPUs** – start with the 4‑bit sweet‑spot cluster; if higher accuracy is required, choose ByteShape’s IQ4_XS‑4.67 bpw (5090) or IQ4_XS‑3.87 bpw (4080). By following the "fit‑first, optimise‑later" mantra and leveraging ByteShape’s per‑tensor datatype selection, developers can unlock near‑real‑time performance and top‑tier accuracy for 30 B Qwen3‑2507 models on a wide variety of hardware platforms.