Hierarchical Autoregressive Modeling for Memory-Efficient Language Generation
View PDF HTML (experimental) Abstract:Transformers operate as horizontal token-by-token scanners; at each generation step, the model attends to an ever-growing sequence of token-level ...
View PDF
HTML (experimental)
Abstract:Transformers operate as horizontal token-by-token scanners; at each generation step, the model attends to an ever-growing sequence of token-level states. This access pattern increases prefill latency and makes long-context decoding increasingly memory-bound, as KV-cache reads and writes dominate inference throughput rather than arithmetic computation. We propose Parallel Hierarchical Operation for Top-down Networks (PHOTON), a hierarchical autoregressive model that replaces flat scanning with vertical, multi-resolution context access. PHOTON maintains a hierarchy of latent streams: a bottom-up encoder progressively compresses tokens into low-rate contextual states, while lightweight top-down decoders reconstruct fine-grained token representations. Experimental results show that PHOTON is superior to competitive Transformer-based language models regarding the throughput-quality trade-off, offering significant advantages in long-context and multi-query tasks. This reduces decode-time KV-cache traffic, yielding up to $10^{3}\times$ higher throughput per unit memory.
Comments:
12 pages, 5 figures
Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:
arXiv:2512.20687 [cs.LG]
(or
arXiv:2512.20687v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2512.20687
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Yuma Ichikawa [view email] [v1]
Mon, 22 Dec 2025 19:26:59 UTC (1,233 KB)