Abstract
This paper presents FluxMem, a training-free framework for efficient streaming video understanding. FluxMem adaptively compresses redundant visual memory through a hierarchical, two-stage design: (1) a Temporal Adjacency Selection (TAS) module removes redundant visual tokens across adjacent frames, and (2) a Spatial Domain Consolidation (SDC) module further merges spatially repetitive regions within each frame into compact representations. To adapt effectively to dynamic scenes, we introduce a self-adaptive token compression mechanism in both TAS and SDC, which automatically determines the compression rate based on intrinsic scene statistics rather than manual tuning. Extensive experiments demonstrate that FluxMem achieves new state-of-the-art results on existing online video benchmarks, reaching 76.4 on StreamingBench and 67.2 on OVO-Bench under real-time settings, while reducing latency by 69.9% and peak GPU memory by 34.5% on OVO-Bench. Furthermore, it maintains strong offline performance, achieving 73.1 on MLVU while using 65% fewer visual tokens.
Contributions
- Hierarchical memory. We introduce a novel training-free hierarchical memory with two lightweight adaptive modules, which equips MLLMs with unified short- and long-term video modeling for both online and offline settings.
- Accuracy-efficiency. Our approach achieves state-of-the-art performance on various video tasks in both online and offline settings while discarding 60-70% of visual tokens and reducing latency and GPU memory usage.
- Adaptive thresholds. We demonstrate that an adaptive token reduction threshold, based on video-specific information density, outperforms fixed-rule methods. This adaptive capability is natively supported by TAS and SDC.
Model Design
FluxMem maintains an adaptive hierarchical memory with three levels—short-term, mid-term, and long-term—organized by temporal distance to the current query. Incoming frames are encoded into visual tokens and first stored in short-term memory to preserve fine-grained, immediate context. As the stream grows, a Temporal Adjacency Selection (TAS) module compares tokens across adjacent frames, uses data-driven thresholds to drop redundant tokens, and compacts them into mid-term memory. When mid-term memory becomes saturated, a Spatial Domain Consolidation (SDC) module groups spatially similar regions within each frame into representative anchors, forming a long-term memory that keeps global structure while aggressively reducing redundancy. This progressive, self-adaptive design allows FluxMem to retain salient dynamics over long streams while significantly lowering the visual token budget.
Experiments
Quantitative results: FluxMem lifts Qwen2.5-VL across online/offline benchmarks while using markedly fewer visual tokens.
Ablations: TAS prunes adjacent redundancy, SDC fuses spatial repeats, and adaptive thresholds give the best accuracy-compression trade-off on MLVU.
Visualizations
Token retention visualizations: TAS keeps temporally novel context, SDC preserves spatially salient objects/regions needed for reasoning.
BibTeX
@inproceedings{xie2026fluxmem,
title={FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding},
author={Xie, Yiweng and He, Bo and Wang, Junke and Zheng, Xiangyu and Ye, Ziyi and Wu, Zuxuan},
booktitle={CVPR},
year={2026}
}