Creating and sharing knowledge for telecommunications

Efficient Frequency-Aware Multiscale Vision Transformer for Event-to-Video Reconstruction

Maqsood, R. ; Nunes, P. ; Soares, L. D. ; Conti, C.

Efficient Frequency-Aware Multiscale Vision Transformer for Event-to-Video Reconstruction, Proc European Signal Processing Conference EUSIPCO, Palermo, Italy, Vol. , pp. - , September, 2025.

Digital Object Identifier: 10.23919/EUSIPCO63237.2025.11226686

Download Full text PDF ( 12 MBs)

 

Abstract
Event-to-video (E2V) reconstruction is a critical task in event-based vision, benefiting from the advantages of event cameras, such as high dynamic range and low latency. However, existing deep learning reconstruction methods often prioritize temporal consistency and over-emphasize low-frequency features, leading to blur artifacts and loss of fine details. To overcome these limitations, we propose a novel frequency-aware multiscale vision transformer model for E2V reconstruction (MSViT-E2V). Our model employs wavelet-based decomposition to extract features at multiple scales, preserving fine-grained details through multi-level wavelet-based downsampling blocks, followed by transformer blocks for multiscale feature aggregation and long-range dependency modeling. Extensive experiments on various event datasets demonstrate that our model not only minimizes artifacts and preserves fine details but also reduces computational costs by up to 50% compared to the transformer-based model ET-Net.