Optimizing Memory Usage and Accesses on CUDA-Based Recurrent Pattern Matching Image Compression
Domingues, P.
; Silva, J.S.
; Ribeiro, TR
;
Rodrigues, Nuno M. M.
; Carvalho, M.
;
Faria, S.M.M.
Optimizing Memory Usage and Accesses on CUDA-Based Recurrent Pattern Matching Image Compression, Proc Computational Science and Its Applications - ICCSA, Guimarães, Portugal, Vol. 8582, pp. 560 - 575 - DOI: 10.1007/978-3-319-09147-1_41, July, 2014.
Digital Object Identifier: 10.1007/978-3-319-09147-1_41
Abstract
This paper reports the adaptation of the Multidimensional Multiscale Parser (MMP) algorithm to CUDA, focusing on memory optimization issues. MMP is a lossy compression algorithm for images that achieves compression quality and bitrate which surpasses standards such as H.264 and JPEG2000. However, the high computational complexity of the algorithm results in large execution times for encoding operations. For example, to encode the 512 × 512 Lenna image on a 2013’s state of the art Intel Xeon, MMP requires nearly 9000 seconds. The goal to port MMP to CUDA manycore platforms is to speedup the algorithm,
maintaining full compatibility with the CPU-based version. One of the main challenges to adapt MMP to manycore is related to the algorithm dependency on a pattern codebook which is dynamically built during the execution. This forces the processing of the blocks of the input image to be performed sequentially, making MMP a non-trivial candidate
for adaptation to manycore GPUs. Nonetheless, by porting the costliest MMP operations, CUDA-MMP achieves a 12× speedup over the sequential version when ran over an NVIDIA GTX 680. Moreover, by further heavily optimizing memory operations of CUDA-MMP, we attain an overall 17.1× speedup relatively to the sequential version. This paper deals essentially on the memory optimizations performed at the GPU level, focusing on the layout of data structures in memory, on the type of GPU memory – shared, constant and global – and on achieving coalesced accesses.