Optimizing GPU code for CPU execution using OpenCL and vectorization: a case study on image coding
Pereira, Pedro M. M.
Rodrigues, Nuno M. M.
Optimizing GPU code for CPU execution using OpenCL and vectorization: a case study on image coding, Proc ICA3PP: International Conference on Algorithms and Architectures for Parallel Processing, Granada, Spain, Vol. 1, pp. 1 - 8, December, 2016.
Digital Object Identifier:
Although OpenCL (Open Computing Language) aims to achieve portability at the code level, different hardware platforms requires different approaches in order to extract the best performance for OpenCL-based code. In this work, we use an image encoder originally tuned for OpenCL on GPU (OpenCL-GPU), and optimize it for multi-CPU based platforms. We produce two OpenCL-based versions:i) a regular one (OpenCL-CPU) and ii) a CPU vector-based one (OpenCL-
CPU-Vect). The use of CPU vectorization exploits the OpenCL support, making it much simpler than directly coding with SIMD instructions
such as SSE and AVX. Globally, while the OpenCL-GPU version is the fastest when run on a high end GPU requiring around 580 seconds to encode the Lenna image, its performance drops roughly 65% when run unchanged on a multicore CPU machine. Regarding the versions tuned for CPU, the OpenCL-CPU encodes the Lenna image in 805 seconds, while the vectorization-based approach executes the same operation in 672 seconds. Results show that meaningful performance gains can be achieved by tailoring the OpenCL code to the CPU, and that the use of CPU vectorization instructions through OpenCL is both rather simple and performance rewarding.