Speeding up MR Image Reconstruction with GeneRalized Autocalibrating Partially Parallel Acquisitions (GRAPPA)
This time we would like to share some details about realization of one of our projects in medical imaging.
Currently one of the bottlenecks in MR image reconstruction is speed improvement. Improving the speed of image reconstruction is difficult from algorithmic point of view. But it’s becoming more popular to improve algorithm performance using GPU.
In magnetic resonance (MR) image reconstruction raw data measured from the scanner correspond to the Fourier coefficients of the target image, so the fast Fourier transform (FFT) is used for reconstruction. An important procedure called the density compensation is necessary in the very beginning to account for the non-uniform sampling. GeneRalized Autocalibrating Partially Parallel Acquisitions (GRAPPA) is a partially parallel acquisition (PPA) method which is used to reduce scan times. In this case only partial data is acquired and missing data is mathematically calculated from available data.
Subject for optimization
Original implementation was based on FFTW library for Fast Fourier transforms and adapted Singular Value Decomposition (SVD) algorithm from ACM Collected Algorithms for GRAPPA preprocessing. These two algorithms are the most computationally intensive parts of the whole image reconstruction. FFTW library is claimed to be the fastest CPU implementation using all possible optimizations like SSE2/3 and hyper threading, however it does not leverages the power of modern GPU cards. SVD algorithm was done on CPU as well. It is known to be badly parallelizable for small matrices, but in case of GRAPPA algorithm we have many image frames with same size which can be processed in parallel. Besides there are many intermediate steps which consume a lot of CPU and they can be easily parallelized on GPU.
FFTW library performance is comparable with Intel MKL implementation. NVidia provides comparison for their CUDA based cuFFT library with MKL implementation (Figure 1):
|Figure 1 Comparison of CUDA based cuFFT library with MKL implementation|
According to this we should achieve up to 8-10x faster FFT processing when using GPU accelerated cuFFT library from Nvidia. GPU accelerated SVD algorithm is also available, for example CULA library by EM Photonics. However, current CULA library implementation does not support batch mode, so we will need to process all image frames as a sequence. Brief testing showed that 64 image frames (256*256) are processed even slower than CPU based version. Since we haven’t found any good alternative to CULA library we decided to implement our own GPU accelerated version of SVD algorithm.
FFT part of image reconstruction when using cuFFT library was straightforward, however we had to deal with image frames which does not fit into available GPU memory. We had to write algorithm to run FFT over portions of the large data frame with subsequent aggregation. Figure 2.1 below shows case when all data fits into GPU memory.
Figure 2.2 illustrates the case when huge data is processed. Solid lines in figure below show measured performance, dashed lines show estimated time in case data fits into GPU memory.