IROO Based Image Composition and the General Optimization

2.5 The In-practice Real-time over Operator

2.5.1 IROO Based Image Composition and the General Optimization

The image composition problem considered here contains n h×k-sized asynchronously- arriving partial images, m homogeneously-configured composition devices cj, j =

0, 1, · · · , m − 1. Each partial image is divided into m identically-sized tiles tj,

j = 0, 1, · · · , m − 1. The fragment (x, y), 0 ≤ x < w, 0 ≤ y < h, belongs to tile tj, if j = by/nc. The composition unit cj, 0 ≤ j < n, only composites fragments

To be more specific, each considered composition unit here provides a hetero- geneous computing enviroment that contains both GPU and CPU. GPU performs most of the parallelization works and CPU coordinates the communications.2 _We

study several performance-gaining strategies within such enviroment, which include 1) reducing data movement between CPU and GPU, 2) increasing data access rates, 3) modeling the amounts of requisite resources on GPU (mainly the global memory and threads, which are critical to the operator’s parallelization), 4) exploring the parallelization opportunities with the given amounts of resources. We first address each of them individually, then apply and integrate them in proposed algorithms.

1) Data movement between host memory and device memory occurs when data on one side is requested from the other. Before desired data arrive, the operations depending on the requested data are halted. Such movement halts proceeding of all dependent operations and introduces non-negligible performance degrading. Our main optimization principle is thus to best locate the requested data along with the requesting side, i.e. locating requested data for GPU/CPU in device/host memory. In the current form of the proposed operators, operations on GPU require the W component of each participating fragment. Considering the W components are initialized uniformly for all fragments, we initialize all W components on device directly instead of doing that on host and later moving from host to device.

2) We consider increasing data access rates for GPU related operations through exploring the usage of share memory on GPU. Considering the scarcity of share memory and its preference over certain access patterns to deliver highest access rates, we accordingly propose the following acceleration strategies: 1. reserving the share memory for newly-arriving image’s alpha values, which are highly concurrently demanded while updating all existing images’ W components; 2. swapping data in/out of share memory at a size equivalent to the share memory’s capacity so as to

2_{the GPU device we select are CUDA compatible, the terms follow CUDA conventions}

minimize the relevant data movement overheads, i.e., all required alpha components are splited into n parts, where n is the number of locations the share memory exist, one alpha component-part corresponding to one share memory-location; 3. coordinating the each thread’s share-memory access scheme so as to minimize bank confliction.

3) In our real-time composition problem, as new image arrives, the cost of memory to keep the W components and alpha values of all available operands on board also increases. It is possible that the required amounts in such a manner exceed the available at a particular point. We figure out the feasibility of avoiding/delaying such a point based on the observation that the requisite alpha values as new image arrives come from two images at most: one is the newly arriving one, the other is immediately in front of the arriving one among the currently available ones, leaving all the rest eligible for being replaced without affecting the performance. In addition, the immediately in-front image has less priority than the new one, the memory requirement of new image should be always considered first. When the memory shortage occurs, we further propose to substitute the alpha values of existing partial images with the W component and alpha values of the newly arriving one. Methods to determine the shortage and the corresponding treatment are given in the modeling and algorithm design, respectively.

In compliance with the problem specification and proposed optimization, we propose the following modeling for the proposed operator’s resource requirement, covering both memory and thread. For memory requirement, there exist the following metrics: 1) a constant size C to hold the permanent variables which live through the whole composition procedure; 2) a varying size D(j) to support the update of all j existing images’ W components in parallel, which requires the simultaneous availability of these images’ all related W components and at least two images’ all related alpha values, thus determines that D(j) = (j + 2) × h × k, where h × k is the size of each considered image, j × h × k is allocated to W components of all

considered images, 2 × h × k is for alpha values of the newly arriving image and its immediately-before one; 3) D(j) is the size of memories actually accessed in the parallel execution, reserving memory space merely of size D(j) doesn’t guarantee optimal performance, since the new image’s immediately-before image is determined only after the new image is ready, it requires another round of data transferring if the needed immediately-before image is not in device memory yet, instead, if all existing images’ alpha components are already in memory when the new image arrives, there is no such follow-up transferring need, the required amounts of memory in such situation is (j + 2) × h × k then. Given available amounts of memories ma, we

categorize the according composition of the j images as: 1) sufficient memory(SM) if ma > 2j × h × k; 2) acceptable memory(AM) if (j + 2) × h × k < ma < 2j × h × k;

3) deficient memory(DM) if ma < (j + 2) × h × k.

For thread requirement, we follow the similar logics. The requisite number of threads for completely parallel composition of the j images is j × h × k, which tells the thread sufficiency in the considered situation as 1)insufficient threads(IT) 2) sufficient threads(ST).

The above analysis tells the extent of possible parallelism in the given situation, points out the issues to consider while exploring parallelization, and provides a concise framework to organise the according algorithm design. We thus consider our algorithm design based on the above categorization of memory availability and cover the divided cases of 1. SM 2. AM 3. DM individually.

In document Accelerating data-intensive scientific visualization and computing through parallelization (Page 63-66)