• No results found

Understanding and using Video Codecs

N/A
N/A
Protected

Academic year: 2021

Share "Understanding and using Video Codecs"

Copied!
64
0
0

Loading.... (view fulltext now)

Full text

(1)

PROJECT REPORT

(Project Semester Jan -Jun 2011)

AT

Understanding and using Video Codecs

Submitted By:

Name: Manish Chhabra

Roll No.: 100806060

Under the guidance of:

Mr.. H.K.S. Randhawa

Dr. MONA MATHUR

Training Coordinator, ECED ST-IITD Research Initiative Thapar University ST Microelectronics, India

Department of Electronics & Communication Engineering

THAPAR UNIVERSITY, PATIALA

(2)

DECLARATION

I hereby declare that the project work entitled Understanding and using Video Codecs

is an authentic record of my own work carried out at IIT-Delhi in the ST-IITD

Research & Innovation Lab as requirements of six months project semester for the

award of degree of B.E. (Electronics Instrumentation & Control Engineering), Thapar

Institute of Engineering & Technology (Deemed University), Patiala, under the guidance

of Dr. Mona Mathur and Mr. H.K.S. Randhawa during January to June, 2011.

Manish Chhabra

100806060

Date: 31/07/2011

Certified that the above statement made by the student is correct to the best of our

knowledge and belief.

Mr. H.K.S. Randhawa

Dr. Mona Mathur

IAP Co-ordinator

ST-IITD Research Initiative

ECED Deptt.

ST Microelectronics, India

(3)

ACKNOWLEDGEMENT

It gives me immense pleasure thanking my faculty mentors Dr. Mona Mathur and Dr. Brejesh Lall for giving me a project of my interest. I am also thankful to my program Co-ordinator Dr. Subrat Kar and my student mentor Mr. Satyam Naolekar and other team members of ST-IITD Multimedia Group for their constant encouragement and guidance.

I am grateful to Dr. A. K. Chatterjee (Head) , Electronics and Communication Engineering for allowing me to do the project as a part of my course curriculum. My sincere thanks to Mr. H.K.S. Randhawa (Assistant Professor),

Electronics and Communication Engineering, Thapar University, Patiala for being extremely cooperative and helpful.

I am also thankful to Siddharth Gupta(Student EIC IV yr) with whom i have done this project.

I am also grateful to my peers and other senior staff of MM Lab, IIT, Delhi for their support, motivation and valuable suggestions.

(4)

TABLE OF CONTENTS

Motivation behind the Work 1: Introduction to Video Coding

1.1 Video Codec 1.2 Temporal Model

1.1.1 Changes Due to motion

1.1.2 Block Based ME and compensation 1.3 Transform & Quantization

1.4 The Hybrid DPCM DCT Video Codec 1.4.1 Encoder Data Flow 1.4.2 Decoder Data Flow

2: Enhaced Predictive Zonal Search Technique

2.1 Early Termination Algorithm

3: Introduction to High Efficiency Video Coding 3.1 Background

3.2 Features 3.3 History

3.4 When will it be finished?

4: Introduction to Profiling

4.1 Compiling a program for Profiling 4.2 Executing the program

4.3 Gprof Command Summary 4.4 Profiling the HM Code.

5. Parallel Computing 5.1Introduction

5.2 Parallel Computing Basics 5.3 Why Parallel?

5.4 Models

(5)

6. OpenCL

6.1 The OpenCL Architecture 6.2 Graphics Processor 6.3 NVIDIA CUDA

6.4 Programming Interface 6.5 Working with OpenCL 6.5.1 Workflow

6.5.2 Basic Kernels 6.6 Samples

8 Workshops Attended 9. Conclusion

10.Challenges and Problems faced

(6)

Motivation behind the Work

My internship was mainly focussed on Video Compression and the various techniques that are involved in this process. Firstly, in the report I would cover the basic concepts I learnt about the Video Compression and then I move on to the EPZS technique which is implemented in the existing Video Compression Standard i.e. H.264 which focuses on mainly the motion estimation method. After that I focussed on the evolving Video Coding Standard and understood its execution and working requirements through profiling in various configurations mainly low delay high efficiency and low delay low complexity. As this evolving code is much more complex as compared to the exiting code (about 50%), we focussed on optimizing this using multi-core architectures using OpenCL for parallelizing of Deblocking filter implemented in the HM code (test model for HEVC) .

1. Introduction to Video Coding

Compression is the process of compacting data into a smaller number of bits. Video compression (video coding) is the process of compacting or condensing a digital video sequence into a smaller number of bits. ‘Raw’ or uncompressed digital video typically requires a large bit rate (approximately 216 Mbits for 1 second of uncompressed TV-quality video, and compression is necessary for practical storage and transmission of digital video.

Compression involves a complementary pair of systems, a compressor (encoder) and a decompressor (decoder). The encoder converts the source data into a compressed form prior to transmission or storage and the decoder converts the compressed form back into a representation of the original video data. The encoder/decoder pair is often described as a CODEC (Encoder/ Decoder).

Data compression is achieved by removing Redundancy, i.e. components that are not necessary for faithful reproduction of the data. Many types of data contain statastical redundancy and can be effectively compressed using lossless

(7)

compression, so that the reconstructed data at the output of the decoder is a perfect copy of the original data. Unfortunately, lossless compression of image and video information gives only a moderate amount of compression. The best that can be achieved with current lossless image compression standards such as JPEG-LS is a compression ratio of around 3–4 times. Lossy compression is necessary to achieve higher compression. In a lossy compression system, the decompressed data is not identical to the source data and much higher compression ratios can be achieved at the expense of a loss of visual quality. Lossy video compression systems are based on the principle of removing subjective redundancy, elements of the image or video sequence that can be removed without significantly affecting the viewer’s perception of visual quality.

Figure 1 : Encoder/Decoder

Most video coding methods exploit both temporal and spatial redundancy to achieve compression. In the temporal domain, there is usually a high correlation (similarity) between frames of video that were captured at around the same time. Temporally adjacent frames (successive frames in time order) are often highly correlated, especially if the temporal sampling rate (the frame rate) is high. In the spatial domain, there is usually a high correlation between pixels (samples) that are close to each other, i.e. the values of neighbouring samples are often very similar.

The H.264 and MPEG-4 Visual standards share a number of common features. Both standards assume a CODEC ‘model’ that uses block-based motion compensation, transform, quantization and entropy coding.

1.1 Video Codec

A video CODEC encodes a source image or video sequence into a compressed form and decodes this to produce a copy or approximation of the source sequence.

(8)

Figure 2 : Video Encoder Block Diagram

If the decoded video sequence is identical to the original, then the coding process is lossless; if the decoded sequence differs from the original, the process is lossy.

The CODEC represents the original video sequence by a model (an efficient coded representation that can be used to reconstruct an approximation of the video data). Ideally, the model should represent the sequence using as few bits as possible and with as high a fidelity as possible. These two goals Compression efficiency and high quality are usually conflicting, because a lower compressed bit rate typically produces reduced image quality at the decoder.

A video encoder consists of three main functional units: a temporal model, a spatial model and an entropy encoder. The input to the temporal model is an uncompressed video sequence. The temporal model attempts to reduce temporal redundancy by exploiting the similarities between neighbouring video frames, usually by constructing a prediction of the current video frame. In MPEG-4 Visual and H.264, the prediction is formed from one or more previous or future frames and is improved by compensating for differences between the frames (motion compensated prediction). The output of the temporal model is a residual frame (created by subtracting the prediction from the actual current frame) and a set of model parameters, typically a set of motion vectors describing how the motion was compensated.

The residual frame forms the input to the spatial model which makes use of similarities between neighbouring samples in the residual frame to reduce spatial redundancy. In MPEG-4 Visual and H.264 this is achieved by applying a transform to the residual samples and quantizing the results. The transform converts the samples into another domain in which they are represented by

(9)

transform coefficients. The coefficients are quantised to remove insignificant values, leaving a small number of significant coefficients that provide a more compact representation of the residual frame. The output of the spatial model is a set of quantised transform coefficients. The parameters of the temporal model (typically motion vectors) and the spatial model (coefficients) are compressed by the entropy encoder. This removes statistical redundancy in the data (for example, representing commonly-occurring vectors and coefficients by short binary codes) and produces a compressed bit stream or file that may be transmitted and/or stored. A compressed sequence consists of coded motion vector parameters, coded residual coefficients and header information.

The video decoder reconstructs a video frame from the compressed bit stream. The coefficients and motion vectors are decoded by an entropy decoder after which the spatial model is decoded to reconstruct a version of the residual frame. The decoder uses the motion vector parameters, together with one or more previously decoded frames, to create a prediction of the current frame and the frame itself is reconstructed by adding the residual frame to this prediction.

1.2 TEMPORAL MODEL

The goal of the temporal model is to reduce redundancy between transmitted frames by forming a predicted frame and subtracting this from the current frame. The output of this process is a residual (difference) frame and the more accurate the prediction process, the less energy is contained in the residual frame. The residual frame is encoded and sent to the decoder which re-creates the predicted frame, adds the decoded residual and reconstructs the current frame. The predicted frame is created from one or more past or future frames (‘reference frames’). The accuracy of the prediction can usually be improved by compensating for motion between the reference frame(s) and the current frame. The simplest method of temporal prediction is to use the previous frame as the predictor for the current frame. Two successive frames from a video sequence are used as a predictor for frame 2 and the residual formed by subtracting the predictor (frame 1) from the current frame (frame 2). In this image, mid-grey represents a difference of zero and light or dark greys correspond to positive and negative differences respectively. The obvious problem with this simple

(10)

prediction is that a lot of energy remains in the residual frame (indicated by the light and dark areas) and this means that there is still a significant amount of information to compress after temporal prediction. Much of the residual energy is due to object movements between the two frames and a better prediction may be formed by compensating for motion between the two frames.

1.2.1Changes due to Motion

Changes between video frames may be caused by object motion (rigid object motion, for example a moving car, and deformable object motion, for example a moving arm), camera motion (panning, tilt, zoom, rotation), uncovered regions (for example, a portion of the scene background uncovered by a moving object) and lighting changes. With the exception of uncovered regions and lighting changes, these differences correspond to pixel movements between frames. It is possible to estimate the trajectory of each pixel between successive video frames, producing a field of pixel trajectories known as the optical flow (optic flow). The complete field contains a flow vector for every pixel position but for clarity, the field is sub-sampled so that only the vector for every 2nd pixel is shown.

If the optical flow field is accurately known, it should be possible to form an accurate prediction of most of the pixels of the current frame by moving each pixel from the reference frame along its optical flow vector. However, this is not a practical method of motion compensation for several reasons. An accurate calculation of optical flow is very computationally intensive (the more accurate methods use an iterative procedure for every pixel) and it would be necessary to send the optical flow vector for every pixel to the decoder in order for the decoder to re-create the prediction frame (resulting in a large amount of transmitted data and negating the advantage of a small residual).

(11)

Figure 3 : Frame 1

Figure 4 : Frame 2

(12)

Figure 6 : Optical Flow

1.2.2. Block-based Motion Estimation and Compensation

A practical and widely-used method of motion compensation is to compensate for movement of rectangular sections or ‘blocks’ of the current frame.

The following procedure is carried out for each block of M × N samples in the current frame:

1. Search an area in the reference frame (past or future frame, previously coded and transmitted) to find a ‘matching’ M × N -sample region. This is carried out by comparing the M × N block in the current frame with some or all of the possible M × N regions in the search area (usually a region centred on the current block position) and finding the region that gives the best match. A popular matching criterion is the energy in the residual formed by subtracting the candidate region from the current M × N block, so that the candidate region that minimises the residual energy is chosen as the best match. This process of finding the best match is known as motion estimation.

2. The chosen candidate region becomes the predictor for the current M × N block and is subtracted from the current block to form a residual M × N block (motion compensation).

(13)

3. The residual block is encoded and transmitted and the offset between the current block and the position of the candidate region (motion vector) is also transmitted.

1.3 Transform and quantization

A block of residual samples is transformed using a 4x4 or 8x8 integer transform, an approximate form of the Discrete Cosine Transform (DCT). The transform outputs a set of coefficients, each of which is a weighting value for a standard basis pattern. When combined, the weighted basis patterns re-create the block of residual samples.

The output of the transform, a block of transform coefficients, is quantized, i.e. each coefficient is divided by an integer value. Quantization reduces the precision of the transform coefficients according to a quantization parameter (QP). Typically, the result is a block in which most or all of the coefficients are zero, with a few non-zero coefficients. Setting QP to a high value means that more coefficients are set to zero, resulting in high compression at the expense of poor decoded image quality. Setting QP to a low value means that more non-zero coefficients remain after quantization, resulting in better decoded image quality but lower compression.

(14)

n

Figure 8 : DCT Video Decoder

1.4 THE HYBRID DPCM/DCT VIDEO CODEC MODEL

The major video coding standards released since the early 1990s have been based on the same generic design (or model) of a video CODEC that incorporates a motion estimation and compensation front end (sometimes described as DPCM), a transform stage and an entropy encoder. The model is often described as a hybrid DPCM/DCT CODEC. Any CODEC that is compatible with H.261, H.263, MPEG-1, MPEG-2, MPEG-4 Visual and H.264 has to implement a similar set of basic coding and decoding functions (although there are many differences of detail between the standards and between implementations).

The figure above shows a generic DPCM/DCT hybrid encoder and decoder. In the encoder, video frame n (Fn ) is processed to produce a coded (compressed) bitstream and in the decoder, the compressed bitstream (shown at the right of the figure) is decoded to produce a reconstructed video frameF. not usually identical to the source frame. The figures have been deliberately drawn to highlight the common elements within encoder and decoder. Most of the functions of the decoder are actually contained within the encoder (the reason for this will be explained below).

1.4.1 Encoder Data Flow

There are two main data flow paths in the encoder, left to right (encoding) and right to left (reconstruction). The encoding flow is as follows:

1. An input video frame Fn is presented for encoding and is processed in units of a macroblock (corresponding to a 16 × 16 luma region and associated chroma samples).

(15)

n

2. Fn is compared with a refrence frame, The offset between the current macroblock position and the chosen reference region is a motion vector MV. 3. Based on the chosen motion vector MV, a motion compensated prediction P

is generated (the 16 × 16 region selected by the motion estimator). 4. P is subtracted from the current macroblock to produce a residual or

difference macroblock D.

5. D is transformed using the DCT. Typically, D is split into 8 × 8 or 4 × 4 sub-blocks and each sub-block is transformed separately.

6. Each sub-block is quantised (X).

7. The DCT coefficients of each sub-block are reordered and run-level coded. 8. Finally, the coefficients, motion vector and associated header information

for each macroblock are entropy encoded to produce the compressed bitstream.

The reconstruction data flow is as follows:

1.Each quantised macroblock X is rescaled and inverse transformed to produce a decoded residual D. Note that the nonreversible quantisation process means that D is not identical to D (i.e. distortion has been introduced).

2.The motion compensated prediction P is added to the residual D to produce a reconstructed macroblock and the reconstructed macroblocks are saved to produce reconstructed frame F After encoding a complete frame, the

reconstructed frame F may be used as a reference frame for the next encoded frame Fn+1 .

1.4.2 Decoder Data Flow

1. A compressed bitstream is entropy decoded to extract coefficients, motion vector and header for each macroblock.

2. Run-level coding and reordering are reversed to produce a quantised, transformed macroblock X.

3. X is rescaled and inverse transformed to produce a decoded residual D.

4. The decoded motion vector is used to locate a 16× 16 region in the decoder’s copy of the previous (reference) frame Fn−1. This region becomes the motion compensated prediction P.

(16)

n .

5. P is added to D to produce a reconstructed macroblock. The reconstructed macroblocks are saved to produce decoded frame Fn.

After a complete frame is decoded, Fn is ready to be displayed and may also be stored as a reference frame for the next decoded frame Fn+1. It is clear from the figures and from the above explanation that the encoder includes a

Decoding path (rescale, IDCT, reconstruct). This is necessary to ensure that the encoder and decoder use identical reference frames Fn−1 for motion compensated prediction.

(17)

Figure 10 : Reconstructed Reference Frame Fn-1

(18)

Figure 12 : 16 × 16 motion vectors

(19)

Figure 14 : Motion Compensated Residual Frame

2. EPZS (Enhanced Predictive Zonal Search)

Fast ME algorithms for motion estimation (ME) has been widely adopted by the current video compression standards, such as H.261, H.263, MPEG-1, MPEG-2, MPEG-4 and H.264 due to its effectiveness and simple implementation.

The most straightforward one is the full search (FS). However, this method is very computationally intensive, and can consume up to 80% of the computational power of the encoder. This limitation makes ME the main bottleneck in real-time video coding applications. Consequently, fast algorithms are indispensable to decrease the computational cost.

Predictive motion estimation algorithms have become quite popular in several video coding implementations and standards, such as MPEG-4 and H.263, due to their very low encoding complexity and high efficiency compared to the brute force Full Search (FS) algorithm. Their efficiency comes mainly from initially considering several highly likely predictors and by introducing very reliable early-stopping criterion. This technique aims to stop the motion estimation early without checking all predictors through calculation of a threshoding parameter.

(20)

There are 5 different types of predictors predicted in the algorithm: • Spatial • Spatial Memory • Temporal • Windows • Block Type

2.1 EARLY TERMINATION ALGORITHM

To reduce the motion estimation complexity, the EPZS algorithm introduces an early termination process by taking advantage of the distortion correlation of the adjacent blocks. The EPZS idea consists of choosing a list of predictors as checking points based on correlation aspects. The best predictor is chosen by minimizing the cost function in equation 1 which is the distortion. Instead of computing the distortions of all the predictors, the EPZS compares the distortion to a threshold criterion and if it’s smaller, it enables skipping the remaining predictors considering that their distortions could not be much smaller than the found distortion.

m= (Mx,My)” motion vector

p= (Px,Py)” predictor for motion vector λ(m) = Langrangian Multiplier

R(m-p), Rate term represents motion information.

(21)

3. Introduction to

High Efficiency Video Coding (H.265)

High Efficiency Video Coding (HEVC) is a proposed video compression standard, a successor to H.264/MPEG-4 AVC (Advanced Video Coding), currently under joint development by the ISO/IEC Moving Picture Experts Group (MPEG) and ITU-T Video Coding Experts Group (VCEG). MPEG and VCEG have established a Joint Collaborative Team on Video Coding (JCT-VC) to develop the proposed HEVC standard.

3.1 Background

HEVC aims to substantially improve coding efficiency compared to AVC High Profile, i.e. reduce bit rate requirements by half with comparable image quality, probably at the expense of increased computational complexity. Depending on the application requirements, HEVC should be able to trade off computational complexity, compression rate, robustness to errors and processing delay time. HEVC is targeted at next-generation HDTV displays and content capture systems which feature progressive scanned frame rates and display resolutions from QVGA (320x240) up to 1080p and Ultra HDTV (7680x4320), as well as improved picture quality in terms of noise level, colour gamut and dynamic range.

3.2 Features

The JCT-VC is currently evaluating new coding tools and also modifying some of the existing coding tools, such as

 adaptive loop filter (ALF),

 sample adaptive offset (SAO),

 extended macro block size (EMS),

 larger transform size (LTS),

 internal bit depth increasing (IBDI), and

 adaptive quantization matrix selection (AQMS)

 modified intra prediction,

(22)

 Decoder-side motion vector derivation (DMVD). Many new features are proposed to meet the requirements:

 2-D non-separable adaptive interpolation filter (AIF)

 Separable AIF

 Directional AIF

 "Supermacroblock" structure up to 64x64 with additional transforms

 Adaptive prediction error coding (APEC) in spatial and frequency domain

 Competition-based scheme for motion vector selection and coding

 Mode-dependent KLT for intra coding

It is speculated that these techniques are most beneficial with multi-pass encoding.

3.3 History

The ITU-T Video Coding Experts Group (VCEG) began significant study of technology advances that could enable creation of a new video compression standard (or substantial compression-oriented enhancements of the H.264/MPEG-4 AVC standard) in about 200H.264/MPEG-4. Various techniques for potential enhancement of the H.264/MPEG-4 AVC standard were surveyed in October 2004. At the next meeting of VCEG, in January 2005, VCEG began designating certain topics as "Key Technical Areas" (KTA) for further investigation. A software codebase called the KTA codebase was established for evaluating such proposals in 2005. The KTA software was based on the Joint Model (JM) reference software that was developed by the MPEG & VCEG Joint Video Team for H.264/MPEG-4 AVC. Additional proposed technologies were integrated into the KTA software and tested in experiment evaluations over the next four years.

Two approaches for standardizing enhanced compression technology were considered: either creating a new standard or creating extensions of H.264/MPEG-4 AVC. The project had tentative names H.265 and H.NGVC (Next-generation Video Coding), and was a major part of the work of VCEG until its evolution into the HEVC joint project with MPEG in 2010. The "H.265" nickname was especially associated with the potential creation of a new standard.

(23)

The preliminary requirements for NGVC were bit rate reduction of 50% at the same subjective image quality comparing to H.264/MPEG-4 AVC High profile, with computational complexity ranging from 1/2 to 3 times that of the High profile. NGVC would be able to provide 25% bit rate reduction along with 50% reduction in complexity at the same perceived video quality as the High profile, or to provide greater bit rate reduction with somewhat higher complexity.

"H.265" was used as a nickname for an entirely new standard, as was the "High-performance Video Coding" work by the ISO/IEC Moving Picture Experts Group (MPEG). Although some agreements about the goals of the project had been reached by early 2009, e.g. computational efficiency and high compression performance, the state of technology at the time seemed not yet mature for creation of an entirely new "H.265" standard, as all contributions were essentially modifications closely based on the H.264/MPEG-4 AVC design.

The ISO/IEC Moving Picture Experts Group (MPEG) started a similar project in 2007, tentatively named High-performance Video Coding. Early evaluations were performed with modifications of the KTA reference software encoder developed by VCEG. By July 2009, experimental results showed average bit reduction of around 20% compared with AVC High Profile; these results prompted MPEG to initiate its standardization effort in collaboration with VCEG.

A formal joint Call for Proposals (CfP) on video compression technology was issued in January 2010 by VCEG and MPEG, and proposals were evaluated at the first meeting of the MPEG & VCEG Joint Collaborative Team on Video Coding (JCT-VC), which took place in April 2010. A total of 27 full proposals were submitted. Evaluations showed that some proposals could reach the same visual quality as AVC at only half the bit rate in many of the test cases, at the cost of 2x-10x increase in computational complexity; and some proposals achieved good subjective quality and bit rate results with lower computational complexity than the reference AVC High profile encodings. At that meeting, the name High Efficiency Video Coding (HEVC) was adopted for the joint project. The JCT-VC is currently working to integrate features of some of the best proposals into a single software codebase and to perform further experiments to evaluate those features; the results will be discussed at future meetings.

(24)

3.4 When will it be finished?

The timescale for completing the HEVC standard is as follows:

 February 2012: Committee Draft (complete draft of standard)

 July 2012: Draft International Standard

 January 2013: Final Draft International Standard (ready to be ratified as a Standard).

Table of main Linux commands

Command Description DOS equivalent

ls lists the content of a

directory

dir

Cd change directory cd

cd .. parent directory cd..

Mkdir creates a new directory md

Rmdir eliminates a directory deltree

Cp copy a file copy, xcopy

Mv moves a file move

Rm removes a file del

passwd changes the user's password

Cat displays the file's content type

chmod changes the attribute of a file chmod XXX file XXX= User|Group|Other where X represents an integer 1<X<7

Read=4, Write=2, Run=1 X=Read+Write+Run 0 means no rights 1 means running right 2 means writing right 3 means writing and

running rights

4 means reading right 5 means reading and

running rights

6 means reading and 7 means all rights

(25)

4. Introduction to Profiling

Profiling allows you to learn where your program spent its time and which functions called which other functions while it was executing. This information can show you which pieces of your program are slower than you expected, and might be candidates for rewriting to make your program execute faster. It can also tell you which functions are being called more or less often than you expected. This may help you spot bugs that had otherwise been unnoticed.

Since the profiler uses information collected during the actual execution of your program, it can be used on programs that are too large or too complex to analyze by reading the source. However, how your program is run will affect the information that shows up in the profile data. If you don’t use some feature of your program while it is being profiled, no profile information will be generated for that feature.

4.1 Compiling a Program for Profiling

The first step in generating profile information for your program is to compile and link it with profiling enabled. To compile a source file for profiling, specify the ‘-pg’ option when you run the compiler. (This is in addition to the options you normally use.) To link the program for profiling, if you use a compiler such ascc to do the linking, simply specify ‘-pg’ in addition to your usual options. The same option, ‘-pg’, alters either compilation or linking to do what is necessary for profiling. Here are examples:

cc -g -c myprog.c utils.c -pg cc -o myprog myprog.o utils.o -pg

The ‘-pg’ option also works with a command that both compiles and links: cc -o myprog myprog.c utils.c -g –pg

4.2 Executing the Program

Once the program is compiled for profiling, you must run it in order to generate the information thatgprof needs. Simply run the program as usual, using the normal arguments, file names, etc. The program should run normally, producing

(26)

the same output as usual. It will, however, run somewhat slower than normal because of the time spent collecting and the writing the profile data.

The way you run the program—the arguments and input that you give it—may have a dramatic effect on what the profile information shows. The profile data will describe the parts of the program that were activated for the particular input you use. For example, if the first command you give to your program is to quit, the profile data will show the time used in initialization and in cleanup, but not much else. Your program will write the profile data into a file called ‘gmon.out’ just before exiting. If there is already a file called ‘gmon.out’, its contents are overwritten. There is currently no way to tell the program to write the profile data under a different name, but you can rename the file afterward if you are concerned that it may be overwritten. In order to write the ‘gmon.out’ file properly, your program must exit normally: by returning from main or by calling exit.

Calling the low-level function exit does not write the profile data, and neither does abnormal termination due to an unhandled signal.

The ‘gmon.out’ file is written in the program’s current working directory at the time it exits. This means that if your program calls chdir, the ‘gmon.out’ file will be left in the last directory your program chdir’d to. If you don’t have permission to write in this directory, the file is not written, and you will get an error message.

4.3 gprof Command Summary

After you have a profile data file ‘gmon.out’, you can run gprof to interpret the information in it. The gprof program prints a flat profile and a call graph on standard output. Typically you would redirect the output of gprof into a file with ‘>’.

You rungprof like this:

gprof options [executable-file [profile-data-files...]] [> outfile]

If you omit the executable file name, the file ‘a.out’ is used. If you give no profile data file name, the file ‘gmon.out’ is used. If any file is not in the proper format, or if the profile data file does not appear to belong to the executable file, an error message is printed. You can give more than one profile data file by entering all

(27)

their names after the executable file name; then the statistics in all the data files are summed together. The order of these options does not matter.

4.4 Profiling the HM Code

Points to check before starting the Profiling :

 Make sure to install g++ and dos2unix on the terminal using : sudo apt-get install (package)

 We have to make sure that /HM-1.0/source/ – All the directories are to be converted using dos2unix ./(directory)

 /HM-1.0/build/linux/common/makefile.base – This represents our rule file.

Compiling :

 In /HM-1.0/build/linux/common/makefile.base we have to set : CPPFLAGS = -pg

ALL_LDFLAGS = -pg

 Now, in /HM-1.0/build/linux/make clean (To clean the existing solution, if any)

 /HM-1.0/build/linux/make all (To compile the code)  Now we have to define the parameters in the cfg file. Executing :

 (Encoder) /HM-1.0/bin/TApp(Project-Encoder) –c (path of the cfg file)  Following this we get Videosequence.bin.

 (Decoder)/ /HM-1.0/bin/TApp(Project-Decoder) –b (path of the .bin file) –o (output file-specify the path if you want it in any other location). Now gmon.out is generated in bin folder in both the cases since we were

working in this directory. Profiling :

(28)

 /HM-1.0/bin/gprof ./TApp(Project-Encoder/Decoder) ./gmon.out >output file

(it will give flat profile)

 /HM-1.0/bin/gprof -q ./TApp(Project-Encoder/Decoder) ./gmon.out >output1 file

(it will give call graph profile)

(29)

Figure 16 : A Sample Call Graph

(30)

5. Parallel Computing

OpenCL (Open Computing Language) is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. OpenCL includes a language (based on C99) for writing kernels (functions that execute on OpenCL devices), plus APIs that are used to define and then control the platforms. OpenCL provides parallel computing using task-based and data-based parallelism. It has been adopted into graphics card drivers by both AMD/ATI (which made it its sole GPGPU offering branded as Stream SDK) and Nvidia, which offers OpenCL as equal choice to its Compute Unified Device Architecture (CUDA) in its drivers. OpenCL's architecture shares a range of computational interfaces with both CUDA and Microsoft's competing DirectCompute. OpenCL gives any application access to the Graphics Processing Unit for non-graphical computing. Thus, OpenCL extends the power of the Graphics Processing Unit beyond graphics (General-purpose computing on graphics processing units). Academic researchers have investigated automatically compiling OpenCL programs into application-specific processors running on FPGAs, and commercial FPGA vendors are developing tools to translate OpenCL to run on their FPGA devices. OpenCL is analogous to the open industry standards OpenGL and OpenAL, for 3D graphics and computer audio, respectively. OpenCL is managed by the non-profit technology consortium Khronos Group.

5.1Introduction

Recent development of the Graphics Processing Unit or GPU for short has trans-formed the GPU from a highly specialised hardware with fixed hardware functions to a fully programmable device, first through programmable small applications executed in a specialised fashion within the graphics pipeline, and then eventually through fully generalised interfaces such as Nvidia’s CUDA and ATI’s stream. The continuos consumer demand for high-quality graphics has meant that the GPU devices boast an impressive rendering performance. However, there are many problems that could be expressed in a similar fashion and thus take advantage of the relatively cheap source of high-performance computing. Open Computing Language or OpenCL for short is, like OpenGL an

(31)

open specification vendors may choose to implement. It offers, again like OpenGL a common interface for general computation on heterogenous hardware in the same way the OpenGL offers a com-mon interface for rendering on heterogenous hardware. While it is not explicitly designed for GPUs, the computational model is highly adapted to the hardware of modern GPUs and as such requires a different approach to problem solving. Most importantly GPUs achieve their high performance through massive parallelisation and multi-threading and by making an inherent assumption of locality and thus over little means of synchronisation. Moreover, by using a relaxed memory consistency model costly validation can be avoided to further improve performance.

Image analysis in general seems to be a good candidate for GPU implementation - many operations rely on properties that are local to the pixel that is considered and thanks to it, primary purpose graphics hardware has memory optimised for storing image data. Interpolation for scaling and border value handling is implemented in hardware.

5.2 Parallel Computing

Parallelisation as a means of computational speed-up is by no means a new concept. Many problems have been observed to be local in their nature and well suited for parallel execution. The way in which this can be achieved varies from problem to problem, and one usually makes a distinction between so called task parallelism and data parallelism. Task parallelism refers to, as one might think problems which can be split into several distinct sub problems or tasks which in turn can be executed in parallel. Data parallelism on the other hand refers to problems where the same task is executed across a large set of data.

Regardless of the strategy used, parallelisation poses new problems for programmers when compared to the traditional sequential way of executing programs. Firstly, we must devise a way for the processing units to communicate, be it through shared memory, ethernet or any other means of interconnecting the hardware. Secondly, multiple concurrent processes means consistency in memory and execution (through order of execution) can no longer be guaranteed and hence must be considered by the developer. A third inherent assumption of a parallel strategy is optimisation - if performance was not an issue we could just as

(32)

well chosen a sequential approach. Conventional wisdom tell us we must minimise the number of instructions, however as we shall see, when opting for a parallelised approach this might not always be the case.

5.3 Why Parallel

In the good old days, software speedup was achieved by using a CPU with a higher clock speed, which significantly increased each passing year. However, at around 2004 when Intel’s CPU clock speed reached 4GHz, the increase in power consumption and heat dissipation formed what is now known as the “Power Wall”, which effectively caused the CPU clock speed to level off. The processor vendors were forced to give up their efforts to increase the clock speed, and instead adopt a new method of increasing the number of cores within the processor. Since the CPU clock speed has either remained the same or even slowed down in order to economize the power usage, old software designed to run on a single processor will not get any faster just by replacing the CPU with the newest model. To get the most out of the current processors, the software must be designed to take full advantage of the multiple cores and perform processes in parallel. Today, dual-core CPUs are commonplace even for the basic consumer laptops. This shows that parallel processing is not just useful for performing advanced computations, but that it is becoming common in various applications.

5.4 Execution models

In order for us to discuss how one might exploit parallelism it’s necessary for us to define more precisely how execution and in particular parallel execution is per-formed. (Flynn, 1972) defined the following four modes of execution:

 Single Instruction, Single Data stream or SISD for short is the traditional  Single Instruction, Multiple Data streams or SIMD for short uses a single

instruction stream across multiple processors each with it’s own data stream to achieve parallel execution. As all processing units operate on the same stream of instructions they move in lockstep.

 Multiple Instruction, Multiple Data or MIMD for short is perhaps the the most obvious mode of parallel execution. Each processing unit has it’s own set of instructions and it’s own set of data to work on.

(33)

 Multiple Instruction, Single Data stream or MISD for short is a less common structure. One potential use is fault tolerance the output of the different processing units must agree.

These four major categories have since been extended and further divided with variations such as Single Program, Multiple Data(Darema et al., 1988) (SPMD, also known as Single Process, Multiple Data).This model is similar to SIMD but rather than moving in lock step each processing unit has it’s own program counter allowing them to work independently thus placing it in the MIMD category.

6. OpenCL

OpenCL is designed as a homogenous interface for the multitude of different parallel computing devices available today. The vastly different architectures used by these devices has meant that utilising them has required hardware specific libraries and as a result software has been strongly dependant on hardware. OpenCL attempts to solve this problem by offering a unified architecture specification. While it supports both the task- and data parallel

(34)

paradigms through SPMD and SIMD respectively, it’s primary focus is data parallel(Khronos OpenCL Working Group, 2008b, p.25). This is not to say that it offers a major abstraction from the hardware, quite on the contrary it’s pointed out that OpenCL is intended to offer ”low-level, high-performance, portable abstraction”(Khronos OpenCL Working Group, 2008b, p.11). It’s important to note here that OpenCL is not an actual implementation but rather a specification hardware vendors may choose to support. In other words, while an application may run on hardware from two different vendors, the performance may be radically different. It is therefore necessary to consider the actual hardware to be used when writing OpenCL applications if one is to expect any performance gains.

6.1 The OpenCL Architecture

The OpenCL Architecture is split into four models, the platform model, the execu-tion model, the memory model and finally the programming model.

Platform model

The platform model consists of of a host with one or more connected OpenCL devices. These devices in turn consist of Compute Units (CUs in short) and each Compute Unit consists of several processing elements (PEs). Using a modern multi-core CPU as an example the CPU is the compute unit and each core is a processing element. Execution of applications using OpenCL is achieved through running a native application on the host which then issues commands to the OpenCL devices via an OpenCL context and command queue.

(35)

1D NDRange

Work group Work item

3D NDRange 2D NDRange

Execution model

As the OpenCL platform is designed to utilise multiple additional computational devices the execution of an OpenCL application is split between the host and the CUs. The part that runs on the host is aptly named host program and the part that runs on the CU(s) is called a kernel. To facilitate parallel execution an index space is created when the kernel is submitted to the device for computation. An instance of the kernel, called work-item is then created for each index and can subsequently by identified by this index. Work-items are then grouped into work-groups. Just as the work-items each work group holds a unique ID derived from the same index space. Apart from its global ID work-items are also given a local ID to identify its location within its work-group. During execution all work-items in a work-group will execute concurrently.

The indexing space used to partition problems in OpenCL is called an NDRange. The NDRange, as the name suggest supports multidimensional indexing (N-Dimensional Range), however in it’s current form OpenCL only supports up to and including three dimensional indexing(Khronos OpenCL Working Group,

(36)

2008a, p.19). In order for the host to be able to start execution of a kernel it must be able to enqueue commands. OpenCL achieves this through what it calls a context. The context contains the devices to be used, the kernels and their program objects and memory objects. Using this context the host then creates a command-queue using the OpenCL API.

The commands the host can issue via the command-queue break down into the following three categories:

• Kernel execution commands • Memory commands

• Synchronisation commands

The command-queue allows for two general modes of execution of enqueued com-mands; In-order order and out-of-order. In-order execution forces a serialisation of the commands whereas out-of-order allows commands to start executing before prior commands have finished executing.

Memory model

The OpenCL memory model only concerns the memory on the compute devices, i.e the memory model of the host application is not bound by this model but rather that of its native platform. The work-items generated do adhere to this model and have access to the following four types of memory.

• Global Memory: Device memory from which all items in all work-groups can both read and write.

• Constant Memory: Global memory which cannot be written to during run-time.

• Local Memory: Memory shared within a work-group, i.e. work-items in the same work-group can access this memory space, but it remains hidden to work-items in other work-groups.

• Private Memory: Memory only visible to the work-item.

In order to actually bind or map data to the device the host must interact with the OpenCL memory model. This is done by issuing memory commands to the command-queue. More specifically the host can allocate, deallocate, read and write to device memory.

(37)

OpenCL uses a relaxed memory consistency model(Khronos OpenCL Working Group, 2008a, p.25) and memory is only guaranteed to be consistent over barriers (synchronisation points). The actual hardware implementation of this memory model may vary from vendor to vendor as OpenCL only specifies the access levels of these different areas of memory. However, as we shall see later these choices are heavily influenced by existing hardware architectures and proper usage can result if rather large performance boosts.

Figure 19 : Memory Model

Programming Model

In essence the programming model is split allowing a data-parallel and a task parallel approach. However, as we’re only concerned with data parallel execution, only this model will be described. OpenCL uses a relaxed data parallel model where a strict one-to-one mapping between memory elements is not required, i.e given an array one work-item may process more than a single element. While the programmer must always specify the number of work items the number of work-items per work-group can be left for OpenCL to decide implicitly. However as we’ve pointed out earlier all work-items in a work-group must execute

(38)

concurrently, and so the work-group size can be set manually to better suit the hardware that is being used.

Figure 20 : Different memory types and access level

6.2 Graphics Processors

Graphics Processing Units (GPUs) are separate processing units that today can be found in most mainstream computing systems(Owens et al., 2008, p.879).

Modern GPUs are highly optimised and highly parallel computational units who can out-perform the traditional CPU both in terms of arithmetic operations and memory bandwidth. As the name suggest GPUs have their origins in graphics processing, more specifically rasterisation.

One key detail of OpenCL and indeed most other general purpose GPU (GPGPU) programming interfaces is the ”close-the-metal” approach, i.e. performance is tightly coupled with the hardware used. It is therefore important to have intimate knowledge of the hardware the kernels are designed to run on.

6.3 Nvidia Compute Unified Device Architecture

As mentioned earlier OpenCL is just a specification hardware vendors may choose to support. In the case of Nvidia they had already released a similar

(39)

parallel program-ming architecture specifically for their GPUs called Compute Unified Device Archi-tecture (CUDA) (unlike OpenCL CUDA is not device agnostic and will only run on Nvidia GPUs). OpenCL for Nvidia GPUs has since then been implemented on top of CUDA as most of OpenCLs features and functions map directly to CUDA’s. As a result, it’s the CUDA architecture that should be considered when optimising OpenCL kernels running on Nvidia GPUs.

The CUDA architecture was first introduced with the Nvidia Tesla GPU archi-tecture(Lindholm et al., 2008) and is based on a multithreaded approach to paral-lelisation. The approach is very much like that described in OpenCL, i.e a threads are not strictly SIMD but can work on more than one data element. However, since thread management is handled in hardware and incurs very little, if any over-head(Lindholm et al., 2008, p.43) using more threads than available PEs isn’t going to limit performance. In fact, threads waiting for data from memory can be put to sleep while threads with available resources can be called up. In this way multi-threading is used to hide memory latency.

Operating systems and drivers

OpenCL is intended as a cross-platform specification, however for it to work the hardware must have drivers that support OpenCL. These in turn are delivered by the hardware vendor and as such may vary. In the case of Apple hardware and more specifically Apple hardware running the latest version of their operating system; Snow Leopard, support for OpenCL is integrated into the operating system. Support for OpenCL in other operating systems is left for the GPU vendors to provide. The implementation of OpenCL in Snow Leopard is Apple’s own and by and large its device agnostic as Apple delivers computers with both ATI and Nvidia GPUs. In the case of Nvidia hardware it has been confirmed that Apple’s implementation is made on top of Nvidia’s CUDA platform (ref press release).

6.4 Programming Interface

The OpenCL specification only specifies a C interface for interaction with the com-pute devices and Apple has stayed true to this in it’s implementation in

(40)

Snow Leopard. However, as one of the requirements for this thesis was to use Java when possible one has to either implement native bindings or choose from the publicly available. There are currently two open-source alternatives to choose from: ”Java bindings for OpenCL” (JOCL) and an implementation in the ”Nativelibs 4 Java” (JavaCL) family. JOCL offers a more light-weight C-esque binding while JavaCL uses the more Java-like object oriented approach. Neither library is extensively documented however if one is familiar with the C implementation of OpenCL the provided Java-doc coupled with examples is enough to render both options work-able. For this thesis we’ve chosen the JavaCL library.

6.5 Working with OpenCL

OpenCL is a relatively new technology and teething problems is a very real pos-sibility. While we’ve found no previously published work on the stability and re-liability of OpenCL implementations, the fact that we’ve found very few papers using OpenCL at all means there is little data to work with. This evaluation then, will also cover issues such as specification compliance and bugs in the OpenCL implementation and how those affect the ease-of-use and performance of OpenCL.

6.5.1 Workflow

The nature of the OpenCL specification lends itself to a rather straight forward work-flow:

• Write or generate kernel code and compile it. • Allocate device memory and copy data to device. • Set kernel arguments and execute kernel.

• Copy results back to host.

Exactly how this is utilised in an application may of course vary, and other advanced features such as streams and OpenGL interoperability may change the way in which OpenCL is used but the order of the actions above generally is followed. Considering the properties of the the different memory types in the CUDA architecture Nvidia further suggests the following layout of kernels:

(41)

• Synchronise threads.

• Perform calculations on data in local memory. • Write data to global memory.

Obviously, the first two steps only makes sense if we have any data that will be used multiple times, however in image processing where pixel neighbourhoods are considered this workflow is indeed very relevant.

6.5.2 Basic kernels

The nature of our problem then, would seem to be very much suitable for the NDRange structure of OpenCL. We create a one-to-one mapping between the pixel domain of the image and the NDRange of OpenCL. Upon execution this will then create a thread of each pixel in our image and it’s dimensional identity will corre-spond to the pixel coordinate of the pixel it should process. Each thread then simply needs to read the surrounding pixels as defined by the Kernel/SE and performs its corresponding reduction operation. The complexity of a thread is then clearly O(n) in the area or volume of the window depending on the dimensionality of the image and conversely O(n2) or O(n3) in the radius of the window.

Separable kernel

The separable algorithm splits the one-pass approach of the basic kernel into multiple passes. The exact shape of the windows used in the individual passes varies on the decomposition used, however in the algorithmic sense when we mention separable we mean separable across dimensions, that is, an n-dimensional window can be split into at most n passes of n-dimensionality at most n −1. After all, if it’s not split across dimensions the algorithm is identical, albeit with a different kernel. Ideally we want a split into n passes where each pass is 1-dimensional. The arithmetic operations required for each pixel in a pass will then be O(n) in the radius of the kernel.

(42)

It’s important to note that the dependency on neighbouring pixels means we must use a separate buffer to write our results. By default OpenCL does not support paging between device and host memory. As a result, the largest possible image (in bytes memory) that can be convolved/eroded/dilated using the GPU is (let m denote the total size of the device memory and k denote the size of the kernel, both in bytes) . In practice the OpenCL specification also enables vendors to implement a maximum size a single object may allocate i.e. if an image would require more space than this limit, two separate buffers must be allocated to store the entire image in device memory (Khronos OpenCL Working Group, 2008a, p.32).

Kernel optimisation

In image convolution with CUDA, Podlozhnyuk presents an optimised strategy for spatial convolution in general and a optimised separable spatial convolution algorithm using the CUDA architecture. The algorithm provided uses the kernel work-flow as described in 4.3.1. The strategy (and indeed the algorithm) splits the image into smaller sub-images and assigns each sub-image to a work-group for computa-tion. Each work-group then loads the pixels needed for computation of it’s segment into shared memory, synchronises the threads and computes the result before writ-ing it back to memory. The actual optimisations lie in how memory is loaded and how the work is partitioned to attain maximum computational bandwidth.

Memory coalescence

The actual area of the image each work-group must load into shared memory de-pends on the size of the kernel. Pixels on the edges of the work-group will have neighbourhoods that lie partially outside the pixels of the work-group, referred to as the apron. More precisely the area to be considered will be extended by (let ki

denote the size of the filter kernel dimension i) ki − 1 in dimension i. We subtract

one as we’ve already considered the origin pixels in the work-group. This poses a problem in terms of memory coalescence as the size of the kernel doesn’t necessarily align with warp sizes and by extension memory. To get around this problem Pod-lozhnyuk suggest using additional threads or offsetting the starting

(43)

address in the loading stage to force alignment in each load. It may seem wasteful to load mem-ory that will not be used but the alternative is serialisation of all global memory requests in the warp for compute capability < 1.2. Obviously one request with 128 bytes of data, 64 of which is useful is preferable to 16 requests each with 4 bytes of useful data.

Work partitioning

To achieve coalescence in the load stage we rely on thread ids to determine which pixel to load. This in turn is defined by the work-size and work-group definition. Some general observations can be made, like the greater the work-group size, the fewer global memory accesses we must make. Furthermore, using nonhalf-warp-aligned (i.e. not multiple of half-warp size) work-group sizes will cause wrapping of dimensional ids in the half-warps as thread warp numbering follows the row-major convention. This on it’s own is not a problem, but when considering memory access this may cause problems both in terms of coalescence and bank conflicts. Generally speaking it’s safer to simply use multiples of half-warp sizes so that the thread dimensionality composition of warps will be identical. This will make reading and writing to consecutive areas of memory consistent without use of conditional statements thus further improving performance. In the case of local memory access using this method bank conflicts will not arise during the loading or convolution stage assuming 4-byte words. In the loading stage each thread will be reading and consequently writing to consecutive memory address. In the convolution stage, each thread in the half-warp will be reading from the same pixel coordinate in their corresponding neighbourhoods and as they’re processing consecutive pixels, by extension each simultaneous read in the half-warp will be to consecutive local memory addresses and thus separate banks.

Occupancy

Occupancy is the ratio if active warps compared to the maximum number of active warps on streaming multiprocessor. Strictly speaking there are only three factors limiting the occupancy of a multiprocessor:

• Register usage per thread.

(44)

• Work-group size.

For a warp to be started it must have access to enough registers and local memory to complete its execution. However, reduction of multiprocessor load is done on a per work-group basis. That is, an entire work-group must be able to start for it to be executed. When considering local memory is shared across the work-group this requirement comes as no surprise. Likewise, if the work-group size does not align with the maximum number of threads a multiprocessor can handle there might be spare room for more threads to execute, but since the work-group is larger than the number of slots available it won’t be able to execute. As a result of this property, the throughput of an SM is limited to the throughput of the work-groups. If some threads are only used in the loading stage (e.g for alignment) the throughput in the computational stage will be limited as thread slots are occupied by idling threads from the loading stage. To avoid this Podlozhnyuk (2007a) suggest limiting the number of load threads and perform several loads per thread to increase computational throughput in the later stages of the algorithm.

Constant memory

When we’ve dealt with memory handling of the image we still have the issue of the filter kernel. Using local memory would mean more loads from global memory and a higher local memory usage per work-group, potentially further limiting the possible number of threads that can be run concurrently by the streaming multiprocessor. Constant memory however, would not affect the amount of local memory used by the work-group while having the benefit of on-chip store in the constant cache. The amount of constant memory is very limited compared to device memory, only 64KB in the Tesla architecture but for any reasonable filter kernel size this is more than enough.

Instruction throughput and cost

Through the OpenCL specification we are guaranteed support for all basic oper-ators and instructions but it makes no requirements in terms of performance of these instructions. The Nvidia CUDA architecture offers best performance when working with floating point variables but as we’re only concerned with greyscale images this is of little hindrance, any image data can easily be represented as

(45)

floating point numbers. Another key for performance is avoiding branch-instructions. As mentioned previously branching within a warp causes stalling of threads not following the current path of execution, however even if branching is strictly along warp divisions it may still have a serious detrimental effect on performance. In particular loops with few instructions within the loop body will suffer from this, as the ratio of work to branch-instructions will be very low. To avoid this, loops that have a known number of iterations at compile-time should, when possible, be unrolled.

Final kernels

The separable algorithm used in this thesis is the one provided by Podlozhnyuk (2007a) as it’s been developed by the vendor (Nvidia) for optimal performance on their hardware. For the non-separable case however, no direct algorithm is supplied and as such it falls unto the developer to find an optimal solution. The convolution kernels described here can easily be converted to erosion and dilation kernels by replacing the floating point filter-kernel with a {0, 1} integer array representing the structuring element. Obviously in the case of the non-flat SE we’ll still have to use a floating point array representing our SE. We then simply perform min or max respectively between the current smallest/greatest value and replace it with the result if the SE has a nonzero value in the current position. Instruction-wise this is largely equivalent to convolution.

Separable kernel

Podlozhnyuk (2007a) opts for a kernel with maximum computational output having all threads in the work-group compute convolved values in the computational stage. The y-dimensional size of the work-groups vary across the two passes but width remains fixed at 16 for both as it’s the minim required for memory coalescence. In the separable case it’s obviously desirable to have a large size in the dimension of which we’re performing the convolution for the given pass as it will reduce the number of reads from global memory. As a result Podlozhnyuk opts for a height of 8 in the column pass but only 4 in the row pass. Data is then loaded on a work-group size basis, The number of result steps (i.e. number of pixels for which we compute the convolved value) you want to do in between the kernel should be set according to kernel size of optimal performance,

(46)

but for a radius of 8 Podlozhnyuk (2007a) uses 4 result steps. Nonseparable kernel

For the separable kernel we chose to use a more load-centric approach, using threads for threads for every pixel to be loaded in the x-dimension of the image. For the y-dimension multiple steps are used to still maintain an acceptable ratio of computation in the convolution stage of the algorithm. Using this approach we limit the number of control-flow statements significantly in the loading stage and thusly improve performance when compared to the computational-centric approach used by Podlozhnyuk (2007a) in the separable algorithm.

6.6 Samples Codes :

Hello World

Hello world-kernel Hello.cl

#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable __kernel void hello(__global char* string)

{ string[0] = 'H'; string[1] = 'e'; string[2] = 'l'; string[3] = 'l'; string[4] = 'o'; string[5] = ','; string[6] = ' '; string[7] = 'W'; string[8] = 'o'; string[9] = 'r'; string[10] = 'l'; string[11] = 'd'; string[12] = '!'; string[13] = '\0'; } Hello world-host(hello.C) #include <stdio.h> #include <stdlib.h> #ifdef __APPLE__ #include <OpenCL/opencl.h> #else #include <CL/cl.h> #endif #define MEM_SIZE (128) #define MAX_SOURCE_SIZE (0x100000) const char *source_size = "\n"\

References

Related documents

UPnP Control Point (DLNA) Device Discovery HTTP Server (DLNA, Chormecast, AirPlay Photo/Video) RTSP Server (AirPlay Audio) Streaming Server.. Figure 11: Simplified

The total coliform count from this study range between 25cfu/100ml in Joju and too numerous to count (TNTC) in Oju-Ore, Sango, Okede and Ijamido HH water samples as

With the same basis they share as avant-garde artists, Banksy and Monet informed about the changes occurred within a century in avant-garde movement. Change in

This paper outlines the development of an integrated biological and membrane fouling MBR model (IBMF-MBR) and presents some selected simulation results.. Due to space restrictions

○ If BP elevated, think primary aldosteronism, Cushing’s, renal artery stenosis, ○ If BP normal, think hypomagnesemia, severe hypoK, Bartter’s, NaHCO3,

This article discussed the LISP-MSX framework meant to improve LISP operation at the Internet scale, by facilitating cooperation between Mapping Systems and introducing more

That small restoration shop at Tidewater Tech became the Fighter Factory, a private aircraft restoration facility run by AIM.. It operates with a permanent staff of thirteen

To evaluate the G-DMA core, a system was created which used G-DMA to facilitate the memory accesses for an implementation of the Dijkstra’s algorithm for finding the shortest