• No results found

}w!"#$%&'()+,-./012345<ya

N/A
N/A
Protected

Academic year: 2021

Share "}w!"#$%&'()+,-./012345<ya"

Copied!
59
0
0

Loading.... (view fulltext now)

Full text

(1)

MASARYK UNIVERSITY

FACULTY OFINFORMATICS

}w  !"#$%&'()+,-./012345<yA|

Using low cost FPGAs for realtime

video processing

MASTER’STHESIS

Filip Roth

(2)

Declaration

Hereby I declare, that this paper is my original authorial work, which I have worked out by my own. All sources, references and literature used or excerpted during elaboration of this work are properly cited and listed in complete refer-ence to the due source.

(3)

Acknowledgement

I would like to thank IBSmm Engineering for allowing parts of my work to be published as this thesis and also for being understanding and supporting during my studies.

I would like to thank my advisor, prof. Ing. V´aclav Pˇrenosil, CSc, for helpful-ness and guidance during writing of this thesis.

Last but not least, I would like to thank my family and friends for their sup-port during my studies.

(4)

Abstract

This thesis describes the use of current day low cost Field Programmable Gate Arrays (FPGAs) for realtime broadcast video processing. Capabilities of selected device family (Altera Cyclone IV) are discussed with regard to video process-ing. Example IP cores (deinterlacer, alpha blender and frame rate converter) are designed in Verilog HDL and the design flow is described. The IP cores are im-plemented in real hardware system. The overall hardware system is described, together with individual FPGA components providing video input/output and other I/O functions.

(5)

Keywords

FPGA, video processing, deinterlacing, alpha blending, frame rate conversion, Verilog, HDL, hardware design flow

(6)

Contents

1 Preface . . . 3

2 Introduction . . . 4

2.1 History of Field Programmable Gate Arrays . . . 4

2.2 Present day FPGAs . . . 5

2.2.1 Programmable logic . . . 5

2.2.2 Routing resources . . . 6

2.2.3 Embedded memory . . . 6

2.2.4 Embedded multipliers . . . 6

2.2.5 Development software . . . 6

2.3 Future possibilities of FPGAs . . . 7

2.4 Video processing on an FPGA . . . 8

3 Broadcast video transport standards . . . 9

3.1 Parallel digital data . . . 11

3.2 Serial digital interface (SDI) . . . 12

3.3 Digital Video Interface (DVI) . . . 12

4 Project requirements . . . 13

4.1 Video deinterlacing . . . 13

4.2 Low latency . . . 13

4.3 On-screen display generation . . . 13

4.4 Video stream switching . . . 14

4.5 Image capture . . . 14

5 Device family selection . . . 15

5.1 Design requirements . . . 15

5.2 Altera Cyclone family . . . 16

5.3 Xilinx Spartan family . . . 16

5.4 Lattice Semiconductor Corporation . . . 17

5.5 Final selection . . . 17

6 Evaluation of commercial IP cores from Altera . . . 18

6.1 Video and Image Processing Suite (VIP) . . . 18

6.2 DDR2 High Performance Controller II . . . 19

6.3 NIOS II soft processor . . . 20

7 Selected system structure . . . 21

7.1 Block diagram . . . 21

7.2 Camera video input . . . 22

7.3 Frame buffer . . . 23

7.4 USB link . . . 23

(7)

7.6 PC video input . . . 24

7.7 Alpha blender . . . 24

8 Example video processing cores . . . 25

8.1 Deinterlacer . . . 25

8.1.1 Algorithm overview . . . 25

Line duplication . . . 26

Line interpolation . . . 26

Weave algorithm . . . 26

Motion adaptive algorithm . . . 27

8.1.2 Algorithm selection . . . 27 8.1.3 Principle of operation . . . 28 8.1.4 Implementation . . . 29 8.2 Alpha blender . . . 32 8.2.1 Principle of operation . . . 32 8.2.2 Implementation . . . 33 8.3 Frame buffer . . . 36 8.3.1 Principle of operation . . . 37 8.3.2 Implementation . . . 39

9 FPGA design flow . . . 41

9.1 Separate projects for custom components . . . 41

9.2 Use standard interfaces . . . 41

9.3 Optimize the memory access pattern . . . 41

9.4 SignalTap II logic analyzer . . . 42

9.5 Horizontal and vertical device migration . . . 42

9.6 Physical I/O pin mapping . . . 43

10 Resulting hardware . . . . 44

10.1 Verification of the hardware . . . 45

11 Conclusion . . . . 46

Bibliography . . . 50

A Pin placement . . . . 51

B Device floorplan . . . . 52

C Slack histogram . . . . 53

(8)

Chapter 1

Preface

This work originates in author’s work as a hardware developer at IBSmm Engi-neering [1], a hardware design house located in Brno, Czech Republic. The video processing IP (intellectual property) cores presented in this work are part of a video processing device developed at IBSmm. Due to this fact, the subject of this thesis is a commercial product into which significant time and effort was invested.

As it happens in the industry, the IBSmm Engineering management is not that willing to release the entire product documentation, including board design files, firmware sources, intellectual property sources for the programmable logic or schematic documentation to the public domain.

Therefore, a decision was made to make public only selected parts of the de-sign, demonstrating the approaches and algorithms used to accomplish the re-quired functions, but not the entire source codes or project files.

For this reason, this work describes the overall system only in brief detail and full description is given only to the IP cores developed by the author for providing the video processing functions of the system. The IP core source codes are available on the enclosed CD and in the online archive at Masaryk University, each as a separate Quartus II 9.1SP2 project.

The entire FPGA design is the original work of the author of this thesis to-gether with FPGA pins assignments, timing constraints and major part of the resulting hardware board schematic (some blocks in the board schematic were reused from earlier projects and were not done by the author).

The complete project documentation is available upon request, provided that the requestor signs an NDA with IBSmm Engineering.

It is hoped by the author, that, despite these limitations, this work will give useful information to readers interested in video processing on an FPGA and also provide a ”real world” demonstration of the development of a product using these technologies.

(9)

Chapter 2

Introduction

Nowadays, as the requirements for processing power of the embedded systems are growing, many systems are starting to use FPGAs for offloading the process-ing functions traditionally done by the embedded CPU or ASIC.

This was made possible by the advancements in chip manufacturing technol-ogy as described by Moore’s law[2], where programmable logic device parame-ters such as density, processing power, power consumption and cost improved to become viable alternatives to the traditional approaches.

Additionally, a design using programmable logic offers specific advantages over other approaches, mainly the possibility to alter the configuration of the hardware in the field (hence the name), which is a very useful feature considering problems like bug fixes and frequent needs to modify the design after the product is finished.

Of course, this flexibility comes at a premium compared to a dedicated ’hard-ened’ CPU or ASIC, usually both in terms of power consumption and unit price. However, especially for small production series, the flexibility of programmable logic may more than balance the additional cost of the device; the CPU may not be exactly suited to the application and the ASIC development costs may be well out of bounds of the estimated product volumes.

With the gradual transition of video signal representation from analog signals like VGA and SCART to the digital domain, programmable logic started to pro-vide the processing functions where required. With its inherently parallel nature, these devices are well suited for algorithms requiring high bandwidth and the calculation of many operations in parallel on the video data.

2.1

History of Field Programmable Gate Arrays

The history of Field Programmable Logic Arrays (FPGAs) began with the ad-vent of programmable logic arrays (PLAs)[3]. From today’s point of view, these devices were relatively simple and were used mainly as ’glue logic’, merging sev-eral discrete combinational logic ICs into one chip.

The programmable logic devices progressed hand in hand with the advance-ments in IC manufacturing technology and architecture theory, and in 1985 the first commercially viable Field Programmable Gate Array was developed by Xil-inx, Inc[4].

(10)

2. INTRODUCTION

another vendors emerged and the devices started to be used in more market seg-ments than the initial networking and telecommunications areas. For more in-depth overview of FPGA history, please see [4].

2.2

Present day FPGAs

Nowadays, FPGAs are a standard off-the-shelf components, ranging in size and capabilities. Usually, the FPGA is composed of configurable logic, routing re-sources, embedded memory, multipliers and a range of hardened peripheral in-terfaces. Not physically present in the FPGA, but from the design standpoint an integral part of the device design flow, is the FPGA development software.

2.2.1 Programmable logic

The programmable logic is composed of LUTs (look up tables, sometimes also called LEs - logic elements), which are SRAM based cells performing user defined function given by the FPGA configuration bitstream. The exact LUT structure varies by manufacturer and device family, for illustration shown here is the LUT structure of Altera Cyclone IV device family.

(11)

2. INTRODUCTION

2.2.2 Routing resources

To enable connections between logic elements themselves and between logic ele-ments and any other parts of the chip, the FPGA contains the interconnect.

These are ’hardened’ connection paths inside the chip, either general purpose for user design or with a specific function, e.g. clock distribution networks.

The clock distribution paths are designed in such a way as to provide uniform clock distribution with minimal skew over all parts of the chip. This is an impor-tant part of the interconnect fabric, since most FPGA designs are synchronous and the quality of clock distribution directly affects the maximum frequency at which the user design can properly function (this maximum frequency is usually called fmax).

Also, this is the part of an FPGA occupying the most silicon resources of the chip. Some estimates quote up to 90% of the silicon die is dedicated to routing[7].

2.2.3 Embedded memory

Many of the FPGA designs require some kind of fast memory for temporary storage of intermediate results, data buffers and other. For this reason, the chip contains embedded memory blocks. These are hardened SRAM memory units, usually configurable for different memory sizes, data widths or single/dual port access.

2.2.4 Embedded multipliers

Since FPGAs are well suited for digital signal processing (DSP), most device fam-ilies contain hardened multipliers. This provides the designer with optimized blocks with higher performance (fmax) than soft (in logic) implementations and also frees up logic resources, which would otherwise be needed to implement the multiplier function. The DSP blocks are usually fixed point, newer and high end FPGA families implement hardened floating-point-optimized components[6].

2.2.5 Development software

An important part of the FPGA development is the design software. This soft-ware package provides the designer with interface to all FPGA design stages, from design entry to programming of the configuration memory. This software is responsible for transferring the user design to a selected physical device and its structure while meeting the user requirements for design timing (timing con-straints).

Contrary to the software world, where the compilation times are relatively small and the iterative development method cycle is short, a larger FPGA design can take several hours to compile. The compiler must analyze the design, convert the algorithms into device-specific blocks and fit the resulting netlist into the se-lected device fabric. When the design uses a large portion of the device resources

(12)

2. INTRODUCTION

or has high requirements for maximum frequency, this is a computing challenge even for modern processors (Intel Sandy Bridge CPU [email protected] compiles the design described in this work in 9 minutes, although on a single core; and this is a relatively small design). For this reason, appropriate hardware is necessary for the development.

Figure 2.2: Example FPGA development software[20]

2.3

Future possibilities of FPGAs

Currently, the fastest performing FPGA is probably from the device family Speed-ster22 from Achronix[8]. Since the major performance limiting factor in current FPGAs is the interconnect, the Achronix device avoids this bottleneck by time multiplexing the routing resources. By doing this, the Speedster22i device is ca-pable of providing 1.5GHz peak processing performance. Since today’s highend is tomorrow’s lowend in the semiconductor industry, we may see a rapid increase in processing power of even low cost FPGAs in the coming years.

The discovery of memristor[9] may be an important step towards developing new generation FPGAs. HP is currently developing a memristor based FPGA[10]. The standard PC architecture may also include elements of FPGA fabric in the future or be entirely replaced by programmable logic. This is signified by the Intel Stellarton[11] CPU, which includes an Intel Atom processor together with an Altera Arria II FPGA die in a single package. The FPGA is currently used as an H264 encoding accelerator.

(13)

2. INTRODUCTION

2.4

Video processing on an FPGA

Processing a video stream usually involves operations on either the video sig-nal timing or on the raw bitmap data of individual frames or fields. The FPGA architecture is well suited for video processing for the following reasons:

· Video timing generation is relatively straightforward with an FPGA. Even the logic fabric of low cost FPGA families is usually capable of support-ing 150+ MHz IP components, therefore allowsupport-ing generation of HD resolu-tions.

· Processing the raw frame data can take advantage of the hardened DSP blocks to ease the timing requirements for the logic fabric itself. Together with pipelining the individual algorithm operations, this allows the design of complex video processing paths even with HD resolutions.

· By being ”close to metal”, the algorithms on an FPGA can be more effective in terms of power than systems using an CPU core to perform the process-ing functions.

· Due to the FPGA flexibility, the video processing path can be tailored to specific project requirements.

· The flexibility of the FPGA architecture may prove useful for small duction series, where the development costs of ASIC solution may be pro-hibitive.

For these reasons, the processing functions required for the project described in this work were implemented on an FPGA.

(14)

Chapter 3

Broadcast video transport standards

Today, with a few exceptions (e.g. the VGA interface), the video signal represen-tation transitioned from analog to digital domain. The most obvious advantage of digital representation over analog is that the video data is not in any way al-tered by the transmission. With analog representation, this was not possible due to effects like noise and line losses, which in most cases corrupted the transmitted information.

Regardless of the selected video interface standard, video data is divided into discrete images called frames. A frame is an bitmap image, transferred over the transport interface from top to bottom line by line, with each image line being transmitted from left to right. Therefore, the transmission of a frame starts with top left pixel and ends with bottom right pixel. The rate at which the video frames are transferred is called a frame rate.

(15)

3. BROADCAST VIDEO TRANSPORT STANDARDS

The video format can be either progressive or interlaced. In progressive video stream, a frame is transferred in whole, meaning it is a complete representation of the video image in one point in time. With interlaced stream, frames are divided in halves called fields. Fields can be either odd or even, where odd field contains odd lines of the frame and even field contains even lines. When the stream is transferred as interlaced video, the motion appears smoother because this format effectively doubles temporal resolution of the stream (compared to a progressive stream with the same resolution and bandwidth).

Figure 3.2: Interlaced frame structure[12]

The video data represent the scene in some predefined color space. The most commonly used color spaces are RGB and YCbCr. With RGB color space, the pixel has red, green and blue component to identify it’s color. The RGB standard is widely used in the PC industry for video data representation and as graph-ics card output format. When using YCbCr color space, the pixel has luminance (brightness) and chrominance (color) coordinates to identify the color. Conver-sion between these color spaces can be from straightforward to fairly complex, depending on the requested conversion quality.

The horizontal and vertical resolution of the frame, frame rate, color space and progressive/interlaced identifier together form a video format. Video formats are standardized by organizations such as VESA[16] or SMPTE[17].

This chapter gives an overview of video transport standards used for video input and output of the presented video processing system.

(16)

3. BROADCAST VIDEO TRANSPORT STANDARDS

Figure 3.3: YCbCr color space at 0.5 luminance[13]

3.1

Parallel digital data

The representation of video data as a parallel clocked bus is most common when connecting different integrated circuits on a printed circuit board. The bus con-tains a master clock signal, horizontal and vertical synchronization signals, active picture indicator (data valid signal) field identifier for interlaced formats and the video data itself. This format with separate horizontal and vertical synchroniza-tion is most commonly used, probably for its universality.

Although embedded synchronization can be used (synchronization signals are not separate wires but are embedded as special sequences directly in the video data), this may cause design complications when using video processing ICs which each expect differing embedded synchronization sequences because of differing standards (e.g. BT656 vs BT1120).

The parallel transmission format requires that the appropriate individual bit wires have their lengths closely matched to each other to ensure that the pixel wavefront is properly aligned at the receiver side. With today’s high resolutions and therefore high pixel clock rates, this data format may also cause problems with signal crosstalk or reflections from impedance differences, therefore it is a good practice to use some kind of termination at both the transmitter and receiver sides.

(17)

3. BROADCAST VIDEO TRANSPORT STANDARDS

3.2

Serial digital interface (SDI)

Serial digital interface[18] is a video transport standard used mainly in broadcast and medical industries. It uses shielded coaxial cable as a medium and allows for transfer rates from 270Mbit/s to 3Gbit/s. It can be thought of as an serial encap-sulation of parallel digital data. On the transmitting side, the data is serialized to a high speed serial form and on the receiving side data is deserialized back to parallel format.

Figure 3.4: An example of SDI connector[14]

SDI uses NRZI encoding scheme to encode data and a linear feedback shift register to scramble the data to control bit disparity. The video stream can also in-clude CRC (Cyclic Redundancy Check) checksums to verify that the transmission occurred without an error.

3.3

Digital Video Interface (DVI)

DVI is an interface to transfer digital video and is used frequently in the PC in-dustry. The interface uses TMDS (Transition Minimized Differential Signaling) to transfer data over four twisted pairs (three for data and one for clock) of wires. Because this interface is frequently used to connect a graphics port of a computer to a display, DVI also includes support data channels to allow the computer to identify the device being connected. This interface is called EDID and is basically a serial EEPROM with information about the display vendor and supported res-olutions.

This interface can be also thought of as an serial encapsulation of parallel data, but compared to SDI it uses three serial data channels to transport the data. This reduces bandwidth requirements for a single serial channel and therefore reduces the quality requirements for used cabling.

(18)

Chapter 4

Project requirements

This chapter describes the various requirements for the processing hardware. The device using the FPGA video processor is to be used in an medical environment for displaying live video from endoscopic cameras during surgeries. The system also has to be able to record the video and store the feed either locally or via network, but these functions are handled by a standard x86 system embedded in the device and as such are not the topic of this work.

4.1

Video deinterlacing

Based on customer requirements, the video processor must handle two input video formats, one progressive and one interlaced video feed. This requirement comes from the fact that with this system, a HD camera will be usually delivered which has two settings for output video resolution, 720p and 1080i. Since the cus-tomer wants to be able use a standard monitor (most of which do not handle interlaced video timings very well), the 1080i interlaced video must be internally converted to 1080p. This video format can be displayed on a standard monitor with no timing problems.

4.2

Low latency

The system is to be used for live video display during surgical operations. The device processes the video signal from endoscope which is then output on a mon-itor. The surgeon navigates by the displayed video image and so the processing delay must be as small as possible. If the delay was too large, the surgeon would see the operating tool later than he or she may do a critical intervention to the patient and would therefore be a hazardous behavior.

4.3

On-screen display generation

When displaying live video from the endoscopic camera, the system also has to mix into the picture some additional information. This information includes pa-tient name, system settings, buttons for touch controls if the attached monitor has a touch panel and an indicator of free space available for the recorded video.

(19)

4. PROJECT REQUIREMENTS

4.4

Video stream switching

One of the features that the customer requested was the ability to display both the live video feed and an administrative GUI application running on the system on a single monitor. From this stems the requirement to switch between two video streams seamlessly, not to cause the attached display monitor to resynchronize to a new timing should the transition be made by a simple switch.

4.5

Image capture

The system must be able to take snapshots of the displayed video feed. Although this could be handled by the embedded x86 system in a similar way as the video recording, because of another request by the customer that the captured image be freezed for a few seconds for a surgeon to see what the picture is, it was decided that this function will be handled by the hardware.

(20)

Chapter 5

Device family selection

This chapter discusses the selection of FPGA device family to realize the required functions of the system. After preliminary tests of video processing components on a separate board developed for said testing, it was concluded that even low cost FPGA families from major manufacturers were sufficient to implement Full HD video processing. Based on this conclusion, the family selection was limited to low cost field programmable gate arrays.

FPGA families are usually divided into several generations, each generation contains devices with varying sizes and features and each device is manufactured in various packages and speed grades.

5.1

Design requirements

The project requirements described in previous chapter were transferred to de-sign requirements for the FPGA chip performance and required peripheral func-tions.

Since the design seemed to most likely require a frame buffer component, some form of large temporary memory was needed to store the incoming video frames. It was decided that the system will use DDR2 memory for it’s relatively low cost and sufficient performance. Based on the incoming video formats spec-ified by the customer, the required memory bandwidth for the frame buffer was estimated (in bytes):

1920(width) × 540(height) × 60(f ps) × 4(Bpp) × 2(R + W ) = 474M B/s

Including a margin for read/write bank switching and memory refresh cycles, it was concluded that a single DDR2 x16 chip fulfills this bandwidth requirement, since (in bytes):

2(datawidth) × 2(ef f ectiveperclock) × 200000000(f requency) = 800M iB/s Therefore, the target device must be able to instantiate a DDR2 x16 memory controller core to interface to the external DDR2 x16 memory chip.

The total number of pins required was estimated to be in the range of 150 to 180. This included two video inputs, USB link connection, DDR2 memory inter-face and support I/O functions of the FPGA.

(21)

5. DEVICE FAMILY SELECTION

The maximum frequency required for any part of the design was estimated to be 150MHz-180Mhz for the most demanding components. Namely the DDR2 memory interface and the deinterlacer module.

The selection of FPGA device family was based on these requirements to-gether with a preference of wide availability and good online support.

5.2

Altera Cyclone family

Altera manufactures low cost FPGA chips under the Cyclone family name. This family includes devices from 3000 logic elements (LEs) to about 150000 LEs. The FPGA chips of this family also contain up to several megabits of embedded mem-ory blocks, multipliers for DSP processing and are offered in a range of package sizes and pin counts. The Cyclone family supports the instantiation of an DRAM memory device controllers.

Figure 5.1: Altera Cyclone IV family logo[5]

The Cyclone family is currently divided into four generations, Cyclone I to Cyclone IV (as of time of writing of this work, the Cyclone V family is announced by Altera with available samples, but mass production of this family is planned to 2012). These generations differ in power consumption, densities, supported peripheral features and the maximum frequency the logic fabric of the device is able to support for a given HDL design.

The family generation selection was reduced to Cyclone III and Cyclone IV. These families are more advanced than the I/II generations, due to advances in lithographic processes are cheaper and have better availability. Also, due to the Cyclone IV being basically a ”shrink” of Cyclone III, the conversion of a given design between these families is a relatively simple task.

5.3

Xilinx Spartan family

The other major manufacturer of Field Programmable Gate Arrays, Xilinx Inc., of-fers device families with similar features as Altera. The Xilinx version is branded under the name Xilinx Spartan.

The Spartan devices are also divided into device generations based on ad-vancements in FPGA design. The device families considered were Spartan-6 and Spartan-3 due to a relatively large community support for designs based on these devices. The FPGA chips from the Spartan-6 device family include hardened memory controller blocks for interfacing an external DRAM memory chip.

(22)

5. DEVICE FAMILY SELECTION

5.4

Lattice Semiconductor Corporation

Lattice Semiconductor is the third largest FPGA manufacturer and although it was also taken into consideration, for a perceived lack of good online support the devices from Lattice Semi were not given any further evaluation.

5.5

Final selection

The device family selected to implement the requested functions of the system was Altera Cyclone III/IV. This decision was influenced by several factors.

The low cost FPGA devices from Altera are on par with low cost devices of-fered by Xilinx when comparing features like price, performance, capabilities and package options.

Since the selected manufacturer will probably be also used in future projects requiring some form of FPGA processing, availability of IP cores was taken into account. Since the company is trying to enter into medical video processing mar-ket, it is necessary to have video processing cores available. Although there exist many for the Xilinx devices, Altera offers a complete package for video process-ing, the Altera Video and Image Processing Suite (VIP)[12].

Both manufacturer’s FPGA development environments were evaluated, the Altera Quartus II and Xilinx ISE Design suite. It was concluded that the Altera Quartus II is a better solution, because it integrates all required functions (design entry, compilation, simulation, programming) into one package. Also taken into account was the large availability of cores adhering to the Altera Avalon Intercon-nect Fabric standard, which together with the SOPC Builder software simplifies system design.

To provide a complete and realistic overview of the reasons influencing this decision, it must also be noted, that one of the reasons tipping the selection into Altera’s favor was the authors familiarity with devices of this manufacturer from lectures at FI MU.

(23)

Chapter 6

Evaluation of commercial IP cores from Altera

Before designing the final hardware board to be used in the device, we designed an evaluation platform to test the video processing functions inside the FPGA and the interface chips used to convert the different video transmission standards to and from the FPGA input/output format.

The evaluation board included an Cyclone III FPGA with 40k logic elements with the fastest speed grade available (EP3C40F484C6N). The FPGA had two DDR2 memory channels available, each consisting of two 16-bit DDR2 memory chips. The FPGA was connected to SDI input interface chips, DVI (TMDS) re-ceiver, output DVI transmitter, USB communication bridge to allow for PC con-nection and other support ICs.

On this board we evaluated the relevant IP cores to be used throughout the project and later developed our own intellectual property components.

6.1

Video and Image Processing Suite (VIP)

The first to evaluate were the components from the Altera Video and Image Pro-cessing Suite. We were mainly interested in the Deinterlacer and Switch IP cores. The VIP cores can be instantiated either standalone or inside an SOPC system. With both approaches, the user is offered a MegaWizard configuration interface to select the required core functionality.

We used the core in the ”Bob - Scanline Interpolation” mode. This deinterlac-ing method adds lines to each half field by calculatdeinterlac-ing the missdeinterlac-ing odd/even lines of the field. We selected this mode for that this interpolation method produces rel-atively clean image with no visible artifacts from merging two fields (such as the Weave method does) and since this algorithm uses only a few lines of image to buffer data so it produces very little delay.

The visual quality of the processed video feed was found acceptable for the project, but the stability of the IP generation was found unsatisfactory. For visual quality testing, we used the IP core version integrated in the Quartus II pack-age version 9.0. This version performed with no problems. When we switched to the Quartus II version 9.1, we could not compile any design containing the core. When the core was configured and the system was being generated (the parametrization of the core was under way by the configurator), the Quartus IDE crashed and was not able to recover. Since we had to use the 9.1 version (because it included a Switch component which we needed and which was not included

(24)

6. EVALUATION OF COMMERCIALIPCORES FROM ALTERA

in the 9.0 version IP library), we had to abandon using the provided deinterlacer component from Altera and had to develop our own solution.

At the time of writing this thesis, when testing the deinterlacer core the com-pilation runs without any problems. This incident illustrates that the FPGA de-velopment toolchain is a rather complex software and should be thoroughly eval-uated before considering using it in a design.

Figure 6.1: Example of the MegaWizard configuration interface[20]

6.2

DDR2 High Performance Controller II

We needed some form of large temporary storage memory to store the video data when synchronizing two video streams (frame buffer) and to store the captured image for the still image capture function of the system

We decided to use DDR2 memory because it is the newest DDRx standard electrically supported on the Cyclone II/IV device family architecture. On the evaluation board we had integrated two channels of DDR2 memory channels, each consisting of two 16-bit DDR2 chips. This meant 64 bits effective transfer size per clock and 128 bits smallest transfer size when considering DDR2 minimal memory side burst size of 4 beats according to the JEDEC specification.

(25)

6. EVALUATION OF COMMERCIALIPCORES FROM ALTERA

Regarding the memory access pattern, we needed to read and write sequential areas of memory and therefore did not need the short memory side burst lengths of DDR I memory, which could be more appropriate for other algorithms such as realtime image rotation.

We tested the memory controller core by running the ”memtest” example in-cluded in the Nios II Embedded Design Suite. The tests passed with no problems and we therefore decided to use this core.

6.3

NIOS II soft processor

To control the FPGA hardware an softcore processor was needed. Altera provides the Nios II 32-bit embedded processor for use on it’s devices.

The processor core can be configured into three versions, Economy, Standard and Fast. Since we did not need any video processing functions done on the CPU, we could use the Economy version on the core.

The JTAG debugging and communication feature of the Nios II EDS devel-opment software proved very handy when debugging the system later in the project.

(26)

Chapter 7

Selected system structure

This chapter describes the resulting internal FPGA video processor structure, This setup emerged after several design iterations.

The structure took shape after considering the project requirements described above. It was necessary to display both the video feed from endoscopic cam-era and the administrative GUI application running on the internal x86 system. Therefore the FPGA has two video inputs. It is necessary to display the output video, so the FPGA has a video output. We needed to somehow communicate with the PC for system control and captured image transfer. For this reason the FPGA is connected to and USB FIFO bridge. To provide storage space for triple buffering and captured image storage, the FPGA has an DDR2 memory attached. The design was created using the Quartus II IDE. As a top level entity was selected a schematic file to provide a clear way of showing the system structure inside the Quartus project. Compared to a HDL top entity such as Verilog or VHDL file, the schematic quickly shows how the individual blocks are connected and communicates the information to the hardware designer.

The block diagram and individual components of the system are discussed below. The components are covered only in brief detail, the three components comprising the core of this work are described in detail in a separate chapter.

7.1

Block diagram

The system block diagram on figure 7.1 shows only the components relevant to the video processing paths. Supporting components like clock domain crossing, external support signals for the PCIe grabber, video resolution detectors, color space conversion cores etc. are not shown to maintain clarity.

The system takes two video streams as input, processes them and outputs a single video feed as output. The video inputs are the camera input and the PC video input. The timing of the PC video input is taken as a reference timing, onto which the camera video feed is synchronized and blended using the frame buffer component. The frame buffer is also connected to the USB link providing a way to ”dump” the contents of a memory location containing a captured image to a PC.

One video processing path is the camera feed, another is the PC video feed. The camera video has to be synchronized to the PC video signal timing and if in an interlaced format it also must be deinterlaced. Placing the deinterlacer after the

(27)

7. SELECTED SYSTEM STRUCTURE

frame buffer component saves memory bandwidth, since it allows for buffering of the half fields only and the final full frame is calculated using the deinterlacer after the synchronization phase. This also means that the images transferred to the host x86 system are half fields (for the 1080i interlaced input video format) and have to be stretched to original aspect ratio. It was found that this solution is perfectly acceptable since there is no visible reduction of quality of the captured image.

USB link Frame buffer Camera video input PC video input Deinterlacer Alpha blender Video output

Figure 7.1: Final video processing system block diagram

7.2

Camera video input

The component providing the video input to the system is designed as an Avalon Interface Fabric compatible component. The input parallel video data from the external SDI receiver chip are converted from YCbCr color space into RGB color space using a simple pipelined calculation and then the data are fed into a dual

(28)

7. SELECTED SYSTEM STRUCTURE

clock FIFO (standard component provided by Altera). This effectively transitions the data from exact-time-formatted data for display into a data stream suitable for internal processing. The remainder of the component is an Avalon Memory Mapped Master externally controller by the Nios II CPU. The master component can be thought of as an DMA engine, which converts a video frame into the DDR2 storage memory using long Avalon side bursts.

To relax the frequency requirements for the core logic fabric, the width of the bus from the FIFO to the memory controller IP is set to 64 bits. This effectively halves the frequency at which the bus must run to transfer the data and therefore easies the fitter effort to reach timing closure.

This component is displayed as standalone in the block diagram but is effec-tively part of the frame buffer subsystem.

7.3

Frame buffer

The frame buffer subsystem provides the means to synchronize the camera video feed to that of the PC. Although this introduces delay (of at most one half field), which could be avoided by synchronizing the PC video feed to that of the camera instead, it was supposed that since the PC video feed comes from inside of the device from the host x86 system, it is more reliable and ”under control” than the unknown camera signal from outside of the system and is therefore more usable as a reference timing signal.

The frame buffer uses a standard triple buffering scheduling algorithm, where one buffer is always available to save an incoming video frame. This provides the means to synchronize the two video streams, since the frame buffer scheduler can either drop or repeat a field to match the required timing.

Together with writing the raw image data to the DDR2 memory, the scheduler also registers whether the currently transferred field is odd or even. The scheduler has the field signal available from the external SDI receiver chip. This information is later used to properly configure the deinterlacer block at the output of the frame buffer.

The frame buffer then includes an output component which reads a stored field from memory and outputs the data into an input dual clock FIFO of the deinterlacer component. The frame buffer component is described in detail in a separate chapter.

7.4

USB link

The frame buffer subsystem also contains a USB link component on the Avalon Interconnect Fabric. This provides the capability of the system to transfer the stored image data to the PC using a USB FIFO bridge (FT2232H from FTDI[21]). The size of a single half field in 1080i video format is about 4 megabytes and is transferred in under two seconds.

(29)

7. SELECTED SYSTEM STRUCTURE

The USB interface IC has two channels, one is configured for the RS232 stan-dard and is used for FPGA system control, the other channel is a one way com-munication link to the PC for captured still image transfers. The control channel is connected to an UART component of a controlling SOPC system with the Nios II soft core processor.

7.5

Deinterlacer

The deinterlacer component is fed the video data by the frame buffer component, this video data is deinterlaced (if requested) into a full frame and output to the alpha blender component. The deinterlacer core is described in a separate chap-ter.

7.6

PC video input

The video data from the host x86 system is fed into the FPGA using an external DVI receiver chip. The data passes into the alpha blender, where it is mixed with the video feed from the camera and output from the FPGA into an external DVI transmitter IC.

7.7

Alpha blender

The alpha blender IP core takes the two video streams and mixes them together using an alpha value provided by the Nios II controller system. The alpha value is controllable from the PC over the USB link and therefore allows for video stream switching.

The alpha blender includes a transparent color definition. When the alpha blender encounters a pixel with this color in the video data of the host x86 system, the alpha blender displays the original pixel color from the camera feed regard-less of the alpha setting. This basically provides the overlay function known from the PC world. This was implemented to allow the system not only to blend the two streams together, but to also enable the on screen display (OSD) generation. The transparent color definition allows for displaying original camera video data with non-transparent OSD mixed on top of this feed. The alpha blender compo-nent is also described in more detail in a separate chapter.

(30)

Chapter 8

Example video processing cores

This chapter describes the IP cores developed to provide the video processing functions of the system, as required by the project requirements. All the cores were written using the Verilog HDL language. Compared to VHDL, the Verilog hardware description language was perceived as more readable and ”developer friendly”.

The cores process the video stream data in a stream format - the input compo-nents of the processing chain convert the video data from the exact timing format to a stream format, stripping the video data of the synchronization information and forwarding only the active picture data.

8.1

Deinterlacer

Deinterlacing is used to convert from interlaced video format to a progressive one. In interlaced video stream, each complete frame is transferred as two half fields, odd and even. Odd field contains odd picture lines and even field con-tains even picture lines. By splitting the complete frame info two half fields, the temporal resolution of the video feed is doubled and the motion appears more smooth.

Progressive video format transfers frames as complete units, each frame con-taining all (odd and even) lines of the picture. Progressive video does not have the same temporal resolution as an interlaced video with the same bandwidth, on the other hand it offers better vertical resolution and therefore more detailed image.

Interlaced format is commonly used in broadcast applications and TV indus-try, whereas progressive format is more common in the PC industry.

8.1.1 Algorithm overview

A system converting an interlaced video signal to progressive is called a deinter-lacer. There are several methods on how to accomplish the conversion:

· Bob - line duplication · Bob - line interpolation

(31)

8. EXAMPLE VIDEO PROCESSING CORES

· Motion adaptive algorithms

”Bob” is a name given to algorithms needing only one half field to produce a complete progressive frame. The individual approaches are described below. Line duplication

Line duplication algorithm simply takes the input line and produces two lines on output, each same as the image line on deinterlacer input. This is the simplest deinterlacing algorithm, however also the one with the lowest output image qual-ity. Since the half field lines are duplicated, the output progressive image appears pixelated in the vertical direction. This is especially visible on sharp, highly con-trasting edges in the image.

Figure 8.1: Output of the line duplication algorithm[23]

Line interpolation

Line interpolation algorithm does not replicate the missing lines, but instead it calculates the missing line from the line above and below the missing one. This produces a complete frame from a single half field, with the quality of the output image better than the line duplication algorithm. The most visible improvement is that the sharp contrasting edges appear more smooth thanks to the interpolated lines.

Weave algorithm

The weave algorithm uses two half fields to produce a progressive output frame. The method works by merging the odd and even fields directly into one frame.

(32)

8. EXAMPLE VIDEO PROCESSING CORES

Figure 8.2: Output of the line interpolation algorithm[23]

Compared to the Bob algorithms this method needs a storage memory to tem-porarily store the half field data. This also introduces a half field delay to the processing chain since the deinterlacer must wait for complete field to produce an output progressive frame. The output quality of this algorithm is compro-mised by artifacts on edges in the resulting progressive image; since the fields used to produce the output originate in different points in time, when the video feed contains scenes with fast movement, the edges appear distorted since each field captures the moving object in a different position.

Motion adaptive algorithm

Motion adaptive algorithms try to predict the areas of the image with movement and try to compensate for the motion by calculating the final progressive frame from several preceding half fields of the interlaced input. Additionally to the re-quirements for storage of the preceding half fields, this algorithm also introduces delay to the video processing chain. This delay depends on the specific motion adaptive algorithm used.

8.1.2 Algorithm selection

After testing the above mentioned algorithms on a development board, we se-lected the line interpolation algorithm. The quality of the output image was found acceptable, since the edges appear smooth and there are no ”saw tooth” artifacts as is the case with the weave algorithm. Also, since this method does not need any preceding half fields to produce an output frame, the latency introduced into the processing chain is very small - typically the duration of a single image line.

(33)

8. EXAMPLE VIDEO PROCESSING CORES

Figure 8.3: Output of the weave algorithm[23]

8.1.3 Principle of operation

The deinterlacer core uses the line interpolation algorithm to convert the input interlaced video to the progressive output format. The input data in stream for-mat are fed to the core by the frame buffer component. The output of the core is connected to the alpha blender core, where it is mixed with a second video feed and output to the external DVI transmitter chip.

The core is reset with the beginning of each input field. After the reset signal is deasserted, the core detects whether the current field is odd or even and also registers the video format resolution as detected by the preceding components of the video processing chain. The core uses two counters, x and y to store the actual position within the video frame. The core has three options as to what to do with each processed line:

· A - store the incoming line into the temporary buffer, and, at the same time, output the line

· B - store the incoming line into the temporary buffer, and, at the same time, output the average (interpolation) of the line being currently stored in the temporary buffer and the line already saved in the temporary buffer · C - do not store anything, just output the line already stored in the

tempo-rary buffer

The decision between performing the action A, B or C is made by the core scheduler. This part of the deinterlacer keeps the current position in the video image and performs configuration of the remaining parts of the component at the beginning of each image line.

(34)

8. EXAMPLE VIDEO PROCESSING CORES

8.1.4 Implementation

The core is implemented as a schematic file instantiating the subentities designed in Verilog hardware description language. The core also uses Altera specific com-ponents included in the Quartus IP library.

The core has three main groups of virtual I/O pins exported to the higher level design file - video stream input, video stream output and control signals.

Video stream input pin group is composed of signals fifo data input[63..0], fifo rdempty input and fifo rdreq output. These signals form an interface to the FIFO data buffer of the frame buffer component.

Video stream output pin group is composed of pins out rdreq, out rdclk and out data[23..0]. These signals provide the interface to the alpha blender compo-nent mixing the two streams to form the output video signal.

The remaining signals form the deinterlacer core control signals. The main sig-nals in this group are clock, reset, video resolution and the deinterlacing enable signal to optionally disable the deinterlacing to let progressive video format pass through unchanged for the 720p progressive input camera video format.

The core scheduler is located in the deinterlacer controller submodule. This module controls the state transitions at the beginning of each image line as de-scribed in the previous chapter. The core scheduler algorithm is the following:

i f ( can advance == 1 ) begin x = x + 1 ; i f ( x == x s i z e ) begin x = 0 ; y = y + 1 ; i f ( f i e l d == 0 ) begin i f ( y == 0 ) m a s t e r s t a t e = 2 ; i f ( ( y >= 1 ) & ( y < ( y s i z e m i n u s o n e ) ) ) m a s t e r s t a t e [ 1 : 0 ] = { 1 ’ b0 , y [ 0 ] } ; i f ( y == ( y s i z e m i n u s o n e ) ) m a s t e r s t a t e = 0 ; end e l s e begin i f ( y == 0 ) m a s t e r s t a t e = 2 ; i f ( y == 1 ) m a s t e r s t a t e = 0 ; i f ( ( y >= 2 ) & ( y < y s i z e ) ) m a s t e r s t a t e [ 1 : 0 ] = { 1 ’ b0 , ˜ y [ 0 ] } ; end end end

(35)

8. EXAMPLE VIDEO PROCESSING CORES

The variables x, y contain the actual position within the video image data, field is the even/odd field indicator, master state[1:0] is a variable indicating which of the actions A, B or C should the deinterlacer perform on the actual line and can advance is a signal indicating that the remaining core components are ready for next data item.

The deinterlacer ram buffer module is Altera-specific instantiation of an em-bedded memory block forming a RAM memory to store the image line. The ad-dress to this embedded RAM memory block is controlled by the scheduler, the deinterlacer mem addr delay module delays the address signals for the line op-eration B. The opop-eration B means that the deinterlacer must store the incoming line to the RAM buffer and at the same time load the data from the very same memory buffer. Therefore, it is necessary that the data from the buffer can be read out before the new image line data are saved to the buffer.

The deinterlacer line switch module provides the switching between opera-tions A, B and C as requested by the scheduler module. Operation A (master state = 2) means that the data received from the frame buffer component is stored to the RAM buffer and at the same time the data is routed through the dein-terlacer line switch to the output FIFO. Operation B (master state = 1) means that the incoming data is stored to the RAM buffer and at the same time the previous line data stored in the RAM buffer are read out, sent to the deinter-lacer line switch where the pixel data is averaged (interpolated) with the actual line data and sent to the output FIFO. Operation C (master state = 0) does not read the incoming pixel data but instead simply outputs the stored line from the RAM buffer to the output FIFO.

The remaining components of the deinterlacer core are mainly support func-tions to properly align the individual data and control signals to compensate for the latency of the respective communicating components.

To relax the requirements for the maximum frequency of the device logic fab-ric, the deinterlacer core processes two pixels at a time. This doubles the used data bus width, but at the same time allows to halve the operating frequency while maintaining the required bus bandwidth.

The deinterlacer core expects the field data in a standard RGB color space with every color component having 8 bit value range (0.. 255).

The interpolation (vertical averaging) of the neighboring half field image lines is done by adding the individual red, green and blue components of the pixel color (the two pixels in the RAM buffer from the previous image line and the two pixels currently being received and stored to the RAM buffer) together and then doing an one bit position shift right, thereby calculating an arithmetic average of the two values.

(36)

8. EXAMPLE VIDEO PROCESSING CORES

(37)

8. EXAMPLE VIDEO PROCESSING CORES

8.2

Alpha blender

Alpha blending is an image processing algorithm for mixing two images into one, with the option to select the transparency of individual picture elements. In video stream processing, the input images are formed by the active picture data of the individual video frames. The transparency is selected by the alpha channel, which for each pixel defines a transparency value. The range of the transparency value 0.0 to 1.0 can be translated to integer representation, for example with 8-bit resolution the range is 0.. 255. The value 0 means that the first image is fully visible with no visual input from the second one and vice versa.

Value of the final pixel is usually calculated by calculating the individual ele-ments of the pixel color for each coordinate in the pixel’s color space. For exam-ple, with the RGB color space, the calculation can be described by the following equations:

outR= layerAR∗ layerAalpha+ layerBR∗ (1 − layerAalpha)

outG = layerAG∗ layerAalpha+ layerBG∗ (1 − layerAalpha) (8.1)

outB = layerAB∗ layerAalpha+ layerBB∗ (1 − layerAalpha)

The alpha value for each pixel can be either fixed for the entire image or deliv-ered to the blender core as a separate value for each individual pixel, for example as the unused 8 bits within 32-bit pixel memory window for 24-bit pixel colors. In this work, the blender core has a fixed value for the alpha channel for the entire active picture window. Although initially was the per-pixel alpha channel con-sidered, to provide a simple way for the OSD menu generation, the fixed alpha solution was preferred. The main reason for this decision was that the PC video feed is used as the source for the OSD menu and it would be problematic to trans-mit the alpha channel through the standard 24-bit per color DVI interface. Using the fixed alpha value, the entire range of the pixel value of the DVI interface can be used for pixel color space coordinates and the OSD generation is achieved by simply displaying an image on the x86 host system graphics output.

This solution also has its drawbacks, most notably the inability to display a non-transparent OSD image on top of the live camera video feed. This was re-solved to dedicating a single pixel color from the x86 host system a the transpar-ent color value. When this color is encountered by the blender core, the value of the camera video pixel is assigned to the output, regardless of the alpha value set-ting. This allows for the generation of either non-transparent or semitransparent OSD image on top of the live video feed.

8.2.1 Principle of operation

The core processes two input pixel streams and produces a blended pixel stream on the output. One input stream is a directly connected video feed from the x86 host system, which is used as a reference video signal for the output video feed.

(38)

8. EXAMPLE VIDEO PROCESSING CORES

This means that the output video feed has the same parameters (pixel clock, tim-ing, resolution) as the video feed from the x86 host system. Into this video feed is mixed the live video signal from the camera input using the preceding frame buffer and deinterlacer components. This allows the system to mix these two streams with no interruptions in output video timing, since the camera feed is passed through the frame buffer component and can be therefore matched to the reference video signal.

The calculation of the output pixel value is divided into separate calculations for each color component of the pixel color. Each calculation of the output color component is then further divided into pipelined calculation stages to relax the timing requirements of the design compared to the case with no pipelining done. For the calculation of output values the blender core uses the equations 8.1 translated into the integer domain.

8.2.2 Implementation

The blender component is implemented as a Verilog HDL entity, instantiated in a higher level schematic design file in the Quartus design environment.

The reference input video signal is fed to the core using the pixel b in[23..0] bus together with the reference video timing signals de in, hsync in and vsync in. The core is clocked using the reference video signal clock connected to the core clock input clock in.

Figure 8.5: Schematic symbol for the blender module[20]

The output video signal is formed by the output pixel out[23..0] together with the timing control signals de out, hsync out and vsync out. The output video feed uses the same clock as the input reference video feed, i.e. clock in.

Following is a code walk through for a single color component (red). The core starts by registering the input information to reduce the length of the input path and therefore to improve the maximum operating frequency of the core.

always @( posedge c l o c k i n )

begin

p i x e l a <= p i x e l a i n ; p i x e l b <= p i x e l b i n ;

(39)

8. EXAMPLE VIDEO PROCESSING CORES

Now the core has the input pixel information available in the internal regis-ters. The alpha value for the current video frame is registered during the vertical blanking interval of the reference video signal. This way, the alpha value is forced to be the same for each individual video frame. To further improve the maximum frequency of the core, the alpha value for both video inputs is immediately calcu-lated (the layerAalphaand the (1 - layerAalpha) expression as described in equations

8.1). always @( posedge c l o c k i n ) begin i f ( v s y n c i n == 1 ) begin a l p h a a [ 7 : 0 ] = alpha [ 7 : 0 ] ; a l p h a b [ 7 : 0 ] = 8 ’ d255 − alpha [ 7 : 0 ] ; end end

The blender core then continues by calculation of the intermediate values for the expressions listed in (8.1). The core produces intermediate values for pixel a and pixel B color components. Since there were some problems with the integer representation of the equations (uneven mapping of the multiplication results, the value for component output with layerXalpha = 255 was 254), the core checks

for the alpha value and if found maximal, the core simply outputs the respective color components. If this is not the case, the core performs an integer multiplica-tion of the pixel color component and the alpha value.

always @( posedge c l o c k i n ) begin i f ( a l p h a a == 2 5 5 ) begin r e d a [ 1 5 : 0 ] = { p i x e l a [ 7 : 0 ] , 8 ’ b00000000 } ; r e d b = 0 ; end e l s e r e d a = p i x e l a [ 7 : 0 ] * alpha a ; i f ( a l p h a b == 2 5 5 ) begin r e d b [ 1 5 : 0 ] = { p i x e l b [ 7 : 0 ] , 8 ’ b00000000 } ; r e d a = 0 ; end e l s e r e d b = p i x e l b [ 7 : 0 ] * alpha b ; r e d a p i p e = r e d a ; r e d b p i p e = r e d b ; end

(40)

regis-8. EXAMPLE VIDEO PROCESSING CORES

ters red a pipe and red b pipe. The core then continues by producing the final pixel output value. In this step the core also checks for the transparent color as described above and decides whether to output the pixel value based on the pre-vious calculations or whether to output the camera video feed pixel value directly. The color selected as the transparent color for the video overlay is 0xFF00FF (ma-genta), considered very unlikely to appear in the x86 host system video output under normal conditions.

always @( posedge c l o c k i n ) begin i f ( pc 1 == 2 4 ’ hFF00FF ) begin red [ 1 5 : 8 ] = cam 1 [ 7 : 0 ] ; green [ 1 5 : 8 ] = cam 1 [ 1 5 : 8 ] ; blue [ 1 5 : 8 ] = cam 1 [ 2 3 : 1 6 ] ; end e l s e begin red = r e d a p i p e + r e d b p i p e ; green = g r e e n a p i p e + g r e e n b p i p e ; blue = b l u e a p i p e + b l u e b p i p e ; end end

The cam 1 register stores the camera input pixel value for the actually pro-cessed pixel, the pc 1 register stores the original reference video feed pixel value. Separate registers are necessary, since in this stage of the processing pipeline the original pixel a and pixel b registers contain newer pixels due to the processing latency. This is the last step of the processing pipeline.

To compensate for the latency introduced by the individual processing stages, it is necessary to also properly align the output video timing signals to match the active picture data. This is done by a simple delay stage inside the blender core.

always @( posedge c l o c k i n ) begin d e l a y 3 = d e l a y 2 ; d e l a y 2 = d e l a y 1 ; d e l a y 1 = { de in , hsync in , v s y n c i n } ; end

The individual bits of the delay 3 register are then output as the final timing control signals.

At first, the design of the core did not use any pipelining and the core had very low performance of terms of maximum allowable frequency of the incom-ing video signal. After introducincom-ing the pipelinincom-ing stages, the core is capable of handling 150+ MHz input video signal pixel clocks and therefore supports the required HD resolutions (the pixel clock of the 1080p video format is 148.5MHz).

(41)

8. EXAMPLE VIDEO PROCESSING CORES

8.3

Frame buffer

Frame buffering is a method for synchronizing data producer and data consumer, each running at a different processing rate. It can be used in situations where the data stream is divided into compact units, suitable for processing on a per-unit basis. This condition is fulfilled for video processing, since the video stream is composed of individual video frames and these are usually processed separately. Frame buffering works by allocating a number of memory buffer regions for storing the incoming data segments. The number of buffers used depends on the selected frame buffering method.

Buffer A (writing) Data producer Buffer B (reading) buffer switch Data consumer Buffer A (reading) Data producer Buffer B (writing) Data consumer

Figure 8.6: Principle of operation of a double buffering system

Double buffering uses two buffers, one for data producer and one for data consumer. Use of double buffering method is limited to cases where the data production can be controlled in a way as to work synchronously with the data consumption. For example, this is the case with graphic cards in the PC industry. To remove video tearing during the display of rendered scenes, the GPU renders the scene into a different buffer than the one used to send the frame data to the monitor. These buffers are flipped in the vertical blanking period of the monitor output timing signal. This removes video tearing but at the same time it intro-duces inefficiency to the render process, since the GPU has to wait with the start of frame rendering process for the start of the vertical blanking interval (other-wise the GPU has no free buffer to render the scene into).

For situations in which the data production can not be synchronized with data consumption, double buffering is not optimal since this method has to drop a large number of data units. When processing video, this behavior is clearly visible in scenes with moving objects where the motion appears choppy and unnatural.

Triple buffering removes the problems of double buffering by introducing a third buffer. This ensures that one buffer is always available for either the data producer or the data consumer.

When considering using triple buffering for the synchronization of two video streams, each having a different frame rate, this method can provide a simple frame rate conversion. For example, if the synchronized stream has a lower frame rate than the stream being synchronized to, the data consumer can reuse the

(42)

cur-8. EXAMPLE VIDEO PROCESSING CORES Buffer A (writing) Data producer Buffer B (reading) data producer has finished storing data and requests a free buffer Data consumer Buffer C (idle) Buffer A (idle) Buffer B (reading) Data consumer Buffer C (writing) Data producer

Figure 8.7: Principle of operation of a triple buffering system

rently processed frame by duplicating it without any impact on the data producer and the frame storage. The same is valid for an opposite scenario, where the data producer stores the frames at a higher rate than the data consumer is processing the frames. In this case, the frames can be dropped from the output stream, again without any interruption of the data consumption process.

8.3.1 Principle of operation

The frame buffer was implemented using an SOPC builder generated compo-nent. The module uses several subcomponents connected together by the Avalon Interface developed by Altera[25]. The main parts of the frame buffer module are the frame writer component, frame reader component, usb reader component and the necessary DDR2 memory controller core. By using the Avalon interface, the system could take advantage of the Altera provided High Performance DDR2 Memory Controller Core.

The frame writer was based on an example template of Avalon Interface Mem-ory Mapped Master with burst writes[26]. This is a template for a master compo-nent located in an Avalon interface SOPC system. This compocompo-nent can be thought of as an DMA (Direct Memory Access) engine with the source data not being in the target address space. The component provides a bridging function between the Avalon subsystem and user logic design. The original component includes a single clock FIFO to provide the input from the user design into the Avalon sub-system. For the purposes of this project this single clock FIFO was replaced with a dual clock version of the same, therefore providing the clock crossing function required to convert the video data from the camera video input clock domain to the internal Avalon subsystem clock domain. There were also some other modi-fications to the template, mainly regarding the start of the transfer. The bursting capability enables the DMA engine to transfer large portions of the video data,

(43)

8. EXAMPLE VIDEO PROCESSING CORES

this greatly improves the efficiency of DDR2 memory access. The bursting capa-bility is more discussed in the implementation section.

Figure 8.8: Structure of the read master template component[26]

The frame reader is also based on the Altera template, namely on an Avalon Interface Memory Mapped Master with burst writes. This component works in the same way as the frame reader, but it instead reads data from the DDR2 mem-ory address space and transfers them using an dual clock FIFO into the user de-sign. The user design is in this case the deinterlacer module.

The USB reader component is based on the same component as the frame reader, but this time the output is connected to the external USB to FIFO bridge (FT2232H[21]) from FTDI. The component was modified to adapt to the timing required by the external USB bridge chip.

The last part of the frame buffer component is the Altera DDR2 High Perfor-mance Memory Controller core. This component provides the means to access the external DDR2 memory chip by mapping it into the Avalon subsystem memory space.

This entire component is controlled by a second SOPC builder subsystem lo-cated inside the FPGA. This subsystem provides the control functions for the en-tire FPGA design and the peripheral devices, including the communication link of the device to the PC using a second data channel of the FT2232H bridge, con-trol of the timing of frame reading and writing of the frame buffer components, input video format detection and other support functions of the FPGA system. This subsystem is not described here due to the reasons described in the Preface to this work.

References

Related documents

Minors who do not have a valid driver’s license which allows them to operate a motorized vehicle in the state in which they reside will not be permitted to operate a motorized

Marie Laure Suites (Self Catering) Self Catering 14 Mr. Richard Naya Mahe Belombre 2516591 [email protected] 61 Metcalfe Villas Self Catering 6 Ms Loulou Metcalfe

The corona radiata consists of one or more layers of follicular cells that surround the zona pellucida, the polar body, and the secondary oocyte.. The corona radiata is dispersed

The PROMs questionnaire used in the national programme, contains several elements; the EQ-5D measure, which forms the basis for all individual procedure

• Follow up with your employer each reporting period to ensure your hours are reported on a regular basis?. • Discuss your progress with

4.1 The Select Committee is asked to consider the proposed development of the Customer Service Function, the recommended service delivery option and the investment required8. It

more than four additional runs were required, they were needed for the 2 7-3 design, which is intuitive as this design has one more factor than the 2 6-2 design

The encryption operation for PBES2 consists of the following steps, which encrypt a message M under a password P to produce a ciphertext C, applying a