Bachelor Thesis. Optimization of OpenGL streaming in distributed embedded systems. Felix Mues June 2020

(1)

Bachelor Thesis

Optimization of OpenGL streaming in distributed embedded systems

Felix Mues June 2020

Reviewers:

Prof. Dr. Jian-Jia Chen Dr. Robert Budde

Technische Universität Dortmund Fakultät für Informatik

Lehrstuhl 12 embedded systems

http://ls12-www.cs.tu-dortmund.de

In cooperation with: Leopold Kostal GmbH & Co KG AEN 4.1 advanced development

(2)

(3)

Chapter 1 Introduction

1.1 Motivation and background

The number of displays in modern cars is growing constantly. It can be additional displays for fellow passengers or analogue instruments like speedometers which are replaced by dig-ital displays. For example, the cockpit of the new Porsche Taycan displayed in Figure 1.1 includes four separate displays.

Nowadays production cars come with a variety of customizations including different resolu-tion rates or addiresolu-tional displays. All these displays have to be supplied with video streams of information.

Figure 1.1: The cockpit of the new Porsche Taycan1

1_{©Christoph Bauer}

https://christophorus.porsche.com/image/image_1080x624/b14271ed-61c6-4c3a-b16a-29394b914f73.jpg

(6)

All live information presented to the user needs to be rendered into a visual representation. Rendering is the process of creating a picture out of meta information, for example dis-playing of the current speed as a digital speedometer. Usually standardized graphics-APIs are used for this process.

To keep a homogenous look and feel across all those displays, following architecture is used: A central controller executes the application while the process of rendering and dis-playing is distributed to each display controller. This concept already got evaluated in the student project ”Bereitstellung von graphischen Benutzeroberflächen in verteilten HMI-Systemen über IP-Netzwerke” [6] and proven feasible. One result of the evaluation process was the software ”PipeGL” which serves as middleware for streaming OpenGL instructions. The current version of PipeGL is in a prototype state. This thesis will take a deeper look into the needs of a middleware for OpenGL streaming in the automotive context and apply the results to PipeGL. The techniques explained in the following are relevant for this objective.

1.1.1 OpenGL

OpenGL is a standardized set of instructions for rendering images, also called graphics-API. Its primary target are desktop computers. Concerning embedded systems, the standard was revised under the name OpenGL ES. It provides the same functionality as OpenGL, but removes redundancy to keep the footprint as small as possible. It is maintained by the Khronos Group, an open industry consortium of over 150 companies.[10] The Khronos Group provides all specifications free of charge, thereby OpenGL got the de-facto standard for graphics-APIs on embedded systems.

The current available version is OpenGL ES 3.2, but libraries which are critical of this project require version 2.0, so in this thesis ”OpenGL” will always refer to OpenGL ES 2.0. The main objective of OpenGL is to give developers low-level access to rendering operations while being hardware independent. Therefore it can run on a range of graphics platforms with varying capabilities and performance. The appearance of the rendered scene is inde-pendent of the used rendering hardware.

A scene in OpenGL consists of any amount of objects which are placed in a 3-dimensional coordinate system. This is called the world view in the following. OpenGL then acts like a camera which is placed in this coordinate system. The view of this camera represents the users view on the scene, called the user’s view.

(7)

1.1. MOTIVATION AND BACKGROUND 3 Two types of objects are relevant for the rendering process: Vertices and fragments. Both can be seen as points in a 3-dimensional space with some extra information like the colour this point should have. In this context vertices are used for the logical representation of complex objects like a cube and use the world view as a reference while fragments could be seen as a discrete point in the users view which later might be a pixel on the screen. A basic OpenGL program consists of the following steps:

1. Creating buffers for vertex data

1 // g e n e r a t e a b u f f e r and s a v e t h e ID i n v e r t e x B u f f e r

g l G e n B u f f e r s ( 1 , &v e r t e x b u f f e r ) ;

2. Writing vertex data into the buffers

G L f l o a t g _ v e r t e x _ b u f f e r _ d a t a [ ] = {< a r r a y o f p o s i t i o n s >};

2 // p a s s v e r t e x d a t a t o OpenGL

g l B u f f e r D a t a (GL_ARRAY_BUFFER, s i z e o f ( g _ v e r t e x _ b u f f e r _ d a t a ) , g _ v e r t e x _ b u f f e r _ d a t a , GL_STATIC_DRAW) ;

3. Specifying how to draw the data inside of the buffers

1 // a c t i v a t e t h e b u f f e r

g l B i n d B u f f e r (GL_ARRAY_BUFFER, v e r t e x b u f f e r ) ;

3 // u s e v e r t i c e s w i t h t h e i n d i c e s 0−2 o f t y p e GL_FLOAT

g l V e r t e x A t t r i b P o i n t e r ( 0 , 3 ,GL_FLOAT, GL_FALSE, 0 , ( v o i d ∗ ) 0 ) ;

5 // draw a t r i a n g l e

glDrawArrays (GL_TRIANGLES, 0 , 3 ) ; // S t a r t i n g from v e r t e x 0 ; 3 v e r t i c e s t o t a l −> 1 t r i a n g l e

4. Clearing buffers

g l D e l e t e B u f f e r s ( s i z e o f ( v e r t e x B u f f e r ) , &v e r t e x B u f f e r ) ;

Step 1 and 2 are executed once at the start of the program. Step 3 is usually executed in a loop where every frame is generated. Step 4 is executed once at termination of the program. OpenGL uses the concept of a rendering pipeline to create an image from the given information. The rendering pipeline for OpenGL is displayed in Figure 1.2.

In the first step, all vertices are processed by a Vertex Shader. This is a program which transforms positions of each vertex from the world view into the users view.

The output of the Vertex Shader will then be used to construct fragments which are passed to the Fragment Shader. This shader computes depth values for each fragment and can do

(8)

additional modifications, like applying textures.

After this, OpenGL will further construct the image by omitting not visible fragments that lay behind other fragments2 _{or fail the stencil test specified by the developer.}

Figure 1.2: OpenGL ES 2.0 Programmable Pipeline3

Colour buffer blend and dithering are colour optimizations which are almost completely handled by the OpenGL implementation. The resulting fragments are written as pixels into the frame buffer.

OpenGL is designed as client-server architecture. Usually, these two components lay on the same computer, but they do not have to. In this context, the OpenGL application is considered being the client and the implementation of OpenGL is considered being the server. A special way of transmitting OpenGL commands via network is not specified. The client-server architecture does as well affect error handling. If any call in OpenGL results in an error, the call will be ignored. Specified error returns are defined for calls with return values. Error handling is done via the call enum GetError( void );. If an error is

detected in any previous call, the error code is returned and the flag is reset toNO_ERROR.

1.1.2 PipeGL

The objective of the thesis ”Bereitstellung von graphischen Benutzeroberflächen in verteil-ten HMI-Systemen über IP-Netzwerke”[6] was to evaluate the usability of streaming pro-cedure calls of an graphics-API via IP-Networks. The scope of this thesis includes the

2

This is a simplification, as effects like transparency would usually be evaluated.

3

(9)

1.1. MOTIVATION AND BACKGROUND 5 development of an OpenGL application and a middleware that handles streaming over IP networks, called ’PipeGL’.

PipeGL consists of two independent parts of software, which are running on two different devices. The design is following the client-server architecture, so the device which is running the OpenGL application is called the application server and the device which displays the OpenGL scene is called rendering client. As displayed in Figure 1.3, the PipeGL library is running on the application server. It accepts all OpenGL instructions and transfers them via Ethernet to the rendering client. There, the PipeGL renderer accepts the transmitted data from the PipeGL library, transforms it back into OpenGL calls and executes those calls on its current device.

Figure 1.3: General communication of PipeGL

PipeGL handles the connection between application server and rendering client. It seri-alizes all data that needs to be sent to the client and deseriseri-alizes it on the client again. Both, the OpenGL application and the OpenGL library are implemented without knowl-edge about the middleware.

A definition of PipeGL in terms of the OSI model can be seen in Figure 1.4. Direct com-munication between the application and OpenGL is not possible in an distributed context, so PipeGL is added to handle the communication via Ethernet. The transport layer and below is based on the TCP/IP reference model.

(10)

Figure 1.4: PipeGL in the OSI model

1.1.3 Automotive Ethernet

OABR (Open Alliance BoardR-Reach) was originally developed by Broadcom. It is a physical transmission technology which was standardized as IEEE 100Base-T1, also called Automotive Ethernet.

It is used as the physical transmission layer of the OSI model, so that the full TCP/IP stack can be utilized in automotive context. Communication is realized in full-duplex transmission mode while the physical connection is realized with an unshielded twisted pair of cables.

The relevant parts of this technology for the current context are bandwidth and latency. Bandwidth is limited by specification to 100Mbit/s for each node to node connection. Cur-rent transmission technologies are FlexRay with a bandwidth of 10 Mbit/s[7] and MOST which has a maximal bandwidth of 150 Mbit/s, but is designed in a ring topology. So the

(11)

1.2. SCOPE 7 150 Mbit/s are shared between every connected node.

A standard for gigabit Ethernet (IEEE 1000Base-T1) is defined as well, but to keep the demands low, this thesis will keep 100Base-T1 as requirement.

For automotive purposes timing is essential, but common Ethernet does not provide any support to give real time guarantees. To fulfil this need, the ”Stream Reservation Protocol” (SRP) was introduced. SRP is part of the Time Sensitive Networking standards. This group of standards intention is to provide quality of service on Layer 2 of the OSI model. It defines four classes, sorted by priority [2, p.14]:

• CDT

guaranteed worst-case latency of 100 µs over 5 hops and maximum transmission period of 0.5 ms

• Class A

guaranteed worst-case latency of 2 ms and maximum transmission period of 125 µs • Class B

guaranteed worst-case latency of 50 ms and maximum transmission period of 250 µs • Control traffic

used by the protocol for adding and removing of streams

Each stream will be registered at compatible Ethernet bridges and transmission is guar-anteed. To ensure compatibility with standard Ethernet, traffic which does not need any real-time guarantees can be transferred parallel to registered streams.

1.2 Scope

The current implementation of PipeGL is a limited test application. This thesis includes the definition of an use case for the embedded automotive context, further evaluation of PipeGL and the design, implementation and evaluation of an optimization in this context.

1.3 Structure

At first, the current development status will be documented as a reference. Then require-ments for the PipeGL project will be defined. Derived from these requirerequire-ments, analyses will be defined, implemented and evaluated.

From this evaluation, a possible optimization will be defined, implemented and compared with the reference.

(12)

(13)

Chapter 2 Given experimental setup

The test setup is reconstructed from the underlying project ”Bereitstellung von graphis-chen Benutzeroberflägraphis-chen in verteilten HMI-Systemen über IP-Netzwerke” [6, section 5.2]. Differences are mentioned.

2.1 Hardware

The rendering client is running on a system on a chip1_{which comes with an i.MX8DualXPlus} processor and 1 GB of RAM. The processor has a 64 bit ARM architecture and the in-cluded GPU supports OpenGL ES 3.1 and below.[14]

This board comes with the peripherals for connecting a display and Automotive Ethernet. The connected display has a resolution of 1280x800 pixel, colour depth of 32 bit and a frame rate of 60 Hz. The board was changed because the current hardware is closer to a real automotive setup than the i.MX8 QXP MEK used in the previous thesis.

The application server is a Linux PC containing an Intel(R) Core(TM) i5-6500 CPU with an X86_64 architecture and 8 GB RAM. The used operating system is Fedora 30.

Figure 2.1 presents the current test setup. The Ethernet switch connects application server and rendering client. It handles bridging from standard to automotive Ethernet.

Using the Stream Reservation Protocol has to be supported by the network interface con-troller. This is not given with the current hardware. Having a guaranteed amount of bandwidth would not gain any advantage because there is no other traffic on the used network. That means PipeGL always can use the full available bandwidth.

1

further referenced as ’board’

(14)

application server Ethernet switch

Automotive Ethernet

rendering client display

Figure 2.1: Test setup as described in [6]

2.2 Operating system

The operating system is an embedded Linux-based distribution developed and maintained by Kostal. It is based on the Yocto project2_.

Yocto provides tools to create and support embedded Linux distributions which need to run on multiple different hardware platforms. It consists of three main elements: The build tool Bitbake, a reference embedded distribution called Poky3 _{and tools for} auto-mated building, testing and deployment.

Yocto has a layered architecture. Each layer bundles related functionalities. A distribution consists of a selection of layers. Layers are ordered by priority, a high prioritized layer can override settings from lower prioritized layers. This allows i.e. upgrading required software to a version different from the base layers. The advantages of that architecture are reduced complexity of embedded projects because features are only added if they are required. It also improves reproducibility because the software comes with a full reference distribution and does not need to be adjusted to any existing OS.

As the hardware is changed for the current project, PipeGL and the required libraries need to be ported to the new system. In this process PipeGL and the required libraries are added as a separate layer to the Kostal-Linux project. This will ease the process of further hardware changes.

The operating system has a vertical synchronization mechanism. Vertical synchronization stops changing of the frame buffer while the display controller transfers the picture to the screen. This limits the maximal reachable frame rate to 60 FPS. Lower frame rates tend to be divisors of 60 FPS.

2

www.yoctoproject.org

3

(15)

2.3. SOFTWARE 11

2.3 Software

The software consists of three parts: PipeGL library, PipeGL renderer and OpenGL ap-plication. All components are written in C++ and Cmake is used as build tool.

The library and the renderer are based on the pipes-and-filters pattern. Pipes-and-filters splits a complex transformation of an input stream into multiple atomic transformations. The atomic transformations are called filters and are connected by pipes. Pipes can range from buffers that are passed between different functions to complex streams like TCP con-nections. Using this pattern reduces complexity and improves interchangeability for each component.

PipeGL uses the software library Wangle4_{which provides an implementation for} pipes-and-filters. Wangle pipelines extend the pipes-and-filters pattern to a bidirectional pipeline and add the possibility of using different filters dependent on their direction. This allows the implementation of one pipeline for transporting calls from the application to OpenGL and one pipeline for returning values together in a single Wangle pipeline.

The used asynchronous calls in PipeGL are based on the model of promises and futures. Whenever an asynchronous call is made, it returns aFuture<T>which acts as a placeholder

while the real return value is computed in parallel. It is possible to wait for the fulfilment of a future which leads to synchronization of operations.

PipeGL includes the two selectable optimizations compression and framing. Framing is the process of collecting OpenGL calls until the end of a frame or the moment synchronization is required. Then all collected calls are transferred as one frame. The end of a frame can not be detected by OpenGL itself, but it is assumed that every frame starts with a call of

gl_clear(). If compression is enabled, the serialized data is compressed before sending on

the network and decompressed after receiving on the client side.

The OpenGL application is independent of PipeGL. As mentioned in the introduction PipeGL should act as middleware for OpenGL calls, so the objective of this application is to test the current implementation of PipeGL. Currently PipeGL does not support every OpenGL instruction, but the instruction set is extended in the scope of this thesis. All OpenGL instructions supported by PipeGL at the time point when the thesis is finished are listed in Appendix B.

4

(16)

2.3.1 Wangle Pipeline

The Wangle pipeline is located half on the application server and half on the PipeGL ren-derer. On the library side it receives OpenGL calls, on the renderer side it accepts return values. Each OpenGL call is stored in an Instruction class and the possible return value in

a Replyclass.

Figure 2.2 shows the individual components of the PipeGL library and renderer. Both are highlighted in the picture. The pipeline is physically located on two different machines, but logically it is one unit.

Figure 2.2: The PipeGL pipeline5

The individual components are described from the ClientCacheAdapter to theExecutor. The

order in that the components are described represents the control flow of the pipeline. The direction used for an OpenGL command is called the forward direction, the direction used for a return value is referenced as backward direction.

5

Source: ”Bereitstellung von graphischen Benutzeroberflächen in verteilten HMI-Systemen über IP-Netzwerke”[6, S. 16], the author added frames around the library and renderer and added the labels PipeGL library and PipeGL renderer

(17)

2.3. SOFTWARE 13 The ClientCacheAdapter caches the result of synchronous calls. If an synchronous call is

issued for the first time, it is passed down the pipeline. The return value is cached when it arrives at the adapter. When the same call is issued again, this adapter returns the cached value directly. This reduces the amount of OpenGL calls transmitted over the network. If framing is enabled, theOutboundFrameAdapter collects all instructions until

synchroniza-tion is required or an synchronous call is made. The so called frame is then issued to the

FrameSerAdapter. TheOutboundFrameAdapter is only used for the forward direction, because

return calls are not framed.

The InstructionSerAdapter and the FrameSerAdapter have the same purpose: Serializing the

incoming data. Serializing is the process of translating data into a format that can be used for sending it on a network. The reverse process is called deserialization. Those function-alities are provided by the Flatbuffers6_{library. The}_{InstructionSerAdapter}_{serializes incoming} instructions and theFrameSerAdapter serializes incoming frames.

If compression is activated, theCompressionAdapter compresses the serialized data with the

Snappy7 _{compression algorithm.}

TheTCPClientHandlerhandles the TCP socket of the application server. The incoming data

is sent to the PipeGL renderer.

TheTCPServerHandlerhandles the TCP socket of the PipeGL renderer. The incoming data

from the application server is passed on to the next handler. If compression is activated, the data is decompressed again by the CompressionAdapter.

The deserialization adapter is then deserializing the incoming data into the original format. TheFrameDeserAdapter is used if framing is activated and theInstructionDeserAdapteris used

if not.

If framing is activated, the InboundFrameAdapter splits the incoming frame up into single

commands. This adapter is also only used in the forward direction.

At last, theExecutor collects all executable OpenGL instructions in a list. This list is

fur-ther processed by the PipeGL renderer. 6

See https://google.github.io/flatbuffers/

7

(18)

If an OpenGL method is producing a return value, this value is wrapped in an Reply

class and passed back to the OpenGL application. Framing is deactivated for return calls. The Wangle pipeline is symmetric in most parts. Exceptions are the Executor and the ClientCacheAdapter.

This means theInstructionDeSerAdapteracts in the backward direction as theInstructionSerAdapter

acts in the forward direction, the CompressionAdapter compresses on the PipeGL renderer

side and decompresses on the PipeGL library side and the TCPServerHandler acts like the TCPClientHandler and vice versa.

In the backward direction the Executor just passes the Reply to the InstructionDeSerAdapter.

The ClientCacheAdapterextracts the return value and adds it to the cache, if possible.

2.3.2 PipeGL library

For keeping compatibility to any OpenGL application, the application server side of PipeGL is implemented as a shared library. It can be linked instead of the OpenGL library at run-time. The dynamic linker on Linux systems can be utilized to first link libraries that are specified with the LD_PRELOAD environment variable. This provides the possibility

of running PipeGL with any application which is dynamically linked to OpenGL without doing any changes to the application code itself.

The port and IP of the client have to be set via environment variables, only one target can be specified. Instead of executing the OpenGL calls locally, they are collected by PipeGL and transmitted to the remote renderer. If OpenGL does not expect a return value, trans-mission is done asynchronously.

For instructions that expect a return from OpenGL, the application server will return a

Future. When the value behind thisFuture is accessed, the execution is suspended until the

client provides a return value. Then the OpenGL application can access the value. Figure 2.3 displays all components of the PipeGL library. The glIntercept class implements

all OpenGL methods, so it is the starting point of control flow. glIntercept puts all required

information for a call in an Instruction class and passes it to theClientApp. The ClientApp

holds the Wangle pipeline. When the first call arrives, the PipeGL library part of the pipeline is initialized. After that, the ClientApponly forwards incoming instructions to the

(19)

2.3. SOFTWARE 15

Figure 2.3: Block diagram of PipeGL library8

With the initialization of the Pipeline a connection to the rendering client is established and kept open until the termination of the program.

Return values of the rendering client are handled in Reply classes. They are passed from

theBoostTCPHandlervia the Pipeline to theClientAppwhich passes them back to glIntercept.

These are used to fill the given promises with the corresponding value.

2.3.3 PipeGL renderer

The PipeGL renderer is linked against OpenGL and runs on the rendering client. It opens a TCP socket and waits for the application server to connect. If an application is connected, the renderer extracts the OpenGL commands from the incoming packets and executes them locally. If a return value is generated by one OpenGL call, the renderer sends it back to the client.

Figure 2.4 visualizes the internal setup of the PipeGL renderer. The receiver thread han-dles the forward direction of the pipeline. After the incoming data is processed by the

Executor, the next packet can be received.

The OpenGL thread executes the collected OpenGL instructions. If a return value is com-puted, it is written back into the pipeline. This is an asynchronous process, so it does not block the OpenGL thread. A newly created background thread handles moving the return value on the backward direction of the pipeline.

8_{Source: ”Bereitstellung von graphischen Benutzeroberflächen in verteilten HMI-Systemen über}

(20)

Figure 2.4: Block diagram of PipeGL renderer

2.3.4 OpenGL application

The OpenGL application is used for testing PipeGL. It uses scenery, a simple API for cre-ating OpenGL scenes. The advantage of scenery is that it only uses a subset of OpenGL commands. This allows testing of PipeGL without implementing the full API. The OpenGL scene consists of a camera facing towards a variable number of coloured rotating octahe-drons. See Figure 2.5

(21)

Chapter 3 Analysis

Before analysing the current application the basic conditions for the use of PipeGL are going to be discussed. First, the extended use case for PipeGL is presented. The new use case is then analysed for requirements and in the next step these requirements are applied to PipeGL and the OpenGL application.

3.1 Use case

PipeGL is used as middleware in infotainment systems to distribute the graphic surface of any used software on various displays. This is achieved by having one central controller acting as server which distributes the process of rendering and displaying to multiple clients. The server should be aware of the connected clients and the displayed picture should be fitted for every display. The display serves as Human-Machine Interface between the car and persons inside.

3.2 Metrics for the analysis

The criteria for measuring the performance of PipeGL derive from the fields of client-server software, infotainment systems and embedded devices. The evaluation of demands in these fields is important to produce expressive metrics which can be applied to PipeGL.

3.2.1 General requirements Client-Server software

PipeGL consists of two parts: The application server side receives all issued OpenGL calls, wraps them into IP packets and sends them via TCP. The rendering client resides on another machine and accepts the packets, extracts the OpenGL calls and executes them locally to render a video frame.

(22)

Data transmission is often a bottleneck for Client-Server software which makes the number of non-blocking and blocking calls relevant for the overall performance. Critical parameters for data transmission are latency and bandwidth.

In the current case a 100 Mbit automotive Ethernet network is taken as a precondition. The bandwidth might be reduced by other applications using the same or parts of the network route. Therefore the amount of transmitted data should be kept low. If real-time constraints are required, the TSN standards can be utilized to provide real-time guarantees for latency.

Infotainment systems

Infotainment systems serve as assistants for the user’s interaction with the car’s internal control elements and often provide additional features for a better user experience. When displaying information to the user it is important to deliver a satisfying experience of interaction with this data. This project does not have any influence on the hardware requirements for user experience like size and quality of the display. Further, it can not influence the non-functional requirements of displaying information like having a well struc-tured and informative GUI. Nerveless PipeGL has influence on the functional requirements frame rate and input latency.

Current high definition TVs use frame rates of 25 or 50 FPS [1]. Displays designed for gam-ing support frame rates from 60 to 144 FPS. It is proven that higher frame rates improve the users ability to select both dynamic and static targets[9]. This is a significant factor for infotainment systems, because if the driver interacts with the display, he should be able to do this with least possible distraction. So a high frame rate is a critical demand for PipeGL. If a user is interacting with the system, another measurement size is important: The input latency. Input latency describes the time between an action of the user and the visual response of the change in the system. For example if the user pushes a button, input latency describes the time until the GUI displays a colour change of this button. An input latency of 100 ms and less is perceived as instant, higher input latency distracts the user.[12, Topic 1. Response to control activation]

Scalability

Administrative scalability describes the ability of an application to serve an increasing number of individuals. Well scaling applications require only few extra resources (memory,

(23)

3.2. METRICS FOR THE ANALYSIS 19 bandwidth, etc.) for each newly connected individual.

PipeGL is used to deliver OpenGL instructions from the application server to an undefined number of renderers. As the number of connected renderers grows the additionally required resources of PipeGL per new renderer should be kept as low as possible.

3.2.2 Applied to PipeGL

OpenGL test application

The test application consists of asynchronous OpenGL commands, synchronous OpenGL commands and commands not related to OpenGL. PipeGL only has an impact on OpenGL commands, so other commands will not be further analysed in the following tests.

The fact that synchronous OpenGL calls are blocking the further execution of the appli-cation makes PipeGL dependent on a low latency and high bandwidth. Especially a high latency increases the execution time of the application.

Unlike synchronous calls, asynchronous calls do not block the execution of the OpenGL application. Therefore asynchronous calls are still affected by a low bandwidth, but they are less affected by a high latency. In reference to the OpenGL example program given in the introduction, it can be said that most instructions in step 3 are asynchronous. This is relevant because step 3 specifies how to draw the provided data and therefore creates a visible frame. So it is suggested that OpenGL programs mostly consist of asynchronous instructions. It is advisable to have long sequences of asynchronous calls as they can profit from optimization methods.

The optimisation techniques framing and compression have effect on the execution time of the application. To minimise this effect all tests should be independent from execution time. This is realized by referring to the generated frames instead of a time interval. The above mentioned factors are covered with three metrics: Asynchronous commands per frame, synchronous commands per frame and blocks of continuous asynchronous commands per frame.

PipeGL

User satisfaction is the main objective for infotainment systems. This is achieved by keep-ing the frame rate high and the input latency low.

(24)

In this setup input latency is composed of following parts:

Assuming the user uses a touch device, the touch needs to be detected by the touch con-troller. This touch signal needs to be transferred from the rendering client to the application server. Detection and processing of the touch takes presumably 12 ms. The transmission via Ethernet takes another 5 ms. Then the touch signal needs to be processed by the OpenGL application which leads to a change in the OpenGL scene. The processing step is assumed to take 45 ms. The OpenGL calls to change this scene need to be transmitted to the rendering client by PipeGL. This step takes presumably 25 ms. Then the OpenGL implementation is rendering the new scene which takes presumably 5 ms. At last the dis-play controller transfers the new image to the screen. The refresh rate is 60 FPS, so the processing time is 17 ms.

The complete process is displayed in Figure 3.1. The above mentioned steps are listed from left to right, split up by the layer they are belonging to. Streaming is done from the moment the first OpenGL call is issued by the OpenGL application until the last OpenGL call is executed on the rendering client, so it overlaps with the input processing and the rendering step.

OS layer Client application layer Transport layer Server application layer

10 20 30 40 50 60 70 80 90 100

t in ms

touch detection streaming transmitting rendering processing displaing

Figure 3.1: Input latency

PipeGL can only influence the processes of streaming and computing the reaction. Touch detection and displaying are worst-case assumptions. The transmission time from the client to the server is network dependent and the time required by the OpenGL application to process the data can only be estimated as an universal average value. The current esti-mation grants PipeGL a time slot of 28 ms for streaming and rendering the scene. This takes into account that the computation of the results can eventually be run parallel to the streaming process. Short transmission times for sensor signals are assumed which could be achieved by the transmission class A or CDT mentioned in section 1.1.3.

(25)

3.3. EVALUATION OF THE OPENGL APPLICATION 21 PipeGL has no functional differences between streaming a static or an interactive GUI. Therefore the reaction time of 28 ms is higher than the required time of 16.66 ms for streaming a scene with 60 FPS.

As a client-server software PipeGL is limited to the maximum bandwidth and latency of the network. The bandwidth is limited to 100Mbit/s which needs to be shared with other applications in the network, so the amount of transmitted data should be kept low. Nonetheless it can profit from the TSN standards as it has soft real-time constraints. TSN defines classes of priority and we can assume PipeGL belongs in class B which guarantees us a maximum latency of 50 ms. So latency testing is only done up to 50 ms.

Scalability testing is limited to the available hardware resources which are currently three boards with one display each. All of them will be connected to one switch.

In the underlying project PipeGL was already tested in regards of frame rate. To include the new requirements, the test case is slightly changed. To simulate the 1 to n streaming process, PipeGL should stream the same scene from one application server to at least one and at the most three renderers.

To cover all factors mentioned above following measurements should be executed: FPS should be the determining number of quality, the FPS should be measured once while changing bandwidth and once while changing latency. Both the changing latency and the changing bandwidth test should be executed with at least one and at the most three rendering clients.

3.3 Evaluation of the OpenGL application

The OpenGL application is used as a testing tool for PipeGL, but to obtain expressive results we need further information. More precisely about how many synchronous and asynchronous calls are issued for every frame. This information can later be used to classify results from the tests of PipeGL and to find fields for optimization.

3.3.1 Definition of tests

The first test should generate some frames and analyse which OpenGL calls are issued by the application. These calls should be logged in the correct order. The test will at first generate a single frame and in a second turn 60 frames which is approximately a one second video.

(26)

Logging should be done by writing to a file. This requires accesses to the hard drive which could affect the performance of PipeGL. For this test the performance should be irrelevant. This means the amount of time required for every OpenGL call should not have any effect on the number or order of OpenGL calls. The logging functionality should be selectable at compile time to prevent negative impacts on later tests.

One of PipeGL’s features is the caching of synchronous calls. Cached calls perform different than synchronous and asynchronous ones, as they are are neither transferred via network nor executed on the target machine. That is why they should be recorded separately. The results should display following numbers:

• synchronous calls • asynchronous calls • cached calls

• number of concurrent synchronous/asynchronous calls (blocks) • total number of blocks

3.3.2 Implementation of tests

All OpenGL calls are caught by glIntercept. To make sure all calls are logged and the

correct order is kept logging will be implemented in every OpenGL method of glIntercept.

A logging functionality is implemented in the already included ”Google Logging Library”, short GLOG.

The performance is not relevant for this test, so an intuitive implementation for logging can be used. At the beginning of every OpenGL command call the text ”method: $$$<name of the command> $$$ called” is written into a logfile. The dollar signs are used to separate

commands from other log entries.

To keep this functionality optional, it will be surrounded with the preprocessor statement

#ifdef LOG_METHODS, so the feature can be included at compile time.

An example implementation in the command glViewportis given in Listing 3.1 GL_APICALL v o i d GL_APIENTRY g l V i e w p o r t ( . . ) {

2 #i f d e f LOG_METHODS

LOG(INFO) << " method : $ $ $ g l V i e w p o r t $$$ c a l l e d ";

4 #e n d i f

(27)

3.3. EVALUATION OF THE OPENGL APPLICATION 23 The ClientCacheAdapter mentioned in chapter 3 handles caching of synchronous calls. This functionality is extended to write ”value of <cached method name> got $$$cached $$$” into the log file when theLOG_METHODS flag is set.

The current OpenGL test application runs for a specified amount of time, creating a speci-fied number of Octahedrons per frame. This time dependency should not exist for this test, so it is replaced by running for a specified amount of frames. The number of octahedrons is set to 25. As seen in Figure 3.2, PipeGL performs still with 60 FPS for this amount.

Figure 3.2: Amount of Octahedrons rendered by PipeGL1

The application will then be evaluated by a Python script.

The Python script holds two lists: One containing all calls PipeGL is handling syn-chronously and one for asynchronous calls. When the test script is started, it first calls the test application with the logging version of PipeGL and then evaluates the logfile. The OpenGL calls in the logfile are separated from other logged information. The ’$$$’ are used as separators. Then every OpenGL call will be classified as synchronous or asyn-chronous call using the lists. The total numbers of sync/async calls are counted. Any call which is followed by a ’cached’ statement is listed as cached call.

1_{Source: ”Bereitstellung von graphischen Benutzeroberflächen in verteilten HMI-Systemen über}

(28)

Further the number of concurrent sync/async calls is evaluated as blocks of calls. The script returns all the block lengths for sync and async calls. A CSV file with all block lengths of synchronous and asynchronous blocks will be created as well. Cached calls do not interrupt a block of calls.

glClear() is defined as a synchronous call as it is used to synchronize between the server

and renderer at the end of a frame. 3.3.3 Test results

When rendering one frame the octahedrons application issues 322 OpenGL calls in total. 59 of them are synchronous, 263 are asynchronous. 72 calls can be cached. The asyn-chronous calls are split into 18 blocks.

The concurrent asynchronous calls could be relevant for further optimizations, therefore they are presented in a box diagram in Figure 3.3. Blocks of synchronous calls are not further analysed at the moment.

1 2 3 1 2 3 4 6 242 async blocks commands per blo ck

Figure 3.3: Block size for asynchronous calls, 1 frame

When rendering sixty frames, the octahedrons application issues 15190 OpenGL calls in total. 118 of them are synchronous, 15072 are asynchronous. 4497 calls can be cached. The asynchronous calls are split into 136 blocks.

The sizes of the asynchronous blocks when generating sixty frames are visualized in a box diagram in Figure 3.4.

(29)

3.3. EVALUATION OF THE OPENGL APPLICATION 25 0 10 20 30 40 50 60 1 2 3 4 6 243 250 251 async blocks commands per blo ck

Figure 3.4: Block size for asynchronous calls, 60 frames

3.3.4 Evaluation

The test results confirm that the OpenGL application, further called Octahedrons, is suit-able for being streamed via Ethernet.

That is because most of the synchronous calls are cached. If we subtract the synchronous calls for the first frame which includes setting up and deconstructing the scene from the synchronous calls for sixty frames, 59 calls are left. So every frame requires only a single synchronous call. This keeps blocking effects low. The test results of the Octahedrons application are not comparable to any OpenGL application running in a real environment because the metric of synchronous/asynchronous calls is not based on existing literature. This metric is also limited on evaluating network transmission. The complexity of the scene and the thereby arising demand for rendering performance is not evaluated, but de-termining rendering performance consists of several different factors. Some but not all of them are the amount of issued vertices, the size of the output picture, the complexity of Shaders and the amount of used Shaders.

Developing a metric which includes all relevant factors would go beyond the scope of this thesis. That is why the solutions used by comparable projects for this issue are presented. Comparable projects are ”Chromium”[11], ”ClusterGL”[13] and ”BroadcastGL”[8].

All mentioned applications implement streaming software for OpenGL instead of OpenGL ES. These applications are designed for the distribution of rendering scenes which are too

(30)

complex for a single rendering hardware. These projects render multiple complex scenes as benchmark test and measure the achieved FPS.

ClusterGL renders a number of static cubes which is comparable to the octahedrons ap-plication, but the authors state: ”The cube test is a simple stress test. While it is easy to analyse it does not necessarily represent real applications.”[13, p. 7] This leads to the assumption that the octahedrons application alone does not give the expressiveness desired for our use case.

To have a further look on PipeGL a second reference application should be implemented. This new application should either prove if the octahedrons application is representable for an real OpenGL application or not. If the octahedrons application is not representable, the reference application should serve as benchmark test.

3.4 Improvement of the OpenGL application

Using an real OpenGL ES application leads to two problems: At first PipeGL is currently implemented with a limited subset of OpenGL calls and at second OpenGL ES applications for Linux systems are not as wide spread as common OpenGL applications.

A representable test would be Q3lite2 _{which is an OpenGL ES port of Quake. The Quake} clone OpenArena is used as a benchmark test by ClusterGL[13, p.8], but raising the im-plemented instruction set of PipeGL to the required amount of instructions is out of the scope of this thesis.

A common model in computer graphics is the Utah Teapot. This model is more complex than the Octahedrons and because it is quite popular there are free OpenGL implemen-tations of the model data available. An example implementation of the Utah Teapot for OpenGL ES 2.0 is publicly available in the Android Native Development Kit samples. [3] This sample is useable with a reasonable extension of PipeGL‘s instruction set.

Therefore, the Utah Teapot is used as reference for an real OpenGL implementation. 3.4.1 Analysis

Rendering a rotating teapot does not stress PipeGL as a computer game would do, but it has the following advantages over the octahedrons application:

2

(31)

3.4. IMPROVEMENT OF THE OPENGL APPLICATION 27 The Utah Teapot is rendered as follows: All vertex data of the teapot is loaded in a vertex buffer object. Vertex buffer objects are saved on the OpenGL server side which is usually the graphics card. The vertex data is transformed by multiplying it with a transforma-tion matrix. The fact that the vertex data is only transferred once and in following only transformation matrices are transferred brings an increased workload on the renderer while reducing network traffic.

Moving the object back in the correct view space is done inside the vertex Shader. Vertex and Object Shader are supplied as OpenGL Shader Language (GLSL) files which need to be compiled by the OpenGL implementation. This is done once at the start of the pro-gram. The transfer of Shaders is done by issuing a single instruction with a large buffer. This is a relevant corner case for PipeGL.

The Teapot sample is hosted by the Android NDK team, so it is unrelated to PipeGL. This matches the demand of having an universal OpenGL middleware.

3.4.2 Implementation

The code of the teapot sample is written in C++, but it depends strongly on the Android-API. The dependencies are bundled in a class called NDKHelper.

Computing the required OpenGL calls for rendering the teapot is done in theTeapotRenderer

class. That process is supported by two classes. A camera class which handles the user’s point of view and an OpenGL specific mathematics header which provides a simple inter-face to interact with matrices and vectors.

The OpenGL Mathematics Library, short GLM, is a library designed to be close to GLSL. It provides the data structures that should be used for OpenGL, such as matrices and vectors. GLM is as well used in Octahedrons. The mathematics header of the NDK has the same functionality, but it is not compatible to GLM. To keep the software maintainable and consistent the mathematics header is replaced by GLM.

The TapCamera class handles the perspective to the OpenGL scene. It acts like a virtual

camera that is placed in the OpenGL space and can be moved around freely. It contains some convenience functions like rotating around a fixed point or looking at a specified point. This features are implemented by changing transformation matrices that are passed to the vertex Shader. This is far more than needed by our test application. To reduce the complexity of this test application, theTapCamera is replaced ba a static method that

(32)

computes the current camera position and view angle relative to the object.

The camera should move on a circle with a fixed size around the Teapot, always facing towards it.The GLM functionlookAt (vec3 eye, vec3 center , vec3 up ) takes 3 vectors: the

po-sition of the camera, the popo-sition of the point where the camera should look at and the upwards direction of the camera. This function is used to replace theTapCameraclass. The

teapot is placed in a fixed position and the camera moves in a circle around it. The scene of the OpenGL application is presented in Figure 3.5

Figure 3.5: Scene of the new OpenGL application

The main loop in the NDK does not only render the teapot, but also provides Android-specific functionality like handling touch and sensor data which is incompatible with the current project. As a non-interactive scene is enough for our requirements all this extra functionality is removed.

3.4.3 Test results

To compare the teapot application with the octahedrons application, the same tests are ex-ecuted for the teapot application. At first the results for rendering one frame are presented: A total of 322 OpenGL calls are issued. 59 of these are synchronous calls, 263 asynchronous. Zero calls are cached. The block size of the asynchronous calls is visualized in a block diagram in Figure 3.6.

(33)

3.4. IMPROVEMENT OF THE OPENGL APPLICATION 29 1 2 3 1 2 3 4 6 17 async blocks commands p er blo ck

Figure 3.6: block size for asynchronous calls, 1 frame

When rendering 60 frames, a total of 1120 OpenGL calls are issued. 75 of them are syn-chronous calls, 1045 are asynsyn-chronous. Zero calls are cached. The amount of commands per block is visualized in Figure 3.7.

0 10 20 30 40 50 60 1 2 3 4 6 17 19 async blocks commands p er blo ck

(34)

3.4.4 Evaluation

When comparing the octahedrons application with the teapot application, the distribution of synchronous and asynchronous are similar, but the teapot application issues significantly less OpenGL calls. This leads to two results: At first, the Octahedrons application be-haves like a real OpenGL application in terms of streamability and secondly the OpenGL application stresses the network transmission far more than the teapot application does. For the purposes of testing PipeGL the teapot application serves as minimal case as it does not need many OpenGL calls per frame. The Octahedrons application might not be too stressful for the rendering hardware, but it is still demanding for the amount of transmitted OpenGL calls. So it serves as high utilizing task.

Further tests should consider using both of this applications. The octahedrons application as a highly utilizing test and the teapot application as a minimal utilizing test.

3.5 Evaluation of PipeGL

As mentioned in the introduction a few tests were performed on PipeGL already, but this thesis has defined higher requirements on PipeGL which require further testing. This includes explicitly the use of more than one target as client device.

3.5.1 Definition of tests

The new test setup expands the original test described in Chapter 2. One x86 Linux com-puter is used as application server. It will run the OpenGL application with the PipeGL library. The scene is streamed to a varying amount of targets, at the most 3. Each of them is running a PipeGL renderer which displays the scene on its own screen. See Figure 3.8 for reference.

As mentioned in section 3.2.2, the most significant size for PipeGL is frames per second. The targets have a v-sync mechanism that can not be shut off which leads to a maximal achievable frame rate of 60 FPS.

PipeGL comes with the two optimizations compression and framing. The previous project recommends using both optimizations or none. For this reason all tests will be executed with either both or no optimizations.

(35)

3.5. EVALUATION OF PIPEGL 31 Target 1 Screen Switch Linux Machine Application Target 2 Screen Target 3 Screen

Figure 3.8: setup for testing PipeGL

The network properties latency and bandwidth should be taken into account as well. As described in section 3.2.2 data rate and latency affect the performance in different ways. Therefore, testing both factors individually can give a precise information about where to optimize PipeGL.

The limit of 50 ms latency is taken as guaranteed by the TSN standard. So latency tests should be performed from 0 ms to 50 ms. Bandwidth testing could be done to a maximum of 100 Mbit/s as this is the limitation of the used Automotive Ethernet. In a production context PipeGL would need to share this bandwidth with other components, so testing PipeGL with bandwidths up to 50 Mbit/s or less is sufficient.

To determine the scaling factor all those tests are executed for 1, 2 and 3 connected ren-dering clients. For each client the frame rate should be logged. To determine a frame rate for one test with a fixed number of clients the average frame rate of all connected clients should be used.

3.5.2 Implementation of tests

At first FPS measurement has to be implemented. The octahedrons application has already included time measurement by using the Chrono library which provides a high resolution clock. The resolution of each time step is hardware dependent. For FPS measurement a

(36)

precision of 10 ms is precise enough, as the provided clock’s precision is under 1 ms so it is precise enough. The same measurement is implemented in the teapot application. Measuring is done in the main loop of the OpenGL applications which generates one frame per iteration. The time needed for one iteration is measured and logged.

All following tests are done with one to three targets. Testing is done for each OpenGL application separately.

At first the Linux tool TC is used for setting the minimal latency and maximal band-width for the network interface connected to the PipeGL renderers. TC is a Linux tool for changing network settings at kernel level. This tool is extended by Netem which provides a simple interface for network emulation.

Then a SSH connection to the remote clients is opened, the required environment variables PIPEGL_IP, PIPEGL_PORT, USE_COMPRESSION and USE_FRAMING are set and the PipeGL renderer is started. The connection does not need to be kept open, so it will be closed.

After that the OpenGL application with the preloaded PipeGL library is started. Due to the current implementation of PipeGL there can only be one application connected to one target. Streaming to multiple targets is not implemented, because a TCP connection between client and server is used and TCP does not allow multicast. Parallel streaming to multiple targets is done by starting a new process for each target with its own instance of the application running. Each process streams to one target. The main process starts all those subprocesses and waits for them to be finished.

The logged iteration times for each application are transformed to FPS. The average FPS of all executed applications in this iteration is computed and saved as result. Then a new SSH connection is opened to terminate the remote rendering clients.

3.5.3 Test results

The tests are done with 300 Frames. All results are included in Appendix A. Visualization of the test results is done as follows:

The first test is the bandwidth test. Latency is not explicitly set, ping tests show a latency of approximately 1.5 ms. This test is done with both OpenGL applications.

(37)

3.5. EVALUATION OF PIPEGL 33 Bandwidth

For bandwidth testing the latency is not limited. The average round trip time from the application server to each renderer is measured as 1.5 ms. After a-priori tests the maximum bandwidth is set to 40 Mbit/s. For the a-priori test the octahedrons application is executed with one connected renderer. For the first iteration bandwidth was set to 100 Mbit/s and it got reduced by 10 Mbit/s in every iteration. The last iteration where PipeGL performed with approximately 60 FPS were 30 Mbit/s, another 10 Mbit/s were added to visualize differences when using more targets.

For better visualization of frame rate changes in the low bandwidth area the tests from 0 to 14 Mbit/s are performed in steps of 2 Mbit/s. Further steps are 20, 30 and 40 Mbit/s. When comparing PipeGL to the streaming of a video, the stream with 720p at 30 FPS is the closest match for the used display resolution. Compressed with the H.264 com-pression algorithm a bandwidth of approximately 8 Mbit/s [4, p.434] is required for this video stream. A vertical line at 8 Mbit/s is added to the bandwidth test results as reference. At first the results for the OpenGL application Octahedrons are displayed. Figure 3.9 shows the results if optimization is disabled and Figure 3.10 displays the result with the optimizations framing and compression activated.

0 10 20 30 40 0 20 40 60 bandwidth in Mbit/s Frames per Second 1 Target 2 Targets 3 Targets

Figure 3.9: Bandwidth test Octahedrons

At second the test results for the new OpenGL teapot application are presented. Figure 3.11 shows the results with disabled optimization and Figure 3.12 presents the results with enabled optimization. The parameters are the same as before.

(38)

0 10 20 30 40 0 20 40 60 bandwidth in Mbit/s Frames per Secon d 1 Target 2 Targets 3 Targets

Figure 3.10: Bandwidth test Octahedrons with optimizations enabled

Figure 3.11: Bandwidth test Teapot

(39)

3.5. EVALUATION OF PIPEGL 35 Latency

A steady bandwidth of 40 Mbit/s is used for latency testing. The greyed out area rep-resents frame rates beyond 1 frame/latency these frame rates are unachievable if PipeGL has to wait for at least one synchronous call per frame.

At first the results for the octahedrons application are presented. Figure 3.13 shows the results if framing and compression are disabled, Figure 3.14 displays the result with fram-ing and compression enabled.

0 10 20 30 40 50 0 20 40 60 latency in ms Frames per Secon d 1 Target 2 Targets 3 Targets

Figure 3.13: Latency test Octahedrons

(40)

Figures 3.15 and 3.16 show the results for the new test application teapot. In Figure 3.15 the results with disabled optimizations are displayed and in Figure 3.16 the results with enabled optimizations are presented. The parameters are the same as used for the teapot application. 0 10 20 30 40 50 0 20 40 60 latency in ms Frames per Secon d 1 Target 2 Targets 3 Targets

Figure 3.15: Latency test Teapot

Figure 3.16: Latency test Teapot with optimization enabled

3.5.4 Evaluation

The the frame rate drops from 60 to 30 FPS in Figures 3.10, 3.14 and 3.16 is described in the underlying project paper as well. They seem to be connected to the target‘s v-sync mechanism. With this mechanism the results tend do reside around divisors of 60 FPS, so if 60 FPS are not achievable for PipeGL the performance drops to 30 FPS. A solution for

(41)

3.5. EVALUATION OF PIPEGL 37 this issue is not found. Connecting more targets reduces this effect, because the average FPS of all renderers is determined. If the optimizations of PipeGL are not enabled this drop is not present.

The mark of 8 Mbit/s in the test results can be used as reference how good OpenGL would perform in a direct comparison to video streaming. PipeGL produces more than 30 FPS with current optimizations which leads to the conclusion that reducing the currently required bandwidth should be a secondary objective.

Scaleability of used bandwidth is an issue. PipeGL streams the same application to dif-ferent targets, but because only 1:1 connections are supported, it streams up to three in-dependent streams. This happens with a rising need of bandwidth and might be avoidable. Latency is an issue regardless of the amount of connected renderers. It is still affecting the frame rates even if most of the calls are asynchronous. Avoiding synchronous calls is a good starting point, but does not solve the problem completely. It might still be affected by the fact that a TCP connection is used. However, reducing the impact of latency is a challenging task, because a single synchronous call forces the application to wait a whole round trip time. With 60 FPS the time to render a new frame is 16 ms. So with more than 16 ms of latency the frame rate can not be achieved. The two currently running applications prevent any other synchronous calls than glClear().

To conclude this evaluation, it can be said that PipeGL preforms at a comparable level to video streaming in regards of bandwidth. Latency is an issue as well as scalability. Further optimizations should improve current results in terms of scalability and FPS at limited latency.

(42)

(43)

Chapter 4 Optimization of PipeGL

The analysis exposes that latency and scalability require further optimization. These issues are targeted in this chapter.

4.1 Objective of the optimization

Scaling PipeGL to stream to multiple clients is not possible since PipeGL is written as a point to point TCP application. As TCP does not support multicast, UDP is used as network protocol.

Switching from TCP to UDP should improve the quality of the video stream when having a high latency. This is because TCP is connection-oriented and acknowledges the receiving of every packet. UDP is datagram-oriented. Therefore no guarantees are made for com-plete or correctly ordered transmission. On the other hand PipeGL has to deal with the packet loss indicated by using UDP. This has to be further evaluated.

Making PipeGL a multicast application should not impair current latency and bandwidth achievements. At least it should work for non-optimized PipeGL, additionally for opti-mized PipeGL. With this optimization, PipeGL should be able to stream one application to any amount of connected clients. The constructed scene should look the same as if executed locally. A change in the order of OpenGL calls should be prevented if it has an effect on the resulting image. PipeGL should reply to OpenGL calls depending on return values with a suitable answer.

(44)

4.2 Structure

The optimization will be applied in three steps. At first, the network protocol will be switched from TCP to UDP. Increases in performance are measured and the packet loss problem is evaluated.

In the next step, a multicast protocol for streaming OpenGL calls is developed and imple-mented. Packet loss should be managed by the new protocol with the result that a reliable OpenGL multicast stream is achieved.

In the third step, PipeGL is fitted to handle multiple rendering clients with one application server. This includes the handling of multiple different return values.

4.3 Replacing TCP with UDP

The pipes and filters pattern reduces the effort to change the network protocol. As seen in Figure 2.2, the TCP connection is implemented in the TCPClientHandler and the TCPServerHandler. Those handlers are replaced by oneUDPSocketHandler each. The fact that

UDP is not connection-oriented allows using the same implementation for each the server and the client.

The implementation is split up into the UDPSocketHandler which implements an Wangle

handler and theBoostUDPSocket which is implementing the required methods to

communi-cate with the other side.

4.3.1 BoostUDPSocket

The BoostUDPSocket is initialized with the IP addresses and ports of each the local and

remote service. Then it creates a socket on the local machine which listens to packets sent from the remote service. It also initializes the UDPSocketHandlerand passes a reference to

the created socket.

Listening for UDP packets is handled by the Boost context thread. Incoming packets are received in a loop. The payload of every new UDP datagram is passed to theread()method

(45)

4.3. REPLACING TCP WITH UDP 41

4.3.2 UDPSocketHandler

The UDPSocketHandler implements a Wangle handler. So, it needs to implement a read()

and awrite() method. Theread()method accepts a buffer object and moves it upwards the

pipeline. This method is invoked by the BoostUDPSocket.

The write() method accepts a buffer object and prefixes it with a four byte integer

con-taining the buffer size. Then it uses theasync_send_to function of the UDP socket to send

the prefixed buffer via the Ethernet connection. async_send_to requires a return handler

to execute when sending is completed. This handler needs to be finished before the next

async_send_tocall is issued.

To assure that the return handler is finished before the nextasync_send_tomethod is issued,

mutual exclusion is used. The approach is visualized in Figure 4.1.

If the main thread wants to issue a new async_send_to call, it checks if a send handler

is already running. If yes, the write method waits until the send handler is finished. A condition variable is used to wake up the write thread from within the sending thread. The write thread wakes up when the send thread is finished and then issues the next

async_send_tocall.

wait for send_handler_running write: call async_send_to() wake up lock send_handler_running async_send_to(): send message release send_handler_running

Figure 4.1: workflow of async_send_to()

The newasync_send_to statement calls up the return handler in the context of the network

thread. See listing 4.1. When the return handler finishes, it acquires the mutex and sets

send_handler_runningon false. After that it notifies the thread who is listening to the

con-dition variable and releases the mutex.

s t d : : u n i q u e _ l o c k <s t d : : mutex> l o c k ( send_thread_mutex ) ;

(46)

send_thread_cond . n o t i f y _ o n e ( ) ;

4 l o c k . u n l o c k ( ) ;

Listing 4.1: Mutex example network thread

4.3.3 Throttling

When applying the tests for PipeGL to the new application the results are approximately the same, as seen in Figures 4.2 and 4.3.

0 10 20 30 40 0 20 40 60 bandwidth in Mbit/s Frames per Second UDP TCP UDP + opt TCP + opt

0

10

20

30

40

50

0

20

40

60 latency in ms

Frames p er Secon d

UDP

TCP

UDP + opt

TCP + opt

(47)

4.3. REPLACING TCP WITH UDP 43 The Octahedrons application issues commands faster than the network can transport them. An analysis with Wireshark1 _{shows that all packets are sent from the UDP socket} with-out an error, but the application freezes before finishing. That is because PipeGL does not check whether packets get lost. If a synchronous packet is lost, the application server freezes until it receives a return value from the rendering client. The rendering client is not aware that a packet is missing and waits for the next one. This leads to a dead lock. Currently not a single test with the octahedrons application terminates successfully. This issue is addressed by throttling the network thread on the client side. Between two calls of the send handler one UDP datagram gets transmitted, so when the time between two calls of the send handler is less than 50 µs, the thread sleeps for the rest of the time. The time of 50 µs is determined through testing. It is the lowest time interval where the octahedrons application finishes successfully in more than about 80% of executions. Packet loss is addressed in the next section. So this test does not aim at giving an expressive analysis on packet loss, but allows the testing of the UDP implementation.

4.3.4 Evaluation

The UDP variant of PipeGL is tested the same way as the TCP variant only with the lim-itation that currently one target is used. In Figures 4.4 and 4.8 the results of the PipeGL UDP variant are compared to the results of the TCP variant when using the teapot ap-plication. Figure 4.4 displays the constrained bandwidth test and Figure 4.8 displays the constrained latency test.

0 10 20 30 40 0 20 40 60 bandwidth in Mbit/s Frames per Secon d UDP TCP UDP + opt TCP + opt

1

(48)

0

10

20

30

40

50

0

20

40

60 latency in ms

Frames p er Secon d

UDP

TCP

UDP + opt

TCP + opt

Figure 4.5: Latency test Teapot

The teapot application represents a minimal case of an OpenGL application. In this case the UDP optimization of PipeGL performs as good as the TCP variant. Therefore the UDP optimization is a success because UDP performs as good as TCP while allowing the implementation of multicast streaming.

Figures 4.2 and 4.3 show that the throttling effect does not affect PipeGL significantly when streaming the teapot application. Since the teapot application is a minimal case this might not cope all requirements.

In the following, the current optimization is compared with the TCP variant when using the octahedrons application. Figures 4.6 and 4.7 compare the achieved FPS with limited bandwidth of the UDP variant connected to one target with the TCP variant which is at one time connected to one target and at one time connected to three targets. Figure 4.6 displays the results of streaming the Octahedrons application while framing and compres-sion is disabled. Figure 4.7 displays the results if framing and comprescompres-sion is enabled. Figure 4.6 shows that the current UDP variant performs significantly worse than the TCP variant. The maximal achievable frame rate is 40 FPS. This is caused by the currently used throttling method. The throttling is necessary as long as the UDP optimization does not handle packet loss. As handling packet loss is planned for this optimization as well, the performance decrease by throttling is acceptable for now.

(49)

4.3. REPLACING TCP WITH UDP 45 0 10 20 30 40 0 20 40 60 bandwidth in Mbit/s Frames per Secon d UDP TCP 1 target TCP 3 targets

Figure 4.6: Bandwidth test Octahedrons

0 10 20 30 40 0 20 40 60 bandwidth in Mbit/s Frames per Secon d UDP TCP 1 target TCP 3 targets

Figure 4.7: Bandwidth test Octahedrons with framing

The test results for the UDP variant are higher than the ones for the TCP variant with three connected targets. If using UDP with IP multicast does not decrease the frame rate results significantly, then even the current UDP implementation would be more performant than using TCP with three targets.

Figure 4.8 shows the differences between the optimized and the original PipeGL when limiting frame rate. The UDP variant without framing suffers as well from the throttling mechanism. It reaches 40 FPS at maximum and lies 20 FPS behind the TCP variant. This effect is nullified at a latency of 30 ms when both variants reach only 30 FPS. Beyond this point both curves are approximately the same. If framing is activated, the UDP variant

Bachelor Thesis. Optimization of OpenGL streaming in distributed embedded systems. Felix Mues June 2020

Bachelor Thesis

Contents

Chapter 1

Introduction

1.1

Motivation and background

1.2

Scope

1.3

Structure

Chapter 2

Given experimental setup

2.1

Hardware

2.2

Operating system

2.3

Software

Chapter 3

Analysis

3.1

Use case

3.2

Metrics for the analysis

3.3

Evaluation of the OpenGL application

3.4

Improvement of the OpenGL application

3.5

Evaluation of PipeGL

Chapter 4

Optimization of PipeGL

4.1

Objective of the optimization

4.2

Structure

4.3

Replacing TCP with UDP

0

10

20

30

40

50

0

20

40

60

latency in ms

UDP

TCP

UDP + opt

TCP + opt

0

10

20

30

40

50

0

20

40

60

latency in ms

UDP

TCP

UDP + opt

TCP + opt