Application of 3D human pose estimation for motion capture and character animation

(1)

Anastasiia Borodulina

APPLICATION OF 3D HUMAN POSE

ESTIMATION FOR MOTION CAPTURE AND

CHARACTER ANIMATION

Master’s Thesis

Degree Programme in Computer Science and Engineering

June 2019

(2)

Science and Engineering. Master’s thesis, 57 p.

ABSTRACT

Interest in motion capture (mocap) technology is growing every day, and the num-ber of possible applications is multiplying. But such systems are very expensive and are not affordable for personal use. Based on that, this thesis presents the framework that can produce mocap data from regular RGB video and then use it to animate a 3D character according to the movement of the person in the orig-inal video. To extract the mocap data from the input video, one of the three 3D pose estimation (PE) methods that are available within the scope of the project is used to determine where the joints of the person in each video frame are located in the 3D space. The 3D positions of the joints are used as mocap data and are imported to Blender which contains a simple 3D character. The data is assigned to the corresponding joints of the character to animate it. To test how the cre-ated animation will be working in a different environment, it was imported to the Unity game engine and applied to the native 3D character. The evaluation of the produced animations from Blender and Unity showed that even though the qual-ity of the animation might be not perfect, the test subjects found this approach to animation promising. In addition, during the evaluation, a few issues were discovered and considered for future framework development.

Keywords: Mocap, Computer vision, Deep neural networks, Blender, Unity, YOLO, Video processing

(3)

ABSTRACT TABLE OF CONTENTS FOREWORD ABBREVIATIONS 1. INTRODUCTION 6 2. MOTION CAPTURE 7 2.1. History . . . 7 2.2. Performance capture . . . 8

2.3. Motion capture systems . . . 8

2.3.1. Optical systems . . . 9

2.3.2. Non-optical systems . . . 10

2.4. Motivation . . . 11

3. RELATED WORK ON POSE ESTIMATION 12 3.1. Neural networks . . . 12

3.2. Heatmaps . . . 14

3.3. 2D Methods . . . 14

3.3.1. Image based single person PE . . . 14

3.3.2. Image based multi-person PE . . . 15

3.3.3. Video based single person PE . . . 16

3.3.4. Video based multi-person PE . . . 17

3.4. 3D Methods . . . 17

3.4.1. 3D pose estimation . . . 17

3.4.2. 3D pose and shape estimation . . . 22

3.4.3. Other methods . . . 26 4. ANIMATION 27 4.1. Animation tools . . . 27 4.1.1. Blender . . . 27 4.1.2. Unity . . . 27 4.2. 3D character design . . . 28 4.2.1. Mesh . . . 28 4.2.2. Armature . . . 29 4.2.3. Rigging . . . 29 4.2.4. Skinning . . . 29 4.3. Animation generation . . . 31 4.3.1. Keyframing . . . 31

4.3.2. Using motion capture data for animation . . . 31

(4)

5.2. Mapping and animation in Blender . . . 38

5.2.1. Overlay of 3D data to the rig . . . 38

5.2.2. Constrains and modifiers . . . 39

5.2.3. Animation baking . . . 41

5.3. Blender to Unity transition . . . 42

5.3.1. Ways of importing the animation to Unity . . . 42

5.3.2. Differences of environments . . . 43

6. EXPERIMENTAL EVALUATION 44 6.1. Evaluation of 3D PE methods . . . 44

6.1.1. Dataset and evaluation metrics . . . 44

6.1.2. Evaluation results . . . 44

6.2. Animation evaluation . . . 45

6.2.1. Data collection . . . 45

6.2.2. Experimental evaluation results . . . 48

7. DISCUSSION 50

8. CONCLUSION 52

(5)

The work on this master’s thesis was performed in the Center for Machine Vision and Signal Analysis (CMVS) at the University of Oulu.

I want to express the gratitude to my supervisor Prof. Janne Heikkilä for all of the guidance, patience, and support that he provided during my studies and work on this thesis project.

Also, I would like to thank my family and friends for always supporting and helping me on my way to this Master’s degree.

Oulu, Finland June 21, 2019

(6)

2D Two dimensional space

3D Three dimensional space

AI Artificial intelligence

API Application programming interface

AR Augmented reality

BB Bounding box

BVH Biovision hierarchy data

CG Computer graphics

CGI Computer-generated imagery

CNN Convolutional neural network

CV Computer vision

DLCM Deeply learned compositional model

DNN Deep neural network

DPNet 3D pose network

FFN Feedforward neural network

GT Ground truth

IMU Inertial measurement unit

LED Light-emitting diode

LFTD Lifting from the deep

LSTM Long short-term memory

Mocap Motion capture

MPJPE Mean per joint position error

NBF Neural body fitting

NN Neural network

PAF Part affinity fields

PE Pose estimation

PRCNN Pairwise ranking convolutional neural network

ReLU Rectified linear units

ResNet Residual neural network

RGB Red, green, blue

RNN Recurrent neural network

SMPL Skinned multi-person linear model

USD United States Dollar

VFX Visual effects

VR Virtual reality

(7)

1. INTRODUCTION

Motion capture (Mocap) [1, 2, 3] is a procedure of tracking and capturing the motion of humans or other objects using special equipment and software. The recorded data can be applied to the computer model for further animation or research purposes. During the last decades this technology became widely used in such different areas as military, robotics [4], medical applications [5, 6], game design, animation, and creation of visual effects (VFX), etc.

Unfortunately, the mocap technology is expensive that is why it is used mostly by professionals. Therefore, any person who would like to try mocap for developing their own game, animation or some other product will face the problems of availability of mocap systems. This project is looking for a solution in the 3D human pose estimation (PE) methods and aims to develop a framework that utilizes free open-source tools to produce the mocap data.

Human PE is related to the field of computer vision (CV) and its task is to detect human position, shape, and orientation in different images and videos. It means to localize different keypoints, also called joints, that belong to a specific object or a person. After that obtained joints can be connected in a graph which represents the skeleton of the person at the image.

Thus, the main objective of this thesis is developing a framework that would produce an animated character from the real video of human motion. The workflow of the project is expected to be so that the user would provide the system with a video of a person performing some action. The developed framework will run one of three available 3D PE methods to define the position of the human in the 3D space. The exact PE method is defined by the user during the submission of the input video. Next, the output from the previous step is used as a mocap data which is automatically mapped to the 3D character. As a result, the character should be animated in accordance with the input video that was given to the system. Produced character and its animation can be further imported to other software and used in motion pictures, games, etc.

The structure of the thesis consists of the following parts. Chapter 2 takes a closer look at the mocap technology, its history, and different types of systems that allow capturing the motion of the object. Chapter 3 gives an explanation on the topic of hu-man PE and related concepts. In addition, this part provides a description of different types of 2D and 3D methods. Chapter 4 describes the most important concepts of 3D character animation. Chapter 5 is provided with the overview and detailed descrip-tion of the framework that combines all of the above-described concepts. Chapter 6 consists of two parts. The first one provides the experimental evaluation of the 3D pose estimation methods while the second part explains the subjective evaluation of the animation process and its results. Chapter 7 contain the discussion of the obtained outcome and possible improvements to the developed framework. Finally, Chapter 8 makes conclusions about the thesis project.

(8)

2. MOTION CAPTURE

From the beginning of the 21st century mocap popularity in the entertainment area started to grow exponentially. Nowadays, it is very common that the motion of a professional actor is recorded and applied to 2D or more often to 3D character model, so its movement during animation would entirely repeat the acting. Figure 1 illustrates the mocap recording process for Human3.6M dataset [7] creation. This method allows having a smooth animation that is extremely hard to get using traditional methods. It makes the characters look realistic and alive. A great example of the mocap use is famousGollumfrom the epic adaptation of theThe Lord of the Ringsseries played by Andy Serkis. This character made a revolution in computer-generated imagery (CGI).

Figure 1: Mocap data recording process for Human 3.6 dataset [7].

2.1. History

To most of the people, the term "animation" seems to be quite young. In fact, the story of animation has begun at the beginning of 19thcentury with the creation of the phenakistiscope [8]. This device had a set of images that were moving in a circle. By reaching a certain rotational speed it created an illusion of motion. This effect is based on the persistence of the human vision which means that very fast motion of the object is not perceived by human eyes and is interpreted as a solid image. In general, it is the basis of the cinematography because anything which is shown on the screen is just a sequence of constantly changing images.

In early 20th _{century the traditional animation was born when one of the first}

(9)

technique was used to create the most famous cartoon character -Mickey Mouseback in 1923 by Walt Disney.

In 1914 Max Fleischer made the life of the animators easier by creating the rotoscop-ing technique. In the heart of this method is the video sequence of actors’ performance which is projected to the thin paper and the animator just need to outline the picture in each shot frame by frame. Then all of the new pictures were photographed and transformed into one cartoon video. This helped to simplify and speed up the process. Shortly, rotoscoping became widely used in cartoon production among the animators including Walt Disney. His first work made with this technique was Snow White and the Seven Dwarfs (1937). Since then the method became very popular. Nevertheless, with the advent of the CGI era in the 1990s, the classical rotoscoping became rare although it is still used in some projects.

CGI has adopted the idea of rotoscoping for its own needs - usually, it is used to add different objects in the scene. For example, when it is needed to add a crowd or more decoration to the scene or any other effects after shooting on a green screen. As a result, a lot of motion pictures and even computer games would not exist without rotoscoping.

Back in 1995, the Pixar Animation Studios opened a great world of CGI to the viewer in their Toy Story. The main difference with everything that had been made before was that the characters were no longer flat 2D images, instead, they became 3D models of humans-alike creatures, animals and so on. It was a revolution in the animation, VFX and game industry. Now the studios are competing who will be first to create more realistic characters, more immersing scenes, and fascinating words.

Thus mocap took over the baton from the rotoscoping and occupied one of the most crucial positions in the entertainment industry.

2.2. Performance capture

Mocap name speaks for itself that the main target is to capture the subject’s movement itself and that is how it was till the turn of the millennia. In the early 2000’s the tech-nology advanced enough to catch even small motion features like facial expressions, lips’ and fingers’ movements. Since then, it became more than just recording of a gen-eral body motion but a real performance capture. The actors have to really become their characters and live it. Benedict Cumberbatch, as an example, made a fascinating performance capture work as a Smaug forThe Hobbitseries.

Performance capture made a big step for the CGI. Over the last almost 20 years it has made a significant progress from iconicGollumfromThe Lord of the Ringsseries (2001-2003) andKing Kong(2004) toAvatar(2009) andRise of the Planet of the Apes series (2011-2017).

2.3. Motion capture systems

Usually, the application of mocap technology in digital character animation is compli-cated and involves the use of professional equipment, like cameras, sensors, and other. Altogether, it is a massive and complicated system to set up. There are two main types

(10)

of mocap systems that are based on different principles [1, 2, 3, 6, 9, 10]: optical and non-optical.

2.3.1. Optical systems

The most popular mocap systems are optical. It is called so because the actor has to wear a special costume with a set of markers on it. The markers are located in the places of the most interest for good animation. Usually, it is at the positions of the joints. This allows the system to see them with no problems. To calculate an exact position of the markers multiple high-resolution well-calibrated cameras are used. The number of cameras varies from a minimum of 2 to tens. Each of them provides a 2D location of each reflector. Then the data from all of the cameras is combined by a specific software using triangulation to calculate the location of the markers in a 3D space and thus reproduce the performance.

Optical systems have a very high frame rate which allows recording quick move-ments like dances or martial arts. The frame rate depends on the resolution of the camera and can reach 200 frames per second and even more. At the same time, the costumes and reflectors are light and do not interfere with the natural movements of the actors [2]. The actors can move around big spaces but the technology is sensitive to light conditions and overlaps with other objects. So it can be successfully applied only in specially equipped premises. In addition, the setups that require such a big amount of equipment and sophisticated software are very expensive in use [2]. Its price easily goes over 250,000 USD [10].

Active markers

The optical systems use two types of markers: active and passive. Active markers represent a set of light-emitting diodes (LED) placed on the actor’s costume that emits light. Each marker has own identifier which helps the system to distinguish joints from each other. It is very useful when there are multiple actors on the scene or some body parts disappear due to overlapping and then pop up again. But additional light sources on the scene can create an effect of ghost markers for the system [9].

Passive markers

Passive markers are also called reflective because the markers are covered with retro-reflective material that has to reflect the infrared light that comes from the LEDs lo-cated next to the lens of the tracking camera back to the camera. An example of mocap system with passive markers is shown in the Figure 1. When comparing to active markers, passive ones do not have any identifiers so when the markers are too close to each other the system may confuse them with each other [11]. This and other issues can result in noisy data and sometimes gaps in the data. To avoid these distortions it is good to use multiple cameras. For example, to solve this issue up to 300 cameras can be used while in a regular setup it might be from 6 to 24 cameras [10].

(11)

Optical systems are extremely popular among animators, VFX, and game develop-ment studios because it produces detailed and accurate data with a high frame rate which allows creating high-quality realistic character motion.

2.3.2. Non-optical systems

Aside from optical systems, there are other methods which are combined in a group of non-optical methods. Depending on the types of sensors there are magnetic, mechani-cal, inertial and other, less popular technologies [3]. There is no perfect system, as all of them have pros and cons, so the choice of the system depends on the task.

Mechanical

One of the cheapest mocap systems is mechanical. Its affordability allows this system to be quite popular. To capture the data the actor has to wear so-called, exoskeleton which repeats the subject’s movements. It consists of a set of narrow metal or plastic strips that are fixed around the actor’s joints and back. At each connection point, there is a goniometer which plays a role of the sensor that provides the data about the angles between the corresponding stripes [9].

The sensors might transmit the data using a wireless connection or be connected us-ing cables. Wireless sensors allow the actor to move around big areas and not worryus-ing where the camera is. It gives an advantage over optical and inertial systems. At the same time, systems that use cables can provide a lot of motion limitations and discom-fort to the subject [2, 3]. Collected by goniometers data after processing by kinematic algorithms produces the body position.

This kind of setup creates on its own some complications as sensor displacement during the movement can result in big accuracy error. At the same time, it is hard to place the sensor correctly, in accordance to the joints. Also, the soft tissue volume can affect the measurements. In addition, some joints like shoulders or hips have a few rotational degrees of freedom, and thus angle measurement using just goniometers becomes a non-trivial task. All of these factors have a significant impact on the sys-tem accuracy, so it requires frequent re-calibration [2, 9]. Nevertheless, the pose data obtained in this way still do not look very realistic.

Magnetic

Magnetic systems by their appearance might remind the mechanical ones. However, there is a significant difference - the magnets play the role of sensors. In addition, there is a source of the low-frequency electromagnetic field. The magnets are placed all over the body and together with the source are connected to each other and to the processing block with a set of wires, which create a lot of limitation to the actor’s motion. This unit is responsible for tracking the changes in the electromagnetic field caused by the sensors’ movements. Further, this data is transmitted to the computer where a specific software recalculates the 3D location and rotation of each joint [2, 3, 11].

The use of magnetic fields entails multiple challenges. First of all the system is very sensitive to any influences from outside that can be caused by other electrical

(12)

devices and equipment. Moreover, the metal constructions in the walls, ceiling, other surrounding structures, electrical wires inside the premises can cause data distortion. Nevertheless, the system does not have any problems with body parts overlapping because the human tissue does not affect the magnetic field [9]. But the presence of a few actors on the stage at the same time will again have an impact on the accuracy due to the fact that the sensors of different people will interfere the measurements of each other [3, 11]. At the same time, magnetic field is inversely proportional to the distance from its source. Thus the capturing area is quite small [9].

Despite all the shortcomings, this system can be relatively cheap. If it is set up correctly it produces good quality data with high enough frame frequency which is around 100 frames per second to capture everyday scenarios [10].

Inertial

The core difference of inertial systems with the previous ones is in the sensors them-selves. Each marker includes inertial measurement units (IMUs) like gyroscope, ac-celerometer, and magnetometer, and thus to capture the data there is no need in any cameras or intermediate recording devices as required in optical or magnetic systems, but a system is needed to provide with an absolute position. Considering all of that the system provides the data about the location and rotation of the joints, plus angular velocity and accelerations of the keypoints. Captured data is sent to the computer for further processing and storage.

Due to the specification of the IMUs that are used with time they accumulate the error so the data starts drifting. Usually, it appears within a few minutes [3]. This error can be minimized by a proper combination with complementary sensors [9].

Inertial systems are portable thus it is good for changing environments of different scales with changing amount of light. They have a good frame rate and can provide fairly accurate data. Often this kind of setups can also provide the hands’ motion tracking. In addition, these systems are not very expensive. All of that makes them quite popular in the entertainment industry.

2.4. Motivation

Summarizing everything said above, it is clear that the use of mocap data for creating a good quality animation is not affordable for the beginners and enthusiasts because all of the systems are quite expensive for personal use.

The aim of this work is to make this process more affordable for non-professional users, so that everyone could make their own animation with a regular camera. In this work, 3D human pose estimation methods are utilized to provide the mocap data which is later adopted for the purpose of 3D character animation.

(13)

3. RELATED WORK ON POSE ESTIMATION

Recently, human PE became one of the most interesting and discussed topics in CV but also, it is one of the most complex problems due to the fact that the human body is a complicated mechanism itself, which has a set of limitations. In addition, things like clothing, different types of occlusions, and light conditions are making this issue even more challenging. This nontrivial task can be solved in different ways including various estimation techniques. These methods can be divided into two big categories: 2D methods - those, that produce the location of the keypoints in the 2D space; and 3D methods, which together with the location of the human body keypoints in 2D space also provide the data about their depth which allows to see the geometry of the object in a 3D space. These methods are base on deep learning architectures like Convolutional Neural Network (CNN) and Recurrent Neural Networks (RNN).

3.1. Neural networks

The neural networks is a programming paradigm which tries to imitate the structure of the human brain in order to solve such complicated tasks like image and speech recog-nition. The interesting thing here is that compared to other programming paradigms where the programmer himself defines how the program will solve the task, the neural network has to find it out on its own. In the simplest form, it can be represented as a black box (Figure 2) where some input data is given to the NN, and it is known what kind of output has to be produced but it does not matter how exactly this result will be calculated.

Figure 2: NN as a Black Box.

If to look inside this black box, the NN consist of layers and layers of "neurons". They are also called perceptrons. An example of a simple NN is given in Figure 3. The leftmost layer is called "input layer" and the rightmost one - "output layer". Everything in between got the name "hidden layers" because it is not important to know what are the values in the nodes there. Depending on the complexity of the task some networks might have multiple hidden layers.

Generally, the process goes from small concepts at the input to bigger ones at the output. For example, if the image of a handwritten number is given to the NN, each node in the input layer will contain the value of a specific pixel in the image. The number of nodes in this layer will be equal to the total number of image pixels. While there will be only 10 output nodes assigned to all possible numbers from 0 to 9. These

(14)

nodes will contain the values between 1 and 0. The neuron with the highest value points to the digit that is depicted in the image.

In order to give a correct answer, the system has to go through the training process where the NN is given the input and the correct answers that have to be at the output. In this way, the network has to calculate a set of suitable parameters for each connection and each node in the hidden layers which will allow it to match the input with given output, so that after the training the NN could make a correct decision for the similar input on its own. This process is called learning.

Figure 3: Example of a NN structure [12].

One of the most popular and used NNs in the field of CV for image recognition and classification is CNN whose typical structure is illustrated in Figure 4. In some layers instead of specific parameters CNN applies functions of convolution or pooling. Over-all CNN is a combination of convolutional and pooling layers, activation functions, such as rectified linear units (ReLU) and fully connected layers [13].

Figure 4: Typical CNN structure.

Sometimes, they are also called feedforward neural networks (FFN) because the information from one layer always goes to the next one and it is never passed to the previous one. That is why it is a loop-free network.

(15)

3.2. Heatmaps

The key components of successful PE in the latest research are the heatmaps [14, 15]. In general, a heatmap is a visual depiction of some data. Mostly, it is used for analytic purposes. Within the framework of PE, heatmaps represent the result of confidence maps predictions over the image plane for each joint (Figure 5). As more confident the system is about the location of the joint as more intense will be the color at that coordinate point. As a result, it shows with a heatmark where is the specific joint on the image.

Figure 5: Heatmaps produced by VNect [16].

3.3. 2D Methods

The scientific community achieved a lot of progress in 2D PE and as a result, has provided a big variety of methods for solving this problem. Most of these methods share the same implementation principles. For example, utilizing the hourglass design [17], its modifications and different combinations of those. 2D methods can be divided into 4 categories described below.

3.3.1. Image based single person PE

The overwhelming majority of 2D methods refer to a single person estimation from a single RGB image [18, 19, 20, 21, 22, 23] and they are particularly targeting to improve the state-of-the-art methods and outperform them in efficiency and accuracy by fighting ambiguities, occlusion, and overlaps. These methods in one or the other way rely on the geometrical constraints of the human body. This data can contain any information which will help to have a better understanding of body geometry and its limits. For

(16)

example, human joints like knees and elbows cannot bend in all directions. It means that they have a specific range of available movements which is forming a structural constraint. Also, it can be the hierarchical structure and connections between different body parts [18] or a bone-based description which contain the information about the size, orientation, scale of the bones, etc.

Some methods [18, 20, 23] use those structural constraints as priors which are em-bedded in the process of training of the deep neural network (DNN). As a result, the trained network "understands" which poses are plausible and it leads to accuracy im-provement. The others like Keet. al. [19] encode the skeletal structure of the human body into the loss function, so it would try to balance the loss in each joint individually and the loss of entire structure as a whole. At the same time, Chouet. al. [22] makes it even more interesting by stacking two absolutely similar hourglass models [17] at the training stage. The first one also called the generator is used just for pose estimation, while the second one named a discriminator, thanks to the adversarial training with the geometrical priors, can distinguish if the pose provided by the generator is feasible. In the end, it returns the feedback to the generator about the result that could be improved.

3.3.2. Image based multi-person PE

At the same time, there are methods which can estimate multiple human figures from the single RGB image. In the images with more than one person, humans often overlap each others’ figures. As a result, the pose estimation complexity grows because it is important to assign correctly all of the body joints of the same person to one skeleton. Thus accurate pose estimation is the aim of the majority of the multi-person methods [24, 25, 26, 27, 28].

Pishchulinet. al. in his work [27] presented one of the most accurate approaches for 2D joint extraction called DeepCut which was erelong improved to the DeeperCut together with Insafutdinov et. al. [28] using more powerful residual neural network (ResNet) [29] and improved optimization process. Nevertheless, the main process of solving the above-mentioned problem remained the same and is based on linear in-teger programming principles. First of all, from a single input image with multiple people DeepCut generates the hypotheses about the body parts. It means that the al-gorithm starts from a set of estimated joints, so-called body part candidates and body part classes, e.g., head, back, shoulder. Every element of body part candidates receive a value which allows to calculate how much does the candidate correspond to some specific class. In the next step, which is very similar to the first one, body candidates and classes are taken in pairs. Then these pairs get their own cost values which tell if the candidate from some specific class and the other candidate from the other class belong to the same person. In this way, body part hypotheses are formed. Later, all the body joints that were defined as those that belong to the same person are connected into one cluster based on the image. After that, inside each cluster, the joints are labeled according to their classes which were assigned earlier.

At the same time, there is another popular strategy presented by Cao et. al. [26] often called OpenPose. To be able to distinguish one person from another the method is using part affinity fields (PAF) which are vector fields that contain the location and orientation of the human body parts in the image. To predict the 2D joint locations the

(17)

algorithm uses FFN which calculates a set of confidence maps for each body element, and a set of affinity fields, one for every bone, which is defined by 2 endpoints. These predictions are calculated within multiple stages. Before that, a set of feature maps is formed from the input image. It is used to generate the first set of confidence maps and PAFs. Then in every new stage, the original feature maps are combined with confidence maps and affinity fields from a previous stage in order to get more accurate results. After the joints are detected the PAFs are used to define their belonging to the same limb and then to the same person. This process produces the final skeletons for all people in the input picture.

Meanwhile, to improve the results, some methods use the bounding boxes (BB) to define the areas where each person is. This helps to increase accuracy. But as it is expected, there are problems like incorrect localization of the BB which leads to the estimation problems. Fanget. al. offers a regional multi-person PE method [25]. In the beginning, the spatial transformer network is used together with the single person PE algorithm to provide a highly accurate BB and to generate pose proposals. To optimize this process, at the training stage one more single person PE branch is used. After the pose proposals are done they are led to parametric pose nonmaximum suppression block which cuts out all of the unnecessary data.

Some of the methods which are targeted for multi-person estimation can be also used in cases with one person. For example, the method offered by Nieet. al. [24] extract the parsing information from the input data with an eye to teach the PE model so it could adapt the result according to new parsing data. It helps to fix invalid locations for joints in single and multi-person cases which lead to the formation of more accurate joint candidates and hypotheses about the bodies in cases with multiple people.

3.3.3. Video based single person PE

Altogether there is a big variety of methods that are working only with images. How-ever, there are some methods that can also process the video which is basically a se-quence of images. It might sound simple but this process is more massive and needs more computational resources, so that it could calculate the output for each frame fast enough. In addition, the work with video has its own side effects like flicking or image blurring due to the motion and other problems.

Methods like LSTM Pose Machines [30] offered by Luoet. al. provides a solution for this kind of issues for videos with only one person in it. They transformed the CNN into the RNN by using a specific way of sharing the weights inside the network. RNN is able to do the calculations at each step taking to account the results from the previous ones. In this particular case, it also allows to reduce the number of connections in the transformed network. As a result, the processing time for videos improves a lot. In addition, Luo adds Long Short-Term Memory (LSTM) modules between frames processing which allows to "look back" for the previous calculations even further than the RNN can do without it. LSTM decrease the amount of parameters in the network and improve the result because the output of already computed frames is processed and stored to LSTM and used for next frame calculation adjustment. Overall, the offered schema sees the geometrical connections between the frames and it leads to fast and reliable results.

(18)

3.3.4. Video based multi-person PE

Some methods, like earlier described Openpose [26, 31], can also perform multiple estimations for video sequences. Due to the complexity of the problem in these cir-cumstances Xiao et. al. [31] is looking for as simple as possible but still effective configuration. Their method uses the regular ResNet [29] supplemented with a few de-convolutional layers. This simple solution ended up to be quite effective for obtaining the heatmaps from the feature maps. For better reliability, they also use the BBs and track human movement through the frames using optical flow.

Nevertheless, the output of these methods is a 2D pose of the human skeleton which consists of 2D locations of each joint in the structure. That is why one of the biggest problems of 2D methods is ambiguity. A lot of those 2D poses can be interpreted in many different ways since different 3D poses can have similar projections to the 2D plane. Thus 3D methods are often used to solve the ambiguity issue.

3.4. 3D Methods

Most of the 3D PE methods at some moment rely on the results of 2D PE methods so that first of all they estimate x and y coordinates of the keypoints [16, 32, 33, 34] or use pre-calculated data [35] which is the output of the above-described 2D methods. This stage is very important because the final result is dependent on the accuracy of the 2D joint extraction method. As more precise are the joint coordinates as more trustworthy will be the 3D PE result. After obtaining the information about the 2D joints coordi-nates the methods "think" how to get 3D locations of the keypoints. However, there are also methods that go directly to 3D PE [36] without any intermediate calculations of 2D poses.

3.4.1. 3D pose estimation

For the great number of 3D methods recovering the 3D pose, shown in Figure 6b, is the main goal [16, 33, 34, 36, 37, 38, 39]. It means that they aim to calculate the coordinates of each joint of the person in a 3D space. Sometimes the authors also try to visualize the obtained pose data by fitting it into the kinematic skeletons [16, 36].

(a) Input image (b) Predicted pose (c) Predicted shape

(19)

A good starting point for this category is a method called Lifting from the Deep (LFTD) offered by Tome et. al. [34]. It proposes a new CNN architecture which is a combination of 2D joint location extraction and 3D pose predictions, that are calculated simultaneously. It is done in this way for the better efficiency of the method. The offered framework is trained using data with 2D and 3D annotations. Furthermore, these two datasets can be independent of each other.

LFTD claims that the geometric priors and constraints of the human body are nec-essary for a good approximation of the body shape. In the offered CNN the method is modified in the way that it has an embedded layer of 3D parameters. This allows to make sure that the predicted 3D model will be satisfying all the geometrical parameter and constraints.

The architecture of the framework presented in Figure 7 is complicated and consist of multiple stages which are connected successively. At the same time, all of them re-ceive the input image. At each stage, first of all the input image goes through the CNN to extract features from the input image. These features are then used for joint location prediction. Thus the CNN outputs predicted heatmaps with 2D joint assumptions. At the next step, this information is used together with probabilistic 3D pose model to give a volume to the 2D coordinates and get the 3D pose. After that, the 3D joints of the probabilistic model are projected to the 2D surface aiming to obtain a new 2D joint location heatmaps. These belief maps projected from 3D joints are merged with the ones that were computed earlier in order to calculate the 2D loss. The merged set of heatmaps goes to the next stage as the reference for joint prediction recalculation and creation of a new set of belief maps from the original image and reference data. By re-peating this procedure over and over through multiple stages the framework accuracy for 2D and 3D keypoint location raises progressively. This structure allowed LFTD method to reach state-of-the-art results.

(20)

The other interesting method whose aim is to recover the 3D pose is called DR-Pose3D and was proposed by Wang et. al. [33]. The architecture of this approach is easier than in the previous method. First of all, because this technique is based on the concept of depth, which is so easily perceived by humans, and can be learned by machines using NNs. The overview of the system is given in Figure 8.

Figure 8: Pipeline of the DRPose3D method by Wanget. al. [33].

The input image is run through the 2D PE module which outputs 2 types of data: a set of heatmaps and the 2D keypoint locations. After that, the heatmaps paired with the input image are going to be processed by the newly offered Pairwise Ranking CNN (PRCNN). This network computes a so-called ranking matrix which reflects the loca-tion of the joints relatively to each other in terms of depth. Finally, this depth ranking is given to the 3D Pose Network (DPNet), also offered by authors. This network re-constructs the 3D pose from 2D joint locations given by 2D pose estimator, taking to account the depth data from the ranking matrix. Depth based approach allows DR-Pose3D to avoid pose ambiguities especially around the limbs and get more accurate angles between the body parts.

Another good example of 3D methods is an approach offered by Mehtaet. al. [39] who decided to perform the 3D PE and at the same time create a new dataset called MPI-INF-3DHP for training PE systems.

The pipeline of the PE framework shown in Figure 9 consists of 3 steps. Within the first step, the input data is lead to the fully convolutional NN called 2DPoseNet which outputs the 2D heatmaps. These location maps easily provide the 2D joint locations to extract its maximum value from each of them. Then calculated data is used by the BB detector to locate the human on the image, root-center it and finally crop the picture. Next, all of this data goes to the second step where 3D PE is done by the other CNN, 3DPoseNet, that also uses a new supervision technique developed by authors for accurate regression. As the final step, the 3D and 2D data obtained from previous steps is aligned in order to get a global pose of the human with respect to the camera coordinate system.

The approach offered works well with the data from simple RGB cameras and achieves promising results. To make it possible, good data for training is a critical component. Existing datasets did not satisfy the requirements of the authors so they created their own MPI-INF-3DHP dataset. The data was captured with the state-of-the-art mocap system that contain a ground truth (GT) 3D annotations captured from

(21)

Figure 9: Architecture of method offered by Mehtaet. al. [39].

different angles and in different conditions. The dataset created allowed the presented method to reach state-of-the-art results. MPI-INF-3DHP dataset has gained popularity among researchers [16, 36] and is a common tool to train and test PE systems under development.

One of the most interesting methods for this theses is VNect created by Mehtaet. al. [16]. This approach goes further than most of the others and have 3 main key points: 2D and 3D joint extraction and kinematic skeleton fitting. The last stage allows using the method in the real-time character control applications, like games, virtual reality and so on. The goal of this approach is to create a tool for character animation which would not require the usage of the sensor mocap system and would work in real time using a simple RGB video sequence.

The whole pipeline given in Figure 10 can be divided into 2 parts. The first stage is responsible for the reconstruction of 2D and 3D keypoints. The CNN that performs this operation is fully convolutional mean what it can work without close-ups, so there is no need in using BBs. Nevertheless, for the result improvement, the image is centered on a human and cropped. The 2D coordinate regression is using the aforementioned heatmaps which are calculated for each joint independently. As for 3d reconstruc-tion, the authors found an unusual solution. The heatmaps from 2D reconstruction are broadened to obtain the heatmaps for each of 3D coordinates. These are called the location-maps because the color intensity shows how close or far the joint is from the point of origin where x, y, and z coordinates are equal to zero. Pelvis is considered to be a constant point of origin. Then, the location-maps in conjunction with the 2D heatmaps are used for the prediction of the 3D coordinates by finding the points of a maximum correspondence using the loss function.

As for the last part, which is the kinematic skeleton fitting, the authors optimize, filter and smooth the regressed 2D and 3D data about the joint positions in order to get feasible character poses from frame to frame and as a result smooth and trustworthy animation.

The other interesting method suggested by Dabral et. al. [36] in comparison to all of the previous methods do not make any intermediate calculations of 2D poses and go directly to 3D pose estimation, but it does not mean that the offered method has nothing to do with 2D keypoints at all. For 3D pose reconstruction from a single image, the CNN called Structure-Aware PoseNet was presented. It was trained on data with 2D and 3D annotation. To make sure that the predicted 3D data will fit to the 2D landmarks and stay realistic, authors developed a loss function which is based on the anatomical structure of the human body. It consists of three components. The first and

(22)

Figure 10: The architecture of the VNect [16] method.

the most important is the length of the bones which is the basis of the loss function. The other two joint bending limits and symmetry of the left and right body sides -are supplementary. This structure allowed the method to outperform state-of-the-art methods.

Moreover, the authors did not stop here and offered to extend their approach to the video sequence by creating a temporal network which tracks structural hints from frame to frame to make the prediction more smooth and realistic. In this way, it takes the 3D prediction from the Structure-Aware PoseNet and compares to the predictions from the previous frames. Then, the network can adjust it a bit if there is a need. The output of this whole process can be applied to the kinematic skeleton to create an animation of any kind. Coupled together, the presented networks produce great estimation results which are good enough for character animation.

Some methods do not care about the 2D coordinate recovery and use 2D joint lo-cations already as an input to their system. Their 3D pose restoration is based on the same idea of anthropometrical landmarks that were mention in Section 3.3. A good example of those methods are given by Ramakrishna [37] and Akhter [38].

Ramakrishnaet. al. [37] have suggested to restore 3D pose from 2D landmarks. It is claimed that anthropometrical landmarks, for example in this particular case limb length, are very valuable prior which can help to avoid implausible poses. To make use of it, the overcomplete dictionary which contains the description of the limb length and deformability variations has been created. As a result, it allowed recovering the 3D pose of the human together with the position of the camera relative to the person.

Akhteret. al. [38] used the same idea of anthropometrical landmarks as Ramakr-ishna but instead of limb length they studied joint limits to form a prior for possible configurations in order to avoid unnatural deformation of the joints in the extracted skeleton. The work has started from collecting the data from the motion capture dataset and formation of the geometrical prior by learning the joint limits and how much they influence the pose. Nevertheless, even with the correct joint angles there still might be a few possible pose configurations, thus it was recognized to be important to restore the camera parameters. To solve this task, it was assumed from the learned angles that the torso part of the human body is having fewer variations than the whole body. This finding allowed to make the parameter estimation easy and as a consequence allowed the method to reach great results in 3D PE.

(23)

3.4.2. 3D pose and shape estimation

To make the 3D pose estimation more accomplished and for easier visual perception some methods together with pose aim to restore also the shape of the human body as illustrated in Figure 6c. Shape restoration complicates the whole process because depending on the volume of the body segments some parts can interpenetrate to each other which cannot happen in real life.

A great example in this category is the method called SMPLify [35] offered by Bogo et. al. It is based on the idea that a lot of information about human body shape can be withdrawn from a set of 2D joints. This information applied to the 3D model in combination with an anti-interpenetration technique allows to restore 3D body pose and shape avoiding ambiguities.

The whole method can be divided into two main stages: extracting a set of 2D joints from a single image using above-mentioned 2D PE method for joint extraction called DeepCut [27, 28] and restoring body shape and pose by mapping obtained 2D joints to a skinned multi-person linear 3D model of a human body (SMPL) [40]. This model represents a high-quality 3D-model of an average human body. In addition, SMPL takes to account how the body shape changes depending on the pose.

Figure 11: Visual representation of the objective function minimization performed by SMPLify.

When the 2D data is received from DeepCut, SMPLify goes directly to the mapping stage where the CNN minimizes the objective function. It means that SMPLify tries to make joints of the 3D model, projected to the 2D plane with a perspective camera model, to be as close as possible to the detected set of 2D joints. As an example, in Figure 11 green circles represent 2D joint predictions made by DeepCut and red ones are the projected joints of the SMPL model. In that kind of procedures, it is important to pay attention to the number of joints because it might be different for the output of the 2D joint estimation method and the model, such as SMPL. For instance, DeepCut produces a skeleton of 14 joints while the SMPL contains 23 joints. To solve this problem, 2D keypoints have to be associated with the most relevant joints of the 3D model. Moreover, Bogo et. al. understood that still a lot of poses are not feasible due to the body parts intersection and interpenetration. Therefore, SMPLify defines the shape of the body with a set of so-called capsules to get a proper result. The parameters of the capsule are linearly dependent on the shape coefficients of the SMPL

(24)

model. As a result, this method achieved outstanding performance in comparison with other methods. Consequently, this approach serves as the basis for many other studies. Methods offered by Kanazawaet. al. [41] and Pavlakoset. al. [42] pursue the same target as SMPLify. Usually, for this goal, CNN is trained using the training data with 3D annotations. Unfortunately, this kind of data often has bad quality and there is not enough of it for good training results. Luckily, there is a big amount of the images offered for training that have 2D annotation. Similarly to Dabral et. al. [36] whose method was described earlier, Kanazawa and Pavlakos found a solution in it. They propose an end-to-end framework that can restore not only the 3D pose but also the shape using the SMPL model directly from a 2D image. This concept allows to avoid double training process (first stage for estimation of the joints and second for fitting the SMPL model projections into the output of the previous stage) and compress the whole procedure to a single training stage, that allows to reduce the complexity of the framework.

Both methods adapt the same idea of minimizing the objective function or in other words minimizing the error between the projected keypoints of the 3D model and 2D annotations with an eye on the 3D pose and shape reconstruction with the incorporation of SMPL model.

Nevertheless, these approaches follow a bit different pipelines. Pavlakoset. al. [42] whose architecture is shown in Figure 12 use the CNN named Human2D at the early stage of their framework to produce two types of data from a single input image -silhouette and a set of heatmaps for each joint. This data is used as an input for two other networks. The heatmaps are used to find 72 proper pose parameters (x, y, and z coordinates for each of 23 joints and 3 global rotation parameters) for the SMPL nodes. And the silhouettes are needed for calculating the shape of the person. The obtained results are used together to generate the correspondence to the input image mesh. Finally, the produced mesh is projected to the 2D plane and optimized according to the keypoints. At the same time, this structure has built-in 3D per-vertex loss as a part of the optimization process. It describes the correlation of errors from vertex to a vertex in the mesh and helps in evaluation and improvement of the training process.

Figure 12: Architecture of method offered by Pavlakoset. al. [42].

Kanazawa et. al. [41] send the input through the convolutional encoder to the re-gression module as illustrated in Figure 13. In an iterative manner, it is looking for appropriate parameters for pose, shape, and camera that will minimize the reprojection

(25)

errors. At the same time, they are taking to account that even with this setup there might be ambiguous or even impossible postures. That is why they built in into the 3D PE pipeline the discriminator network whose task is to analyze provided SMPL parameters and determine whether they are corresponding to a plausible human pose or not. The discriminator is able to do that because during the training process it stud-ies the available poses following the graph representation of the joints. Then it goes to each joint separately to learn its limitations. This process results in the geometry aware discriminator. This approach is giving very good results for 3D pose and shape estimation compared to other methods that work with 2D annotation.

Figure 13: Structure of the end-to-end Framework offered by Kanazawaet. al. [41]. In contrast with previous methods, Neural Body Fitting (NBF) proposed by Omran et. al. [32] can be trained using data with 2D and 3D annotations which require weak or full supervision, respectively. Despite the aforementioned shortcomings of 3D annotation data, the authors want to demonstrate that there is no need in a big amount of 3D training data to get an acceptable outcome.

Its end-to-end pipeline is alike with the previous methods and also includes param-eter search, SMPL mesh creation and further projection to the 2D surface. The differ-ence is that for improvement of the performance of the framework the authors add a step for simplification of the image by means of semantic segmentation. The result of this step is illustrated in Figure 14b. The figure of the person is represented as a set of 12 colored areas where each color corresponds to a specific body part while all the irrelevant information is simply withdrawn. Omranet. al.assure that 12 segments are enough for a good restoration outcome. Nevertheless, the quality of the segmentation has a big impact on the final result.

Another difference is that for the optimization of the NBF, Omranet. al.offered and tested four different loss functions and its combinations to define how much the use of data with 2D and 3D annotations influence the result. They found out, that only 2D annotated training input does not give a good outcome because the system cannot fight the 3D ambiguities. A small amount of data with 3D annotations solves this issue and contributes to the achievement of worthy results.

The other approach worth of mentioning got the name BodyNet from its creators Varol et. al. [43] who decided to use four NNs each of which was responsible for

(26)

(a) Input image (b) Semantic segmentation (c) Predicted shape

Figure 14: Output of NBF method by Omranet. al. [32].

one of four tasks: 2D PE, 2D segmentation, 3D PE and 3D shape reconstruction. At first, all of them were trained separately so they could properly solve an assigned task. After that, the networks went through the general training together. In addition, this method uses three different loss functions each of which improves the reconstruction outcome. The BodyNet structure overview is presented in Figure 15. The input image is processed by two networks. One predicts the 2D keypoint locations and output it as heatmaps. At the same time, the other NN segment the body parts (there are 7 body parts, viz. hands, legs, head, torso, and back) from the image and deliver it also as a set of heatmaps, one per each body part. After that both sets of the heatmaps together with the input image are given to the 3D pose prediction NN which lift the 2D data to the 3D space relative to the camera perspective. In general, this procedure sounds similar to other methods, but also there is a big difference of this method from others because at this point it shifts from pixels to the concept of voxels. All of the further work is done in a 3D space defined by the voxel grid.

Figure 15: Overview of the end-to-end BodyNet framework [43].

In all of the previous steps, the loss functions are calculated. All three of them are the components of the multi-task loss function. They are collected together and along with 3D pose prediction and three other inputs go to the 3D shape restoration network. This module contains two more loss function: volumetric and reprojection losses for different planes. The volumetric loss allows to get a corrects alignment of the predicted shape with the segmentation data and get an accurate vision of body shape in space. The reprojection loss is used to improve the restoration of the limbs. Overall, BodyNet

(27)

structure is quite complected and massive but it gives outperforming results over the state-of-the-art approaches.

3.4.3. Other methods

Despite the fact that the majority of the 3D PE methods are aiming to get the 3D pose and sometimes the shape, there are methods which are also looking for a solution to some other problems that appear while the procedure of 3D reconstruction takes place, for example, too large training datasets or quantization errors.

In rare cases, the method aims to improve the learning process of the modern 3D PE by fighting over too massive training sets [44]. It is done, first of all by creating the latent body model without any references to joint coordinates. To do that the NN uses four images of the same person which are taken from different angles so it could learn the representation of the person taking into account its geometry and volume. After that, the obtained model has to be mapped to a 3D pose. It is done by the other shallow NN which needs only a small amount of data to be trained. It works because the latent model already contains the data about the geometry of the person’s body thus there is no need to dub it while mapping process. In this way, the presented system learns to assume what is the 3D pose of the person on the input image if a single view image is given. Such construction allowed this method to outperform the other methods based on DNNs, trained on big sets of annotated data.

At the same time, some authors as Sunet. al.[45] define fighting the post-processing and quantization errors as their main goal. From the above-described methods, it is seen that often they owe their success to heatmaps but its use leads to aforesaid problems. The authors of the current method have found a solution to these issues in simple integral calculations. Thus by changing a simple operation of maximum ex-traction from heatmaps for joint regression to the integration of joint locations based on its probabilities, they create a simple and parameter-free way to overcome men-tioned problems. In addition, this small but effective solution can be integrated into any architecture that uses heatmaps to improve its accuracy.

Finally, DensePose offered by Güler et. al. [46] represent an unusual method of estimation by mapping the RGB image of a human to a 3D model. To do that, first of all, a GT dataset that can be used to train the CNN so that it could map the data properly to the 3D surface model of a human body needed to be created. Thus the system starts with the identification of the body parts at the input image and its segmentation. Then, each segment is filled with points at the equal distance from each other and the system has to spread them in accordance with the shape of the surface. Thanks to this approach a proper point mapping to the 3D model can be performed successfully.

The authors of DensePose went further and proposed to use the method not only for creating the 3D body representations but also for generating synthetic images of a person while having only one image as an input [47]. Together with the input image offered, system gets the pose that has to be synthesized using the input data. To make it happen previously presented surface-based model is used to obtain the pose that corresponds to the input which is further remapped to the target pose. Moreover, this version of DensePose has a few additional modules, a combination of which gives the outperforming results in comparison to other image synthesis systems.

(28)

4. ANIMATION

The whole process of creating a realistic character animation based on real human motion from the point of view of the animation can be split into three parts: creating a character itself, generating the animation from the data produced by the PE methods and appliance of the motion to the 3D character.

4.1. Animation tools

Character creation and animation is a complicated process which involves a lot of skills and knowledge of specific software as Blender [48], ZBrush [49], Autodesk Maya [50], Autodesk 3D Max [51] and others, which are oriented to the professionals in the area. All of these tools have diverse sets of various instrument to work with all types of characters and objects. They are widely used by the industry professionals for creating different kinds of models and pictures, animated movies and VFX, video games and AR/VR applications, etc. Because of the high quality and multifunctionality of these products most of them [49, 50, 51] are expensive. Luckily, there are also free computer graphics (CG) toolsets that are available for everyone, for instance, Blender.

4.1.1. Blender

Blender is a powerful cross-platform, open-source software which can handle the whole creation pipeline - from character design and rendering to movie making and game development. Moreover, its functionality includes different kinds of simulators, various animation tools, instruments for use of mocap data, Python scripting and many others. For working with Python code Blender has a Python application programming interface (API) that provide the opportunity of working with native Blender libraries and tools. Moreover, this software has its own game engine, nonetheless, it will be removed from the upcoming release as it loses oneself in comparison with other, more powerful game engines like Unity [52], Unreal Engine, etc. Currently, Blender whose main screen is shown in Figure 16, is one of the most widely used 3D modeling tools. Considering all of the above, Blender is a perfect tool for performing the implemen-tation of the graphical part the thesis project: the creation of the simple character and its animation which afterwards can be transferred to the game engine.

4.1.2. Unity

Unity is a popular cross-platform game development environment. A great advantage over other game engines is that the users do not have to be good in programming to create their own game. Unity supports graphical development environment depicted in Figure 17 which make the whole process more open for the public. Nevertheless, it supports scripting with C# for event description and processing the users’ actions. More advanced users can utilize scripting for physics description, and creation of vi-sual effects up to implementing complicated artificial intelligence (AI) algorithms.

(29)

Figure 16: Blender splash screen.

Games created with Unity can run on different platforms whether it is a PC, smart-phones, game consoles, etc. In addition, it is free of charge for personal use.

While Blender is getting concentrated on the character development and animation creation tools, Unity is targeted specifically to the game development itself, and thus it does not support any serious functionality to build the character, but offers to import it from external sources. Also it allows to import landscape meshes, textures, material and others.

4.2. 3D character design

One of the most popular techniques for 2D and 3D character creation and animation is called skeletal animation [53, 54]. It implies that every character consists of 2 main parts: the mesh and the skeleton.

4.2.1. Mesh

The mesh illustrated in Figure 18a defines how the character looks like. Sometimes it is also called the skin of the character. The only thing that limits its shape is the imag-ination of the character creator. At the same time, the mesh itself is just a shell, and a set of vertices and its animation for the game or a film might become a complicated process since the character artist has to animate all of the vertices, while some models can contain tens and even hundreds of thousands of vertices. A simple and effective solution for this issue is to use armatures.

(30)

Figure 17: Unity graphical development environment.

4.2.2. Armature

Armature shown in Figure 18b [55 Armatures] or a skeleton in most of the times is a hierarchical tree-like structure of elements which are called bones. In this structure, every next bone is connected to the previous parenting bone so the movement of the parenting bones will lead to the movement of the child ones. Usually, these skeletons of different characters look similar to a regular skeleton.

4.2.3. Rigging

The process of creating a character skeleton is called rigging [56, 55 Rigging]. Except for the bone structures, it might include constraints, object modifiers, and other com-ponents. The resulting armature is a rig that is used to control the mesh. To make it happen, the skin has to be linked to the skeleton of the character within the skinning process the result of which is illustrated in Figure 18c.

4.2.4. Skinning

During skinning [55 Skinning] the groups of vertices have to be associated with the corresponding bones in order to make it really move. Thereby, when a specific bone is moving, the assigned vertexes will follow it creating the motion. Usually, the vertices are assigned to the closest bone. Each of which has a vertex weight that shows how much influence this or another rig element has to each vertex. As bigger is the weight value as more the translation of the vertex will be dependent on the motion of the particular bone. At the joint locations, the vertices might be influenced by multiple bones.

(31)

(a) Mesh (b) Armature (c) Skinned mesh

Figure 18: Elements of skeletal animation.

Automatic weight distribution sometimes can result in robotic and unnatural charac-ter movements. To fix it the skinning might requite manual assignment and distribution of the weights so that except the fully controlled groups of vertices the bones would also slightly affect the neighboring mesh areas around the joints. For instance, in Fig-ure 19 red color means that the selected bone has a strong influence in all of the vertices of that area while for the green zone the bone’s effect is not so strong, and blue vertices are not influenced by the chosen bone. So the joint area between the foot.L and toes.L bones is influenced by both of these bones. This procedure will help to create more authentic mesh deformations according to the skeleton.

In more advanced cases, for creating more realistic movement rigging and skinning there might also include the creation and connection of the muscles. As a result, the above-mentioned problem of multiple vertices, animation becomes as simple as the animation of the bones, which requires only location and rotation data. Thus, skeletal animation also decreases the amount of information that has to be saved since only bone motion data is saved and the location of the vertices in each frame can be calcu-lated from there.

(a) Influence zone of the foot.L bone (b) Influence zone of the toes.L bone

(32)

4.3. Animation generation

4.3.1. Keyframing

Animation of the 3D characters is based on traditional animation principals and tech-niques [57]. One of the simplest and most applicable ways of animation is called keyframing [55 Keyframes] [58]. The idea behind this technique is that the animator creates only the main frames of the character motion which describe its movement within the timeline as illustrated in Figure 20. The keyframe description of the motion can contain coordinates, rotation angles or any other information that can help to de-scribe the transformation of the object with time. The keyframes do not follow each other one after one so the in-between frames are generated and rendered by the soft-ware using interpolation methods. The appliance of this technique makes the character movement smooth and takes less time because there is no need to create a frame by frame animation.

Figure 20: Keyframes for walk animation in Blender.

In the framework of this thesis mocap data for each frame of the video sequence is used as keyframes.

4.3.2. Using motion capture data for animation

In Blender, skeleton bones look like those presented in Figure 21. Each bone has a root or head and the tail. The tail of one bone might be at the same time the root of the next one. For example, in Figure 21a the upper arm bone’s root is the shoulder joint and its tail is in the elbow. At the same time, the upper arm’s tail is the root for the forearm bone which starts with the elbow and ends at the wrist. This structure creates a hierarchy of the rig so that the upper arm bone is a parent to the forearm, while the forearm is a parent to the hand and so on.

As a result, shown in Figure 21b, the motion of the children structures is dependent on the parents. It can be seen, the roots and tails of the rig bones often correspond to the real joints location. Nevertheless, it is hard to apply correctly the mocap data to

(33)

(a) Hierarchical structure of the left arm bones

(b) Parent-child motion dependency

Figure 21: Humanoid rig in Blender.

these joints because they are defined by the location of the bones which are basically the space between the neighboring joints but not the joints themselves. That can create a lot of confusion and incorrect data overlay. To solve this issue the empties can be used.

Empty or an empty object is a specific type of objects that are invisible, they do not have volume or surface or any other property of an object and thus cannot be rendered [55 Modeling]. The only features that describe them are their transformation settings. This makes empties an effective tool whenever it is needed to control and modify an object, texture, rig or any other elements of the project. Available empties are illustrated in Figure 22.

Figure 22: Empties in Blender.

Overall, empties are targeted for performing any kind of transformation and creating influences that have to be invisible in the final render, for example, a force field. One of the most popular ways to use empties is parenting. It means that an empty is used as a reference point for one object or a group of connected elements.

(34)

For this thesis project, empties can be used to parent the joints of the skeleton so they will be able to control this structure. The plain axis empties can be placed inside the corresponding joints as it is shown in Figure 23.

Figure 23: Empties bounded to the skeleton joints.

In this way, the data about the 3D location of each joint that is imported from the PE methods can be assigned to the corresponding empties. Each empty receives a list of x, y and z coordinates for every frame that is processed. The imported data can be seen in the Dope sheet Summary section as illustrated in Figure 24a.

(a) Dope sheet editor (b) Skeleton movement

Figure 24: Skeleton animation with empties.

The Dope Sheet [55 Editors] concept is also originating from the classical animation even though it had changed a bit with the advent of CG. The Dope Sheet contains the information about the motion of all of the objects in the scene: at which moment they appear, what is happening with them and for what period of time. It is using the

(35)

Figure 25: Manual avatar mapping in Unity.

keyframing technique for each separate element that is going to be animated within the timeline. Basically, every mark in Figure 24a is a keyframe for each empty connected to the joint and it contains the data for each action channel - X, Y and Z locations. Figure 24b demonstrates the corresponding animation for the frame number 2 that is marked in Figure 24a.

In this project, Figure 24 shows the ideal case when each joint empty has a keyframe value for every frame. At this rate, the imported movement will be as close as possible to the real human motion. If some of the frames are missing the data about the joint location in some specific time will be interpolated from the previous and the next available keyframes.

Finally, frame-to-frame animation can be saved as native Blender.blendfile format or as an.fbxfile and together with the character itself may be exported for the further use in Unity or other applications.

4.4. Character animation with Unity

As it was said before, Unity require importing the rigged models created by other means. So the character together with the all available set of animation for it can be imported right to the scene.

In Unity, human alike models are called humanoids. Unity has its own understanding of how the humanoid has to look like. Therefore, it has a set of requirements for that kind of characters. So one of the first things to do after the humanoid character

(36)

transferring is to match a hierarchy of the character with the Unity vision on how the humanoid has to be set up. It is also good to check these requirements before the import so that the structure of the imported character would fit the game engine environment. Otherwise, some problems can pop up after the character is transferred to Unity. All of this is needed to make sure that the animation of the model will be correctly overlaid with the character. It can be done automatically or manually using Mapping window illustrated in Figure 25.

Sometimes, it might be needed to disable the animation of the humanoid partially. Avatar masks are used for this purpose. They can define what parts of the character will not follow the animation and stay still.

To manipulate the animations, Unity has a complex Animation controller which provide a user with various functionality. It allows to create own animation schemes, import animations from other sources (Section 5.3.1), build a connection between dif-ferent behavioral animations that allow making smooth translations during the game-play from one action to another one, retarget the motion when the animation which was created for one specific character is applied to the other one, build different animation patterns for separate body parts using the avatar masks and so on.