Basic Container Image Creation Guide

(1)

V100R020C20

Basic Container Image Creation

Guide

Issue 01

(2)

No part of this document may be reproduced or transmitted in any form or by any means without prior written consent of Huawei Technologies Co., Ltd.

Trademarks and Permissions

and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.

All other trademarks and trade names mentioned in this document are the property of their respective holders.

Notice

The purchased products, services and features are stipulated by the contract made between Huawei and the customer. All or part of the products, services and features described in this document may not be within the purchase scope or the usage scope. Unless otherwise specified in the contract, all statements, information, and recommendations in this document are provided "AS IS" without warranties, guarantees or representations of any kind, either express or implied.

The information in this document is subject to change without notice. Every effort has been made in the preparation of this document to ensure accuracy of the contents, but all statements, information, and recommendations in this document do not constitute a warranty of any kind, express or implied.

(3)

1 Overview...1

2 Container Engine Plugin... 2

3 Inference... 4

3.1 Atlas 200 AI Accelerator Module (RC Scenario)...4

3.1.1 Creating an Inference Container Image... 4

3.1.2 Deploying an Inference Container... 8

3.2 Atlas 500 AI Edge Station...9

3.3 Other Inference Devices...14

3.3.1 Instructions... 14

4 Training...21

4.1 Instructions...21

4.2 Creating a Training Container Image...21

4.3 Deploying a Training Container... 27

A Common Operations... 30

A.1 Configuring a System Network Proxy... 30

A.2 Creating a Configuration File... 31

A.3 Changing NPU IP Addresses... 31

B Reference...33

(4)

1

Overview

This document describes how to use a Dockerfile to create a container image. AscendHub at the Ascend community provides some basic container images. You can obtain images from AscendHub based on the site requirements.

This document applies to the following Ascend AI devices: ● Inference

– Atlas 200 AI accelerator module (RC scenario) – Atlas 500 AI edge station

– Other inference devices

▪

Atlas 300I inference card

▪

Atlas 500 Pro AI edge server

▪

Atlas 800 inference server ● Training

– Atlas 300T training card – Atlas 800 training server – Atlas 900 AI cluster

(5)

2

Container Engine Plugin

Ascend Docker is a basic component of the CANN. It provides container-based Ascend NPUs (Ascend AI Processors) for all AI training or inference jobs so that AI jobs can smoothly run on Ascend devices as Docker containers, as shown in Figure 2-1. Ascend-docker-runtime is the software package released with Ascend Docker and has been integrated into the toolbox.

Figure 2-1 Ascend Docker

Features of Ascend Docker

● Full decoupling: Ascend Docker is decoupled from Docker, and the Docker code does not need to be modified. The Runtime can evolve independently. ● Backward compatibility: Provides optional Runtime, which does not affect the

use of native Docker.

● Easy adaptation: Smoothly adapt to the customer's existing platform and system without affecting the original Docker command interface.

● Easy deployment: The RPM package is provided for deployment. After the installation, you can use Docker to create a container to which the Ascend NPU is mounted.

Ascend Docker Design

Ascend Docker is essentially Docker Runtime implemented based on the Open Container Initiative (OCI) standard without modifying the Docker engine and provides the Ascend NPU adaptation function for Docker as a plugin.

(6)

As shown in Figure 2-2, Ascend Docker connects to the native Docker through the OCI interface. When the native Docker runc starts a container, prestart-hook is called to configure and manage the container.

Figure 2-2 Docker Adaptation Principles

prestart-hook is a container survival state defined by OCI, that is, a hook function set for an intermediate transition from the created state to the running state. In this transition state, the namespace of the container has been created, but the job of the container is not started. Therefore, you can mount devices to the container and configure the cgroup. In this way, the configuration can be used by the jobs that are started later.

Ascend Docker performs the following operations on the container in the prestart-hook prestart-hook function:

1. Mount the NPU device to the namespace of the container based on ASCEND_VISBLE_DEVICES.

2. Configure the device cgroup of the container on the host to ensure that the container can use only the specified NPU to ensure device isolation.

(7)

3

Inference

3.1 Atlas 200 AI Accelerator Module (RC Scenario) 3.2 Atlas 500 AI Edge Station

3.3 Other Inference Devices

3.1 Atlas 200 AI Accelerator Module (RC Scenario)

3.1.1 Creating an Inference Container Image

Prerequisites

● In the container scenario, install Docker 18.03 or later. ● The container OS image can be obtained from Docker Hub.

● Obtain the offline inference engine package and service inference program package by referring to Table 3-1.

Table 3-1 Required software

Software Package Description How to Obtain

Ascend-cann-nnrt_{version} _linux-aarch64.run

Offline inference engine package.

{version} indicates the software package version.

Link

Dockerfile Required for creating an

(8)

ascend_install.info Software package

installation log file. Copy the /etc/ascend_install. info file from the host. Change the path as required.

version.info Driver version

information file. Copy the /var/davinci/driver/ version.info file from the host.

Change the path as required. Service inference program

package Collection of serviceinference programs. The .tar and .tgz formats are supported. The compressed package format of the service inference program must be supported by the

compression program in the container. In

addition, the command for decompressing the service inference program package in install.sh must map the actual format.

Prepared by users.

install.sh Installation script of the

service inference program.

run.sh Running script of the

Procedure

Step 1 Upload the software package to the same directory (for example, /home/test) on the accelerator module.

● Ascend-cann-nnrt_{version}_linux-aarch64.run ● ascend_install.info

(9)

● version.info

● Service inference program package

Step 2 Perform the following steps to create a Dockerfile:

1. Log in to the accelerator module as the root user and run the id HwHiAiUser command to query and record the UID and GID of the HwHiAiUser user on the host.

2. Go to the software package upload directory in Step 1 and run the following command to create a Dockerfile (for example, Dockerfile):

vi Dockerfile

3. Write the following content and run the :wq command to save the content. The Ubuntu ARM OS is used as an example. (The following content is only an example. You can perform secondary development based on the site

requirements.)

#OS and version number. Change them based on the site requirements. FROM ubuntu:18.04

# Set the parameters of the offline inference engine package. ARG NNRT_PKG

# Set environment variables.

ARG ASCEND_BASE=/usr/local/Ascend ENV LD_LIBRARY_PATH=\

$LD_LIBRARY_PATH:\

$ASCEND_BASE/nnrt/latest/acllib/lib64:\ /usr/lib64

# Set the directory of the started container. WORKDIR /root

# Copy the offline inference engine package. COPY $NNRT_PKG .

COPY ascend_install.info /etc/ RUN mkdir -p /var/davinci/driver COPY version.info /var/davinci/driver # Install the offline inference engine package. RUN umask 0022 && \

groupadd -g gid HwHiAiUser && useradd -g HwHiAiUser -d /home/HwHiAiUser -m HwHiAiUser && usermod -u uid HwHiAiUser &&\

chmod +x ${NNRT_PKG} &&\ ./${NNRT_PKG} --quiet --install &&\ rm ${NNRT_PKG} &&\

rm -rf /var/davinci/driver

# Copy the service inference program package, installation script, and running script. ARG DIST_PKG

COPY $DIST_PKG . COPY install.sh .

COPY run.sh /usr/local/bin/ # Run the installation script.

RUN chmod +x /usr/local/bin/run.sh && \ sh install.sh && \

rm $DIST_PKG && \ rm install.sh

# Program that is run by default when the container is started. CMD run.sh

In Dockerfile:

Create a HwHiAiUser user in the container. gid and uid in the file indicate the GID and UID of the HwHiAiUser user on the host. You can replace them

(10)

as required according to Step 2.1. The GID and UID of the HwHiAiUser user in the container must be the same as those on the host.

4. After creating the Dockerfile, run the following command to change the permission on the Dockerfile:

chmod 600 Dockerfile

5. The procedure for preparing the install.sh and run.sh scripts is the same as that for preparing the Dockerfile. Compilation Examples shows the file content.

Step 3 Go to the directory where the software packages are stored and run the following command to create a container image:

docker build -t image-name --build-arg NNRT_PKG=nnrt-name --build-arg DIST_PKG=distpackage-name .

Do not omit . at the end of the command. Table 3-2 describes the parameters in the command.

Table 3-2 Command parameter description

Parameter Description

image-name Specifies the image name and tag. Set this parameter as required.

--build-arg Specifies parameters in the Dockerfile.

NNRT_PKG nnrt-name: specifies the name of the offline inference engine package. Do not omit the file name extension. Replace it with the actual one.

DIST_PKG distpackage-name: specifies the name of the compressed package of the service inference program. Do not omit the file name extension. Replace it with the actual one.

If "Successfully built xxx" is displayed, the image is successfully created. Step 4 After the image is created, run the following command to view the image

information: docker images Example:

REPOSITORY TAG IMAGE ID CREATED SIZE workload-image v1.0 1372d2961ed2 About an hour ago 249MB ----End

Compilation Examples

Compilation example of install.sh #!/bin/bash

# Go to the container working directory. cd /root

(11)

#Decompress the service inference program package based on the package format. tar xf dist.tar

Compilation example of run.sh #!/bin/bash

# Start the slogd daemon process. mkdir -p /usr/slog

/var/slogd &

# Access the directory where the executable file of the service inference program is located. cd /root/dist

# Run the executable file. ./main

3.1.2 Deploying an Inference Container

This section describes how to start a container image on a single device. If you need to deploy container images in batches on FusionDirector, see the MindX Edge Application Deployment and Model Update Guide.

Prerequisites

A container image has been created.

Procedure

Step 1 Log in to the accelerator module as the root user.

Step 2 Run the following command to start the container image. Change it based on the site requirements.

docker run -it device=/dev/davinci0 device=/dev/davinci_manager device=/dev/svm0 device=/dev/log_drv device=/dev/event_sched --device=/dev/upgrade --device=/dev/hi_dvpp --device=/dev/

memory_bandwidth --device=/dev/ts_aisle -v /var:/var -v /usr/lib64:/usr/lib64 v /etc/hdcBasic.cfg:/etc/hdcBasic.cfg v /etc/rc.local:/etc/rc.local v /sys:/sys -v /usr/bin/sudo:/usr/bin/sudo --v /usr/lib/sudo/:/usr/lib/sudo/ --v /etc/

sudoers:/etc/sudoers/ workload-image:v1.0

In the preceding command example, the service program is run by default. If you need to directly access the container, add /bin/bash to the end of the command. Table 3-3 Parameter description

--device Adds a host device to the container.

Replace davinci0 with the actual device name.

If multiple chips need to be mapped, multiple parameters need to be configured, for example,

--device=/dev/davinci0 --device=/dev/ davinci1.

(12)

workload-image:v1.0 Generated image file.

NO TE

In this version, multiple containers can be mounted to a single processor. ● Up to 16 containers can be mounted to a single processor.

● Containers obtain the computing power of processors in preemption mode. Memory isolation and computing power slicing are not supported.

● Device sharing is disabled by default.

– Run the following command on the host to enable device sharing: npu-smi set -t device-share -i 0 -c 0 -d 1

Run the following command to query the device sharing status: npu-smi info -t device-share -i 0 -c 0

After the restart or upgrade, the multi-container sharing function is disabled. – The application program can enable this function by calling the DaVinci Card Management Interface (DCMI). For details, see the Atlas 200 AI Accelerator Module 1.0.7 or Later DCMI API Reference (Model 3000).

----End

3.2 Atlas 500 AI Edge Station

3.2.1 Creating an Inference Container Image

Prerequisites

● Obtain the offline inference engine package and service inference program package by referring to Table 3-4.

Table 3-4 Required software

A500-3000-nnrt_{version}_ linux-aarch64.run

Offline inference engine package.

{version} indicates the software package version.

Link

(13)

Software Package Description How to Obtain Service inference program

package Collection of serviceinference programs. The .tar and .tgz formats are supported. The compressed package format of the service inference program must be supported by the

compression program in the container. In

addition, the command for decompressing the service inference program package in install.sh must map the actual format.

Prepared by users.

install.sh Installation script of the

run.sh Running script of the

Procedure

Step 1 Upload the software package to the same directory (for example, /home/test) on the edge station.

● A500-3000-nnrt_{version}_linux-aarch64.run ● Service inference program package

Step 2 Perform the following steps to create a Dockerfile:

1. Log in to the edge station as the root user and run the id HwHiAiUser command to query and record the UID and GID of the HwHiAiUser user on the host.

2. Go to the software package upload directory in Step 1 and run the following command to create a Dockerfile (for example, Dockerfile):

vi Dockerfile

3. Write the following content and run the :wq command to save the content. The Ubuntu ARM OS is used as an example.

#OS and version number. Change them based on the site requirements. FROM ubuntu:18.04

# Set the parameters of the offline inference engine package. ARG NNRT_PKG

# Set environment variables.

(14)

ENV LD_LIBRARY_PATH=\ $LD_LIBRARY_PATH:\ $ASCEND_BASE/nnrt/latest/acllib/lib64:\ /home/data/miniD/driver/lib64 ENV ASCEND_AICPU_PATH=\ $ASCEND_BASE/nnrt/latest

# Set the directory of the started container. WORKDIR /root

# Copy the offline inference engine package. COPY $NNRT_PKG .

# Install the offline inference engine package. RUN umask 0022 && \

chmod +x ${NNRT_PKG} &&\ ./${NNRT_PKG} --quiet --install &&\ rm ${NNRT_PKG}

# Copy the service inference program package, installation script, and running script. ARG DIST_PKG

COPY $DIST_PKG . COPY install.sh .

COPY run.sh /usr/local/bin/ # Run the installation script.

RUN chmod +x /usr/local/bin/run.sh && \ sh install.sh && \

rm $DIST_PKG && \ rm install.sh

# Program that is run by default when the container is started. CMD run.sh

In Dockerfile:

Create a HwHiAiUser user in the container. gid and uid in the file indicate the GID and UID of the HwHiAiUser user on the host. You can replace them as required according to Step 2.1. The GID and UID of the HwHiAiUser user in the container must be the same as those on the host.

4. After creating the Dockerfile, run the following command to change the permission on the Dockerfile:

chmod 600 Dockerfile

5. The procedure for preparing the install.sh and run.sh scripts is the same as that for preparing the Dockerfile. Compilation Examples shows the file content.

Step 3 Go to the directory where the software packages are stored and run the following command to create a container image:

docker build -t image-name --build-arg NNRT_PKG=nnrt-name --build-arg DIST_PKG=distpackage-name .

(15)

Table 3-5 Command parameter description

image-name Specifies the image name and tag. Set this parameter as required.

DIST_PKG distpackage-name: specifies the name of the compressed package of the service inference program. Do not omit the file name extension. Replace it with the actual one.

information: docker images Example:

REPOSITORY TAG IMAGE ID CREATED SIZE workload-image v1.0 1372d2961ed2 About an hour ago 249MB ----End

Compilation Examples

Compilation example of install.sh #!/bin/bash

# Go to the container working directory. cd /root

#Decompress the service inference program package based on the package format. tar xf dist.tar

Compilation example of run.sh #!/bin/bash

# Access the directory where the executable file of the service inference program is located. cd /root/dist

# Run the executable file. ./main

3.2.2 Deploying an Inference Container

This section describes how to start a container image on a single device. If you need to deploy container images in batches on FusionDirector, see the MindX Edge Application Deployment and Model Update Guide.

Prerequisites

(16)

Procedure

Step 1 Log in to the edge station as the root user.

Step 2 Run the following commands to start the container image:

docker run --device=/dev/davinci0 --device=/dev/davinci_manager --device=/dev/hisi_hdc --device /dev/ devmm_svm \

-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \

-v /home/data/miniD/driver/lib64:/home/data/miniD/driver/lib64 \ -it workload-image:v1.0

In the preceding command example, the service program is run by default. If you need to directly access the container, add /bin/bash to the end of the command. Table 3-6 Parameter description

--device Adds a host device to

the container. Replace davinci0 with the actual device name. If multiple chips need to be mapped, multiple parameters need to be configured, for example, --device=/dev/davinci0 --device=/dev/ davinci1.

-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi Mounts the npu-smi tool to the container. Change the value based on the site requirements. -v /home/data/miniD/driver/lib64:/home/data/miniD/

driver/lib64 Mounts the operatingenvironment

directory /home/data/ miniD/driver/lib64 to the container. Change the directory based on the path of the

driver.so file.

(17)

NO TE

In this version, multiple containers can be mounted to a single processor. ● Up to 16 containers can be mounted to a single processor.

● Containers obtain the computing power of processors in preemption mode. Memory isolation and computing power slicing are not supported.

● Device sharing is disabled by default. Run the following command on the host to enable device sharing:

npu-smi set -t device-share -i 0 -c 0 -d 1

Run the following command to query the device sharing status: npu-smi info -t device-share -i 0 -c 0

After the restart or upgrade, the multi-container sharing function is disabled. ----End

3.3 Other Inference Devices

3.3.1 Instructions

You can use either of the following methods to prepare a container image: ● (Recommended) Pull the basic inference image from AscendHub.

● Use the Dockerfile to create a container image. You can perform secondary development based on the Dockerfile of a basic image. For details, see 3.3.2 Creating an Inference Container Image.

3.3.2 Creating an Inference Container Image

Introduction

This document describes how to create a container image based on the image tree, which is scalable.

Figure 3-1 shows the inference image tree. Figure 3-1 Inference image tree

Table 3-7 Description of the Ascend basic image tree

Image Description

(18)

ascend-infer Install offline inference engine packages, such as nnrt.

Prerequisites

● The driver and firmware have been installed on the host. For details, see section "Hardware Requirements" in the CANN Software Installation Guide (Development and Operating Scenarios, CLI).

● The toolbox has been installed on the host. For details, see "Operating Environment Installation (Inference) > Installation in a Container > Installing the Toolbox on the Host" in the CANN Software Installation Guide

(Development and Operating Scenarios, CLI).

● For details about how to configure a network proxy, see A.1 Configuring a System Network Proxy.

Procedure

Step 1 Log in to the server as the root user.

Step 2 Copy the Dockerfile file provided by the toolbox software package to any path (for example, /home/test) on the server.

cp -r /usr/local/Ascend/toolbox/latest/Ascend-Docker-Images /home/test Step 3 Create the image ascendbase-infer.

1. Go to the directory where the Dockerfile is stored. cd /home/test/Ascend-Docker-Images/ascendbase-infer/{os}-{arch}

In the preceding command, {os} indicates the container image OS version, and {arch} indicates the architecture. Change them based on the actual situation.

2. Prepare the following files in the current directory: Table 3-8 Required files

File Description How to Obtain

image. Already exists in thecurrent directory. You can customize it based on the site requirements.

setproxy.sh Configures the system

network proxy and source.

(Optional) You can prepare it based on the actual situation.

(19)

File Description How to Obtain unsetproxy.sh Cancels the system

network proxy configuration.

EulerOS.repo Yum source

configuration file. Required only when thecontainer image OS is EulerOS 2.8.

3. Run the following command in the current directory to create the image ascendbase-infer:

docker build -t ascendbase-infer:base_TAG .

NO TE

To configure the system network proxy in this step, run the following command: docker build -t ascendbase-infer:base_TAG --build-arg http_proxy=http://

user:password@proxyserverip:port --build-arg https_proxy=http:// user:password@proxyserverip:port .

In the preceding command, user indicates the username on the intranet, password indicates the user password, proxyserverip indicates the IP address of the proxy server, and port indicates the port number.

Table 3-9 Command parameter description Parameter Description

ascendbase-infer:base_TAG Image name and tag. You are advised to namebase_TAG in the format of Date-Container OS-Architecture (for example, 20210106-ubuntu18.04-arm64).

If "Successfully built xxx" is displayed, the image is successfully created. Step 4 Create the image ascend-infer based on ascendbase-infer.

1. Go to the directory where the Dockerfile is stored. cd /home/test/Ascend-Docker-Images/ascend-infer

(20)

Table 3-10 Required software and files

Software or File Description How to Obtain

Ascend-cann-nnrt_{version}_linux-{arch}.run Offline inferenceengine package.

{version} indicates the software package version, and {arch} indicates the architecture. Link

Dockerfile Required for

creating an image.

Already exists in the current directory. You can customize it based on the site requirements.

ascend_install.info Software

package installation log file.

Copy the /etc/ ascend_install.info file from the host. Change the path as required. After

copying the file to the current directory, delete the contents of lines UserName and UserGroup from the copied file.

version.info Driver package

version

information file.

Copy the /usr/local/ Ascend/driver/ version.info file from the host.

Change the path as required.

preinstall.sh Copies files to a

specified directory.

Already exists in the current directory. You can delete the code comments.

postinstall.sh Deletes the

directories and files that do not need to be retained in the container.

3. Run the following command in the current directory to create the image ascend-infer:

(21)

docker build -t ascend-infer:infer_TAG --build-arg NNRT_PKG=nnrt-name --build-arg BASE_VERSION=base_TAG --build-arg

ARCH={arch} .

ascend-infer:infer_TAG Image name and tag. You are advised to nameinfer_TAG in the format of Software package version-Container OS-Architecture (for example, 20.2.rc1-ubuntu18.04-arm64).

BASE_VERSION base_TAG: specifies the image tag set in Step 3.3. ARCH {arch} indicates the architecture. Replace it with the

actual one (arm64 or x86_64).

information: docker images ----End

3.3.3 Deploying an Inference Container

Prerequisites

● A container image has been created.

● The Docker program has been installed in the environment.

● The toolbox has been installed on the host. For details, see "Operating Environment Installation (Inference) > Installation in a Container > Installing the Toolbox on the Host" in the CANN Software Installation Guide

(Development and Operating Scenarios, CLI).

Ascend-docker-runtime has been integrated into the toolbox. After installing the toolbox, you need to restart Docker to ensure that Ascend-docker-runtime can function properly. The container engine plugin Ascend-docker-runtime provides Ascend NPU-based containerization support for all AI inference jobs so that AI jobs can run smoothly on Ascend devices as Docker containers.

(22)

Starting the Container

Run the following command to run a container based on the new image: docker run -it -e ASCEND_VISIBLE_DEVICES=xxx image-name /bin/bash

Table 3-12 Parameter description

-e ASCEND_VISIBLE_DEVICES=xxx Uses the ASCEND_VISIBLE_DEVICES environment variable to specify the NPU device to be mounted to the container, and uses the device sequence number to specify the device. You can specify a single device or a device range. The two types of devices can be used together. Example: 1. -e ASCEND_VISIBLE_DEVICES=0

indicates that device 0 (/dev/davinci0) is mounted to the container.

2. -e ASCEND_VISIBLE_DEVICES=1,3 indicates that devices 1 and 3 are mounted to the container.

3. -e ASCEND_VISIBLE_DEVICES=0-2 indicates that devices 0 to 2 (including devices 0 and 2) are mounted to the container. The effect is the same as that of

-e ASCEND_VISIBLE_DEVICES=0,1,2 4. ASCEND_VISIBLE_DEVICES=0-2,4:

indicates that devices 0 to 2 and device 4 are mounted to the container. The effect is the same as that of

-e ASCEND_VISIBLE_DEVICES=0,1,2,4

image-name Image name and tag. Change them based on the actual situation (for example, ascend-infer:infer_TAG).

If the container ID (3330d6524117 in this example) is displayed, the container is running and you have accessed the container.

(23)

NO TE

● For the default contents mounted to Ascend Docker Runtime, see B.1 Default Mounted Contents of Ascend Docker Runtime. If you use a specified path to install the driver on the host, you need to mount required directories and files. Example command (change it based the actual situation):

docker run -it -e ASCEND_VISIBLE_DEVICES=xxx -v ${install_path}/driver:$ {install_path}/driver -v /usr/local/dcmi:/usr/local/dcmi -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi image-name

${install_path} indicates the driver installation path.

● If a large amount of mounting content is required, you can use the

ASCEND_RUNTIME_MOUNTS environment variable to read the mounting content in the configuration file. For details about how to create a configuration file, see A.2 Creating a Configuration File. The following uses the hostlog.list configuration file as an example:

docker run --rm -it -e ASCEND_VISIBLE_DEVICES=xxx -e

ASCEND_RUNTIME_MOUNTS=base,hostlog image-name /bin/bash

In the preceding command, base indicates the basic configuration file, as shown in B.1 Default Mounted Contents of Ascend Docker Runtime. Change hostlog to the actual configuration file name.

● You can use the ASCEND_RUNTIME_OPTIONS environment variable to adjust the mounted targets. If the value is set to NODRV, non-device directories and files are not mounted. For example:

docker run --rm -it -e ASCEND_VISIBLE_DEVICES=xxx -e ASCEND_RUNTIME_OPTIONS=NODRVimage-name /bin/bash

Only NPU devices and management devices (for example: /dev/davinci0, /dev/ davinci_manager, /dev/hisi_hdc, /dev/devmm_svm) are mounted to the container to support driver installation in the container.

● The methods of viewing logs in the container and setting log levels are the same as those of a host. For details, see CANN Log Reference (Ascend 310)CANN Log Reference (Ascend 310).

● Viewing Logs

You can view the application logs on the host and device, but the system logs on the device are unavailable.

● Setting the Log Level

You can set the level of logs on the host, but the level of logs on the device cannot be set.

(24)

4

Training

4.1 Instructions

4.2 Creating a Training Container Image 4.3 Deploying a Training Container

4.1 Instructions

You can use either of the following methods to prepare a container image: ● (Recommended) Pull the basic training image from AscendHub.

● Use the Dockerfile to create a container image. You can perform secondary development based on the Dockerfile of a basic image. For details, see 4.2 Creating a Training Container Image.

4.2 Creating a Training Container Image

Introduction

This document describes how to create a container image based on the image tree, which is scalable.

(25)

Figure 4-1 Training image tree

Table 4-1 Description of the Ascend basic image tree

ascendbase-train Install system components and Python third-party dependencies.

ascend-train Install training software packages, such as nnae.

ascend-tensorflow Install the framework plugin package TFPlugin and TensorFlow framework.

Prerequisites

● The toolbox has been installed on the host. For details, see "Operating Environment Installation (Training) > Installation in a Container > Installing the Toolbox on the Host" in the CANN Software Installation Guide (Development and Operating Scenarios, CLI).

● For details about how to configure a network proxy, see A.1 Configuring a System Network Proxy.

Procedure

Step 1 Log in to the server as the root user.

Step 2 Copy the Dockerfile file provided by the toolbox software package to any path (for example, /home/test) on the server.

(26)

Step 3 Create the image ascendbase-train.

1. Go to the directory where the Dockerfile is stored. cd /home/test/Ascend-Docker-Images/ascendbase-train/{os}-{arch}

In the preceding command, {os} indicates the container image OS version, and {arch} indicates the architecture. Change them based on the actual situation.

2. Prepare the following files in the current directory: Table 4-2 Required files

File Description How to Obtain

image. Already exists in thecurrent directory. You can customize it based on the site requirements.

setproxy.sh Configures the system

network proxy and source.

(Optional) You can prepare it based on the actual situation.

unsetproxy.sh Cancels the system network proxy configuration.

libstdc++.so.6.0.24 Dynamic library file. The libstdc++.so.6.0.24 file is required only when the container image OS is CentOS.

You can run the find command to query the path of the libstdc++.so. 6.0.24 file and copy the file from the host. npy_math_internal.h.

src.patch Resolves the NumPyinstallation error in the CentOS image.

Already exists in the current directory. This parameter is required only when the container image OS is CentOS ARM64. 3. Run the following command in the current directory to create the image

ascendbase-train:

docker build -t ascendbase-train:base_TAG .

(27)

NO TE

To configure the system network proxy in this step, run the following command: docker build -t ascendbase-train:base_TAG --build-arg http_proxy=http://

user:password@proxyserverip:port --build-arg https_proxy=http:// user:password@proxyserverip:port .

ascendbase-train:base_TAG Image name and tag. You are advised to namebase_TAG in the format of Date-Container OS-Architecture (for example, 20210106-ubuntu18.04-arm64).

If "Successfully built xxx" is displayed, the image is successfully created. Step 4 Create the image ascend-train based on ascendbase-train.

1. Go to the directory where the Dockerfile is stored. cd /home/test/Ascend-Docker-Images/ascend-train

2. Prepare the following software packages and files in the current directory: Table 4-4 Required software and files

Ascend-cann-nnae_{version}_linux-{arch}.run Deep learningacceleration engine package. {version} indicates the software package version, and {arch} indicates the architecture. Link

creating an image.

Already exists in the current directory. You can customize it based on the site requirements. 3. Run the following command in the current directory to create the image

ascend-train:

docker build -t ascend-train:train_TAG --build-arg NNAE_PKG=nnae-name --build-arg BASE_VERSION=base_TAG .

(28)

ascend-train:train_TAG Image name and tag. You are advised to nametrain_TAG in the format of Software package version-Container OS-Architecture (for example, 20.2.rc1-ubuntu18.04-arm64).

NNAE_PKG nnae-name: specifies the name of the deep learning acceleration engine package. Do not omit the file name extension. Replace it with the actual one.

BASE_VERSION base_TAG: specifies the image tag set in Step 3.3.

If "Successfully built xxx" is displayed, the image is successfully created. Step 5 Create the image ascend-tensorflow based on ascend-train.

1. Go to the directory where the Dockerfile is stored. cd /home/test/Ascend-Docker-Images/ascend-tensorflow

2. Prepare the following software packages and files in the current directory: Table 4-6 Required software and files

Ascend-cann-tfplugin_{version}

_linux-{arch}.run Framework plugin package. {version} indicates the software package version, and {arch} indicates the architecture. Link

tensorflow-1.15.0-cp37-cp37m-{version}.whl TensorFlowframework WHL package.

AArch64 architecture: Link

x86_64 architecture: Not needed

creating an image.

Already exists in the current directory. You can customize it based on the site requirements.

(29)

ascend_install.info Software

package installation log file.

Copy the /etc/ ascend_install.info file from the host. Change the path as required. After

copying the file to the current directory, delete the contents of lines UserName and UserGroup from the copied file.

version.info Driver package

version

information file.

Copy the /usr/local/ Ascend/driver/ version.info file from the host.

Change the path as required.

preinstall.sh Copies files to a

specified directory.

Already exists in the current directory. You can delete the code comments.

postinstall.sh Deletes the

directories and files that do not need to be retained in the container.

3. Run the following command in the current directory to create the ascend-tensorflow image. (If the x86_64 architecture is used, you do not need to enter the TF_PKG parameter. Delete --build-arg TF_PKG=tensorflow-name in the following command.)

docker build -t ascend-tensorflow:tensorflow_TAG arg TFPLUGIN_PKG=tfplugin-name --build-arg BASE_VERSION=train_TAG --build---build-arg TF_PKG=tensorflow-name .

NO TE

To configure the system network proxy in this step, run the following command: docker build -t ascend-tensorflow:tensorflow_TAG build-arg TFPLUGIN_PKG=tfplugin-name --build-arg BASE_VERSION=train_TAG ----build-arg TF_PKG=tensorflow-name ----build-arg http_proxy=http://user:password@proxyserverip:port --build-arg https_proxy=http:// user:password@proxyserverip:port .

(30)

ascend-tensorflow:tenso rflow_TAG

Image name and tag. You are advised to name

tensorflow_TAG in the format of Software package version-Container OS-Architecture (for example, 20.2.rc1-ubuntu18.04-arm64).

TFPLUGIN_PKG tfplugin-name: specifies the name of the framework plugin package. Do not omit the file name extension. Replace it with the actual one.

BASE_VERSION train_TAG: specifies the image tag set in Step 4.3. TF_PKG tensorflow-name: specifies the name of the TensorFlow

framework WHL package. This parameter is required only when the ARM64 architecture is used.

information: docker images ----End

4.3 Deploying a Training Container

Prerequisites

● A container image has been created.

● The Docker program has been installed in the environment.

● The toolbox has been installed on the host. For details, see "Operating Environment Installation (Training) > Installation in a Container > Installing the Toolbox on the Host" in the CANN Software Installation Guide (Development and Operating Scenarios, CLI).

Ascend-docker-runtime has been integrated into the toolbox. After installing the toolbox, you need to restart Docker to ensure that Ascend-docker-runtime can function properly. The container engine plugin Ascend-docker-runtime provides Ascend NPU-based containerization support for all AI inference jobs so that AI jobs can run smoothly on Ascend devices as Docker containers. ● If multiple servers are used for cluster training, ensure that the NPU IP

address has been configured on the host before deploying the training container. For details, see A.3 Changing NPU IP Addresses.

(31)

Starting the Container

CA UTION

When the host environment is CentOS, the SELinux security module of CentOS is enabled by default. As a result, the local directory mounted to the container does not have the execute permission. Therefore, you need to run the su -c "setenforce 0" command to temporarily disable the SELinux. After related services are

complete, run the su -c "setenforce 0" command to enable SELinux again. Run the following command to run a container based on the new image: docker run -it -e ASCEND_VISIBLE_DEVICES=xxx --pids-limit 409600 image-name /bin/bash Table 4-8 Parameter description

-e ASCEND_VISIBLE_DEVICES=xxx Uses the ASCEND_VISIBLE_DEVICES environment variable to specify the NPU device to be mounted to the container, and uses the device sequence number to specify the device. You can specify a single device or a device range. The two types of devices can be used together. Example:

1. -e ASCEND_VISIBLE_DEVICES=0 indicates that device 0 (/dev/davinci0) is mounted to the container.

2. -e ASCEND_VISIBLE_DEVICES=1,3 indicates that devices 1 and 3 are mounted to the container.

3. -e ASCEND_VISIBLE_DEVICES=0-2 indicates that devices 0 to 2 (including devices 0 and 2) are mounted to the container. The effect is the same as that of

-e ASCEND_VISIBLE_DEVICES=0,1,2 4. ASCEND_VISIBLE_DEVICES=0-2,4:

indicates that devices 0 to 2 and device 4 are mounted to the container. The effect is the same as that of

-e ASCEND_VISIBLE_DEVICES=0,1,2,4 --pids-limit 409600 If the host OS is CentOS or BC-Linux, the

maximum number of threads in Docker is 4092, which cannot meet the training requirements. This parameter needs to be increased to configure the maximum number of Docker threads in CentOS/BC-Linux when the container is started.

(32)

image-name Image name and tag. Change them based on the actual situation (for example, ascend-tensorflow:tensorflow_TAG).

If the container ID (3330d6524117 in this example) is displayed, the container is running and you have accessed the container.

root@3330d6524117:/#

NO TE

● For the default contents mounted to Ascend Docker Runtime, see B.1 Default Mounted Contents of Ascend Docker Runtime. If you use a specified path to install the driver on the host, you need to mount required directories and files. Example command (change it based the actual situation):

docker run -it -e ASCEND_VISIBLE_DEVICES=xxx --pids-limit 409600 -v $

{install_path}/driver/lib64:${install_path}/driver/lib64 -v /usr/local/dcmi:/usr/local/ dcmi -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi image-name /bin/bash

${install_path} indicates the installation path of the Ascend AI Processor driver. ● If a large amount of mounting content is required, you can use the

ASCEND_RUNTIME_MOUNTS environment variable to read the mounting content in the configuration file. For details about how to create a configuration file, see A.2 Creating a Configuration File. The following uses the hostlog.list configuration file as an example:

docker run --rm -it -e ASCEND_VISIBLE_DEVICES=xxx -e

ASCEND_RUNTIME_MOUNTS=base,hostlog image-name /bin/bash

In the preceding command, base indicates the basic configuration file, as shown in B.1 Default Mounted Contents of Ascend Docker Runtime. Change hostlog to the actual configuration file name.

● You can use the ASCEND_RUNTIME_OPTIONS environment variable to adjust the mounted targets. If the value is set to NODRV, non-device directories and files are not mounted. For example:

docker run --rm -it -e ASCEND_VISIBLE_DEVICES=xxx -e ASCEND_RUNTIME_OPTIONS=NODRVimage-name /bin/bash

Only NPU devices and management devices (for example: /dev/davinci0, /dev/ davinci_manager, /dev/hisi_hdc, /dev/devmm_svm) are mounted to the container to support driver installation in the container.

● If you need to run related services when deploying a training container, you are advised to use a non-root user.

● The methods of viewing logs in the container and setting log levels are the same as those of a host. For details, see the CANN Log Reference (Ascend 910)CANN Log Reference (Ascend 910).

● Viewing Logs

You can view the application logs on the host and device, but the system logs on the device are unavailable.

● Setting the Log Level

You can set the level of logs on the host, but the level of logs on the device cannot be set.

(33)

A

Common Operations

A.1 Configuring a System Network Proxy

The following procedure is a general method for configuring a network proxy. It may not be applicable to all network environments. The method of configuring the network proxy depends on the actual network environment.

Prerequisites

● Ensure that the network cable of the server is connected and the proxy server can connect to the external network.

● The configuration proxy is based on the condition that the server is located on an intranet and cannot be directly connected to the external network.

Configuring a System Network Proxy

Step 1 Log in to the user environment as the root user.

Step 2 Run the following command to edit the /etc/profile file: vi /etc/profile

Add the following content to the file, save the file, and exit: export http_proxy="http://user:password@proxyserverip:port"

export https_proxy="http://user:password@proxyserverip:port"

In the preceding commands, user indicates the username on the intranet, password (special characters need to be converted) indicates the user password, proxyserverip indicates the IP address of the proxy server, and port indicates the port number.

Step 3 Run the following command to make the configuration take effect. source /etc/profile

Step 4 Run the following command to check whether the external network is connected: wget www.baidu.com

If the HTML file can be downloaded, the server is connected to the external network successfully.

(34)

NO TE

If a certificate error occurs when you use a proxy to connect to the network, you need to install the certificate of the proxy server before downloading third-party components. ----End

A.2 Creating a Configuration File

If a large number of files or directories need to be mounted, you can write the files or directories to be mounted to a configuration file. The procedure is as follows:

Step 1 Go to the configuration file directory: cd /etc/ascend-docker-runtime.d/

The basic configuration file base.list exists in the directory. The content is the default mounting objects of Ascend Docker Runtime. For details, see B.1 Default Mounted Contents of Ascend Docker Runtime. In principle, the base.list file cannot be modified.

Step 2 Create and edit the configuration file. The file name can be customized, for example, hostlog.list.

vi hostlog.list

Step 3 Write the files or directories to be mounted to hostlog.list. The following is an example:

/usr/slog/slog /var/log/npu/slog

Step 4 Save the configuration and exit. ----End

A.3 Changing NPU IP Addresses

When multiple servers are used for distributed training, you need to use the HCCN tool in the Ascend software to configure the NPU IP addresses (the NIC IP

addresses of devices) so that the network model parameters between multiple training servers can be transmitted and synchronized through the optical ports on the NPUs. Ensure that the parameters of the network model can be updated synchronously when each training server performs training. This section describes only the commands for configuring the network by using the HCCN tool. If you need to use other functions of the HCCN tool (for example, checking the link status of a network port), see the Ascend 910 HCCN Tool Interface Reference.

Configuring NIC IP Address of a Device

Atlas 800 training server and Atlas 900 AI cluster ● SMP (symmetric multi-processor) mode

Log in to the AI Servers as the root user and configure the NIC IP address of each device. The configuration requirements are as follows:

(35)

– NICs 0 and 4, 1 and 5, 2 and 6, and 3 and 7 of an AI Server must be in the same network segment respectively. NICs 0, 1, 2, and 3 must be in different network segments. NICs 4, 5, 6, and 7 must be in different network segments.

– In the cluster scenario, the devices in the similar positions on AI Servers must be in the same network segment. For example, NIC 0 of AI Server 1 and AI Server 2 must be in the same network segment, and NIC 1 of AI Server 1 and AI Server 2 must be in the same network segment.

hccn_tool -i 0 -ip -s address 192.168.100.101 netmask 255.255.255.0 hccn_tool -i 1 -ip -s address 192.168.101.101 netmask 255.255.255.0 hccn_tool -i 2 -ip -s address 192.168.102.101 netmask 255.255.255.0 hccn_tool -i 3 -ip -s address 192.168.103.101 netmask 255.255.255.0 hccn_tool -i 4 -ip -s address 192.168.100.100 netmask 255.255.255.0 hccn_tool -i 5 -ip -s address 192.168.101.100 netmask 255.255.255.0 hccn_tool -i 6 -ip -s address 192.168.102.100 netmask 255.255.255.0 hccn_tool -i 7 -ip -s address 192.168.103.100 netmask 255.255.255.0 ● AMP (asymmetric multi-processor) mode

You do not need to restrict network segments. All NICs must be in the same network segment.

Atlas 300T training card

Each server can be configured with one Atlas 300T training card or two. Each card corresponds to one Device OS and needs to be configured with one IP address. Different cards need to be configured with IP addresses in the same network segment.

Log in to the AI Servers as the root user and configure the NIC IP address of each device. The configuration operations are as follows:

Step 1 Run the npu-smi info commands to view the ID of the device to be configured. In Figure A-1, the NPU IDs are 1 and 4, for example. Use the actual NPU IDs in the query result.

Figure A-1 Checking the device ID

Step 2 Run the following command to configure the NIC IP addresses of the device. The IP addresses used in the following example are for reference only.

hccn_tool -i 1 -ip -s address 192.168.0.2 netmask 255.255.255.0 hccn_tool -i 4 -ip -s address 192.168.0.3 netmask 255.255.255.0 ----End

NO TE

● Ensure that the npu-smi tool has been installed on the server.

(36)

B

Reference

B.1 Default Mounted Contents of Ascend Docker

Runtime

In addition to NPUs and management devices (/dev/davinciX, /dev/

davinci_manager, /dev/hisi_hdc, and dev/devmm_svm), Ascend Docker Runtime mounts the following directories and files to containers in read-only mode based on the actual situation.

Table B-1 Default mounted directories and files

Directory Description

/usr/local/Ascend/driver/lib64 Directory for storing user-mode libraries provided by the driver.

/usr/local/Ascend/driver/

include Directory for storing the header file(dsmi_common_interface.h) provided by the driver.

/usr/local/dcmi Directory for storing DCMI header files and libraries.