V100R020C20
Basic Container Image Creation
Guide
Issue 01
No part of this document may be reproduced or transmitted in any form or by any means without prior written consent of Huawei Technologies Co., Ltd.
Trademarks and Permissions
and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.
All other trademarks and trade names mentioned in this document are the property of their respective holders.
Notice
The purchased products, services and features are stipulated by the contract made between Huawei and the customer. All or part of the products, services and features described in this document may not be within the purchase scope or the usage scope. Unless otherwise specified in the contract, all statements, information, and recommendations in this document are provided "AS IS" without warranties, guarantees or representations of any kind, either express or implied.
The information in this document is subject to change without notice. Every effort has been made in the preparation of this document to ensure accuracy of the contents, but all statements, information, and recommendations in this document do not constitute a warranty of any kind, express or implied.
Contents
1 Overview...1
2 Container Engine Plugin... 2
3 Inference... 4
3.1 Atlas 200 AI Accelerator Module (RC Scenario)...4
3.1.1 Creating an Inference Container Image... 4
3.1.2 Deploying an Inference Container... 8
3.2 Atlas 500 AI Edge Station...9
3.2.1 Creating an Inference Container Image... 9
3.2.2 Deploying an Inference Container... 12
3.3 Other Inference Devices...14
3.3.1 Instructions... 14
3.3.2 Creating an Inference Container Image... 14
3.3.3 Deploying an Inference Container... 18
4 Training...21
4.1 Instructions...21
4.2 Creating a Training Container Image...21
4.3 Deploying a Training Container... 27
A Common Operations... 30
A.1 Configuring a System Network Proxy... 30
A.2 Creating a Configuration File... 31
A.3 Changing NPU IP Addresses... 31
B Reference...33
1
Overview
This document describes how to use a Dockerfile to create a container image. AscendHub at the Ascend community provides some basic container images. You can obtain images from AscendHub based on the site requirements.This document applies to the following Ascend AI devices: ● Inference
– Atlas 200 AI accelerator module (RC scenario) – Atlas 500 AI edge station
– Other inference devices
▪
Atlas 300I inference card▪
Atlas 500 Pro AI edge server▪
Atlas 800 inference server ● Training– Atlas 300T training card – Atlas 800 training server – Atlas 900 AI cluster
2
Container Engine Plugin
Ascend Docker is a basic component of the CANN. It provides container-based Ascend NPUs (Ascend AI Processors) for all AI training or inference jobs so that AI jobs can smoothly run on Ascend devices as Docker containers, as shown in Figure 2-1. Ascend-docker-runtime is the software package released with Ascend Docker and has been integrated into the toolbox.Figure 2-1 Ascend Docker
Features of Ascend Docker
● Full decoupling: Ascend Docker is decoupled from Docker, and the Docker code does not need to be modified. The Runtime can evolve independently. ● Backward compatibility: Provides optional Runtime, which does not affect the
use of native Docker.
● Easy adaptation: Smoothly adapt to the customer's existing platform and system without affecting the original Docker command interface.
● Easy deployment: The RPM package is provided for deployment. After the installation, you can use Docker to create a container to which the Ascend NPU is mounted.
Ascend Docker Design
Ascend Docker is essentially Docker Runtime implemented based on the Open Container Initiative (OCI) standard without modifying the Docker engine and provides the Ascend NPU adaptation function for Docker as a plugin.
As shown in Figure 2-2, Ascend Docker connects to the native Docker through the OCI interface. When the native Docker runc starts a container, prestart-hook is called to configure and manage the container.
Figure 2-2 Docker Adaptation Principles
prestart-hook is a container survival state defined by OCI, that is, a hook function set for an intermediate transition from the created state to the running state. In this transition state, the namespace of the container has been created, but the job of the container is not started. Therefore, you can mount devices to the container and configure the cgroup. In this way, the configuration can be used by the jobs that are started later.
Ascend Docker performs the following operations on the container in the prestart-hook prestart-hook function:
1. Mount the NPU device to the namespace of the container based on ASCEND_VISBLE_DEVICES.
2. Configure the device cgroup of the container on the host to ensure that the container can use only the specified NPU to ensure device isolation.
3
Inference
3.1 Atlas 200 AI Accelerator Module (RC Scenario) 3.2 Atlas 500 AI Edge Station
3.3 Other Inference Devices
3.1 Atlas 200 AI Accelerator Module (RC Scenario)
3.1.1 Creating an Inference Container Image
Prerequisites
● In the container scenario, install Docker 18.03 or later. ● The container OS image can be obtained from Docker Hub.
● Obtain the offline inference engine package and service inference program package by referring to Table 3-1.
Table 3-1 Required software
Software Package Description How to Obtain
Ascend-cann-nnrt_{version} _linux-aarch64.run
Offline inference engine package.
{version} indicates the software package version.
Link
Dockerfile Required for creating an
Software Package Description How to Obtain
ascend_install.info Software package
installation log file. Copy the /etc/ascend_install. info file from the host. Change the path as required.
version.info Driver version
information file. Copy the /var/davinci/driver/ version.info file from the host.
Change the path as required. Service inference program
package Collection of serviceinference programs. The .tar and .tgz formats are supported. The compressed package format of the service inference program must be supported by the
compression program in the container. In
addition, the command for decompressing the service inference program package in install.sh must map the actual format.
Prepared by users.
install.sh Installation script of the
service inference program.
run.sh Running script of the
service inference program.
Procedure
Step 1 Upload the software package to the same directory (for example, /home/test) on the accelerator module.
● Ascend-cann-nnrt_{version}_linux-aarch64.run ● ascend_install.info
● version.info
● Service inference program package
Step 2 Perform the following steps to create a Dockerfile:
1. Log in to the accelerator module as the root user and run the id HwHiAiUser command to query and record the UID and GID of the HwHiAiUser user on the host.
2. Go to the software package upload directory in Step 1 and run the following command to create a Dockerfile (for example, Dockerfile):
vi Dockerfile
3. Write the following content and run the :wq command to save the content. The Ubuntu ARM OS is used as an example. (The following content is only an example. You can perform secondary development based on the site
requirements.)
#OS and version number. Change them based on the site requirements. FROM ubuntu:18.04
# Set the parameters of the offline inference engine package. ARG NNRT_PKG
# Set environment variables.
ARG ASCEND_BASE=/usr/local/Ascend ENV LD_LIBRARY_PATH=\
$LD_LIBRARY_PATH:\
$ASCEND_BASE/nnrt/latest/acllib/lib64:\ /usr/lib64
# Set the directory of the started container. WORKDIR /root
# Copy the offline inference engine package. COPY $NNRT_PKG .
COPY ascend_install.info /etc/ RUN mkdir -p /var/davinci/driver COPY version.info /var/davinci/driver # Install the offline inference engine package. RUN umask 0022 && \
groupadd -g gid HwHiAiUser && useradd -g HwHiAiUser -d /home/HwHiAiUser -m HwHiAiUser && usermod -u uid HwHiAiUser &&\
chmod +x ${NNRT_PKG} &&\ ./${NNRT_PKG} --quiet --install &&\ rm ${NNRT_PKG} &&\
rm -rf /var/davinci/driver
# Copy the service inference program package, installation script, and running script. ARG DIST_PKG
COPY $DIST_PKG . COPY install.sh .
COPY run.sh /usr/local/bin/ # Run the installation script.
RUN chmod +x /usr/local/bin/run.sh && \ sh install.sh && \
rm $DIST_PKG && \ rm install.sh
# Program that is run by default when the container is started. CMD run.sh
In Dockerfile:
groupadd -g gid HwHiAiUser && useradd -g HwHiAiUser -d /home/HwHiAiUser -m HwHiAiUser && usermod -u uid HwHiAiUser &&\
Create a HwHiAiUser user in the container. gid and uid in the file indicate the GID and UID of the HwHiAiUser user on the host. You can replace them
as required according to Step 2.1. The GID and UID of the HwHiAiUser user in the container must be the same as those on the host.
4. After creating the Dockerfile, run the following command to change the permission on the Dockerfile:
chmod 600 Dockerfile
5. The procedure for preparing the install.sh and run.sh scripts is the same as that for preparing the Dockerfile. Compilation Examples shows the file content.
Step 3 Go to the directory where the software packages are stored and run the following command to create a container image:
docker build -t image-name --build-arg NNRT_PKG=nnrt-name --build-arg DIST_PKG=distpackage-name .
Do not omit . at the end of the command. Table 3-2 describes the parameters in the command.
Table 3-2 Command parameter description
Parameter Description
image-name Specifies the image name and tag. Set this parameter as required.
--build-arg Specifies parameters in the Dockerfile.
NNRT_PKG nnrt-name: specifies the name of the offline inference engine package. Do not omit the file name extension. Replace it with the actual one.
DIST_PKG distpackage-name: specifies the name of the compressed package of the service inference program. Do not omit the file name extension. Replace it with the actual one.
If "Successfully built xxx" is displayed, the image is successfully created. Step 4 After the image is created, run the following command to view the image
information: docker images Example:
REPOSITORY TAG IMAGE ID CREATED SIZE workload-image v1.0 1372d2961ed2 About an hour ago 249MB ----End
Compilation Examples
Compilation example of install.sh #!/bin/bash
# Go to the container working directory. cd /root
#Decompress the service inference program package based on the package format. tar xf dist.tar
Compilation example of run.sh #!/bin/bash
# Start the slogd daemon process. mkdir -p /usr/slog
/var/slogd &
# Access the directory where the executable file of the service inference program is located. cd /root/dist
# Run the executable file. ./main
3.1.2 Deploying an Inference Container
This section describes how to start a container image on a single device. If you need to deploy container images in batches on FusionDirector, see the MindX Edge Application Deployment and Model Update Guide.
Prerequisites
A container image has been created.
Procedure
Step 1 Log in to the accelerator module as the root user.
Step 2 Run the following command to start the container image. Change it based on the site requirements.
docker run -it device=/dev/davinci0 device=/dev/davinci_manager device=/dev/svm0 device=/dev/log_drv device=/dev/event_sched --device=/dev/upgrade --device=/dev/hi_dvpp --device=/dev/
memory_bandwidth --device=/dev/ts_aisle -v /var:/var -v /usr/lib64:/usr/lib64 v /etc/hdcBasic.cfg:/etc/hdcBasic.cfg v /etc/rc.local:/etc/rc.local v /sys:/sys -v /usr/bin/sudo:/usr/bin/sudo --v /usr/lib/sudo/:/usr/lib/sudo/ --v /etc/
sudoers:/etc/sudoers/ workload-image:v1.0
In the preceding command example, the service program is run by default. If you need to directly access the container, add /bin/bash to the end of the command. Table 3-3 Parameter description
Parameter Description
--device Adds a host device to the container.
Replace davinci0 with the actual device name.
If multiple chips need to be mapped, multiple parameters need to be configured, for example,
--device=/dev/davinci0 --device=/dev/ davinci1.
Parameter Description
workload-image:v1.0 Generated image file.
NO TE
In this version, multiple containers can be mounted to a single processor. ● Up to 16 containers can be mounted to a single processor.
● Containers obtain the computing power of processors in preemption mode. Memory isolation and computing power slicing are not supported.
● Device sharing is disabled by default.
– Run the following command on the host to enable device sharing: npu-smi set -t device-share -i 0 -c 0 -d 1
Run the following command to query the device sharing status: npu-smi info -t device-share -i 0 -c 0
After the restart or upgrade, the multi-container sharing function is disabled. – The application program can enable this function by calling the DaVinci Card Management Interface (DCMI). For details, see the Atlas 200 AI Accelerator Module 1.0.7 or Later DCMI API Reference (Model 3000).
----End
3.2 Atlas 500 AI Edge Station
3.2.1 Creating an Inference Container Image
Prerequisites
● In the container scenario, install Docker 18.03 or later. ● The container OS image can be obtained from Docker Hub.
● Obtain the offline inference engine package and service inference program package by referring to Table 3-4.
Table 3-4 Required software
Software Package Description How to Obtain
A500-3000-nnrt_{version}_ linux-aarch64.run
Offline inference engine package.
{version} indicates the software package version.
Link
Dockerfile Required for creating an
Software Package Description How to Obtain Service inference program
package Collection of serviceinference programs. The .tar and .tgz formats are supported. The compressed package format of the service inference program must be supported by the
compression program in the container. In
addition, the command for decompressing the service inference program package in install.sh must map the actual format.
Prepared by users.
install.sh Installation script of the
service inference program.
run.sh Running script of the
service inference program.
Procedure
Step 1 Upload the software package to the same directory (for example, /home/test) on the edge station.
● A500-3000-nnrt_{version}_linux-aarch64.run ● Service inference program package
Step 2 Perform the following steps to create a Dockerfile:
1. Log in to the edge station as the root user and run the id HwHiAiUser command to query and record the UID and GID of the HwHiAiUser user on the host.
2. Go to the software package upload directory in Step 1 and run the following command to create a Dockerfile (for example, Dockerfile):
vi Dockerfile
3. Write the following content and run the :wq command to save the content. The Ubuntu ARM OS is used as an example.
#OS and version number. Change them based on the site requirements. FROM ubuntu:18.04
# Set the parameters of the offline inference engine package. ARG NNRT_PKG
# Set environment variables.
ENV LD_LIBRARY_PATH=\ $LD_LIBRARY_PATH:\ $ASCEND_BASE/nnrt/latest/acllib/lib64:\ /home/data/miniD/driver/lib64 ENV ASCEND_AICPU_PATH=\ $ASCEND_BASE/nnrt/latest
# Set the directory of the started container. WORKDIR /root
# Copy the offline inference engine package. COPY $NNRT_PKG .
# Install the offline inference engine package. RUN umask 0022 && \
groupadd -g gid HwHiAiUser && useradd -g HwHiAiUser -d /home/HwHiAiUser -m HwHiAiUser && usermod -u uid HwHiAiUser &&\
chmod +x ${NNRT_PKG} &&\ ./${NNRT_PKG} --quiet --install &&\ rm ${NNRT_PKG}
# Copy the service inference program package, installation script, and running script. ARG DIST_PKG
COPY $DIST_PKG . COPY install.sh .
COPY run.sh /usr/local/bin/ # Run the installation script.
RUN chmod +x /usr/local/bin/run.sh && \ sh install.sh && \
rm $DIST_PKG && \ rm install.sh
# Program that is run by default when the container is started. CMD run.sh
In Dockerfile:
groupadd -g gid HwHiAiUser && useradd -g HwHiAiUser -d /home/HwHiAiUser -m HwHiAiUser && usermod -u uid HwHiAiUser &&\
Create a HwHiAiUser user in the container. gid and uid in the file indicate the GID and UID of the HwHiAiUser user on the host. You can replace them as required according to Step 2.1. The GID and UID of the HwHiAiUser user in the container must be the same as those on the host.
4. After creating the Dockerfile, run the following command to change the permission on the Dockerfile:
chmod 600 Dockerfile
5. The procedure for preparing the install.sh and run.sh scripts is the same as that for preparing the Dockerfile. Compilation Examples shows the file content.
Step 3 Go to the directory where the software packages are stored and run the following command to create a container image:
docker build -t image-name --build-arg NNRT_PKG=nnrt-name --build-arg DIST_PKG=distpackage-name .
Do not omit . at the end of the command. Table 3-5 describes the parameters in the command.
Table 3-5 Command parameter description
Parameter Description
image-name Specifies the image name and tag. Set this parameter as required.
--build-arg Specifies parameters in the Dockerfile.
NNRT_PKG nnrt-name: specifies the name of the offline inference engine package. Do not omit the file name extension. Replace it with the actual one.
DIST_PKG distpackage-name: specifies the name of the compressed package of the service inference program. Do not omit the file name extension. Replace it with the actual one.
If "Successfully built xxx" is displayed, the image is successfully created. Step 4 After the image is created, run the following command to view the image
information: docker images Example:
REPOSITORY TAG IMAGE ID CREATED SIZE workload-image v1.0 1372d2961ed2 About an hour ago 249MB ----End
Compilation Examples
Compilation example of install.sh #!/bin/bash
# Go to the container working directory. cd /root
#Decompress the service inference program package based on the package format. tar xf dist.tar
Compilation example of run.sh #!/bin/bash
# Access the directory where the executable file of the service inference program is located. cd /root/dist
# Run the executable file. ./main
3.2.2 Deploying an Inference Container
This section describes how to start a container image on a single device. If you need to deploy container images in batches on FusionDirector, see the MindX Edge Application Deployment and Model Update Guide.
Prerequisites
Procedure
Step 1 Log in to the edge station as the root user.
Step 2 Run the following commands to start the container image:
docker run --device=/dev/davinci0 --device=/dev/davinci_manager --device=/dev/hisi_hdc --device /dev/ devmm_svm \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /home/data/miniD/driver/lib64:/home/data/miniD/driver/lib64 \ -it workload-image:v1.0
In the preceding command example, the service program is run by default. If you need to directly access the container, add /bin/bash to the end of the command. Table 3-6 Parameter description
Parameter Description
--device Adds a host device to
the container. Replace davinci0 with the actual device name. If multiple chips need to be mapped, multiple parameters need to be configured, for example, --device=/dev/davinci0 --device=/dev/ davinci1.
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi Mounts the npu-smi tool to the container. Change the value based on the site requirements. -v /home/data/miniD/driver/lib64:/home/data/miniD/
driver/lib64 Mounts the operatingenvironment
directory /home/data/ miniD/driver/lib64 to the container. Change the directory based on the path of the
driver.so file.
NO TE
In this version, multiple containers can be mounted to a single processor. ● Up to 16 containers can be mounted to a single processor.
● Containers obtain the computing power of processors in preemption mode. Memory isolation and computing power slicing are not supported.
● Device sharing is disabled by default. Run the following command on the host to enable device sharing:
npu-smi set -t device-share -i 0 -c 0 -d 1
Run the following command to query the device sharing status: npu-smi info -t device-share -i 0 -c 0
After the restart or upgrade, the multi-container sharing function is disabled. ----End
3.3 Other Inference Devices
3.3.1 Instructions
You can use either of the following methods to prepare a container image: ● (Recommended) Pull the basic inference image from AscendHub.
● Use the Dockerfile to create a container image. You can perform secondary development based on the Dockerfile of a basic image. For details, see 3.3.2 Creating an Inference Container Image.
3.3.2 Creating an Inference Container Image
Introduction
This document describes how to create a container image based on the image tree, which is scalable.
Figure 3-1 shows the inference image tree. Figure 3-1 Inference image tree
Table 3-7 Description of the Ascend basic image tree
Image Description
Image Description
ascend-infer Install offline inference engine packages, such as nnrt.
Prerequisites
● In the container scenario, install Docker 18.03 or later. ● The container OS image can be obtained from Docker Hub.
● The driver and firmware have been installed on the host. For details, see section "Hardware Requirements" in the CANN Software Installation Guide (Development and Operating Scenarios, CLI).
● The toolbox has been installed on the host. For details, see "Operating Environment Installation (Inference) > Installation in a Container > Installing the Toolbox on the Host" in the CANN Software Installation Guide
(Development and Operating Scenarios, CLI).
● For details about how to configure a network proxy, see A.1 Configuring a System Network Proxy.
Procedure
Step 1 Log in to the server as the root user.
Step 2 Copy the Dockerfile file provided by the toolbox software package to any path (for example, /home/test) on the server.
cp -r /usr/local/Ascend/toolbox/latest/Ascend-Docker-Images /home/test Step 3 Create the image ascendbase-infer.
1. Go to the directory where the Dockerfile is stored. cd /home/test/Ascend-Docker-Images/ascendbase-infer/{os}-{arch}
In the preceding command, {os} indicates the container image OS version, and {arch} indicates the architecture. Change them based on the actual situation.
2. Prepare the following files in the current directory: Table 3-8 Required files
File Description How to Obtain
Dockerfile Required for creating an
image. Already exists in thecurrent directory. You can customize it based on the site requirements.
setproxy.sh Configures the system
network proxy and source.
(Optional) You can prepare it based on the actual situation.
File Description How to Obtain unsetproxy.sh Cancels the system
network proxy configuration.
EulerOS.repo Yum source
configuration file. Required only when thecontainer image OS is EulerOS 2.8.
3. Run the following command in the current directory to create the image ascendbase-infer:
docker build -t ascendbase-infer:base_TAG .
Do not omit . at the end of the command. Table 3-9 describes the parameters in the command.
NO TE
To configure the system network proxy in this step, run the following command: docker build -t ascendbase-infer:base_TAG --build-arg http_proxy=http://
user:password@proxyserverip:port --build-arg https_proxy=http:// user:password@proxyserverip:port .
In the preceding command, user indicates the username on the intranet, password indicates the user password, proxyserverip indicates the IP address of the proxy server, and port indicates the port number.
Table 3-9 Command parameter description Parameter Description
ascendbase-infer:base_TAG Image name and tag. You are advised to namebase_TAG in the format of Date-Container OS-Architecture (for example, 20210106-ubuntu18.04-arm64).
If "Successfully built xxx" is displayed, the image is successfully created. Step 4 Create the image ascend-infer based on ascendbase-infer.
1. Go to the directory where the Dockerfile is stored. cd /home/test/Ascend-Docker-Images/ascend-infer
Table 3-10 Required software and files
Software or File Description How to Obtain
Ascend-cann-nnrt_{version}_linux-{arch}.run Offline inferenceengine package.
{version} indicates the software package version, and {arch} indicates the architecture. Link
Dockerfile Required for
creating an image.
Already exists in the current directory. You can customize it based on the site requirements.
ascend_install.info Software
package installation log file.
Copy the /etc/ ascend_install.info file from the host. Change the path as required. After
copying the file to the current directory, delete the contents of lines UserName and UserGroup from the copied file.
version.info Driver package
version
information file.
Copy the /usr/local/ Ascend/driver/ version.info file from the host.
Change the path as required.
preinstall.sh Copies files to a
specified directory.
Already exists in the current directory. You can delete the code comments.
postinstall.sh Deletes the
directories and files that do not need to be retained in the container.
3. Run the following command in the current directory to create the image ascend-infer:
docker build -t ascend-infer:infer_TAG --build-arg NNRT_PKG=nnrt-name --build-arg BASE_VERSION=base_TAG --build-arg
ARCH={arch} .
Do not omit . at the end of the command. Table 3-11 describes the parameters in the command.
Table 3-11 Command parameter description Parameter Description
ascend-infer:infer_TAG Image name and tag. You are advised to nameinfer_TAG in the format of Software package version-Container OS-Architecture (for example, 20.2.rc1-ubuntu18.04-arm64).
--build-arg Specifies parameters in the Dockerfile.
NNRT_PKG nnrt-name: specifies the name of the offline inference engine package. Do not omit the file name extension. Replace it with the actual one.
BASE_VERSION base_TAG: specifies the image tag set in Step 3.3. ARCH {arch} indicates the architecture. Replace it with the
actual one (arm64 or x86_64).
If "Successfully built xxx" is displayed, the image is successfully created. Step 5 After the image is created, run the following command to view the image
information: docker images ----End
3.3.3 Deploying an Inference Container
Prerequisites
● A container image has been created.
● The Docker program has been installed in the environment.
● The driver and firmware have been installed on the host. For details, see section "Hardware Requirements" in the CANN Software Installation Guide (Development and Operating Scenarios, CLI).
● The toolbox has been installed on the host. For details, see "Operating Environment Installation (Inference) > Installation in a Container > Installing the Toolbox on the Host" in the CANN Software Installation Guide
(Development and Operating Scenarios, CLI).
Ascend-docker-runtime has been integrated into the toolbox. After installing the toolbox, you need to restart Docker to ensure that Ascend-docker-runtime can function properly. The container engine plugin Ascend-docker-runtime provides Ascend NPU-based containerization support for all AI inference jobs so that AI jobs can run smoothly on Ascend devices as Docker containers.
Starting the Container
Run the following command to run a container based on the new image: docker run -it -e ASCEND_VISIBLE_DEVICES=xxx image-name /bin/bash
Table 3-12 Parameter description
Parameter Description
-e ASCEND_VISIBLE_DEVICES=xxx Uses the ASCEND_VISIBLE_DEVICES environment variable to specify the NPU device to be mounted to the container, and uses the device sequence number to specify the device. You can specify a single device or a device range. The two types of devices can be used together. Example: 1. -e ASCEND_VISIBLE_DEVICES=0
indicates that device 0 (/dev/davinci0) is mounted to the container.
2. -e ASCEND_VISIBLE_DEVICES=1,3 indicates that devices 1 and 3 are mounted to the container.
3. -e ASCEND_VISIBLE_DEVICES=0-2 indicates that devices 0 to 2 (including devices 0 and 2) are mounted to the container. The effect is the same as that of
-e ASCEND_VISIBLE_DEVICES=0,1,2 4. ASCEND_VISIBLE_DEVICES=0-2,4:
indicates that devices 0 to 2 and device 4 are mounted to the container. The effect is the same as that of
-e ASCEND_VISIBLE_DEVICES=0,1,2,4
image-name Image name and tag. Change them based on the actual situation (for example, ascend-infer:infer_TAG).
If the container ID (3330d6524117 in this example) is displayed, the container is running and you have accessed the container.
NO TE
● For the default contents mounted to Ascend Docker Runtime, see B.1 Default Mounted Contents of Ascend Docker Runtime. If you use a specified path to install the driver on the host, you need to mount required directories and files. Example command (change it based the actual situation):
docker run -it -e ASCEND_VISIBLE_DEVICES=xxx -v ${install_path}/driver:$ {install_path}/driver -v /usr/local/dcmi:/usr/local/dcmi -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi image-name
${install_path} indicates the driver installation path.
● If a large amount of mounting content is required, you can use the
ASCEND_RUNTIME_MOUNTS environment variable to read the mounting content in the configuration file. For details about how to create a configuration file, see A.2 Creating a Configuration File. The following uses the hostlog.list configuration file as an example:
docker run --rm -it -e ASCEND_VISIBLE_DEVICES=xxx -e
ASCEND_RUNTIME_MOUNTS=base,hostlog image-name /bin/bash
In the preceding command, base indicates the basic configuration file, as shown in B.1 Default Mounted Contents of Ascend Docker Runtime. Change hostlog to the actual configuration file name.
● You can use the ASCEND_RUNTIME_OPTIONS environment variable to adjust the mounted targets. If the value is set to NODRV, non-device directories and files are not mounted. For example:
docker run --rm -it -e ASCEND_VISIBLE_DEVICES=xxx -e ASCEND_RUNTIME_OPTIONS=NODRVimage-name /bin/bash
Only NPU devices and management devices (for example: /dev/davinci0, /dev/ davinci_manager, /dev/hisi_hdc, /dev/devmm_svm) are mounted to the container to support driver installation in the container.
● The methods of viewing logs in the container and setting log levels are the same as those of a host. For details, see CANN Log Reference (Ascend 310)CANN Log Reference (Ascend 310).
● Viewing Logs
You can view the application logs on the host and device, but the system logs on the device are unavailable.
● Setting the Log Level
You can set the level of logs on the host, but the level of logs on the device cannot be set.
4
Training
4.1 Instructions
4.2 Creating a Training Container Image 4.3 Deploying a Training Container
4.1 Instructions
You can use either of the following methods to prepare a container image: ● (Recommended) Pull the basic training image from AscendHub.
● Use the Dockerfile to create a container image. You can perform secondary development based on the Dockerfile of a basic image. For details, see 4.2 Creating a Training Container Image.
4.2 Creating a Training Container Image
Introduction
This document describes how to create a container image based on the image tree, which is scalable.
Figure 4-1 Training image tree
Table 4-1 Description of the Ascend basic image tree
Image Description
ascendbase-train Install system components and Python third-party dependencies.
ascend-train Install training software packages, such as nnae.
ascend-tensorflow Install the framework plugin package TFPlugin and TensorFlow framework.
Prerequisites
● In the container scenario, install Docker 18.03 or later. ● The container OS image can be obtained from Docker Hub.
● The driver and firmware have been installed on the host. For details, see section "Hardware Requirements" in the CANN Software Installation Guide (Development and Operating Scenarios, CLI).
● The toolbox has been installed on the host. For details, see "Operating Environment Installation (Training) > Installation in a Container > Installing the Toolbox on the Host" in the CANN Software Installation Guide (Development and Operating Scenarios, CLI).
● For details about how to configure a network proxy, see A.1 Configuring a System Network Proxy.
Procedure
Step 1 Log in to the server as the root user.
Step 2 Copy the Dockerfile file provided by the toolbox software package to any path (for example, /home/test) on the server.
Step 3 Create the image ascendbase-train.
1. Go to the directory where the Dockerfile is stored. cd /home/test/Ascend-Docker-Images/ascendbase-train/{os}-{arch}
In the preceding command, {os} indicates the container image OS version, and {arch} indicates the architecture. Change them based on the actual situation.
2. Prepare the following files in the current directory: Table 4-2 Required files
File Description How to Obtain
Dockerfile Required for creating an
image. Already exists in thecurrent directory. You can customize it based on the site requirements.
setproxy.sh Configures the system
network proxy and source.
(Optional) You can prepare it based on the actual situation.
unsetproxy.sh Cancels the system network proxy configuration.
libstdc++.so.6.0.24 Dynamic library file. The libstdc++.so.6.0.24 file is required only when the container image OS is CentOS.
You can run the find command to query the path of the libstdc++.so. 6.0.24 file and copy the file from the host. npy_math_internal.h.
src.patch Resolves the NumPyinstallation error in the CentOS image.
Already exists in the current directory. This parameter is required only when the container image OS is CentOS ARM64. 3. Run the following command in the current directory to create the image
ascendbase-train:
docker build -t ascendbase-train:base_TAG .
Do not omit . at the end of the command. Table 4-3 describes the parameters in the command.
NO TE
To configure the system network proxy in this step, run the following command: docker build -t ascendbase-train:base_TAG --build-arg http_proxy=http://
user:password@proxyserverip:port --build-arg https_proxy=http:// user:password@proxyserverip:port .
In the preceding command, user indicates the username on the intranet, password indicates the user password, proxyserverip indicates the IP address of the proxy server, and port indicates the port number.
Table 4-3 Command parameter description Parameter Description
ascendbase-train:base_TAG Image name and tag. You are advised to namebase_TAG in the format of Date-Container OS-Architecture (for example, 20210106-ubuntu18.04-arm64).
If "Successfully built xxx" is displayed, the image is successfully created. Step 4 Create the image ascend-train based on ascendbase-train.
1. Go to the directory where the Dockerfile is stored. cd /home/test/Ascend-Docker-Images/ascend-train
2. Prepare the following software packages and files in the current directory: Table 4-4 Required software and files
Software or File Description How to Obtain
Ascend-cann-nnae_{version}_linux-{arch}.run Deep learningacceleration engine package. {version} indicates the software package version, and {arch} indicates the architecture. Link
Dockerfile Required for
creating an image.
Already exists in the current directory. You can customize it based on the site requirements. 3. Run the following command in the current directory to create the image
ascend-train:
docker build -t ascend-train:train_TAG --build-arg NNAE_PKG=nnae-name --build-arg BASE_VERSION=base_TAG .
Do not omit . at the end of the command. Table 4-5 describes the parameters in the command.
Table 4-5 Command parameter description Parameter Description
ascend-train:train_TAG Image name and tag. You are advised to nametrain_TAG in the format of Software package version-Container OS-Architecture (for example, 20.2.rc1-ubuntu18.04-arm64).
--build-arg Specifies parameters in the Dockerfile.
NNAE_PKG nnae-name: specifies the name of the deep learning acceleration engine package. Do not omit the file name extension. Replace it with the actual one.
BASE_VERSION base_TAG: specifies the image tag set in Step 3.3.
If "Successfully built xxx" is displayed, the image is successfully created. Step 5 Create the image ascend-tensorflow based on ascend-train.
1. Go to the directory where the Dockerfile is stored. cd /home/test/Ascend-Docker-Images/ascend-tensorflow
2. Prepare the following software packages and files in the current directory: Table 4-6 Required software and files
Software or File Description How to Obtain
Ascend-cann-tfplugin_{version}
_linux-{arch}.run Framework plugin package. {version} indicates the software package version, and {arch} indicates the architecture. Link
tensorflow-1.15.0-cp37-cp37m-{version}.whl TensorFlowframework WHL package.
AArch64 architecture: Link
x86_64 architecture: Not needed
Dockerfile Required for
creating an image.
Already exists in the current directory. You can customize it based on the site requirements.
Software or File Description How to Obtain
ascend_install.info Software
package installation log file.
Copy the /etc/ ascend_install.info file from the host. Change the path as required. After
copying the file to the current directory, delete the contents of lines UserName and UserGroup from the copied file.
version.info Driver package
version
information file.
Copy the /usr/local/ Ascend/driver/ version.info file from the host.
Change the path as required.
preinstall.sh Copies files to a
specified directory.
Already exists in the current directory. You can delete the code comments.
postinstall.sh Deletes the
directories and files that do not need to be retained in the container.
3. Run the following command in the current directory to create the ascend-tensorflow image. (If the x86_64 architecture is used, you do not need to enter the TF_PKG parameter. Delete --build-arg TF_PKG=tensorflow-name in the following command.)
docker build -t ascend-tensorflow:tensorflow_TAG arg TFPLUGIN_PKG=tfplugin-name --build-arg BASE_VERSION=train_TAG --build---build-arg TF_PKG=tensorflow-name .
Do not omit . at the end of the command. Table 4-7 describes the parameters in the command.
NO TE
To configure the system network proxy in this step, run the following command: docker build -t ascend-tensorflow:tensorflow_TAG build-arg TFPLUGIN_PKG=tfplugin-name --build-arg BASE_VERSION=train_TAG ----build-arg TF_PKG=tensorflow-name ----build-arg http_proxy=http://user:password@proxyserverip:port --build-arg https_proxy=http:// user:password@proxyserverip:port .
In the preceding command, user indicates the username on the intranet, password indicates the user password, proxyserverip indicates the IP address of the proxy server, and port indicates the port number.
Table 4-7 Command parameter description Parameter Description
ascend-tensorflow:tenso rflow_TAG
Image name and tag. You are advised to name
tensorflow_TAG in the format of Software package version-Container OS-Architecture (for example, 20.2.rc1-ubuntu18.04-arm64).
--build-arg Specifies parameters in the Dockerfile.
TFPLUGIN_PKG tfplugin-name: specifies the name of the framework plugin package. Do not omit the file name extension. Replace it with the actual one.
BASE_VERSION train_TAG: specifies the image tag set in Step 4.3. TF_PKG tensorflow-name: specifies the name of the TensorFlow
framework WHL package. This parameter is required only when the ARM64 architecture is used.
If "Successfully built xxx" is displayed, the image is successfully created. Step 6 After the image is created, run the following command to view the image
information: docker images ----End
4.3 Deploying a Training Container
Prerequisites
● A container image has been created.
● The Docker program has been installed in the environment.
● The driver and firmware have been installed on the host. For details, see section "Hardware Requirements" in the CANN Software Installation Guide (Development and Operating Scenarios, CLI).
● The toolbox has been installed on the host. For details, see "Operating Environment Installation (Training) > Installation in a Container > Installing the Toolbox on the Host" in the CANN Software Installation Guide (Development and Operating Scenarios, CLI).
Ascend-docker-runtime has been integrated into the toolbox. After installing the toolbox, you need to restart Docker to ensure that Ascend-docker-runtime can function properly. The container engine plugin Ascend-docker-runtime provides Ascend NPU-based containerization support for all AI inference jobs so that AI jobs can run smoothly on Ascend devices as Docker containers. ● If multiple servers are used for cluster training, ensure that the NPU IP
address has been configured on the host before deploying the training container. For details, see A.3 Changing NPU IP Addresses.
Starting the Container
CA UTION
When the host environment is CentOS, the SELinux security module of CentOS is enabled by default. As a result, the local directory mounted to the container does not have the execute permission. Therefore, you need to run the su -c "setenforce 0" command to temporarily disable the SELinux. After related services are
complete, run the su -c "setenforce 0" command to enable SELinux again. Run the following command to run a container based on the new image: docker run -it -e ASCEND_VISIBLE_DEVICES=xxx --pids-limit 409600 image-name /bin/bash Table 4-8 Parameter description
Parameter Description
-e ASCEND_VISIBLE_DEVICES=xxx Uses the ASCEND_VISIBLE_DEVICES environment variable to specify the NPU device to be mounted to the container, and uses the device sequence number to specify the device. You can specify a single device or a device range. The two types of devices can be used together. Example:
1. -e ASCEND_VISIBLE_DEVICES=0 indicates that device 0 (/dev/davinci0) is mounted to the container.
2. -e ASCEND_VISIBLE_DEVICES=1,3 indicates that devices 1 and 3 are mounted to the container.
3. -e ASCEND_VISIBLE_DEVICES=0-2 indicates that devices 0 to 2 (including devices 0 and 2) are mounted to the container. The effect is the same as that of
-e ASCEND_VISIBLE_DEVICES=0,1,2 4. ASCEND_VISIBLE_DEVICES=0-2,4:
indicates that devices 0 to 2 and device 4 are mounted to the container. The effect is the same as that of
-e ASCEND_VISIBLE_DEVICES=0,1,2,4 --pids-limit 409600 If the host OS is CentOS or BC-Linux, the
maximum number of threads in Docker is 4092, which cannot meet the training requirements. This parameter needs to be increased to configure the maximum number of Docker threads in CentOS/BC-Linux when the container is started.
Parameter Description
image-name Image name and tag. Change them based on the actual situation (for example, ascend-tensorflow:tensorflow_TAG).
If the container ID (3330d6524117 in this example) is displayed, the container is running and you have accessed the container.
root@3330d6524117:/#
NO TE
● For the default contents mounted to Ascend Docker Runtime, see B.1 Default Mounted Contents of Ascend Docker Runtime. If you use a specified path to install the driver on the host, you need to mount required directories and files. Example command (change it based the actual situation):
docker run -it -e ASCEND_VISIBLE_DEVICES=xxx --pids-limit 409600 -v $
{install_path}/driver/lib64:${install_path}/driver/lib64 -v /usr/local/dcmi:/usr/local/ dcmi -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi image-name /bin/bash
${install_path} indicates the installation path of the Ascend AI Processor driver. ● If a large amount of mounting content is required, you can use the
ASCEND_RUNTIME_MOUNTS environment variable to read the mounting content in the configuration file. For details about how to create a configuration file, see A.2 Creating a Configuration File. The following uses the hostlog.list configuration file as an example:
docker run --rm -it -e ASCEND_VISIBLE_DEVICES=xxx -e
ASCEND_RUNTIME_MOUNTS=base,hostlog image-name /bin/bash
In the preceding command, base indicates the basic configuration file, as shown in B.1 Default Mounted Contents of Ascend Docker Runtime. Change hostlog to the actual configuration file name.
● You can use the ASCEND_RUNTIME_OPTIONS environment variable to adjust the mounted targets. If the value is set to NODRV, non-device directories and files are not mounted. For example:
docker run --rm -it -e ASCEND_VISIBLE_DEVICES=xxx -e ASCEND_RUNTIME_OPTIONS=NODRVimage-name /bin/bash
Only NPU devices and management devices (for example: /dev/davinci0, /dev/ davinci_manager, /dev/hisi_hdc, /dev/devmm_svm) are mounted to the container to support driver installation in the container.
● If you need to run related services when deploying a training container, you are advised to use a non-root user.
● The methods of viewing logs in the container and setting log levels are the same as those of a host. For details, see the CANN Log Reference (Ascend 910)CANN Log Reference (Ascend 910).
● Viewing Logs
You can view the application logs on the host and device, but the system logs on the device are unavailable.
● Setting the Log Level
You can set the level of logs on the host, but the level of logs on the device cannot be set.
A
Common Operations
A.1 Configuring a System Network Proxy
The following procedure is a general method for configuring a network proxy. It may not be applicable to all network environments. The method of configuring the network proxy depends on the actual network environment.
Prerequisites
● Ensure that the network cable of the server is connected and the proxy server can connect to the external network.
● The configuration proxy is based on the condition that the server is located on an intranet and cannot be directly connected to the external network.
Configuring a System Network Proxy
Step 1 Log in to the user environment as the root user.
Step 2 Run the following command to edit the /etc/profile file: vi /etc/profile
Add the following content to the file, save the file, and exit: export http_proxy="http://user:password@proxyserverip:port"
export https_proxy="http://user:password@proxyserverip:port"
In the preceding commands, user indicates the username on the intranet, password (special characters need to be converted) indicates the user password, proxyserverip indicates the IP address of the proxy server, and port indicates the port number.
Step 3 Run the following command to make the configuration take effect. source /etc/profile
Step 4 Run the following command to check whether the external network is connected: wget www.baidu.com
If the HTML file can be downloaded, the server is connected to the external network successfully.
NO TE
If a certificate error occurs when you use a proxy to connect to the network, you need to install the certificate of the proxy server before downloading third-party components. ----End
A.2 Creating a Configuration File
If a large number of files or directories need to be mounted, you can write the files or directories to be mounted to a configuration file. The procedure is as follows:
Step 1 Go to the configuration file directory: cd /etc/ascend-docker-runtime.d/
The basic configuration file base.list exists in the directory. The content is the default mounting objects of Ascend Docker Runtime. For details, see B.1 Default Mounted Contents of Ascend Docker Runtime. In principle, the base.list file cannot be modified.
Step 2 Create and edit the configuration file. The file name can be customized, for example, hostlog.list.
vi hostlog.list
Step 3 Write the files or directories to be mounted to hostlog.list. The following is an example:
/usr/slog/slog /var/log/npu/slog
Step 4 Save the configuration and exit. ----End
A.3 Changing NPU IP Addresses
When multiple servers are used for distributed training, you need to use the HCCN tool in the Ascend software to configure the NPU IP addresses (the NIC IP
addresses of devices) so that the network model parameters between multiple training servers can be transmitted and synchronized through the optical ports on the NPUs. Ensure that the parameters of the network model can be updated synchronously when each training server performs training. This section describes only the commands for configuring the network by using the HCCN tool. If you need to use other functions of the HCCN tool (for example, checking the link status of a network port), see the Ascend 910 HCCN Tool Interface Reference.
Configuring NIC IP Address of a Device
Atlas 800 training server and Atlas 900 AI cluster ● SMP (symmetric multi-processor) mode
Log in to the AI Servers as the root user and configure the NIC IP address of each device. The configuration requirements are as follows:
– NICs 0 and 4, 1 and 5, 2 and 6, and 3 and 7 of an AI Server must be in the same network segment respectively. NICs 0, 1, 2, and 3 must be in different network segments. NICs 4, 5, 6, and 7 must be in different network segments.
– In the cluster scenario, the devices in the similar positions on AI Servers must be in the same network segment. For example, NIC 0 of AI Server 1 and AI Server 2 must be in the same network segment, and NIC 1 of AI Server 1 and AI Server 2 must be in the same network segment.
hccn_tool -i 0 -ip -s address 192.168.100.101 netmask 255.255.255.0 hccn_tool -i 1 -ip -s address 192.168.101.101 netmask 255.255.255.0 hccn_tool -i 2 -ip -s address 192.168.102.101 netmask 255.255.255.0 hccn_tool -i 3 -ip -s address 192.168.103.101 netmask 255.255.255.0 hccn_tool -i 4 -ip -s address 192.168.100.100 netmask 255.255.255.0 hccn_tool -i 5 -ip -s address 192.168.101.100 netmask 255.255.255.0 hccn_tool -i 6 -ip -s address 192.168.102.100 netmask 255.255.255.0 hccn_tool -i 7 -ip -s address 192.168.103.100 netmask 255.255.255.0 ● AMP (asymmetric multi-processor) mode
You do not need to restrict network segments. All NICs must be in the same network segment.
Atlas 300T training card
Each server can be configured with one Atlas 300T training card or two. Each card corresponds to one Device OS and needs to be configured with one IP address. Different cards need to be configured with IP addresses in the same network segment.
Log in to the AI Servers as the root user and configure the NIC IP address of each device. The configuration operations are as follows:
Step 1 Run the npu-smi info commands to view the ID of the device to be configured. In Figure A-1, the NPU IDs are 1 and 4, for example. Use the actual NPU IDs in the query result.
Figure A-1 Checking the device ID
Step 2 Run the following command to configure the NIC IP addresses of the device. The IP addresses used in the following example are for reference only.
hccn_tool -i 1 -ip -s address 192.168.0.2 netmask 255.255.255.0 hccn_tool -i 4 -ip -s address 192.168.0.3 netmask 255.255.255.0 ----End
NO TE
● Ensure that the npu-smi tool has been installed on the server.
B
Reference
B.1 Default Mounted Contents of Ascend Docker
Runtime
In addition to NPUs and management devices (/dev/davinciX, /dev/
davinci_manager, /dev/hisi_hdc, and dev/devmm_svm), Ascend Docker Runtime mounts the following directories and files to containers in read-only mode based on the actual situation.
Table B-1 Default mounted directories and files
Directory Description
/usr/local/Ascend/driver/lib64 Directory for storing user-mode libraries provided by the driver.
/usr/local/Ascend/driver/
include Directory for storing the header file(dsmi_common_interface.h) provided by the driver.
/usr/local/dcmi Directory for storing DCMI header files and libraries.