Updated on 2023-09-04 GMT+08:00

Installing MLNX_OFED in a Container Image

Scenarios

The Mellanox Technologies NIC has been configured on ModelArts GPU servers to support Remote Direct Memory Access (RDMA). As a result, you can install MLNX_OFED in the container image, which will allow the NCCL to leverage the NIC and enhance the efficiency of cross-node communication.

After this NIC is enabled for NCCL, NET/IB is used for cross-node communication. If this NIC is not enabled, NET/Socket is used for cross-node communication. NET/IB is better than NET/Socket in terms of latency and bandwidth.

Table 1 Mellanox Technologies NIC and MLNX_OFED installation on ModelArts GPU servers

GPU Model

Mellanox Technologies NIC

Installed MLNX_OFED Version

Recommended MLNX_OFED Version for Container Image

V100

ConnectX-5

4.3-1.0.1.0/4.5-1.0.1.0

4.9-6.0.6.0-LTS

Ant8/Ant1

ConnectX-6 Dx

5.5-1.0.3.2

5.8-2.0.3.0-LTS

Installing MLNX_OFED

Take the Ubuntu18.04 container image as an example. The Dockerfile for installing MLNX_OFED 4.9-6.0.6.0-LTS is as follows:

The host that is used to download files and create container images using a Dockerfile must be able to connect to the public network.

FROM nvidia/cuda:11.1.1-runtime-ubuntu18.04

RUN cp -a /etc/apt/sources.list /etc/apt/sources.list.bak && \
    sed -i "s@http://.*archive.ubuntu.com@http://repo.huaweicloud.com@g" /etc/apt/sources.list && \
    sed -i "s@http://.*security.ubuntu.com@http://repo.huaweicloud.com@g" /etc/apt/sources.list && \
    echo > /etc/apt/apt.conf.d/00skip-verify-peer.conf "Acquire { https::Verify-Peer false }" && \
    apt-get update && \
    apt-get install --no-install-recommends -y lsb-core curl && \
    curl -k -o /tmp/MLNX_OFED_LINUX-4.9-6.0.6.0-ubuntu18.04-x86_64.tgz https://content.mellanox.com/ofed/MLNX_OFED-4.9-6.0.6.0/MLNX_OFED_LINUX-4.9-6.0.6.0-ubuntu18.04-x86_64.tgz && \
    cd /tmp && \
    tar xzf MLNX_OFED_LINUX-4.9-6.0.6.0-ubuntu18.04-x86_64.tgz && \
    cd MLNX_OFED_LINUX-4.9-6.0.6.0-ubuntu18.04-x86_64 && \
    ./mlnxofedinstall --user-space-only --without-fw-update --without-neohost-backend --force && \
    rm /tmp/MLNX_OFED_LINUX-4.9-6.0.6.0-ubuntu18.04-x86_64.tgz && \
    rm -rf /tmp/MLNX_OFED_LINUX-4.9-6.0.6.0-ubuntu18.04-x86_64 && \
    apt-get clean && \
    mv /etc/apt/sources.list.bak /etc/apt/sources.list && \
    rm /etc/apt/apt.conf.d/00skip-verify-peer.conf

Create a container image by referring to this command example:

docker build -f Dockerfile . -t nvidia/cuda:mlnx-ofed-4.9-11.1.1-runtime-ubuntu18.04

After the container image has been created, run this command to obtain the MLNX_OFED version in the container image:

docker run -ti --rm nvidia/cuda:mlnx-ofed-4.9-11.1.1-runtime-ubuntu18.04 ofed_info | head -n 1

The command output is as follows:

MLNX_OFED_LINUX-4.9-6.0.6.0 (OFED-4.9-6.0.6):