Installing MLNX_OFED in a Container Image
Scenarios
The Mellanox Technologies NIC has been configured on ModelArts GPU servers to support Remote Direct Memory Access (RDMA). As a result, you can install MLNX_OFED in the container image, which will allow the NCCL to leverage the NIC and enhance the efficiency of cross-node communication.
After this NIC is enabled for NCCL, NET/IB is used for cross-node communication. If this NIC is not enabled, NET/Socket is used for cross-node communication. NET/IB is better than NET/Socket in terms of latency and bandwidth.
GPU Model |
Mellanox Technologies NIC |
Installed MLNX_OFED Version |
Recommended MLNX_OFED Version for Container Image |
---|---|---|---|
Vnt1 |
ConnectX-5 |
4.3-1.0.1.0/4.5-1.0.1.0 |
4.9-6.0.6.0-LTS |
Ant8/Ant1 |
ConnectX-6 Dx |
5.5-1.0.3.2 |
5.8-2.0.3.0-LTS |
Installing MLNX_OFED
Take the Ubuntu18.04 container image as an example. The Dockerfile for installing MLNX_OFED 4.9-6.0.6.0-LTS is as follows:
The host that is used to download files and create container images using a Dockerfile must be able to connect to the public network.
FROM nvidia/cuda:11.1.1-runtime-ubuntu18.04 RUN cp -a /etc/apt/sources.list /etc/apt/sources.list.bak && \ sed -i "s@http://.*archive.ubuntu.com@http://repo.huaweicloud.com@g" /etc/apt/sources.list && \ sed -i "s@http://.*security.ubuntu.com@http://repo.huaweicloud.com@g" /etc/apt/sources.list && \ echo > /etc/apt/apt.conf.d/00skip-verify-peer.conf "Acquire { https::Verify-Peer false }" && \ apt-get update && \ apt-get install --no-install-recommends -y lsb-core curl && \ curl -k -o /tmp/MLNX_OFED_LINUX-4.9-6.0.6.0-ubuntu18.04-x86_64.tgz https://content.mellanox.com/ofed/MLNX_OFED-4.9-6.0.6.0/MLNX_OFED_LINUX-4.9-6.0.6.0-ubuntu18.04-x86_64.tgz && \ cd /tmp && \ tar xzf MLNX_OFED_LINUX-4.9-6.0.6.0-ubuntu18.04-x86_64.tgz && \ cd MLNX_OFED_LINUX-4.9-6.0.6.0-ubuntu18.04-x86_64 && \ ./mlnxofedinstall --user-space-only --without-fw-update --without-neohost-backend --force && \ rm /tmp/MLNX_OFED_LINUX-4.9-6.0.6.0-ubuntu18.04-x86_64.tgz && \ rm -rf /tmp/MLNX_OFED_LINUX-4.9-6.0.6.0-ubuntu18.04-x86_64 && \ apt-get clean && \ mv /etc/apt/sources.list.bak /etc/apt/sources.list && \ rm /etc/apt/apt.conf.d/00skip-verify-peer.conf
Create a container image by referring to this command example:
docker build -f Dockerfile . -t nvidia/cuda:mlnx-ofed-4.9-11.1.1-runtime-ubuntu18.04
After the container image is created, run the following command to obtain the MLNX_OFED version in the container image:
docker run -ti --rm nvidia/cuda:mlnx-ofed-4.9-11.1.1-runtime-ubuntu18.04 ofed_info | head -n 1
The command output is as follows:
MLNX_OFED_LINUX-4.9-6.0.6.0 (OFED-4.9-6.0.6):
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot