Using TensorBoard Visualization Jobs in JupyterLab

ModelArts supports TensorBoard for visualizing training jobs. TensorBoard is a visualization tool package of TensorFlow. It provides visualization functions and tools required for machine learning experiments. With TensorBoard, computational graph during training, metric trends, and data used during training are effectively displayed. For details about TensorBoard, see the official website.

Currently, TensorBoard can be used only in the PyTorch and TensorFlow engines.

Prerequisites

When you write a training script, add the code for collecting the summary record to the script to ensure that the summary file is generated in the training result.

For details about how to add the code for collecting the summary record to a TensorFlow-powered training script, see TensorFlow official website.

For details about how to add the code for collecting the summary record to a PyTorch-powered training script, see PyTorch official website.

Precautions

A running visualization job is not billed separately. When the target notebook instance is stopped, the billing stops.
If the summary file is stored in OBS, you will be charged for the storage. After a job is complete, stop the notebook instance and clear OBS data to stop billing.

Process of Creating a TensorBoard Visualization Job in a Development Environment

Step 1 Creating a Development Environment and Accessing It Online

Step 2: Generating Summary Data

Step 3 Uploading Summary Data

Step 4: Starting TensorBoard

Step 5 Viewing Visualized Data on the Dashboard

Step 1 Creating a Development Environment and Accessing It Online

Log in to the ModelArts management console. In the navigation pane on the left, choose Development Workspace > Notebook. Create an instance using a TensorFlow or PyTorch image. After the instance is created, locate it in the list and click Open in the Operation column to access it online.

If there is a large amount of data, for example, hundreds of MB, in the data directory to be visualized, the CPU or memory of the instance with 2 vCPUs and 8 GB memory may be insufficient, causing the notebook instance fails to work. In this case, user a higher-specification instance, for example, 4 vCPUs and 16 GB memory. Set the specifications based on the site requirements.

Step 2: Generating Summary Data

Summary data is required for using TensorBoard visualization functions in DevEnviron. The following is a simple linear regression training example, which demonstrates how to generate and record summary data. For details about how to generate summary data, see the PyTorch official document.

import torch
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter()

x = torch.arange(-5, 5, 0.1).view(-1, 1)
y = -5 * x + 0.1 * torch.randn(x.size())
model = torch.nn.Linear(1, 1)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr = 0.1)
def train_model(iter):
    for epoch in range(iter):
        y1 = model(x)
        loss = criterion(y1, y)
        writer.add_scalar("Loss/train", loss, epoch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
train_model(10)
writer.flush()

Step 3 Uploading Summary Data

You can upload local summary data to the notebook instance as follows:

Create a directory on the left of the notebook, and upload the data to the directory. For details, see Uploading Files from a Local Path to JupyterLab.

Figure 1 Data to be visualized in the notebook instance

You are advised to upload the data to be visualized to a separate directory. If you need to use OBS mounting, mount the OBS path that stores only the visualized data to the notebook instance, instead of mounting the entire bucket, avoiding slow TensorBoard loading or even visualization failure caused by mixed data.

Step 4: Starting TensorBoard

Open the Launcher page in JupyterLab of the development environment and click TensorBoard.
Figure 2 Opening Launcher

Figure 3 Opening TensorBoard in Launcher
When you open TensorBoard for the first time, a default initialization panel is displayed, on which you can create a TensorBoard instance.
Enter the directory that stores the visualized data in the Log Dir text box. The path is a relative path under /home/ma-user/work. Click Create TensorBoard on the right, as shown in Figure 2. If you have opened TensorBoard before, the first active TensorBoard instance is displayed.

Figure 4 TensorBoard page

The parameter details are as follows:
- Log Dir: The default value is the directory in the sidebar when you click TensorBoard. When you enter a directory, specify the directory as detailed as possible to improve the initialization speed.
- Multi LogDir: You can enter multiple directory parameters and separate them with commas (,). This parameter corresponds to the --logdir_spec parameter of the native TensorBoard. This function is not fully supported by official TensorBoard and is not recommended.
- Reload Interval: The interval at which TensorBoard rescans the corresponding directory. This function is disabled by default. Manual reload can meet your requirements for daily use. Once this parameter is configured, the TensorBoard backend continuously scans the directory, affecting the stability and file system of JupyterLab.
For more details about TensorBoard, see the official JupyterLab TensorBoard website.
Figure 5 shows the created TensorBoard panel.
- Click the button marked with 1, the visualization panel will be displayed in a separate tab.
- The buttons in the box marked with 2 allow you to manage instances, including Reload, Destroy, Duplicate, and New.
- The buttons in the box marked with 3 allow you to manage instances on the Kernel management panel of JupyterLab. You can switch to the corresponding instance and delete the instance.
Figure 5 Visualization page after creation

Do not to enable Reload Interval. Otherwise, the notebook instance may freeze or even become unavailable due to frequent background refresh. To view new data, click the refresh button in the upper right corner.

Do not create multiple TensorBoard instances by clicking New. Otherwise, the CPU or memory usage may be too high, causing the notebook instance to freeze or even become unavailable. If you need to visualize a new directory, close the current TensorBoard instance and specify a new directory to create a TensorBoard instance.

Step 5 Viewing Visualized Data on the Dashboard

The visualization dashboard is important for TensorBoard visualization. The dashboard allows for scalar visualization, image visualization, and computational graph visualization.

For more functions, see Get started with TensorBoard.

Browser restrictions: Due to security settings in browsers like Chrome, data cannot be downloaded from iFrame. The Download data button is currently unsupported.
No data displayed: Ensure the correct visualization data directory is selected and that the data is complete. Avoid manually adjusting or deleting summary data from training, as this can cause visualization failures.
Data not displaying: If the refresh button keeps spinning, open a Terminal window and run the top command to check CPU usage. High CPU or memory usage due to large data volumes may be the issue. Consider reducing the data volume or using a notebook instance with more resources.
Occasional data loading issues: If data fails to be loaded, wait a moment and click the refresh button again. The data should display properly after a short wait.