Configuring the Connection Between Spark and MemArtsCC

Scenario

MemArtsCC stores hotspot data in compute clusters to reduce the required bandwidth on the OBS server. With the local storage of MemArtsCC, hotspot data does not need to be accessed across networks, improving the data read efficiency of Spark. This topic describes how to integrate MemArtsCC into Spark tasks for a system where storage and compute are decoupled.

Prerequisites

The Guardian service is running properly, and decoupled storage and compute have been used. For details, see Interconnecting Guardian with OBS.

Spark has been connected to OBS. For details, see Accessing OBS Using Spark Through Guardian.

Modifying Spark Configurations

Log in to FusionInsight Manager and choose Cluster > Services > Spark. Click Configurations, click All Configurations, and click SparkResource(Role) > OBS.
Set fs.obs.readahead.policy to memArtsCC.
Click Save. In the displayed dialog box, click OK to save the configuration. Click Dashboard and choose More > Service Rolling Restart to restart the Spark service.
Download and install the Spark service client again.
Click Instances, select SparkResource, and choose More > Instance Rolling Restart to restart the SparkResource instance.
Download and install the Spark service client again or update the existing client configuration. For details, see Using an MRS Client.

Verifying the Configuration

Log in to FusionInsight Manager and choose Cluster > Services > MemArtsCC > Chart > Capacity.
View and record the number of shards in the cluster.
Log in to the Spark client node, create a table whose Location is an OBS path, and query the table. For details, see Accessing OBS Using Spark Through Guardian.
Repeat 1 and 2. If there are more shards in the cluster than there were in 2, the interconnection is successful.