Help Center/ MapReduce Service/ Best Practices/ Data Analytics/ Kafka-based WordCount Data Flow Statistics Case

Updated on 2024-08-12 GMT+08:00

View PDF

Kafka-based WordCount Data Flow Statistics Case

Application Scenarios

Use an MRS cluster to run Kafka programs to process data.

Kafka Streams is a lightweight stream processing framework provided by Apache Kafka, where the input and output data are stored in Kafka clusters.

The following uses WordCount as an example.

Solution Architecture

Kafka is a distributed message publish-subscribe system. With features similar to JMS, Kafka processes active streaming data.

Kafka applies to many scenarios, such as message queuing, behavior tracing, O&M data monitoring, log collection, stream processing, event tracing, and log persistence.

Kafka has the following features:

High throughput
Message persistence to disks
Scalable distributed system
High fault tolerance

Procedure

Huawei Cloud MRS provides sample development projects for Kafka in multiple scenarios. The development guideline for the scenario in this practice is as follows:

Create two topics on the Kafka client to serve as the input and output topics.
Develop a Kafka Streams to implement the word count function. The system collects statistics on the number of words in each message by reading the message in the input topic, consumes data from the output topic, and provides the statistical result in the form of a key-value pair.

Step 1: Creating an MRS Cluster

Create and purchase an MRS cluster that contains the Kafka component. For details, see Buying a Custom Cluster.

In this practice, an MRS 3.1.0 cluster, with Hadoop and Kafka installed and with Kerberos authentication disabled, is used as an example.
After the cluster is purchased, install the cluster client on any node of the cluster. For details, see Installing and Using the Cluster Client.

For example, install the client in the /opt/client directory on the active management node.
After the client is installed, create the lib directory on the client to store related JAR files.

Copy the Kafka JAR files in the directory decompressed during client installation to lib.

For example, if the download path of the client software package is /tmp/FusionInsight-Client on the active management node, run the following commands:

mkdir /opt/client/lib

cd /tmp/FusionInsight-Client/FusionInsight_Cluster_1_Services_ClientConfig

scp Kafka/install_files/kafka/libs/* /opt/client/lib

Step 2: Preparing Applications

Obtain the sample project from Huawei Mirrors.

Download the Maven project source code and configuration files of the sample project, and configure related development tools on the local host. For details, see Obtaining Sample Projects from Huawei Mirrors.

Select a sample project based on the cluster version and download the sample project.

For example, to obtain WordCountDemo, visit https://github.com/huaweicloud/huaweicloud-mrs-example/tree/mrs-3.1.0/src/kafka-examples.

Use IntelliJ IDEA to import the sample project locally and wait for the Maven project to download related dependency packages.

After Maven and SDK parameters are configured on the local host, the sample project automatically loads related dependency packages. For details, see Configuring and Importing a Sample Project.

In this sample program WordCountDemo, Kafka APIs are called to obtain word records, and word records are classified to obtain the number of records of each word. The key code snippets are as follows:

...
    static Properties getStreamsConfig() {
        final Properties props = new Properties();
        KafkaProperties kafkaProc = KafkaProperties.getInstance();
        // Broker address list. Configure this parameter based on site requirements.
        props.put(BOOTSTRAP_SERVERS, kafkaProc.getValues(BOOTSTRAP_SERVERS, "node-group-1kLFk.mrs-rbmq.com:9092"));
        props.put(SASL_KERBEROS_SERVICE_NAME, "kafka");
        props.put(KERBEROS_DOMAIN_NAME, kafkaProc.getValues(KERBEROS_DOMAIN_NAME, "hadoop.hadoop.com"));
        props.put(APPLICATION_ID, kafkaProc.getValues(APPLICATION_ID, "streams-wordcount"));
        // Protocol type. The value can be SASL_PLAINTEXT or PLAINTEXT.
        props.put(SECURITY_PROTOCOL, kafkaProc.getValues(SECURITY_PROTOCOL, "PLAINTEXT"));
        props.put(CACHE_MAX_BYTES_BUFFERING, 0);
        props.put(DEFAULT_KEY_SERDE, Serdes.String().getClass().getName());
        props.put(DEFAULT_VALUE_SERDE, Serdes.String().getClass().getName());
        props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
        return props;
    }
    static void createWordCountStream(final StreamsBuilder builder) {
        // Receives input records from the input topic.
        final KStream<String, String> source = builder.stream(INPUT_TOPIC_NAME);
        // Aggregates calculation results of the key-value pair.
        final KTable<String, Long> counts = source
                .flatMapValues(value -> Arrays.asList(value.toLowerCase(Locale.getDefault()).split(REGEX_STRING)))
                .groupBy((key, value) -> value)
                .count();
        // Outputs the key-value pairs from the output topic.
        counts.toStream().to(OUTPUT_TOPIC_NAME, Produced.with(Serdes.String(), Serdes.Long()));
    }
...

Set BOOTSTRAP_SERVERS to the host names and port numbers of Kafka broker nodes based on site requirements. For details about the broker information in Commissioning an Application in Linux, log in to FusionInsight Manager, choose Cluster > Services > Kafka, and click the Instance tab.
SECURITY_PROTOCOL indicates the protocol type for connecting to Kafka. In this example, set this parameter to PLAINTEXT.

After confirming that the parameters in WordCountDemo.java are correct, compile the project and package it to obtain the JAR file.

For details about how to compile a JAR file, see Commissioning an Application in Linux.

For example, the packaged JAR file is kafka-demo.jar.

Step 3: Uploading the JAR File and Source Data

Upload the compiled JAR file to a directory, for example, /opt/client/lib, on the client node.

If you cannot directly connect to the client node to upload files through the local network, upload the JAR file or source data to OBS, import the file to HDFS on the Files tab page of the MRS cluster, and run the hdfs dfs -get command on the HDFS client to download the file to the client node.

Step 4: Running the Job and Viewing the Result

Log in to the node where the cluster client is installed as user root.

cd /opt/client

source bigdata_env
Create an input topic and an output topic. Ensure that the topic names are the same as those specified in the sample code. Set the cleanup policy of the output topic to compact.

kafka-topics.sh --create --zookeeper IP address of the quorumpeer instance:ZooKeeper client connection port /kafka --replication-factor 1 --partitions 1 --topic Topic name

To query the IP address of the quorumpeer instance, log in to FusionInsight Manager of the cluster, choose Cluster > Services > ZooKeeper, and click the Instance tab. Use commas (,) to separate multiple IP addresses. You can query the ZooKeeper client connection port by querying the ZooKeeper service configuration parameter clientPort. The default value is 2181.

For example, run the following commands:

kafka-topics.sh --create --zookeeper 192.168.0.17:2181/kafka --replication-factor 1 --partitions 1 --topic streams-wordcount-input

kafka-topics.sh --create --zookeeper 192.168.0.17:2181/kafka --replication-factor 1 --partitions 1 --topic streams-wordcount-output --config cleanup.policy=compact
After the topics are created, run the following command to run the program:

java -cp .:/opt/client/lib/* com.huawei.bigdata.kafka.example.WordCountDemo
Open a new client connection window and run the following commands to use kafka-console-producer.sh to write messages to the input topic:

cd /opt/client

source bigdata_env

kafka-console-producer.sh --broker-list Broker instance IP address:Kafka connection port (For example, 192.168.0.13:9092) --topic streams-wordcount-input --producer.config /opt/client/Kafka/kafka/config/producer.properties
Open a new client connection window and run the following commands to use kafka-console-consumer.sh to consume data from the output topic and view the statistics result:

cd /opt/client

source bigdata_env

kafka-console-consumer.sh --topic streams-wordcount-output --bootstrap-server Broker instance IP address:Kafka connection port --consumer.config /opt/client/Kafka/kafka/config/consumer.properties --from-beginning --property print.key=true --property print.value=true --property key.deserializer=org.apache.kafka.common.serialization.StringDeserializer --property value.deserializer=org.apache.kafka.common.serialization.LongDeserializer --formatter kafka.tools.DefaultMessageFormatter

Write a message to the input topic.
```
>This is Kafka Streams test 
>test starting 
>now Kafka Streams is running 
>test end 
```
The message is output as follows:
```
this    1 
is      1 
kafka   1 
streams 1 
test    1 
test    2 
starting 1 
now     1 
kafka   2 
streams 2 
is      2 
running 1 
test    3 
end     1
```