Help Center/ MapReduce Service/ Best Practices/ Data Analytics/ Using Flume to Collect Log Files from a Specified Directory to HDFS

Updated on 2022-12-09 GMT+08:00

View PDF

Using Flume to Collect Log Files from a Specified Directory to HDFS

Application Scenarios

Flume is a distributed, reliable, and highly available system for aggregating massive logs. It can efficiently collect, aggregate, and move massive amounts of log data from different data sources and store the data in a centralized data storage system. Data senders can be customized in the system to collect data. In addition, Flume provides the capability of simply processing data and writing data to data receivers (customizable).

Flume consists of the client and server, both of which are FlumeAgents. The server corresponds to the FlumeServer instance and is directly deployed in a cluster. The client can be deployed inside or outside the cluster. he client-side and service-side FlumeAgents work independently and provide the same functions.

The Flume client needs to be installed separately. It can be used to import data directly to components such as HDFS and Kafka of a cluster.

In this practice, the Flume component of a custom MRS cluster is used to automatically collect new files generated in the log directory of a specified node and store the files to HDFS.

Solution Architecture

A Flume-NG consists of agents. Each agent consists of three components (source, channel, and sink). A source is used for receiving data. A channel is used for transmitting data. A sink is used for sending data to the next end.

Figure 1 Flume-NG architecture
Click to enlarge

**Table 1** Module description
Name	Description
Source	A source receives data or generates data by using a special mechanism, and places the data in batches in one or more channels. The source can work in data-driven or polling mode. Typical source types are as follows: Sources that are integrated with the system, such as Syslog and Netcat Sources that automatically generate events, such as Exec and SEQ IPC sources that are used for communication between agents, such as Avro A Source must associate with at least one channel.
Channel	A channel is used to buffer data between a source and a sink. The channel caches data from the source and deletes that data after the sink sends the data to the next-hop channel or final destination. Different channels provide different persistence levels. Memory channel: non-persistency File channel: Write-Ahead Logging (WAL)-based persistence JDBC channel: persistency implemented based on the embedded database The channel supports the transaction feature to ensure simple sequential operations. A channel can work with sources and sinks of any quantity.
Sink	A sink sends data to the next-hop channel or final destination. Once completed, the transmitted data is removed from the channel. Typical sink types are as follows: Sinks that send storage data to the final destination, such as HDFS and HBase Sinks that are consumed automatically, such as Null Sink IPC sinks used for communication between Agents, such as Avro A sink must be associated with a specific channel.

As shown in Figure 2, a Flume client can have multiple sources, channels, and sinks.

Figure 2 Flume structure
Click to enlarge

Step 1: Creating an MRS Cluster

Create and purchase an MRS cluster that contains the Flume and HDFS components. For details, see Buying a Custom Cluster.

In this practice, an MRS 3.1.0 cluster with Kerberos authentication disabled is used as an example.
After the cluster is purchased, log in to FusionInsight Manager of the cluster, download the cluster client, and decompress it.

The Flume client needs to be installed separately. You need to download the cluster client installation package to the node where the Flume client is to be installed and decompress the package.
1. On the Homepage page of FusionInsight Manager, click next to the cluster name and click Download Client to download the cluster client.
2. On the Download Cluster Client page, enter the cluster client download information.
  Figure 3 Downloading the cluster client
  - Set Select Client Type to Complete Client.
  - Set Select Platform Type to the architecture of the node to install the client. x86_64 is used as an example.
  - Select Save to Path and enter the download path, for example, /tmp/FusionInsight-Client/. Ensure that user omm has the operation permission on the path.
3. After the client software package is downloaded, log in to the active OMS node of the cluster as user root and copy the installation package to a specified node.
  By default, the client software package is downloaded to the active OMS node of the cluster. You can view the node marked with on the host page of FusionInsight Manager. If you need to install the client software package on another node in the cluster, run the following command to transfer the software package to the target node.
  
  cd /tmp/FusionInsight-Client/
  
  scp -p FusionInsight_Cluster_1_Services_Client.tar IP address of the node where the Flume client is to be installed:/tmp
4. Log in to the node where the Flume client is to be installed as user root, go to the directory where the client software package is stored, and run the following commands to decompress the software package:
  tar -xvf FusionInsight_Cluster_1_Services_Client.tar
  
  tar -xvf FusionInsight_Cluster_1_Services_ClientConfig.tar

Step 2: Generating the Flume Configuration File

Log in to FusionInsight Manager and choose Cluster > Services. On the page that is displayed, choose Flume. On the displayed page, click the Configuration Tool tab.

Configure and export the properties.properties file.

Set Agent Name to server, select Avro Source, Memory Channel, and HDFS Sink, and connect them.

Double-click the module icon and set the parameters according to the following table. Retain the default values for the parameters not listed.

Type	Parameter	Description	Example Value
Avro Source	Name	Module name, which is customizable	test_source_1
	bind	IP address of the node where the Flume role resides. You can choose Cluster > Services > Flume > Instances to view the IP address of any Flume role instance.	192.168.10.192
	port	Connection port. The port number starts from 21154.	21154
Memory Channel	Name	Module name, which is customizable	test_channel_1
HDFS Sink	Name	Module name, which is customizable	test_sink_1
	hdfs.path	HDFS directory to which log files are written	hdfs://hacluster/flume/test
	hdfs.filePrefix	Prefix of the file name written to HDFS	over_%{basename}

Click Export to download the properties.properties file to your local PC.
On FusionInsight Manager, choose Cluster > Services > Flume, click the Instance tab, and click the Flume role in the row of the node where the configuration file is to be uploaded. The Instance Configurations tab page is displayed.
Click Upload File and upload the properties.properties file.

Click Save. Then click OK.

Choose Cluster > Services > Flume. On the page that is displayed, click the Configuration Tool tab.

Set Agent Name to client, select SpoolDir Source, Memory Channel, and Avro Sink, and connect them.

Double-click the module icon and set the parameters according to the following table. (Retain the default values for the parameters not listed.)

Type	Parameter	Description	Example Value
SpoolDir Source	Name	Module name, which is customizable	test_source_1
SpoolDir Source	spoolDir	Directory where logs need to be collected. The Flume running user must have the read and write permissions on the directory, and the permissions must be verified by storing files in the directory.	/var/log/Bigdata/audit/test
Memory Channel	Name	Module name, which is customizable	test_channel_1
HDFS Sink	Name	Module name, which is customizable	test_sink_1
	hostname	IP address of the node where the Flume role to be connected resides	192.168.10.192
	port	Connection port. The port number starts from 21154.	21154

Click Export to download the properties.properties file to your local PC.
Rename the properties.properties file as client.properties.properties, and upload the file to the Path where the cluster client installation package is decompressed/Flume/FlumeClient/flume/conf directory on the Flume client node.

Step 3: Installing the Flume Client

Log in to the node where the Flume client is to be installed as user root.
Go to the path where the client installation package is decompressed. For example, the client installation package has been uploaded to /tmp and then decompressed.
Run the following commands to install the Flume client. In the command, /opt/FlumeClient indicates the custom Flume client installation path.

cd /tmp/FusionInsight-Client/FusionInsight_Cluster_1_Services_ClientConfig/Flume/FlumeClient

./install.sh -d /opt/FlumeClient -c flume/conf/client.properties.properties
```
CST ... [flume-client install]: install flume client successfully.
```

Step 4: Viewing Log Collection Results

After the Flume client is installed, write new log files to the log collection directory to check whether logs are transmitted.

For example, create several log files in the /var/log/Bigdata/audit/test directory.

cd /var/log/Bigdata/audit/test

vi log1.txt
```
Test log file 1!!!
```
vi log2.txt
```
Test log file 2!!!
```
After the log files are written, run the ll command to view the file list. If the suffix .COMPLETED is automatically added to the file names, the log files have been collected.
```
-rw-------. 1 root root      75 Jun  9 19:59 log1.txt.COMPLETED
-rw-------. 1 root root      75 Jun  9 19:59 log2.txt.COMPLETED
```
Log in to FusionInsight Manager and choose Cluster > Services > HDFS. On the Dashboard tab page that is displayed, click the NameNode(Node name,Active) link next to NameNode WebUI to access the HDFS web UI.
Choose Utilities > Browse the file system and check whether data is generated in the /flume/test directory in HDFS.

As shown above, log files are generated in the directory, and the prefix over_ is added to the file names.

Download the log file over_log1.txt and check whether its content is the same as that of the log file log1.txt.
```
Test log file 1!!!
```