Using Flume to Collect Log Files from a Specified Directory to HDFS
Application Scenarios
Flume is a distributed, reliable, and highly available system for aggregating massive logs. It can efficiently collect, aggregate, and move massive amounts of log data from different data sources and store the data in a centralized data storage system. Data senders can be customized in the system to collect data. In addition, Flume provides the capability of simply processing data and writing data to data receivers (customizable).
Flume consists of the client and server, both of which are FlumeAgents. The server corresponds to the FlumeServer instance and is directly deployed in a cluster. The client can be deployed inside or outside the cluster. he client-side and service-side FlumeAgents work independently and provide the same functions.
The Flume client needs to be installed separately. It can be used to import data directly to components such as HDFS and Kafka of a cluster.
In this practice, the Flume component of a custom MRS cluster is used to automatically collect new files generated in the log directory of a specified node and store the files to HDFS.
Solution Architecture
A Flume-NG consists of agents. Each agent consists of three components (source, channel, and sink). A source is used for receiving data. A channel is used for transmitting data. A sink is used for sending data to the next end.
Name |
Description |
---|---|
Source |
A source receives data or generates data by using a special mechanism, and places the data in batches in one or more channels. The source can work in data-driven or polling mode. Typical source types are as follows:
A Source must associate with at least one channel. |
Channel |
A channel is used to buffer data between a source and a sink. The channel caches data from the source and deletes that data after the sink sends the data to the next-hop channel or final destination. Different channels provide different persistence levels.
The channel supports the transaction feature to ensure simple sequential operations. A channel can work with sources and sinks of any quantity. |
Sink |
A sink sends data to the next-hop channel or final destination. Once completed, the transmitted data is removed from the channel. Typical sink types are as follows:
A sink must be associated with a specific channel. |
As shown in Figure 2, a Flume client can have multiple sources, channels, and sinks.
Step 1: Creating an MRS Cluster
- Create and purchase an MRS cluster that contains the Flume and HDFS components. For details, see Buying a Custom Cluster.
In this practice, an MRS 3.1.0 cluster with Kerberos authentication disabled is used as an example.
- After the cluster is purchased, log in to FusionInsight Manager of the cluster, download the cluster client, and decompress it.
The Flume client needs to be installed separately. You need to download the cluster client installation package to the node where the Flume client is to be installed and decompress the package.
- On the Homepage page of FusionInsight Manager, click next to the cluster name and click Download Client to download the cluster client.
- On the Download Cluster Client page, enter the cluster client download information.
Figure 3 Downloading the cluster client
- Set Select Client Type to Complete Client.
- Set Select Platform Type to the architecture of the node to install the client. x86_64 is used as an example.
- Select Save to Path and enter the download path, for example, /tmp/FusionInsight-Client/. Ensure that user omm has the operation permission on the path.
- After the client software package is downloaded, log in to the active OMS node of the cluster as user root and copy the installation package to a specified node.
By default, the client software package is downloaded to the active OMS node of the cluster. You can view the node marked with on the host page of FusionInsight Manager. If you need to install the client software package on another node in the cluster, run the following command to transfer the software package to the target node.
cd /tmp/FusionInsight-Client/
scp -p FusionInsight_Cluster_1_Services_Client.tar IP address of the node where the Flume client is to be installed:/tmp
- Log in to the node where the Flume client is to be installed as user root, go to the directory where the client software package is stored, and run the following commands to decompress the software package:
tar -xvf FusionInsight_Cluster_1_Services_Client.tar
tar -xvf FusionInsight_Cluster_1_Services_ClientConfig.tar
Step 2: Generating the Flume Configuration File
- Log in to FusionInsight Manager and choose Cluster > Services. On the page that is displayed, choose Flume. On the displayed page, click the Configuration Tool tab.
- Configure and export the properties.properties file.
Set Agent Name to server, select Avro Source, Memory Channel, and HDFS Sink, and connect them.
Double-click the module icon and set the parameters according to the following table. Retain the default values for the parameters not listed.
Type
Parameter
Description
Example Value
Avro Source
Name
Module name, which is customizable
test_source_1
bind
IP address of the node where the Flume role resides. You can choose Cluster > Services > Flume > Instances to view the IP address of any Flume role instance.
192.168.10.192
port
Connection port. The port number starts from 21154.
21154
Memory Channel
Name
Module name, which is customizable
test_channel_1
HDFS Sink
Name
Module name, which is customizable
test_sink_1
hdfs.path
HDFS directory to which log files are written
hdfs://hacluster/flume/test
hdfs.filePrefix
Prefix of the file name written to HDFS
over_%{basename}
- Click Export to download the properties.properties file to your local PC.
- On FusionInsight Manager, choose Cluster > Services > Flume, click the Instance tab, and click the Flume role in the row of the node where the configuration file is to be uploaded. The Instance Configurations tab page is displayed.
- Click Upload File and upload the properties.properties file.
Click Save. Then click OK.
- Choose Cluster > Services > Flume. On the page that is displayed, click the Configuration Tool tab.
Set Agent Name to client, select SpoolDir Source, Memory Channel, and Avro Sink, and connect them.
Double-click the module icon and set the parameters according to the following table. (Retain the default values for the parameters not listed.)
Type
Parameter
Description
Example Value
SpoolDir Source
Name
Module name, which is customizable
test_source_1
spoolDir
Directory where logs need to be collected. The Flume running user must have the read and write permissions on the directory, and the permissions must be verified by storing files in the directory.
/var/log/Bigdata/audit/test
Memory Channel
Name
Module name, which is customizable
test_channel_1
HDFS Sink
Name
Module name, which is customizable
test_sink_1
hostname
IP address of the node where the Flume role to be connected resides
192.168.10.192
port
Connection port. The port number starts from 21154.
21154
- Click Export to download the properties.properties file to your local PC.
- Rename the properties.properties file as client.properties.properties, and upload the file to the Path where the cluster client installation package is decompressed/Flume/FlumeClient/flume/conf directory on the Flume client node.
Step 3: Installing the Flume Client
- Log in to the node where the Flume client is to be installed as user root.
- Go to the path where the client installation package is decompressed. For example, the client installation package has been uploaded to /tmp and then decompressed.
- Run the following commands to install the Flume client. In the command, /opt/FlumeClient indicates the custom Flume client installation path.
cd /tmp/FusionInsight-Client/FusionInsight_Cluster_1_Services_ClientConfig/Flume/FlumeClient
./install.sh -d /opt/FlumeClient -c flume/conf/client.properties.properties
CST ... [flume-client install]: install flume client successfully.
Step 4: Viewing Log Collection Results
- After the Flume client is installed, write new log files to the log collection directory to check whether logs are transmitted.
For example, create several log files in the /var/log/Bigdata/audit/test directory.
cd /var/log/Bigdata/audit/test
vi log1.txt
Test log file 1!!!
vi log2.txt
Test log file 2!!!
- After the log files are written, run the ll command to view the file list. If the suffix .COMPLETED is automatically added to the file names, the log files have been collected.
-rw-------. 1 root root 75 Jun 9 19:59 log1.txt.COMPLETED -rw-------. 1 root root 75 Jun 9 19:59 log2.txt.COMPLETED
- Log in to FusionInsight Manager and choose Cluster > Services > HDFS. On the Dashboard tab page that is displayed, click the NameNode(Node name,Active) link next to NameNode WebUI to access the HDFS web UI.
- Choose Utilities > Browse the file system and check whether data is generated in the /flume/test directory in HDFS.
As shown above, log files are generated in the directory, and the prefix over_ is added to the file names.
Download the log file over_log1.txt and check whether its content is the same as that of the log file log1.txt.
Test log file 1!!!
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.