Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
Cloud Phone Host
Huawei Cloud EulerOS
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT Device Access
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Managed Threat Detection
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
Distributed Database Middleware
Database and Application Migration UGO
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
Intelligent EdgeCloud
SAP Cloud
High Performance Computing
Developer Services
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
Help Center/ MapReduce Service/ Best Practices/ Data Analytics/ Using Flume to Collect Log Files from a Specified Directory to HDFS

Using Flume to Collect Log Files from a Specified Directory to HDFS

Updated on 2022-12-09 GMT+08:00

Application Scenarios

Flume is a distributed, reliable, and highly available system for aggregating massive logs. It can efficiently collect, aggregate, and move massive amounts of log data from different data sources and store the data in a centralized data storage system. Data senders can be customized in the system to collect data. In addition, Flume provides the capability of simply processing data and writing data to data receivers (customizable).

Flume consists of the client and server, both of which are FlumeAgents. The server corresponds to the FlumeServer instance and is directly deployed in a cluster. The client can be deployed inside or outside the cluster. he client-side and service-side FlumeAgents work independently and provide the same functions.

The Flume client needs to be installed separately. It can be used to import data directly to components such as HDFS and Kafka of a cluster.

In this practice, the Flume component of a custom MRS cluster is used to automatically collect new files generated in the log directory of a specified node and store the files to HDFS.

Solution Architecture

A Flume-NG consists of agents. Each agent consists of three components (source, channel, and sink). A source is used for receiving data. A channel is used for transmitting data. A sink is used for sending data to the next end.

Figure 1 Flume-NG architecture
Table 1 Module description




A source receives data or generates data by using a special mechanism, and places the data in batches in one or more channels. The source can work in data-driven or polling mode.

Typical source types are as follows:

  • Sources that are integrated with the system, such as Syslog and Netcat
  • Sources that automatically generate events, such as Exec and SEQ
  • IPC sources that are used for communication between agents, such as Avro

A Source must associate with at least one channel.


A channel is used to buffer data between a source and a sink. The channel caches data from the source and deletes that data after the sink sends the data to the next-hop channel or final destination.

Different channels provide different persistence levels.

  • Memory channel: non-persistency
  • File channel: Write-Ahead Logging (WAL)-based persistence
  • JDBC channel: persistency implemented based on the embedded database

The channel supports the transaction feature to ensure simple sequential operations. A channel can work with sources and sinks of any quantity.


A sink sends data to the next-hop channel or final destination. Once completed, the transmitted data is removed from the channel.

Typical sink types are as follows:

  • Sinks that send storage data to the final destination, such as HDFS and HBase
  • Sinks that are consumed automatically, such as Null Sink
  • IPC sinks used for communication between Agents, such as Avro

A sink must be associated with a specific channel.

As shown in Figure 2, a Flume client can have multiple sources, channels, and sinks.

Figure 2 Flume structure

Step 1: Creating an MRS Cluster

  1. Create and purchase an MRS cluster that contains the Flume and HDFS components. For details, see Buying a Custom Cluster.


    In this practice, an MRS 3.1.0 cluster with Kerberos authentication disabled is used as an example.

  2. After the cluster is purchased, log in to FusionInsight Manager of the cluster, download the cluster client, and decompress it.

    The Flume client needs to be installed separately. You need to download the cluster client installation package to the node where the Flume client is to be installed and decompress the package.

    1. On the Homepage page of FusionInsight Manager, click next to the cluster name and click Download Client to download the cluster client.
    2. On the Download Cluster Client page, enter the cluster client download information.
      Figure 3 Downloading the cluster client
      • Set Select Client Type to Complete Client.
      • Set Select Platform Type to the architecture of the node to install the client. x86_64 is used as an example.
      • Select Save to Path and enter the download path, for example, /tmp/FusionInsight-Client/. Ensure that user omm has the operation permission on the path.
    3. After the client software package is downloaded, log in to the active OMS node of the cluster as user root and copy the installation package to a specified node.

      By default, the client software package is downloaded to the active OMS node of the cluster. You can view the node marked with on the host page of FusionInsight Manager. If you need to install the client software package on another node in the cluster, run the following command to transfer the software package to the target node.

      cd /tmp/FusionInsight-Client/

      scp -p FusionInsight_Cluster_1_Services_Client.tar IP address of the node where the Flume client is to be installed:/tmp

    4. Log in to the node where the Flume client is to be installed as user root, go to the directory where the client software package is stored, and run the following commands to decompress the software package:

      tar -xvf FusionInsight_Cluster_1_Services_Client.tar

      tar -xvf FusionInsight_Cluster_1_Services_ClientConfig.tar

Step 2: Generating the Flume Configuration File

  1. Log in to FusionInsight Manager and choose Cluster > Services. On the page that is displayed, choose Flume. On the displayed page, click the Configuration Tool tab.
  2. Configure and export the file.

    Set Agent Name to server, select Avro Source, Memory Channel, and HDFS Sink, and connect them.

    Double-click the module icon and set the parameters according to the following table. Retain the default values for the parameters not listed.




    Example Value

    Avro Source


    Module name, which is customizable



    IP address of the node where the Flume role resides. You can choose Cluster > Services > Flume > Instances to view the IP address of any Flume role instance.


    Connection port. The port number starts from 21154.


    Memory Channel


    Module name, which is customizable


    HDFS Sink


    Module name, which is customizable



    HDFS directory to which log files are written



    Prefix of the file name written to HDFS


  3. Click Export to download the file to your local PC.
  4. On FusionInsight Manager, choose Cluster > Services > Flume, click the Instance tab, and click the Flume role in the row of the node where the configuration file is to be uploaded. The Instance Configurations tab page is displayed.

  5. Click Upload File and upload the file.

    Click Save. Then click OK.

  6. Choose Cluster > Services > Flume. On the page that is displayed, click the Configuration Tool tab.

    Set Agent Name to client, select SpoolDir Source, Memory Channel, and Avro Sink, and connect them.

    Double-click the module icon and set the parameters according to the following table. (Retain the default values for the parameters not listed.)




    Example Value

    SpoolDir Source


    Module name, which is customizable



    Directory where logs need to be collected. The Flume running user must have the read and write permissions on the directory, and the permissions must be verified by storing files in the directory.


    Memory Channel


    Module name, which is customizable


    HDFS Sink


    Module name, which is customizable



    IP address of the node where the Flume role to be connected resides


    Connection port. The port number starts from 21154.


  7. Click Export to download the file to your local PC.
  8. Rename the file as, and upload the file to the Path where the cluster client installation package is decompressed/Flume/FlumeClient/flume/conf directory on the Flume client node.

Step 3: Installing the Flume Client

  1. Log in to the node where the Flume client is to be installed as user root.
  2. Go to the path where the client installation package is decompressed. For example, the client installation package has been uploaded to /tmp and then decompressed.
  3. Run the following commands to install the Flume client. In the command, /opt/FlumeClient indicates the custom Flume client installation path.

    cd /tmp/FusionInsight-Client/FusionInsight_Cluster_1_Services_ClientConfig/Flume/FlumeClient

    ./ -d /opt/FlumeClient -c flume/conf/

    CST ... [flume-client install]: install flume client successfully.

Step 4: Viewing Log Collection Results

  1. After the Flume client is installed, write new log files to the log collection directory to check whether logs are transmitted.

    For example, create several log files in the /var/log/Bigdata/audit/test directory.

    cd /var/log/Bigdata/audit/test

    vi log1.txt

    Test log file 1!!!

    vi log2.txt

    Test log file 2!!!

  2. After the log files are written, run the ll command to view the file list. If the suffix .COMPLETED is automatically added to the file names, the log files have been collected.

    -rw-------. 1 root root      75 Jun  9 19:59 log1.txt.COMPLETED
    -rw-------. 1 root root      75 Jun  9 19:59 log2.txt.COMPLETED

  3. Log in to FusionInsight Manager and choose Cluster > Services > HDFS. On the Dashboard tab page that is displayed, click the NameNode(Node name,Active) link next to NameNode WebUI to access the HDFS web UI.

  4. Choose Utilities > Browse the file system and check whether data is generated in the /flume/test directory in HDFS.

    As shown above, log files are generated in the directory, and the prefix over_ is added to the file names.

    Download the log file over_log1.txt and check whether its content is the same as that of the log file log1.txt.

    Test log file 1!!!

We use cookies to improve our site and your experience. By continuing to browse our site you accept our cookie policy. Find out more





Selected Content

Submit selected content with the feedback