Updated on 2024-03-25 GMT+08:00

Connecting Flume to OBS

Overview

Flume is a distributed, reliable, and highly available service for collecting, aggregating, and moving a large amount of log data. For details, see Apache Flume. In big data scenarios, OBS can replace HDFS in the Hadoop system.

Precautions

  • Multiple sinks write the same file.

    OBS and HDFS differ in consistency assurance. The HDFS lease mechanism keeps data consistent when the same file is concurrently written, but the HDFS protocol effected by OBS does not support the lease mechanism, that is, something uncertain will happen when the same file is concurrently written. To address this issue, the file naming rules can be used in Flume scenarios.

    For example, hostname-sinkname is used as the prefix of a sink file name. If a host has multiple Flume agents deployed, each agent must have a different sink name.

  • Flume log configuration

    To reduce output logs, add the following configuration to the /opt/apache-flume-1.9.0-bin/conf/log4j.properties file:

    log4j.logger.com.obs=ERROR
  • Configuration for the directory of temporary files that OBSA writes data to.

    When Flume writes data to OBS, the data is first written to the local disk buffer and then uploaded to OBS. If you require better performance for data write, select a high-performance disk as the buffer. Specifically, add the following configuration to the core-site.xml file:

    1
    2
    3
    4
    <property>
    <name>fs.obs.buffer.dir</name>
    <value>xxx</value>
    </property>
    

Procedure

The following uses Flume 1.9 as an example.

  1. Download apache-flume-1.9.0-bin.tar.gz.
  2. Install Flume.

    Decompress apache-flume-1.9.0-bin.tar.gz to the /opt/apache-flume-1.9.0-bin directory.

    • If Hadoop has been deployed, no additional operation is required. For details about the deployment, see Connecting Hadoop to OBS.
    • If Hadoop is not deployed:
      1. Copy the Hadoop JAR packages, including hadoop-huaweicloud-xxx.jar, to the /opt/apache-flume-1.9.0-bin/lib directory.
      2. Copy the core-site.xml file containing the OBS configurations to the /opt/apache-flume-1.9.0-bin/conf directory.

  3. Check whether the connection is successful.

    Example: The built-in StressSource is used as the source, the file is used as the channel, and the obs is used as the sink.

    1. Create a Flume configuration file sink2obs.properties.
      agent.sources = r1
      agent.channels = c1
      agent.sinks = k1
      
      agent.sources.r1.type = org.apache.flume.source.StressSource
      agent.sources.r1.channels = c1
      agent.sources.r1.size = 1024
      agent.sources.r1.maxTotalEvents = 100000
      agent.sources.r1.maxEventsPerSecond = 10000
      agent.sources.r1.batchSize=1000
      
      agent.sources.r1.interceptors = i1
      agent.sources.r1.interceptors.i1.type = host
      agent.sources.r1.interceptors.i1.useIP = false
      
      agent.channels.c1.type = file
      agent.channels.c1.dataDirs = /data/agent/flume-data
      agent.channels.c1.checkpointDir = /data/agent/flume-checkpoint
      agent.channels.c1.capacity = 500000
      agent.channels.c1.transactionCapacity = 50000
      
      agent.sinks.k1.channel = c1
      agent.sinks.k1.type = hdfs
      agent.sinks.k1.hdfs.useLocalTimeStamp = true
      agent.sinks.k1.hdfs.filePrefix = %{host}_k1
      agent.sinks.k1.hdfs.path = obs://obs-bucket/flume/create_time=%Y-%m-%d-%H-%M
      agent.sinks.k1.hdfs.fileType = DataStream
      agent.sinks.k1.hdfs.writeFormat = Text
      agent.sinks.k1.hdfs.rollSize = 0
      agent.sinks.k1.hdfs.rollCount = 1000
      agent.sinks.k1.hdfs.rollInterval = 0
      agent.sinks.k1.hdfs.batchSize = 1000
      agent.sinks.k1.hdfs.round = true
      agent.sinks.k1.hdfs.roundValue = 10
      agent.sinks.k1.hdfs.roundUnit = minute
    2. Start the Flume agent:

      ./bin/flume-ng agent -n agent -c conf/ -f conf/sink2obs.properties