Connecting Flume to OBS

Overview

Flume is a distributed, reliable, and highly available service for collecting, aggregating, and moving a large amount of log data. For details, see Apache Flume. In big data scenarios, OBS can replace HDFS in the Hadoop system.

Precautions

Multiple sinks write the same file.
OBS and HDFS differ in consistency assurance. The HDFS lease mechanism keeps data consistent when the same file is concurrently written, but the HDFS protocol effected by OBS does not support the lease mechanism, that is, something uncertain will happen when the same file is concurrently written. To address this issue, the file naming rules can be used in Flume scenarios.

For example, hostname-sinkname is used as the prefix of a sink file name. If a host has multiple Flume agents deployed, each agent must have a different sink name.
Flume log configuration
To reduce output logs, add the following configuration to the /opt/apache-flume-1.9.0-bin/conf/log4j.properties file:
```
log4j.logger.com.obs=ERROR
```

Configuration for the directory of temporary files that OBSA writes data to.

When Flume writes data to OBS, the data is first written to the local disk buffer and then uploaded to OBS. If you require better performance for data write, select a high-performance disk as the buffer. Specifically, add the following configuration to the core-site.xml file:

       
            <property>
<name>fs.obs.buffer.dir</name>
<value>xxx</value>
</property>

Procedure

The following uses Flume 1.9 as an example.

Download apache-flume-1.9.0-bin.tar.gz.
Install Flume.

Decompress apache-flume-1.9.0-bin.tar.gz to the /opt/apache-flume-1.9.0-bin directory.
- If Hadoop has been deployed, no additional operation is required. For details about the deployment, see Connecting Hadoop to OBS.
- If Hadoop is not deployed:
  1. Copy the Hadoop JAR packages, including hadoop-huaweicloud-xxx.jar, to the /opt/apache-flume-1.9.0-bin/lib directory.
  2. Copy the core-site.xml file containing the OBS configurations to the /opt/apache-flume-1.9.0-bin/conf directory.

Check whether the connection is successful.

Example: The built-in StressSource is used as the source, the file is used as the channel, and the obs is used as the sink.

Create a Flume configuration file sink2obs.properties.

agent.sources = r1
agent.channels = c1
agent.sinks = k1

agent.sources.r1.type = org.apache.flume.source.StressSource
agent.sources.r1.channels = c1
agent.sources.r1.size = 1024
agent.sources.r1.maxTotalEvents = 100000
agent.sources.r1.maxEventsPerSecond = 10000
agent.sources.r1.batchSize=1000

agent.sources.r1.interceptors = i1
agent.sources.r1.interceptors.i1.type = host
agent.sources.r1.interceptors.i1.useIP = false

agent.channels.c1.type = file
agent.channels.c1.dataDirs = /data/agent/flume-data
agent.channels.c1.checkpointDir = /data/agent/flume-checkpoint
agent.channels.c1.capacity = 500000
agent.channels.c1.transactionCapacity = 50000

agent.sinks.k1.channel = c1
agent.sinks.k1.type = hdfs
agent.sinks.k1.hdfs.useLocalTimeStamp = true
agent.sinks.k1.hdfs.filePrefix = %{host}_k1
agent.sinks.k1.hdfs.path = obs://obs-bucket/flume/create_time=%Y-%m-%d-%H-%M
agent.sinks.k1.hdfs.fileType = DataStream
agent.sinks.k1.hdfs.writeFormat = Text
agent.sinks.k1.hdfs.rollSize = 0
agent.sinks.k1.hdfs.rollCount = 1000
agent.sinks.k1.hdfs.rollInterval = 0
agent.sinks.k1.hdfs.batchSize = 1000
agent.sinks.k1.hdfs.round = true
agent.sinks.k1.hdfs.roundValue = 10
agent.sinks.k1.hdfs.roundUnit = minute