Connecting Flume to OBS
Overview
Flume is a distributed, reliable, and highly available service for collecting, aggregating, and moving a large amount of log data. For details, see Apache Flume. In big data scenarios, OBS can replace HDFS in the Hadoop system.
Precautions
- Multiple sinks write the same file.
OBS and HDFS differ in consistency assurance. The HDFS lease mechanism keeps data consistent when the same file is concurrently written, but the HDFS protocol effected by OBS does not support the lease mechanism, that is, something uncertain will happen when the same file is concurrently written. To address this issue, the file naming rules can be used in Flume scenarios.
For example, hostname-sinkname is used as the prefix of a sink file name. If a host has multiple Flume agents deployed, each agent must have a different sink name.
- Flume log configuration
To reduce output logs, add the following configuration to the /opt/apache-flume-1.9.0-bin/conf/log4j.properties file:
log4j.logger.com.obs=ERROR
- Configuration for the directory of temporary files that OBSA writes data to.
When Flume writes data to OBS, the data is first written to the local disk buffer and then uploaded to OBS. If you require better performance for data write, select a high-performance disk as the buffer. Specifically, add the following configuration to the core-site.xml file:
1 2 3 4
<property> <name>fs.obs.buffer.dir</name> <value>xxx</value> </property>
Procedure
The following uses Flume 1.9 as an example.
- Download apache-flume-1.9.0-bin.tar.gz.
- Install Flume.
Decompress apache-flume-1.9.0-bin.tar.gz to the /opt/apache-flume-1.9.0-bin directory.
- If Hadoop has been deployed, no additional operation is required. For details about the deployment, see Connecting Hadoop to OBS.
- If Hadoop is not deployed:
- Copy the Hadoop JAR packages, including hadoop-huaweicloud-xxx.jar, to the /opt/apache-flume-1.9.0-bin/lib directory.
- Copy the core-site.xml file containing the OBS configurations to the /opt/apache-flume-1.9.0-bin/conf directory.
- Check whether the connection is successful.
Example: The built-in StressSource is used as the source, the file is used as the channel, and the obs is used as the sink.
- Create a Flume configuration file sink2obs.properties.
agent.sources = r1 agent.channels = c1 agent.sinks = k1 agent.sources.r1.type = org.apache.flume.source.StressSource agent.sources.r1.channels = c1 agent.sources.r1.size = 1024 agent.sources.r1.maxTotalEvents = 100000 agent.sources.r1.maxEventsPerSecond = 10000 agent.sources.r1.batchSize=1000 agent.sources.r1.interceptors = i1 agent.sources.r1.interceptors.i1.type = host agent.sources.r1.interceptors.i1.useIP = false agent.channels.c1.type = file agent.channels.c1.dataDirs = /data/agent/flume-data agent.channels.c1.checkpointDir = /data/agent/flume-checkpoint agent.channels.c1.capacity = 500000 agent.channels.c1.transactionCapacity = 50000 agent.sinks.k1.channel = c1 agent.sinks.k1.type = hdfs agent.sinks.k1.hdfs.useLocalTimeStamp = true agent.sinks.k1.hdfs.filePrefix = %{host}_k1 agent.sinks.k1.hdfs.path = obs://obs-bucket/flume/create_time=%Y-%m-%d-%H-%M agent.sinks.k1.hdfs.fileType = DataStream agent.sinks.k1.hdfs.writeFormat = Text agent.sinks.k1.hdfs.rollSize = 0 agent.sinks.k1.hdfs.rollCount = 1000 agent.sinks.k1.hdfs.rollInterval = 0 agent.sinks.k1.hdfs.batchSize = 1000 agent.sinks.k1.hdfs.round = true agent.sinks.k1.hdfs.roundValue = 10 agent.sinks.k1.hdfs.roundUnit = minute
- Start the Flume agent:
./bin/flume-ng agent -n agent -c conf/ -f conf/sink2obs.properties
- Create a Flume configuration file sink2obs.properties.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.