Writing Data to OBS Using Flink Jar
Overview
DLI allows you to use a custom JAR package to run Flink jobs and write data to OBS. This section describes how to write processed Kafka data to OBS. You need to modify the parameters in the example Java code based on site requirements.
Environment Preparations
Development tools such as IntelliJ IDEA and other development tools, JDK, and Maven have been installed and configured.
- For details about how to configure the pom.xml file of the Maven project, see "POM file configurations" in Java Example Code (Flink 1.12).
- Ensure that you can access the public network in the local compilation environment.
Constraints
- In the left navigation pane of the DLI console, choose Global Configuration > Service Authorization. On the page displayed, select Tenant Administrator(Global service) and click Update.
- The bucket to which data is written must be an OBS bucket created by a main account.
- When using Flink 1.15, users need to configure agencies themselves to avoid potential impacts on job execution.
For details, see Customizing DLI Agency Permissions.
Java Example Code (Flink 1.15)
- POM file configurations
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <parent> <groupId>com.huaweicloud</groupId> <artifactId>dli-flink-demo</artifactId> <version>1.0-SNAPSHOT</version> </parent> <groupId>org.example</groupId> <artifactId>flink-1.15-demo</artifactId> <properties> <flink.version>1.15.0</flink.version> </properties> <dependencies> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-statebackend-rocksdb</artifactId> <version>${flink.version}</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-streaming-java</artifactId> <version>${flink.version}</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-table-planner_2.12</artifactId> <version>${flink.version}</version> <scope>provided</scope> </dependency> <dependency> <groupId>com.fasterxml.jackson.core</groupId> <artifactId>jackson-databind</artifactId> <version>2.14.2</version> <scope>provided</scope> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-assembly-plugin</artifactId> <version>3.3.0</version> <executions> <execution> <phase>package</phase> <goals> <goal>single</goal> </goals> </execution> </executions> <configuration> <archive> <manifest> <mainClass>com.huawei.dli.GetUserConfigFileDemo</mainClass> </manifest> </archive> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration> </plugin> </plugins> <resources> <resource> <directory>src/main/resources</directory> <filtering>true</filtering> <includes> <include>**/*.*</include> </includes> </resource> </resources> </build> </project>
- Example code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
package com.huawei.dli; import com.huawei.dli.source.CustomParallelSource; import org.apache.flink.api.common.serialization.SimpleStringEncoder; import org.apache.flink.api.java.utils.ParameterTool; import org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend; import org.apache.flink.core.fs.Path; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.environment.CheckpointConfig; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink; import org.apache.flink.streaming.api.functions.sink.filesystem.rollingpolicies.OnCheckpointRollingPolicy; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import java.io.File; import java.io.IOException; import java.net.URL; import java.time.LocalDateTime; import java.time.ZoneOffset; import java.time.format.DateTimeFormatter; public class GetUserConfigFileDemo { private static final Logger LOG = LoggerFactory.getLogger(GetUserConfigFileDemo.class); public static void main(String[] args) { try { ParameterTool params = ParameterTool.fromArgs(args); LOG.info("Params: " + params.toString()); StreamExecutionEnvironment streamEnv = StreamExecutionEnvironment.getExecutionEnvironment(); // set checkpoint String checkpointPath = params.get("checkpoint.path", "obs://bucket/checkpoint/jobId_jobName/"); LocalDateTime localDateTime = LocalDateTime.ofEpochSecond(System.currentTimeMillis() / 1000, 0, ZoneOffset.ofHours(8)); String dt = localDateTime.format(DateTimeFormatter.ofPattern("yyyyMMdd_HH:mm:ss")); checkpointPath = checkpointPath + dt; streamEnv.setStateBackend(new EmbeddedRocksDBStateBackend()); streamEnv.getCheckpointConfig().setCheckpointStorage(checkpointPath); streamEnv.getCheckpointConfig().setExternalizedCheckpointCleanup( CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION); streamEnv.enableCheckpointing(30 * 1000); DataStream<String> stream = streamEnv.addSource(new CustomParallelSource()) .setParallelism(1) .disableChaining(); String outputPath = params.get("output.path", "obs://bucket/outputPath/jobId_jobName"); // Get user dependents config URL url = GetUserConfigFileDemo.class.getClassLoader().getResource("userData/user.config"); if (url != null) { Path filePath = org.apache.flink.util.FileUtils.absolutizePath(new Path(url.getPath())); try { String config = org.apache.flink.util.FileUtils.readFileUtf8(new File(filePath.getPath())); LOG.info("config is {}", config); // Do something by config } catch (IOException e) { LOG.error(e.getMessage(), e); } } // Sink OBS final StreamingFileSink<String> sinkForRow = StreamingFileSink .forRowFormat(new Path(outputPath), new SimpleStringEncoder<String>("UTF-8")) .withRollingPolicy(OnCheckpointRollingPolicy.build()) .build(); stream.addSink(sinkForRow); streamEnv.execute("sinkForRow"); } catch (Throwable e) { LOG.error(e.getMessage(), e); } } }
Table 1 Parameter description Parameter
Description
Example
output.path
OBS path to which data will be written
obs://bucket/output
checkpoint.path
Checkpoint OBS path
obs://bucket/checkpoint
Java Example Code (Flink 1.12)
- POM file configurations
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <parent> <artifactId>Flink-demo</artifactId> <groupId>com.huaweicloud</groupId> <version>1.0-SNAPSHOT</version> </parent> <modelVersion>4.0.0</modelVersion> <artifactId>flink-kafka-to-obs</artifactId> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <!--Flink version--> <flink.version>1.12.2</flink.version> <!--JDK version --> <java.version>1.8</java.version> <!--Scala 2.11 --> <scala.binary.version>2.11</scala.binary.version> <slf4j.version>2.13.3</slf4j.version> <log4j.version>2.10.0</log4j.version> <maven.compiler.source>8</maven.compiler.source> <maven.compiler.target>8</maven.compiler.target> </properties> <dependencies> <!-- flink --> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-java</artifactId> <version>${flink.version}</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-streaming-java_${scala.binary.version}</artifactId> <version>${flink.version}</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-statebackend-rocksdb_2.11</artifactId> <version>${flink.version}</version> <scope>provided</scope> </dependency> <!-- kafka --> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-connector-kafka_2.11</artifactId> <version>${flink.version}</version> </dependency> <!-- logging --> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-slf4j-impl</artifactId> <version>${slf4j.version}</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-api</artifactId> <version>${log4j.version}</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-core</artifactId> <version>${log4j.version}</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-jcl</artifactId> <version>${log4j.version}</version> <scope>provided</scope> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-assembly-plugin</artifactId> <version>3.3.0</version> <executions> <execution> <phase>package</phase> <goals> <goal>single</goal> </goals> </execution> </executions> <configuration> <archive> <manifest> <mainClass>com.huaweicloud.dli.FlinkKafkaToObsExample</mainClass> </manifest> </archive> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration> </plugin> </plugins> <resources> <resource> <directory>../main/config</directory> <filtering>true</filtering> <includes> <include>**/*.*</include> </includes> </resource> </resources> </build> </project>
- Example code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115
import org.apache.flink.api.common.serialization.SimpleStringEncoder; import org.apache.flink.api.common.serialization.SimpleStringSchema; import org.apache.flink.api.java.utils.ParameterTool; import org.apache.flink.contrib.streaming.state.RocksDBStateBackend; import org.apache.flink.core.fs.Path; import org.apache.flink.runtime.state.filesystem.FsStateBackend; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.environment.CheckpointConfig; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink; import org.apache.flink.streaming.api.functions.sink.filesystem.bucketassigners.DateTimeBucketAssigner; import org.apache.flink.streaming.api.functions.sink.filesystem.rollingpolicies.OnCheckpointRollingPolicy; import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer; import org.apache.kafka.clients.consumer.ConsumerConfig; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import java.util.Properties; /** * @author xxx * @date 6/26/21 */ public class FlinkKafkaToObsExample { private static final Logger LOG = LoggerFactory.getLogger(FlinkKafkaToObsExample.class); public static void main(String[] args) throws Exception { LOG.info("Start Kafka2OBS Flink Streaming Source Java Demo."); ParameterTool params = ParameterTool.fromArgs(args); LOG.info("Params: " + params.toString()); // Kafka connection address String bootstrapServers; // Kafka consumer group String kafkaGroup; // Kafka topic String kafkaTopic; // Consumption policy. This policy is used only when the partition does not have a checkpoint or the checkpoint expires. // If a valid checkpoint exists, consumption continues from this checkpoint. // When the policy is set to LATEST, the consumption starts from the latest data. This policy will ignore the existing data in the stream. // When the policy is set to EARLIEST, the consumption starts from the earliest data. This policy will obtain all valid data in the stream. String offsetPolicy; // OBS file output path, in the format of obs://bucket/path. String outputPath; // Checkpoint output path, in the format of obs://bucket/path. String checkpointPath; bootstrapServers = params.get("bootstrap.servers", "xxxx:9092,xxxx:9092,xxxx:9092"); kafkaGroup = params.get("group.id", "test-group"); kafkaTopic = params.get("topic", "test-topic"); offsetPolicy = params.get("offset.policy", "earliest"); outputPath = params.get("output.path", "obs://bucket/output"); checkpointPath = params.get("checkpoint.path", "obs://bucket/checkpoint"); try { //Create an execution environment. StreamExecutionEnvironment streamEnv = StreamExecutionEnvironment.getExecutionEnvironment(); streamEnv.setParallelism(4); RocksDBStateBackend rocksDbBackend = new RocksDBStateBackend(checkpointPath, true); RocksDBStateBackend rocksDbBackend = new RocksDBStateBackend(new FsStateBackend(checkpointPath), true); streamEnv.setStateBackend(rocksDbBackend); // Enable Flink checkpointing mechanism. If enabled, the offset information will be synchronized to Kafka. streamEnv.enableCheckpointing(300000); // Set the minimum interval between two checkpoints. streamEnv.getCheckpointConfig().setMinPauseBetweenCheckpoints(60000); // Set the checkpoint timeout duration. streamEnv.getCheckpointConfig().setCheckpointTimeout(60000); // Set the maximum number of concurrent checkpoints. streamEnv.getCheckpointConfig().setMaxConcurrentCheckpoints(1); // Retain checkpoints when a job is canceled. streamEnv.getCheckpointConfig().enableExternalizedCheckpoints( CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION); // Source: Connect to the Kafka data source. Properties properties = new Properties(); properties.setProperty("bootstrap.servers", bootstrapServers); properties.setProperty("group.id", kafkaGroup); properties.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, offsetPolicy); String topic = kafkaTopic; // Create a Kafka consumer. FlinkKafkaConsumer<String> kafkaConsumer = new FlinkKafkaConsumer<>(topic, new SimpleStringSchema(), properties); /** * Read partitions from the offset submitted by the consumer group (specified by group.id in the consumer attribute) in Kafka brokers. * If the partition offset cannot be found, set it by using the auto.offset.reset parameter. * For details, see https://ci.apache.org/projects/flink/flink-docs-release-1.13/zh/docs/connectors/datastream/kafka/. */ kafkaConsumer.setStartFromGroupOffsets(); // Add Kafka to the data source. DataStream<String> stream = streamEnv.addSource(kafkaConsumer).setParallelism(3).disableChaining(); // Create a file output stream. final StreamingFileSink<String> sink = StreamingFileSink // Specify the file output path and row encoding format. .forRowFormat(new Path(outputPath), new SimpleStringEncoder<String>("UTF-8")) // Specify the file output path and bulk encoding format. Files are output in parquet format. //.forBulkFormat(new Path(outputPath), ParquetAvroWriters.forGenericRecord(schema)) // Specify a custom bucket assigner. .withBucketAssigner(new DateTimeBucketAssigner<>()) // Specify the rolling policy. .withRollingPolicy(OnCheckpointRollingPolicy.build()) .build(); // Add sink for DIS Consumer data source stream.addSink(sink).disableChaining().name("obs"); // stream.print(); streamEnv.execute(); } catch (Exception e) { LOG.error(e.getMessage(), e); } } }
Table 2 Parameter description Parameter
Description
Example
bootstrap.servers
Kafka connection address
IP address of the Kafka service 1:9092, IP address of the Kafka service 2:9092, IP address of the Kafka service 3:9092
group.id
Kafka consumer group
test-group
topic
Kafka consumption topic
test-topic
offset.policy
Kafka offset policy
earliest
output.path
OBS path to which data will be written
obs://bucket/output
checkpoint.path
Checkpoint OBS path
obs://bucket/checkpoint
Compiling and Running the Application
After the application is developed, upload the JAR package to DLI by referring to Flink Jar Job Examples and check whether related data exists in the OBS path.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot