MRS Kafka Source Stream

Function

Create a source stream to obtain data from Kafka as input data for jobs.

Apache Kafka is a fast, scalable, and fault-tolerant distributed message publishing and subscription system. It delivers high throughput and built-in partitions and provides data replicas and fault tolerance. Apache Kafka is applicable to scenarios of handling massive messages. Kafka clusters are deployed and hosted on MRS that is powered on Apache Kafka.

Prerequisites

  • If the Kafka server listens on the port using hostname, you need to add the mapping between the hostname and IP address of the Kafka Broker node to the DLI queue. Contact the Kafka service deployment personnel to obtain the hostname and IP address of the Kafka Broker node. For details about how to add an IP-domain mapping, see Enhanced Datasource Connection in the Data Lake Insight User Guide.
  • Kafka is an offline cluster. You need to use the enhanced datasource connection function to connect Flink jobs to Kafka. You can also set security group rules as required.

    For details about how to create an enhanced datasource connection, see Enhanced Datasource Connections in the Data Lake Insight User Guide.

    For details about how to configure security group rules, see Security Group in the Virtual Private Cloud User Guide.

Syntax

1
2
3
4
5
6
7
8
9
CREATE SOURCE STREAM stream_id (attr_name attr_type (',' attr_name attr_type)* )
  WITH (
    type = "kafka",
    kafka_bootstrap_servers = "",
    kafka_group_id = "",
    kafka_topic = "",
    encode = "json"
  )
  (TIMESTAMP BY timeindicator (',' timeindicator)?);timeindicator:PROCTIME '.' PROCTIME| ID '.' ROWTIME

Keyword

Table 1 Keyword description

Parameter

Mandatory

Description

type

Yes

Data source type. Value Kafka indicates that the data source is Kafka.

kafka_bootstrap_servers

Yes

Port that connects DLI to Kafka. Use enhanced datasource connections to connect DLI queues with Kafka clusters.

kafka_group_id

No

Group ID

kafka_topic

Yes

Kafka topic to be read.

encode

Yes

Data encoding format. The value can be csv, json, blob, and user_defined.

  • field_delimiter must be specified if this parameter is set to csv.
  • json_config must be specified if this parameter is set to json.
  • If this parameter is set to blob, the received data is not parsed, only one stream attribute exists, and the stream attribute is of the Array[TINYINT] type.
  • encode_class_name and encode_class_parameter must be specified if this parameter is set to user_defined.

encode_class_name

No

If encode is set to user_defined, you need to set this parameter to the name of the user-defined decoding class (including the complete package path). The class must inherit the DeserializationSchema class.

encode_class_parameter

No

If encode is set to user_defined, you can set this parameter to specify the input parameter of the user-defined decoding class. Only one parameter of the string type is supported.

krb_auth

No

The authentication name for creating a datasource connection authentication. This parameter is mandatory when Kerberos authentication is enabled. If Kerberos authentication is not enabled for the created MRS cluster, ensure that the /etc/hosts information of the master node in the MRS cluster is added to the host file of the DLI queue.

json_config

No

If encode is set to json, you can use this parameter to specify the mapping between JSON fields and stream attribute fields. The format is field1=json_field1;field2=json_field2.

field_delimiter

No

If encode is set to csv, you can use this parameter to specify the separator between CSV fields. By default, the comma (,) is used.

quote

No

Quoted symbol in a data format. The attribute delimiters between two quoted symbols are treated as common characters.

  • If double quotation marks are used as the quoted symbol, set this parameter to "\u005c\u0022" for character conversion.
  • If a single quotation mark is used as the quoted symbol, set this parameter to a comma (,).
NOTE:

After this parameter is specified, ensure that each field does not contain quoted symbols or contains an even number of quoted symbols. Otherwise, parsing will fail.

timeindicator

No

Timestamp added in the source stream. The value can be processing time or event time.

NOTE:
  • If this parameter is set to processing time, the format is proctime.proctime.

    In this case, an attribute proctime will be added to the original attribute field. If there are three attributes in the original attribute field, four attributes will be exported after this parameter is set to processing time. However, the attribute length remains unchanged if the rowtime attribute is specified.

  • If this parameter is set to event time, you can select an attribute in the stream as the timestamp. The format is attr_name.rowtime.
  • This parameter can be simultaneously set to processing time and event time.

start_time

No

Start time when Kafka data is ingested.

If this parameter is specified, DLI reads data read from the specified time. The parameter value is in the format of yyyy-MM-dd HH:mm:ss. Ensure that the value of start_time is not later than the current time. Otherwise, no data will be obtained.

kafka_properties

No

This parameter is used to configure the native attributes of Kafka. The format is key1=value1;key2=value2.

kafka_certificate_name

No

Specifies the name of the datasource authentication information. This parameter is valid only when the datasource authentication type is set to Kafka_SSL.

NOTE:
  • If this parameter is specified, the service loads only the specified file and password under the authentication. The system automatically sets this parameter to kafka_properties.
  • Other configuration information required for Kafka SSL authentication needs to be manually configured in the kafka_properties attribute.

Precautions

The attribute type used as the timestamp must be long or timestamp.

Example

Read data from the Kafka topic test.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
CREATE SOURCE STREAM kafka_source (
  name STRING, 
  age int
 )
  WITH (
    type = "kafka",
    kafka_bootstrap_servers = "ip1:port1,ip2:port2", 
    kafka_group_id = "sourcegroup1", 
    kafka_topic = "test",
    encode = "json"
);