Help Center/ Object Storage Service/ Best Practices/ Using OBS to Decouple Storage from Compute in Big Data Scenarios/ Connecting OBS to Big Data Components/ Connecting Presto to OBS

Updated on 2024-10-17 GMT+08:00

View PDF

Connecting Presto to OBS

Overview

There are PrestoSQL (renamed to Trino) and PrestoDB available.

Only PrestoSQL (Trino) can connect to OBS. The following example describes how to connect PrestoSQL 333 to OBS. PrestoSQL 332 and later must use JDK 11.

Presto in this section refers to PrestoSQL (Trino).

Prerequisites

Hadoop has been installed. For details, see Connecting Hadoop to OBS.

Hive has been installed. For details, see Connecting Hive to OBS.

Installing the Presto Server

Version: PrestoSQL 333

Download the Presto client and server.

Presto client

Presto server
Download the hadoop-huaweicloud pug-in.
Decompress the Presto server package:

tar –zxvf presto-server-333.tar.gz

Place the following JAR packages in the Presto root directory /plugin/hive-hadoop2:
- hadoop-huaweicloud-${hadoop.version}-hw-${version}.jar
- Apache commons-lang-xxx.jar
  You can download them from the Maven central repository or copy them from the hadoop directory.

Configuring Presto

Create an etc directory inside the installation directory. Under etc, create the following configuration files:

Node configuration file: environment configurations of each node
JVM configuration file: command line options for Java virtual machines (JVMs)
Server configuration file: configurations of the Presto server
Catalog configuration file: configurations of different Presto connectors (data sources)
Log configuration file: Presto log configurations

Node Configuration File

etc/node.properties is the node property file that contains configurations of each node. A node is a Presto instance. This file is typically created when Presto is first installed. The minimum configuration is as follows:

node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/var/presto/data

Explanations:

node.environment: environment name. All nodes in a Presto cluster must have the same environment name.

node.id: the unique identifier for a node. A node ID must keep unchanged across reboots or upgrades of the Presto cluster.

node.data-dir: data directory. It is used by Presto to store logs and other data.

Example:

node.environment=presto_cluster

node.id=bigdata00

node.data-dir=/home/modules/presto-server-0.215/data #data needs to be manually created.

JVM Configuration File

etc/jvm.config is the JVM configuration file that contains command line options for starting JVMs. Each command line option is on a separate line. This file is interpreted by the shell, so options containing spaces or special characters will be ignored.

Reference configurations:

-server
-Xmx16G
-XX:-UseBiasedLocking
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+ExplicitGCInvokesConcurrent
-XX:+ExitOnOutOfMemoryError
-XX:+UseGCOverheadLimit
-XX:+HeapDumpOnOutOfMemoryError
-XX:ReservedCodeCacheSize=512M
-Djdk.attach.allowAttachSelf=true
-Djdk.nio.maxCachedBufferSize=2000000

The parameters above are from the Presto official website and must be adjusted in an actual environment.

Server Configuration File

etc/config.properties is a configuration property file that contains the configurations for the Presto server. A Presto server can serve as both a coordinator and a worker. In large clusters, you are advised to specify only one machine as the coordinator.

Configuration file of the coordinator node

coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=5050
discovery-server.enabled=true
discovery.uri=http://192.168.XX.XX:5050
query.max-memory=20GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB

Configuration file of the worker node

coordinator=false
http-server.http.port=5050
discovery.uri=http://192.168.XX.XX:5050
query.max-memory=20GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB

Explanations:

coordinator: whether to run the instance as a coordinator, to receive queries from clients and manage query executions.

node-scheduler.include-coordinator: whether the coordinator also serves as a worker. For larger clusters, processing work on the coordinator can impact query performance.

http-server.http.port: HTTP port. Presto uses HTTP for all external and internal communications.

query.max-memory: the total maximum memory that can be allocated for queries

query.max-memory-per-node: the maximum single-node memory that can be allowed for queries

discovery-server.enabled: Presto uses the Discovery service to find all nodes in the cluster. The Presto coordinator has a built-in Discovery service, and each Presto instance will be registered with the Discovery service on startup. This way, the deployment can be simplified and no additional service is required.

discovery.uri: URI of the Discovery service. In the URI, replace example.net:8080 with the host and port of the coordinator. The URI cannot end with a slash, or error 404 will be reported.

Additional properties:

jmx.rmiregistry.port: registry of the JMX RMI. The JMX client can connect to the port specified here.

jmx.rmiserver.port: server of the JMX RMI. The JMX can be used for listening.

Catalog Configuration File (Key)

Configure a Hive connector as follows:

Create a catalog directory under etc.
Create the configuration file hive.properties for the Hive connector.

# hive.properties
#Connector name
connector.name=hive-hadoop2
#Configure the Hive metastore connection.
hive.metastore.uri=thrift://192.168.XX.XX:9083
#Specify the Hadoop configuration file.
hive.config.resources=/home/modules/hadoop-2.8.3/etc/hadoop/core-site.xml,/home/modules/hadoop-2.8.3/etc/hadoop/hdfs-site.xml,/home/modules/hadoop-2.8.3/etc/hadoop/mapred-site.xml
# Grant the permission to drop tables.
hive.allow-drop-table=true

Log Configuration File

1. Create a log.properties file.

2. Write content: com.facebook.presto=INFO.

There are four log levels: DEBUG, INFO, WARN, and ERROR.

Starting Presto

The procedure is as follows:

Run hive --service metastore & to start the Hive metastore.
Run bin/launcher start to start the Presto server. To stop the Presto server, run bin/launcher stop.
Start the Presto client.
1. Rename presto-cli-333-executable.jar to presto, place it in the bin directory, and run the chmod +x presto command to make it executable.
2. Run ./presto --server XX.XX.XX.XX:5050 --catalog hive --schema default to start the client.

Using Presto to Query OBS

Creating a Hive table

     
          hive>
CREATE TABLE sample01(id int,name string,address string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'obs://obs-east-bkt001/sample01';

insert into sample01 values(1,'xiaoming','cd');
insert into sample01 values(2,'daming','sh');

Using Presto to query the Hive table

./presto --server XX.XX.XX.XX:5050 --catalog hive --schema default

     
          presto:default> 
select * from sample01;

Parent Topic: Connecting OBS to Big Data Components

Previous topic: Connecting Spark to OBS

Next topic: Connecting Flume to OBS

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

Which of the following issues have you encountered?

Content is inconsistent with the product UI

Unclear descriptions

Lack of examples or code

Incorrect steps

Can't find what I need

Lack of best practices

Feedback (optional)

0/500

Select at least one type of issue, and enter your comments or suggestions.

Enter a maximum of 500 characters.

Submit Cancel

For any further questions, feel free to contact us through the chatbot.

Chatbot