Updated on 2023-05-06 GMT+08:00

Using Impala from Scratch

Impala is a massively parallel processing (MPP) SQL query engine for processing vast amounts of data stored in Hadoop clusters. It is an open source software written in C++ and Java. It provides high performance and low latency compared with other SQL engines for Hadoop.

Background

Suppose a user develops an application to manage users who use service A in an enterprise. The procedure of operating service A on the Impala client is as follows:

Operations on common tables:

  • Create the user_info table.
  • Add users' educational backgrounds and titles to the table.
  • Query user names and addresses by user ID.
  • Delete the user information table after service A ends.
Table 1 User information

No.

Name

Gender

Age

Address

12005000201

A

Male

19

City A

12005000202

B

Female

23

City B

12005000203

C

Male

26

City C

12005000204

D

Male

18

City D

12005000205

E

Female

21

City E

12005000206

F

Male

32

City F

12005000207

G

Female

29

City G

12005000208

H

Female

30

City H

12005000209

I

Male

26

City I

12005000210

J

Female

25

City J

Prerequisites

The client has been installed. For example, the client is installed in the /opt/hadoopclient directory. The client directory in the following operations is only an example. Change it to the actual installation directory.

Procedure

  1. Log in to the node where the client is installed as the client installation user.
  2. Run the following command to go to the client installation directory:

    cd /opt/hadoopclient

  3. Run the following command to configure environment variables:

    source bigdata_env

  4. Run the Impala client command to implement service A.

    • Operations on internal tables:

      Run the Impala client command impala-shell.

      By default, impala-shell attempts to connect to the Impala daemon on port 21000 of localhost. To connect to another host, use the -i < host:port > option, for example, impala-shell -i xxx.xxx.xxx.xxx:21000. To automatically connect to a specific Impala database, use the -d <database> option. For example, if all your Kudu tables are in the impala_kudu database, -d impala_kudu can use this database. To exit Impala Shell, run the quit command.

      1. Create the user_info user information table according to Table 1 and add data to it.
        create table user_info(id string,name string,gender string,age int,addr string);
        insert into table user_info(id,name,gender,age,addr) values("12005000201", "A", "Male", 19, "City A");

        ... (Other statements are the same.)

      2. Add users' educational backgrounds and titles to the user_info table.

        For example, to add educational background and title information about user 12005000201, run the following commands.

        alter table user_info add columns(education string,technical string);
      3. Query user names and addresses by user ID.

        For example, to query the name and address of user 12005000201, run the following command:

        select name,addr from user_info where id='12005000201';
      4. Delete the user information table:
        drop table user_info;
    • Operations on external partition tables:

      Create an external partition table and import data.

      1. Create a path for storing external table data.

        kinit hive (Run this command only in security mode.)

        The user must have the hive administrator permissions.

        hdfs dfs -mkdir /hive

        hdfs dfs -mkdir /hive/user_info

      2. Create a table.
        impala-shell

        By default, impala-shell attempts to connect to the Impala daemon on port 21000 of localhost. To connect to another host, use the -i < host:port > option, for example, impala-shell -i xxx.xxx.xxx.xxx:21000. To automatically connect to a specific Impala database, use the -d <database> option. For example, if all your Kudu tables are in the impala_kudu database, -d impala_kudu can use this database. To exit Impala Shell, run the quit command.

        create external table user_info(id string,name string,gender string,age int,addr string) partitioned by(year string) row format delimited fields terminated by ' ' lines terminated by '\n' stored as textfile location '/hive/user_info';
        • fields terminated indicates delimiters, for example, spaces.
        • lines terminated indicates line breaks, for example, \n.
        • /hive/user_info indicates the path of the data file.
      3. Import data.
        1. Execute the insert statement to insert data.
          insert into user_info partition(year="2018") values ("12005000201", "A", "Male", 19, "City A");
        2. Run the load data command to import file data.
          1. Create a file based on the data in Table 1. For example, the file name is txt.log. Fields are separated by space, and the line feed characters are used as the line breaks.
          2. Upload the file to HDFS.

            hdfs dfs -put txt.log /tmp

          3. Load data to the table.

            load data inpath '/tmp/txt.log' into table user_info partition (year='2018');

      4. Query the imported data:
        select * from user_info;
      5. Delete the user information table:
        drop table user_info;