Using an Impala Client

Impala is a massively parallel processing (MPP) SQL query engine for processing vast amounts of data stored in Hadoop clusters. It is an open source software written in C++ and Java. It provides high performance and low latency compared with other SQL engines for Hadoop.

Background

Suppose a user develops an application to manage users who use service A in an enterprise. The procedure of operating service A on the Impala client is as follows:

Operations on common tables:

Create the user_info table.
Add users' educational backgrounds and titles to the table.
Query user names and addresses by user ID.
Delete the user information table after service A ends.

**Table 1** User information
No.	Name	Gender	Age	Address
12005000201	A	Male	19	City A
12005000202	B	Female	23	City B
12005000203	C	Male	26	City C
12005000204	D	Male	18	City D
12005000205	E	Female	21	City E
12005000206	F	Male	32	City F
12005000207	G	Female	29	City G
12005000208	H	Female	30	City H
12005000209	I	Male	26	City I
12005000210	J	Female	25	City J

Prerequisites

The client has been installed. For example, the client is installed in the /opt/hadoopclient directory. The client directory in the following operations is only an example. Change it to the actual installation directory.

Procedure

Log in to the node where the client is installed as the client installation user.
Run the following command to go to the client installation directory:

cd /opt/hadoopclient
Run the following command to configure environment variables:

source bigdata_env
Run the Impala client command to implement service A.

Run the client command of the Impala component directly.

impala-shell

By default, impala-shell attempts to connect to the Impala daemon on port 21000 of localhost. To connect to another host, use the -i < host:port > option, for example, impala-shell -i xxx.xxx.xxx.xxx:21000. To automatically connect to a specific Impala database, use the -d <database> option. For example, if all your Kudu tables are in the impala_kudu database, -d impala_kudu can use this database. To exit the Impala Shell, run the quit command.

Operations on internal tables:
1. Create the user_info user information table according to Table 1 and add data to it.
```
create table user_info(id string,name string,gender string,age int,addr string);
insert into table user_info(id,name,gender,age,addr) values("12005000201", "A", "Male", 19, "City A");
```
  ... (Other statements are the same.)
2. Add users' educational backgrounds and titles to the user_info table.
  For example, to add educational background and title information about user 12005000201, run the following commands.
```
alter table user_info add columns(education string,technical string);
```
3. Query user names and addresses by user ID.
  For example, to query the name and address of user 12005000201, run the following command:
```
select name,addr from user_info where id='12005000201';
```
4. Delete the user information table:
```
drop table user_info;
```
Operations on external partition tables:

Create an external partition table and import data.
1. Create a path for storing external table data.
  - Security mode (Kerberos authentication is enabled for clusters)
    cd /opt/hadoopclient
    
    source bigdata_env
    
    kinit hive
    
    The user must have the hive administrator permissions.
    
    impala-shell
    
    hdfs dfs -mkdir /hive
    
    hdfs dfs -mkdir /hive/user_info
  - Normal mode (Kerberos authentication is disabled for clusters)
    su - omm
    
    cd /opt/hadoopclient
    
    source bigdata_env
    
    impala-shell
    
    hdfs dfs -mkdir /hive
    
    hdfs dfs -mkdir /hive/user_info
2. Create a table.
```
create external table user_info(id string,name string,gender string,age int,addr string) partitioned by(year string) row format delimited fields terminated by ' ' lines terminated by '\n' stored as textfile location '/hive/user_info';
```
  fields terminated indicates delimiters, for example, spaces.
  
  lines terminated indicates line breaks, for example, \n.
  
  /hive/user_info indicates the path of the data file.
3. Import data.
  1. Execute the insert statement to insert data.
```
insert into user_info partition(year="2018") values ("12005000201", "A", "Male", 19, "City A");
```
  2. Run the load data command to import file data.
    1. Create a file based on the data in Table 1. For example, the file name is txt.log. Fields are separated by space, and the line feed characters are used as the line breaks.
    2. Upload the file to HDFS.
      hdfs dfs -put txt.log /tmp
    3. Load data to the table.
      load data inpath '/tmp/txt.log' into table user_info partition (year='2018');
4. Query the imported data:
```
select * from user_info;
```
5. Delete the user information table:
```
drop table user_info;
```