Truncating a String
Truncation strings is a key technical capability in multidimensional analysis of data warehouses.
Typical applications:
- Extracting core information: For example, extract the core serial number 0129A from EC2023-BJ-0129A, and use substring() to capture the administrative division code and birth date in a 18-digit ID card code.
- Masking data for privacy protection: For example, use right() to extract the last four digits of a mobile number for data masking
- Extracting information accurately: For example, use regexp_matches to match multiple character strings starting with the letter b.
- Clearing redundant data: For example, clear the redundant spaces or null characters before and after an order No.
DWS provides multiple functions for you to extract strings.
|
Type |
Description |
Function |
Example |
Usage Difference |
||
|---|---|---|---|---|---|---|
|
Remove characters at both ends. |
Remove the specified characters or spaces at both ends of a string. |
|
- |
|||
|
Extract a substring, in bytes. |
Extract a substring from a given string, where the starting position and length are specified in bytes. |
|
These four functions work similarly but differ in two ways: what they extract (either the full remaining parts or just specific middle bytes) and their input parameters (whether they accept text-only or all string types). |
|||
|
Extract a substring from a given string, where the starting position and length are specified in bytes. It and the above function work similarly, but the difference is that the input parameter type of the function is text |
- |
|||||
|
Extract a substring from a given string, where the starting position is specified in bytes. If from is a negative number, the starting position is counted backward from the end of the string. (It means that the last n bytes are extracted.) |
|
|||||
|
Extract a substring from a given string, where the starting position is specified in bytes. They work similarly, but the difference is that the input parameter type of the function is text. |
|
|||||
|
Extract a substring, in characters. |
Extract a substring from a given string, where the starting position and length are specified in characters. |
|
substr(string,from,count) and substring(string [from int] [for int]) work similarly but differ in two ways: what they extract (either the full remaining parts or just specific middle characters) and their SQL syntax (the former is simpler and the latter is standard SQL). |
|||
|
Extract a substring from a given string, where the starting position is specified in characters. |
|
|||||
|
Extract a substring from a given string, where the starting position and length are specified in characters. It and substr(string,from,count) work similarly. |
- |
|||||
|
Extract the first several characters. |
- |
- |
||||
|
Extract the last several characters. |
- |
- |
||||
|
Extract substrings by regular expression pattern. |
Extract a portion of a string that matches a specified regular expression pattern. |
Click the link on the left to view the details. |
regexp_substr and substring are similar. substring is a standard SQL function, while regexp_substr is an extended function with stronger regular expression capabilities. regexp_matches returns text arrays, which is suitable for extracting group information. |
|||
|
Extract a portion of a string that matches a specified regular expression pattern. The escape character can be set. |
||||||
|
It is similar to the preceding function, but has a strong regular expression capability. |
||||||
|
It is similar to the preceding function, but has a strong regular expression capability. |
||||||
|
Extract substrings and return text arrays. It works well when you need to match multiple conditions or extract group data. |
btrim(string text [, characters text])
Description: Removes the specified characters from both ends of a string. Only the characters at both ends are processed. The characters in the middle will not be removed even if they match.
Return type: text
Parameters:
- string: (mandatory) target string.
- characters: (optional) characters to be removed from both ends of the target string. If it is not specified, all whitespace characters are removed by default.
Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
SELECT btrim('sring' , 'ing'); btrim ------- sr (1 row) SELECT btrim(' sting '); -- If characters are not specified, all spaces are deleted by default. btrim ------- sting (1 row) SELECT btrim('xxdataxxbasexx','x'); -- Only the x at both ends is deleted. The x in the middle is not deleted. btrim ------- dataxxbase (1 row) |
substrb(string,from,count)
Description: Truncates a substring of a specified length from a string. from specifies the position where the extraction starts. count specifies the length of the extracted substring.
It and substr(string,from,count) work similarly, but its calculation unit is byte.
- If from starts at 0, it is typically treated as 1.
- If from is a positive value, extract count bytes starting from from.
- If from is a negative value, extract the last n count bytes in the string, in which n indicates the absolute value of from.
- When count is less than 1, null is returned.
Return type: bytea
Example:
1 2 3 4 5 |
SELECT substrb('ABCDEF',2,2); substrb --------- BC (1 row) |
substrb(text,int,int)
Description: Extracts substrings from a string. The first int indicates the start byte position, and the second int indicates the number of bytes to be extracted.
Return type: text
Example:
1 2 3 4 5 |
SELECT substrb('string',2,3); ---Extracts three bytes starting from the second byte t. substrb --------- tri (1 row) |
substrb(string,from)
Description: Truncates a specified substring from a string. from indicates the starting position. It and substr(string,from) work similarly, but its calculation unit is byte.
- If from starts at 0, it is typically treated as 1.
- If from is a positive value, all bytes from from to the end are extracted.
- If from is a negative value, the last n bytes in the string are extracted, in which n indicates the absolute value of from.
Return type: bytea
Example:
1 2 3 4 5 |
SELECT substrb('ABCDEF',-2); substrb --------- EF (1 row) |
substrb(text,int)
Description: Extracts a substring from a string. int indicates the start byte position of the subtraction.
Return type: text
Example:
1 2 3 4 5 |
SELECT substrb('string',2); substrb --------- tring (1 row) |
substr(string,from,count)
Description: Truncates a substring of a specified length from a string.
from specifies the position where the extraction starts. count specifies the length of the extracted substring.
- If from starts at 0, it is typically treated as 1.
- If from is a positive value, extract count characters starting from from.
- If from is a negative value, extract the last n count characters in the string, in which n indicates the absolute value of from.
- When count is less than 1, null is returned.
Return type: varchar
Example:
When from is 0, extract the first four characters from the string database.
1 2 3 4 5 |
SELECT substr('database',0,4); substr -------- data (1 row) |
When from is a positive number, extract the first four characters from the fifth character of the string database.
1 2 3 4 5 |
SELECT substr('database',5,4); substr -------- base (1 row) |
If from is a negative number, truncate three characters starting from the last fourth character of the string database.
1 2 3 4 5 |
SELECT substr('database',-4,3); substr -------- bas (1 row) |
substr(string,from)
Description: Truncates a specified substring from a string.
from indicates the start position of the substring.
- If from starts at 0, it is typically treated as 1.
- If from is a positive value, all characters from from to the end are extracted.
- If from is a negative value, the last n characters in the string are extracted, in which n indicates the absolute value of from.
Return type: varchar
Example:
When from starts at 0, extract all characters from the first character to the end of the string database.
1 2 3 4 5 |
SELECT substr('database',0); substr ---------- database (1 row) |
When from is a positive value, extract all characters from the fifth character to the end of the string database.
1 2 3 4 5 |
SELECT substr('database',5); substr -------- base (1 row) |
When from is a negative value, extract all characters from the fourth character from the end to the end of the string database.
1 2 3 4 5 |
SELECT substr('database',-4); substr -------- base (1 row) |
substring(string [from int] [for int])
Description: Extracts a substring. from int indicates the start position of the truncation. for int indicates the number of characters truncated. It and substr(string,from,count) work similarly.
Return type: text
Example:
Extract three characters starting from the second character h in the string Thomas, that is, hom.
1 2 3 4 5 |
SELECT substring('Thomas' from 2 for 3); substring ----------- hom (1 row) |
left(str text, n int)
Description: Returns first n characters in a string.
- In the ORA- or TD-compatible mode, all but the last |n| characters are returned if n is a negative value.
- In the MySQL-compatible mode, an empty string is returned if n is a negative value.
Return type: text
Example:
1 2 3 4 5 |
SELECT left('database', 4); left ------ data (1 row) |
right(str text, n int)
Description: Returns the last n characters in a string.
- In the ORA- or TD-compatible mode, all but the first |n| characters are returned if n is a negative value.
- In the MySQL-compatible mode, an empty string is returned if n is a negative value.
Return type: text
Example:
1 2 3 4 5 6 7 8 9 10 11 |
SELECT right('abcde', 2); right ------- de (1 row) SELECT right('abcde', -2); right ------- cde (1 row) |
substring(string from pattern)
Description: Extracts a substring from a string that matches a specified pattern, which is a POSIX regular expression. string indicates the original string, and pattern indicates the pattern to be matched. If no match is found, NULL is returned. If a match is found, the matched part is returned.
Return type: text
Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
SELECT substring('Thomas' from '...$'); --- Match the last three consecutive characters. substring ----------- mas (1 row) SELECT substring('Thomas' from '^...'); --- Match the first three consecutive characters. substring ----------- Tho (1 row) SELECT substring('Tom' from '....$'); --- Match the last four consecutive characters. Tom has only three characters and cannot match the condition. Null is returned. substring ----------- (1 row) SELECT substring('order_20251120_12345' from '_([0-9]+)_'); --- Obtain the order No. between two underscores (_). substring ----------- 20251120 (1 row) SELECT substring('tel:13812345678,status:valid' from 'tel:([0-9]{11})'); --- Extract the mobile number information. substring ----------- 13812345678 (1 row) SELECT substring('foobar' from 'o(.)b'); -- Match the o(.)b rule. o indicates that the first letter o is matched, (.) indicates that any character is matched and captured, and b indicates that the letter b is matched. In foobar, oob is the matched substring. (.) is the middle character (o) of the substring. substring -------- o (1 row) SELECT substring('foobar' from '(o(.)b)'); -- Use nested capture rules. The outer group captures oob, while the inner group captures the middle o. The function returns only the first captured group, so the result is oob. substring -------- oob (1 row) |
If the POSIX pattern contains any parentheses, the portion of the text that matched the first parenthesized sub-expression (the one whose left parenthesis comes first) is returned. If you want to use parentheses in an expression without causing this exception, you can add a pair of parentheses outside the entire expression.
substring(string from pattern for escape)
Description: Extracts the substring that matches the SQL regular expression. string indicates the original string, pattern indicates the pattern tha the string must match, and escape indicates the escape character, which is specified by for. If no pattern is matched, null is returned.
To identify the specific portion of the pattern that should be returned upon a successful match, the pattern must contain two occurrences of the designated escape character, each followed by a double quotation mark ("). The text in the data string that matches the part of the pattern between these two markers is the substring returned by the function.
Return type: text
Example:
1 2 3 4 5 |
SELECT substring('Thomas' from '%#"o_a#"_' for '#'); substring ----------- oma (1 row) |
Example explanations:
- %#"o_a#"_ is the pattern, and # is the escape character.
- % indicates the wildcard that matches any number of any prefix. In this example, the prefix Tho is matched.
- #" is a marker used to match the text between two #", that is, o_a.
- The last underscore (_) indicates that the suffix character s is matched.
In conclusion, the pattern body %#"o_a#"_ is expanded to % o_a _, which means that the pattern body matches "any prefix" + o + "single any character" + a + "single any character".
If the string Thomas needs to match this pattern, that is, only the text between the two number signs (#) is extracted, that is, o_a, where _ can indicate any character. The result is oma.
Example:
Extract the four digits after FX. (The express delivery number is FX589220251124, which is in the format of FX + 4-digit number + 8-digit timestamp.)
1 2 3 4 5 6 7 8 9 10 11 |
SELECT substring('FX589220251124' from 'FX#"____#"%' for '#'); substring ----------- 5892 (1 row) SELECT substring('FX564320241024' from 'FX#"____#"%' for '#'); substring ----------- 5643 (1 row) |
regexp_substr(text,text)
Description: Extracts substrings from a regular expression. The first text indicates the original string to be processed, and the second text indicates the defined regular expression. It is similar to substr. When a regular expression contains multiple parallel brackets, all of them are processed.
Return type: text
Example:
1 2 3 4 5 |
SELECT regexp_substr('str','[ac]'); --- Match any character in ac from str and return null. regexp_substr --------------- (1 row) |
Example: Obtain the first 10 digits from the express number. The other digits after the first 10 digits are timestamps and can be ignored.
1 2 3 4 5 |
SELECT regexp_substr('FX58920235822025112404', '\d{10}'); --Search for the first occurrence of 10 consecutive digits in the string by scanning it from left to right. regexp_substr --------------- 5892023582 (1 row) |
regexp_substr(source_char, pattern)
Description: Extracts substrings from a regular expression. source_char indicates the original string, and pattern indicates the defined regular expression.
Return type: varchar
Example:
1 2 3 4 5 6 |
-- Find the part of the string that starts with a comma, has at least one non-comma character, and ends with another comma. SELECT regexp_substr('500 Hello World, Redwood Shores, CA', ',[^,]+,') "REGEXPR_SUBSTR"; REGEXPR_SUBSTR ------------------- , Redwood Shores, (1 row) |
regexp_matches(string text, pattern text [, flags text])
Description: Returns all substrings that match the POSIX regular expression. The returned result is a text array. If the pattern does not match, the function returns no rows. This function is flexible in processing complex text parsing, especially when you need to match multiple conditions or extract group data.
- string indicates the original string.
- pattern indicates the pattern the string must match. The pattern can contain multiple sub-expressions in parentheses, for example, (bar)(beque). If the pattern is matched, a text array with multiple elements is returned, for example, {bar,beque}. If the pattern does not contain sub-expressions in parentheses, a text array with a single element is returned, for example, {barbeque}.
- flags is an optional parameter. If flags is set to i, the pattern matching is case insensitive. If flags is set to g, it indicates that the function finds all matches of the pattern within the string, not just the first one.
If the last parameter flags is provided but the parameter value is an empty string ('') and the SQL compatibility mode of the database is set to ORA, the returned result is null. To resolve this problem, you can:
- Change the database SQL compatibility mode to TD.
- Do not provide the last parameter or do not set the last parameter to an empty string.
Return type: multiple lines of text as an array (setof text[]).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
-- The rule has two parts: bar and beque. The result is a text array containing two elements. SELECT regexp_matches('foobarbequebaz', '(bar)(beque)'); regexp_matches ---------------- {bar,beque} (1 row) -- The rule does not contain parentheses. The result is a text array that contains only one element. SELECT regexp_matches('foobarbequebaz', 'barbeque'); regexp_matches ---------------- {barbeque} (1 row) -- The rule matches strings starting with 'b' followed by one or more non-'b' characters. It splits the match into two groups: bar and beque. Since g is not included, only the first line matches. SELECT regexp_matches('foobarbequebazilbarfbonk', '(b[^b]+)(b[^b]+)'); regexp_matches -------------- {bar,beque} (1 rows) -- Compared with the preceding example, if g is provided, the matching continues and the second line of the text array is displayed. SELECT regexp_matches('foobarbequebazilbarfbonk', '(b[^b]+)(b[^b]+)', 'g'); regexp_matches -------------- {bar,beque} {bazil,barf} (2 rows) -- The rule extracts all 11-digit mobile numbers from the text and causes the function to find all matches of the pattern within the string. All lines are returned. SELECT regexp_matches('13812345678,13987654321', '\d{11}', 'g'); regexp_matches ---------------- {13812345678} {13987654321} (2 rows) -- The rule adds i to make the pattern matching case-insensitive. SELECT regexp_matches('fooBarBeQuebaz', 'barbeque', 'i'); regexp_matches ---------------- {BarBeQue} (1 row) -- The rule does not contain i, so the pattern matching is case sensitive and null is returned. SELECT regexp_matches('fooBarBeQuebaz', 'barbeque'); regexp_matches ---------------- (0 rows) |
When a subquery is absent, and the regexp_matches function fails to find a match, the table data remains hidden, which is contrary to what is usually intended. To avoid this issue, employ the regexp_substr(text,text) function, which can deliver the expected results even without a subquery.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
SELECT * FROM tab; c1 | c2 -----+----- dws | dws (1 row) SELECT c1, regexp_matches(c2, '(bar)(beque)') FROM tab; c1 | regexp_matches ----+---------------- (0 rows) SELECT c1, c2, (SELECT regexp_matches(c2, '(bar)(beque)')) FROM tab; c1 | c2 | regexp_matches -----+-----+---------------- dws | dws | (1 row) |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot