Updated on 2024-12-13 GMT+08:00

Regular Expressions

Overview

All regular expression functions use Java-style syntax, except in the following cases:

  • Use the multi-line mode (through (? m) flag enabling), only \n is identified as a line terminator. In addition, it does not support (? d) Flag. Therefore, it cannot be used.
  • In case-sensitive mode (through (? i) flag enabling), the unicode mode is always used. In addition, context-sensitive matching and local sensitive matching are not supported. In addition, it does not support (? u) flag.
  • The Surrogate Pair encoding mode is not supported. For example, \ uD800 \ uDC00 is not considered as U + 10000 and must be specified as \ x {10000}.
  • The boundary character (\b) cannot be handled correctly because it is a non-spaced marker without a base character.
  • \Q and \E are not supported in character classes (such as [A-Z123]). They are processed as text.
  • Unicode characters (\ p {prop}) are supported. The differences are as follows:
    • All underscores in the name must be deleted. For example, use OldItalic instead of Old_Italic.
    • You must specify a script without the prefix Is, script =, or sc =. Example: \p {Hiragana}
    • The In prefix must be used to specify a block. The prefix block = or blk = is not supported. Example: \p{Mongolian}
    • You must specify a category without the prefix Is, general_category =, or gc =. Example: \p{L}
    • The binary attribute must be specified directly, not Is. Example: \p{NoncharacterCodePoint}

Function

  • regexp_count(string, pattern) → bigint

    Description: Returns the number of pattern matches in a string.

    SELECT regexp_count('1a 2b 14m', '\s*[a-z]+\s*'); -- 3
  • regexp_extract_all(string, pattern) -> array(varchar)

    Description: Returns all matched substrings in array format.

    SELECT regexp_extract_all('1a 2b 14m','\d+');-- [1, 2, 14]
  • regexp_extract_all(string, pattern, group) -> array(varchar)

    Description: When the pattern contains multiple groups, group is used to return all substrings that meet the captured group conditions.

    SELECT regexp_extract_all('1a 2b 14m','(\d+)([a-z]+)',2);-- [a, b, m]
  • regexp_extract(string, pattern) → varchar

    Description: Returns the first substring that matches the regular expression pattern in a string.

    SELECT regexp_extract('1a 2b 14m','\d+');-- 1
  • regexp_extract(string, pattern, group) → varchar

    Description: When the pattern contains multiple groups, group is used to specify the first substring that meets captured group.

    SELECT regexp_extract('1a 2b 14m','(\d+)([a-z]+)',2);-- 'a'
  • regexp_like(string, pattern) → boolean

    Description: Checks whether a string contains substrings that meet the regular expression. If yes, true is returned.

    SELECT regexp_like('1a 2b 14m','\d+b');-- true
  • regexp_position(string, pattern) → integer

    Description: Returns the index that matches the pattern for the first time in a string. If no index is matched, returns -1.

    SELECT regexp_position('I have 23 apples, 5 pears and 13 oranges', '\b\d+\b'); -- 8
  • regexp_position(string, pattern, start) → integer

    Description: Returns the index of the item that matches the pattern for the first time starting from the start index (included). If no index is matched, returns -1.

    SELECT regexp_position('I have 23 apples, 5 pears and 13 oranges', '\b\d+\b', 5); -- 8
    SELECT regexp_position('I have 23 apples, 5 pears and 13 oranges', '\b\d+\b', 12); -- 19
  • regexp_position(string, pattern, start, occurrence) → integer

    Description: Returns the index of the item that matches the pattern for the occurrence time starting from the start index (included). If no index is matched, returns -1.

    SELECT regexp_position('I have 23 apples, 5 pears and 13 oranges','\b\d+\b',12,1);-- 19
    SELECT regexp_position('I have 23 apples, 5 pears and 13 oranges','\b\d+\b',12,2);-- 31
    SELECT regexp_position('I have 23 apples, 5 pears and 13 oranges','\b\d+\b',12,3);-- -1
  • regexp_replace(string, pattern) → varchar

    Description: Removes substrings that meet the regular expression from the target string.

    SELECT regexp_replace('1a 2b 14m','\d+[ab] ');-- '14m'
  • regexp_replace(string, pattern, replacement) → varchar

    Description: Replaces the substring that meets the regular expression in the target string with replacement. If the replacement contains the character $, use \$ to escape the character. During replacement, you can use $g to reference a capture group for a numbered group and ${name} to reference a capture group for a named group.

    SELECT regexp_replace('1a 2b 14m','(\d+)([ab]) ','3c$2 ');-- '3ca 3cb 14m'
  • regexp_replace(string, pattern, function) → varchar

    Description: Replaces each instance of the substring that matches the regular expression pattern in the string with function. For each match, the captured group passed as an array calls the lambda expression function. The capture group ID starts from 1. The entire match is not grouped (brackets enclose the entire expression if necessary).

    SELECT regexp_replace('new york','(\w)(\w*)',x->upper(x[1])||lower(x[2]));--'New York'
  • regexp_split(string, pattern) -> array(varchar)

    Description: Splits a string using the regular expression pattern and returns an array. The following empty character string is reserved:

    SELECT regexp_split('1a 2b 14m','\s*[a-z]+\s*');-- [1, 2, 14, ]