Optimizing Regular Expressions for Faster Log Parsing

Regular expressions are widely used to extract and parse log fields in log ingestion, search, analysis, and alarm reporting. Performance testing reveals significant variations in parsing efficiency across different regular expressions. To improve log parsing efficiency, you can optimize regular expressions by refining matching rules, optimizing quantifiers, and narrowing the matching range.

Performance Test Data

Test the following example log. This log contains a typical structure: timestamp [YYYY-MM-DD_HH:MM:SS], log level [LEVEL], module name [MODULE], and service data.
```
est log is: [2025-06-05_10:04:36] [WARNING] [MODULE_ecdio] - m9nTW7s4YVSh1ImDMV2y;51+0FSNo5
```

Analyze the performance test results.

Regular Expression	Time for 10 Million Matches	Performance Improvement Ratio	Description
\[.*\]	5.0614793s	Reference value	Greedy mode. It matches [2025-06-05_10:04:36] [WARNING] [MODULE_ecdio] and takes a long time.
^\[.*\]	5.0501595s	0.22%	Adds a start limit.
\[\S*\]	1.8859162s	62.7%	Matches on-whitespace characters, that is, [2025-06-05_10:04:36].
^\[\S*\]	1.8838008s	62.8%	Adds a start limit.
^\[\d-\d-\d_\d:\d:\d\]	1.4906888s	70.6%	Specifies a digit format.
^\[\d{4}-\d{2}-\d{2}_\d{2}:\d{2}:\d{2}\]	851.6531 ms	83.2%	Adds a digit length limit.

Conclusion:
- Replacing .* with \S* improves performance by over 60%, proving the importance of precise character definition.
- Using the anchor ^ slightly improves performance by 0.22%, but it is critical for multi-line matching.
- Replacing \d* with \d{Fixed length} further improves performance, verifying the effectiveness of precise quantifiers.

Regular Expression Optimization Suggestions

Regular expression performance optimization is critical for data collection system optimization. Precise character definition, proper use of quantifiers, systematic debugging, and advanced skills can significantly reduce the matching complexity for engines and improve collection efficiency.

Precisely define matching characters to narrow down the search range and reduce backtracking.
- Replace general matching with precise character groups. For example, use [a-zA-Z] for letters, \d or [0-9] for digits, and \D or [^0-9] for non-digits.
- For timestamps, use \d to represent the year, month, day, hour, minute, and second, as they are digits.
- Delimiter - and _ are fixed characters and can be directly included in patterns.
- Use \S (equivalent to [^\s]) to replace non-whitespace characters.
Properly use quantifiers to balance matching accuracy and efficiency.
Use + instead of * whenever possible to reduce unnecessary checks, as + requires at least one match.
Use anchors (^ and $) to specify exact start and end positions, narrowing the matching range and improving efficiency.
Avoid using anchors ^ and $ with .*, for example, ^.*pattern, as this does not improve performance.
Use the boundary character \b to extract a specific log level, such as [WARNING].
- Inefficient: \[.*?\]
- Efficient: \bWARNING\b. It uses word boundaries to avoid matching interference items such as WARNINGLY.