Optimizing Regular Expressions for Faster Log Parsing
Regular expressions are widely used to extract and parse log fields in log ingestion, search, analysis, and alarm reporting. Performance testing reveals significant variations in parsing efficiency across different regular expressions. To improve log parsing efficiency, you can optimize regular expressions by refining matching rules, optimizing quantifiers, and narrowing the matching range.
Performance Test Data
- Test the following example log. This log contains a typical structure: timestamp [YYYY-MM-DD_HH:MM:SS], log level [LEVEL], module name [MODULE], and service data.
est log is: [2025-06-05_10:04:36] [WARNING] [MODULE_ecdio] - m9nTW7s4YVSh1ImDMV2y;51+0FSNo5
- Analyze the performance test results.
Regular Expression
Time for 10 Million Matches
Performance Improvement Ratio
Description
\[.*\]
5.0614793s
Reference value
Greedy mode. It matches [2025-06-05_10:04:36] [WARNING] [MODULE_ecdio] and takes a long time.
^\[.*\]
5.0501595s
0.22%
Adds a start limit.
\[\S*\]
1.8859162s
62.7%
Matches on-whitespace characters, that is, [2025-06-05_10:04:36].
^\[\S*\]
1.8838008s
62.8%
Adds a start limit.
^\[\d*-\d*-\d*_\d*:\d*:\d*\]
1.4906888s
70.6%
Specifies a digit format.
^\[\d{4}-\d{2}-\d{2}_\d{2}:\d{2}:\d{2}\]
851.6531 ms
83.2%
Adds a digit length limit.
- Conclusion:
- Replacing .* with \S* improves performance by over 60%, proving the importance of precise character definition.
- Using the anchor ^ slightly improves performance by 0.22%, but it is critical for multi-line matching.
- Replacing \d* with \d{Fixed length} further improves performance, verifying the effectiveness of precise quantifiers.
Regular Expression Optimization Suggestions
Regular expression performance optimization is critical for data collection system optimization. Precise character definition, proper use of quantifiers, systematic debugging, and advanced skills can significantly reduce the matching complexity for engines and improve collection efficiency.
- Precisely define matching characters to narrow down the search range and reduce backtracking.
- Replace general matching with precise character groups. For example, use [a-zA-Z] for letters, \d or [0-9] for digits, and \D or [^0-9] for non-digits.
- For timestamps, use \d to represent the year, month, day, hour, minute, and second, as they are digits.
- Delimiter - and _ are fixed characters and can be directly included in patterns.
- Use \S (equivalent to [^\s]) to replace non-whitespace characters.
- Properly use quantifiers to balance matching accuracy and efficiency.
Use + instead of * whenever possible to reduce unnecessary checks, as + requires at least one match.
- Use anchors (^ and $) to specify exact start and end positions, narrowing the matching range and improving efficiency.
Avoid using anchors ^ and $ with .*, for example, ^.*pattern, as this does not improve performance.
- Use the boundary character \b to extract a specific log level, such as [WARNING].
- Inefficient: \[.*?\]
- Efficient: \bWARNING\b. It uses word boundaries to avoid matching interference items such as WARNINGLY.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot