Regular Expressions for Separating Semi-structured Text
During table/file migration, CDM uses delimiters to separate fields in CSV files. However, delimiters cannot be used in complex semi-structured data because the field values also contain delimiters. In this case, the regular expression can be used to separate the fields.
The regular expression is configured in Source Job Configuration. The migration source must be an object storage or file system, and File Format must be CSV.
Log4J Log
- Log sample:
2018-01-11 08:50:59,001 INFO [org.apache.sqoop.core.SqoopConfiguration.configureClassLoader(SqoopConfiguration.java:251)] Adding jars to current classloader from property: org.apache.sqoop.classpath.extra
- Regular expression:
^(\d.*\d) (\w*) \[(.*)\] (\w.*).*
- Parsing result:
Table 1 Log4J log parsing result Column Number
Example Value
1
2018-01-11 08:50:59,001
2
INFO
3
org.apache.sqoop.core.SqoopConfiguration.configureClassLoader(SqoopConfiguration.java:251)
4
Adding jars to current classloader from property: org.apache.sqoop.classpath.extra
Log4J Audit Log
- Log sample:
2018-01-11 08:51:06,156 INFO [org.apache.sqoop.audit.FileAuditLogger.logAuditEvent(FileAuditLogger.java:61)] user=sqoop.anonymous.user ip=189.xxx.xxx.75 op=show obj=version objId=x
- Regular expression:
^(\d.*\d) (\w*) \[(.*)\] user=(\w.*) ip=(\w.*) op=(\w.*) obj=(\w.*) objId=(.*).*
- Parsing result:
Table 2 Log4J audit log parsing result Column Number
Example Value
1
2018-01-11 08:51:06,156
2
INFO
3
org.apache.sqoop.audit.FileAuditLogger.logAuditEvent(FileAuditLogger.java:61)
4
sqoop.anonymous.user
5
189.xxx.xxx.75
6
show
7
version
8
x
Tomcat Log
- Log sample:
11-Jan-2018 09:00:06.907 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log OS Name: Linux
- Regular expression:
^(\d.*\d) (\w*) \[(.*)\] ([\w\.]*) (\w.*).*
- Parsing result:
Table 3 Tomcat log parsing result Column Number
Example Value
1
11-Jan-2018 09:00:06.907
2
INFO
3
main
4
org.apache.catalina.startup.VersionLoggerListener.log
5
OS Name:Linux
Django Log
- Log sample:
[08/Jan/2018 20:59:07 ] settings INFO Welcome to Hue 3.9.0
- Regular expression:
^\[(.*)\] (\w*) (\w*) (.*).*
- Parsing result:
Table 4 Django log parsing result Column Number
Example Value
1
08/Jan/2018 20:59:07
2
settings
3
INFO
4
Welcome to Hue 3.9.0
Apache Server Log
- Log sample:
[Mon Jan 08 20:43:51.854334 2018] [mpm_event:notice] [pid 36465:tid 140557517657856] AH00489: Apache/2.4.12 (Unix) OpenSSL/1.0.1t configured -- resuming normal operations
- Regular expression:
^\[(.*)\] \[(.*)\] \[(.*)\] (.*).*
- Parsing result:
Table 5 Apache server log parsing result Column Number
Example Value
1
Mon Jan 08 20:43:51.854334 2018
2
mpm_event:notice
3
pid 36465:tid 140557517657856
4
AH00489: Apache/2.4.12 (Unix) OpenSSL/1.0.1t configured -- resuming normal operations
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.