Parsing CSV Logs
This document describes how to parse syslogs or other text logs in special character-separated format.
Parsing Normal CSV Logs
{
"program":"access",
"severity":6,
"priority":14,
"facility":1,
"content":"198.51.100.1|10/Jun/2019:11:32:16 +0800|example.com|GET /zf/11874.html HTTP/1.1|200|0.077|6404|198.51.100.10:8001|200|0.060|https://example.com/s?q=%25%24%23%40%21&from=wy878378&uc_param_str=dnntnwvepffrgibijbprsvdsei|-|Mozilla/5.0 (Linux; Android 9; HWI-AL00 Build/HUAWEIHWI-A00) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Mobile Safari/537.36|-|-"
}
Background and requirements:
- When the value of the program field is access, the content field is parsed once using PSV (pipe-separated value) and then discarded.
- The request: GET /zf/11874.html HTTP/1.1 field is split into request_method, http_version, and request.
- Perform URL decoding on http_referer.
- The time field is formatted.
Solution:
- If the value of the program field is access, use the e_psv function to parse the content field and delete the original field content.
e_if(e_search("program==access"), e_compose(e_psv("content", "remote_addr, time_local,host,request,status,request_time,body_bytes_sent,upstream_addr,upstream_status, upstream_response_time,http_referer,http_x_forwarded_for,http_user_agent,session_id,guid", restrict=True), e_drop_fields("content")))The returned log is as follows:
{ "severity": 6, "remote_addr": "198.51.100.1", "request": "GET /zf/11874.html HTTP/1.1", "upstream_addr": "198.51.100.10:8001", "body_bytes_sent": 6404, "session_id": "-", "program": "access", "priority": 14, "http_user_agent": "Mozilla/5.0 (Linux; Android 9; HWI-AL00 Build/HUAWEIHWI-A00) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Mobile Safari/537.36", "upstream_status": 200, "request_time": "0.077", "http_referer": "https://example.com/s?q=%3A%2F%3A&from=wy878378&uc_param_str=dnntnwvepffrgibijbprsvdsei", "upstream_response_time": "0.060", "host": "example.com", "http_x_forwarded_for": "-", "guid": "-", "facility": 1, "time_local": "10/Jun/2019:11:32:16 +0800", "status": 200 } - Use the e_regex function to parse the request field into request_method, request, and http_version.
e_regex("request",r"^(?P<request_method>\w+) (?P<request>.+) (?P<http_version>\w+/[\d\.]+)$")The returned log is as follows:
"request": "GET /zf/11874.html HTTP/1.1", "request_method": "GET", "http_version": "HTTP/1.1"
- Perform URL decoding on http_referer.
e_set("http",url_decoding(v("http_referer")))The returned log is as follows:
"http": "https://example.com/s?q=:/:&from=wy878378&uc_param_str=dnntnwvepffrgibijbprsvdsei",
- In conclusion, the solution is as follows:
e_if(e_search("program==access"), e_compose(e_psv("content", "remote_addr, time_local,host,request,status,request_time,body_bytes_sent,upstream_addr,upstream_status, upstream_response_time,http_referer,http_x_forwarded_for,http_user_agent,session_id,guid", restrict=True), e_drop_fields("content"))) e_regex("request",r"^(?P<request_method>\w+) (?P<request>.+) (?P<http_version>\w+/[\d\.]+)$") e_set("http",url_decoding(v("http_referer")))
Output log:
{
"severity": 6,
"remote_addr": "198.51.100.1",
"request": "GET /zf/11874.html HTTP/1.1",
"upstream_addr": "198.51.100.10:8001",
"body_bytes_sent": 6404,
"session_id": "-",
"http_version": "HTTP/1.1",
"program": "access",
"request_method": "GET",
"priority": 14,
"http_user_agent": "Mozilla/5.0 (Linux; Android 9; HWI-AL00 Build/HUAWEIHWI-A00) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Mobile Safari/537.36",
"upstream_status": 200,
"request_time": "0.077",
"http_referer": "https://example.com/s?q=%3A%2F%3A&from=wy878378&uc_param_str=dnntnwvepffrgibijbprsvdsei",
" upstream_response_time": "0.060",
"host": "example.com",
"http_x_forwarded_for": "-",
"guid": "-",
"http": "https://example.com/s?q=:/:&from=wy878378&uc_param_str=dnntnwvepffrgibijbprsvdsei",
"facility": 1,
" time_local": "10/Jun/2019:11:32:16 +0800",
"status": 200
}
Parsing Abnormal CSV Logs
The following log format contains an abnormal log. To parse its content, do as follows:
- Raw log
{ "content":"192.168.0.1|07/Aug/2019:11:10:37 +0800|www.learn.example.com|GET /index/htsw/?ad=5|8|6|11| HTTP/1.1|200|6.729|14559|192.168.0.1:8001|200|6.716|-|-|Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D))||" } - Solution
For GET /index/htsw/?ad=5|8|6|11| HTTP/1.1 in the content, if the e_csv function cannot extract the correct fields, extract the GET /index/htsw/?ad=5|8|6|11| HTTP/1.1 first and replace it with an empty string in the content.
e_regex("content", r"[^\|]+\|[^\|]+\|[^\|]+\|(?P<request>(.+)HTTP/\d.\d)") e_set("content", regex_replace(v("content"), r"([^\|]+\|[^\|]+\|[^\|]+)\|((.+)HTTP/\d.\d)\|(.+)",replace= r"\1||\4")) e_psv("content", "remote_addr,time_local,host,status,request_time,body_bytes_sent,upstream_addr,upstream_status, upstream_response_time,http_referer,http_x_forwarded_for,http_user_agent,session_id,guid", restrict=True) - Output log
{ "request": "GET /index/htsw/?ad=5|8|6|11| HTTP/1.1", "remote_addr": "192.168.0.1", "upstream_addr": 14559, "body_bytes_sent": "6.729", "time_local": "07/Aug/2019:11:10:37 +0800", "session_id": "Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D))", "content": "192.168.0.1|07/Aug/2019:11:10:37 +0800|www.learn.example.com||200|6.729|14559|192.168.0.1:8001|200|6.716|-|-|Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D))||", "http_user_agent": "-", "upstream_status": "192.168.0.1:8001", "request_time": 200, "http_referer": "6.716", " upstream_response_time": 200, "host": "www.learn.example.com", "http_x_forwarded_for": "-", "guid": "", "status": "" }
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot