Updated on 2025-12-04 GMT+08:00

Parsing CSV Logs

This document describes how to parse syslogs or other text logs in special character-separated format.

Parsing Normal CSV Logs

Raw log:
{
	"program":"access",
	"severity":6,
	"priority":14,
	"facility":1,
	"content":"198.51.100.1|10/Jun/2019:11:32:16 +0800|example.com|GET /zf/11874.html HTTP/1.1|200|0.077|6404|198.51.100.10:8001|200|0.060|https://example.com/s?q=%25%24%23%40%21&from=wy878378&uc_param_str=dnntnwvepffrgibijbprsvdsei|-|Mozilla/5.0 (Linux; Android 9; HWI-AL00 Build/HUAWEIHWI-A00) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Mobile Safari/537.36|-|-"
}

Background and requirements:

  1. When the value of the program field is access, the content field is parsed once using PSV (pipe-separated value) and then discarded.
  2. The request: GET /zf/11874.html HTTP/1.1 field is split into request_method, http_version, and request.
  3. Perform URL decoding on http_referer.
  4. The time field is formatted.

Solution:

  1. If the value of the program field is access, use the e_psv function to parse the content field and delete the original field content.
    e_if(e_search("program==access"), e_compose(e_psv("content", "remote_addr, time_local,host,request,status,request_time,body_bytes_sent,upstream_addr,upstream_status, upstream_response_time,http_referer,http_x_forwarded_for,http_user_agent,session_id,guid", restrict=True), e_drop_fields("content")))

    The returned log is as follows:

    {
    	"severity": 6,
    	"remote_addr": "198.51.100.1",
    	"request": "GET /zf/11874.html HTTP/1.1",
    	"upstream_addr": "198.51.100.10:8001",
    	"body_bytes_sent": 6404,
    	"session_id": "-",
    	"program": "access",
    	"priority": 14,
    	"http_user_agent": "Mozilla/5.0 (Linux; Android 9; HWI-AL00 Build/HUAWEIHWI-A00) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Mobile Safari/537.36",
    	"upstream_status": 200,
    	"request_time": "0.077",
    	"http_referer": "https://example.com/s?q=%3A%2F%3A&from=wy878378&uc_param_str=dnntnwvepffrgibijbprsvdsei",
    	"upstream_response_time": "0.060",
    	"host": "example.com",
    	"http_x_forwarded_for": "-",
    	"guid": "-",
    	"facility": 1,
    	"time_local": "10/Jun/2019:11:32:16 +0800",
    	"status": 200
    }
  2. Use the e_regex function to parse the request field into request_method, request, and http_version.
    e_regex("request",r"^(?P<request_method>\w+) (?P<request>.+) (?P<http_version>\w+/[\d\.]+)$")

    The returned log is as follows:

    "request": "GET /zf/11874.html HTTP/1.1",
    "request_method": "GET",
    "http_version": "HTTP/1.1"
  3. Perform URL decoding on http_referer.
    e_set("http",url_decoding(v("http_referer")))

    The returned log is as follows:

    "http": "https://example.com/s?q=:/:&from=wy878378&uc_param_str=dnntnwvepffrgibijbprsvdsei",
  4. In conclusion, the solution is as follows:
    e_if(e_search("program==access"), e_compose(e_psv("content", "remote_addr, time_local,host,request,status,request_time,body_bytes_sent,upstream_addr,upstream_status, upstream_response_time,http_referer,http_x_forwarded_for,http_user_agent,session_id,guid", restrict=True), e_drop_fields("content")))
    e_regex("request",r"^(?P<request_method>\w+) (?P<request>.+) (?P<http_version>\w+/[\d\.]+)$")
    e_set("http",url_decoding(v("http_referer")))

Output log:

{
	"severity": 6,
	"remote_addr": "198.51.100.1",
	"request": "GET /zf/11874.html HTTP/1.1",
	"upstream_addr": "198.51.100.10:8001",
	"body_bytes_sent": 6404,
	"session_id": "-",
	"http_version": "HTTP/1.1",
	"program": "access",
	"request_method": "GET",
	"priority": 14,
	"http_user_agent": "Mozilla/5.0 (Linux; Android 9; HWI-AL00 Build/HUAWEIHWI-A00) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Mobile Safari/537.36",
	"upstream_status": 200,
	"request_time": "0.077",
	"http_referer": "https://example.com/s?q=%3A%2F%3A&from=wy878378&uc_param_str=dnntnwvepffrgibijbprsvdsei",
	" upstream_response_time": "0.060",
	"host": "example.com",
	"http_x_forwarded_for": "-",
	"guid": "-",
	"http": "https://example.com/s?q=:/:&from=wy878378&uc_param_str=dnntnwvepffrgibijbprsvdsei",
	"facility": 1,
	" time_local": "10/Jun/2019:11:32:16 +0800",
	"status": 200
}

Parsing Abnormal CSV Logs

The following log format contains an abnormal log. To parse its content, do as follows:

  • Raw log
    {
        "content":"192.168.0.1|07/Aug/2019:11:10:37 +0800|www.learn.example.com|GET /index/htsw/?ad=5|8|6|11| HTTP/1.1|200|6.729|14559|192.168.0.1:8001|200|6.716|-|-|Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D))||"
    }
  • Solution

    For GET /index/htsw/?ad=5|8|6|11| HTTP/1.1 in the content, if the e_csv function cannot extract the correct fields, extract the GET /index/htsw/?ad=5|8|6|11| HTTP/1.1 first and replace it with an empty string in the content.

    e_regex("content", r"[^\|]+\|[^\|]+\|[^\|]+\|(?P<request>(.+)HTTP/\d.\d)")
    e_set("content", regex_replace(v("content"), r"([^\|]+\|[^\|]+\|[^\|]+)\|((.+)HTTP/\d.\d)\|(.+)",replace= r"\1||\4"))
    e_psv("content", "remote_addr,time_local,host,status,request_time,body_bytes_sent,upstream_addr,upstream_status, upstream_response_time,http_referer,http_x_forwarded_for,http_user_agent,session_id,guid", restrict=True)
  • Output log
    {
    	"request": "GET /index/htsw/?ad=5|8|6|11| HTTP/1.1",
    	"remote_addr": "192.168.0.1",
    	"upstream_addr": 14559,
    	"body_bytes_sent": "6.729",
    	"time_local": "07/Aug/2019:11:10:37 +0800",
    	"session_id": "Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D))",
    	"content": "192.168.0.1|07/Aug/2019:11:10:37 +0800|www.learn.example.com||200|6.729|14559|192.168.0.1:8001|200|6.716|-|-|Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D))||",
    	"http_user_agent": "-",
    	"upstream_status": "192.168.0.1:8001",
    	"request_time": 200,
    	"http_referer": "6.716",
    	" upstream_response_time": 200,
    	"host": "www.learn.example.com",
    	"http_x_forwarded_for": "-",
    	"guid": "",
    	"status": ""
    }