Ok this is the fastest I could get it by cheating with all my might using lparallel and other shit to compete with the python read lol.
>CODE: https://pastebin.com/Ju4UM5vz
Ran it over 2.75GiB of spark logs consisting of 3,852 files. 33,236,604 lines. Tested these 2 regex:
(defparameter *simple-re* (ppcre:create-scanner "\\b(INFO|WARN|ERROR)\\b"))
(defparameter *complex-re*
(ppcre:create-scanner "^(\\d{2}/\\d{2}/\\d{2} \\d{2}:\\d{2}:\\d{2}) (INFO|WARN|ERROR) ([^:]+): (.+)$"))
Python simple read line results:
> uv run bench4.py "~/Downloads/sparklogs/"
Directory: ~/Downloads/sparklogs/
Files: 3,852
Total size: 2,941,224,304 bytes (2804.97 MB)
Comment too long. Click here to view the full text.