Pontifications and realizations from the developer's point of view about Super Speedy Syslog Searcher!.
- Background
- The Problem
- The Solution
- Solution Difficulties
- differing sysline formats, differing datetime formats
- Which is the correct datetime stamp?
- awareness of multi-line syslines adds some layer of program "state"
- What year? What timezone?
- Unicode decoding is a surprisingly expensive operation.
- Binary-search in a file that has varying character encodings.
- What defines the boundaries of a sysline?
- Solution Difficulties
- The Solution
Why Super Speedy Syslog Searcher?
I prefer unique, searchable, memorable, and self-explanatory names for all programs.
To explain another way, try an internet search for "syslog finder", or "syslog datetime sort", etc. Many varying results; not a memorable phrase.
Great language! Great ecosystem!
Not perfect but a necessary generational improvement over competitors C and C++.
As a Test Engineer, I'm often investigating problems that affect complex systems. I'm usually only superficially familiar with the underlying subsystems (Test Engineers are often system generalists, not specialists). During an investigation, I may know when a problem occurred but I have little certainty about what system components might be involved, and so I don't what system logs ("syslogs") are relevant to review.
For example, the frobulator failed around minute 2022/01/01 14:33. I know the frobulator emitted an error but my hunch is this error only first appeared in the frobulator but the source error is somewhere else. I know many system components interact with the frobulator but I'm not certain of that knowledge. So I'd like to search all other system components for the preceding five minutes (i.e. search all other syslog files).
I have several-times written scripts that:
- parse varying datetime formats in lines
- sort lines by datetime
- filter lines by some datetime period ("I only want syslogs during this minute")
- display syslog lines ("syslines") by datetime order and prepended file name
Doing this in a scripting language like Python is do-able but tedious. It's also very slow for large amounts of syslogs and/or compressed syslogs. Additional complexities make scripting difficult or imperfect.
- Handling compressed or archived files adds scripting complexity and runtime.
- Storing all the syslines in memory may be impossible.
- Dealing with multi-line syslog lines ("syslines") means typical "line by line" processing is inadequate. e.g. this is one sysline
2022/01/01 14:33:12 Frobulator: failed to initialize the woozybucket
woozybucket: I/O failure /dev/sdc4
I once attempted to create such a script for a multi-host data storage system.
The system would generate tens of Gigabytes of syslogs per day
(it was running in Debug mode).
Those syslogs were periodically offloaded onto a busy central NAS that
resided in a remote office, where they were further compressed
and archived (.tar
files in .tar
files).
I would investigate errors days after they occurred.
Searching a large amount of syslogs for "everything that happened between
datetime A and datetime A plus five minutes" was incredibly slow
(tens of minutes to hours).
I observed that if the Searching Tool was datetime-aware, and did a binary-search instead of a linear search, then searching could be very fast. Additionally, if the tool would release memory for syslines that have been processed then it could handle arbitrarily large syslog files.
Enter Super Speedy Syslog Searcher (s4).
s4 searches syslogs by datetime in a binary-search manner. That is, given datetime A, a syslog (e.g. a file with lines having datetime-stamps; "syslines") a binary-search for the sysline at or nearest to datetime A is possible. Only small blocks of the file need to be read to discover the sysline with datetime A (or nearest to datetime A).
Surprisingly, nothing exists that can do this.
At the same time, I was interested in learning about the new cool thing, the rust programming language. rust seemed ideal for this situation where speed and efficiency was a defining feature of the solution.
Why do no other tools exist for this common need?
When moving from implementing a script for a known system, to implementing a general purpose tool, here are some difficulties I discovered during design and implementation.
In practice there are many reasonably possible sysline formats (datetime format, plus arrangement). Handling them all can be done but adds complexity. Handling them all speedily adds further complexity.
These are a few system log lines ("syslines") I've found:
2022/01/01 12:34:56 There was an error
There was an error January 01 2022 12:34:56PM
[Error] There was an error Sat Jan 1 12:34:56 2022
e.g.
Frobulator 2022/01/01 12:34:50 there was an error 2022/01/01 12:34:56
How do you tell your parser that the second datetime stamp is the real datetime stamp of the sysline?
2022/01/01 12:34:56 Frobulator: failed to initialize the woozybucket
woozybucket: I/O failure /dev/sdc4
Line-by-line parsing won't suffice.
It is common to have syslog datetime formats the do not include a year or a timezone.
For example, your syslog searching script finds these three syslines from three different syslogs, what is the correct chronological order?
Jan. 1 11:00:00 Error
Jan. 1 12:00:00 PST Error
Jan. 1 13:00:00 UTC Error
What year and timezone did the first sysline occur? (Answer: you don't know). What year did the second and third sysline occur? (Answer: you don't know).
There are ways to get a "probably correct" answer. But how to do this very quickly (i.e. without reading the entire file? without spinning up some fancy Learning AI computer?)? It's not a straightforward solution.
In a multi-host system that spans different timezones (which I've worked with) then this problem moves from theory to reality.
See benches/bench_decode_utf.rs
.
What if the "byte cursor" of your binary search lands on the middle byte that is part of three-byte long unicode character?
A sysline is a line or lines that come after a sysline (or the beginning of a file), and before a sysline (or the end of a file).
(I hope you are understanding the implied difficulty here).