LogDenser

tags: #java #development #library #stringprocessing #editdistance
published: January 25, 2024
reading time: 7 minutes

The Idea

What do you usually think when you see a giant application log file or when you open a cloud console filled with errors and warnings? “How boring”, “Not again”, “Endless scrolling” — I totally feel you. “I wish there was a better way to glance over all these messages” I said to myself when I was studying another long stream of log entries. And I decided to make a Java 17 library that allows you to achieve this.

What does a “LogDenser” mean? Oh, it’s simply a combination of “Log” and “Condenser”, nothing more.

Typically, an application or a microservice employs a limited amount of logging patterns. E.g.

[INFO] Operation ${name} finished successfully in ${duration} ms

[WARN] [${typeName}] ${name} is not found in [${types}]

That means, there must a way to group log entries by their patterns, or how I call “condense” the logs.

Does it mean you need to extract a pattern from the log message and then find all the similar messages? Not really, it’s better to start with something simple and improve it iteratively.

The Two Strings

Let’s forget about logs for a moment and think about strings. Just two strings. How to measure their similarity? Use some kind of string metric, of course!

Edit distance is a natural choice for this task. Most straightforward types of edit distance are:

Levenshtein distance — allows deletion, insertion, and substitution.
Longest common subsequence — allows only insertion and deletion, not substitution. Doesn’t really work without substitution.
Hamming distance — allows only substitution; hence, it only applies to strings of the same length.

Levenshtein distance seems like a viable candidate. The distance between these two strings is 10.

Operation findProduct finished successfully in 22 ms
Operation listAllProducts finished successfully in 135 ms

Let’s consider another pair of strings:

[Environment type] TEST is not found in [DEV, PROD]
[Job state] NEW is not found in [PLANNED, IN_PROGRESS, FINISHED, FAILED, SKIPPED]

The Levenshtein distance between them is 55. But this pair of messages were obviously printed by the same pattern. It’s really difficult to rely on a metric with such a wide range of values. How can we improve it?

Lexical Tokenization

Computing a character-based edit distance doesn’t really work. But what about a word-based, or even better, token-based edit distance? For simplicity, we define two types of tokens:

Word token — a substring bound by whitespace.
Enclosed words — a substring bound by a pair of matching characters like [ and ].

Let’s tokenize the first example.

Levenshtein distance between these two tokenized strings is 2.

And the second pair of strings:

In this example, Levenshtein distance is 3.

This looks promising. And there is one interesting detail: the lengths of tokenized strings are the same in both examples now. That means, if tokenization is done just right, we could even use the Hamming distance to compute the similarity of two strings.

Why use the Hamming distance? The Levenshtein distance calculation algorithm has O(n²) time complexity and requires additional memory, while the Hamming distance can be calculated in linear time without auxiliary memory. That’s why I used the latter in this research.

You can see the tokenizer source code here.

Condensation

Or grouping logs by similarity. Let’s define a group: it’s a collection of static and dynamic parts that represent a single logging pattern, and a total number of messages that follow the pattern.

So we will have this group for the first example:

And the second example will look like that:

The group doesn’t contain the information about exact combinations, and that’s intended. This condensed view is enough to quickly glance over the logs and see what’s happening in the application.

The main (and only for now) condenser is responsible for grouping tokenized strings. In short, it works like this:

public List<SameResults> condense(List<TokenString> tokenStrings) {
    List<TokenString> input = tokenStrings;
    while (!input.isEmpty()) {
        List<TokenString> similar = new ArrayList<>();
        List<TokenString> different = new ArrayList<>();
        BitSet substitutionIndices = new BitSet(tokenLength);
        
        var first = input.get(0);
        similar.add(first);

        for (int i = 1; i < input.size(); i++) {
            var tokenString = input.get(i);
            var editDistance = editDistanceCalculator.distance(first, tokenString);
            if (condensingMatcher.matches(first, tokenString, editDistance)) {
                similar.add(tokenString);
                // this method extracts static and dynamic message parts 
                fillSubstitutionIndices(editDistance, substitutionIndices);
            } else {
                different.add(tokenString);
            }
        } 
        // now "similar" contains a list of token strings similar to the first in input
        // and "different" contains everything else

        // pack "similar" to a single instance of SameResults
        results.add(fromList(similar, substitutionIndices));
        // repeat everything, but only for "different" strings
        input = new ArrayList<>(different);
    }
    return results;
}

This method has O(n²) worst case time complexity. A nested loop without any bounds is a recipe for disaster if n is big enough. And in the case of logs, it can be: sometimes you have hundreds of thousands or even millions of log entries for just 24 hours.

But, as we discussed earlier, those myriads of log entries are not random strings — they follow a limited set of patterns. That means the only way condense method will degrade to O(n²) is when condensingMatcher.matches returns false for plenty of calls. What does this matcher do? It implements a heuristic algorithm that actually decides whether two strings are similar. A simple comparison like editDistance.distance() < n where n is an increasing function of token length usually works pretty well.

Using an algorithm that has quadratic time complexity in the worst case in production is a bad idea, but the first version of Quicksort by Tony Hoare had the similar problem. It didn’t stop people from using it 😄

Conclusion

Writing this library was an interesting adventure. I set up a CI, a code coverage solution, and published my first artifact to the Maven Central Repository. But more importantly, I used good old computer science to make a piece of software that solves a real-world problem. That what it means to be a software engineer for me.

If you have suggestions or spotted a bug, please don’t hesitate to open an issue on GitHub.

Thank you for reading and take care!