Why is regular expression efficiency important?
While well-written regular expressions can be very effective, poorly written regular expressions can take a long time to run and can significantly slow down a system. It is quite possible to write a regular expression that takes hours or days to complete, and it is even possible to write a regular expression that will not complete within the lifetime of the universe because it runs against a medium-sized string.
Some improvements have been made in practice to make it more resistant to inefficient regular expressions than previous versions. It now minimizes the regular expression matching required when deciding which TPL patterns to run. It also extends the job of running TPL patterns across multiple processors, so that if one processor is busy with long regular expression matches, the others can continue to work.
Despite these improvements, writing efficient regular expressions is still important to keep system discovery running optimally. If system discovery slows down significantly when scanning the network and the inference or discovery service uses 100% CPU for long periods of time, one possible cause is an inefficient regular expression.
Anatomy of an inefficient regular expression
So, how do you write inefficient regular expressions? One problem is when the regular expression performs excessive backtracking; this can happen when there are multiple repeat operators in the regular expression. Repetition operators are +
, *
, or {n, m}
. If the regular expression makes a partial match but then fails, it must return and try any other possible partial matches in case any of them succeed.
For example, consider matching the regular expression a.*b*cd
against the string abc abc abc
. The match will never succeed because there is no d
in the string, but the regular expression must still check all possible combinations of the letters a
, b
and c
before giving up. That is.
As a rough guide, the number of comparisons a regular expression needs to perform is proportional to the length of the string multiplied by the number of possible intermediate matches.
In this example, using the non-greedy operator, a.*?b.*?cd
, has no effect on the number of matches, because the regular expression engine still needs to try every combination.
Real examples
Let’s look at some examples based on real regular expressions that have caused problems in real system operation.
|
|
This is a regular expression that is compared to the process command line on the host. It fails when running a half-megabyte string that contains a lot of xx
repetitions, but not foo
.
Let’s analyze what happens when it matches the string:
- the engine starts scanning from the beginning of the string.
- The engine scans forward until it finds the word boundary \b.
- The engine scans forward from the word boundary until it finds a match for xx.
- the engine scans forward from xx until it finds the string foo or reaches the end of the string.
- if it has reached the end of the string and does not match foo, it returns to step 3 and scans forward to the next xx.
- If it matches all the xx’s but still doesn’t find foo, then it returns to step 2 and scans forward to the next word boundary.
Thus regular expression matching contains nested loops; the total processing time is determined by multiplying the length of the string (which is about 500000 for the command line causing the problem) by the number of xx substrings (about 500) by the number of word boundaries (about 80,000). This is roughly equivalent to scanning a 20 trillion character long string, which would take over a day to complete.
This is solved in two ways.
The first is that \b.*
is removed from the beginning of the regular expression, as it serves no purpose other than to slow down the entire process. This change reduces the runtime from days to seconds.
Second, we can use the knowledge about the data we want to match; in this case, we are only interested in the text from the beginning of the string to the first space. So, to stop the regular expression from scanning the whole string, we can anchor the regular expression to the beginning of the string and use the \S
token to match non-whitespace characters. The last regular expression ^\S*xx\S*foo
will stop as soon as it reaches the whitespace character. Now, it takes a few microseconds to run against the same string.
|
|
This is used as a sensitive data filter. Its purpose is to scan the command line and remove options that start with -A
and -B
. However, not only does it miss the author’s intent, but it is executed in a way that it may never be able to handle.
Let’s break it down.
- start scanning from the string until
-A
is found. - Match all non-numeric characters until
-B
is found. - If you can’t find
-B
, try matching every combination of groups between-A
and the end of the string or the next number. For example, if the rest of the string isabcd
, it will match each of the following groups of(\D+)
+
For each additional character in the remaining string, the number of combinations is doubled.
Thus, for the case where the command line contains -A but is not followed by -B, it will take a time proportional to 2n, where N is the number of characters between -A and the next number or the end of the string. To better understand this, on a standard PC, a 22-character string takes about one second. A 40-character string takes about 3 days, and a 100-character string takes 95836965945500 years, although this has not been tested.
This regular expression is fixed by removing group repetitions, since it has no use for.
-A(\D+)-B(\D+).
The runtime drops from forever to a few microseconds.
A guide to writing efficient regular expressions
This section provides guidelines to help you avoid common problems and write efficient regular expressions.
Consider failure scenarios
As the previous example shows, problems occur when a regular expression does not match exactly, but multiple partial matches exist. When writing regular expressions, it is not enough to consider what happens when the regular expression succeeds; you also need to consider what happens when the regular expression fails to execute. Regular expressions used in actual discovery often match a large number of command lines, which may be very long (up to a million characters is not unknown) and may contain text that partially matches the regular expression, rather than all of it.
Note the multiple repetition of wildcard tokens
A wildcard token is anything that can match more than one character; this includes: .
, \w
, \D
, character classes, negative character classes, etc.
If a regular expression has two or more consecutive wildcard repetitions, the possibility of backtracking exists. For example, if the target string starts with a, is N characters long, and does not contain x, then.
- a.*. *x - requires N2 matches.
- a.x. x - also requires N2 matches, because x can be a zero-length match, so it can match any position in the string.
- a.*y.x - will take NM matches, where M is the number of occurrences of y.
Really beware of **nested groups repeating **
As mentioned above, if there is a group containing duplicates and the group itself is duplicated, e.g. (. *)
, the number of possible matches may grow exponentially.
Do not start regular expressions with wildcard repetitions
This is a special case of the second point above. The regular expression engine searches for a match anywhere in the string, so it tries to match the regular expression starting with the first character, then the second, and so on until it finds a match. The regular expression *x
is equivalent to the regular expression ^. *
’s regular expression *x
, which suffers from the above backtracking problem.
Since this is a very common error, the actual discovery looks for regular expressions that start with an unnecessary *
and removes them when possible.
Match only what you really need
A regular expression should be designed to stop as soon as there are enough matches, or when it knows it cannot match. For example, consider the regular expression \b\w+XXX*
\b\w+` is redundant and can be replaced with
\w``. This will match in all cases where the original regular expression matches.- The final
. *
is also redundant, as it has no effect on whether the match succeeds or not.
Therefore, the entire regular expression can be replaced with \wXXX
.
The exception is when you need to use the matching string instead of simply testing if the match succeeds
Try to fail fast
If the entire regular expression reaches a point where it is impossible to match the intended target, try to make it fail.
The first real-world example shown above is an example of this. The regular expression ^\S*xx\S*foo
will never scan for strings beyond the first space character, whether it succeeds or not. In the context in which it is used, this means the difference between scanning a sequence of about 100 characters and scanning a sequence of hundreds of thousands of characters.
Profile-especially failure cases
It is important to test a regular expression to make sure it matches what you expect. However, it is also important to test the performance of long strings (e.g., megabyte strings of random characters) that partially match the regular expression.
Minimize data extraction from the host
TPL mode allows you to run commands on the host and retrieve data to be searched with regular expressions, such as version information.
A common error is retrieving large amounts of information, e.g.
cat file1 file2 file3...
and then running a regular expression on the data to extract a single piece of information.
This can return a large amount of data, so not only does regular expression matching take a long time, but transferring the data over the network can be slow and take up a lot of space in the data store.
A better approach is to get only the information you are interested in by running a command (such as grep
or head
) on the host. For example.
grep VERSION file1 file2 file3...
You can then run the regular expression on the shorter text returned.
Don’t use groups unless absolutely necessary
When using parentheses to enclose part of a regular expression in a group, the regular expression engine must do extra work to save the text that the group matches in case it is needed later. In some cases, this can slow down the matching process by a factor of four or more.
If parentheses are needed, for example for a part of a repeated regular expression that does not require the contents of the group to be used later, then a non-grouped version of the parentheses - (? :...)
. This is usually still slower than no parentheses, but faster than regular parentheses. For example.
(abc|def) -*
slow. Use only if you need to use matching text later.(? :abc|def) -
Fasterabc|def -
Fastest
Consider regular expressions
While these guidelines may be helpful, there is no substitute for the thought and understanding that comes from stepping back and re-examining regular expressions for efficiency and accuracy.
Regular expression optimization for TPL triggers
Use regular expression indexing to improve schema performance. Indexing is used to quickly eliminate regular expressions that clearly cannot be matched to a given string. This greatly reduces the number of regular expressions that must be executed. As a result, reasoning can often avoid executing the pathological regular expressions described in the rest of this document.
When optimizing regular expressions for TPL pattern triggers, it helps to have a basic understanding of how indexes work.
Indexing
Regular expressions are indexed by an attribute called a hint. A hint is a case-insensitive string that is a substring of each string that matches the expression. For example, the expression lift (backstay | mainstay), ye (land stray dog 124 scurvy dog)! It has three distinct hints.
{{Hoist the }}
, ye"
!
For simplicity, each expression is indexed by its longest hint only. In the example above, the hint would be {Lift}}.
Some regular expressions do not have any hints. For example, \d+[-/]\d++[-]\d+
has no substring that must be part of the matching string. The hint calculator is (currently) also quite naive and misses some hints that might be extracted. Expressions without hints must be handled separately.
When trying to match a string, a set of regular expressions in the index is queried and their hints are displayed as substrings of the matched string. Once built, the index can execute this query very quickly. This set is combined with a set of expressions without hints to form the set of all possible regular expressions that match the string. Try matching the strings to each expression in turn until an expression is found or until no expression is found, which means the strings do not match.
Try to give some hints
A regular expression without hints must be run for each string given the given index. Therefore, it is important to try to use regular expressions that can compute hints.
The indexed hint extraction algorithm cannot handle the following types of sub-expressions, which will be used to divide the hints.
- Built-in character classes, such as \d, \w and.
- custom character classes, such as [ijkxyz]
- Optional expressions, such as ? and *
- groups, such as (foo)+
- alternates, e.g. foo|bar
Using alternation at the top level of an expression will always prevent the extraction of hints. Alternation within a group will not prevent extraction of hints from parts of expressions outside the group.
Trying to make unique hints
If the regular expression has only one java-like hint, it needs to be checked against a large number of strings. Try to design your expression so that its longest hint is quite rare.