The Java regex engine uses recursive method calls to implement backtracking. Therefore when a repetition inside a regular expression contains
multiple paths (i.e. the body of the repetition contains an alternation (|
), an optional element or another repetition), trying to match
the regular expression can cause a stack overflow on large inputs. This does not happen when using a possessive quantifier (such as *+
instead of *
) or when using a character class inside a repetition (e.g. [ab]*
instead of (a|b)*
).
The size of the input required to overflow the stack depends on various factors, including of course the stack size of the JVM. One thing that
significantly increases the size of the input that can be processed is if each iteration of the repetition goes through a chain of multiple constant
characters because such consecutive characters will be matched by the regex engine without invoking any recursion.
For example, on a JVM with a stack size of 1MB, the regex (?:a|b)*
will overflow the stack after matching around 6000 characters
(actual numbers may differ between JVM versions and even across multiple runs on the same JVM) whereas (?:abc|def)*
can handle around
15000 characters.
Since often times stack growth can’t easily be avoided, this rule will only report issues on regular expressions if they can cause a stack overflow
on realistically sized inputs. You can adjust the maxStackConsumptionFactor
parameter to adjust this.
Noncompliant code example
Pattern.compile("(a|b)*"); // Noncompliant
Pattern.compile("(.|\n)*"); // Noncompliant
Pattern.compile("(ab?)*"); // Noncompliant
Compliant solution
Pattern.compile("[ab]*"); // Character classes don't cause recursion the way that '|' does
Pattern.compile("(?s).*"); // Enabling the (?s) flag makes '.' match line breaks, so '|\n' isn't necessary
Pattern.compile("(ab?)*+"); // Possessive quantifiers don't cause recursion because they disable backtracking