Documentation Index
Fetch the complete documentation index at: https://docs.semgrep.dev/llms.txt
Use this file to discover all available pages before exploring further.
Introduction
Semgrep can match generic patterns in languages that it does not yet support. Use generic pattern matching for languages that do not have a parser, configuration files, or other structured data such as XML. Generic pattern matching can also be helpful in files containing multiple languages, even if the languages are otherwise supported, such as HTML with embedded JavaScript or PHP code. In those cases, you can also consider Extract mode (experimental), but generic patterns may be more straightforward and still effective. As an example of generic matching, consider this rule:- A document is interpreted as a nested sequence of ASCII words, ASCII punctuation, and other bytes.
...(ellipsis operator) allows skipping non-matching elements, up to 10 lines down from the last match.$X(metavariable) matches any word.$...X(ellipsis metavariable) matches a sequence of words, up to 10 lines down from the last match.- Indentation determines primary nesting in the document.
- Common ASCII braces
(),[], and{}introduce secondary nesting but only within single lines. Therefore, misinterpreted or mismatched braces don’t disturb the structure of the rest of the document. - The document must be at least as indented as the pattern: any indentation specified in the pattern must be honored in the document.
Caveats and limitations of generic mode
Semgrep can reliably understand the syntax of natively supported languages. The generic mode is useful for unsupported languages and consequently brings specific limitations. The generic mode works fine with any human-readable text, as long as it is primarily based on ASCII symbols. Since the generic mode does not understand the syntax of the language you are scanning, the quality of the result may differ from language to language or even depend on specific code. As a consequence, the generic mode works well for some languages, but it does not always give consistent results. Generally, it’s possible or even easy to write code in weird ways that prevent generic mode from matching. Example: In XML, one can writeHello instead of Hello. If a rule pattern in generic mode is Hello, Semgrep is unable to match the Hello, unlike if it had full XML support.
With respect to Semgrep operators and features:
- support is limited to capturing a single “word”, which is a token of the form [A-Za-z0-9_]+. They can’t capture sequences of tokens such as hello, world (in this case, there are three tokens:
hello,,, andworld). - The ellipsis operator is supported and spans, at most, 10 lines.
- The pattern operators like either/not/inside are supported.
- Inline regular expressions for strings (
"=~/word.*/") are not supported.
Troubleshooting
Common pitfall #1: not enough ...
Rule of thumb:
If the pattern commonly matches many lines, useHere’s an innocuous pattern that should match the call to a function... ...(20 lines), or... ... ...(30 lines), to ensure that all lines are matched.
f():
... in the pattern:
Common pitfall #2: not enough indentation
Rule of thumb:If the target code is always indented, use indentation in the pattern.In the following example, the goal is to match the
system sections containing a name field:
name field in the user section:
name field in the system section:
Handling line-based input
This section explains how to use Semgrep’s generic mode to match single lines of code using an ellipsis metavariable. Many simple configuration formats are collections of key and value pairs delimited by newlines. For example, to extract thepassword value from the
following made-up input:
p to $PASSWORD instead of the full value p@$$w0rd.
To match an arbitrary sequence of items and capture their value in the example:
-
Use a named ellipsis by changing the pattern to the following:
$...PASSWORD are now p@$$w0rd andserver = example.com. In generic mode, an ellipsis extends until the end of the current block or up to 10 lines below, whichever comes first. To prevent this behavior, continue with the next step.
-
In the Semgrep rule, specify the following key:
Ignoring comments
By default, the generic mode does not know about comments or code that can be ignored. The following example is scanning for CSS code that sets the text color to blue. The target code is the following:options.generic_comment_style
to ignore C-style comments, as is the case in the example.
The Semgrep rule is:
Command line example
Sample pattern:exec(...)
Sample target file exec.txt contains:
Semgrep Registry rules for generic pattern matching
You can peruse existing generic rules in the Semgrep registry. In general, short patterns on structured data performs the best.Cheat sheet
Some examples of what matches and what doesn’t match on thegeneric tab of the Semgrep cheat sheet below: