This page covers advanced taint analysis techniques for use when writing rules to catch complex injection bugs. If you are new to writing taint mode rules, begin with Overview.Documentation Index
Fetch the complete documentation index at: https://docs.semgrep.dev/llms.txt
Use this file to discover all available pages before exploring further.
Taint by side effect
Taint sources by side effect
Consider the following Python code, wheremake_tainted is a function that makes its argument tainted by side effect:
by-side-effect: true:
by-side-effect: true is enabled and the source specification matches a variable, or more generally, an l-value exactly, then Semgrep assumes that the variable, or l-value, becomes tainted by side effect at the places where the source specification produces a match.
x in make_tainted(x) is itself tainted too. If you do not want this to be the case, then set by-side-effect: only instead.
NOTEYou must use
focus-metavariable: $X to focus the match on the l-value that you want to taint; otherwise, by-side-effect does not work.by-side-effect, then only the very occurrence of x in make_tainted(x) will be tainted, not the occurrence of x in sink(x). The source specification matches only the first occurrence, and without by-side-effect: true, Semgrep does not recognize that make_tainted updates the variable x by side effect. Thus, a taint rule using such a specification does not produce any finding.
Original implementation for tainting variables by side effect
Original implementation for tainting variables by side effect
Before the implementation of This definition says that every occurrence of
by-side-effect, the following example was the official workaround to obtain similar behavior:$X after make_tainted($X) must be considered a source. However, this approach has two main limitations:- It overrides any sanitization that can be performed on the code matched by
$X. In the example code below, the callsink(x)is reported as tainted despitexhaving been sanitized!
- The
...ellipses operator has limitations. For example, in the code below, Semgrep does not match any finding if such a source specification is in use:
Taint sanitizers by side-effect
Consider the following Python code, where it is guaranteed that, aftercheck_if_safe(x), the value of x must be a safe one.
by-side-effect: true:
by-side-effect and the sanitizer specification matches a variable, or more generally, an l-value, exactly, Semgrep assumes that the variable or l-value is sanitized by side effect at the places where the sanitizer specification produces a match.
x in check_if_safe(x) is sanitized and not the occurrence of x in sink(x). The sanitizer specification matches only the first occurrence, and without by-side-effect: true, Semgrep doesn’t know that check_if_safe updates and sanitizes the variable x by side effect. Thus, a taint rule using such a specification does produce a finding for sink(x) in the preceding example.
NOTEEnsure that you use
focus-metavariable: $X to focus the match on the l-value that you want to sanitize. Otherwise, by-side-effect does not work as expected.Original implementation for tainting sanitizers by side effect
Original implementation for tainting sanitizers by side effect
Before the implementation of This specification tells Semgrep that every occurrence of
by-side-effect, the following example was the official workaround to obtain similar behavior:$X after check_if_safe($X) must be considered sanitized.This approach has two main limitations:- It overrides any further tainting that can be performed on the code matched by
$X. In the following example, the callsink(x)is not reported as tainted despitexhaving been tainted:
- The
...ellipses operator has limitations. For example, in the following code, Semgrep still returns matches despitexhaving been sanitized in both branches:
Taint function arguments
Taint function arguments as sources
To specify that an argument of a function must be considered a taint source, you can write a pattern that matches the argument:focus-metavariable: $X is essential, and using pattern: $X is not equivalent. With focus-metavariable: $X, Semgrep matches the formal parameter exactly. Click “Open in Playground” below and use “Inspect Rule” to visualize what the source is matching.
pattern: $X. The pattern: $X does not match the formal parameter itself, but matches all its uses inside the function definition. Even if x is sanitized via x = sanitize(x), the occurrence of x inside sink(x) is a taint source itself (due to pattern: $X) and so sink(x) is tainted.
Taint function arguments as sinks
You can specify that only one, or a subset, of the arguments of a function is the actual sink by usingfocus-metavariable:
sink as the sink, rather than the function sink itself. If taint goes into any other parameter of sink, then that is not considered a problem.
NOTEIf you specify a sink such as
sink(...), then any tainted data passed to sink, through any of its arguments, results in a finding.Custom propagators
To better understand custom propagators, consider the following Python code where an unsafeuser_input is stored in a set data structure. A random element from set is then passed into a sink function. This random element can be user_input itself, leading to an injection vulnerability.
s.add(x) makes x one of the elements in the set data structure s.
pattern-propagators key:
$S.add($E), and it checks whether the code matched by $E is tainted. If it is tainted, Semgrep propagates that same taint to the code matched by $S. Thus, adding tainted data to a set marks the set itself as tainted.
s becomes tainted by side effect after s.add(x). This is due to by-side-effect: true being the default for propagators, and because s is an l-value.
In general, a taint propagator must specify the following requirements:
- A pattern containing two metavariables. These two metavariables specify where taint is propagated from and to.
- The
toandfrommetavariables. These metavariables must match an expression.- The
frommetavariable specifies the entry point of the taint. - The
tometavariable specifies where the tainted data is propagated to, typically an object or data structure. If optionby-side-effectis enabled (as it is by default) and thetometavariable matches an l-value, the propagation is side-effectful.
- The
$S.add($E) includes two metavariables $S and $E. Given from: $E, to: $S, $E matching x, and $S matching s, when x is tainted, then s becomes tainted by side-effect with the same taint as x.
Another situation where taint propagators are useful is specifying in Java that, when iterating a collection that is tainted, the individual elements must also be considered tainted:
Propagate without side-effect
Taint propagators can be used in many different ways, and in some cases, you might not want taint to propagate by side effect. You can avoid this behavior by disablingby-side-effect, which is enabled by default.
if block, where the condition is something($FROM), we want to propagate taint from $FROM to any function that is being called without arguments, $TO().
by-side-effect, the sink occurrence that is inside the if block is tainted, but this does not affect the sink occurrence outside the if block.
Minimize false positives
The following rule options can be used to minimize false positives:| Rule option | Default | Description |
|---|---|---|
taint_assume_safe_booleans | false | Boolean data is never considered tainted (works better with type annotations). |
taint_assume_safe_numbers | false | Numbers (integers, floats) are never considered tainted (works better with type annotations). |
taint_assume_safe_indexes | false | An index expression I tainted does not make an access expression E[I] tainted (it is only tainted if E is tainted). |
taint_assume_safe_functions | false | A function call like F(E) is not considered tainted even if E is tainted. Note: When using Pro’s interprocedural taint analysis, this only applies to functions for which Semgrep cannot find a definition. |
taint_only_propagate_through_assignments 🧪 | false | Disables all implicit taint propagation except for assignments. |
Restrict taint by type 🧪
Semgrep automatically sanitizes Boolean expressions when it can infer that the expression resolves to a Boolean if you enable thetaint_assume_safe_booleans option.
For example, comparing a tainted string against a constant string isn’t considered a tainted expression:
taint_assume_safe_numbers, Semgrep automatically sanitizes numeric expressions when it can infer that the expression is numeric.
NOTESemgrep Pro’s ability to infer types for expressions varies depending on the language. For example, in Python, type annotations are not always present, and the
+ operator can also be used to concatenate strings. Semgrep also ignores the types of functions and classes coming from third-party libraries.Assume tainted indexes are safe
By default, Semgrep assumes that accessing an array-like object with a tainted index (that is,obj[tainted]) is itself a tainted expression, even if the object itself is not tainted. Setting taint_assume_safe_indexes: true makes Semgrep assume that these expressions are safe.
Assume function calls are safe
NOTEA function call is referred to as opaque when Semgrep doesn’t have access to its definition, which is necessary to examine it and determine its taint behavior. For example, with an opaque function, Semgrep cannot determine whether a function call propagates any taint that comes through its inputs.In Semgrep Community Edition (CE), where taint analysis is intraprocedural, all function calls are opaque. In Semgrep Pro, with interprocedural taint analysis, an opaque function could originate from a third-party library.
some_safe_function receives tainted data as input, so Semgrep assumes that it also returns tainted data as output. As a result, a finding is produced.
taint_assume_safe_functions: true makes Semgrep assume that opaque function calls are safe and do not propagate any taint. If you’d like specific functions to propagate taint without generating a finding, you can do so using custom propagators:
Propagate only through assignments 🧪
Settingtaint_only_propagate_through_assignments: true makes Semgrep propagate taint through trivial assignments of the form <l-value> = <tainted-expression> only. It requires the user to be explicit about any other kind of taint propagation that is to be performed.
For example, neither unsafe_function(tainted) nor tainted_string + "foo" will be considered tainted expressions:
Metavariables, rule messages, and unification
The patterns specified bypattern-sources and pattern-sinks (and pattern-sanitizers) are all independent of each other. If a metavariable used in pattern-sources has the same name as a metavariable used in pattern-sinks, these are considered to be different metavariables.
In the message of a taint-mode rule, you can refer to any metavariable bound by pattern-sinks, as well as to any metavariable bound by pattern-sources that does not conflict with a metavariable bound by pattern-sinks.
Semgrep can also treat metavariables with the same name as the same metavariable; to turn this behavior on, set taint_unify_mvars: true using rule options. Unification enforces the behavior where whatever a metavariable binds to in each of these operators is, syntactically speaking, the same piece of code. For example, if a metavariable binds to a code variable x in the source match, it must bind to the same code variable x in the sink match. In general, unless you know what you are doing, avoid metavariable unification between sources and sinks.
The following example demonstrates the use of source and sink metavariable unification:
Taint mode sensitivity
Field sensitivity
The taint engine provides basic field sensitivity support. It can:- Track that
x.a.bis tainted, butxorx.ais not tainted. Ifx.a.bis tainted, any extension ofx.a.b(such asx.a.b.c) is considered tainted by default. - Track that
x.ais tainted, but remember thatx.a.bhas been sanitized. Thus, the engine records thatx.a.bis not tainted, butx.aorx.a.care still tainted.
NOTEThe taint engine tracks taint per variable, not per object in memory. The taint engine does not track aliasing.
Index sensitivity 🧪
NOTEIndex sensitivity is a Semgrep Pro feature.
- This feature is only for access using the built-in
a[E]syntax. - This feature works for statically constant indexes that are integers, such as
a[42]or strings, such asa["foo"]. - If an arbitrary index
a[i]is sanitized, then every index becomes clean of taint.
Report findings on the source 🧪
NOTEReporting findings on the source of taint is a Semgrep Pro feature.
taint_focus_on to source:
Restrict taint to at-exit sinks 🧪
NOTEAt-exit taint sinks is a Semgrep Pro feature.
at-exit: true, you can restrict a sink specification to only match at exit statements, or statements after which the control-flow will exit the function being analyzed.
return statements, which are always exit statements, or function calls occurring as exit statements.
Unlike regular sinks, at-exit sinks trigger a finding if any tainted l-value reaches the location of the sink. For example, the preceding at-exit sink specification triggers a finding at a return 0 statement if some tainted l-value reaches the return, even if return 0 itself is not tainted. The location itself is the sink, rather than the code that is located there.
You can use behavior, for example, to check that file descriptors are being closed within the same function where they were opened.
print(content) statement is reported because the control flow exits the function at that point, and the file has not been closed.
Track control sources 🧪
NOTEControl taint sources is a Semgrep Pro feature.
control: true.
foo() could be followed by bar():
NOTEUse taint labels to combine both data and control sources in the same rule.
Taint labels 🧪
Taint labels increase the expressiveness of taint analysis by allowing you to specify and track different kinds of tainted data in one rule using labels. This functionality is helpful for more complex use cases, such as when data becomes dangerous in several steps that are hard to specify through a single pair of source and sink.Attach a Semgrep accepts any valid Python identifier as a label.
label key to the taint source, such as label: TAINTED or label: INPUT:Restrict a taint source to a subset of labels using the Combine labels using the
requires key. The following sample extends the previous example with requires: INPUT:requires key. To do so, use Python’s Boolean operators, such as requires: LABEL1 and not LABEL2.user_input is dangerous, but only when it passes through the evil function. This can be specified with taint labels as follows:
Multiple requires expressions in taint labels
You can assign an independent requires expression to each metavariable matched by a sink. Given $OBJ.foo($ARG), you can require that $OBJ has label XYZ and $ARG has label TAINTED, and focus-metavariable: $ARG: