This document is about adding support for a new programming language in Semgrep using the tree-sitter technology. Most languages in semgrep useDocumentation Index
Fetch the complete documentation index at: https://docs.semgrep.dev/llms.txt
Use this file to discover all available pages before exploring further.
tree-parser though you may also need to update the menhir parser.
Repositories involved directly:
- semgrep: the semgrep command line program.
- ocaml-tree-sitter-semgrep: language-specific setup, generates C/OCaml parsers for semgrep.
- A new repository semgrep-LANG for the language you’re adding: this is a C or OCaml parser generated from
ocaml-tree-sitter-semgrepby a Semgrep administrator. - semgrep-interfaces
Placeholder values
This document uses the placeholder LANG to indicate that you should substitute the name of your language as the value in the given context. For example, if your language is Ruby, and the document’s instructions read:
Create a new file TEST_LANG_<LANG>.txt where LANG is in small caps.
The name of your file should be TEST_LANG_ruby.txt
Create a file Pretty_print.**_EXTENSION_** with the filename extension of your language:
The name of your file should be Pretty_print.rb.
semgrep repository overview
There are some GitHub repositories involved in porting a language.
Here is the file hierarchy of the semgrep
repository:
ocaml-tree-sitter-semgrep, you’ll need a new repository semgrep-LANG to host the generated parser code.
Ask someone from the Semgrep team to create one for you. For this, they should use the template
semgrep-lang-template when creating the repository.
The instructions for adding a language start in ocaml-tree-sitter-semgrep, as indicated below. Be careful that you are always in the correct repository!
Set up ocaml-tree-sitter-semgrep
As a model, you can use the existing setup for ruby or javascript. The most complicated setup is for typescript and tsx.
Expedited setup
If you’re lucky, the language you want to add can be added with the scriptadd-simple-lang:
grammar.js file at the root of the project. If this simplified approach fails, use the Manual setup instructions below to understand what’s going on or to set things up manually.
Manual setup
From theocaml-tree-sitter-semgrep repository, do the following:
Make a
test/ok directory. Inside the directory, create a simple hello-world program for the language you are porting. Name the program hello-world.EXTENSION.Now make a file called
extensions.txt and input all the language extensions (.rb, .kt, etc) for your language in the file.Create a file called
fyi.list with all the information files, such as
semgrep-grammars/src/tree-sitter-LANG/LICENSE,
semgrep-grammars/src/tree-sitter-LANG/grammar.js,
semgrep-grammars/src/semgrep-LANG/grammar.js, etc.
to bundle with the final OCaml/C project.Create a test corpus. You can do this by:
- Running
most-starred-for-languageto gather projects on which to run parsing stats. Run with the following command:./scripts/most-starred-for-language LANG YOUR_USERNAME API_KEY - Using github advanced search to find the most starred or most forked repositories.
ocaml-tree-sitter-semgrep, you must build the
ocaml-tree-sitter-semgrep OCaml code generator, run it to produce a parser,
then run some tests for the parser. Full instructions for this
are given in updating-a-grammar under
“Testing”. The short instructions are:
- For the first time, build everything with
./scripts/rebuild-everything. - Subsequently, work from the
lang/LANGfolder and runmakeandmake test.
The fyi.list file
The fyi.list file was created to specify informational files that
should accompany the generated files. These files are typically:
- the source grammar, most often a single
grammar.jsfile. - the licensing conditions usually specified in a
LICENSEfile.
fyi.list end up in a fyi folder in
tree-sitter-lang. For example,
see ruby/fyi.
Extend the original grammar with semgrep syntax
This is best done after everything else is set up. Some constructs such as semgrep metavariables ($FOO) may already be valid constructs
in the language, in which case there’s nothing to do. Some support for
the semgrep ellipsis ... usually needs to be added as well.
You’ll need to learn how to create tree-sitter
grammars.
For an example of how to extend a language, you can:
- Look at what was done for the semgrep extensions of other languages
in their respective
semgrep-*folders. - Look at how
tree-sitter-typescriptextends the JavaScript grammar. This is the filecommon/define-grammar.jsin the tree-sitter-typescript repository.
Parsing statistics
From a language’s folder such aslang/csharp, two targets are
available to exercise the generated parser:
make test: runs ontest/okandtest/xfailmake stat: downloads the code specified inprojects.txtand parses the files whose extension matches those inextensions.txt, reporting parsing success in the form of a CSV file.
scripts/most-starred-for-language.py. For github searches, filter by
programming language and use a constraint to select large projects,
such as ”> 100 forks”. Collect the repository URLs and put them into
projects.txt.
Publish generated parsers
After you have pushed your ocaml-tree-sitter-semgrep changes to the main branch, do the following:Check that the original
grammar.js, src/scanner.c/.cc (if
applicable) look clean and have minimal external dependencies.In
ocaml-tree-sitter/lang directory, run ./release LANG --dry-run.
If this looks good, please ask someone from the Semgrep team to
publish the code using ./release LANG.Troubleshooting
Various errors can occur along the way. Compilation errors in C or C++ are usually due to a missing source filescanner.c or scanner.cc, or a grammar with a name that
doesn’t match the name inside the scanner file. JavaScript files may
also be missing, in particular in the case of grammars that extend
existing grammars such as C++ for C or TypeScript for
JavaScript. Check for require() calls in grammar.js and learn how
this NodeJS primitive resolves paths.
There may also be errors when generating or compiling
OCaml code. These are likely bugs in ocaml-tree-sitter-semgrep and they should
be reported or fixed right away.
Here are some known types of parsing errors:
- A syntax error. The input program is in the wrong syntax or uses a
recent feature that’s not supported yet:
make testor directly theparse_LANGprogram will show the tree produced by tree-sitter with one or moreERRORnodes. - A “reparsing” error. It’s an error generated after the first
successful parsing pass by the tree-sitter parser, during the
reparsing pass by the OCaml code performed by the generated
Parse.mlfile. The error message should tell you something like “cannot interpret tree-sitter’s output”, with details on what code failed to match what pattern. This is most likely a bug inocaml-tree-sitter-semgrep. - A segmentation fault. This could be due to a bug in the OCaml/tree-sitter C bindings and should be fixed. A simple test case that reproduces the problem would be nice. See https://github.com/semgrep/ocaml-tree-sitter-semgrep/issues/65
fail/ folder, preferably in the form of the minimal program suitable for a bug report, with a comment describing what was expected and what’s going on.
Update the semgrep repository
Now that you have added your new language LANG to tree-sitter, do the following:
Update
generate.py in the semgrep-interfaces repository with your new language.In the
semgrep repository, go to /src/parsing/Check_pattern.ml, and add LANG to lang_has_no_dollar_ids. If the grammar has no dollar identifiers, add LANG above ‘true’. Otherwise, add it above ‘false’.In
/src/printing/Pretty_print_AST.ml, add LANG to the appropriate functions:print_boolif_stmtwhile_stmtdo_whilefor_stmtdef_stmtreturnbreakcontinueliteral
In
/src/parsing/tests/Test_parsing.ml, add in LANG to dump_tree_sitter_cst_lang.Inspect the other languages in
/languages as a reference for what
code to add. Create a new folder for your language.Add the
semgrep-LANG repository as a submodule under
/languages/LANG/tree-sitter/ (git submodule add ...).Create a file
/languages/LANG/tree-sitter/Parse_LANG_tree_sitter.ml
by copying the generated template Boilerplate.ml that you’ll find
in the semgrep-LANG submodule.
Add basic functionality to
define the function parse and import the module
Parse_tree_sitter_helpers.
Look at other languages to get a better idea of how to
define the parse file function. This file should contain something
similar to:Create the missing
dune files wherever you have OCaml source
files (.ml, .mli) by imitating what was done for other
languages.Write a basic test case for your language in
tests/LANG/hello-world.EXT. This
can just be a hello-world function.Legal concerns
Be thankful for the authors of the original code, keep clearly visible license notices, and make it easy to get back to the original projects:- Make sure to preserve the
LICENSEfiles. This should be listed in thefyi.listfile. - For sample input in
test/, consider Public Domain (“The Unlicense”) files or write your own, for simplicity. GitHub Search allows you to filter projects by license and by programming language.