Like for adding a language, most of these instructions happen in ocaml-tree-sitter-semgrep. Let’s assume we are upgrading the grammar for the programming languageDocumentation Index
Fetch the complete documentation index at: https://docs.semgrep.dev/llms.txt
Use this file to discover all available pages before exploring further.
$PL.
(Consider adding an environment variable to your shell to make copying some of the commands below easier).
Summary (ocaml-tree-sitter)
In ocaml-tree-sitter:
In semgrep:
In the end, make sure the generated code used by the main branch of
semgrep can be regenerated from the main branch of ocaml-tree-sitter:
Components
Here are the main components:- the OCaml code generator
ocaml-tree-sitter:
generates OCaml parsing code from tree-sitter grammars extended
with
...and such. Publishes code into the git repos of the formsemgrep-$PL. - the original tree-sitter grammar
tree-sitter-$PLe.g., tree-sitter-ruby: the original tree-sitter grammar for the language. This is the git submodulelang/semgrep-grammars/src/tree-sitter-$PLin ocaml-tree-sitter. It is installed at the project’s root innode_modulesby invokingnpm install. - syntax extensions to support semgrep patterns, such as ellipses
(
...) and metavariables ($FOO). This islang/semgrep-grammars/src/semgrep-$PL. It can be tested from that folder withmake && make test. - an automatically-modified grammar for language
$PLinlang/$PL. It is modified so as to accommodate various requirements of the ocaml-tree-sitter code generator.lang/$PL/srcandlang/$PL/ocaml-srccontain the C/C++/OCaml code that will published intosemgrep-$PLe.g. semgrep-ruby and used by semgrep. - semgrep-$PL: provides generated OCaml/C parsers as a dune project. Is a submodule of semgrep.
- semgrep: uses the parsers
provided by
semgrep-$PL, which produce a CST. The program’s CST or pattern’s CST is further transformed into an AST suitable for pattern matching.
Before upgrading
Make sure thegrammar.js file or equivalent source files
defining the grammar are included in the fyi.list file in
ocaml-tree-sitter/lang/$PL.
Why: It is important for tracking and understanding the changes made at the
source.
How: See How to add support for a new language.
Upgrade the tree-sitter-$PL submodule
Say you want to upgrade (or downgrade)tree-sitter-$PL from some old
commit to commit 602f12b. This uses the git submodule way, without
anything weird. The commands might be something like this:
Testing
First, build and install ocaml-tree-sitter normally, based on the instructions found in the main README.lang/. The following
commands will build and test the language:
Blank kind. Eventually, the generated
CST.ml should not have Blank nodes anymore but a token type instead.
Where a Blank node exists, we won’t be able to get a token or its location
at parsing time.
If this works, we’re all set. Commit the new commit for the
tree-sitter-$PL submodule:
semgrep-$PL.
Publishing
Please ask someone at Semgrep, Inc. to run this step. From thelang folder of ocaml-tree-sitter, we’ll perform the
release. This step redoes some of the work that was done earlier and
checks that everything is clean before committing and pushing the
changes to semgrep-$PL.
https://github.com/semgrep/semgrep-$PL e.g.
https://github.com/semgrep/semgrep-javascript.
The fyi/ folder
contains original files from which the code was generated.
fyi/versions
shows the last change for each file, allowing you to check that you
got the correct version of grammar.js or some other source file.
Semgrep integration
From the semgrep repository, point the submodule forsemgrep-$PL to the
latest commit from the “Publishing” step. Then rebuild semgrep-core,
which will normally fail if the grammar changed. If the source
grammar.js was included in the fyi folder for semgrep-$PL (as it
should), git diff HEAD^ should help figure out the changes since the
last version.