Boilerplate.ml for your language
foo into Parse_foo_tree_sitter.ml, you can start editing it. The
goal is replace all the calls like todo env x by the construction
of a node of the AST. The destination AST can be a language-specific
AST or directly the generic AST. If we’re mapping to a
language-specific AST, this language-specific AST needs to be created
first. The advantage of going through a language-specific AST is more
visibility into which constructs are valid for the language, compared
to the generic AST which supports many more constructs.
Besides writing and updating the tree-sitter grammar, this step is
where the most time will be spent to integrate a language in semgrep.
This is a collection of tips to make this tedious task somewhat easier.
Use editor/IDE with good OCaml support
Make sure to set up your editor with a proper ocaml mode, so that you can see the inferred type of expressions and get the ability to jump to the definitions. Popular editors include emacs, vim, vscode. They all have their own OCaml extension or plugin which relies on merlin.Editing the boilerplate
Study examples
Parse_foo_tree_sitter.ml is copied from the generated file
Boilerplate.ml. The todo env x calls are typically replaced by the
construction of a node of the AST.
See how it’s done for example in Parse_go_tree_sitter.ml.
Learn OCaml basics
CST and AST type definitions make heavy use of algebraic data types to accommodate nodes of different kinds under the same type. Those are known as variants (e.g.Expr e) and
polymorphic variants in OCaml jargon (e.g. `Expr e).
Parametrized types in OCaml are like generics in languages like Java.
The OCaml type for a list of ints is denoted int list, which would
be denoted List<Int> in a Java-like language.
Run utop (opam install utop) and go over this tutorial about OCaml
types at ocaml.org.
Preserve structure, assign useful names
Consider this example of typical generated code inBoilerplate.ml file:
anon_choice_type_id_42c0412 was generated from an anonymous
node in the grammar and it’s not meaningful. However, it’s used in multiple
spots, which is why it has its own function definition. It occurs for example
here:
id_or_nested_id as
follows:
v1, v2, etc. The above snippet becomes something like
v1 and v2, because it’s
not very useful to find names for them.
Finally, it is very useful to specify the return type of the function
so as to figure out type errors a lot more easily.
- Replace generated function names by something meaningful.
- Replace
let v1 =by a meaningful name. - Specify the return type of functions that map CST to AST.
- Preserve the general structure of the generated functions.
Compile regularly
Compile regularly so as to perform type checking. This is general advice for OCaml development. If you’re working on a single file, you don’t need to recompile the project, though. Merlin will take care of checking types when you save the file. The initial template with all the todos should compile successfully, but of course will fail at runtime. Type errors produced by the compiler can be tricky to understand but it’s good to learn how to interpret them. Sometimes they’re just too long, though.Keep the boilerplate structure intact
Leave the original structure in place as much as possible. This is important for later when we want to update the grammar and need to compare the new boilerplate with the old/edited one.Add type annotations
The generated boilerplate looks like this, i.e. the return type is left unspecified.Consult the original grammar.js
The original grammar.js, or sometimes another javascript file,
contains the bulk of the original rules for the grammar. This is
usually a better reference than the generated code.
The generated boilerplate Boilerplate.ml is similar to the type definitions
CST.ml which is our interpretation of the original
grammar.js. So, it is useful to consult CST.ml as well.
What tends to work well is to keep 4 windows open:
Parse_foo_tree_sitter.ml(2 windows)grammar.jsAST_generic.ml
AST_generic.ml contains the type definitions of the
AST we’re mapping to.