Documentation Index
Fetch the complete documentation index at: https://docs.semgrep.dev/llms.txt
Use this file to discover all available pages before exploring further.
- Scanning the components of a monorepo separately.
- Serializing the type of scan performed.
- Increasing the RAM of the job runner for CI jobs.
Determining the size of your monorepo
By default, Semgrep places resource limitations on the size of file scanned and memory allocated. However, Semgrep does not place limitations on the number of files scanned and scanning a large monorepo can involve thousands of files. To determine how many files are getting scanned:
A sample Semgrep scan output can look like this:
Scanning components separately
Based on the composition provided by the logs, you may be able to determine if your repository is modular. If so, you can try scanning the components separately.NOTESemgrep Code still performs interfile analysis on each module. If the modules are functionally separate, running separate scans shouldn’t result in a reduction in findings.
Serializing types of scans
Avoid exhausting resource limits by running Semgrep Code, Supply Chain, and Secrets serially instead of simultaneously. That is, instead of:Increasing RAM
Lastly, you can also tackle a large scan by increasing the RAM.Establish RAM baseline and avoid swap memory
First, establish how much memory is required to scan. Determining the total amount of memory required not only helps avoid killed scans but also helps prevent use of swap memory. Semgrep and other SAST tools make heavy use of disk I/O, and swapping in and out with a swap file significantly reduces performance.- In the early phases of your scan deployment, start with a relatively larger runner or Kubernetes pod that has lots of memory.
- Perform the scan with the
-j 1option (see CLI reference). This sets the number of jobs to 1 (no parallelization of subprocesses). - Enable a swap monitor for the entire duration of the scan to ensure an accurate assessment of RAM used, for example, running a script that samples the memory frequently:
- Then perhaps add 10% more RAM to your final memory tally to account for churn, increase in code, and so on. This is something you must gauge.
Parallelization
Once you have determined the RAM required to scan your large codebase, you can introduce parallelization to speed up the scan. In the previous section, you determined the total memory required for a configuration with no parallelization. Now, you can begin testing different parallelization configurations to improve scan speed, while still monitoring for any swap usage. To increase parallelization, first try the scan with-j 2 for two jobs. For two jobs, memory usage will typically be just less than twice the amount required for one job, and that trend continues as the number of jobs increases.
Furthermore, there is overhead in parallelization: the total RAM required for a -j 2 scan is greater than a -j 1 scan for the same codebase, but you should see a decrease in total scan time.