If your applications ever parse XML—whether it comes from user uploads, web services, or configuration files—you could be exposed to a risk. For developers, this matters because even if you are not writing low-level parsing code, the libraries you rely on might be insecure by default. The key takeaway is simple: XML is powerful but risky. Unless you configure your parser securely, attackers can exploit features that were designed for flexibility, not safety. This article will give you a solid foundation in XML security. First, we’ll look at the basics of XML and the features that create risks. Next, we’ll walk through the most common XML-based attacks and show you how to recognize them in code. Finally, we’ll cover practical steps you can take to configure parsers securely and reduce your exposure.Documentation Index
Fetch the complete documentation index at: https://docs.semgrep.dev/llms.txt
Use this file to discover all available pages before exploring further.
What is XML Security?
In 2017, XML security had its own spot on the OWASP Top 10 list of the most critical web application risks. In the latest 2021 update, it is no longer called out as a separate category but instead falls under “Security Misconfiguration.” Even so, XML vulnerabilities remain common—searching the CVE database for XML-related flaws returns thousands of entries, including recent ones that enabled attackers to steal data or execute code remotely. Fundamentally, as its name indicates, eXtensible Markup Language (XML) is a markup language. It is designed for storing and sending data. XML provides several mechanisms for loading parts from the document from other sources. Two commonly used mechanisms are external Document Type Definitions (DTDs), and external entities.External Document Type Definition
The XML DTD is defined at the top of an XML document using theDOCTYPE keyword. It defines the legal structure of the document.
SYSTEM keyword and provide a URL or filename to the DTD.
External XML Entities
XML entities are used to represent structured data. The XML specification has several entities built in. But as the X in XML implies: it is an extensible language and custom entities can be defined in the DTD using theENTITY keyword.
SYSTEM command, these are called external XML entities.
schemaLocation attribute, the xsl:include element, the document() function, and import or include statements can all be used to reference external resources. You can find an overview of these and more in our XML Cheat Sheet.
Common XML Attacks
XML was designed to store and transmit data. When an application receives an XML file, or reads one from the filesystem, it needs to parse the XML data. To achieve this many libraries are available that implement the XML specifications as described above. However, if the XML data is untrusted, there are several features in the specifications that can be manipulated to achieve malicious behaviour.XML Injection
The first type of attack does not trick the parser into fetching external content. Instead, the attacker injects additional tags or attributes into the XML itself to alter the logic of the application. Imagine an application that reads user account details from XML like this:admin field, the attacker could escalate privileges. This is similar in spirit to SQL Injection but applied to XML-based logic.
Exponential Entity Expansion (XEE)
EXpontential Entity Expansion (XEE) happens when the mechanism for recursively defining XML entities with other entities is manipulated into expanding several layers of nested entities. This type of attack is also known as an XML bomb or billion laughs attack. As an example, here is the XML bomb payload we used in our research project on GitHub.XML External Entity (XXE) Injection
XML eXternal Entity (XXE) happens when one of the 9 mechanisms to include external content is manipulated into parsing read content from an unintended location. This can lead to the disclosure of confidential data if the identifier supplied by the attacker is something likefile:///etc/passwd . In other cases, XXE payloads can be used to upload code files that can later be triggered in remote code execution attacks, like in this CVE where the identifier referenced a java code archive. In PHP, the right identifier by itself can even cause arbitrary code execution when the expect module is loaded. In that case, a pseudo-uri like expect://cmd will execute cmd and return the output of the command.
Detecting XML Security Vulnerabilities in Your Code
Detecting XML issues can be challenging because risky behavior often comes from default parser configurations rather than obvious coding mistakes. Let’s look at a Java example.input.xml is attacker-controlled, the parser could make network requests or expose files.
Tools like Semgrep can help detect these patterns. Rules for XXE look for places in code where XML parsing from untrusted sources occurs without security features enabled. For XML Injection, detection is more about how the application uses parsed data. If the code blindly trusts XML input without validating it against a strict schema, that’s a sign of risk.
Recommendations and Mitigations
While the flexibility of XML can lead to security vulnerabilities, you can mitigate these risks by configuring your parser correctly and adopting a secure development mindset. Here are some key recommendations and mitigations to consider.Consider using alternative formats
Before you even start, ask yourself if you really need to use XML. For many modern applications, simpler and less risky data formats like JSON (JavaScript Object Notation) or YAML (YAML Ain’t Markup Language) are perfectly suitable. Both are lightweight and widely supported, offering similar data-structuring capabilities without the complex and sometimes vulnerable features of XML like DTDs and external entities. If your use case involves web APIs or configuration files, JSON or YAML can be a much safer default choice.Use up-to-date language and libraries
Staying current is critical. Vulnerabilities are frequently discovered in older language versions and libraries. During some of our research on XML parsers in Java, we ran into a known JDK bug where DOM parsers do not honor setExpandEntityReferences(false) for certain JDK versions. Using up-to-date versions ensures you benefit from the latest security patches and bug fixes. Regularly check for updates and integrate them into your development workflow.Disabling DTD processing
One of the most effective ways to prevent both XEE and XXE attacks is to disable DTD processing entirely. Since DTDs are the primary mechanism for declaring and referencing entities, disabling them prevents the parser from attempting to process any external content. Most XML parsers provide a configuration setting to achieve this. For instance, in Java, you can often set aFEATURE like XMLConstants.FEATURE_SECURE_PROCESSING or XMLInputFactory.IS_SUPPORTING_EXTERNAL_ENTITIES to false.
Disable external entities
If you can’t disable DTDs completely, the next best thing is to disable external entities. This is a more granular approach that allows internal DTDs to be processed while blocking any references to external resources. This can be configured separately from DTD processing and is a key step in preventing XXE attacks. Some parsers, like the Pythondefusedxml library, are specifically designed to be resistant to these attacks by default, making them a safer alternative to standard libraries.
Disabling external schema/stylesheet processing
To prevent attacks related to schema and stylesheet processing, you can also configure your parser to disable the processing of external schemas and stylesheets. The specific method depends on the library, but it’s a critical security control to include in your parser’s configuration. In Java, thesetAttribute method can be used to set XMLConstants.ACCESS_EXTERNAL_SCHEMA and XMLConstants.ACCESS_EXTERNAL_STYLESHEET to an empty string to disable these features. The feature, however, is not implemented effectively for all parsers, consult our cheat sheet to ensure you configure your specific parser securely!