Java: Add RegexMatch concept and recognise @Pattern annotation as sanitizer#21310
Java: Add RegexMatch concept and recognise @Pattern annotation as sanitizer#21310owen-mc wants to merge 19 commits intogithub:mainfrom
@Pattern annotation as sanitizer#21310Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a shared “RegexExecution” concept for Java CodeQL libraries and migrates existing regex-related models/queries to use it, aiming to unify regex execution modeling across languages and reduce false positives.
Changes:
- Added
semmle.code.java.ConceptswithRegexExecution/RegexExecutionExprconcepts and wired it intojava.qll. - Extended existing Java regex API models (
String.matches,Pattern.compile,Pattern.matches,Matcher.matches) to implement the new concept. - Updated experimental and security libraries to use the new concept types instead of bespoke regex helpers.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| java/ql/src/experimental/Security/CWE/CWE-625/Regex.qll | Removes the (deprecated) local regex helper module. |
| java/ql/src/experimental/Security/CWE/CWE-625/PermissiveDotRegexQuery.qll | Switches to framework regex models/concepts for sinks and matching logic. |
| java/ql/lib/semmle/code/java/security/regexp/RegexInjection.qll | Updates sanitizer modeling to use the new PatternCompileCall abstraction. |
| java/ql/lib/semmle/code/java/security/Sanitizers.qll | Reworks regexp-guard matching to use RegexExecutionExpr::Range. |
| java/ql/lib/semmle/code/java/security/PathSanitizer.qll | Updates directory-character matching guard logic to use the new regex execution concept. |
| java/ql/lib/semmle/code/java/frameworks/Regex.qll | Adds call/model classes (e.g., PatternCompileCall, PatternMatchesCall, MatcherMatchesCall) implementing RegexExecutionExpr::Range. |
| java/ql/lib/semmle/code/java/JDK.qll | Updates StringMatchesCall to implement RegexExecutionExpr::Range. |
| java/ql/lib/semmle/code/java/Concepts.qll | New Java Concepts library including RegexExecution and related modeling modules. |
| java/ql/lib/java.qll | Exports the new Concepts library by importing it. |
Comments suppressed due to low confidence (1)
java/ql/lib/semmle/code/java/frameworks/Regex.qll:96
- Doc comment punctuation: this comment is missing a trailing period, unlike most other doc comments in this file.
/** A call to the `matches` method of `java.util.regex.Matcher` */
class MatcherMatchesCall extends MethodCall, RegexExecutionExpr::Range {
@Pattern annotation as sanitizer
e571aef to
205a058
Compare
205a058 to
5bdf550
Compare
aschackmull
left a comment
There was a problem hiding this comment.
Some comments. At a high-level I think we should stick to defining the concept in terms of expressions and not data flow nodes.
Also, I think it would be good to clarify the relationship between the "regex execution" concept and the "regex match-or-not execution" sub-concept and its implicit reliance on returning boolean true to signify match-success. Is there prior art for distinctions like this in other languages?
java/ql/src/experimental/Security/CWE/CWE-625/PermissiveDotRegexQuery.qll
Outdated
Show resolved
Hide resolved
|
@aschackmull It seems like having one concept which uses |
f6e034e to
8f8f4c2
Compare
@Pattern annotation as sanitizer@Pattern annotation as sanitizer
|
The QA results were fine except for one thing: I saw several repos where we had gained SSRF results, which was not what I expected. On investigation I found that more care was needed around the following kind of code: If I have added a fix for this, and a test to confirm that it works. I've also started a DCA run with all the repos that showed any change in SSRF alerts to see if this fixes the unexpectedly gained alerts. |
I expect some reduction in FPs due to the final commit expanded the scope of some sanitizers to include more regex execution methods. I plan to run QA to quantify that.
I welcome discussion about how to best deal with the fact that I want the concept to be the same as other languages, which use data flow nodes, but actually java mostly works at the levels of expressions. I've kind of done both for now. I think as we spread the use of shared libraries this difference in approach will eventually go away, or at least become easier to fix.