Skip to content

Reduce memory footprint#2314

Open
0intro wants to merge 4 commits intoOpenSCAP:mainfrom
0intro:djc/xml-memory
Open

Reduce memory footprint#2314
0intro wants to merge 4 commits intoOpenSCAP:mainfrom
0intro:djc/xml-memory

Conversation

@0intro
Copy link
Contributor

@0intro 0intro commented Feb 18, 2026

Avoid keeping full libxml2 DOM trees in memory when only the parsed object model is needed. The Source DataStream XML (typically 20+ MB) was previously held as a DOM for the entire evaluation lifetime. Now it is parsed via a streaming xmlTextReader and freed early.

It includes the following changes:

  • Add oscap_source_get_streaming_xmlTextReader that parses directly from file or memory without building a persistent DOM.
  • Use the streaming reader in all OVAL model importers.
  • Serialize extracted SDS components to memory buffers instead of cloning DOM subtrees, so the full SDS DOM can be freed after extraction.
  • Free XCCDF and OVAL source DOMs immediately after their object models are built.

Add oscap_source_get_streaming_xmlTextReader() that creates an
xmlTextReader directly from file contents or memory buffer without
loading the full XML DOM first. For file-based sources, the file
is read into a memory buffer and parsed with xmlReaderForMemory.
For memory-based sources, the buffer is parsed directly. BZ2-
compressed sources fall back to the existing DOM-based path.

Also switch oscap_source_get_scap_type() and
oscap_source_get_schema_version() to use the streaming reader,
avoiding unnecessary DOM construction for document type detection
and schema version extraction.
Switch oval_definition_model, oval_syschar_model, oval_variable_model,
oval_directives_model, and oval_results_model import functions to use
oscap_source_get_streaming_xmlTextReader() instead of
oscap_source_get_xmlTextReader(). This avoids loading the full XML DOM
into memory when importing OVAL documents, since the OVAL parsers only
use streaming-compatible xmlTextReader API calls.
Instead of keeping cloned DOM trees for extracted DataStream components,
serialize them to compact XML text buffers via xmlDocDumpMemory() and
immediately free the cloned DOM. The component oscap_source is then
created from the memory buffer using oscap_source_new_take_memory().

This reduces peak memory during SDS decomposition because serialized XML
text is typically 3-5x smaller than its libxml2 DOM representation. The
streaming xmlTextReader can also parse directly from these buffers
without constructing an intermediate DOM.
Release the xmlDoc held by OVAL and XCCDF sources as soon as the
corresponding object models have been built from them. In
xccdf_session_load_oval(), call oscap_source_free_xmlDoc() on each
OVAL source right after oval_definition_model_import_source(). In
_xccdf_session_load_xccdf_benchmark(), free the XCCDF source DOM
right after xccdf_benchmark_import_source().

This eliminates the window where both the XML DOM and the parsed
object model coexist in memory during the loading phase.
@sonarqubecloud
Copy link

}

if (source->origin.filepath != NULL) {
int fd = open(source->origin.filepath, O_RDONLY);

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This argument to a file access function is derived from
user input (a command-line argument)
and then passed to open(__path).
This argument to a file access function is derived from
user input (a command-line argument)
and then passed to open(__path).
This argument to a file access function is derived from
user input (a command-line argument)
and then passed to open(__path).
This argument to a file access function is derived from
user input (a command-line argument)
and then passed to open(__path).
This argument to a file access function is derived from
user input (a command-line argument)
and then passed to open(__path).
This argument to a file access function is derived from
user input (a command-line argument)
and then passed to open(__path).
This argument to a file access function is derived from
user input (a command-line argument)
and then passed to open(__path).
This argument to a file access function is derived from
user input (a command-line argument)
and then passed to open(__path).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments