Add new CTS test to validate if unique telemetry is reported by sysman#327
Add new CTS test to validate if unique telemetry is reported by sysman#327kalyanalle wants to merge 1 commit intooneapi-src:masterfrom
Conversation
| SYSMAN_MULTI_DEVICE_TEST, | ||
| GivenMultipleDevicesWhenQueryingAllSysmanAPIsThenEachDeviceReturnsUniqueValues) { | ||
|
|
||
| LOG_INFO << "=== Multi-Device PMT Interface Validation Test ==="; |
| ASSERT_EQ(ZE_RESULT_SUCCESS, | ||
| zesDeviceGet(lzt::get_default_zes_driver(), &deviceCount, nullptr)); | ||
|
|
||
| if (deviceCount < 2) { |
There was a problem hiding this comment.
| if (deviceCount < 2) { | |
| if (deviceCount > 1) { |
This seems more intuitive.
There was a problem hiding this comment.
The code checks if there are fewer than 2 devices available. If true, it skips the test rather than failing it.
Without this check, the test would likely crash or fail when trying to access a second GPU that doesn't exist.
There was a problem hiding this comment.
I did not question the need for this check. Just want to convey that the suggested change is more intuitive for a reader.
| GTEST_SKIP() | ||
| << "Multi-device PMT validation requires at least 2 Intel GPUs. " | ||
| << "Found: " << deviceCount << " device(s). " | ||
| << "This test validates PMT interface isolation across multiple GPUs."; | ||
| } |
| LOG_INFO << " Multi-device environment validated: " << deviceCount | ||
| << " devices found"; | ||
| LOG_INFO << "Testing " << devices.size() | ||
| << " devices for PMT interface isolation"; |
There was a problem hiding this comment.
here as well. Please search for this word and remove it.
conformance_tests/sysman/test_sysman_device/src/test_sysman_multi_device.cpp
Show resolved
Hide resolved
| for (auto tempHandle : tempHandles) { | ||
| double temperature = 0.0; | ||
| if (zesTemperatureGetState(tempHandle, &temperature) == ZE_RESULT_SUCCESS) { | ||
| deviceData.temperatures.push_back(temperature); |
| // Helper function to validate temperature data isolation between devices | ||
| void validateTemperatureDataIsolation( | ||
| const std::vector<MultiDeviceData> &deviceData) { | ||
| for (size_t i = 0; i < deviceData.size(); i++) { | ||
| for (size_t j = i + 1; j < deviceData.size(); j++) { | ||
|
|
||
| // Compare temperature readings | ||
| for (const auto &tempA : deviceData[i].temperatures) { | ||
| for (const auto &tempB : deviceData[j].temperatures) { | ||
|
|
||
| // Validate temperature ranges are realistic | ||
| EXPECT_GT(tempA, 0.0) << "Device " << deviceData[i].deviceIdentifier | ||
| << " has invalid temperature reading"; | ||
| EXPECT_LT(tempA, 150.0) << "Device " << deviceData[i].deviceIdentifier | ||
| << " temperature seems unrealistic (>150°C)"; | ||
|
|
||
| EXPECT_GT(tempB, 0.0) << "Device " << deviceData[j].deviceIdentifier | ||
| << " has invalid temperature reading"; | ||
| EXPECT_LT(tempB, 150.0) << "Device " << deviceData[j].deviceIdentifier | ||
| << " temperature seems unrealistic (>150°C)"; | ||
|
|
||
| // Note: Temperatures could be identical if both GPUs are idle, | ||
| // so we don't enforce uniqueness here, just realistic values | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
granted they can come similar maybe this could be made as a warning if it same instead of a failure?
I dont think we need to do the valid range check, since its already done in the temperature specific checks.
| LOG_DEBUG << "[DEBUG] Validating PCI statistics isolation..."; | ||
| validatePciDataIsolation(deviceData); | ||
| LOG_DEBUG << "[DEBUG] PCI statistics isolation validation complete"; | ||
|
|
||
| LOG_DEBUG << "[DEBUG] Step 3 complete: All isolation validations finished"; | ||
|
|
||
| // Step 4: Final validation summary | ||
| LOG_DEBUG << "[DEBUG] Step 4: Generating final validation summary"; |
There was a problem hiding this comment.
These seem like not necessary comments, please remove
There was a problem hiding this comment.
removed all the debug prints in the main test
| LOG_DEBUG << "[DEBUG] Step 1 complete: All device data collected"; | ||
|
|
||
| // Step 2: Validate PCI BDF uniqueness (PMT interface mapping validation) | ||
| LOG_DEBUG << "[DEBUG] Step 2: Validating PCI BDF uniqueness..."; | ||
| ASSERT_TRUE(validateUniquePciBdf(deviceData)) | ||
| << "PMT interface mapping validation failed - duplicate PCI BDF detected"; | ||
| LOG_DEBUG << "[DEBUG] Step 2 complete: PCI BDF uniqueness validated"; |
There was a problem hiding this comment.
debug is not needed to be mentioned since we are already using LOG_DEBUG calls.
Also remove usage of the term PMT
| for (const auto &bandwidthA : deviceData[i].memoryBandwidth) { | ||
| for (const auto &bandwidthB : deviceData[j].memoryBandwidth) { |
There was a problem hiding this comment.
I see you are avoiding the ordering issue using brute force here. This is fine, but could be optimized if you can generate the MultiDeviceData structure in an ordered manner. We can use module properties to understand the type of handle for each API and sort it based on that. All cards would return he same number of handles with similar handle types, so you always expect to have a 1:1 mapping.
There was a problem hiding this comment.
done , taken care
There was a problem hiding this comment.
Removed nested loops in memory state validation - now uses 1:1 comparison with sorted data
Added 1:1 comparison for power and temperature validation
Added count verification - ensures devices have same number of modules/domains/sensors
Fixed incomplete main test - added missing validation calls
Better error messages - include module/domain/sensor index numbers
Consistent logging - added success indicators
5944178 to
8835856
Compare
a674d35 to
4d5f5e3
Compare
on multi card systems for each device(GPU) Related-To: VLCLJ-2646 Signed-off-by: Alle, Kalyan <kalyan.alle@intel.com>
on multi card systems for each device(GPU)
Related-To: VLCLJ-2646
Note: left the debug prints for testing purpose, will remove in the final code.
Brief Function Logic Explanations
Data Collection Functions
collectMemoryData()
Purpose: Collects memory telemetry from a device
Enumerates memory modules using zesDeviceEnumMemoryModules()
For each module: gets bandwidth counters (read/write) and memory state (free/used)
Stores in deviceData.memoryBandwidth and deviceData.memoryStates
Returns gracefully if no memory modules found
collectPowerData()
Purpose: Collects power consumption telemetry from a device
Enumerates power domains using zesDeviceEnumPowerDomains()
For each domain: gets energy counters using zesPowerGetEnergyCounter()
Stores in deviceData.powerEnergy
Returns gracefully if no power domains found
collectTemperatureData()
Purpose: Collects temperature readings from a device
Enumerates temperature sensors using zesDeviceEnumTemperatureSensors()
For each sensor: gets temperature value using zesTemperatureGetState()
Stores in deviceData.temperatures
Returns gracefully if no temperature sensors found
collectPciData()
Purpose: CRITICAL - Collects PCI info and creates unique device ID
Gets PCI properties using zesDevicePciGetProperties()
Creates BDF string: "bus:device:function" (e.g., "3:0:0")
Gets PCI traffic stats using zesDevicePciGetStats()
Returns false if PCI properties fail (test-critical failure)
Validation Functions
validateUniquePciBdf()
Purpose: CORE PMT VALIDATION - Ensures no duplicate PCI addresses
Uses std::set to detect duplicate BDF identifiers
Returns false if duplicate found → PMT mapping error detected
Most critical validation - proves each device has unique address
validateMemoryDataIsolation()
Purpose: Ensures memory counters differ between all device pairs
Double loop: Compares every device pair (i vs j where j > i)
Checks memory bandwidth: read/write counters must differ
Checks memory state: free memory should differ between devices
EXPECT_FALSE on identical data → detects PMT cross-contamination
validatePowerDataIsolation()
Purpose: Ensures power readings differ between all device pairs
Double loop: Compares every device pair
Checks energy counters: power consumption values must differ
EXPECT_FALSE on identical energy → detects shared power data
validateTemperatureDataIsolation()
Purpose: Validates temperature readings are realistic per device
Double loop: Validates each device's temperature range
Range check: 0°C < temperature < 150°C per device
No uniqueness requirement (idle GPUs may have similar temps)
Ensures PMT thermal interface is accessible
validatePciDataIsolation()
Purpose: CRITICAL - Validates PCI bus uniqueness and traffic isolation
EXPECT_NE on PCI bus numbers → different devices must be on different buses
Compares PCI traffic stats: RX/TX/packet counters must differ
EXPECT_FALSE on identical stats → detects PMT interface sharing
Core PMT mapping validation - validates the commit's fix