Summary
DefaultFeaturesExtractor.extract_raw_features() crashes with KeyError: 'appId' when processing GPU eventlogs that do not produce an appId column during feature extraction. The error is an unhandled KeyError with no context, making it difficult for callers to diagnose.
Steps to Reproduce
- Obtain a GPU eventlog where the Spark application does not emit data source information (e.g.,
data_source_information CSV is missing or empty in profiling output)
- Call
predict_from_profiles() with that eventlog's profiling output
from spark_rapids_tools.tools.qualx.predict import predict_from_profiles
results = predict_from_profiles(
model_type="xgboost",
model_path="<path_to_model>",
profile_output_dirs=["<path_to_profiling_output>"],
)
Observed Behavior
Raw KeyError with no actionable context:
Traceback (most recent call last):
File ".../spark_rapids_tools/tools/qualx/predict.py", line ..., in predict_from_profiles
...
File ".../spark_rapids_tools/tools/qualx/featurizers/default.py", line 438, in extract_raw_features
...
KeyError: 'appId'
The error originates in extract_raw_features() where the code does groupby(['appId', 'sqlID']) on a DataFrame that is missing the appId column.
Expected Behavior
Either:
- Raise a descriptive error, e.g.,
ValueError("Feature extraction failed: 'appId' column not found in extracted features for <app_id>. The eventlog may not contain sufficient profiling data.")
- Return an empty result / skip the app gracefully, so callers can handle it
Summary
DefaultFeaturesExtractor.extract_raw_features()crashes withKeyError: 'appId'when processing GPU eventlogs that do not produce anappIdcolumn during feature extraction. The error is an unhandledKeyErrorwith no context, making it difficult for callers to diagnose.Steps to Reproduce
data_source_informationCSV is missing or empty in profiling output)predict_from_profiles()with that eventlog's profiling outputObserved Behavior
Raw
KeyErrorwith no actionable context:The error originates in
extract_raw_features()where the code doesgroupby(['appId', 'sqlID'])on a DataFrame that is missing theappIdcolumn.Expected Behavior
Either:
ValueError("Feature extraction failed: 'appId' column not found in extracted features for <app_id>. The eventlog may not contain sufficient profiling data.")