Extended Metrics Beyond Dice and IoU in Summary Reporting [HD95, RVE, ASSD]#3007
Extended Metrics Beyond Dice and IoU in Summary Reporting [HD95, RVE, ASSD]#3007toufiqmusah wants to merge 2 commits intoMIC-DKFZ:masterfrom
Conversation
toufiqmusah
commented
Mar 26, 2026
- Extend evaluation metrics to include ASSD, HD95, and signed RVE alongside Dice/IoU.
- Compute spacing-aware surface distances and respect ignore labels.
- Print new metrics in final validation summary output.
- Metrics flow into [summary.json] and foreground/class means automatically.
- Final validation now prints Dice, ASSD, HD95, and RVE.
|
Hey Toufiq, thanks for the PR! We are currently testing claude for automated PR review and I would like to use your PR to test this. Just as a quick warning. Things may go wrong haha |
|
@claude what do we think? |
|
@claude let's try this again |
There was a problem hiding this comment.
🔍 nnU-Net Code Review
Summary
This PR extends the evaluation pipeline to compute ASSD, HD95, and signed Relative Volume Error (RVE) alongside the existing Dice and IoU metrics. New metrics are computed via scipy's distance_transform_edt and binary_erosion, and are logged during final validation in nnUNetTrainer.
Contribution Guidelines
- General applicability: HD95, ASSD, and RVE are widely used metrics in medical image segmentation, so these additions are broadly useful. This aligns with the "generally valid across datasets" principle.
- Performance claims / benchmarks: The PR does not discuss the performance impact of adding mandatory EDT-based surface distance computation for every evaluation. Surface distance computation on 3D volumes can be orders of magnitude slower than Dice/IoU (which only require sums). Benchmarks showing the impact on evaluation time are missing and should be provided, especially since this is not optional.
- Prior discussion: There is no linked GitHub issue. Given that this changes the behavior of a core pipeline component (evaluation) for all users, prior discussion would have been appropriate per CONTRIBUTING.md.
Key Findings
-
np.squeezeis unsafe (correctness risk) —_get_surfaceand_compute_assd_hd95usenp.squeeze(mask)to remove extra dimensions.np.squeezeremoves all size-1 dimensions, not just the channel dimension. If a spatial dimension happens to be size 1 (e.g., a single-slice volume, or a region touching only one slice), the spatial dimension is also removed, producing incorrect surface/distance results. A safer approach is to explicitly remove the known channel dimension, e.g.,mask[0](since segmentations always have shape(1, *spatial_dims)).(
evaluate_predictions.py,_get_surfaceand_compute_assd_hd95) -
No opt-out for expensive metrics — EDT-based surface distance is significantly more expensive than the voxel-counting metrics (Dice/IoU). Making it mandatory in
compute_metricsmeans every evaluation (including the multiprocessing pool incompute_metrics_on_folder) pays this cost with no way to disable it. For large 3D datasets with many labels, this could increase evaluation time substantially. Consider making the surface metrics optional (e.g., via a parameter flag) or computing them only when explicitly requested. -
HD95 definition differs from common convention — The implementation computes HD95 as the 95th percentile of the concatenated bidirectional distances. The more common definition in the medical imaging literature (e.g., as used in the Medical Segmentation Decathlon) is
max(P95(d_ref_to_pred), P95(d_pred_to_ref)). The concatenated approach can mask asymmetric errors. If this is intentional, it should be documented; otherwise, consider aligning with the standard definition. -
Redundant ignore mask application — The PR adds explicit ignore-mask filtering of
mask_refandmask_predbefore passing them tocompute_tp_fp_fn_tn, which already applies the ignore mask internally. While numerically equivalent (the operations are idempotent), this is confusing. The new masking is needed for the surface distance functions, but consider applying it only to the copies passed to_compute_assd_hd95rather than modifying the masks used by the existing Dice/IoU path.(
evaluate_predictions.py, lines incompute_metricsaround the newif ignore_mask is not Noneblock)
Domain-Specific Notes
-
Spacing axis alignment: The spacing from
seg_ref_dict['spacing']follows nnUNet's internal convention (axes match the numpy array order). The_normalize_spacingfunction handles dimension mismatches by truncating or padding. For 2D data where the first spacing element is the artificial999placeholder,_normalize_spacingcorrectly drops it after squeezing. However, this relies onnp.squeezeproducing the correct dimensionality — reinforcing that the squeeze issue above is critical. -
Surface extraction on empty masks after ignore masking: If
ignore_maskzeroes out an entire class,mask_ref.sum() == 0correctly returns(nan, nan)for ASSD/HD95. Good. -
RVE when
vol_ref == 0: The edge case returns0.0if both are zero,np.infotherwise. Returningnp.infmay cause issues downstream when computing means (incompute_metrics_on_folder,np.nanmeanis used, butinfis notnan— it will propagate asinfinto mean values). Consider returningnp.naninstead ofnp.inffor consistency with the other metrics' empty-case handling.
Minor Suggestions
-
The
_normalize_spacinghelper constructs a list then converts to tuple on the padding path (tuple([1.0] * (ndim - len(spacing)) + list(spacing))). Minor, but(1.0,) * (ndim - len(spacing)) + spacingis cleaner. -
Consider adding a brief docstring to
_compute_assd_hd95describing which HD95 variant is being computed (combined vs. directed), so future readers understand the choice.
|
Awesome, that went better than expected. I like all the suggestions by claude. Particularly the correctness of the HD95 implementation should be checked and I would really want these advanced metrics to be opt in due to their high computational overhead. Best, |
|
Hello @FabianIsensee, thanks for taking the time to look at this. Interesting use of claude too, given the amount of thankless maintenance you do on the repo. Isn't it expensive though? I agree with the findings overall. This would be better as an opt-in due to computational overheads, and defaulting to np.inf may cause some aggregation issues. How about I will work on addressing the challenges over the weekend, and explore the suggested metrics (including NSD, AVE). My current implementation heavily borrows from the deepmind surface-distance implementations, by the way. Best, |
|
Hey Toufiq, |