Skip to content

Extended Metrics Beyond Dice and IoU in Summary Reporting [HD95, RVE, ASSD]#3007

Open
toufiqmusah wants to merge 2 commits intoMIC-DKFZ:masterfrom
CALADAN-AREPO:extended-metrics
Open

Extended Metrics Beyond Dice and IoU in Summary Reporting [HD95, RVE, ASSD]#3007
toufiqmusah wants to merge 2 commits intoMIC-DKFZ:masterfrom
CALADAN-AREPO:extended-metrics

Conversation

@toufiqmusah
Copy link
Copy Markdown

  • Extend evaluation metrics to include ASSD, HD95, and signed RVE alongside Dice/IoU.
  • Compute spacing-aware surface distances and respect ignore labels.
  • Print new metrics in final validation summary output.
  • Metrics flow into [summary.json] and foreground/class means automatically.
  • Final validation now prints Dice, ASSD, HD95, and RVE.

@FabianIsensee FabianIsensee self-assigned this Mar 26, 2026
@FabianIsensee
Copy link
Copy Markdown
Member

Hey Toufiq, thanks for the PR! We are currently testing claude for automated PR review and I would like to use your PR to test this. Just as a quick warning. Things may go wrong haha

@FabianIsensee
Copy link
Copy Markdown
Member

@claude what do we think?

@FabianIsensee
Copy link
Copy Markdown
Member

@claude let's try this again

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔍 nnU-Net Code Review

Summary

This PR extends the evaluation pipeline to compute ASSD, HD95, and signed Relative Volume Error (RVE) alongside the existing Dice and IoU metrics. New metrics are computed via scipy's distance_transform_edt and binary_erosion, and are logged during final validation in nnUNetTrainer.

Contribution Guidelines

  • General applicability: HD95, ASSD, and RVE are widely used metrics in medical image segmentation, so these additions are broadly useful. This aligns with the "generally valid across datasets" principle.
  • Performance claims / benchmarks: The PR does not discuss the performance impact of adding mandatory EDT-based surface distance computation for every evaluation. Surface distance computation on 3D volumes can be orders of magnitude slower than Dice/IoU (which only require sums). Benchmarks showing the impact on evaluation time are missing and should be provided, especially since this is not optional.
  • Prior discussion: There is no linked GitHub issue. Given that this changes the behavior of a core pipeline component (evaluation) for all users, prior discussion would have been appropriate per CONTRIBUTING.md.

Key Findings

  • np.squeeze is unsafe (correctness risk)_get_surface and _compute_assd_hd95 use np.squeeze(mask) to remove extra dimensions. np.squeeze removes all size-1 dimensions, not just the channel dimension. If a spatial dimension happens to be size 1 (e.g., a single-slice volume, or a region touching only one slice), the spatial dimension is also removed, producing incorrect surface/distance results. A safer approach is to explicitly remove the known channel dimension, e.g., mask[0] (since segmentations always have shape (1, *spatial_dims)).

    (evaluate_predictions.py, _get_surface and _compute_assd_hd95)

  • No opt-out for expensive metrics — EDT-based surface distance is significantly more expensive than the voxel-counting metrics (Dice/IoU). Making it mandatory in compute_metrics means every evaluation (including the multiprocessing pool in compute_metrics_on_folder) pays this cost with no way to disable it. For large 3D datasets with many labels, this could increase evaluation time substantially. Consider making the surface metrics optional (e.g., via a parameter flag) or computing them only when explicitly requested.

  • HD95 definition differs from common convention — The implementation computes HD95 as the 95th percentile of the concatenated bidirectional distances. The more common definition in the medical imaging literature (e.g., as used in the Medical Segmentation Decathlon) is max(P95(d_ref_to_pred), P95(d_pred_to_ref)). The concatenated approach can mask asymmetric errors. If this is intentional, it should be documented; otherwise, consider aligning with the standard definition.

  • Redundant ignore mask application — The PR adds explicit ignore-mask filtering of mask_ref and mask_pred before passing them to compute_tp_fp_fn_tn, which already applies the ignore mask internally. While numerically equivalent (the operations are idempotent), this is confusing. The new masking is needed for the surface distance functions, but consider applying it only to the copies passed to _compute_assd_hd95 rather than modifying the masks used by the existing Dice/IoU path.

    (evaluate_predictions.py, lines in compute_metrics around the new if ignore_mask is not None block)

Domain-Specific Notes

  • Spacing axis alignment: The spacing from seg_ref_dict['spacing'] follows nnUNet's internal convention (axes match the numpy array order). The _normalize_spacing function handles dimension mismatches by truncating or padding. For 2D data where the first spacing element is the artificial 999 placeholder, _normalize_spacing correctly drops it after squeezing. However, this relies on np.squeeze producing the correct dimensionality — reinforcing that the squeeze issue above is critical.

  • Surface extraction on empty masks after ignore masking: If ignore_mask zeroes out an entire class, mask_ref.sum() == 0 correctly returns (nan, nan) for ASSD/HD95. Good.

  • RVE when vol_ref == 0: The edge case returns 0.0 if both are zero, np.inf otherwise. Returning np.inf may cause issues downstream when computing means (in compute_metrics_on_folder, np.nanmean is used, but inf is not nan — it will propagate as inf into mean values). Consider returning np.nan instead of np.inf for consistency with the other metrics' empty-case handling.

Minor Suggestions

  • The _normalize_spacing helper constructs a list then converts to tuple on the padding path (tuple([1.0] * (ndim - len(spacing)) + list(spacing))). Minor, but (1.0,) * (ndim - len(spacing)) + spacing is cleaner.

  • Consider adding a brief docstring to _compute_assd_hd95 describing which HD95 variant is being computed (combined vs. directed), so future readers understand the choice.

@FabianIsensee
Copy link
Copy Markdown
Member

Awesome, that went better than expected. I like all the suggestions by claude. Particularly the correctness of the HD95 implementation should be checked and I would really want these advanced metrics to be opt in due to their high computational overhead.
For RVE we should discuss what the intended behavior for missing classes should be (np.nan vs np.inf) and what poential effects this may have for aggregation. What are your thoughts on this @toufiqmusah ?
If we report RVE, why not also include AVE (absolute volume error) as well? Might be useful for some cases and is easy enough to add.
In the same spirit: NSD@1mm might also be a sensible addition given that the Metrics Reloaded project recommends that in conjunction with Dice? The NSD implementation by Deepmind also uses surface distances, so there is some potential for synergies

Best,
Fabian

@toufiqmusah
Copy link
Copy Markdown
Author

Hello @FabianIsensee,

thanks for taking the time to look at this. Interesting use of claude too, given the amount of thankless maintenance you do on the repo. Isn't it expensive though? I agree with the findings overall. This would be better as an opt-in due to computational overheads, and defaulting to np.inf may cause some aggregation issues. How about np.nanmean instead of np.nan?

I will work on addressing the challenges over the weekend, and explore the suggested metrics (including NSD, AVE). My current implementation heavily borrows from the deepmind surface-distance implementations, by the way.

Best,
Toufiq

@FabianIsensee
Copy link
Copy Markdown
Member

Hey Toufiq,
we are just trying claude to see how much it helps us. If it saves a lot of time, the cost is worth it. It takes about 0.25 EUR per response in the current setup. If you compare that to what one of us maintainers costs, it can be a good deal. But like I said, we are trying it out and will evaluate once we have some data :-)
If np.inf are present in RVD then the aggregation is unclear. np.inf is a valid output (FP or FN prediction of a class) and it should be up to the user to decide how they handle it. They still get per-case metrics in the summary.json so they can figure this out themselves. np.nanmean would be the correct aggregation scheme here. The user would in the presence of np.inf get np.inf as aggregated result and that should trigger them thinking about how they want to hande it.
Honestly, I personally don't like volume metrics. They can only ever be used as supplementary information and never standalone because they do not check whether the predicted volume is actually what we wanted to segment, so using something like the Dice is mandatory.
BTW the same aggregation issue also arises for FP/FN predictions with ASSD and HD95. It is unclear how they should behave in these cases. Aggregation becomes meaningless because you can't determine a distance to something that does not exist. HD95/ASSD only really make sense for structures that you would expect to be present and predicted in each image, like organs. I would always pick NSD over these for pathologies becauyse NSD handles missing pred/gt gracefully.
Best,
Fabian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants