Evaluation


We describe below the different evaluation metrics used to evaluate methods for each of the tasks. We will soon add a link here to the github repo with the evaluation code. More information on the evaluation can also be found in the challenge design document

Important notes

  • For the segmentation predictions, the expected output is a 3D image with an intensity range between 0 and 1. If this is not the case, the output will not be considered in the evaluation and default worst results will be assigned to that case.
  • To quantify the evaluation measures, the segmentation outputs will be binarised for all metrics except the measures of volume differences (for which a clipped version of the output will be used for assessment
  • The uncertainty measures won't be evaluated during the validation phase (Task 3).

Evaluation code online

The evaluation code can be found in the Where is Valdo GitHub repository for all tasks (the code will soon be updated to also include the uncertainty metrics)


Description of evaluation measures per task


Task #1 Enlarged perivascular spaces detection and segmentation
For the detection and segmentation of enlarged perivascular spaces, the following metrics will be used:

  • Dice Similarity Coefficient (DSC) Volumetry
  • Absolute Volume Difference Volumetry
  • Detection F1 Detection
  • Absolute Count Difference Detection


  • We require submitted methods to take MRI scans as input and give as output per image a corresponding predicted segmentation mask. We will threshold the predicted segmentation masks at 0.5. 

 Cerebral microbleeds detection and segmentation
For the detection and segmentation of cerebral microbleeds, the following metrics will be used:

  • Dice Similarity Coefficient (DSC) Volumetry
  • Absolute Volume Difference Volumetry
  • Detection F1 Detection
  • Absolute Count Difference Detection


  • We require submitted methods to take MRI scans as input and give as output per image a corresponding predicted segmentation mask. We will threshold the predicted segmentation masks at 0.5.

 Lacunes detection, segmentation and uncertainty
For the detection and segmentation of lacunes, the following evaluation metrics will be used:

  • Dice Similarity Coefficient (DSC) Volumetry
  • Absolute Volume Difference Volumetry
  • Detection F1 Detection
  • Uncertainty Dice Similarity Coefficient (DSC) Uncertainty
  • Uncertainty Detection F1 Uncertainty

  • We will quantify epistemic uncertainty and aleatoric uncertainty. Only the epistemic uncertainty metrics will be included in the ranking, but both uncertainty metrics will be presented at MICCAI, mentioned on the leaderboard and included in the analysis in the challenge paper.

    We define the following uncertainty evaluation terms for computing the DSC and F1 for epistemic uncertainty:
    - TP_uncertainty: incorrect and uncertain
      - TN_uncertainty: correct and certain
      - FP_uncertainty: correct and uncertain
      - FN_uncertainty: incorrect and certain

      And the following terms for computing the DSC and F1 for aleatoric uncertainty:- TP_uncertainty: uncertain and raters disagree
      - TN_uncertainty: certain and raters agree
      - FP_uncertainty: uncertain and raters agree
      - FN_uncertainty: certain and raters disagree

      We require submitted methods to take MRI scans as input and give as output per image a corresponding predicted segmentation mask (values range from 0=background to 1=lacune)  and an uncertainty map (0=certain, 1=uncertain). We will threshold the predicted segmentation masks at 0.5 as well as the uncertainty