[1,2]\fnmShengjia \surChen

[1]\orgdivWindreich Department of AI and Human Health, \orgnameIcahn School of Medicine at Mount Sinai, \orgaddress\cityNew York, \postcode10029, \stateNY, \countryUnited States 2]\orgdivHasso Platner Institute at Mount Sinai, \orgnameIcahn School of Medicine at Mount Sinai, \orgaddress\cityNew York, \postcode10029, \stateNY, \countryUnited States 3]\orgdivDepartment of Genetics and Genomics, \orgnameIcahn School of Medicine at Mount Sinai, \orgaddress\cityNew York, \postcode10029, \stateNY, \countryUnited States 4]\orgdivDepartment of Pathology, \orgnameIcahn School of Medicine at Mount Sinai, \orgaddress\cityNew York, \postcode10029, \stateNY, \countryUnited States 5]\orgdivDepartment of Medicine, \orgnameMemorial Sloan Kettering Cancer Center, \orgaddress\cityNew York, \postcode10065, \stateNY, \countryUnited States 6]\orgdivDepartment of Pathology, \orgnameMemorial Sloan Kettering Cancer Center, \orgaddress\cityNew York, \postcode10065, \stateNY, \countryUnited States

Benchmarking Embedding Aggregation Methods in Computational Pathology: A Clinical Data Perspective

shengjia.chen@icahn.mssm.edu    \fnmGabriele \surCampanella    \fnmAbdulkadir \surElmas    \fnmAryeh \surStock    \fnmJennifer \surZeng    \fnmAlexandros D. \surPolydorides    \fnmAdam J. \surSchoenfeld    \fnmKuan-lin \surHuang    \fnmJane \surHouldsworth    \fnmChad \surVanderbilt    \fnmThomas J. \surFuchs * [ [ [ [ [
Abstract

Recent advances in artificial intelligence (AI), in particular self-supervised learning of foundation models (FMs), are revolutionizing medical imaging and computational pathology (CPath). A constant challenge in the analysis of digital Whole Slide Images (WSIs) is the problem of aggregating tens of thousands of tile-level image embeddings to a slide-level representation. Due to the prevalent use of datasets created for genomic research, such as TCGA, for method development, the performance of these techniques on diagnostic slides from clinical practice has been inadequately explored. This study conducts a thorough benchmarking analysis of ten slide-level aggregation techniques across nine clinically relevant tasks, including diagnostic assessment, biomarker classification, and outcome prediction. The results yield following key insights: (1) Embeddings derived from domain-specific (histological images) FMs outperform those from generic ImageNet-based models across aggregation methods. (2) Spatial-aware aggregators enhance the performance significantly when using ImageNet pre-trained models but not when using FMs. (3) No single model excels in all tasks and spatially-aware models do not show general superiority as it would be expected. These findings underscore the need for more adaptable and universally applicable aggregation techniques, guiding future research towards tools that better meet the evolving needs of clinical-AI in pathology. The code used in this work are available at https://github.com/fuchs-lab-public/CPath_SABenchmark.

keywords:
Computational Pathology, Histopathological Image Analysis, Embedding Aggregation, Benchmark Analysis

1 Introduction

Advancements in deep learning have significantly revolutionized the field of computational pathology (CPath), particularly in the analysis of whole slide images (WSIs) [1, 2]. Due to their gigapixel resolution, WSIs are usually divided into small tiles for analysis, and weakly supervised learning is a popular training strategy to leverage slide-level supervision without the need of pixel level annotations [3, 4]. Most applications of weakly supervised learning in pathology focus on training slide-level aggregators and using a pre-trained vision model to encode tiles into feature vectors [3, 4, 5, 6]. A notable trend enhancing this capability is the adoption of self-supervised learning (SSL) technique to train foundation models (FMs) on large-scale, domain-specific datasets [7, 8, 9]. These models, pre-trained on extensive datasets, provide a robust foundation for task-specific fine-tuning with transfer learning. However, a critical limitation in the field is its reliance on public datasets for downstream task performance evaluation which may not generalize well to a clinical setting. Datasets like TCGA, which, while invaluable for genomic research, may not be ideal for histological analysis. This limitation is not only due to potential biases from high tumor prevalence but also stems from the use of legacy scanning techniques that result in poor cell-level resolution and the reliance on frozen section tissues. These factors collectively contribute to the dataset’s limitations, impacting the accuracy and generalizability of histological studies.

Related Work

Recent review and benchmark analyses have extensively evaluated AI algorithms’ performance and limitations in CPath using WSI datasets. [6] applied six weakly supervised algorithms to six clinically significant tasks, driven by a systematic literature review for unbiased selection. Yet, their emphasis on end-to-end pipeline overshadowed the slide-level aggregation phase and lacked exploration of foundation model benefits. [2] proposed a comprehensive CPath workflow, assessing seven methods on a single TCGA dataset, which might limit the findings’ applicability. Their study aimed at a fair comparison across various aggregation techniques but was constrained by dataset specificity. [10] compared five multiple instance learning (MIL) algorithms on three public datasets, highlighting the advantage of ensemble methods for accuracy. However, their study did not engage with clinical-relevant datasets. These studies reveal the need for benchmarking slide-level aggregation methods with a focus on clinically relevant tasks while incorporating embeddings from FMs. They also highlight a gap in discussing the interplay among various method groups and their evolution with artificial intelligence (AI) and computer vision (CV) techniques, essential for enhancing clinical applications.

Our Contributions

In this comprehensive benchmarking analysis, we evaluated ten slide aggregation techniques across nine clinically relevant tasks including diagnostic assessment, biomarker classification, and outcome prediction. Our study selected these methods through a structured literature search, prioritizing techniques that advance embedding aggregation technology. Our objectives are summarized as follows: 1. Benchmarking Widely Used Aggregation Methods: We prioritized aggregation methods commonly used for comparison in recent work. 2. Assessment of Embedding from FMs: To understand the impact of embedding source on the performance of aggregation methods, we employed embeddings derived from four FMs: three pretrained on domain-specific histological images and the other on the ImageNet dataset. 3. Providing Insights and Guidelines: We aim to provide recommendations for effectively using slide aggregation methods. We introduce an evolutionary tree of relevant methods highlighting relationships among them.

2 Methods

Refer to caption
Figure 1: Evolution of Slide Aggregation Methods in CPath (2017 - 2023). We track the progression of aggregation and embedding techniques, categorized by Key Instance, Attention, Cluster, Self-Attention, and Graph-based methods. Models benchmarked in this study are marked with a black outline. Colors and gradient colors denote method categories and their combinations, respectively; vertical placement shows chronological order, and horizontal lines indicate whether spatial information is integrated or not.

Problem Formulation

Let a WSI be denoted by X𝑋Xitalic_X, representing a ’bag’ in the MIL framework. The bag X𝑋Xitalic_X comprises a set of instances {x1,x2,,xN}subscript𝑥1subscript𝑥2subscript𝑥𝑁\{x_{1},x_{2},\ldots,x_{N}\}{ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }, where each instance xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a tile extracted from X𝑋Xitalic_X. An encoder function f()𝑓f(\cdot)italic_f ( ⋅ ) transforms each instance xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into a low-dimensional embedding, resulting in a set of embeddings {h1,h2,,hN}subscript1subscript2subscript𝑁\{h_{1},h_{2},\ldots,h_{N}\}{ italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }, where hi=f(xi)subscript𝑖𝑓subscript𝑥𝑖h_{i}=f(x_{i})italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) represents the feature vector for the i𝑖iitalic_i-th tile. Optionally, spatial information sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be associated with each xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, capturing its position within X𝑋Xitalic_X. The aggregation function g()𝑔g(\cdot)italic_g ( ⋅ ) aggregates the set of all embeddings and optionally their spatial information to output a single vector H𝐻Hitalic_H that serves as the bag-level representation for the entire WSI X𝑋Xitalic_X. This can be expressed as: H=g({(h1,s1),(h2,s2),,(hN,sN)})𝐻𝑔subscript1subscript𝑠1subscript2subscript𝑠2subscript𝑁subscript𝑠𝑁H=g(\{(h_{1},s_{1}),(h_{2},s_{2}),\ldots,(h_{N},s_{N})\})italic_H = italic_g ( { ( italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , … , ( italic_h start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) } ). The final prediction Y𝑌Yitalic_Y for the bag X𝑋Xitalic_X is then obtained by applying a suitable classifier c()𝑐c(\cdot)italic_c ( ⋅ ) to H𝐻Hitalic_H: Y=c(H)𝑌𝑐𝐻Y=c(H)italic_Y = italic_c ( italic_H ).

Selection of Aggregation Methods:

To select aggregation methods for our study, we conducted a literature search on Google Scholar with the query “((deep learning) AND ((computational histopathology) OR (whole slide images)) AND (classification))” for publications between 2017 and 2023. While recognizing the potential of hierarchical models to fuse local and global features, we focused on embedding aggregation technologies that do not require instance-level labels, drawing from studies that utilize embeddings at a uniform magnification for consistency. Our analysis centers on methods with public, implementable code and those frequently used in recent scholarly work for benchmarking purposes. Details on the selected aggregation methods and their computational complexities are provided in Appendix 2. Table 3.

Figure 1 provides an overview of aggregation methods from 2017 - 2023. In the ”Key Instance” category, [4] focus on processing a selected subset of the most suspicious tiles sequentially, but not fully considering the spatial relationships between tiles across the entire slide. [11] proposed an attention-based aggregation with two fully trainable layers, considering contributions of all instances through attention weights. VarMIL [7, 12] extended attention-based MIL with a variance module for tissue heterogeneity analysis. DS-MIL [13] employed a dual stream approach, coupling max-pooling with attention scoring for instance evaluation. Cluster-based approaches like DeepMISL [14] and DeepAttnMISL [15] integrate phenotype-level information, with [5] addressing multi-class classification via attention for pseudo label generation. Remix [16] reduces WSI bag instances using patch cluster centroids and applies latent space augmentations. These methods focus on instance significance without considering spatial distribution of patches in WSI. AB-MIL [11] and CLAM [5] focus on the highest scoring instance with binary and multiclass setting, whereas DS-MIL [13] overlooks instance correlations. In contrast, TransMIL [17] utilizes self-attention mechanisms within a transformer architecture to analyze spatial information, employing pyramid position encoding. However, its approach to position encoding lacks absolute spatial consistency across WSIs. DT-MIL [18] addresses this by incorporating absolute position features alongside a deformable transformer to boost spatial awareness efficiently. Similarly, SET-MIL [19] adopts a token-to-token vision transformer for extracting multi-scale context from WSIs, employing absolute position encoding for precise spatial representation. KAT [20] further advances this concept by matching tokens with positional kernels. Spatially-aware graph methods enhance image analysis by representing patches as nodes. Patch-GCN [21], for instance, constructs graphs from adjacent patches and uses CNNs for effective local-to-global information aggregation, outperforming methods like DeepGraphConv [22] that depend on embedding space similarities. The incorporation of attention mechanisms, introduced by [23], has further advanced this domain. AttPool, applied in CPath, exemplifies this by selecting discriminative nodes for a hierarchical graph and employing attention-weighted pooling for graph representation. GTP [24] and EGT [25] extend these concepts by integrating adjacency matrices for graph construction with self-attention for embedding aggregation on selected top-k critical tokens, respectively. Heterogeneous graph approaches, such as in H2-MIL [26], offer innovative strategies for advanced graph representation. Methods such as DTFD-MIL [27] and Prompt-MIL [28], categorized under ”Feature Enhancement,” were not included in this study or in the evolutionary tree because their primary focus was on enhancing feature extraction rather than the aggregation stage. Similarly, MHIM-MIL [29] and WENO [30] were excluded for similar reasons.

Large-scale datasets and clinically relevant tasks:

To benchmark aggregation performance on clinical tasks, we collected nine datasets from two institutions, Mount Sinai Health System (MSHS) and Memorial Sloan Kettering Cancer Center (MSKCC). The MSHS slides were scanned on Philips Ultrafast scanners, while the slides from MSKCC were scanned on Leica Aperio AT2 scanners. The cohorts included are described below and summarized in Table 1. Histogram of number of tiles per slides can be found in Appendix Figure 3.

3 Summary of Datasets

Table 1: Summary of benchmark datasets in this study. BCa: Breast Cancer, IBD: Inflammatory Bowel Disease, ER: Estrogen Receptor, EGFR: Epidermal Growth Factor Receptor, LUAD: Lung Adenocarcinoma, IO: Immunotherapy Response, NSCLC: Non-small cell lung cancer.
Code Origin Task Disease Slides Tiles [min, max]
BCa MSHS Disease Detection Breast Cancer +999, -999 [22, 40086]
IBD MSHS Disease Detection IBD +717, -724 [159, 16009]
BCa ER MSHS Replicative Biomarker Breast Cancer +1000, -1000 [291, 30564]
BCa HER2 MSHS Replicative Biomarker Breast Cancer +1258, -760 [291, 34946]
BCa PR MSHS Replicative Biomarker Breast Cancer +1033, -953 [291, 30564]
BIOME BR HRD MSHS Replicative Biomarker Breast Cancer +375, -188 [69, 37849]
MS LUAD EGFR MSHS Replicative Biomarker LUAD +103, -191 [61, 45339]
MSK LUAD EGFR MSKCC Replicative Biomarker LUAD +307, -693 [18, 44128]
NSCLC IO MSKCC Outcome Prediction Lung Cancer +86, -368 [13, 44128]

1. BCa: Breast cancer (BCa) detection cohort. Breast cancer blocks and normal breast blocks were obtained from the pathology laboratory information system. A total of 1998 slides were sampled, with 999 positive and 999 negative. The positive slides were selected from blocks that received the routine biomarker panel for cancer cases (estrogen receptor: ER, progesterone receptor: PR, human epidermal growth factor receptor 2: HER2, and Ki67), while negative slides were selected from breast cases that did not have an order for the routine panel. Additionally, negative cases were selected if they were not mastectomy cases, did not have a synoptic report associated with the case, and had no mention of cancer or carcinoma in the report. 2. IBD: Inflammatory Bowel Disease (IBD) detection cohort. Normal mucosa samples were obtained from patients undergoing screening and routine surveillance lower endoscopy from 2018 to 2022. IBD cases, including first diagnoses and follow-ups, were included. A total of 1441 slides were sampled, 717 with active inflammation and 724 with normal mucosa. 3. BCa ER, BCa PR, BCa HER2: BCa biomarker prediction cohorts. Breast cancer cases are routinely assessed for ER, PR, and HER2 status using immunohistochemistry (IHC) and Fluorescence In Situ Hybridization (FISH). Results for each biomarker were automatically extracted from the pathology report. 4. BioMe BR HRD:: Breast (BR) Homologous Repair Deficiency (HRD) prediction cohort. Mount Sinai BioMe is a whole-exome sequencing cohort of 30k individuals, where carriers of pathogenic and protein-truncating variants affecting HRD genes, i.e., BRCA1 BRCA2 BRIP1 PALB2 RAD51 RAD51C RAD51D ATM ATR CHEK1 CHEK2, where included as positives. A subset of the BioMe dataset of patients with available breast pathology slides were included. Slides containing solely normal breast tissue and slides with breast cancer were both included. 5. LUAD EGFR: EGFR (Epidermal Growth Factor Receptor) mutation status prediction in Lung Adenocarcinoma (LUAD). Two datasets were collected, one from MSHS and one from MSKCC. For the MSHS dataset, a total of 294 slides were obtained from MSHS clinical slide database, 103 positive and 191 negative. The cohort was built following the guidelines described in previous work [31] to map mutations to a binary target. The MSKCC dataset consists of 1,000 slides with 307 positive and 693 negative and is a random subset of the dataset described in [3]. See [3] for additional details. 6. NSCLC IO: Lung cancer immunotherapy response prediction. Non-small cell lung cancer (NSCLC) patients who received PD-L1 blockade-based immunotherapy were considered. Cytology specimens were excluded. The objective overall response was determined by RECIST [32] and performed by a blinded thoracic radiologist. A total of 454 slides were obtained, 86 positive and 368 negative.

Model Implementation Details:

We used the same experimental setup for all datasets and tasks. The embeddings as the input of the aggregation module were generated from four pretrained FMs: 1. Truncated ResNet50 (tres50_imagenet, dim:1024), pretrained on ImageNet [5]. 2. CTransPath (dim:768) integrates a CNN and Swin Transformer, pretrained on 5.6 million tiles from TCGA and PAIP datasets [33]. 3. DINO-ViT small (dinosmall, dim:384), pretrained on 1.6 billion histological images from over 420,000 clinical slides [8]. 4. UNI (dim:1024) leverages ViT-L/16 via DINOv2, pretrained on over 100 million images from 100,000 WSIs from public and in-house datasets (Mass-100K) [9]. Each dataset was split using a Monte Carlo Cross-Validation (MCCV) strategy where for each MCCV split, 80% of the samples were used for training and the rest for validation. We generated 21 MCCV folds for each task; one was used exclusively for hyperparameter tuning, and the subsequent 20 folds were each run twice using fixed random seeds (0 and 2024) to assess model variance and robustness. All models were trained with a single A100 GPU for 40 epochs using AdamW [34] optimizer. A cosine decay with a 10-epoch warm up schedule was used for the learning rate and weight decay hyperparameters. Implementation details for each method follow the instruction on official sources but keep consistent training strategy across all experiments. A GPU memory usage of each methods to number of tiles per slide is summarized in Appendix Figure 3.

Refer to caption
Figure 2: AUC scores in boxplots from benchmark aggregation methods versus AB-MIL baseline across nine datasets, using two embedding groups. Scores are from 20 Monte Carlo cross-validations, averaged over two random seeds. A one-sided t-test assessed AB-MIL performance comparisons, with symbols indicating significant differences. The dotted orange line shows the AB-MIL average for reference. Methods follow the Figure 1 category order and colors.

4 Results

We assessed aggregation methods using the area under the receiver operating characteristic curve (AUC) averaged over 20 MCCV runs. The AUC is reported at the end of training (40 epochs). Boxplots were utilized to compare AUC distributions for embeddings generated by four FMs, presented in same subfigures of Figure 2. Overall, domain-specific embeddings (CTransPath, dinosmall, UNI) outperform tres50_imagenet across most tasks, regardless of the aggregator used. Additionally, using UNI generally yields better performance in lung datasets. We observe that no method is consistently superior to the others across all datasets and embeddings. To compare different methods within the same embedding, one-sided t-tests were conducted to determine whether each method is significantly better than the AB-MIL as baseline. In BCa and BCa PR detection, no consistent advantage of advanced methods over baseline approaches is observed, except transMIL and PatchGCN showed a statistically significant improvement over the baseline (p<0.05𝑝0.05p<0.05italic_p < 0.05) with CTransPath input. In BCa ER and IBD detection tasks, several methods are superior to the baseline AB-MIL for all four embeddings as input. PatchGCN and AB-MIL_FC exhibit strong results in MSK LUAD EGFR tasks. In the MS LUAD EGFR tasks, all methods perform on par with or less effectively than the baseline, with the dataset’s smaller size (294 slides) potentially influencing these outcomes. In BCa HER2 and BIOME HRD tasks, no significant differences are observed across all embeddings and aggregators. However, domain-specific embeddings exhibit less variance in the box plots. In outcome prediction for NSCLC IO, VarMIL and transMIL show superiority with dinosmall input, while AB-MIL_FC_big performs better with UNI; however, the improvements are marginal and exhibit high variability. To show the speed of convergence, an example curve of validation AUCs during training process is in Appendix Figure 3.

Performance of selected methods on public datasets is available in their respective papers. We also aggregated a comprehensive array of results from these sources in Supplementary Table 3. For illustration, we present AUC score as model performance from three widely used public datasets: CAMELYON16, TCGA-Lung, TCGA-NSCLC. We can observe that the reported AUC values exhibit significant variability across different sources, despite the application of identical methodologies. For instance, within the CAMELYON 16 dataset, the DS-MIL method achieved an AUC of 0.894, notably higher than MIL-RNN’s AUC of 0.806. In comparison, TransMIL reported an even higher AUC of 0.931, with the other methods such as DS-MIL and MIL-RNN showing lower AUCs of 0.818 and 0.889, respectively.

5 Discussion

In this study, we employed ten aggregation methods, utilizing embeddings from ImageNet and several domain-specific FMs, to benchmark performance across nine clinically relevant datasets. Our empirical evaluation reveals that: (1) for general disease detection tasks, attention mechanisms (AB-MIL) are effective, albeit additional spatial information could be incorporated at a computational cost; (2) using transfer learning directly from FM pretrained on natural images to histological images without domain-specific tuning may degrade results, yet proficient aggregation methods can diminish this performance gap; (3) for more challenging tasks such as outcome prediction or replicative biomarker prediction, the inclusion of spatial information using current methods contributes marginally to performance. Based on these findings, while it is clear that pathology FMs provide superior performance, it is not possible to recommend any particular aggregation method. We suggest using AB-MIL as a strong baseline and validate other methods on a case by case basis.

Despite the growing number of aggregation algorithms published, there is no clear evidence for a method that is consistently superior than AB-MIL. The inclusion of spatial information, while theoretically sound, has yet to yield the expected gains. It is possible that spatial information is not relevant for certain tasks, but this is in contradiction with pathologists’ intuition. More likely, current methods fall short and better ways to incorporate spatial information across slides are needed. Future research should focus on developing methods that can better leverage spatial context and hierarchical structures designed for WSIs. In future work, we will expand our datasets to include a wider range of clinically relevant tasks, including survival analysis. We are developing infrastructure for secure benchmarking of external models on our clinical cohorts, which we plan to share with the community. We will explore the most recent FMs for more nuanced representations and investigate class activation map visualizations to enhance the interpretability and effectiveness of aggregation methods.

\bmhead

Acknowledgements

This work is supported in part through the use of research platform AI-Ready Mount Sinai (AIR.MS) and the expertise provided by the team at the Hasso Platner Institute for Digital Health at Mount Sinai (HPI.MS). This work was supported in part through the computational and data resources and staff expertise provided by Scientific Computing and Data at the Icahn School of Medicine at Mount Sinai and supported by the Clinical and Translational Science Awards (CTSA) grant UL1TR004419 from the National Center for Advancing Translational Sciences.

Appendix A Summary of Selected Methods and Model Parameters

Table 2: A comparison of selected aggregation methods in terms of size, and number of parameters.
Abbreviation Authors Group Size (MB) # Params (M)
AB-MIL Ilse et al. 2018 Attention 4.51 1.182
DeepGraphConv Li et al. 2018 Graph 3.01 0.789
DS-MIL Li et al. 2021 Attention 0.59 0.153
TransMIL Shao et al. 2021 Self-Attention 10.19 2.671
CLAM Lu et al. 2021 Attention 3.02 0.790
Patch-GCN Chen et al. 2021 Graph 5.27 1.38
VarMIL Schirris et al. 2022; Carmichael et al. 2022 Attention 2.02 0.529
Refer to caption
Figure 3: A: Computational resources vs. tiles per slide; B: Histogram of number of tiles per slide in each dataset; C: Validation AUC during training process for BCa ER. The line is average value of validation AUC and errorbar is calculated by standard error from 20 MCCV runs.

Appendix B Summary of Model Performance from Source Paper

Table 3: Summary of model performance from several source paper.
Source Comparison CAMELYON 16 TCGA-Lung TCGA-NSCLC
AB-MIL 0.865 0.977 N/A
DS-MIL MIL-RNN 0.806 0.964 N/A
DS-MIL 0.894 0.981 N/A
AB-MIL 0.876 N/A 0.866
MIL-RNN 0.889 N/A 0.912
TransMIL DS-MIL 0.818 N/A 0.893
CLAM-SB 0.881 N/A 0.882
CLAM-MB 0.868 N/A 0.938
TransMIL 0.931 N/A 0.960
AB-MIL 0.854 0.941 N/A
MIL-RNN 0.875 0.894 N/A
DS-MIL 0.899 0.939 N/A
DTFD-MIL CLAM-SB 0.871 0.944 N/A
CLAM-MB 0.878 0.949 N/A
TransMIL 0.906 0.949 N/A
DTFD-MIL 0.946 0.951 N/A
AB-MIL 0.876 0.866 N/A
CLAM-SB 0.881 0.882 N/A
CWC-transformer CLAM-MB 0.868 0.916 N/A
DS-MIL 0.894 0.960 N/A
TransMIL 0.931 0.936 N/A
CWC-transformer 0.939 0.949 N/A
AB-MIL 0.940 0.914 N/A
DS-MIL 0.946 0.937 N/A
MHIM-MIL CLAM-SB 0.947 0.937 N/A
CLAM-MB 0.947 0.937 N/A
TransMIL 0.935 0.925 N/A
DTFD-MIL 0.952 0.938 N/A
AB-MIL 0.838 0.949 N/A
WENO MIL-RNN N/A 0.911 N/A
DS-MIL 0.840 0.963 N/A
CLAM-SB 0.933 N/A 0.972
CLAM-MB 0.938 N/A 0.973
CAMIL TransMIL 0.950 N/A 0.974
DTFD-MIL 0.941 N/A 0.964
GTP 0.921 N/A 0.973
CAMIL 0.959 N/A 0.975

References

  • \bibcommenthead
  • Song et al. [2023] Song, A.H., Jaume, G., Williamson, D.F., Lu, M.Y., Vaidya, A., Miller, T.R., Mahmood, F.: Artificial intelligence for digital and computational pathology. Nature Reviews Bioengineering 1(12), 930–949 (2023)
  • Bilal et al. [2023] Bilal, M., Jewsbury, R., Wang, R., AlGhamdi, H.M., Asif, A., Eastwood, M., Rajpoot, N.: An aggregation of aggregation methods in computational pathology. Medical Image Analysis, 102885 (2023)
  • Campanella et al. [2018] Campanella, G., Silva, V.W.K., Fuchs, T.J.: Terabyte-scale deep multiple instance learning for classification and localization in pathology. arXiv preprint arXiv:1805.06983 (2018)
  • Campanella et al. [2019] Campanella, G., Hanna, M.G., Geneslaw, L., Miraflor, A., Werneck Krauss Silva, V., Busam, K.J., Brogi, E., Reuter, V.E., Klimstra, D.S., Fuchs, T.J.: Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nature medicine 25(8), 1301–1309 (2019)
  • Lu et al. [2021] Lu, M.Y., Williamson, D.F., Chen, T.Y., Chen, R.J., Barbieri, M., Mahmood, F.: Data-efficient and weakly supervised computational pathology on whole-slide images. Nature biomedical engineering 5(6), 555–570 (2021)
  • Laleh et al. [2022] Laleh, N.G., Muti, H.S., Loeffler, C.M.L., Echle, A., Saldanha, O.L., Mahmood, F., Lu, M.Y., Trautwein, C., Langer, R., Dislich, B., et al.: Benchmarking weakly-supervised deep learning pipelines for whole slide classification in computational pathology. Medical image analysis 79, 102474 (2022)
  • Schirris et al. [2022] Schirris, Y., Gavves, E., Nederlof, I., Horlings, H.M., Teuwen, J.: Deepsmile: contrastive self-supervised pre-training benefits msi and hrd classification directly from h&e whole-slide images in colorectal and breast cancer. Medical Image Analysis 79, 102464 (2022)
  • Campanella et al. [2023] Campanella, G., Kwan, R., Fluder, E., Zeng, J., Stock, A., Veremis, B., Polydorides, A.D., Hedvat, C., Schoenfeld, A., Vanderbilt, C., et al.: Computational pathology at health system scale–self-supervised foundation models from three billion images. arXiv preprint arXiv:2310.07033 (2023)
  • Chen et al. [2023] Chen, R.J., Ding, T., Lu, M.Y., Williamson, D.F., Jaume, G., Chen, B., Zhang, A., Shao, D., Song, A.H., Shaban, M., et al.: A general-purpose self-supervised model for computational pathology. arXiv preprint arXiv:2308.15474 (2023)
  • Saunders et al. [2023] Saunders, A., Dash, S., Tsaris, A., Yoon, H.-J.: A comparison of histopathology imaging comprehension algorithms based on multiple instance learning. In: Medical Imaging 2023: Digital and Computational Pathology, vol. 12471, pp. 424–432 (2023). SPIE
  • Ilse et al. [2018] Ilse, M., Tomczak, J., Welling, M.: Attention-based deep multiple instance learning. In: International Conference on Machine Learning, pp. 2127–2136 (2018). PMLR
  • Carmichael et al. [2022] Carmichael, I., Song, A.H., Chen, R.J., Williamson, D.F., Chen, T.Y., Mahmood, F.: Incorporating intratumoral heterogeneity into weakly-supervised deep learning models via variance pooling. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 387–397 (2022). Springer
  • Li et al. [2021] Li, B., Li, Y., Eliceiri, K.W.: Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14318–14328 (2021)
  • Yao et al. [2019] Yao, J., Zhu, X., Huang, J.: Deep multi-instance learning for survival prediction from whole slide images. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part I 22, pp. 496–504 (2019). Springer
  • Yao et al. [2020] Yao, J., Zhu, X., Jonnagaddala, J., Hawkins, N., Huang, J.: Whole slide images based cancer survival prediction using attention guided deep multiple instance learning networks. Medical Image Analysis 65, 101789 (2020)
  • Yang et al. [2022] Yang, J., Chen, H., Zhao, Y., Yang, F., Zhang, Y., He, L., Yao, J.: Remix: A general and efficient framework for multiple instance learning based whole slide image classification. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 35–45 (2022). Springer
  • Shao et al. [2021] Shao, Z., Bian, H., Chen, Y., Wang, Y., Zhang, J., Ji, X., et al.: Transmil: Transformer based correlated multiple instance learning for whole slide image classification. Advances in neural information processing systems 34, 2136–2147 (2021)
  • Li et al. [2021] Li, H., Yang, F., Zhao, Y., Xing, X., Zhang, J., Gao, M., Huang, J., Wang, L., Yao, J.: Dt-mil: deformable transformer for multi-instance learning on histopathological image. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part VIII 24, pp. 206–216 (2021). Springer
  • Zhao et al. [2022] Zhao, Y., Lin, Z., Sun, K., Zhang, Y., Huang, J., Wang, L., Yao, J.: Setmil: spatial encoding transformer-based multiple instance learning for pathological image analysis. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 66–76 (2022). Springer
  • Zheng et al. [2022] Zheng, Y., Li, J., Shi, J., Xie, F., Jiang, Z.: Kernel attention transformer (kat) for histopathology whole slide image classification. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 283–292 (2022). Springer
  • Chen et al. [2021] Chen, R.J., Lu, M.Y., Shaban, M., Chen, C., Chen, T.Y., Williamson, D.F., Mahmood, F.: Whole slide images are 2d point clouds: Context-aware survival prediction using patch-based graph convolutional networks. arXiv preprint arXiv:2107.13048 (2021)
  • Li et al. [2018] Li, R., Yao, J., Zhu, X., Li, Y., Huang, J.: Graph cnn for survival analysis on whole slide pathological images. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 174–182 (2018). Springer
  • Velickovic et al. [2017] Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y., et al.: Graph attention networks. stat 1050(20), 10–48550 (2017)
  • Zheng et al. [2022] Zheng, Y., Gindra, R.H., Green, E.J., Burks, E.J., Betke, M., Beane, J.E., Kolachalama, V.B.: A graph-transformer for whole slide image classification. IEEE transactions on medical imaging 41(11), 3003–3015 (2022)
  • Ding et al. [2023] Ding, S., Li, J., Wang, J., Ying, S., Shi, J.: Multi-scale efficient graph-transformer for whole slide image classification. arXiv preprint arXiv:2305.15773 (2023)
  • Hou et al. [2022] Hou, W., Yu, L., Lin, C., Huang, H., Yu, R., Qin, J., Wang, L.: H^ 2-mil: Exploring hierarchical representation with heterogeneous multiple instance learning for whole slide image analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 933–941 (2022)
  • Zhang et al. [2022] Zhang, H., Meng, Y., Zhao, Y., Qiao, Y., Yang, X., Coupland, S.E., Zheng, Y.: Dtfd-mil: Double-tier feature distillation multiple instance learning for histopathology whole slide image classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18802–18812 (2022)
  • Zhang et al. [2023] Zhang, J., Kapse, S., Ma, K., Prasanna, P., Saltz, J., Vakalopoulou, M., Samaras, D.: Prompt-mil: Boosting multi-instance learning schemes via task-specific prompt tuning. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 624–634 (2023). Springer
  • Tang et al. [2023] Tang, W., Huang, S., Zhang, X., Zhou, F., Zhang, Y., Liu, B.: Multiple instance learning framework with masked hard instance mining for whole slide image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4078–4087 (2023)
  • Qu et al. [2022] Qu, L., Wang, M., Song, Z., et al.: Bi-directional weakly supervised knowledge distillation for whole slide image classification. Advances in Neural Information Processing Systems 35, 15368–15381 (2022)
  • Campanella et al. [2022] Campanella, G., Ho, D., Häggström, I., Becker, A.S., Chang, J., Vanderbilt, C., Fuchs, T.J.: H&e-based computational biomarker enables universal egfr screening for lung adenocarcinoma. arXiv preprint arXiv:2206.10573 (2022)
  • Eisenhauer et al. [2009] Eisenhauer, E.A., Therasse, P., Bogaerts, J., Schwartz, L.H., Sargent, D., Ford, R., Dancey, J., Arbuck, S., Gwyther, S., Mooney, M., et al.: New response evaluation criteria in solid tumours: revised recist guideline (version 1.1). European journal of cancer 45(2), 228–247 (2009)
  • Wang et al. [2022] Wang, X., Yang, S., Zhang, J., Wang, M., Zhang, J., Yang, W., Huang, J., Han, X.: Transformer-based unsupervised contrastive learning for histopathological image classification. Medical image analysis 81, 102559 (2022)
  • Loshchilov and Hutter [2017] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)