Full length article|Articles in Press, 100233

# Detecting Glaucoma from Fundus Photographs Using Deep Learning without Convolutions: Transformer for Improved Generalization

Open AccessPublished:October 18, 2022

## Abstract

### Purpose

To compare the diagnostic accuracy and explainability of a new Vision Transformer deep learning technique, Data-efficient image Transformer (DeiT), and Resnet-50, trained on fundus photographs from the Ocular Hypertension Treatment Study (OHTS) to detect primary open-angle glaucoma (POAG) and to identify in general the salient areas of the photographs most important for each model's decision-making process.

### Study Design

Evaluation of a diagnostic technology

### Subjects, Participants, and/or Controls

66,715 photographs from 1,636 OHTS participants and an additional five external datasets of 16137 photographs of healthy and glaucoma eyes.

### Methods, Intervention, or Testing

DeiT models were trained to detect five ground truth OHTS POAG classifications: OHTS Endpoint Committee POAG determinations due to disc changes (Model 1), visual field changes (Model 2), or either disc or visual field changes (Model 3) and reading center determinations based on disc (Model 4) and visual fields (Model 5). The best-performing DeiT models were compared to ResNet-50 on OHTS and five external datasets.

### Main Outcome Measures

Diagnostic performance was compared using areas under the receiver operating characteristic curve (AUROC) and sensitivities at fixed specificities. The explainability of the DeiT and ResNet-50 models was compared by evaluating the attention maps derived directly from DeiT to 3 gradient-weighted class activation map generation strategies.

### Results

Compared to our best-performing ResNet-50 models, the DeiT models demonstrated similar performance on the OHTS test sets for all five-ground truth POAG labels; AUROC ranged from 0.82 (Model 5) to 0.91 (Model 1). However, the AUROC of DeiT was consistently higher than ResNet-50 on the five external datasets. For example, AUROC for the main OHTS endpoint (Model 3) was between 0.08 and 0.20 higher in the DeiT compared to ResNet-50 models. The saliency maps from the DeiT highlight localized areas of the neuroretinal rim, suggesting the use of important clinical features for classification, while the same maps in the ResNet-50 models show a more diffuse, generalized distribution around the optic disc,

### Conclusions

Vision transformer has the potential to improve the generalizability and explainability of deep learning models for the detection of eye disease and possibly other medical conditions that rely on imaging modalities for clinical diagnosis and management.

## Introduction

Primary open-angle glaucoma (POAG) is a blinding but treatable disease in which damage to the optic nerve can result in progressive and irreversible vision loss. In 2010, there were an estimated 1.4 billion people worldwide with myopia, and the prevalence is rapidly rising to an estimated 4.75 billion by 2050
• Holden B.A.
• Fricke T.R.
• Wilson D.A.
• et al.
Global prevalence of myopia and high myopia and temporal trends from 2000 through 2050.
,
• Weinreb R.N.
• Aung T.
• Medeiros F.A.
The pathophysiology and treatment of glaucoma: a review.
. As the population ages, the number of people with POAG will increase worldwide to an estimated 111.8 million by 2040
• Thompson A.C.
• Jammal A.A.
• Medeiros F.A.
A review of deep learning for screening, diagnosis, and detection of glaucoma progression.
. As its symptoms often only occur when the disease is severe, early detection and treatment play an important role in preventing visual impairment and blindness from the disease.
Digital fundus photography is an important modality for detecting glaucoma. Benefiting from the recent advances in artificial intelligence (AI), deep learning (DL) models using fundus photographs have achieved compelling results for glaucoma detection
• Thompson A.C.
• Jammal A.A.
• Medeiros F.A.
A review of deep learning for screening, diagnosis, and detection of glaucoma progression.
• Wu J.-H.
• Nishida T.
• Weinreb R.N.
• Lin J.-W.
Performances of machine learning in detecting glaucoma using fundus and retinal optical coherence tomography images: A meta-analysis.
• Christopher M.
• Belghith A.
• Bowd C.
• et al.
Performance of deep learning architectures and transfer learning for detecting glaucomatous optic neuropathy in fundus photographs.
• Liao W.
• Zou B.
• Zhao R.
• Chen Y.
• He Z.
• Zhou M.
Clinical interpretable deep learning model for glaucoma diagnosis.
• Yu S.
• Xiao D.
• Frost S.
• Kanagasingam Y.
Robust optic disc and cup segmentation with deep learning for glaucoma detection.
• Christopher M.
• Nakahara K.
• Bowd C.
• et al.
Effects of study population, labeling and training on glaucoma detection using deep learning algorithms.
.
Convolutional neural networks (CNNs), the most common and prevalent type of DL model, have dominated this research area. Although CNNs have achieved high accuracy in glaucoma detection, there is still a burning problem: CNNs often do not generalize well on unseen fundus data. This is likely due, in part, to differences in glaucoma ground truth and study populations across datasets and/or to different cameras, illuminations, and photographer skills/preferences. There is considerable evidence that the assessment of optic nerve head (ONH) photographs for glaucoma determination is highly variable, even among glaucoma experts
• Thompson A.C.
• Jammal A.A.
• Medeiros F.A.
A review of deep learning for screening, diagnosis, and detection of glaucoma progression.
. The development of glaucoma is also related to factors such as race and age, which can vary significantly across datasets. Therefore, there is a need to develop techniques to enhance a DL model's ability to generalize across fundus datasets.
From the algorithm aspect, the state-of-the-art (SoTA) CNNs typically employ convolutional layers to represent fundus photographs with one-dimensional (1-D) visual features

Fan, R., Bowd, C., Brye, N. et al. [Preprint] one-vote veto: Semi-supervised learning for low-shot glaucoma diagnosis. arXiv preprint arXiv:2012.04841 (2020).

. A fully connected layer is simultaneously fine-tuned to classify these visual features as either glaucoma or healthy
• Fan R.
• Bowd C.
• Christopher M.
• et al.
[Preprint] Detecting Glaucoma in the Ocular Hypertension Treatment Study Using Deep Learning: Implications for clinical trial endpoints.
. Such visual features are learned without encoding the connection between pixels

Dosovitskiy, A., Beyer, L., Kolesnikov, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (2020).

, and therefore, CNNs can be easily misled when the visual feature of an unseen image is similar to one of the learned examples, even if they have significantly different spatial structures.
Transformers were initially introduced for machine translation

Vaswani, A., Shazeer, N., Parmar, N. et al. Attention is all you need. In Advances in neural information processing systems, 5998–6008 (2017).

, and they have since become the SoTA method in many natural language processing (NLP) tasks

Dosovitskiy, A., Beyer, L., Kolesnikov, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (2020).

. The explanation of their success in NLP tasks lies in the adopted self-attention mechanism

Vaswani, A., Shazeer, N., Parmar, N. et al. Attention is all you need. In Advances in neural information processing systems, 5998–6008 (2017).

, which differentially weighs the significance of each part of the sequential input data. Unlike recurrent neural networks (RNNs), Transformers do not necessarily process the data in order. In contrast, the self-attention mechanism provides context for any position in the input sequence, showing the capability to understand the connection between inputs (such as the words in a sentence when a Transformer is applied for NLP)

Vaswani, A., Shazeer, N., Parmar, N. et al. Attention is all you need. In Advances in neural information processing systems, 5998–6008 (2017).

. However, when using Transformers to evaluate images, applying self-attention between pixels is often challenging.
Vision Transformer (ViT)

Dosovitskiy, A., Beyer, L., Kolesnikov, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (2020).

was recently introduced to tackle this problem. It divides an image into a grid of square patches. Each patch is flattened into a single vector by concatenating the channels of all pixels in a patch and then linearly projecting it to the desired input dimension. Since ViT is agnostic to the structure of the input elements, learnable position embeddings were added to each patch to enable the model to learn about the structure of the images. A ViT does not know about the relative location of patches in the image or even that the image has a 2-D structure — it learns relevant information from the training data and encodes structural information in the position embeddings. As opposed to convolution layers whose receptive field is a small neighborhood grid, the self-attention layer's receptive field is always the full image. Therefore, ViT can learn many more global visual features

Dosovitskiy, A., Beyer, L., Kolesnikov, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (2020).

.
However, ViT models performed slightly worse than ResNet models of comparable sizes when the training set sample size was small

Dosovitskiy, A., Beyer, L., Kolesnikov, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (2020).

. This is likely due at least in part because ViT models lack some of the inductive biases inherent to CNNs (such as translation equivariance and locality) and do not generalize well when trained on insufficient amounts of data. In contrast, when the training data is sufficient, ViT models can overcome the inductive biases and can outperform the SoTA CNNs significantly. In this regard, ViT is typically pre-trained with hundreds of millions of images, thereby limiting their adoption. Data-efficient image Transformer (DeiT)

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A. & Je´gou, H. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, 10347–10357 (PMLR, 2021).

was recently proposed to overcome this limitation. DeiT introduces a teacher-student strategy specific to Transformers. It relies on a distillation token to ensure that the student model learns from the teacher model through attention. Compared to ViT, DeiT achieves better performance in terms of both model accuracy and training efficiency without the large sample size requirement

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A. & Je´gou, H. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, 10347–10357 (PMLR, 2021).

. The main objective of this study is to compare the accuracy of widely used ResNet-50 and new state-of-the-art DeiT models for the detection of glaucoma from expert graded fundus photographs and one of the most extensive clinical trials for glaucoma prevention, the Ocular Hypertension Treatment Study (OHTS).
In addition to high accuracy, it is also essential that AI algorithms provide mechanisms to explain their decisions. In vision-based AI, explanation techniques such as attention maps and saliency maps have become very popular in recent years

Zeiler, M. D. & Fergus, R. Visualizing and understanding convolutional networks. In European conference on computer vision, 818–833 (Springer, 2014).

Hendricks, L. A., Akata, Z., Rohrbach, M., Donahue, J., Schiele, B. & Darrell, T. Generating visual explanations. In European Conference on Computer Vision, 3–19 (Springer, 2016).

Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D. & Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, 618–626 (2017).

. Saliency maps can highlight parts of the input that are most influential on AI's outcome; however, these techniques have shown limitations in the interpretability of AI, especially in medical applications

Saporta, A., Gui, X., Agrawal, A. et al. [Preprint] deep learning saliency maps do not accurately highlight diagnostically relevant regions for medical image interpretation. medRxiv (2021).

. In contrast, attention maps in AI algorithms are usually generated during the inference and can more directly identify the parts of input considered as important. This paper also objectively compares saliency maps and attention maps from the DeiT algorithm to the saliency maps from the Resnet-50 classifier. The goal of this analysis is to provide information to developers and validators on the regions considered by the deep learner to ensure that it is not focusing on irrelevant areas that will lead to poor performance.

## Methods

### Description of study populations and datasets

This study compares the accuracy, generalizability, and explainability of the SOTA Transformer DeiT and CNN ResNet-50 models for detecting glaucoma from OHTS fundus photographs and five external independent datasets. DIGS/ADAGES and OHTS studies adhered to the tenets of the Declaration of Helsinki and were approved by the IRBs of UC San Diego and all other study sites. Other data were assembled from publicly available datasets.
The OHTS
• Gordon M.O.
• Kass M.A.
Ocular Hypertension Treatment Study Group et al. The ocular hyper- tension treatment study: design and baseline description of the participants.
,
• Kass M.A.
• Heuer D.K.
• Higginbotham E.J.
• et al.
The ocular hypertension treatment study: a randomized trial determines that topical ocular hypotensive medication delays or prevents the onset of primary open-angle glaucoma.
, a large randomized clinical trial of 1,636 subjects with ocular hypertension, was designed to determine the safety and efficacy of topical ocular hypotensive medication in delaying or preventing the onset of POAG in ocular hypertensive eyes. The OHTS was unique in that the primary POAG endpoint, of a reproducible clinically significant POAG optic disc changes or a reproducible glaucomatous visual field (VF) defect, was decided by a three-member masked Endpoint Committee of glaucoma experts who reviewed clinical information and both the photographs and VFs to determine whether observed changes were due to POAG or another disease. The Endpoint Committee reviewed cases only after changes in the optic disc and/or VF from baseline were determined by masked readers at the independent Optic Disc Reading Center (ODRC) and VF Reading Center (VFRC). We trained DL models on the OHTS fundus photographs to detect the following five outcomes from the Endpoint Committee and ODRC and VFRC POAG determinations:
• Endpoint Committee Determination:
Model 1: Optic disc changes attributable to POAG.
Model 2: VF changes attributable to POAG.
Model 3: Optic disc or VF changes attributable to POAG.
• Reading Center Determination:
Model 4: Optic disc changes attributable to POAG by ODRC.
Model 5: VF changes attributable to POAG by VFRC.
The DL models were then evaluated on the following five independent external test sets: (1) the University of California, San Diego (UCSD)-based Diagnostic Innovations in Glaucoma Study (DIGS, United States) dataset and the multi-center African Descent and Glaucoma Evaluation Study (ADAGES, San Diego, CA, Birmingham, AL and New York City, NY, United States) dataset
• Sample P.A.
• Girkin C.A.
• Zangwill L.M.
• et al.
The African Descent and Glaucoma Evaluation Study (ADAGES): Design and baseline data.
, (2) the ACRIMA (Spain)
• Diaz-Pinto A.
• Morales S.
• Naranjo V.
• Köhler T.
• Mossi J.M.
• Navea A.
CNNs for automatic glaucoma assessment using fundus images: an extensive validation.
, dataset, (3) the Large-Scale Attention-based Glaucoma (LAG, China)

Li, L., Xu, M., Wang, X., Jiang, L. & Liu, H. Attention based glaucoma detection: A large-scale database and CNN model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10571–10580 (2019).

dataset, (4) the Retinal IMage database for Optic Nerve Evaluation (RIM-ONE, Spain)

Fumero, F., Alayo'n, S., Sanchez, J. L., Sigut, J. & Gonzalez-Hernandez, M. Rim-one: An open retinal image database for optic nerve evaluation. In 2011 24th international symposium on computer-based medical systems (CBMS), 1–6 (IEEE, 2011).

dataset, (5) The Online Retinal Fundus Image Dataset for Glaucoma Analysis and Research (ORIGA, Singapore)

Zhang, Z., Yin, F. S., Liu, J. et al. ORIGA-light: An online retinal fundus image database for glaucoma analysis and research. In 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology, 3065–3068 (IEEE, 2010).

dataset.

### Dataset Preparation and Augmentation

We performed the same preprocessing and data augmentation strategies to prepare the fundus photograph dataset for DL model training as we used in our previous work
• Fan R.
• Bowd C.
• Christopher M.
• et al.
[Preprint] Detecting Glaucoma in the Ocular Hypertension Treatment Study Using Deep Learning: Implications for clinical trial endpoints.
. In brief, a region centered on the ONH was first extracted from each raw fundus photograph using a semantic segmentation network. A square region surrounding the extracted ONH was then automatically cropped from each image and resized to $224×224$ pixels for input in the DL model. During the DL model training, several data augmentation strategies, such as random rotation, translation, and horizontal flipping, were applied to increase the amount and type of variation of the training set.

### Deep Learning Model

In our experiments, a DeiT

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A. & Je´gou, H. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, 10347–10357 (PMLR, 2021).

, pre-trained on the ImageNet database

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K. & Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255 (IEEE, 2009).

, was trained to detect the three Endpoint committee and the two Reading Center committee POAG determinations. We modified the last layer of the pre-trained DeiT to produce two scalars indicating the probability distribution of healthy and POAG classes, respectively.
DeiT

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A. & Je´gou, H. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, 10347–10357 (PMLR, 2021).

is a convolution-free Transformer. A fundus image is split into a collection of $16×16$pixel patches, which are then embedded linearly (with position embeddings included). A distillation token interacts with the class and the patch tokens through the self-attention layers. This distillation token is used similarly to the class token, which minimizes a cross-entropy loss LCE, except that on the output of the network, its objective is to reproduce the hard label predicted by the teacher (by minimizing another cross-entropy loss $Lteacher$), instead of a true label. Both the class and distillation tokens input to the transformers is learned by back-propagation (Supplemental Figure 1, available at https://www.ophthalmologyscience.org/).

### Model Training and Selection

The demographic and clinical characteristics of the OHTS participants in training/validation and test sets are outlined in Table 1. We conducted our experiments with the identical training setup as our previous work
• Fan R.
• Bowd C.
• Christopher M.
• et al.
[Preprint] Detecting Glaucoma in the Ocular Hypertension Treatment Study Using Deep Learning: Implications for clinical trial endpoints.
, including (a) the same OHTS training, validation, and test sets randomly chosen by the patient,
Table 1Characteristics of the Ocular Hypertension Treatment Study (OHTS) training, validation, and test sets. (n (%) or mean (95% CI))
• Fan R.
• Bowd C.
• Christopher M.
• et al.
[Preprint] Detecting Glaucoma in the Ocular Hypertension Treatment Study Using Deep Learning: Implications for clinical trial endpoints.
MeasurementTrain / ValidationTestp-value
Age (years)56.8 (56.3, 57.4)57.2 (56.1, 58.2)0.924
Number of participants (eyes)1314 (2628)322 (644)
Number of eye visits296447088
Self-reported Race
European Descent991 (75.4%)238 (73.9%)0.566
African Descent323 (24.6%)84 (26.1%)
Sex
(Female)758 (57.7%)173 (53.7%)0.209
(Male)556 (42.3%)149 (46.3%)
Baseline Visual Field Mean Deviation (MD) (dB)-0.03 (-0.11, 0.05)-0.12 (-0.27, 0.04)0.239
Baseline Photograph Based Vertical Cup Disc Ratio0.39 (0.38, 0.40)0.39 (0.37, 0.41)0.735
No glaucoma
Number of participants (eyes*)1093 (2344)250 (240)
Number of eye visits279666675
Visual field MD (dB)-0.18 (-0.23, -0.12)-0.16 (-0.28, -0.05)0.871
Developed a POAG endpoint by visual field or photograph
Number of participants (eyes)221 (284)72 (96)
Number of eye visits167896
∗Eyes w/o glaucoma are included from patients w/ and w/o glaucoma
(b) five POAG determinations: 3 Endpoint Committee Determinations (Models 1-3), and 2 Reading Center Determinations (Models 4-5), (c) the same strategy to reduce the class imbalance problem, (d) the same metric (F-score) to select the best-performing models from the validation set. The models were trained on an NVIDIA GeForce RTX 2080Ti GPU, which has an 11 GB GDDR6 memory and 4352 CUDA cores.
The mini-batch sizes for DeiT and ResNet-50 training are set to 40 and 110, respectively, to optimize GPU memory use. The maximum epoch is set to 200. We also implemented an early-stopping mechanism to reduce over-fitting where the training of DeiT or ResNet is terminated if the F-score has not increased for 10 epochs. We utilized the Stochastic gradient descent (SGD) optimization algorithm to minimize the training loss. The initial learning rate is set to 0.001, which decays gradually after the 100th epoch. Hyperparameters were chosen based on commonly used values and empirical testing on the training data. The source code for our training algorithm is available in PyTorch upon request at https://github.com/visres-ucsd/vision-transformer.

### Performance Evaluation and Statistical Analysis

The model performance of the five different POAG ground truths was evaluated based on the area under the receiver operating curves (AUROC) within 95% confidence intervals (CI) calculated using clustered bootstrapping techniques. Sensitivity (recall) was calculated at 80%, 85%, 90%, and 95% fixed specificity values. We also evaluated the performance in eyes with early glaucoma (eyes with a VF MD -6 dB) on the OHTS test set. AUROC scores for classifying healthy eyes versus all glaucoma eyes and early glaucoma eyes were computed for each model. The best performing DeiT models were evaluated on the OHTS test set and five external datasets.
To evaluate the explainability of models, we compared the differences in the explanatory feature maps from DeiT and ResNet-50. This comparison was made based on the attention maps derived directly from the DeiT and different gradient-weighted class activation maps from ResNet-50, including GradCAM

Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D. & Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, 618–626 (2017).

, GradCAM++

Chattopadhay, A., Sarkar, A., Howlader, P. & Balasubramanian, V. N. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In 2018 IEEE winter conference on applications of computer vision (WACV), 839–847 (IEEE, 2018).

, and ScoreCAM

Wang, H., Wang, Z., Du, M. et al. Score-cam: Score-weighted visual explanations for convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 24–25 (2020).

. In gradient-based saliency maps, we highlight the pixels with the highest impact on the decision-making process based on activation maps in the last layer of the transformer. For DeiT, we can visualize both the attention from the transformer and the saliency maps for the last layer of the transformer, while for the ResNet-50 model, only the saliency maps are available. Additional details of these explainability approaches are provided in the Supplemental Methods (available at https://www.ophthalmologyscience.org/).

### Role of funding source

The funding sources had no involvement in the study design, collection, analysis, or interpretation of data or in the writing of this manuscript.

## Results

Table 2 and Table 3 and Figure 1 show the performance of our trained DeiT models on the OHTS test set in terms of (a) classifying healthy versus all glaucoma eyes and (b) classifying healthy versus mild glaucoma eyes. Compared to our best-performing ResNet-50 models10 (Table 2), the DeiT models demonstrated similar performance on the OHTS test sets for all five types of glaucoma determinations (see Figure 1a). The DeiT and ResNet-50 AUROC of Model 1 and Model 3 were both 0.91 and 0.88, respectively, and differences in AUROC between the 2 DL strategies ranged from 0.01 to 0.03 for Model 2, Model 4, and Model 5. It is worth noting that the number of eyes incorrectly classified as POAG (false positives) was much higher than the number of endpoints missed (false negative). Specifically, the ratio of false positives to false negatives at 90% specificity ranges from 7.0 for Model 1 to 3.75 for Model 4.
Table 2Diagnostic accuracy of DeiT and ResNet-50 performance
• Fan R.
• Bowd C.
• Christopher M.
• et al.
[Preprint] Detecting Glaucoma in the Ocular Hypertension Treatment Study Using Deep Learning: Implications for clinical trial endpoints.
in identifying primary open-angle glaucoma (POAG) by the OHTS endpoint committee and optic disc and visual field reading centers.
Ground Truth determined byPOAG Detection ModalityPOAG (n) Subjects (Eyes)/VisitsAUROC (95%CI)
DeiTResNet50
All eyesEarly glaucoma (VF MD ≥ -6 dB)Severe glaucoma (VF MD < -6 dB)All eyesEarly glaucoma (VF MD ≥ -6 dB)Severe glaucoma (VF MD < -6 dB)
Endpoint CommitteeOptic Disc Photograph and/or Visual Field52 (71)/3520.88 (0.82, 0.92)0.87 (0.80, 0.91)0.82 (0.65, 0.94)0.88 (0.82, 0.92)0.86 (0.79, 0.91)0.83 (0.63, 0.95)
Optic Disc Photograph41(56)/2620.91 (0.87, 0.93)0.90 (0.87, 0.93)0.77 (0.51, 0.93)0.91 (0.88, 0.94)0.90 (0.87, 0.94)0.81 (0.60, 0.95)
Visual Field35(41)/1950.84 (0.75, 0.90)0.82 (0.73, 0.89)0.74 (0.54, 0.89)0.86 (0.76, 0.93)0.83 (0.70, 0.91)0.81 (0.56, 0.93)
Reading CentersOptic Disc Photograph60(77)/3180.86 (0.83, 0.89)0.86 (0.82, 0.89)0.70 (0.49, 0.89)0.89 (0.85, 0.92)0.87 (0.83, 0.91)0.81 (0.57, 0.96)
Visual Field61(78)/2420.82 (0.76, 0.87)0.80 (0.73, 0.86)0.66 (0.43, 0.82)0.83 (0.76, 0.88)0.80 (0.72, 0.86)0.68 (0.49, 0.87)
Table 3Diagnostic accuracy of DeiT and ResNet-50 performance in test datasets.
DeiTResNet-50
Test set size photosAUROC (95% CI)Sensitivity at Specificity of:AUROC (95% CI)Sensitivity at Specificity of:
HealthyGlaucoma80%85%90%95%80%85%90%95%
OHTS66754130.91 (0.87, 0.93)0.830.790.700.560.91 (0.88, 0.94)0.860.810.730.56
DIGS518442890.77 (0.71, 0.82)0.620.570.480.340.74 (0.69, 0.79)0.590.520.430.30
ACRIMA3093960.74 (0.70, 0.77)0.570.460.380.310.74 (0.70, 0.77)0.550.460.380.29
LAG314317110.88 (0.87, 0.89)0.810.770.700.590.79 (0.78, 0.81)0.660.590.530.42
RIM-ONE2552000.91 (0.88, 0.94)0.850.830.810.730.90 (0.87, 0.93)0.860.830.780.68
ORIGA4821680.73 (0.68, 0.77)0.480.400.350.210.55 (0.48, 0.61)0.350.330.260.18
We also compared the generalizability of DeiT and ResNet-50 models for detecting the five POAG endpoints on five external independent fundus photograph test sets (Figure 1b-f). These results suggest that DeiT significantly outperforms ResNet-50 in almost all cases. Specifically, when evaluated on the LAG test set, the AUROC (95% CI) of the five POAG endpoints by DeiT ranged from 0.08 (Model 2) to 0.16 (Model 4), higher than that of ResNet-50. DeiT achieves the best generalizability on the LAG test set (AUROC (95% CI): 0.91 (0.90, 0.91)) for the Reading Center’s determination of change based on optic disc photographs, compared to 0.74 (0.73, 0.76) by ResNet-50. Additionally, ResNet-50 performs best w.r.t. Model 1 but its AUROC (95% CI): 0.79 (0.78, 0.81) is significantly lower than that achieved by DeiT (AUROC (95% CI): 0.88 (0.87, 0.89)). Moreover, both DeiT and ResNet-50 do not generalize well on the ORIGA test set. However, DeiT still significantly outperforms ResNet-50 in terms of the diagnostic AUROC (95% CI) of the POAG attribution in all cases. The Endpoint Committee's POAG attribution by optic disc photographs on the RIM-ONE test set is the only case in which ResNet-50 performs slightly better than DeiT (0.87 and 0.83, respectively, Figure 1e). In addition, confusion matrices show that the number of false positive classifications is higher than false negative ones in the external datasets (Supplemental Figure 2, available at https://www.ophthalmologyscience.org/).
Model 1 attention and saliency maps are for selected healthy and glaucoma examples are shown in Figure 2, while a comparison of the average attention and saliency maps are shown in Figure 3. In comparing attention and saliency maps between the two AI algorithms utilized in this study, ResNet-50 shows a general sensitivity to the central part of the image while DeiT is more focused on localized features around the borders of the optic disc and neuroretinal rim. Average attention and saliency maps for other models are shown in Supplemental Figures 3 and 4 (available at https://www.ophthalmologyscience.org/).

## Discussion

We found that the convolution-free Vision Transformer DeiT trained and validated on the OHTS dataset performs similarly to ResNet-50 in detecting POAG on the OHTS test set, regardless of which of the five ground-truths POAG determinations were used. Furthermore, DeiT outperforms ResNet-50 w.r.t. different POAG determination strategies on almost all the external independent test sets included in this study. Specifically, DeiT generalizes better than ResNet-50 in the diverse external test sets of fundus photographs, representing individuals of Chinese, Japanese, Spanish, African, and European descent, each with its own criteria for ground truth determination of glaucoma. These results suggest that DeiT may be particularly useful to improve the generalizability of AI models in clinical applications as it outperforms the SoTA CNN ResNet-50 developed based on convolutions. Most importantly, biases in training sets (ground truth, study population, types of photographs, cameras, etc.) have been shown to lead to inaccurate detection results
• Fan R.
• Bowd C.
• Christopher M.
• et al.
[Preprint] Detecting Glaucoma in the Ocular Hypertension Treatment Study Using Deep Learning: Implications for clinical trial endpoints.
, DeiT with improved generalizability may serve as a critical tool to limit the effect of possible biases in training sets used for the detection of eye disease.
It is unclear why DeiT outperforms Resnet-50 when applied to external datasets. It is possible that CNNs such as Resnet-50, which rely on the relationship between pixels, may be sensitive to pixel-level noise in specific datasets, which reduces diagnostic accuracy when the model is applied to external datasets. Vision Transformers, which rely more on the whole image, may therefore provide more generalizable results.
In general, diagnostic accuracy using artificial intelligence for the detection of glaucoma from fundus photographs and optical coherence tomography images is worse in external datasets than in test sets from the original data source
• Wu J.-H.
• Nishida T.
• Weinreb R.N.
• Lin J.-W.
Performances of machine learning in detecting glaucoma using fundus and retinal optical coherence tomography images: A meta-analysis.
. Others have reported better diagnostic accuracy on some of the external fundus photograph test sets used in the current study
• Liao W.
• Zou B.
• Zhao R.
• Chen Y.
• He Z.
• Zhou M.
Clinical interpretable deep learning model for glaucoma diagnosis.
,
• Yu S.
• Xiao D.
• Frost S.
• Kanagasingam Y.
Robust optic disc and cup segmentation with deep learning for glaucoma detection.
,

Fu, H., Cheng, J., Xu, Y. & Liu, J. Glaucoma detection based on deep learning network in fundus image. In Deep Learning and Convolutional Neural Networks for Medical Imaging and Clinical Informatics, 119–137 (Springer, 2019).

• Nawaz M.
• Nazir T.
• Javed A.
• et al.
An efficient deep learning approach to automatic glaucoma detection using optic disc and optic cup localization.
• Islam M.T.
• Mashfu S.T.
• Faisal A.
• Siam S.C.
• Naheen I.T.
• Khan R.
Deep learning-based glaucoma detection with cropped optic cup and disc and blood vessel segmentation.
• Xu X.
• Guan Y.
• Li J.
• Ma Z.
• Zhang L.
• Li L.
Automatic glaucoma detection based on transfer induced attention network.
. However, these reports trained and tested the datasets, so the diagnostic accuracy is expected to be higher than when the fundus photographs are independent external test sets as in the current study. Different glaucoma definitions between the external datasets may also explain the differences in performance (e.g., visual field vs. expert photo review, cup-to-disc ratio, etc.). It should be noted that the OHTS was designed to maximize specificity in its endpoint committee determinations of POAG

Gordon, M. O. a. H. E. J. a. H. D. K. a. P. I. I. R. K. a. R. A. L. a. M. P. A. a. D. . Assessment of the impact of an endpoint committee in the Ocular Hypertension Treatment Study. American journal of ophthalmology, 199, 193--199 (2019)

. For this reason, it is not surprising that there were more false positive than false negative errors in the external datasets that had ground truth determinations that were more likely to classify an eye as glaucoma (more sensitive) than the OHTS (Supplemental Figure 2). We evaluated the alternative approaches on the ability to identify the most important regions of an image. This type of analysis explains in general where the deep learning algorithm is focusing its attention. It is designed to give confidence to developers and validators that the system is not finding spurious correlations in less relevant parts of the image that will not generalize well to novel examples. Analyzing deep learning algorithms based on their saliency and attention maps provides a new perspective on explainable glaucoma detection compared to similar recent work that mostly focuses on post-hoc occlusion
• Christopher M.
• Bowd C.
• Belghith A.
• et al.
Deep learning approaches predict glaucomatous visual field damage from oct optic nerve head en face images and retinal nerve fiber layer thickness maps.
, cropping
• Hemelings R.
• Elen B.
• Barbosa-Breda J.
• Blaschko M.B.
• De Boever P.
• Stalmans I.
Deep learning on fundus images detects glaucoma beyond the optic disc.
, or adversarial examples
• Chang J.
• Lee J.
• Ha A.
• et al.
Explaining the rationale of deep learning glaucoma decisions with adversarial examples.
to explain the inference process. While these post-hoc editing methods can help expose the causal roots of a prediction, they are prone to noise imposed by the editing artifacts. In contrast, attention map explanations attempt to reflect the same information while preserving the original input and avoiding any unnatural artifacts in the inference process. Having been derived directly from the Vision Transformer modeling process, these attention maps may provide more direct information on their decision-making process. Regardless of the specific method, it is difficult to evaluate these explainability techniques with respect to specific cases in the absence of detailed, expert annotation. In this work, we focused on the performance of these methods in general using attention and saliency maps averaged over many cases. In future work, we plan to incorporate detailed expert annotations of specific cases to further explore these techniques. A limited number of recent studies investigated the general role of transformers in glaucoma detection and reported results comparable to ResNet

Yu, S., Ma, K., Bi, Q. et al. Mil-vt: Multiple instance learning enhanced vision transformer for fundus image classification. In International Conference on Medical Image Computing and Computer- Assisted Intervention, 45–54 (Springer, 2021).

,
• Song D.
• Fu B.
• Li F.
• et al.
Deep relation transformer for diagnosing glaucoma with optical coherence tomography and visual field function.
. We are the first to use a data-efficient version of transformers, DeiT, to detect and explain glaucoma in fundus images. With DeiT, we showed better generalizability than ResNet and better explainability than saliency maps.
There are also several possible limitations to this study. First, the SoTA Vision Transformers typically require a large amount of data to train. However, we achieved good results with a training set of 1636 patients (3272 eyes). Secondly, there were fewer eyes with POAG than without it, resulting in an imbalance in the dataset. Therefore, we implemented additional class weights into the model to address this imbalance problem. Third, we cropped all photographs, which may have reduced model performance if informative information is located in the peripheral retina.

## Conclusion

In summary, this study comprehensively explored the generalizability and explainability of a cutting-edge AI technique: Vision Transformer, to detect glaucoma using fundus photographs. The extensive experimental results suggested that Vision Transformer generalizes well to the eyes of individuals of Chinese, Japanese, Spanish, African, and European descent represented in the external test sets of fundus photographs included in this study. Furthermore, Vision Transformer focused on localized features of the neuroretinal rim, which are often used in the clinical management of glaucoma. Vision Transformer has the potential to improve the scalability of deep learning solutions for the detection of not only eye disease but possibly also other conditions that require various imaging modalities for clinical diagnosis and management.

## References

• Holden B.A.
• Fricke T.R.
• Wilson D.A.
• et al.
Global prevalence of myopia and high myopia and temporal trends from 2000 through 2050.
Ophthalmology. 2016; 123: 1036-1042
• Weinreb R.N.
• Aung T.
• Medeiros F.A.
The pathophysiology and treatment of glaucoma: a review.
Jama. 2014; 311: 1901-1911
• Thompson A.C.
• Jammal A.A.
• Medeiros F.A.
A review of deep learning for screening, diagnosis, and detection of glaucoma progression.
Transl. Vis. Sci. & Technol. 2020; 9 (42–42)
• Wu J.-H.
• Nishida T.
• Weinreb R.N.
• Lin J.-W.
Performances of machine learning in detecting glaucoma using fundus and retinal optical coherence tomography images: A meta-analysis.
Am. J. Ophthalmol. 2021;
• Christopher M.
• Belghith A.
• Bowd C.
• et al.
Performance of deep learning architectures and transfer learning for detecting glaucomatous optic neuropathy in fundus photographs.
Sci. reports. 2018; 8: 1-13
• Liao W.
• Zou B.
• Zhao R.
• Chen Y.
• He Z.
• Zhou M.
Clinical interpretable deep learning model for glaucoma diagnosis.
IEEE journal biomedical health informatics. 2019; 24: 1405-1412
• Yu S.
• Xiao D.
• Frost S.
• Kanagasingam Y.
Robust optic disc and cup segmentation with deep learning for glaucoma detection.
Comput. Med. Imaging Graph. 2019; 74: 61-71
• Christopher M.
• Nakahara K.
• Bowd C.
• et al.
Effects of study population, labeling and training on glaucoma detection using deep learning algorithms.
Transl. vision science & technology. 2020; 9 (27–27)
1. Fan, R., Bowd, C., Brye, N. et al. [Preprint] one-vote veto: Semi-supervised learning for low-shot glaucoma diagnosis. arXiv preprint arXiv:2012.04841 (2020).

• Fan R.
• Bowd C.
• Christopher M.
• et al.
[Preprint] Detecting Glaucoma in the Ocular Hypertension Treatment Study Using Deep Learning: Implications for clinical trial endpoints.
TechRxiv. 2022; https://doi.org/10.36227/techrxiv.14959947.v2
2. Dosovitskiy, A., Beyer, L., Kolesnikov, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (2020).

3. Vaswani, A., Shazeer, N., Parmar, N. et al. Attention is all you need. In Advances in neural information processing systems, 5998–6008 (2017).

4. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A. & Je´gou, H. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, 10347–10357 (PMLR, 2021).

5. Zeiler, M. D. & Fergus, R. Visualizing and understanding convolutional networks. In European conference on computer vision, 818–833 (Springer, 2014).

6. Hendricks, L. A., Akata, Z., Rohrbach, M., Donahue, J., Schiele, B. & Darrell, T. Generating visual explanations. In European Conference on Computer Vision, 3–19 (Springer, 2016).

7. Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D. & Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, 618–626 (2017).

8. Saporta, A., Gui, X., Agrawal, A. et al. [Preprint] deep learning saliency maps do not accurately highlight diagnostically relevant regions for medical image interpretation. medRxiv (2021).

• Gordon M.O.
• Kass M.A.
Ocular Hypertension Treatment Study Group et al. The ocular hyper- tension treatment study: design and baseline description of the participants.
Arch. Ophthalmol. 1999; 117: 573-583
• Kass M.A.
• Heuer D.K.
• Higginbotham E.J.
• et al.
The ocular hypertension treatment study: a randomized trial determines that topical ocular hypotensive medication delays or prevents the onset of primary open-angle glaucoma.
Arch. ophthalmology. 2002; 120: 701-713
• Sample P.A.
• Girkin C.A.
• Zangwill L.M.
• et al.
The African Descent and Glaucoma Evaluation Study (ADAGES): Design and baseline data.
Arch. ophthalmology. 2009; 127: 1136-1145
• Diaz-Pinto A.
• Morales S.
• Naranjo V.
• Köhler T.
• Mossi J.M.
• Navea A.
CNNs for automatic glaucoma assessment using fundus images: an extensive validation.
Biomed. engineering online. 2019; 18: 1-19
9. Li, L., Xu, M., Wang, X., Jiang, L. & Liu, H. Attention based glaucoma detection: A large-scale database and CNN model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10571–10580 (2019).

10. Fumero, F., Alayo'n, S., Sanchez, J. L., Sigut, J. & Gonzalez-Hernandez, M. Rim-one: An open retinal image database for optic nerve evaluation. In 2011 24th international symposium on computer-based medical systems (CBMS), 1–6 (IEEE, 2011).

11. Zhang, Z., Yin, F. S., Liu, J. et al. ORIGA-light: An online retinal fundus image database for glaucoma analysis and research. In 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology, 3065–3068 (IEEE, 2010).

12. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K. & Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255 (IEEE, 2009).

13. Chattopadhay, A., Sarkar, A., Howlader, P. & Balasubramanian, V. N. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In 2018 IEEE winter conference on applications of computer vision (WACV), 839–847 (IEEE, 2018).

14. Wang, H., Wang, Z., Du, M. et al. Score-cam: Score-weighted visual explanations for convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 24–25 (2020).

• Christopher M.
• Bowd C.
• Belghith A.
• et al.
Deep learning approaches predict glaucomatous visual field damage from oct optic nerve head en face images and retinal nerve fiber layer thickness maps.
Ophthalmology. 2020; 127: 346-356
• Hemelings R.
• Elen B.
• Barbosa-Breda J.
• Blaschko M.B.
• De Boever P.
• Stalmans I.
Deep learning on fundus images detects glaucoma beyond the optic disc.
Sci. Reports. 2021; 11: 1-12
• Chang J.
• Lee J.
• Ha A.
• et al.
Explaining the rationale of deep learning glaucoma decisions with adversarial examples.
Ophthalmology. 2021; 128: 78-88
15. Fu, H., Cheng, J., Xu, Y. & Liu, J. Glaucoma detection based on deep learning network in fundus image. In Deep Learning and Convolutional Neural Networks for Medical Imaging and Clinical Informatics, 119–137 (Springer, 2019).

• Nawaz M.
• Nazir T.
• Javed A.
• et al.
An efficient deep learning approach to automatic glaucoma detection using optic disc and optic cup localization.
Sensors. 2022; 22: 434
• Islam M.T.
• Mashfu S.T.
• Faisal A.
• Siam S.C.
• Naheen I.T.
• Khan R.
Deep learning-based glaucoma detection with cropped optic cup and disc and blood vessel segmentation.
IEEE Access. 2021; 10: 2828-2841
• Xu X.
• Guan Y.
• Li J.
• Ma Z.
• Zhang L.
• Li L.
Automatic glaucoma detection based on transfer induced attention network.
BioMedical Eng. OnLine. 2021; 20: 1-19
16. Yu, S., Ma, K., Bi, Q. et al. Mil-vt: Multiple instance learning enhanced vision transformer for fundus image classification. In International Conference on Medical Image Computing and Computer- Assisted Intervention, 45–54 (Springer, 2021).

• Song D.
• Fu B.
• Li F.
• et al.
Deep relation transformer for diagnosing glaucoma with optical coherence tomography and visual field function.
IEEE Transactions on Med. Imaging. 2021;
17. Gordon, M. O. a. H. E. J. a. H. D. K. a. P. I. I. R. K. a. R. A. L. a. M. P. A. a. D. . Assessment of the impact of an endpoint committee in the Ocular Hypertension Treatment Study. American journal of ophthalmology, 199, 193--199 (2019)