If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Quantitative Healthcare Analysis (QurAI) Group, Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands; Department of Biomedical Engineering and Physics, Amsterdam UMC, Amsterdam, The Netherlands
Quantitative Healthcare Analysis (QurAI) Group, Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands; Department of Biomedical Engineering and Physics, Amsterdam UMC, Amsterdam, The Netherlands
Significant visual impairment due to glaucoma is largely caused by the disease being detected too late.
Objective
To build a labeled dataset for training AI algorithms for glaucoma screening by fundus photography, to assess the accuracy of the graders and to characterize the features of all eyes with referable glaucoma.
Design
Cross-sectional study.
Subjects
Color fundus photographs (CFPs) of 113,893 eyes of 60,357 individuals were obtained from EyePACS, California, USA, from a population screening program for diabetic retinopathy.
Methods
Carefully selected graders (ophthalmologists and optometrists) graded the images. To qualify, they had to pass the EODAT1 optic disc assessment with at least 85% accuracy and 92% specificity. Of 90 candidates, 30 passed. Each image of the EyePACS set was then scored by varying random pairs of graders as ‘Referable glaucoma’ (RG), ‘No referable glaucoma’ (NRG) or ‘Ungradable’ (UG). In case of disagreement, a glaucoma specialist made the final grading. RG was scored if visual field damage was expected. In case of RG, graders were instructed to mark up to 10 relevant glaucomatous features.
Main outcome measures
Qualitative features in eyes with RG.
Results
The performance of each grader was monitored; if the sensitivity and specificity dropped below 80 and/or 95%, respectively (the final grade served as reference), they exited the study and their gradings were redone by other graders. In all, 20 graders qualified; their mean sensitivity and specificity (SD) were 85.6 (5.7) % and 96.1 (2.8) %, respectively. The two graders agreed in 92.45% of the images (Gwet’s AC2, expressing the inter-rater reliability, was 0.917). Of all gradings, the sensitivity and specificity (95% CI) were 86.0 (85.2 – 86.7) % and 96.4 (96.3 – 96.5) %, respectively. Of all gradable eyes (n = 111,183; 97.62 %) the prevalence of RG was 4.38 %. The most common features of RG were the appearance of the neuroretinal rim inferiorly and superiorly.
Conclusions
A large dataset of color fundus photographs was put together of sufficient quality to develop AI screening solutions for glaucoma. The most common features of referable glaucoma were the appearance of the neuroretinal rim inferiorly and superiorly. Disk hemorrhages were a rare feature of RG.
Color fundus photography may be the best option for population-based screening programs for those at an increased risk of glaucoma, because it is the simplest, least-expensive and most widely used way of optic disc imaging. However, the workload to manually grade all these images makes it very costly and the differences between graders may yield a questionable accuracy.
Several glaucoma screening programs have been shown to be not cost-effective, although eliminating the need for expert graders with better technologies that allow inexpensive imaging of the eye fundus, glaucoma screening has been shown in India and China to be cost-effective.
To reduce the cost of screening, automated detection of glaucoma may play an important role, since artificial intelligence (AI) based on supervised learning and classification of labeled fundus photographs alone has yielded promising results, often reaching the level of experienced clinicians.
However, when validated on other datasets, the performance of the AI models often drops, at times severely limiting the utility of AI under screening conditions, an issue that the current study wishes to address. Why may the external validation of AI be so disappointing?
The various studies show significant differences in their design and in the utilized datasets
; they differ, e.g., in the size of the training, test and validation sets. In addition, the populations may differ significantly between studies in ethnicity and geographical location. Often, different camera systems were used. Differences also occur in whether the fundus photographs had been carefully selected in clinics (with ‘supernormal’ [i.e., without any other significant disease or risk factors] and superglaucomatous [i.e., without any other eye disease]) or whether they were taken from a real-world screening scenario. In addition, in some studies the ungradable images were left out of the datasets, whereas in real world situations, perhaps most notably in screening situations, ungradable images are likely to occur. It is often unclear if the fundus photographs came from populations that were representative of the target population. Perhaps more importantly, the studies differ in their definition of glaucoma (from unconfirmed and unclearly defined ‘referable glaucoma’ to glaucoma confirmed by one or more additional tests, such as (subjective grading of) visual fields or (subjective grading of) ocular coherence tomography (OCT) scans. In addition, studies differ in the number of graders, the experience of the graders and the amount of agreement between graders. In some studies, the datasets were ‘enriched’ with fundus photographs of eyes with suspect glaucoma, although this category was not clearly defined. The grading accuracy of the graders (in terms of sensitivity and specificity, for instance) have also not been reported in all studies. In general, agreement between graders is poor.
Studies also differed significantly in the prevalence of disease per dataset, from a low prevalence to be expected under screening circumstances to up to approximately 50%, which will have a significant effect on the pre-test probability for the classification of disease. Taken together, the performance dropped when the AI was externally validated on different labeled datasets, perhaps not surprisingly given the described differences between studies and datasets. In addition, the prediction performance of the developed AI dropped in the presence of coexisting ocular disease.
The purpose of the current study was to put together a large, labeled dataset of color fundus photographs to be used for the training and validation of AI that could detect referable glaucoma, even in the presence of coexisting disease, under real world screening conditions in many parts of the world. To that end, we aimed for a multi-ethnic dataset, obtained with a wide array of fundus cameras by multiple operators in various locations, without the exclusion of ungradable images that occur under real world screening conditions. The photographs were to be classified by trained, qualified and experienced graders, whose performance was continuously monitored. The scope of the current manuscript is to describe the methods used to obtain the labeled dataset, as well as to describe the features of those fundus photographs that were labeled as referable glaucoma. These features per se could be of clinical interest, especially for the design of glaucoma screening programs, because they offer the opportunity to see the relative importance of each of the features in referable glaucoma. In addition, the results of this study may also be used to develop, improve, and refine (explainable) AI algorithms.
Methods
Dataset: Color fundus photographs (CFPs) of 113,893 eyes of 60,357 individuals were obtained from EyePACS, California, USA, from a population screening program for diabetic retinopathy
. They provided a large dataset that was labeled as ‘random’ and a small set labeled as ‘glaucoma suspects’. The CFPs had been taken in approximately 500 screening centers across the USA on a large variety of cameras. Per eye, 3 images were taken, to reduce the risk that an eye could not be properly judged because the image was poorly aligned, over- or underexposed, or otherwise ungradable because of a closed eyelid or media opacities. The dataset was multi-ethnic and included people of African descent (6%), Caucasians (8%), Asians (4%), Latin Americans (52%), native Americans (1%), people from the Indian subcontinent (3%) and people of mixed ethnicity (1%) as well as people of unspecified ethnicity (25%). The participants’ mean age (SD) was 57.1 (10.4) years. All the CFPs were anonymized. The entire project was approved by the Institutional Review Board of the Rotterdam Eye Hospital.
Graders; training and selection: The basis for training and selection of the graders was the European Optic Disc Assessment Trial (EODAT), which was a trial in which several hundreds of European ophthalmologists from 11 countries were asked to grade 110 stereoscopic optic nerve photographs (slides) as either normal or glaucomatous.
All glaucomatous eyes showed reproducible visual field defects on standard automated perimetry. The results showed a large variation between observers, with an average accuracy of approximately 80%. Glaucoma specialists showed a slightly better accuracy of about 86%. Shortly after the EODAT had been completed, the top 3 European graders were asked to annotate every slide with as many features as they judged to be most typically glaucomatous. The annotated slides, after consensus had been reached between the 3 graders, served as the basis for a stereoscopic Optic Nerve Evaluation (ONE), 3-hour teaching program, that one of the current authors (HL) has been giving regularly ever since. Some key points of the course will be highlighted to illustrate the choice of glaucomatous feature options in the current study. During the course, special attention is given to the importance of disc size (and how it affects the size of the cup, the width of the neuroretinal rim and the cup/disc ratio). In addition, the course points out another limitation of the cup/disc ratio, i.e., that the diameter of the cup is difficult to determine with a flat sloping neuroretinal rim, whereas it is much easier with an, albeit rarer, ‘punched out’, steep sloping rim. The clinician had therefore rather pay attention to the appearance of the neuroretinal rim than to the diameter of the cup or the cup/disc ratio. Relevant features of the rim are color (e.g., the presence of pallor), focal notching, more generalized thinning or narrowing of the rim and bayonetting of the vessels, the latter being a sign of underlying neuroretinal rim tissue loss.
For the present study, all candidate graders (90 experienced ophthalmologists and optometrists) had to take the EODAT test, at least 3 months after following the ONE teaching program. To qualify, they had to pass the EODAT test with at least 85% overall accuracy and 92% specificity. Of all 90 candidates, 30 passed.
Grading tool: To facilitate the grading of all CFPs, Deepdee (Nijmegen, The Netherlands) was commissioned to put together a web-based grading tool, that was quick, easy to use and that allowed the efficient grading of photo batches of 200 eyes at a time by randomly paired qualified graders (Figure 1). To avoid grading fatigue, each grader was allowed to grade a maximum of 2 batches (i.e., 400 eyes) per day. Each grader was oblivious to who the fellow, paired grader was. Per presented CFP, fitted to fill the screen, there were 3 available options, one of which the grader could select by clicking a button: ‘Referable glaucoma’, ‘No referable glaucoma’ or ‘Ungradable’; if ‘Ungradable’ was selected, another photo of the same eye was presented immediately, up to a maximum of 2 additional photos. The grader could not return to a previous photo of the same eye. If ‘Referable glaucoma’ was selected, 10 additional buttons appeared so that the grader could select up to 10 of the typically glaucomatous features of that eye. These features were: ‘appearance neuroretinal rim superiorly’, ‘appearance neuroretinal rim inferiorly’, ‘baring of the circumlinear vessel superiorly’, ‘baring of the circumlinear vessel inferiorly’, ‘disc hemorrhage(s)’, ‘retinal nerve fibre layer defect superiorly’, ‘retinal nerve fibre layer defect inferiorly’, ‘nasalisation (nasal displacement) of the vessel trunk’, ‘laminar dots’ and ‘large cup’. The (vertical) cup to disc ratio was not an option, because it was never annotated by the 3 top graders of the EODAT. In addition, it strongly depends on disc size, and the borders of the cup are difficult to determine with sloping neuroretinal rims. To allow comparison with the fellow eye, its photo was presented at the click of a button. Several of these features will be further discussed below.
Figure 1Screenshot of grading tool. The grading tool provided a large color fundus photograph, together with several displays and buttons for navigation purposes. To grade the image, there were 3 main buttons on the right, marked ‘Referable glaucoma’, ‘No referable glaucoma’ and ‘Ungradable’. If the button ‘Referable glaucoma’ was selected, 10 additional buttons were presented for glaucomatous feature selection (up to 10 allowed). In this example, 4 features have been selected (highlighted).
Grading procedure: Each grader could log into the grading tool at any time and from wherever they wished. Before the first grading took place, the graders were encouraged to familiarize themselves with the grading tool in a sandbox environment. The actual grading procedure started with a new batch of 200 CFPs or from whichever photos remained from a previous, unfinished batch. Per presented CFP, the grader could select from one of three options: ‘Referable glaucoma’ (RG), ‘No referable glaucoma’ (NRG) or ‘Ungradable’ (UG). To select ‘Referable glaucoma’, the grader expected the glaucomatous signs to be associated with glaucomatous visual field defects on standard automated perimetry. If no glaucomatous visual field defects were expected (in case of, e.g., normal eyes or preperimetric glaucoma), the grader had been instructed to select ‘No referable glaucoma’. Any signs of coexisting eye disease, e.g., diabetic retinopathy, were to be ignored. If the 2 graders scored identically on any of the 3 main categories (RG, NRG or UG), their classification became the final label of the CFP. In case they disagreed, the photo was judged by a third grader, i.e., one of two glaucoma specialists (who had passed the EODAT test with a minimum accuracy of 95% each). The classification of the third grader then became the final label. The glaucomatous features that were provided by the graders in case they graded an eye as ‘Referable glaucoma’ were recorded without adjudication.
The evaluation of the graders’ performances by sensitivity and specificity was determined by comparing their score with the final label. To push the graders to not give up too easily in case of poor image quality, we applied a penalty if they classified a CFP as UG, whilst the final label was different: in case of a final label of NRG, their specificity was adversely affected whereas in case of a final label of RG, their sensitivity went down. In case the final label was UG, no penalties were given, regardless of the classification by any of the graders.
Grader monitoring and guidance:
The accuracy of each grader was periodically monitored. If the sensitivity and specificity dropped below 80 or 95%, respectively (the final label served as reference), the grader exited the study and all their gradings were redone by other, randomly selected graders. In all, 20 graders qualified. All graders were encouraged to ask questions (by email) throughout the grading process. Their questions were often related to either purely technical issues or to more medical ones. The one-to-one emails were answered as quickly as possible; technical issues were dealt with by an employee of Deepdee, whereas medical questions were answered by one of the authors (HL). Graders that were not proficient in English were encouraged to ask questions in their native tongue. In case specific questions were raised repeatedly, virtual meetings were called to discuss matters with all concerned. The graded dataset, i.e., the images together with their annotations is referred to as REGAIS – Rotterdam EyePACS Glaucoma AI Screening data set.
Statistics
To determine the inter-rated reliability or intergrader agreement, several statistics may be produced. The most straightforward approach is to determine the percentages of agreement, which simply expresses the percentage of samples on which the graders agree. Because agreement may happen due to chance, Cohen’s kappa was developed to correct for agreement by chance. However, Cohen’s kappa can be low when agreement is high.
This so-called paradox may occur in unbalanced datasets such as the present one. Gwet’s AC addresses this problem and provides an improved inter-rater reliability metric for such cases.
Gwet’s AC1 is limited to nominal data; AC2 may be used for ordinal and interval measurements as well and was calculated in this paper to express the inter-rater reliability. Gwet's AC2 calculations were performed based on the agreement of the classification of each image between the two graders. In addition, to aid the comparison with other reports, pooled results are produced for the agreement between a grading and the final label. These results will be displayed in a table and are further summarized by percentage of agreement and by Cohen’s kappa.
Results
Graders: Of the 30 graders that had passed the EODAT entry exam, 10 failed the periodic monitoring process that took place throughout the grading procedure. Of the remaining 20 graders, the mean sensitivity and specificity (SD) were 85.6 (5.7) % and 96.1 (2.8) %, respectively (the final label served as reference). The two graders agreed in 92.45% of the images (Gwet’s AC2 was 0.917). The 3rd graders graded approximately 11,250 CFPs. Of all gradings, the sensitivity and specificity (95% CI) were 86.0 (85.2 – 86.7) % and 96.4 (96.3 – 96.5)%, respectively. The (pooled) agreement between graders and the final label is shown in the Table 1. Cohen’s kappa was 0.709. Percentage of agreement was 96.0%.
Ungradable images: Of all 113,897 eyes, the CFPs of 2,714 (2.38 %) were classified as ‘Ungradable’. Although we did not systematically score the reason for their ungradability, it was noticed that common causes were media opacities (including cataract and/or vitreous hemorrhage or asteroid hyalosis), overexposure (image too bright), underexposure (image too dark), region of interest (optic nerve head and peripapillary region) outside the image, and a closed eyelid.
Referable glaucoma: Within the subset of gradable eyes (n = 111,183), the prevalence of the final label RG was 4.38 %. Figure 1 shows the prevalence of glaucomatous features in images graded as RG.
The appearance of the neuroretinal rim inferiorly was the most common glaucomatous feature in the RG eyes, followed by the appearance of the neuroretinal rim superiorly. A large cup was observed in approximately half of all glaucomatous eyes and a nasal displacement of the vessel trunk in about a third of eyes. Baring of the circumlinear vessels was more frequently observed than retinal nerve fiber layer defects. Disk hemorrhages were a rare finding (3% of eyes).
Some features were often observed together with other features (Fig. 2). If, for instance, baring of the circumlinear vessel(s) was scored (either superiorly or inferiorly) as glaucomatous, chances were 90% that the matching neuroretinal rim was flagged as glaucomatous as well. Conversely, when appearance of the neuroretinal rim was flagged, baring of the circumlinear vessel(s) was scored in the corresponding region in 28%. In case of an RNFL defect inferiorly, the probability of a glaucomatous appearance of the neuroretinal rim inferiorly was, perhaps not surprisingly, 91%. If the appearance of the neuroretinal rim superiorly was considered as glaucomatous, chances were 70% that the inferior neuroretinal rim was also considered as glaucomatous. Disc hemorrhages occurred quite rarely, but if they were present, they were often (72%) associated with a glaucomatous appearance of the inferior neuroretinal rim.
Figure 2Prevalence of glaucomatous features in images graded as referable glaucoma; ANRI: appearance neuroretinal rim inferiorly; ANRS: appearance neuroretinal rim superiorly; LC: large cup; LD: laminar dots; NVT: nasalization (nasal displacement) of the vessel trunk; BCLVI: baring of the circumlinear vessel(s) inferiorly; BCLVS: baring of the circumlinear vessel(s) superiorly; RNFLDI: retinal nerve fiber layer defect inferiorly; RNFLDS: retinal nerve fiber layer defect superiorly; DH: disk hemorrhage(s).
Figure 3Conditional probabilities of glaucomatous features in RG eyes, expressed as the likelihood of an associated feature given a selected feature; for an explanation of the features, see Fig. 1.
One promising application is the large scale, population-based screening for glaucoma, based on relatively inexpensive CFPs, since AI algorithms have been shown to be more capable of accurately classifying glaucoma than human graders.
This might in turn reduce the enormous burden of glaucomatous visual impairment and blindness across the world. However, several studies have shown that the performance of promising AI dropped when externally validated on entirely different datasets,
sometimes to unacceptably low levels for screening, thereby probably discouraging initiatives to apply the AI for screening on a large scale. Although a poor sensitivity would obviously lead to the undesirable situation of missed cases, a poor specificity would overburden the health care system because of too many false-positive referrals and the costs they would incur, especially with a relatively rare disease like glaucoma. In addition, too many false positive referrals would soon lead to a general loss of trust, and interest, in such screening programs. It is therefore better for population-based screening to sacrifice a little sensitivity in exchange for a high specificity, as long as significantly more cases are detected than with the current design of the health care system. The missed cases would probably be detected in a next screening round, when the disease was likely to have progressed, hopefully still within an asymptomatic range.
We would argue that the disappointing validations of promising AI algorithms might, in part, be explained by the differences in size and composition between the used datasets, caused by, e.g., different definitions of referable glaucoma, differences in ethnicities, the source of the CFP sets (e.g., clinic based or population based; a form of selection bias), other forms of data selection bias, differences in the prevalence of referable glaucoma, any presence of comorbidity and differences in the grading process. A strong point of the current study is that we acquired a large, labeled set of CFPs that is multi-ethnic, that was obtained on multiple cameras by multiple operators, designed for developing AI for glaucoma screening programs under real-world conditions. It is currently unclear to what extent the performance of AI algorithms, trained on our dataset, may drop when validated on other datasets, e.g., in the presence of other coexisting disease.
Another strong point of the current study is the rigorous grader qualification process that we used. This process entailed initial training, followed by testing and selection by means of an exam. In addition, the performance of the graders was monitored throughout the grading process. In many glaucoma AI studies, the ‘ground truth’ labeling and annotation was performed by only a small number of graders, often with limited clinical experience, and without validation of their gradings.
A limitation of our study was that we did not have visual fields available to support or refute the label of ‘Referable glaucoma’, which was defined as glaucoma with expected glaucomatous visual field defects (i.e., perimetric glaucoma). One reason for defining RG in such a way was that we wanted to exclude early, notably preperimetric glaucoma, because that would yield an undesirably low specificity. More importantly, we wanted to use the labeled dataset to develop AI that would detect those individuals at a high risk of becoming seriously visually impaired. Undetected, probably early glaucoma cases, we reasoned, would likely be detected in follow-up testing. In addition, we think that careful clinical examination, in which the assessment of the ONH is only part of several tests (such as OCT imaging, visual field testing, tonometry, pachymetry, slit lamp examination of the anterior and posterior segment, including gonioscopy) is required to make a diagnosis. Optic disc assessment by AI of CFPs would only serve as a first, but important and relatively inexpensive screen for glaucoma detection. In addition, all our graders had to pass the EODAT test, in which all glaucomatous eyes showed reproducible visual field damage.
Another limitation of our study was that we had no OCT imaging of the fundus available for each eye. OCT imaging plays an ever-greater role in the clinical management of glaucoma, both for making a diagnosis and for monitoring progression. OCT imaging might therefore have supported the classification of our CFPs. In addition, the availability of such a combined dataset might have served the development of AI for OCT images to screen the population at large for glaucoma. Our aim, however, was to provide a high quality labeled dataset for the development of AI for the low-cost screening for glaucoma. OCT-based AI has certainly shown promise for glaucoma detection,
but screening would arguably be too costly for many countries. We also think that there are too many differences between devices in scan protocols, preprocessing, processing, segmentation, normative data, etc. to allow the development of a universal AI, applicable across all (major) OCT devices for population-based screening programs. Importantly, OCT images appear not to yield higher performance for detecting glaucoma as CFPs.
Taken together, we think that OCT-based AI is not yet ready for widespread glaucoma screening, as opposed to its unquestionably important role in the clinic.
A potential limitation of the study was that our CFPs were obtained from a screening program for diabetic retinopathy and therefore did not fully represent the population at large. We would argue, however, that this is an advantage since AI screening for glaucoma is likely to be used quite soon in diabetic retinopathy screening programs because these are already in place and require the same equipment, i.e., (non-mydriatic) fundus cameras. The use of AI for the detection of glaucoma could then be seen as a relatively simple add-on, thereby keeping costs low.
Perhaps another limitation of our study was that we graded single-field (monoscopic) CFPs instead of stereoscopic images. In clinical practice, stereoscopic viewing of the optic nerve head is generally considered as superior to monoscopic assessment. However, when expert graders were put to the test, no difference in grading accuracy was found between their monoscopic and stereoscopic assessment of fundus photographs in identifying glaucoma.
This study provides, as far as we know for the first time, an overview of the typical features of CFPs of eyes with ‘Referable glaucoma’ in a population-based only screening program. A somewhat similar overview was obtained from a mixed dataset (both population-based and clinic-based).
Overviews such as these may serve the design of future glaucoma screening programs, because they offer the opportunity to see which of them matter most in referable glaucoma. In addition, the results of this study may also be used to develop, improve and refine explainable AI algorithms, especially because we have made our labeled dataset publicly available. In a recent challenge
the winning solution, which was trained on approx. 101,000 images from our REGAIS dataset and tested on a separate and secretly kept test set of approx. 11,000 images (also from the REGAIS dataset), showed a sensitivity for detecting RG of approximately 85% at a specificity of 95%.
A limitation of our study, which applies to many studies, is that human gradings of the optic nerve head have limited reproducibility and poor inter-rater agreement.
We tried to address that problem by having each CFP graded by two independent, experienced, trained, qualified and periodically monitored graders, and, in case of disagreement between the two in their final classification, have the CFP graded by a 3rd grader who had passed the EODAT entry exam with an exceedingly high score of at least 95% accuracy. Nevertheless, there is a real risk, albeit small, that the first two graders both classified a CFP identically, but incorrectly. We suspect that these errors occurred rarely, but we cannot rule them out entirely.
Yet another limitation of our current study was that, although agreement between graders was required for the 3 main classifications (Referable glaucoma’ (RG), ‘No referable glaucoma’ (NRG) or ‘Ungradable’ (UG)), or else the final classification was determined by the 3rd grader, no such agreement was required for feature selection. When we designed the study, this was considered to make the grading process too complicated and time-consuming. What’s more, the main purpose of the study was to provide high-quality labels of the 3 main classifications (viz., RG, NRG and UG) for the development of AI for population-based screening. In addition, one of the authors (HL), who graded approximately 10,000 CFPs as a 3rd grader, got the impression that the RG features were carefully and correctly selected in the vast majority of cases. Tighter study designs, calling for stronger agreement between graders on the typical features of RG CFPs, may be required in the future.
In case of ‘Referable glaucoma’ the graders could select up to 10 glaucomatous features per CFP. The choice of these features was based on the consensus reached on the most relevant features by the 3 top scorers of the EODAT trial.
These features were arbitrary and might be open to discussion. A number of evaluation methods of the optic nerve head have been proposed, the commonest perhaps being the cup-to-disc ratio (CDR), although it is limited by large measurement variability between and within graders.
Other schemes and classification methods have been proposed, some limiting the number of features, others expanding their number, sometimes focusing on several old features and adding new ones.
This is meaningful, especially when a relatively small cup (with small CDR) is associated with typically glaucomatous damage to the NRR, e.g., in case of a notch. Therefore, the features we offered to our graders to choose from did not contain CDR as an option. Bayonetting of vessels over the NRR was considered as an abnormal NRR and was therefore not explicitly presented as an option. The quite popular so-called ISNT rule, which relates to the width of the neuroretinal rim and assumes that, in healthy eyes, the inferior neuroretinal rim is broader than the superior rim, followed in width by the nasal and temporal rim, respectively) has been shown to be insufficiently accurate for discriminating between healthy and glaucomatous eyes with high sensitivity and specificity.
The 3 top scorers of the EODAT trial never used the ISNT for classifying the RG eyes. Peripapillary atrophy (PPA) was neither presented as an option, because its presence per se is not typical of glaucoma, although glaucomatous eyes tend to have larger areas of PPA (both alpha and beta zones) than normal eyes.
Not surprisingly, the top 3 graders of the EODAT trial attributed little weight to PPA in discriminating between healthy and glaucomatous eyes.
It turned out that disc hemorrhages, that often get a lot of attention in training programs, were observed quite rarely in our RG eyes, which confirms the earlier observation of low sensitivity, with high specificity, for detecting glaucoma.
We have put together a large, labeled dataset of color fundus photographs for the development of an AI solution for low-cost population-based screening for glaucoma in multiple ethnic settings. Approximately 90% of the entire set is publicly available and may be used for training under the Creative Commons BY-NC-ND 4.0 license. The entire set was acquired on a wide array of fundus cameras in many locations by multiple operators. The grading was done by a pool of experienced, trained, certified, and periodically monitored graders. In case of ‘referable glaucoma’, the characteristic features were noted. It turned out that the appearance of the neuroretinal rim (notably notching, thinning or narrowing) was the most predominant feature. RNFL defects were observed in approximately 13 % of eyes and disk hemorrhages were a rare feature of referable glaucoma. Other common features included visible laminar dots, baring of the circumlinear vessels and nasal displacement of the vessel trunk. The labeled dataset will be made publicly available. Our results and data may be of use for developing and refining future screening programs. The estimated sensitivity and specificity were above our target of 80% and 95%, respectively, and the annotated dataset should therefore be of sufficient quality to develop AI screening solutions.
Tabled
1Table. Pooled agreement of each grader compared to the final label. NRG = no referable glaucoma, RG = referable glaucoma, U = ungradable.
Meeting presentation: this study was presented during the 18th meeting of the Imaging and Morphometry Association for Glaucoma in Europe (IMAGE) from May 18- 20, 2022 in Porto, Portugal.
Financial support: The currently described project was supported by unrestricted grants from Stichting Blindenbelangen, Glaucoomfonds, Stichting Wetenschappelijk Onderzoek Oogziekenhuis – Prof. Dr. H.J. Flieringa, Hoornvlies Stichting Nederland, Stichting Blindenhulp, Stichting Ooglijders and by (unrestricted) donations from Aeon Astron, Théa Pharma, Laméris Ootech, Allergan, Carl Zeiss Vision, Revoir Recruitment and Novartis as well as from many private donors (through crowd funding).
Conflict of interest: no conflicting relationship exists for any author
Précis: A large, multiethnic labeled dataset for the training of artificial intelligence algorithms for the screening for glaucoma on color fundus photographs was put together. The typical features of eyes with referable glaucoma were described.