Evaluation of deep learning-based multiparametric MRI oropharyngeal primary tumor auto-segmentation and investigation of input channel effects: Results from a prospective imaging registry

Highlights • mpMRI coupled to deep learning can generate reasonable OPC tumor segmentations.• Using multiple input channels may positively impact segmentation performance.• Deep learning segmentations were non-inferior to ground truth as per Turing test.


Introduction
Oropharyngeal cancer (OPC), a type of head and neck squamous cell carcinoma (HNSCC), is among the most common malignancies globally [1]. Treatment for OPC often includes radiotherapy because of its high cure rate [2]. Segmentation (also termed contouring) of the primary gross tumor volume (GTVp) on radiologic imaging is necessary for the OPC radiotherapy workflow. The GTVp, with a clinical and planning safety margin, acts as a target volume to deliver the radiotherapy dose. Consequently, inadequate GTVp definition may cause under-dosage of the tumor or over-dosage of surrounding normal tissues [3,4]. However, the current clinical standard is manual segmentation by physician experts, which is labor-intensive and subject to high inter-observer variation [5][6][7]. Therefore, an auto-segmentation tool would be a promising alternative to the current manual standard in OPC radiotherapy workflows.
Deep learning (DL) has found wide success in auto-segmentation [8,9], with many HNSCC auto-segmentation studies applying DL to CT imaging [10][11][12]. Although CT is the most commonly used imaging modality in OPC radiotherapy planning, MRI has been increasingly recognized as essential for tumor segmentation because of its exceptional soft-tissue contrast [13,14]. Additionally, the emergence of MR-Linac technology, an image-guided adaptive radiotherapy approach [15], has further incentivized the incorporation of MRI in OPC radiotherapy planning. Importantly, we recently demonstrated the utility of DL for HNSCC organ-at-risk auto-segmentation using MRI, with improvements in performance, execution time, and dosimetric differences compared to other auto-segmentation methods [16]. While several DL tumor auto-segmentation studies for nasopharyngeal cancer using MRI have been published [17][18][19][20][21][22][23][24][25][26], to our knowledge, only one study has been published for OPC [27]. Since HNSCC tumors at different anatomical sites have distinct anatomic boundaries and characteristics [28,29], it is crucial that tumor segmentation models are developed for each site accordingly. Thus, there exists an unmet need for OPC DL tumor segmentation tools using MRI.
In this pilot study, we evaluated the effects of anatomical and functional mpMRI inputs on OPC GTVp segmentation performance. Using open-source DL frameworks with standardized clinical trial data, we trained and evaluated DL models based on variable mpMRI input channels. We then compared the models qualitatively and quantitatively to determine which channel combinations led to the best segmentation results. Finally, we characterized the clinical acceptability of the bestperforming model using physician expert observers.

Imaging data
We acquired pre-radiotherapy T2-weighted (T2), contrast-enhanced T1-weighted Dixon fat-suppressed (T1), DCE, and DWI MRI sequences in Digital Imaging and Communications in Medicine (DICOM) format for 124 HNSCC patients from a prospective clinical trial investigating longitudinal mpMRI (NCT03145077). Images were collected from August 2018-August 2019 under a HIPAA-compliant protocol (PA16-0302) approved by The University of Texas MD Anderson Cancer Center's institutional review board. All patients provided study-specific informed consent. We curated 30 OPC patient data sets with a visible GTVp based on the complete availability of T2, T1, DCE, and DWI image sets (Appendix C, Fig. C1). Demographic characteristics of the patients are shown in Appendix C Table C1. Imaging was performed on a Siemens Aera scanner with a magnetic field strength of 1.5 T and standardized acquisition parameters (Appendix C Table C2). All patients were immobilized with a thermoplastic mask. Apparent diffusion coefficient (ADC) parametric maps were derived from DWI sequences through a proprietary Siemens algorithm (Munich, Germany) using a monoexponential model. The Tofts model was used to generate parametric maps from DCE sequences for the volume transfer constant (Ktrans) and the extravascular extracellular volume fraction (Ve) [40] (additional details can be found in Appendix A). GTVp ground truth structures were manually segmented in the DICOM-RT Structure format by a physician observer (radiologist with > 5 years of expertise in HNSCC) in Velocity AI v.3.0.1 (Atlanta, GA, USA). GTVp ground truth structures were segmented on the T2 MRI with all other co-registered images made available to the observer during the segmentation process. An example of the mpMRI images used in this study and overlying GTVp segmentation for one patient is shown in Fig. 1A.

Image processing
To ensure adequate MRI comparability between patients [41], we performed intensity standardization for all images. Anatomical sequences (T2, T1) were standardized using a Z-score (mean = 0, standard deviation = 1), while functional parametric maps (ADC, Ktrans, Ve) were truncated to the 10th and 90th percentile for all patients to remove potential outliers and rescaled to [-1, 1] as per a previous study [38]. All images were cropped to the smallest field of view (Ktrans, Ve) and resampled to the T2 resolution. An example of the image processing workflow is shown in Fig. 1B.

Segmentation model architecture and implementation
A DL convolutional neural network based on the 3D Residual U-net architecture [42,43] was implemented in the Medical Open Network for Artificial Intelligence (MONAI) software package [44] (Fig. 1C). The GTVp mask was used as the ground truth target to train the segmentation model. The MRI images acted as variable-channel inputs to the models. We investigated the following channel combinations as separate models: T2, T2 + T1, T2 + ADC, T2 + Ktrans, T2 + Ve, and all five input channels (ALL). The T2 model acted as a baseline of comparison for all other models. We implemented an Adam optimizer with a Sørensen-Dice similarity coefficient (DSC) loss function. The models were trained for 700 iterations with a learning rate of 2 × 10 -4 for the first 550 iterations and 1 × 10 -4 for the remaining 150 iterations based on empirical observations in previous studies [35]. We implemented data augmentation to mitigate overfitting, which included random horizontal flips of 50 % and random affine transformations with an axial rotation range of 12 degrees and a scale range of 10 %. Additional details on the DL architecture and implementation are found in Appendix A.

Model evaluation
Model performance was primarily assessed using DSC. We also implemented additional spatial similarity metrics, including Hausdorff  distance (HD), false-negative DSC (FND), false-positive DSC (FPD), sensitivity, positive predictive value (PPV), surface DSC, 95% HD, and mean surface distance (MSD). For surface DSC, a tolerance of 3.0 mm was selected as suitable from previous inter-observer variability studies on T2 MRI of OPC GTVp [45]. Surface distance metrics were calculated using the surface-distance Python package [46], while all other metrics were calculated in Elekta ADMIRE v.2.9 (Stockholm, Sweden). Each model was trained and evaluated using leave-one-out cross-validation (LOOCV) (Fig. 1D).

Clinical evaluation
For our best-performing model, we assigned three physician expert observers (radiologist from 2.1 > 1-year post-segmenting, two radiation oncologists) to evaluate the ground truth and corresponding DLgenerated segmentations using subjective scoring criteria based on a 4-point Likert scale. The score categories were: 1 = requires corrections, large errors; 2 = requires corrections, minor errors; 3 = clinically acceptable, errors not clinically significant; 4 = clinically acceptable, highly accurate. Additionally, we asked observers to predict the source of the segmentations as either human (ground truth) or DL-generated through a modified Turing test [47]. Ground truth and DL-generated segmentations for all 30 patients were anonymized and randomly presented to experts for clinical evaluation. Experts were blinded to the segmentation source.

Statistical analysis
After performing a Shapiro-Wilk test, we found that our data were not normally distributed (p < 0.05); therefore, we utilized nonparametric statistical tests. We used one-sided Wilcoxon signed-rank tests (alternative hypothesis of greater than for DSC, sensitivity, surface DSC, and PPV; alternative hypothesis of less than for HD, FND, FPD, 95% HD, and MSD) to evaluate differences between our baseline T2 model and models with additional channels. Given the utilization of a small number of a priori planned comparisons and exploratory nature of this work, corrections for multiple hypotheses were not considered. We used Mann-Whitney U tests to detect differences in model performance based on tumor subsite (base of tongue vs tonsil). Additionally, to assess correlations of tumor size with model performance, we calculated Pearson correlation coefficients with corresponding p-values of ground truth volume against DSC, HD, and surface DSC for every model. Finally, to assess the clinical evaluation of ground truth against DL-generated segmentations, for each observer we implemented a two-sided Wilcoxon signed-rank test for scores and a McNemar test for source predictions. For all statistical analyses, p-values<0.05 were considered significant. Analyses were performed in Python v.3.7.9. Code notebooks can be found at GitHub (https://github.com/kwahid/mpMRI_OP C_GTVp_segmentation).

Model performance
T2 + T1 was the best performing model overall with the best mean scores in DSC, HD, sensitivity, surface DSC, and 95% HD, while ALL was the worst performing model overall with the worst mean scores in in DSC, FPD, and PPV (Fig. 2). To further quantify the impact of specific channel combinations in comparison to the baseline T2 model, significance testing was performed between different channel combinations and the baseline model. Of the significant relationships, T2 + T1 had better performance (p < 0.05) than the baseline T2 model for HD, FND, sensitivity, surface DSC, and 95% HD while T2 + Ve and ALL had better performance (p < 0.05) than the baseline T2 model for FND. These combined results suggest the T2 + T1 model to be the optimal combination for this segmentation task and was further analyzed in the Clinical Evaluation section. A complete heatmap of p-values comparing channel combinations to the baseline T2 model can be found in Appendix C Figure C2. Subsequent subgroup analysis revealed no significant differences in model performance for any combination of models and metrics based on OPC subsite (base of tongue vs tonsil), as all p-values were > 0.05 (Appendix C, Table C3). We further inspected predicted segmentation results of the various models for a few select cases based on DSC scores across all models (Fig. 3). For the high-performance case, the T2 model demonstrates a DSC of 0.88, with the incorporation of additional channels leading to DSC scores of 0.87-0.90. For the medium-

Clinical Evaluation:
The mean segmentation quality evaluation scores for ground truth vs DL-generated segmentations for the selected model (T2 + T1) were 3.0 vs 2.5, 2.5 vs 2.7, and 3.0 vs 3.0 for observers 1, 2, and 3, respectively. Significance testing revealed no observer could differentiate between the scores (p > 0.05) or source (p > 0.05) of the ground truth segmentations compared to the DL-generated segmentations (Table 1). Sub-analysis of clinical evaluation results for select cases can be found in Appendix B.

Discussion
In this pilot study, we determined the impact of mpMRI input channel combinations (T2, T2 + T1, T2 + ADC, T2 + Ktrans, T2 + Ve, ALL) on DL model segmentation performance. Recent work has suggested that the average agreement between physicians measured in DSC for OPC tumor segmentation is exceptionally low [45]. Notably, compared to previous fully-automated primary tumor segmentation studies of HNSCC patients, we achieved promising average DSC performance (Table 2). While it is difficult to directly compare DSCs between studies due to different datasets and model training, our models seemingly improve upon the only other fully-automated OPC tumor segmentation study to our knowledge, which exclusively investigated anatomical MRI [27].
The best average DSC performance was achieved by the T2 + T1 model (DSC = 0.73), which was higher than the baseline T2 model (DSC = 0.72) but not statistically significant. Moreover, average DSC decreased when combining all input channels (DSC = 0.71), though non-significantly. However, a previous similar study by Bielak et al. investigating HNSCC tumors with segmentations derived from T2 MRI demonstrated an increased DSC after the inclusion of all available mpMRI channels [38], which is in direct opposition to our results. Importantly, the authors used a smaller number of patients (n = 18) than our study and implemented repeat imaging at different time-points, which could confound their results. Additionally, their results may be more relevant for a specific HNSCC tumor site, but no analysis was performed to verify this. Notably, auto-segmentation studies in prostate cancer have also reported conflicting results on the additive effects of additional mpMRI input channels for DSC when using ground truth annotations derived from T2 MRI [48][49][50]. Therefore, further investigations are likely needed to verify if a significant positive DSC effect exists for mpMRI input channel combinations in OPC tumor autosegmentation. While most auto-segmentation studies have focused on DSC as an evaluation metric, it has been argued that other metrics should also be taken into consideration, depending on the use-case of the autosegmentation tool [51,52]. Therefore, to increase the robustness of our analysis, we have included complimentary metrics (HD, FND, FPD, sensitivity, PPV, surface DSC, 95% HD, and MSD) to evaluate our models. Like DSC, most metrics show high performance across various models, with some models demonstrating significantly better values than the baseline T2 model. Interestingly, we demonstrated that in certain edge cases (low-performance example), the inclusion of additional channels could circumvent spurious voxel predictions derived from the baseline T2 model (a possible byproduct of model overfitting), which may increase model robustness (see further discussion in Appendix B, Case 2). These results indicate that the additional channels may contain underlying additive information to improve performance for aspects other than traditional DSC-based evaluation. Notably, the specific anatomic subsite of the tumor (base of tongue or tonsil) had no significant effect on performance for any models for any evaluation metric, indicating that the models were robust to the spatial location of the OPC. Moreover, while the majority of the cases in our study cohort were HPV-positive, our best model (T2 + T1) was minimally affected in the evaluation of an HPV-negative case, though other models were negatively affected (Appendix B, Case 1). Generally, cases that were not well-represented in model training, i.e., irregular tumor presentations, led to suboptimal auto-segmentation performance (Appendix B, Cases 2 and 3).
Previous studies [17,38] have suggested small tumors may be more difficult for DL models to segment, which would hinder the incorporation of models into radiotherapy workflows. Importantly, there were no Table 1 Clinical evaluation and Turing test results for three physician expert observers. Each observer was asked to score blinded ground truth (GT) or deep learning (DL)generated segmentations on a 4-point Likert scale (1 = requires corrections, large errors; 2 = requires corrections, minor errors; 3 = clinically acceptable, errors not clinically significant; 4 = clinically acceptable, highly accurate) and asked to identify the source of the segmentation (GT or DL). DL-generated segmentations corresponded to the best DL model tested (T2-weighted + T1-weighted). significant correlations between tumor size and DSC for any of our models. However, it should be noted that surface distance metrics, such as the HD and surface DSC, demonstrate some size dependence, with larger and smaller tumors being easier for our models to segment, respectively. Interestingly, the surface distance metrics do not demonstrate a significant size dependence for some models that utilize additional channels, particularly those that correspond to functional parametric maps. Therefore, the inclusion of additional channels may strengthen the robustness of models to tumor size for surface distance metric performance, but further confirmatory work is needed.
The acceptability of segmentations used in a radiotherapy workflow is ultimately determined by physician judgment, with physician rating scales considered the gold standard for clinically relevant segmentation quality [52]. While subjective evaluation through rating scales is common in auto-segmentation studies, the established variability of OPC tumor segmentation between observers [45] highlights the difficulty in the interpretation of multi-observer segmentation quality analysis. Therefore, we implemented a comparative approach for each observer to determine if significant clinical differences were present between the ground truth segmentations and the corresponding segmentations of the best DL model (T2 + T1). We demonstrated that experts were on average unable to determine differences between the ground truth and the DLgenerated segmentations or identify the source of the segmentations. Of note, the radiologist who provided the original ground truth segmentations was the closest among the observers to correctly discriminating the segmentation sources but was still unable to achieve statistical significance. Additionally, for the radiation oncologist observers the mean clinical acceptability score of the DL-generated segmentations was equal to or higher than the ground truth segmentations, which may indicate a slight preference towards DL-generated OPC tumor segmentations for radiotherapy end users. Importantly, while these results are encouraging, they should not be conflated with a necessarily positive indication for the desired endpoint of radiation therapy planning, which would require further dosimetric studies [52]. Moreover, we demonstrate several DL segmentation failure cases confirmed by clinical evaluation (Appendix B, Cases 2 and 3), so further work is needed to ensure the generation clinically useful segmentations using these approaches.
One limitation of our study is the use of a small cohort of predominately HPV-positive tumors with standardized acquisition parameters, thus the generalizability of the tested models as well as their robustness to different patient groups and acquisition settings may be restricted. However, we have taken steps to optimally utilize our data by implementing a LOOCV approach and investigating various evaluation metrics. Moreover, we plan to include additional prospectively acquired data for model training and use external heterogenous validation sets in future studies to increase model generalizability. Another limitation of our study is that we have constrained our analysis of input image channels based on those that were investigated in previous literature [38]. However, mpMRI input channels can be further investigated through additional quantitative parametric maps (e.g., extended Tofts model [53], advanced DWI fitting models [54], etc.). Moreover, additional data sources such as CT and PET could also be integrated into existing models to determine their impact on auto-segmentation performance. Additionally, we have limited our analysis to primary tumor volumes, so the utility of mpMRI DL for metastatic cervical lymph node tumor volume segmentation in OPC remains unknown. We plan to include additional input channels and incorporate lymph node segmentation in future analyses. A final limitation of our study is the lack of overt image registration. Our images were acquired from a standardized clinical trial with patient immobilization; therefore, implicit coregistration was deemed adequate for tumor overlap. However, small amounts of motion artifacts may cause the segmentation mask to overlap improperly on mpMRI image channels, impacting autosegmentation quality. Furthermore, though no geometric distortion was observed on any parametric maps, distortions were not explicitly quantified. Future studies should investigate the role of additional OPCspecific registration algorithms and geometric distortion correction in combination with mpMRI DL auto-segmentation algorithms.

Conclusions
In summary, using mpMRI inputs, we built OPC primary tumor DL auto-segmentation models that demonstrated reasonable performance across multiple evaluation metrics. While most channels did not significantly impact model performance as long as the T2 MRI was Table 2 Survey of relevant DL auto-segmentation literature for comparison with our study. Only studies ≤ 3 years old and with sample sizes ≥ 30 were selected for comparison. DL = deep learning, N = number of images sets used in study, GTVp = primary gross tumor volume, DSC = Dice similarity coefficient, OPC = oropharyngeal cancer, NPC = nasopharyngeal cancer, HNSCC = head and neck squamous cell carcinoma, T2 = T2-weighted MRI, T1 = T1-weighted MRI, DCE = dynamic contrast enhanced MRI, DWI = diffusion weighted imaging MRI, LOOCV = leave-one-out cross-validation, CV = cross-validation, CNN = convolutional neural network. Author included, we find that adding T1 MRI significantly improved HD, FND, sensitivity, surface DSC, and 95% HD and adding Ve or using all input channels simultaneously significantly improved FND. Additionally, certain favorable aspects of model construction, including decreased spurious voxel predictions and robustness to tumor size when considering surface distance metric performance, are apparent for models that leverage additional input channels. Finally, blinded physician experts could not differentiate ground truth from DL-generated segmentations (on average), demonstrating our model shows promise in a Turing test scenario, but should be further investigated in dosimetric impact studies. Our results should be verified in large independent external datasets. Moreover, future studies should investigate if the incorporation of additional images germane to the radiation oncology workflow, such as CT and PET, can further improve model performance. Overall, our pilot study is an important step towards fully automated MR-guided OPC radiotherapy workflows.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.