Inter-clinician delineation variation for a new highly-conformal flank target volume in children with renal tumors: A SIOP-Renal Tumor Study Group international multicenter exercise

Highlights • Recently, highly-conformal target volumes for flank delineation were defined.• Delineation variation of this target volume was assessed in an international setting.• Ten radiation oncologist delineated the GTV and CTV of six individual cases.• ‘Unacceptable’ delineation variation was found in a large number of participants.• This indicates the need for central target volume review before radiotherapy onset.


Introduction
Most children with renal tumors who are treated according to the Renal Tumor Study Group (RTSG) protocols of the International Society for Pediatric Oncology (SIOP) receive upfront chemotherapy followed by nephrectomy. Data from the recent SIOP-2001 trial shows that 20-25% of these patients require postoperative flank irradiation as part of their first line treatment [1,2]. For flank irradiation, two conventional opposing Anterior-Posterior/Poster ior-Anterior (AP/PA) photon beams have been considered gold standard since the SIOP-1 trial (1971)(1972)(1973)(1974) [3]. However, renal tumors arise from the retroperitoneal area and displace the organs anterior to the tumor. When performing surgery, the tumor is removed with limited risk of (intraperitoneal) tumor spill or macroscopic residual disease and surrounding organs shift into the surgical cavity [4]. Consequently, the volume irradiated by AP/PA photon beams includes a large amount of normal tissue.
Nowadays, advanced Image-Guided Radiotherapy (IGRT) techniques allow us to treat complex target volumes with high conformity. To exploit these favorable dose-volume characteristics, radiation oncologists affiliated with the SIOP-RTSG developed a consensus statement on highly-conformal flank target volume delineation for pediatric renal tumors [5]. As a result, the risk of inter-clinician variation is more substantial: underestimation of the target volume has the risk to increase locoregional failures, whereas overestimation of the target volume will limit the ability of modern IGRT techniques to spare healthy tissue. To assess the locoregional control of new flank target volumes combined with highly-conformal radiotherapy (RT) techniques, the SIOP-RTSG has the intention to launch a prospective multicenter study [5]. It is expected that during this study, prospective RT quality assurance by centralized review of target volumes and dosimetry will be compulsory to tackle the issue of inter-clinician variation, given earlier experiences with conventional flank delineation and in line with other recently launched pediatric cancer trials [6][7][8]. However, the estimated inter-clinician delineation variation and, subsequently, the need for centralized review of the new flank target volume has not been determined. Therefore, the development of the consensus on highly-conformal flank delineation was accompanied by a multicenter delineation exercise, during which the consensus guideline was continuously optimized based on the experiences of each delineation phase.
The aim of this study was to evaluate the inter-clinician variation of the new highly-conformal flank target volume delineation approach in an international multicenter setting using geometrical analyses and reviewing criteria in order to explore the necessity of centralized pre-treatment quality assurance.

Materials and methods
This exercise was reported according to the Guidelines for Reporting Reliability and Agreement Studies [9].

Patient selection
Six unique cases with a pediatric renal tumor eligible for flank irradiation based on the criteria defined in the SIOP-RTSG UMBRELLA 2016 protocol were selected for this delineation exercise (institutional review board approval number: 17-729/C) [1]. For each case, after preoperative chemotherapy, T1-weighted Magnetic Resonance Imaging (MRI) scans (Achieva 1.5T, Philips Medical Systems, Best, The Netherlands; slice thickness: 1.5 mm) with and without gadolinium contrast agent were acquired together with a postoperative planning Computed Tomography (CT) scans in RT treatment position (Brilliance, Philips Medical Systems, Best, The Netherlands, slice thickness of 2.0 mm). Essential clinical data to determine the extent of the area at risk were extracted from the radiology, surgery and pathology reports (Supplementary Table 1). Clinical data and imaging in Digital Imaging and Communications in Medicine (DICOM) format were anonymized and transferred from the coordinating center (University Medical Center Utrecht) to the participating centers using encrypted data exchange.

Preparation phase
Between May 2016 and May 2017, expert pediatric radiation oncologists of the SIOP-RTSG board ('coordinators') translated the conventional flank target volumes described in the ongoing UMBRELLA SIOP-RTSG-2016 protocol into a 'preliminary' highlyconformal flank delineation guideline during three live meetings [1]. Afterwards, radiation oncologists ('participants') from ten different centers in seven countries across Europe were invited to participate in a delineation exercise. Participants were asked to delineate the pre-and postoperative Gross Tumor Volume (GTV pre/post ), as well as the Clinical Target Volume of the tumor bed and involved lymph node area when indicated (CTV-T and CTV-N, respectively) for all preselected cases using treatment contouring systems available at their institute. For each case, the contralateral kidney, spleen, liver, heart, lungs and vertebrae were delineated by a coordinating pediatric radiation oncologist (GJ) in order to reduce the total delineation time for the participants. The pancreas and intestine were delineated by the participants, since it is closely related to the construction of the target volumes.
The delineation exercise was divided into three phases: two test phases and a quality assurance phase (Fig. 1).

Test phases
During the first test phase (January 2018-April 2018), participants delineated the target volumes of case 1 and 2. The preoperative and postoperative scans of these cases had been co-registered in advance at the coordinating center. However, after all delineations were collected by the coordinating center, it was revealed that the rigid co-registration had been overruled by the delineation software at the participants' departments. For this reason, detailed instructions on co-registration were amended to the 'preliminary' delineation guideline. Hence, in the second test phase (May 2018-May 2019), participants performed the co-registration themselves and delineated the target volumes of case 3 and 4. At the end of each test phase, a video meeting was organized between participants and coordinators to discuss inconsistencies between participants and to evaluate the need for refinement of the 'preliminary' delineation guideline.

Quality assurance phase
At the beginning of the quality assurance phase (April 2019-July 2019), the 'preliminary' delineation guideline was refined by adding new recommendations and detailed illustrations of the delineation approach (Supplementary Table 2) [5]. In this phase, participants performed co-registration and delineated case 5 and 6 using the refined delineation guideline. The purpose of the quality assurance phase was to determine the inter-clinician variation using a standardized procedure to review the target volumes in addition to the geometrical analysis of the volumes.

Geometrical data analysis
Data analysis was limited to cases 3-6 due to the coregistration mismatch in case 1 and 2. Before each phase, a reference target volume (TV ref ), consisting of the GTV pre , GTV post and CTV, was established for each case by one of the coordinators (GJ), and subsequently validated by the other coordinators (PM, CR). The TV ref was based on the 'preliminary' delineation guideline for case 3 and 4 and on the 'refined' delineation guideline for case 5 and 6. Afterwards, the volume of contours, Dice Similarity Coefficient (DSC) and the percentage of the TV ref not delineated by participants were calculated using an in-house developed software tool [10].
The volume of contours (in mL) were calculated per participant, per case and per target volume. The DSC was used to determine the variation between two volumes and calculated as the intersect target volume (TV intersect ) times two, divided by the sum of the two target volumes (TV 1 , TV 2 ). The DSC ranges from 0 (no overlap between volumes) to 1 (perfect agreement between volumes).
DSCs were calculated in a pairwise fashion between each participant and the reference (DSC ref/part ), as well as between the participants only (DSC part/part ) for each target volume per case. The percentage of TV ref not delineated by a participant was calculated for each target volume per case to reflect the amount of underestimated treatment volume. Zero percent indicated that no part of the TV ref was included by the corresponding target volume of the participant, while 100% indicated that all of the TV ref was delineated by the participant.
Reference volume not delineated in % ¼

. Target volume review
Target volume review according to the 'refined' delineation guideline was performed for case 5 and 6 only using a maximum of 18 standardized criteria depending on the clinical situation. These criteria cover the five major steps in the delineation process: one for co-registration, one for GTV pre , seven for GTV post , six for CTV-T and three for CTV-N [5] (Supplementary Table 2).
For the first part of the review, delineations per case per participant were graded by two independent reviewers (BH, PvR) and one reviewer with prior involvement in the delineation exercise (JM). Since a deviation occurring in each delineation step may cause a systematic error in the succeeding steps, each delineation step was reviewed separately. Subsequently, every deviation was appointed to the violation of a specific criterion. Deviations from the criteria were measured in the axial view using a point-topoint distance tool and categorized as either per protocol (0-4 mm), minor deviation (5-9 mm) or major deviation (!10 mm). Deviations were only graded as minor or major when present in 3 or more consecutive axial slices. Major deviations were subdivided into deviations leading to a potential underestimation or overestimation of the target volume. Discrepancies between reviewers were resolved collectively.
For the second part of the review, a reference pediatric radiation oncologist (GJ) and two independent reviewers (BH, PvR) graded deviations from the CTV ref by each participant in six directions of the CTV (anterior, posterior, medial, lateral, cranial and caudal) using automated expansions of the CTV ref . A major deviation in one direction of the CTV resulting in underestimation with potential increased risk of locoregional failure was regarded as an unacceptable variation. All minor deviations and major deviations leading to an overestimation were considered acceptable.

Statistical analysis
The median of the volumes, the DSC ref/part, the DSC part/part and the TV ref not delineated by participants were generated. The One-Sample Wilcoxon signed-rank test was used to test the difference between the size of the CTV part and the CTV ref for each case. The Wilcoxon signed-rank test was used to test whether a significant increase of the DSC ref/part was obtained between the mean of case 3 and 4 ('second test phase') and the mean of case 5 and 6 ('quality control phase'). The Related-Samples Friedman's Two-Way Analysis of Variance by Ranks with the Wilcoxon signed-rank test as post-hoc analysis was used to test the difference of CTV ref not delineated by the participants between cases, and to test the difference between the mean DSC ref/part of the GTV pre versus the GTV post versus the CTV of all of cases combined. Additionally, the difference between the DSC ref/part and the DSC part/part was tested using the Mann-Whitney U test. A p-value of <0.05 was chosen to indicate statistical significance. Data were analyzed using statistical software SPSS

Data collection
At the end of the quality-control phase, a total of 57/60 delineation sets had been collected by the coordinating center. One participating center was unable to delineate case 1, 5 and 6 within the given timeframes. Table 1 demonstrates the absolute volume of the GTV pre , GTV post and CTV of each participant compared to the reference target volumes for case 3-6. For all cases, CTV obs was not significantly different compared to the CTV ref. Considering each individual participant, the maximum difference in size of the CTV part compared to the CTV ref ranged from minus 68 mL to plus 234 mL.

Geometrical data analysis
The boxplots in Fig. 2 illustrate the variation in DSC between the reference and the participants (DSC ref/part ), the variation between the participants only (DSC part/part ) for each target volume per case and the percentage of TV ref not delineated by a participant. All cases combined, the DSC ref/part was better for the GTV pre (median = 0.87) compared to the GTV post (median = 0.39, p = 0.03) and CTV (median = 0.55, p = 0.02). No significant difference in DSC ref/part for the CTV was observed between the 'test phase' and the 'quality assurance phase' (case 3/4 vs. case 5/6: p = 0.15, standard error = 8.43). For the CTV of each case, the DSC ref/part and the DSC part/part were not significantly different (case 3: p = 0.84; case 4: p = 0.59; case 5: p = 0.84; case 6: p = 0.32). The percentage of CTV ref not delineated by the CTV of all participants for case 3-6 ranged between 11% and 73% (median = 35%) and did not significantly differ between cases (p = 0.17) (Fig. 3; Supplementary  Table 3).

Target volume review
Firstly, case 5 and 6 were reviewed by grading each step in the delineation process separately. One or more major deviations were found in 2/18, 5/18, 12/17, 18/18 and 4/9 participants for coregistration, GTV pre , GTV post , CTV-T and CTV-N, respectively (Sup-plementary Table 4). The criteria with highest number of major deviations are CTV-T criterion 3 ('healthy-appearing kidney', n = 14/18), GTV post criterion 3 ('healthy-appearing kidney', n = 8/17) and CTV-T criterion 2 ('organs at risk', n = 9/18) ( Table 2;  Supplementary Table 2). Twenty-nine of the 44 observed major deviations were the result of an overestimation, while 15 of the 44 observed major deviations were caused by an underestimation For the second part of the review, each CTV obs was graded by the deviation from the CTV ref . An unacceptable variation from the CTV ref was found in 7/9 participants for case 5 and 6/9 participants for case 6 (Fig. 4).

Discussion
In the current study, ten radiation oncologists from seven European countries delineated the pre-and postoperative GTV, as well as the CTV of six unique renal tumor cases in order to evaluate the inter-clinician variation of a new flank target volume delineation approach [5]. The median DSC was 0.55, expressing the overlap between the CTV of participants and the reference CTV, while the median underestimation of the reference CTV by the participants ranged between 29% and 47%. Additionally, standardized review of the delineations showed that an unacceptable underestimation of a reference CTV was present in 7/9 participants for case 5 and 6/9 participants for case 6.
Volume and measurements of overlap, like the DSC and Generalized conformity index, are commonly used metrics to determine inter-clinician delineation variation [11][12][13]. In a nationwide French study, the CTV agreement for conventional flank irradiation, as defined in the SIOP-2001 protocol, ranged from 0.50 to 0.64 between five RT teams [6]. In our study, the median DSC ref/part for the CTV ranged from 0.53 to 0.62. Despite two consensus meetings with the participants and refinement of the preliminary delineation guideline, no significant improvement of inter-clinician variation was observed during this study. This lack of improvement might be caused by the complexity of postoperative tumor reconstruction, the diversity of clinical presentations and the rarity of pediatric (renal) tumors in general [14]. Moreover, the participants did not receive any training prior to this study and no feedback was provided regarding their individual performance during the study. This might have reduced the number of errors, as demonstrated in other clinical settings [15][16][17]. Finally, it is also important to consider that the DSC is generally more severely affected by variation in case of small, concave volumes like the postoperative tumor bed, compared to larger, spherical volumes like the GTV pre , as also demonstrated in this study [11,18].
In order to evaluate the effect of variation on the potential clinical outcome of patients, a standardized review of all delineations was performed using objective criteria that reflect the recommendations from the refined delineation guideline [5]. When each step in the delineation process was graded independently during the first review, major deviations predominantly occurred for the margins towards the healthy-appearing kidney tissue, the removal of uninvolved OARs from the CTV-T, and the cranial margin of the CTV-N. This indicates where the delineation guideline could be improved or where additional attention during the reviewing process is appropriate.
The second review showed an unacceptable deviation from the CTV ref (i.e. leading to significant underestimation) in the majority of the participants. In adult cancer, it is known that RT protocol violation may increase the risk of treatment failure [19,20]. The Radiation Oncology Group (RTOG) revealed that failure to adhere to RT guidelines was associated with an increased risk of locoregional failure during a phase III study for pancreatic cancer, as well as during a multi-institutional trial for early-stage gastrointestinal cancer using Intensity-Modulated Radiotherapy (IMRT) [21,22]. Also, a large phase III trial of advanced head and neck cancers using prospective quality assurance in 81 Australian centers found a statistically significant 2-year locoregional control rate of 54% versus 78% for patients with and without major deviations, respectively [23]. Less is known about the negative impact of RT protocol violation on treatment outcome for pediatric cancers. Carrie et al. reviewed the treatment plans of 174 medulloblastoma patients and demonstrated a strong correlation between the number of major target volume deviations and the risk of tumor relapse [24]. While the rate of protocol deviations found in our study is based upon a carefully established reference target volume, the true effect of underestimation can only be determined when comparing clinical outcome and target volume review. However, given the low numbers of locoregional failure for WT compared to medulloblastoma patients, it will be more challenging to demonstrate the impact of major deviations on outcome when this new RT techniques is introduced on a larger scale [2,[25][26][27][28]. Overestimation of the target volume was not regarded as an unacceptable variation in our analysis. However, the degree of overestimation should also be evaluated within central target volume review in order to prevent unnecessary violation of normal tissue constraints like the spleen, tail of the pancreas or the heart. The design of this study was chosen to mimic daily clinical practice with cases representing a wide range of clinical situations. Also, ten radiation oncologists from seven different countries in Europe participated in this study, reflecting the inter-clinician variability in an international multicenter setting. Furthermore, this study implemented a review approach similar to modern quality assurance initiatives [7,8]. However, the use of multiple review criteria and establishing reference target volumes might not be preferable for real-time pretreatment quality assurance, since it is complex and time consuming for reviewers and RT for renal cancers has to start shortly after surgery [29]. Nevertheless, the criteria generated for this delineation exercise could be a good frame of reference since they reflect all recommendations from the consensus statement on the new flank target volume definition [5]. Since this study aimed to evaluate inter-clinician delineation variation only, dosimetric analyses were not included in this study, but are normally part of the RT quality assurance.
In conclusion, this international multicenter delineation exercise demonstrates that this new approach for flank target volume delineation leads to geometrical variation among clinicians. Stan-dardized review using a reference CTV shows that major deviations leading to an underestimation of the reference CTV occurred in the majority of the participants. These findings strongly suggest the need for additional training and centralized pre-treatment review when this highly-conformal target volume delineation approach is implemented during a SIOP-RTSG endorsed prospective multicenter study.

Data availability statement
Research data are available upon request to the corresponding author.   Table 2 Review of case 5 and 6: total number of deviations per criterion.