Intra- and inter-observer variability in measurement of target lesions: implication on response evaluation according to RECIST 1.1
Background. The assessment of cancer treatment in oncological clinical trials is usually based on serial measurements of tumours' size according to the Response Evaluation Criteria in Solid Tumours (RECIST) guidelines. The aim of our study was to evaluate the variability of measurements of target lesions by readers as well as the impact on response evaluation, workflow and reporting.
Patients and Methods. Twenty oncologic patients were included to the study with CT examinations from thorax to pelvis performed at a 64 slices CT scanner. Four readers defined and measured the size of target lesions independently at baseline and follow-up with PACS (Picture Archiving and Communication System) and LMS (Lesion Management Solutions, Median technologies, Valbonne Sophia Antipolis, France), according to the RECIST 1.1 criteria. Variability in measurements using PACS or LMS software was established with the Bland and Altman approach. The inter- and intra-observer variabilities were calculated for identical lesions and the overall response per case was determined. In addition, time required for evaluation and reporting in each case was recorded.
Results. For single lesions, the median intra-observer variability ranged from 4.9-9.6% (mean 5.9%) and the median inter-observer variability from 4.3-11.4% (mean 7.1%), respecting different evaluation time points, image systems and observers. Nevertheless, the variability in change of Δ sum longest diameter (LD), mandatory for classification of the overall response, was 24%. The overall response evaluation assessed by a single respectively different observer was discrepant in 6.3% respectively 12% of the cases compared with the mean results of multiple observers. The mean case evaluation time was 286s vs. 228s at baseline and 267s vs. 196s at follow-up for PACS and LMS, respectively.
Conclusions. Uni-dimensional measurements of target lesions show low intra- and inter-observer variabilities, but the high variability in change of Δ sum LD shows the potential for misclassification of the overall response according to the RECIST 1.1 guidelines. Nevertheless, the reproducibility of RECIST reporting can be improved for the case assessment by a single observer and by mean results of multiple observers. Case-based evaluation time was shortened up to 27% using custom software.