Is visual evaluation of aneurysm coiling a reliable study end point? Systematic review and meta-analysis
Authors:
Ernst, Marielle, Yoo, Albert J., Kriston, Levente, Schonfeld, Michael H., Vettorazzi, Eik, and Fiehler, Jens
Abstract:
BACKGROUND AND PURPOSE: Angiographic occlusion as a surrogate marker of satisfactory aneurysm treatment is commonly used in clinical trials although some pitfalls have to be considered. To investigate the inter-rater reliability of visual rating of aneurysm occlusion as study end point, we performed a systematic review and meta-analysis.
METHODS: Electronic databases (MEDLINE, EMBASE, PubMed, and the Cochrane Library) were searched up to June 2014. Assessment of risk for bias was based on the Quality Appraisal Tool for Studies of Diagnostic Reliability and the Guidelines for Reporting Reliability and Agreement studies. Inter-rater reliability estimates were pooled across studies using meta-analysis, and the influence of several factors (eg, imaging methods, grading scales, and occlusion rate) was tested with meta-regression.
RESULTS: From 1193 titles, 644 abstracts and 87 full-text versions were reviewed. Twenty-six articles met the inclusion criteria and provided 77 reliability estimates. Twenty-one different rating scales were used, and statistical analysis varied. Mean inter-rater agreement of the pooled studies was substantial (kappa=0.65; 95% confidence interval, 0.60-0.69). Reliability varied significantly as a function of imaging methods, grading scales, occlusion rates, and their interaction. Observer agreement substantially increased with increasing occlusion rate in digital subtraction angiography but not in MR angiography. Reliability was higher in studies using 2- or 3-value grading scales than in studies with 4-value grading scales.
CONCLUSIONS: There is significant heterogeneity between studies evaluating the reliability of visual evaluation of aneurysm coiling. On the basis of our analysis, we found that the combination of magnetic resonance angiography, 3-value grading scale, and 2 trained raters seems most promising for usage as surrogate study end points