We concentrated on families of experiments as our form of primary studies. In the event that we were unable to reproduce the results, to investigate the underlying reason for lack of reproduciblity. Use the descriptive statistics reported in the study, attempt to reproduce the reported results. Identify the effect sizes used and how they were calculated and aggregated. ( 2018) as a starting point, we decided to investigate the validity and reproducibility of effect size meta-analysis for families of experiments (Madeyski and Kitchenham 2017). ( 2016), Madeyski and Kitchenham ( 2018b) and Santos et al. They reported that although the most favoured means of aggregating results was Narrative synthesis (used by 18 papers), Aggregated Data meta-analysis (by which they mean aggregation of experiment effect sizes) was used by 15 studies. ( 2018) had already performed a mapping study of families of experiments. We identified the need to consider two mean difference effect sizes and reported the small sample effect size variances and their normal approximations.Īs we were undertaking this systematic review, Footnote 1 we found that Santos et al. They also warned against the use of meta-analysis in the context of crossover style experiments.Īs a results of that study, two of us undertook a detailed study of parametric effect sizes from AB/BA crossover studies (see Madeyski and Kitchenham 2018a, band Kitchenham et al. However, they reported that “ crossover designs are often not properly designed and/or analysed, limiting the validity of the results”. Furthermore, those 82 papers reported 124 experiments of which 68 (i.e., 54.8%) used crossover designs. In their review they identified 82 papers of which 33 (i.e., 40.2%) were crossover designs.
( 2016) reported that crossover designs are a popular design for software engineering experiments.
To support novice researchers, we present recommendations for reporting and meta-analyzing families of experiments and a detailed example of how to analyze a family of 4-group crossover experiments. Meta-analysis is not well understood by software engineering researchers. To support reproducibility of analyses presented in our paper, it is complemented by the reproducer R package. When we were unable to reproduce results, we provide revised meta-analysis results. One study which was correctly analyzed could not be reproduced due to rounding errors. Out of 13 identified primary studies, we reproduced only five. We attempted to reproduce the reported meta-analysis results using the descriptive statistics and also investigated the validity of the meta-analysis process. We performed a systematic review (SR) of papers reporting families of experiments in high quality software engineering journals, that attempted to apply meta-analysis. To identify families of experiments that used meta-analysis, to investigate their methods for effect size construction and aggregation, and to assess the reproducibility and validity of their results. Previous studies have raised concerns about the analysis and meta-analysis of crossover experiments and we were aware of several families of experiments that used crossover designs and meta-analysis.