A Practical Workflow for Correcting Kit-Specific Effects in Whole-Exome Sequencing Data

Large-scale, multi-center projects have become common in the era of rapid technological development, but protocol standardization remains challenging. In whole-exome sequencing (WES), various exome enrichment kits exhibit variable efficiency across genomic regions, leading to systematic, non-biological batch effects, much stronger than other technical factors. We propose a workflow to minimize the effect of WES capture inconsistencies in single-nucleotide variation (SNV) data. The pipeline consists of quality control, mapping to the genome, variant calling, joint genotyping, and imputing genotypes using reference haplotypes. Variants are then aggregated into gene-level features measuring the burden of deleterious mutations. Finally, a gene-level imputation is performed using a customized algorithm. Namely, if the detection rate of a gene is low in samples enriched with a given capture kit, but high in samples enriched with other kits, missing values in the former group are imputed, as such differences are unlikely to reflect true biology. As a benchmark, we conducted a study on over a thousand breast cancer cases across 11 cohorts, using 8 exome capture kits. We demonstrated that the proposed pipeline leads to a considerable decrease in the batch effect signal, potentially increasing the likelihood of finding true biological signals. The workflow is publicly available here: https://github.com/ZAEDPolSl/WESworkflow.

Liked Liked