ISSN: 0974-276X
Eman A Alzaid, Achraf El Allali and Hatim Aboalsamh
Background: Finding accurate genome structural variations (SVs) is important for understanding phenotype diversity and complex diseases. Limited research using classification to find SVs from next-generation sequencing is available. Additionally, the existing algorithms are mainly dependent on an analysis of the alignment signatures of paired-end reads for the prediction of different types of variations. Here, the candidate SV regions and their features are computed using single reads only. Classification is used to predict the variation types of these regions.
Results: Our approach utilizes reads with multi-part alignments to define a possible set of SV regions. To annotate these regions, we extract novel features based on the reads at the breakpoints. We then build three random forest classifiers to identify regions with deletions, inversions, or tandem duplications.
Conclusions: This paper proposes a random forest-based classification approach, MPRClassify, which addresses the issue of finding SVs using single reads only. These single-reads are used to define candidate regions and extract their features. Experimental results show that single reads are sufficient to find SVs without the need for paired-end read signatures. Our proposed approach outperforms existing approaches and serves as a basis for future studies finding SVs using single reads.