Ensemble Deep Learning Models on Raw DNA Sequences for Viral Genome Identification in Human Samples

Detecting highly divergent or previously unknown viruses remains a major challenge with important clinical implications. In human sequencing datasets, alignment-based approaches frequently classify many assembled contigs as “unknown” because they lack detectable similarity to reference genomes. In this work, we introduce an ensemble of deep neural networks designed to identify viral sequences across diverse human biospecimens. The proposed framework operates on metagenomic sequence data derived from next-generation sequencing platforms, where biochemical nucleotide incorporation events are transduced into fluorescence-based signals and captured by optical sensor arrays, subsequently processed into digital sequence information.
The proposed ensemble integrates complementary convolutional architectures capable of capturing both local patterns and broader compositional signals directly from raw metagenomic contigs. The training data consisted of sequences from 19 metagenomic experiments. On this dataset, our approach achieves state-of-the-art performance, outperforming previous machine/deep learning methods for viral genome classification. Using 300 base pairs (bp) contigs, the model achieves an area under the ROC curve of 0.939, outperforming previously reported state-of-the-art methods evaluated on the same dataset under an identical testing protocol. If our system is combined with other recent methods for representing the DNA sequence as a feature vector, we can achieve an even higher area under the curve, approximately 0.94.
All developed code and dataset will be made publicly available to ensure full reproducibility of our results.

Liked Liked