Transposable elements (TEs), including LINEs, SINEs, and long terminal repeats (LTRs), constitute a substantial portion of the eukaryotic genome. TEs can function as cis-regulatory elements, serving as alternative promoters to regulate tissue-specific gene expression. However, comprehensive characterisation of TE-driven transcripts remains a major challenge due to the repetitive and complex nature of these elements, especially when using short-read sequencing technologies. To overcome these limitations, we developed RepeatSeeker, a Python-based tool designed to identify and annotate TE-driven transcripts using long-read sequencing platforms. RepeatSeeker integrates aligned reads, gene annotations, and repeat element annotations to detect repeat-exon overlaps at single-transcript resolution. To further explore the biological insights into TE-driven transcripts, we also developed downstream analytical pipelines: (i) to assign epigenetic states such as DNA methylation to individual TE-driven transcripts, and (ii) to perform CDS analysis to assess the coding potential of TE-driven transcripts. As a benchmark, we first analysed published PacBio mouse oocyte data [1]. RepeatSeeker successfully verified that the majority of TE-driven transcription in mouse oocytes originates from MaLR-ERVL-type LTRs. By integrating DNA methylation data, we further characterised their methylation profiles at single-transcript resolution and confirmed that these LTRs are predominantly unmethylated in mouse oocytes. This demonstrates the accuracy and utility of RepeatSeeker in analysing TE-driven transcripts. We then hypothesised that TE-driven transcripts contribute to both transcriptomic and proteomic diversity. To address this hypothesis, we applied the CDS analysis pipeline to our Oxford nanopore datasets of the mouse placental transcriptome. The results indicated that most LTR-derived transcripts in mouse placenta encode proteins with peptide sequences different from those of their associated (host) genes. This suggests that TE-driven transcription not only increases transcriptomic diversity but may also expand diversity of the proteome.
[1] Qiao, Y. et al. High-resolution annotation of the mouse preimplantation embryo transcriptome using long-read sequencing. Nat. Commun. 11, 2653 (2020)