JOBIM2026 Mini-Symposium "L’annotation structurale des gènes dans les génomes Eucaryotes reste-t-elle toujours un défi ?"

Organisateurs : Sophie Lemoine1, Erwan Corre2, Véronique Brunaud3

1- GenomiqueENS - IBENS, CNRS, INSERM, Université PSL slemoine@bio.ens.psl.eu

2- ABIMS - Station Biologique de Roscoff - CNRS-SU erwan.corre@sb-roscoff.fr

3- IPS2- INRAE, CNRS, Univ. Paris-Saclay veronique.brunaud@inrae.fr 

Mots-clés : annotation des génomes, structure des gènes, retour d’expérience 

Le séquençage des génomes eucaryotes connaît une croissance exponentielle, comme l’illustre la base du NCBI, qui recense environ 60000 génomes eucaryotes correspondant à 25000 espèces distinctes (www.ncbi.nlm.nih.gov/datasets/genome/). Néanmoins, l’annotation structurale des gènes demeure une étape qui présente des défis significatifs à la base de l’interprétation fonctionnelle de ces génomes. En effet, de nombreux génomes ne disposent que d’annotations partielles ou absentes (seulement 24% des génomes ont une annotation). En outre, la qualité des modèles de gènes varie fortement selon les espèces et les outils utilisés (Scalzitti et al. 2020). Malgré l’apport du séquençage du transcriptome (RNAseq), ces annotations demeurent biaisées pour définir les frontières des exons, des UTRs et des sites d’épissage alternatif. De plus, l’identification de toutes les isoformes des gènes codants et non codants reste incomplète.

L'objectif de ce mini-symposium est de présenter des retours d'expérience sur l’apport des longues lectures en RNAseq pour l'annotation des gènes codants, non codants et des isoformes. D'explorer les nouveaux outils d’annotation basés sur l'Intelligence Artificielle (IA), tels qu’Helixer (Stiehler et al. 2021), qui combinent réseaux de neurones profonds et modèles de type HMM (Hidden Markov Models). Enfin, les approches fondées sur la détection des ruptures dans les profils d’expression de RNAseq pourraient être une nouvelle façon d’envisager l’annotation des génomes.

Programme: jeudi 2 juillet 15h-18h

15h - Jean-Marc Aury (CEA Genoscope) : "Structural Annotation of Marine Genomes from ATLASea"

15h30 - Audrey Onfroy (IBENS) "When AI meets structural genome annotation: lessons from Helixer in Lepidoptera"

15h55 - Véronique Brunaud (IPS2) "Arabidopsis genome annotations, differences between official and Helixer"

16h20 - 16h40 Pause 

16h40 - Thomas Derrien (IGDR) / Fabien Degalez (Institut Agro) "Genome annotation of lncRNAs using long-read transcriptomics"

17h05 -  Fabrice Legeai (IGEPP) "Gene annotation of insect genome with EGAPx and Helixer"

17h30 -  Arnaud Liehrmann (IPBS-SU) "Detecting transcriptional regulation despite incomplete annotation"

18h - fin du mini-symposium

Ce mini-symposium est coorganisé par deux réseaux métiers de la communauté bioinformatique, le PEPI IBIS (https://pepi-ibis.inrae.fr) INRAE et le réseau MERIT (https://merit.cnrs.fr) du CNRS. Et nous remercions la SFBI (https://www.sfbi.fr/) pour son aide.

Abstracts

Jean-Marc Aury: Large-scale gene annotation across marine eukaryotic diversity within the ATLASea programme

ATLASea aims to generate annotated reference genomes for thousands of marine eukaryotic species spanning a broad range of taxonomic lineages. Such phylogenetic diversity poses significant challenges for genome annotation, as gene structures, intron-exon organizations, repeat content, and available resources vary considerably across taxa. Developing a generic and scalable annotation framework capable of producing consistent results across this diversity therefore represents a major objective of the project.
To address this challenge, we developed an evidence-driven annotation workflow combining multiple complementary sources of information. Protein evidence is selected from phylogenetically relevant species, while transcript evidence is obtained from both publicly available RNA-Seq datasets and RNA-Seq data generated within ATLASea. These heterogeneous sources are integrated through a reconciliation strategy that produces a consensus set of gene models while maximizing biological support from independent evidence types. In parallel, we systematically run several state-of-the-art annotation approaches on all genomes processed. Although these predictions are not currently incorporated into the final annotation release, they provide an invaluable benchmarking framework for evaluating the robustness of our production workflow and monitoring the performance of emerging annotation methodologies across a large and taxonomically diverse dataset.
Here, we present the ATLASea annotation framework, its deployment at scale, and the lessons learned from annotating genomes representing a wide diversity of marine eukaryotic lineages.

Audrey Onfroy: When AI meets structural genome annotation: lessons from Helixer in Lepidoptera

Accurate structural genome annotation is essential for reliable transcript quantification in transcriptomic analyses. However, existing annotation pipelines can be difficult to interpret and optimize. They may also require experimental data that are not yet available. Recently, deep learning tools such as Helixer have been developed to predict gene structures directly from genome assemblies. We applied Helixer to annotate the genome of Morpho helenor, a Lepidoptera species. We generated annotations using both default pre-trained models and fine-tuned models developed for this project. We compared the resulting annotations with a reference annotation and with annotations generated from long-read RNA sequencing data. We also evaluated short-read RNA-seq read assignment using these annotations. Our results show that Helixer produces reliable annotations. However, it is not sufficient as a standalone solution. It can serve as a first annotation step when no annotation is available and can also complement existing annotations. Finally, our results highlight the importance of model selection, as the prediction model used by Helixer affects annotation quality. Future work will focus on building a consensus annotation and performing functional genome annotation.

Véronique Brunaud: Arabidopsis genome annotations, differences between official and Helixer

The purpose of this presentation is to provide a brief overview of the annotations proposed by Helixer for several plant genomes in comparison with their official annotations. Then a highlight on the latest annotation version of the Arabidopsis thaliana genome (TAIR12), emphasizing the differences in structural annotation proposed by Helixer. These annotations show agreement in over 85% of coding genes and include 1,720 new genes proposed by Helixer. Finally, characteristics and functional analyses of these new proteins is performed by comparing them with the Uniprot database, PFAM domains, and ortholog groups.

Thomas Derrien / Fabien Degalez : Genome annotation of lncRNAs using long-read transcriptomics

Long-read RNA sequencing has become a major driver of genome annotation efforts, particularly for long non-coding RNAs (lncRNAs), whose low expression levels, strong tissue specificity and complex transcript structures have long limited accurate annotation using short-read sequencing alone. 

In this presentation, we will share lessons learned from our participation in recent large-scale annotation projects leveraging long-read transcriptomics. Through the GENCODE Capture Long-Read Sequencing initiative, we will show how targeted full-length transcript sequencing enabled the discovery and annotation of tens of thousands of previously unannotated lncRNA genes and transcript isoforms, substantially expanding reference annotations in both human and mouse. These efforts also highlighted the importance of combining long-read data with stringent annotation workflows to balance transcript discovery and annotation quality. We will also present results from a population-scale long-read transcriptomics study in genetically diverse human individuals. This work revealed that current reference annotations incompletely represent transcript diversity across human populations and that thousands of transcripts, including population-specific isoforms, remain absent from standard annotations. These findings illustrate how annotation completeness is influenced not only by sequencing technology but also by the diversity of sampled individuals. 

Finally, we will briefly discuss how improved transcript models can support downstream comparative genomics analyses, including the identification of conserved lncRNAs across vertebrates through synteny- and sequence-based approaches. Such analyses provide an additional framework for prioritizing candidate functional lncRNAs among the rapidly growing catalogs generated by long-read sequencing.  Overall, these projects illustrate how long-read transcriptomics is transforming lncRNA annotation, moving the field from transcript discovery toward the construction of more complete, biologically meaningful and evolutionarily informed gene catalogs.

Arnaud Liehrmann: Detecting transcriptional regulation despite incomplete annotation

While many RNA-Seq-based tools have been developed to analyze the transcriptome, most only consider the abundance of sequencing reads over annotated features such as genes. Because these annotations are typically incomplete, important regulatory events go undetected in differential expression analysis. To address this, we developed DiffSegR, an R package that discovers transcriptome-wide expression differences between two biological conditions from RNA-Seq data. DiffSegR requires no prior annotation: it uses a multiple changepoint detection algorithm to delineate the boundaries of differentially expressed regions directly from the per-base log2 fold change. In a few minutes of computation, DiffSegR correctly recovered the role of the chloroplast ribonuclease Mini-III in rRNA maturation, and of PNPase in the 3′/5′ degradation of rRNA, mRNA and tRNA precursors as well as in intron accumulation. Moreover, recent results suggest that DiffSegR could scale to nuclear genomes while retaining its ability to uncover novel regulatory events. We believe DiffSegR will benefit biologists working on transcriptomics, as it allows access to a layer of the transcriptome overlooked by the classical differential expression analysis pipelines widely used today.