DeepCRE - deep learning applications for identification and functional annotation of cis-regulatory elements in crops
Gene expression produces the molecular phenotype from the genotype. Gene expression starts with transcription from DNA to RNA, regulated by protein transcription factors (TFs). TFs own a DNA binding domain (DBD) that recognizes distinct DNA motifs via physicochemical interaction. Such transcription factor binding motifs operate as cis-regulatory elements(CREs) and are arranged in cis-regulatory modules (CRMs) which localise in immediate up- or downstream proximity of the transcription start site (TSS) or the transcription termination site (TTS). However, the inference of a regulatory DNA encoded by CREs and cognate TFs is highly complex. For example, in the model plant Arabidopsis thaliana alone, 2296 TFs from 58 different protein families have been characterised by their encoded DBD. It is estimated that with 27,655 target genes, 5 to 20 million potential TF-CRM interactions are possible in the relatively small eukaryotic genome of A. thaliana.
The regulatory network of genomically larger crop plants can be unravelled by the combination of data from transcriptomics and, e.g. chromatin immunoprecipitation (CHiP) sequencing. So, the protein binding patterns of, e.g. TFs during a genes transcription and corresponding profiles can be linked to CREs situated in its flanking regions, using machine learning (ML). Further, ML models are trained and combined from different model species like Zea mays, Arabidopsis thaliana, Solanum lycopersicum and Sorghum bicolor.Consequently, not only species-specific, but also evolutionary conserved CRE motifs can be identified. Accordingly, the resulting ML models and identified CRE motifs enable annotation and prediction of cis-regulatory sequences and functional linkage to their effects on new genomes and gene models of yet genetically uncharacterized crop plants.