Helixer is a Deep Learning-based software that enables base-wise gene annotation of eukaryotic genomes based on solely genomic sequences as input – it classifies each base into one of four regions Intergenic, Untranslated region (UTR), coding sequence (CDS) and Intron. However, in contrast to Helixer’s overall performance, which continues to improve, the performance of the prediction of UTR positions remains relatively stagnant. One possible option influencing that fact is the current reference genomes, which are often not ideal for that concern since a 3' distortion is not uncommon in RNAseq. Thus an accurate 5' UTR determination is significantly more difficult (Mortazavi et al., 2008).
In my PhD thesis, the goal is to try out some tricks on the Deep Learning side to get better models for UTRs, despite errors in the references.