Novelty HGNC gene symbol Chr CDS start CDS end Strand Top rank List Model status GENCODEv22 biotype Human GENCODEv28 gene ID Human GENCODEv28 transcript ID CDS size (aa) No. coding exons Conservation Mouse GENCODE M18 annotation Non-murine representative model or genome coordinates (START-STOP) Human short-read data expression Mouse short read-data expression Contemporary RefSeq annotation Bicistronic? Peptide support sorfs.org support Comments Edge case? CDS sequence New discovery C1orf232 1 26168499 26164161 -1 118 1000 gene added absent ENSG00000282872 ENST00000634842 186 4 avians; potential ortholog in fish without synteny, e.g. carp XP_018954235.1 previously undescribed coding gene ENSMUSG00000108398 chicken LOC419602 no CAGE; no HPA; Intropolis finds expression in brain and eye experiments (e.g. SRR548613). CAGE supports inner ear cell expression; no ENCODE RNAseq absent Uncharacterised protein. Previously absent due to complete lack of cDNA or EST transcript evidence in human or mouse. no MNQAFWKTYKSKVLQTLSGESEEDLAEERENPALVGSETAEPTEETFNPMSQLARRVQGVGVKGWLTMSSLFNKEDEDKLLPSEPCADHPLAARPPSQAAAAAEARGPGFWDAFASRWQQQQAAAASMLRGTEPTPEPDPEPADEAAEEAAERPESQEAEPVAGFKWGFLTHKLAEMRVKAAPKGD New discovery AC119676.1 1 41628816 41585306 -1 419 1000 existing transcript 5' UTR of HIVEP3 ENSG00000284895 ENST00000646142 26 2 mammals previously undescribed coding gene ENSMUSG00000028634 / all sources support general expression all sources support general expression absent cotranscribed with HIVEP3 gawron_2016:38097 Uncharacterised protein within the 5' UTR of HIVEP3. Current evidence indicates it is co-transcriptional with HIVEP3, based on a single shared TSS for both loci. no MNAGFQREQRFSFGHRKWCLQHRRRA New discovery CTXND2 1 150912315 150912482 1 937 1000 gene added absent ENSG00000283324 ENST00000636087 55 1 mammals previously undescribed coding gene ENSMUSG00000105734 / all sources support testis / placenta expression all sources support testis / placenta expression ncRNA XR_158744.3 no Cortexin family member. no MEDSSLSSGVDVDKGFAIAFVVLLFLFLIVMIFRCAKLVKNPYKASSTTTEPSLS New discovery SPRR5 1 152948555 152948881 1 233 1000 gene added absent ENSG00000283227 ENST00000636302 108 1 mammals known coding gene ENSMUSG00000102308 / CAGE supports salivary acinar, skin and tongue expression; HPA supports high skin expression; Intropolis is dominated by psoriasis experiments (e.g. SRR1146240) CAGE supports tongue, vagina and skin expression; ENCODE RNAseq supports low expression in various tissues absent Small proline-rich protein family member. Located within a cluster of previously known paralogs. no MSQQKQKQCAPPQQCCPPPQQRCPPPQQCCPPPQQCCPPPQQCCPPPQQCCPPPQQCCPPPQQCCPPPQQYCPPPQQTKQPCQPPPKCQEPCAPKCPPPQQCQTSKQK New discovery MYOCOS 1 171623884 171626601 1 2157 unannotated gene added absent ENSG00000283683 ENST00000637642 80 2 mammals known coding gene ENSMUSG00000091060 / all sources support testis expression; HPA also finds female reproductive tissue expression all sources support testis expression ncRNA XR_922279.1 no Uncharacterised protein. In rodents the CDS is longer, lacking the human initiation and termination codons. However, the human CDS is seen to be ancestral based on comparison with other mammals, e.g. dog PLAR model linc|3P|XLOC_054809|TCONS_00133026:6.09827|109AA|76AA. no MAQKSLANNSINLPYKDLTSEVTRRRVTMITRKEIITQKSDEAKEMLSHLDLEQAPPPHRTYLTVPPAPPPSPAEDPTVS New discovery CCDC195 2 224716365 224703764 -1 583 1000 gene added absent ENSG00000283428 ENST00000638102 201 3 avians previously undescribed coding gene ENSMUSG00000110100 lizard XP_008104485.1 no CAGE; no HPA; Intropolis is dominated by cancer experiments, the top ranked normal tissue is B-cells (SRR315115); BLUEPRINT finds expression in blood cancers especially No CAGE; no ENCODE RNAseq absent Member of the coiled-coil domain protein family. no MEADIQLMRLIQEMRAEIHKLEKENQALRMKLTASSQRASGSGRESGDEREEEAPGQSPATLQGAVSTDAAPAVQEHQGNVMIVRRYSISSSVCSSAVNDPWKSGKSHPKSGILEGQRTLKSLACSPIKKQDMEEKVFATDSLTSNRTSQRASPEHVCGCRDKTKAVSFLLPMDMSSYSKNSSSLKHSPNQATNQLSIIAE New discovery SCYGR1 2 227387949 227387683 -1 8412 paralog gene added absent ENSG00000284629 ENST00000641359 88 1 there is a cluster in mammals with conserved synteny, though 1:1 orthologs cannot be deduced with confidence 12 novel protein-coding genes have been added to mouse in a location with overall synteny:ENSMUSG00000114299; ENSMUSG00000113880; ENSMUSG00000113973; ENSMUSG00000113846; ENSMUSG00000113084; chr1:82946339-82946764 (a late addition, not yet in GENCODE); ENSMUSG00000113925; ENSMUSG00000114011; ENSMUSG00000113097; ENSMUSG00000100190; ENSMUSG00000113267; ENSMUSG00000104423. This cluster also contains a single known gene, Krtap28-13. / no data / absent no Small cysteine and glycine repeat containing family member, with prospective similarity to keratin associated proteins. The accuracy of the underlying genomic alignments may be suspect, leading to spurious PhyloCSF signals (see main text). no MGCCGCGGCGGCGGCGGGCGGGCGRCTTCRCYRVGCCSSCCPCCRGCCGGCCSTPVICCCRRTCSSCGYSCGKGCCQQKCCCQKQCCC New discovery SCYGR2 2 227598893 227599255 1 853 1000 gene added absent ENSG00000284643 ENST00000641394 120 1 there is a cluster in mammals with conserved synteny, though 1:1 orthologs cannot be deduced with confidence see above / no data / absent Small cysteine and glycine repeat containing family member, with prospective similarity to keratin associated proteins. The accuracy of the underlying genomic alignments may be suspect, leading to spurious PhyloCSF signals (see main text). no MGCCGCGGCGGCGGRCGGCGGGCGGGCGGGCGGGCGGGCGGGCGGGCGGGCGGSCGSCTTCRCYRVGCCSSCCPCCRGCCGGCCSTPVICCCRRTCHSCGCGCGKGCCQQKCCCQKQCCC New discovery SCYGR10 2 227608784 227609101 1 853 1000 gene added absent ENSG00000284622 ENST00000641246 105 1 there is a cluster in mammals with conserved synteny, though 1:1 orthologs cannot be deduced with confidence see above / no data / absent no Small cysteine and glycine repeat containing family member, with prospective similarity to keratin associated proteins. This member is annotated as a polymorphic pseudogene on the reference genome due to nonsense SNP rs563988735, although based on frequency data the gene may be coding in over 10% of the population. Frequency data for adjacent SNP rs575523495 within the same codon indicates that [TGC] is the more common codon in individuals lacking the nonsense mutation, so this translation is provided here. no MGCCGCGGCGGRCSGGCGGGCGGGCGGGCGGGCGGCGGGCGSYTTCRCYRVGCCSSCCPCCRGCCGGCCSTPVICCCRRTCRSCGCGCGKSCCQQKCCCQKQCCC New discovery SCYGR3 2 227614840 227614538 -1 853 1000 gene added absent ENSG00000284704 ENST00000642029 100 1 there is a cluster in mammals with conserved synteny, though 1:1 orthologs cannot be deduced with confidence see above / no data / absent Small cysteine and glycine repeat containing family member, with prospective similarity to keratin associated proteins. The accuracy of the underlying genomic alignments may be suspect, leading to spurious PhyloCSF signals (see main text). no MGCCGCGSCGGCGGGCGGCGGGCGGGCGGGCGGVCGSCTTCRCYRVGCCSSCCPCCRGCCGGCCSTPVICCCRRTCRSCGCGCRKGCCQQKCCCQKQCCC New discovery SCYGR4 2 227617498 227617815 1 853 1000 gene added absent ENSG00000284631 ENST00000641801 105 1 there is a cluster in mammals with conserved synteny, though 1:1 orthologs cannot be deduced with confidence see above / no data / absent Small cysteine and glycine repeat containing family member, with prospective similarity to keratin associated proteins. The accuracy of the underlying genomic alignments may be suspect, leading to spurious PhyloCSF signals (see main text). no MGCCGCGSCGGCGGRCGGGCGGGCSGGCGGGCGGGCGGGCGSCTTCRCYRVGCCSSCCPCCRGCCGGCCSTPVICCCRRTCGSCGCGYGKGCCQQKCCCQKQCCC New discovery SCYGR5 2 227666804 227667061 1 19673 paralog gene added absent ENSG00000284667 ENST00000641976 85 1 there is a cluster in mammals with conserved synteny, though 1:1 orthologs cannot be deduced with confidence see above / no data / absent no Small cysteine and glycine repeat containing family member, with prospective similarity to keratin associated proteins. The accuracy of the underlying genomic alignments may be suspect, leading to spurious PhyloCSF signals (see main text). no MGCCGCGGCGGCGGGCGGGCGSCTTCRCYRVGCCSSCCPCCRGCCGGCCSTPVICCCRRTCCSCGCGCGKGCCQQKGCCQKQCCC New discovery SCYGR6 2 227724757 227724440 -1 1636 unannotated gene added absent ENSG00000284725 ENST00000641918 105 1 there is a cluster in mammals with conserved synteny, though 1:1 orthologs cannot be deduced with confidence see above / no data / absent neXtProt Small cysteine and glycine repeat containing family member, with prospective similarity to keratin associated proteins. The accuracy of the underlying genomic alignments may be suspect, leading to spurious PhyloCSF signals (see main text). no MGCCGCGGCGGGCGGCGGGCSGGCGGGCGGGCGSCTTCRCYRVGCCSSCCPCCRGCCGGCCSTPVICCCRRTCHSCGCGCGCGKGCCQQKGCCQQKGCCKKQCCC New discovery SCYGR7 2 227728335 227728625 1 2286 paralog gene added absent ENSG00000284718 ENST00000641700 96 1 there is a cluster in mammals with conserved synteny, though 1:1 orthologs cannot be deduced with confidence see above / no data / absent no Small cysteine and glycine repeat containing family member, with prospective similarity to keratin associated proteins. The accuracy of the underlying genomic alignments may be suspect, leading to spurious PhyloCSF signals (see main text). no MGCCGCGSCGGCGGGCGGCGGGCGGGCGGGCGSCTTCRCYRVGCCSSCCPCCRGCCGGCCSTPVICCCRRTCGSCGCGCGKGCCQQKGCCQKQCCC New discovery SCYGR8 2 227745894 227746220 1 608 1000 gene added absent ENSG00000284635 ENST00000641981 108 1 there is a cluster in mammals with conserved synteny, though 1:1 orthologs cannot be deduced with confidence see above / no data / absent neXtProt Small cysteine and glycine repeat containing family member, with prospective similarity to keratin associated proteins. The accuracy of the underlying genomic alignments may be suspect, leading to spurious PhyloCSF signals (see main text). no MGCCGCGGCGGGCGGCSGGCGGGCGGGCGGGGCGGGCGSCTTCRCYRVGCCSSCCPCCRGCCGGCCSTPVICCCRRTCSSCGCGYGKGCCQQKGCCQQKCCCQKQCCC New discovery FAM240C 2 241897331 241894213 -1 22742 lincRNA existing transcript lincRNA; protein-coding in v19 ENSG00000216921 ENST00000401641 90 2 mammals; potential ortholog in coelacanth without synteny, linc|3P|XLOC_096740|TCONS_00126143:5.57183|93AA|92AA| absent; apparently lost in rodents tupaia TREES_T100012573 CAGE is highly specific to tongue and diaphragm; HPA and Intropolis suggests more general expression / ncRNA NR_135766.1, though previously found as coding LOC285095 no Protein has homology to FAM240A and FAM240B, both of which are on this sheet. Actually a 'rediscovery', as both GENCODE and RefSeq had the locus annotated as protein-coding in earlier releases. FAM240C is notably more divergent with respect to FAM240A and FAM240B than those two CDS are to each other. no MSKSLTLKNPGRVAYDSGGIKMFWEKKIEHHARHLQNEDIRVRRSALNKLRVGWAEQLEGRNKMLQGPGRCPDRVPEATESLHTKDKKAA New discovery MDFIC2 3 70311973 70196926 -1 692 1000 existing transcript antisense lncRNA ENSG00000242120 ENST00000567252 189 3 avians; potential ortholog in fish without synteny, e.g. zebrafish OTTDARG00000038399 known coding gene ENSMUSG00000090667 chicken ENSGALT00000012550 no CAGE; no HPA; weak Intropolis support dominated by cell lines and cancer cells CAGE supports mesenchymal stem cell expression; ENCODE RNAseq supports certain brain experiments and limb absent MyoD family inhibitor domain-containing protein. The human model was initially built using mouse cDNA evidence. no MSETELEKIKVRTAEHLENDKNNISWLKEDTQLTNAKHADEKPINAIVINSVSDFNITDGPAKENPNEKKLSESSTSLSSLEECQTTFSYLQTDTSVHHRDTDEECASLILACLFCQFWDCLLMLPGTCETVCTKMCCPSRRYHHTSDENHSRNDCSCNCDMDCSLFESCHETSECLELAMEISEICYR New discovery C3orf85 3 109136848 109149894 1 2594 lincRNA transcript extended lincRNA ENSG00000241224 ENST00000622536 90 3 avians; potential ortholog in coelacanth without synteny, LOC106704520 previously undescribed coding gene ENSMUSG00000110573 chicken XP_416629.1 CAGE supports specific gastrointestinal expression; HPA and Intropolis are dominated by gastrointestinal expression, the former also finding support for liver and kidney expression CAGE supports specific gastrointestinal expression; ENCODE RNAseq supports gastrointestinal expression ncRNA NR_033977.1 neXtProt no Uncharacterised protein. no MAYKMLQVVLCSTLLIGALGAPFLLEDPANQFLRLKRHVNLQDYWDPDHSSDVWVNTLAKQARETWIALKTTAQYYLDMNTFTFDMSTAQ New discovery TMEM271 4 576062 574905 -1 703 1000 transcript extended lincRNA ENSG00000273238 ENST00000610212 385 1 vertebrates previously undescribed coding gene ENSMUSG00000105867 zebrafish OTTDARG00000041905 all sources support brain / CNS expression all sources support brain / CNS expression absent Transmembrane family protein. no MKWSVRGACAALSSCLLLACALSAAAVGLKCFSLGSELRGEPFRLGAAAGAFYSGLLLAAGLSLLGAALLCCGPRDAPLAGSEPGPGLGVPAAPAGAPEATPGESGAAAGAPGPVSSQNLLLLGVLVFMLGVLSAFAGAVIDGDTVSLVERKYSHYCLPPRAPGSSPGSAPGSTPGSAPGSAPGSAPGSAPGAPRARSTLDSATSAKCRQLKDYQRGLVLSTVFNSLECLLGLLSLLLVKNYKSSQARRGRRGRRRGGRALARPRGGSGLRAQPPASRARRGRRGRRGRRLQQRPSEASILSPEESDLAAPGDCAGFAAHHAVSYINVGVLHALDEAGAEVRCGGHPSVELPGYAPSDPDLNASYPYCCRPPCETPRPWETHRAC New discovery AC092442.1 4 6070162 6064977 -1 426 1000 gene added absent ENSG00000284684 ENST00000636216 45 2 vertebrates previously undescribed coding gene ENSMUSG00000113373 zebrafish OTTDARG00000044229 no CAGE; no HPA; minimal Intropolis support, not indicative of a particular expression pattern CAGE supports weak olfactory brain expression; no ENCODE RNAseq; conventional support limited to cDNA clone I220006B15 from olfactory sensory neurons absent cotranscribed with JAKMIP1 no Uncharacterised protein.The CDS is co-transcribed with JAKMIP1, being translated on an alternative first exon within a JAKMIP1 intron and sharing an out of frame CDS overlap in coding exon 2. Transcriptional support in zebrafish is limited to ESTs from olfactory tissues, i.e. matching the mouse RNA. JAKMIP1 has a well established role in brain development, and has a clear brain expression profile in all species examined. Of further interest, this work established that C4orf50 immediately 3' of JAKMIP1 is also co-transcribed in brain as part of the locus, and this pattern of transcription is conserved across to avians at least. Although co-transcribed, the C4orf50 and JAMKIP1 CDS are separated by a large non-coding exon that is conserved across this evolutionary distance (human chr4:6022704-6025013). Thus, this locus apparently produces three distinct, non-overlapping translations. no MEYVLFVLYFSFFLCLCALVCLYFSGCQEMTYKHEEACCGDIFWI New discovery AC093323.1 4 6674213 6674572 1 17969 re-ranking existing transcript lincRNA ENSG00000170846 ENST00000635031 119 1 human absent / CAGE is dominated by blood / immune cell experiments; HPA supports general expression; Intropolis finds extremely high expression in primary foot fibroblasts (SRR201164) / pseudogene NR_015433 neXtProt Presumably a duplicant of Morf4 family associated protein 1 (MRFAP1), found 5' adjacent. Additional duplicants were found in the vicinity not based on PhyloCSF; these are clearly pseudogenic due to CDS disruptions. The novel protein duplication is primate specific, and the PhyloCSF signal is potentially based on improper genome alignments. no MRPVDADEAREPREEPGSPLSPAPRAGRENLASLERERARAHWRARRKLLEIQSLLDAIKSEVEAEERGARAPAPRPRAEAEERVARLCAEAERKAAEAARMGRRIVELHQRIAGCECC New discovery EXOC1L 4 55820027 55837351 1 126 1000 existing transcript lincRNA ENSG00000250821 ENST00000636125 172 3 avians previously undescribed coding gene ENSMUSG00000091204 chicken ENSGALT00000042649 weak pineal gland support in CAGE; testis expression in HPA; Intropolis top rank expression is in embryo (SRR490995) no CAGE; ENCODE RNAseq supports weak brain / CNS expression; cDNA is from testes pseudogene NR_003935.2 Partial duplication of the EXOC1 locus located 3' adjacent. Human annotation was originally based on cDNA BC171876 from pooled tissues, while we also find PacBio capture-seq support in brain and K562 experiments. no MSSLVKEDLEKKLFKPLSQNLYEFIEIEFSVQDRYYLCVSVTKKEEVKIVMVKHYRIGLDEKYEVTKKWSLNDLQMIDGKEADTDNPFFDLHFKKVYSLEAYSCASKYAFARTVNKLNHAYLKKDLQIVNFDSTYINDDSIWSSNNKDCLVLMRICFYAFNLVCLSLCPLPL New discovery AC079341.3 4 121765105 121764914 -1 472 1000 existing transcript 5' UTR of TMEM155 ENSG00000164112 ENST00000643802 63 1 mammals; potential ortholog in avians without synteny or transcript evidence, e.g. allMis1/American alligator JH733013:223,996-224,181 previously undescribed coding gene ENSMUSG00000085007 / CAGE supports brain expression; HPA is dominated by brain expression, alongside evidence of weaker more general expression; the highest ranked Intropolis introns are in brain CAGE supports brain expression; ENCODE RNAseq is dominated by brain absent cotranscribed with TMEM155 gonzalez_2014:723790 Uncharacterised protein. Initially found within the 5' UTR of TMEM155 (ENSG00000284849). However, the 130aa TMEM155 CDS was found to be dubious, lacking conservation beyond higher primates and being partially composed of transposable element sequence. Its supposed transmembrane domain prediction could not be recapitulated. The 130aa TMEM155 CDS will thus be removed from GENCODE, and Ensembl ID of this locus has changed from ENSG00000284849 to ENSG00000164112. no MEWELNLLLYLALFFFLLFLLFLLLFVVIKQLKNSVANTAGALQPGRLSVHREPWGFSREQAV New discovery REELD1 4 146216953 146230513 1 309 1000 transcript extended lincRNA ENSG00000250673 ENST00000636502 526 6 vertebrates previously undescribed coding gene ENSMUSG00000112190 coelacanth ENSLACT00000006652, 3' truncated by assembly gap; fugu AUGUSTUS g21894.t3 has a conserved N-t region no CAGE; HPA finds weak general expression; Intropolis finds weak expression without an obvious pattern / 'unknown' XR_939298.1 Reeler domain-containing protein. The original human lincRNA annotation was based on EST AV698927 only, which is cancer derived and does not cover the entire CDS. no MRMQAALVGWACTTLCLASCSSAFSHGASTVACDDMQPKHIQAQPQHQDSHHITIHTHRTSYAPGDKIPVTVRSSRDFMGFLLQARRVSDHQIAGTFVLIPPHSKLMTCFQEADAVTHSDKSLKRNLSFVWKAPAQPVGDIKFLLSVVQSYFVYWARIESSVVSQQTHSSAHSDDRMEPRLLMPNLHQRLGDVEGAAPAPRTPITLPQQHTHVFAVALPGAAEEDNLDPVPASIWVTKFPGDAETLSQPSSHTATEGSINQQPSGDSNPTLEPSLEVHRLERLVALKRVSSESFASSLSTHHRTQDDPSFDSLETCLSSDGGEQDKTKASNRTVTQPPLSTVQLTYPQCLWSSETFTGNGVRASNPIPVLQTSGTSGLPAAGDQSEASRASASFLPQSKHKELRAGKGNGEGGVGYPRQTNPRPDIGLEGAQAPLGIQLRTPQLGILLCLSATLGMALAAGLRYLHTQYCHQQTEVSFSEPASDAVARSNSGETVHVRKIGENSFVLVQAEYNWITPSVGSKKTVL New discovery SMIM31 4 164770444 164801194 1 285 1000 existing transcript lincRNA ENSG00000248771 ENST00000507311 71 2 coelacanth previously undescribed coding gene ENSMUSG00000074300 coelacanth PLAR model EnsShortNonCoding|3P|XLOC_155298|TCONS_00200728|0:0.366284|93AA|72AA| (72aa CDS) CAGE supports gastrointestinal expression; HPA and Intropolis support gastrointestinal expression alongside weaker expression in certain other organs CAGE supports gastrointestinal expression; ENCODE RNAseq supports gastrointestinal expression ncRNA NR_038834.1 (LINC01207) no Member of the small integral membrane protein family. no MELPYTNLEMAFILLAFVIFSLFTLASIYTTPDDSNEEEEHEKKGREKKRKKSEKKKNCSEEEHRIEAVEL New discovery SMIM32 5 136192524 136192213 -1 256 1000 existing transcript lincRNA ENSG00000271824 ENST00000607574 103 1 vertebrates previously undescribed coding gene ENSMUSG00000110086 zebrafish OTTDARG00000044227 CAGE supports expression in kindey and gastrointestinal tissues, and also brain; HPA finds highest expression in gastrointestinal tissues but also expression in brain, kidney, prostate; no introns for Intropolis CAGE finds highest expression in pancreas, though is otherwise dominated by brain experiments; ENCODE RNAseq is dominated by brain expression, alongside certain gastrointestinal tissues and limb ncRNA NR_024418.1 (LOC389332) Member of the small integral membrane protein family. no MYGDIFNATGGPEAAVGSALAPGATVKAEGALPLELATARGMRDGAATKPDLPTYLLLFFLLLLSVALVVLFIGCQLRHSAFAALPHDRSLRDARAPWKTRPV New discovery SMIM33 5 139471122 139472921 1 56 1000 gene added absent ENSG00000283288 ENST00000637503 132 2 avians known coding gene ENSMUSG00000073598 chicken XP_004944821 no CAGE; HPA and Intropolis are dominated by gastrointestinal tissues, with evidence of weaker expression in other organs CAGE and ENCODE RNAseq are dominated by gastrointestinal expression absent Member of the small integral membrane protein family. no MHQAGHYSWPSPAVNSSSEQEPQRQLPEVLSGTWEQPRVDGLPVVTVIVAVFVLLAVCIIVAVHFGPRLHQGHATLPTEPPTPKPDGGIYLIHWRVLGPQDSPEEAPPGPLVPGSCPAPDGPRPSIDEVTCL New discovery SMIM40 6 33329265 33329026 -1 635 1000 existing transcript 5' UTR of DAXX ENSG00000285064 ENST00000494082 79 1 mammals previously undescribed coding gene ENSMUSG00000092349 / eye specific in CAGE; no HPA; Intropolis is dominated by cancer-derived experiments no CAGE; general ENCODE RNAseq expression absent no Uncharacterised protein. Identified within the 5' UTR of DAXX. However, the novel CDS locus has polyA features in human and mouse, while DAXX has a distinct promoter downstream. Also, transcriptional overlap betweem the novel CDS locus and DAXX is limited to human, and is weakly supported. It thus apparent that these two loci are separate genes. no MAEEGDVDEADVFLAFAQGPSPPRGPVRRALDKAFFIFLALFLTLLMLEAAYKLLWLLLWAKLGDWLLGTPQKEEELEL New discovery SMIM29 6 34247791 34246803 -1 204 1000 existing transcript protein-coding gene with a CDS in wrong frame ENSG00000186577 ENST00000476320 102 4 avians known coding gene ENSMUSG00000062753 alligator XP_006025159 all sources support general expression all sources support general expression spurious NM_001008703.2 (C6orf1) neXtProt Member of the small integral membrane protein family. Previously known as C6orf1, which had a spurous CDS annotated in the wrong frame in RefSeq and GENCODE. The spurious CDS is actually longer than the novel CDS described here at 159aa. no MSNTTVPNAPQANSDSMVGYVLGPFFLITLVGVVVAVVMYVQKKKRVDRLRHHLLPMYSYDPAEELHEAEQELLSDMGDPKVVHGWQSGYQHKRMPLLDVKT New discovery SMIM28 6 138378073 138382847 1 212 1000 existing transcript lincRNA; protein-coding in v19 ENSG00000262543 ENST00000573100 152 2 coelacanth; potential ortholog in fish without synteny, e.g. tetraodon CAF95351 absent coelacanth AUGUSTUS g11564.t1 no CAGE; HPA and Intropolis are dominated by gastrointestinal tissues / absent Member of the small integral membrane protein family. The CDS was annotated in GENCODEv19 and subsequently removed. no MRGLLGSSWKKFGHAGRGTYEWLTSEPGLPLLETQLQGTQGVSSTQEDVEPFLCILLPATILLFLAFLLLFLYRRCKSPPPQGQVFSIDLPEHPPAGEVTDLLPGLAWSSEDFPYSPLPPEATLPSQCLPPSYEEATRNPPGEEAQGCSPSV New discovery AC187653.1 7 290170 291488 1 506 1000 transcript extended lincRNA and adjacent pseudogene ENSG00000248767 ENST00000506382 233 3 vertebrates known coding gene ENSMUSG00000094504 zebrafish OTTDARG00000040484 no CAGE; no HPA; Intropolis finds weak expression in developmental cells and cancer experiments no CAGE; ENCODE RNAseq supports testis and brain / CNS expression ncRNA WI2-2373I1.2 Forkhead box L1-like protein. Previously absent due to a complete lack of cDNA or EST transcriptional support in human or mouse. no MFDSSQYPYNCFNYDADDYPAGSSDEDKRLTRPAYSYIALIAMAIQQSPAGRVTLSGIYDFIMRKFPYYRANQRAWQNSIRHNLSLNSCFVKVPRSEGHEKGKGNYWTFAGGCESLLDLFENGNYRRRRRRRGPKREGPRGPRAGGAQGPSGPSEPPAAQGRLAPDSAGEGAPGREPPASPAPPGKEHPRDLKFSIDYILSSPDPFPGLKPPCLAQEGRYPRLENVGLHFWTM New discovery CCDC201 7 45873007 45863085 -1 944 1000 gene added absent ENSG00000283247 ENST00000636578 187 3 avians previously undescribed coding gene ENSMUSG00000087512 chicken ENSGALT00000020330 no CAGE; HPA supports placenta and testis expression; the top ranked introns in Intropolis are from a series of early embryo experiments (e.g. SRR490970) weak ovary CAGE support; ENCODE RNAseq supports placenta and ovary expression absent Uncharacterised protein. Previously absent due to a complete lack of cDNA or EST evidence in human or mouse. Genomes in the rodent / squirrel clade have lost the ancestral STOP (found in other mammals and birds), giving an C-t extended CDS. no MEPGVQDLGLSSSEDESPSLAIRSPTLRKPLKHSTPEEAALGWSPRPSGGASYLSGSPMPAHFSQDLASHPAGVSPPATVRKRRLSTLWASKESSLDLSAPGEEPPTSASLTQRQRQRQQQQQQQESLRAKSWAQNPGLPGILNTTGRKRRDPKKRAAAMERVRQWEIYVLQNIEEATQHELTIEDD New discovery FAM237B 7 90319748 90319329 -1 428 1000 gene added absent ENSG00000283267 ENST00000637645 139 1 avians previously undescribed coding gene ENSMUSG00000073234 ground tit XP_005518842.1 no CAGE; HPA supports testis expression; Intropolis shows strong expression in a series of experiments on preovulatory cumulus and mural granulosa cells, which are associated with developing oocytes (e.g. SRR836179) weak CAGE; ENCODE RNAseq show high brain / CNS expression, and also placenta absent Uncharacterised protein. It has clear homology to human FAM237A (LOC200726) on chr2, which is also on this list. no MCFATRRWFYLHLGCMMLINLVNADFEFQKGVLASISPGITKDIDLQCWKACSLTLIDLKELKIEHNVDAFWNFMLFLQKSQRPGHYNVFLNIAQDFWDMYVDCLLSRSHGMGRRQVMPPKYNFPQKITGGNLNVYLRE New discovery TCAF2C 7 143647646 143639670 -1 81 1000 gene added absent ENSG00000283528 ENST00000636941 711 6 potential primate specific duplication mouse has a cluster of three TCAF loci in a syntenic location, though 1:1 relationships are not discernible / no expression data / absent TRPM8 channel-associated factor protein, adjacent to known protein-coding family member TCAF2. It is a gene fragment leading into a 5' sequence gap, which is likely why it was not annotated already. no YGEDVRQDQQQLLEGISELDIRTGGVPSQLLVHGALAFPLGLDASLNCFLAAAHYGRGRVVLAAHECLLCAPKMGPFLLNAVRWLARGQTGKVGVNTNLKDLCPLLSEHGLQCSLEPHLNSDLCVYCCKAYSDKEAKQLQEFVAEGGGLLIGGQAWWWASQNPGHCPLAGFPGNIILNCFGLSILPQTLKAGCFPVPTPEMRSYHFRKALSQFQAILNHENGNLEKSCLAKLRVDGAAFLQIPAEGVPAYISLHRLLRKMLRGSGLPAVSRENPVASDSYEAAVLSLATGLAHSGTDCSQLAQGLGTWTCSSSLYPSKHPITVEINGINPGNNDCWVSTGLYLLEGQNAEVSLSEAAASAGLRVQIGCHTDDLTKARKLSRAPVVTHQCWMDRTERSVSCLWGGLLYVIVPKGSQLGPVPVTIRGAVPAPYYKLGKTSLEEWKRQMQENLAPWGELATDNIILTVPTTNLQALKDPEPVLRLWDEMMQAVARLAAEPFPFRRPERIVADVQISAGWMHSGYPIMCHLESVKEIINEMDMRSRGVWGPIHELGHNQQWHGWEFPPHTTEATCNLWSVYVHETVLGIPRAQAHEALSPPERERRIKAHLGKGAPLCDWNVWTALETYLQLQQAFGWEPFTQLFAEYQTLSHLPKDNTGRMNLWVKKFSEKVKKNLVPFFEAWGWPIQKEVADSLASLPEWQENPMQVYLRARK New discovery H2BE 7 151210318 151207871 -1 60 1000 spliceform added alternative CDS of ABCF2 in different reading frame ENSG00000285480 ENST00000644661 122 2 avians absent; apparently lost in rodents chicken ENSGALG00000013346 no CAGE; no HPA; Intropolis finds expression specific to embryonic cells (e.g. SRR499827) / absent Histone protein, member of the H2B family. The second exon was previously incorporated in ABCF2 upstream, and this cDNA can now been seen as a misleading 'read-through' transcript event; these are clearly separate genes. no MSAEYGQRQQPGGRGGRSSGNKKSKKRCRRKESYSMYIYKVLKQVHPDIGISAKAMSIMNSFVNDVFEQLACEAARLAQYSGRTTLTSREVQTAVRLLLPGELAKHAVSEGTKAVTKYTSSK New discovery PRSS51 8 10517954 10496385 -1 31 1000 spliceform added antisense lncRNA ENSG00000253649 ENST00000636217 220 6 mammals known coding gene ENSMUSG00000052099 / very weak CAGE; HPA supports testis expression; the topped ranked Intropolis introns are from cell lines, but there is also early embryo expression (e.g. SRR893067). all sources support testis expression absent Protease, serine family member. Human has a premature termination codon in the final exon, giving a shorter protein compared to other mammals including apes (the truncation is 23aa versus the mouse CDS). However, this termination is found downstream of the trypsin domain, and additional confidence in the coding potential of the human locus comes from mass spectromtery support. no MFQLLIPLLLALKGHAQDNPENVQCGHRPAFPNSSWLPFHERLQVQNGECPWQVSIQMSRKHLCGGSILHWWWVLTAAHCFRRTLLDMAVVNVTVVMGTRTFSNIHSERKQVQKEEERTWDWCWMAQWVTTNGYDQYDDLNMHLEKLRVVQISRKECAKRINQLSRNMICAWNEPGTNGIFKVLTPAQPPPPSRETVGHLWFVLFMEPRDSSKWVSSVGA New discovery AC138647.1 8 141514697 141518715 1 102 1000 existing transcript lincRNA; protein-coding in v19 ENSG00000226490 ENST00000427937 188 2 mammals known coding gene ENSMUSG00000086361 / all sources support testis expression all sources support testis expression absent Uncharacterised protein. This was a protein-coding gene in GENCODEv19 before being switched to lncRNA. While the first coding exon, the splice junction, and the bulk of the second exon CDS are conserved in mammals, the STOP region is highly divergent. It has not been possible to establish an ancestral state for the STOP: human, mouse, dog, opossum use different C-ts. However, these translations are similar in size. Avian genomes have a region of prospective homology to the CDS of exon 2. However, this is not syntenically conserved and the first exon of the mammalian gene is not seen. The provenance of this locus is clearly as protein-coding sequence, although it the possibility that it is pseudogenic in human and / or mouse cannot be ruled out. yes, potential unitary pseudogene in human and / or mouse. MASSCPGTPSPAGLPPPSVATPGPAAPPEPAFPDIYGGDAQLWEAHFRGIGRAYRALGKQDDFAIRVLTENFTLPFPFAWPPGSDPACGPLFYDPRDRADFDFLLRGPGASPPALLRPLHATAQAAMRKRRLERLALSCARARGPGPASSCCCPAPPPPSRSPRPALPATAPPGWPRPRRCPESEQNK New discovery SMIM27 9 32552435 32552923 1 324 1000 existing transcript antisense lncRNA ENSG00000235453 ENST00000453396 55 2 coelacanth previously undescribed coding gene ENSMUSG00000028407 coelacanth linc|3P|XLOC_149553|TCONS_00193392:1.03778|96AA|53AA| all sources support general expression all sources support general expression ncRNA NR_033991.1 no Small integral membrane protein, previously known as TOPORS-AS1. no MKPVSRRTLDWIYSVLLLAIVLISWGCIIYASMVSARRQLRKKYPDKIFGTNENL New discovery FAM240B 9 38703999 38694776 -1 214 1000 gene added absent ENSG00000283329 ENST00000637493 78 2 mammals; potential ortholog in avians without synteny, e.g. lizard XP_016846473 known coding gene ENSMUSG00000096537 / CAGE supports eye expression; no HPA; Intropolis supports eye expression (e.g. SRR548613), but also skeletal muscle (ERR030899) no expression data ncRNA XR_242549.3 no Protein with homology to FAM240A and FAM240C, both of which are on this sheet. no MNNQYIRREVFCCGTCHELKSFWEKEISKQTFYRELEEDRQERSALKKLREEWRQRLERRLRMLDNPVEKEKPAHTAD New discovery AL353572.3 9 87956267 87956500 1 67 1000 gene added absent ENSG00000283205 ENST00000636536 77 1 mammals, although with a primate-specific genomic rearrangement; synteny is maintained between the dog, cow and mouse loci previously undescribed coding gene ENSMUSG00000114559 / all sources support testis expression all sources support testis expression absent no Uncharacterised protein. no MGFVTNKSAFKAGDSLYLRRAFVNNLGEERRTRIQIQSIQKALDIQIREIDREKAALKRFLVKLHKTTGYFPQKPLW New discovery B3GNT10 9 120796604 120799678 1 1441 re-ranking transcript extended pseudogene ENSG00000214654 ENST00000464488 369 2 mammals; potential ortholog seen in avians: e.g. chicken B3GNTL2, with partial syntenty (PSDM5 flank) previously undescribed coding gene ENSMUSG00000107167 / CAGE supports weak expression in cell lines and cancer experiments; however, alternative transcript ENST00000437707 skips coding exon 1 with high general expression in HPA; Intropolis finds weak support for coding intron 1, limited to cell lines and cancer experiments No CAGE; ENCODE RNAseq is limited to liver (support was also found in liver PacBio capture-seq experiments) NR_027442.1 (LOC100288842) Galactosyltransferase superfamily member. The human cDNA and EST evidence - and the vast bulk of RNAseq evidence - supports an alternative first exon that does not contain the initiation codon. This may be why the locus was previously annotated as a pseudogene fragment. Just two codons are present on this first coding exon; there is no PhyloCSF signal here, although the [ATG] is found in all therian mammals. This locus could be pseudogenised in human, or potentially it could be translated with a truncated N-terminus, i.e. using an [ATG] in coding exon 2. Nonetheless, it was annotated as a protein-coding locus in human because there is evidence for the transcription of coding exon 1, albeit limited. yes, potential unitary pseudogene in human MQVTFCRLRTHQWCFILFNVILFHALLFGTDFVEEYFLHSLPYIDVKVLEIKNKARKLNIEPLRSNLSKYYVLSQSEICKGKNIFLLSLIFSSPGNGTRRDLIRKTWGNVTSVQGHPILTLFALGMPVSVTTQKEINKESCKNNDIIEGIFLDSSENQTLKIIAMIQWAVAFCPNALFILKVDEETFVNLPSLVDYLLNLKEHLEDIYVGRVLHQVTPNRDPQNRDFVPLSEYPEKYYPDYCSGEAFIMSQDVARMMYVVFKEVPMMVPADVFVGICAKFIGLIPIHSSRFSGKRHIRYNRCCYKFIFTSSEIADPEMPLAWKEINDGKECTLFETSYELISCKLLTYLDSFKRFHMGTIKNNLMYFAD New discovery BX255925.3 9 137217997 137218602 1 549 1000 gene added absent ENSG00000284976 ENST00000645271 201 1 avians previously undescribed coding gene ENSMUSG00000115018 chicken AUGUSTUS g10876.t1 CAGE supports esophagus, tonsil, tongue, testis expression; HPA shows general expression, though likely includes readthrough transcription from NDOR1 upstream so hard to interpret; no introns for Intropolis CAGE supports stomach, skin, ileum, vesicular gland, gastrointestinal tissues expression; ENCODE RNAseq likely includes readthrough transcription from Ndor1, although transcription in stomach notable increases at the CAGE region absent RING finger protein. The initiation codon has not been defined with confidence, i.e. based on conservation. This could suggest the PhyloCSF signal actually finds an alternative final exon of NDOR1, found a short distance upstream. However, several lines of evidence argue against this: (1) there is no support for splicing between the two loci in human, and while mouse does contain connections these are weakly supported, only incorporate part of the PhyloCSF signal and are poorly conserved; (2) the CAGE region of the novel locus is conserved between human and mouse, and in human it colocalises with rich TF-binding data; (3) there is no expectation that NDOR1 – an oxireductase metabolic enzyme – would utilise a RING finger domain; (4) there are multiple RING finger loci in the region, which presumably represent a series of ancient duplications. The possibility that this locus is a pseudogene cannot be ruled out, although its age argues against this. The 5' of the PhloCSF signal corresponds to the edge of a sharp signal in multiz, which corresponds to an in-frame [GAG]. The human model could was constructed with an [ATG] a short distance upstream, which is listed here. Nonetheless, it is possible the locus uses a non-ATG initiation codon. no MEGAWALPTWKEEGREQAAGQGEEEECPICTEPYGPRERRLALLNCSHGLCVGCLHRLLGSASSADLGRVRCPLCRQKTPVLEWEICRLQEELLQADGPSRQPRREAPASYHRNPGPWGSLEHRYQLRFLAGPVGGRGCLPFLPCPPCLGARLWTLRERGPCARRLALLSLLALELLGLLLVFTPLLLLGLLFVLLDRSGR New discovery HSPA14 10 14842285 14843894 1 43 1000 / ms transcript extended non-coding transcript within HSPA14 ENSG00000284024 ENST00000640019 361 2 mammals previously undescribed coding gene ENSMUSG00000051396 / all sources support general expression all sources support general expression absent cotranscribed with HSPA14 Kim et al Myb/SANT-like DNA-binding domain containing protein. Found within UTR exons of HSPA14. All evidence indicates that transcription occurs from the shared HSPA14 promoter in human and mouse, with no evidence of differential expression between the two loci. It is thus not obvious how translation of the two CDS is distinguished. This CDS was initially suggested by a GENCODE reanalysis of the Kim et al and Wilhelm et al mass spectrometry datasets, although the supporting evidence was not considered strong enough to support annotation at that time. HSPA14 is found in all vertebrates, which suggests the novel CDS arose at a later time. no MASANSSAGIRWSRQETRTLLSILGEAEYIQRLQTVHHNADVYQAVSKRMQQEGFRRTERQCRSKFKVLKALYLKAYVAHATSMGEPPHCPFYDTLDQLLRNQIVTDPDNLMEDAAWAKHCDQNLVASDAPGEEGTGILKSKRTQAADHQPILKTVKASDEDCQLRISDRIRETSDLEDSWDESSGAGCSQGTPSYSSSHSLFRGAVAPCQSSPMARLGVSGEPSPCTSTNRSTPGVASTPQTPVSSSRAGFVSGGDRPLTSEPPPRWARRRRRSVARTIAAELAENRRLARELSKREEEKLDRLIAIGEEASAQQDTANELRRDAVIAVRRLATAVEEATGAFQLGLEKLLQRLISNTKS New discovery C10orf143 10 130110772 130064354 -1 1601 lincRNA spliceform added lincRNA ENSG00000237489 ENST00000637128 108 4 xenopus; potential ortholog in fish without synteny, e.g. salmon XP_020333744 known coding gene ENSMUSG00000040139 xenopus OCT69938 all sources support general expression all sources support general expression ncRNA NR_034125.1 (LINC00959) Uncharacterised protein. Previous not annotated in human as the cDNA evidence supports non-coding transcripts within the locus. no MDSLALGRWRQRRAEDLQVPGDVKRVCRRLEASGHERGCHQVNACALASWGPEDRELPSRGCLPAPRPESGQGRLSTGISQNGGRSSAQPCPRCIAGESGHFSHTKNH New discovery C10orf95 10 102451093 102450452 -1 493 1000 existing transcript lincRNA ENSG00000120055 ENST00000625129 213 1 xenopus previously undescribed coding gene ENSMUSG00000099655; CDS previously annotated in the wrong frame xenopus LOC105948016 CAGE supports expression in various epithelial cells; HPA supports expression in lung and colon; top ranking Intropolis introns are from trophoblasts (e.g. SRR486238) CAGE supports expression in tracheal epithelial cells; ENCODE RNAseq supports expression in lung and gastrointestinal tissues spurious NM_024886.2 (C10orf95) Uncharacterised protein. While RefSeq and UniProt have long recognised this locus as protein-coding, they represent a spurious 257aa ORF in the wrong frame. GENCODE represented this false translation up to v19, after which the gene became a lincRNA. no MYVYSWPPPKQGVWPPPPQLLTCTYLAAPLLLPPVQAHSFRSRPGSLHAGEWAAPREYHRFYGPAAPPEAAPPWWACPPAYATTLRRPCAAAGISGLSLQAPAAVAESWAPWPEGGSLQTELRWGRVERARGPPLQLPDFVRRELRRAYGTYPRADVRVTQRRGQFLLQATPRVLEPDHRVEWRVRRRPDSGDSSPAREAAERGRPRKSKGLS New discovery TEX54 11 62832750 62832376 -1 402 1000 / ms gene added absent ENSG00000283268 ENST00000636508 124 1 mammals known coding gene ENSMUSG00000090840 / CAGE supports macrophage expression in a small number of experiments; HPA supports testis and bone marrow expression, although BLUEPRINT finds only weak expression in blood cells; no introns for Intropolis all sources support testis expression absent neXtProt Uncharacterised CDS. The locus apparently lost its single intron in primates. It is found within a small genomic region between WDR74 and STX5 on the same strand, and although it is clearly distinct from the latter it does have limited readthrough transcription evidence from the former. Nonetheless it seems most likely to be a separate coding gene in human and mouse due to CAGE support and its high specificity to testis, which is not seen for WDR74. Evidence for translation in testis has been observed in the HPP mass spectromery datasets, which included PhyloCSF in the search space. This data had not been previously considered strong enough to support CDS annotation in isolation. no MGCCQDKDFEMSDEQSKEEESEDGREDETTDTQRGPRECERGLPEGRGELRGLVVPSGAEDIDLNSPDHPNHKSNESLLITVLWRRLSTFGRRGSSRPSKRQPDQIRKQESPIREGNQEEPEKG New discovery SMIM38 11 69157847 69158002 1 219 1000 gene added absent ENSG00000284713 ENST00000641568 51 1 mammals previously undescribed coding gene ENSMUSG00000109305 / no CAGE; HPA supports stomach expression; Intropolis has highest support in stomach (e.g. SRR980483) all sources support stomach expression absent no This CDS was reported by Saghatelian based on mass spec peptide [LILWSCLGTYIDYR] and conservation [PMID:27010111]. It is orthologous to rat protein-coding gene Rmt1, previously reported to be highly expressed in rat mammary tumours [PMID:11675151]. no MTSWPGGSFGPDPLLALLVVILLARLILWSCLGTYIDYRLAQRRPQKPKQD New discovery SMIM35 11 118086757 118013781 -1 391 1000 transcript extended antisense lncRNA ENSG00000255274 ENST00000636151 85 4 mammals known coding gene ENSMUSG00000091996 / no CAGE in normal tissues; HPA shows weak expression in kidney, lung, placenta; Intropolis support is weak in non-cancer experiments, although expression is apparent in most BLUEPRINT experiments no CAGE; ENCODE RNAseq supports testis and thymus expression, alongside more general, weaker transcription ncRNA NR_038318.1 no Uncharacterised protein. no MTGEDSISTLGLILGVGLLLLLVSILGYSLAKWYQRGYCWEGPNFVFNLYQIRNLKDLEMGPPFTISGHISSTDGGYMKFSNGLV New discovery TEX49 12 48727481 48765655 1 184 1000 existing transcript lincRNA ENSG00000257987 ENST00000548380 131 4 vertebrates known coding gene ENSMUSG00000022993 zebrafish OTTDARG00000041067 all sources support testis expression all sources support testis expression ncRNA NR_029448.1 (LINC00935) Uncharacterised protein. no MAFFNLYLLGYQNSFQNKKRNTTEETNQKEPEPTRLPPIISKDGNYSVHQNSHTRYHEAVRKVLLKTFPNQVFRIPLTDAQNFSFWWSHDPGVRPEETMPWIRSPRHCLIKSAMTRFMDHSILNDRTFSLY New discovery DIABLO 12 122226564 122226430 -1 279 1000 existing transcript 5' UTR of DIABLO ENSG00000284934 ENST00000475784 44 1 avians, within the DIABLO 5' UTR; a syntenic ORF exists in fish without transcript evidence, e.g. fr3/Fugu HE591722:58,952-59,158. previously undescribed coding gene ENSMUSG00000114278 chicken mRNA CR353660, ICGSC Gallus_gallus-4.0/galGal4 [chr15:5,798,711-5,798,845] CAGE expression is limited to neuroblastoma, except a weak signal in fetal rectum; no HPA; minimal Intropolis, almost entirely from neuroblastoma experiments CAGE supports embryo and neonate expression, especially in neurons and adrenal gland; ENCODE RNAseq finds transcription in most experiments absent cotranscribed with DIABLO no Uncharacterised protein. Found within a 5' UTR exon of DIABLO. Current evidence suggests this novel CDS is co-transcriptional with DIABLO in human and mouse, i.e. it is found on a transcript that also contains the DIABLO CDS. This also appears to be the case at least across to avians, i.e. chicken. However, DIABLO also has a strong, conserved TSS downstream of the novel CDS with general expression; while the novel CDS may always be transcribed in conjunction with DIABLO, it seems DIABLO is typically transcribed independently of the novel CDS. Although the human novel CDS has weak expression data, annotation was originally supported by cDNA BC046209 and 7 ESTs. All but one of these is from neuroblastoma. no MPASSTVHVLQLLRELLAFVLLSYTVLIGALLLAGWTTYFLVLK New discovery C12orf81 12 51814893 51813940 -1 173 1000 gene added absent ENSG00000284730 ENST00000642069 317 1 avians previously undescribed coding gene ENSMUSG00000113558 lizard: Broad AnoCar2.0/anoCar2 [chr2:92,335,927-92,335,957 (-)] no CAGE; HPA coverage is weak, and cannot be deconvoluted from a prospective long-form of the SCN8A 3' UTR on the opposite strand; no introns for Intropolis all sources support testis expression absent Uncharacterised protein. This single exon locus appears to be homologous to a multiexon locus found in multiple fish species and amphibians, e.g. zebrafish OTTDARG00000002022, although without preserved synteny. An equivalent multi-exon locus cannot be found in mammals or avians. The most parsimonious explanation would seem to be that the single exon locus represents an ancient retroinsertion near the base of the mammalian / avian clade, accompanied by the loss of the parent. no MAARTLASALVLTLWVWALAPAGAVDAMGPHAAVRLAELLTPEECGHFRSLLEAPEPDVEAELSRLSEDRLARPEPLNTTSGSPSRRRRREAAEDPAGRVAGPGEVSDGCREALAAWLAPQAASLSWDRLARALRRSGRPDVARELGKNLHQQATLQLRKFGQRFLPRPGAAARVPFAPAPRPRRAAVPAPDWDALQLIVERLPQPLYERSPMGWAGPLALGLLTGFVGALGTGALVVLLTLWITGGDGDRASPGSPGPLATVQGWWETKLLLPKERRAPPGAWAADGPDSPSPHSALALSCKMGAQSWGSGALDGL New discovery SMIM41 12 52079780 52080061 1 562 1000 existing transcript non-coding transcript within OR7E47P ENSG00000284791 ENST00000546390 93 1 avians previously undescribed coding gene ENSMUSG00000075408 alligator AUGUSTUS g4669.t1 CAGE supports expression in smooth muscle cells in colon and trachea, and also lung, urethra, testis, prostrate; HPA supports expression in lung and colon and prostate, and weaker expression in spleen, stomach and adrenal gland; Intropolis is dominated by prostate tumours CAGE supports strong expression in cardiac myocytes, and also fetal testis and fetal lung; ENCODE RNAseq supports high expression in lung and testis, alongside weak general expression ncRNA NR_120437 no Uncharacterised protein. no MNGSQAGAAAQAAWLSSCCNQSASPPEPPEGPRAVQAVVLGVLSLLVLCGVLFLGGGLLLRAQGLTALLTREQRASREPEPGSASGEDGDDDS New discovery C13orf42 13 51111209 51084151 -1 136 1000 transcript extended lincRNA ENSG00000226792 ENST00000563710 325 4 coelacanth; potential ortholog in fish without synteny, e.g. tetraodon CAG05779 previously undescribed coding gene ENSMUSG00000100486 coelacanth ENSLACT00000010662 No CAGE; no HPA; Intropolis supports expression in a variety of developmental tissues or cell lines (e.g. SRR488136) CAGE has a weak inner ear signal; no ENCODE RNAseq support ncRNA NR_102432 (LINC00371) Uncharacterised protein. no MFRKIHSIFNSSPQRKTAAESPFYEGASPAVKLIRSSSMYVVGDHGEKFSESLKKYKSTSSMDTSLYYLRQEEDRAWMYSRTQDCLQYLQELLALRKKYLSSFSDLKPHRTQGISSTSSKSSKGGKKTPVRSTPKEIKKATPKKYSQFSADVAEAIAFFDSIIAELDTERRPRAAEASLPNEDVDFDVATSSREHSLHSNWILRAPRRHSEDIAAHTVHTVDGQFRRSTEHRTVGTQRRLERHPIYLPKAVEGAFNTWKFKPKACKKDLGSSRQILFNFSGEDMEWDAELFALEPQLSPGEDYYETENPKGQWLLRERLWERTVP New discovery TMEM272 13 51838530 51816751 -1 10679 lincRNA existing transcript lincRNA ENSG00000281106 ENST00000629372 187 4 avians; potential ortholog in fish without synteny, e.g zebrafish OTTDARG00000042329 absent; apparently lost in rodents chicken LongAUGORFlinc|3PO|XLOC_009333|TCONS_00025932:1.79557|220AA|187AA||Pfam_1_doms|RNACode|HSS_689_1.0E-16 CAGE supports expression in brain, although almost entirely limited to caudate nucleus and putamen; HPA supports expression in most tissues, especially duodenom, small intestine and appendix; Intropolis is dominated by blood expression (e.g. SRR976769), and Blueprint supports expression in innate immune cells especially / ncRNA NR_027047 (LINC00282) Transmembrane domain-containing protein. no MPGGLEKTCHQCISKIASNACFVVVLCAFLALPLSMTFIGMKFLEDCPIQPLIPLYLLVGGIVGTLKVSLLLYDSTRMRRLLSKAVVIDDDDDDEYPWRQNAHRYYIHLLLSLFLFLWFILGNYWVFSVYLPDFLPPFQQPQDYCDKTLYLFAVGVLALSHTVLVLLLLCSGCVYLCSRWRLAADED New discovery LMLN2 14 23104920 23099282 -1 156 1000 gene added absent ENSG00000283654 ENST00000644000 730 14 vertebrates previously undescribed coding gene ENSMUSG00000114865 zebrafish OTTDARG00000043717 CAGE support limited to tetracarcinoma; no HPA; Intropolis has low scores, with the highest score in retina (SRR548611) alongside evidence of embyro and brain expression no CAGE; ENCODE RNAseq has weak support in brain / CNS, liver and testis 'unknown' XR_944274.1 Peptidase family member. The RefSeq non-coding model is based on a cDNA with a retained intron (BC153822). no MLLLLLLLLLLPPLVLRVAASRCLHDETQKSVSLLRPPFSQLPSKSRSSSLTLPSSRDPQPLRIQSCYLGDHISDGAWDPEGEGMRGGSRALAAVREATQRIQAVLAVQGPLLLSRDPAQYCHAVWGDPDSPNYHRCSLLNPGYKGESCLGAKIPDTHLRGYALWPEQGPPQLVQPDGPGVQNTDFLLYVRVAHTSKCHQEPSVIAYAACCQLDSEDRPLAGTIVYCAQHLTSPSLSHSDIVMATLHELLHALGFSGQLFKKWRDCPSGFSVRENCSTRQLVTRQDEWGQLLLTTPAVSLSLAKHLGVSGASLGVPLEEEEGLLSSHWEARLLQGSLMTATFDGAQRTRLDPITLAAFKDSGWYQVNHSAAEELLWGQGSGPEFGLVTTCGTGSSDFFCTGSGLGCHYLHLDKGSCSSDPMLEGCRMYKPLANGSECWKKENGFPAGVDNPHGEIYHPQSRCFFANLTSQLLPGDKPRHPSLTPHLKEAELMGRCYLHQCTGRGAYKVQVEGSPWVPCLPGKVIQIPGYYGLLFCPRGRLCQTNEDINAVTSPPVSLSTPDPLFQLSLELAGPPGHSLGKEQQEGLAEAVLEALASKGGTGRCYFHGPSITTSLVFTVHMWKSPGCQGPSVATLHKALTLTLQKKPLEVYHGGANFTTQPSKLLVTSDHNPSMTHLRLSMGLCLMLLILVGVMGTTAYQKRATLPVRPSASYHSPELHSTRVPVRGIREV New discovery ALDOA 16 30053494 30063842 1 226 1000 existing transcript 5' UTR of ALDOA ENSG00000285043 ENST00000338110 139 4 vertebrates previously undescribed coding gene ENSMUSG00000114515 zebrafish OTTDARG00000037560 CAGE supports weak expression in cell lines; HPA supports cerebral cortex expression, alongside weaker transcription in other organs; Intropolis scores are low, dominated by brain CAGE supports weak expression in testis; ENCODE RNAseq supports brain / CNS expression, alongside weaker general transcription absent cotranscribed with ALDOA Uncharacterised protein. Identified within 5' UTR exons of ALDOA. While ALDOA has its own downstream promoter in human and mouse, there is no evidence that the novel CDS is transcribed without also reading into ALDOA exons. In zebrafish the novel CDS is clearly a separate transcription unit from aldoaa. no MDASSSPWNPTPAPVSSPPLLLPIPAIVFIAVGIYLLLLGLVLLTRNCLLAQGCCADGSSPCRKQGSSGPPDCCWTCAEACNFPLPSPAHFLDACCPQPTRADWAPRCPRCCPLCDCACTCQLPDCQSLNCLCFEIKLR New discovery AC233723.1 17 4802319 4801289 -1 651 1000 spliceform added lincRNA ENSG00000262165 ENST00000635921 79 2 coelacanth previously undescribed coding gene ENSMUSG00000109833 coelacanth AUGUSTUS g10900.t1 all sources support only weak expression in cancer experiments and cell lines no CAGE: ENCODE RNAseq supports very low expression in brain and limb absent no Uncharacterised protein. While there is very little evidence for the expression of the first coding exon in human, there is an alternative first exon that is divergently transcribed from the PLD2 promoter region at appreciable levels. This alternative transcript is likely not protein-coding. The paucity of transcript evidence in human and mouse may call the functionality of the gene into question. However, given the depth of conservation across vertebrates, and the fact that there at least some support for the splice junction in both species, it was decided to annotate the gene in both species as protein-coding. yes, potential unitary pseudogene in human and / or mouse. MGLKGAWCFPWCGCRRQRGTERGAGLSPAAPPDPSPAIAPTMAEGGVPSPGPGAYFSRKARLSFRHQLHDIASANDSTI New discovery LINC00854 17 43228554 43221571 -1 228 1000 existing transcript antisense lncRNA ENSG00000236383 ENST00000636331 168 4 zebrafish previously undescribed coding gene ENSMUSG00000085486 zebrafish OTTDARG00000040503 all sources support testis expression all sources support testis expression ncRNA NR_047479.2 (LINC00854) Uncharacterised protein. PMID:22196729 had previously reported this locus as a deeply conserved lncRNA, first identified in zebrafish. no MGSAYHWEARRRQMALDRRRWLMAQQQQELQQKEQELKNHQEEEQQSEEKLQPHKKLNVPQPPVAKLWTSQEQPQPSQQQPSVQPPSQPPPQPSTLPQAQVWPGPQPPQPQPPPQPTQPSAQARCTQHTSKCNLQDSQRPGLMNPCQSSPIRNTGYSQLKSTNYIQQW New discovery ANKRD40CL 17 50766913 50761363 -1 17729 re-ranking spliceform added lincRNA ENSG00000167117 ENST00000643007 142 4 avians previously undescribed coding gene ENSMUSG00000094091 chicken ENSGALT00000045234 (5' truncated, though ESTs can extend it to full length) all sources support expression in gastrointestinal tissues all sources support expression in gastrointestinal tissues ncRNA NR_073199.1 (LINC00483) Member of the ankyrin repeat domain-containing protein family. no MAEPEQDIGEKPAGEYKQEGTNPHLPTTDPSDNKKPSDTCLVRIQNPKENDFIEIELKRQELSYQNLLNVSCCELGIKPERVEKIRKLPNTLLRKDKDIRRLRDFQEVELILMKNGSSRLTEYVPSLTERPCYDSKAAKMTY New discovery AC015802.6 17 76569648 76568797 -1 14 1000 existing transcript retained intron transcript of ST6GALNAC2 ENSG00000284526 ENST00000640006 283 1 mammals previously undescribed coding gene ENSMUSG00000110170 / all sources support testis expression all sources support testis expression absent contranscribed with ST6GALNAC2 Kim et al Uncharacterised protein. Identified within a non-coding transcript of the ST6GALNAC2 locus. The termination codon of the novel CDS overlaps with a conserved splice acceptor site of ST6GALNAC2. A 'conceptual translation' of the novel mouse locus was supplied to GenBank as ACL12360 as part of a 2009 investigation into sperm proteins [PMID: 19186949]. The human CDS was initially noted by GENCODE as part of the analysis into the Kuster et al and Wilhelm et al proteomics datasets, although without enough statistical confidence to support annotation in isolation. no MSSPSHTPNPSPTQSPSQSRSNNIVTPPGQQPAPSPSPTRGSSSRPPSHPPTPNASQPSSRTSSLNPSPQVQHVPRGAAQTPTPPTSKSPSQSGFKSLSRNPSLTPAVPPKSSFYSPATSSSYIGPIQNIPSYITPYVPRFLKEPPFFQPPTAPLPQNPCFACPSPCPARKPPPPPDSLYLPLLPPPPHHPQFNCPFPTPPGLFMPPSSLSYTPPVEVLVSGKPHVVPNVLPATFYTPFSRYYSQPRSYRSGYRGYSGALTLPSISPLQYDGSGRSVHFYHGS New discovery MBD3L2B 19 7021397 7019161 -1 7145 lincRNA existing transcript lincRNA ENSG00000196589 ENST00000636986 204 2 potential human-specific duplication absent; the cluster is apparently lost in rodents cow XP_015327519.1 no CAGE; no HPA; Intropolis supports expression in early embryo (e.g. SRR490988) and other developmental cells / absent Methyl-CpG binding domain protein 3-like family member. One of 5 members of a series of local duplications. It is not possible to say which locus is ancestral – other apes and monkeys have just two copies with conserved synteny - and the PhyloCSF signal may be based on improper genome alignments. no MGEPAFTSFPSLPVLGKLKRNMMPWALQKKREIHMAKAHRRRAARSALPMRLTSCIFRRPVTRIRSHPDNQVRRRKGDEHLEKPQQLCAYRRLQALQPCSSQGEGSSPLHLESVLSILAPGTAGESLDRAGAERVRSPLEPTPGRFPAVAGGPTPGMGCQLPPPLSGQLVTPADIRRQARRVKKARERLAKALQADRLARRAEM New discovery AC008397.1 19 18255307 18250130 -1 1071 re-ranking spliceform added 5' UTR of PDE4C ENSG00000284797 ENST00000643046 234 4 vertebrates known coding gene ENSMUSG00000095026 zebrafish OTTDARG00000012535 CAGE supports expression in lung and gastrointestinal tissues; HPA supports expression in gastrointestinal tissues alongside weaker expression in other organs; aside from cancer experiments, Intropolis is dominated by gastrointestinal experiments (e.g. SRR364829) all sources support expression in gastrointestinal tissues ncRNA NR_036575.1 (LOC729966) Uncharacterised CDS. Initially found within the 5' UTR of PDE4C. A reappraisal of the transcript evidence indicates that the two are separate loci, and that the existing model unifying the two (ENST00000355502) is a read-through event based on cDNA AK095384. Also, these genes do not appear to be co-transcribed in mouse or other species. The novel CDS was previously recognised in cow as 'cattle intestine-specific transcript 1' (CIST1) based on an in silico genome survey [PMID:16554549], i.e. with an expression profile that matches the human and mouse inferences provided here. no MACPQLPPLLLLVLVVLLKAGVNYNTPFTDIVTSENSMETSPVSSLISSPFAHSTHSSGEPPKSYSSTMSLETDSITHLSPSSSGATPTIQPSPSSTDSRMIPSSPQPETITHPSSGSPSAELTPSSHSTLPSSESLTPHWSPTSHSPGTEPLTSTDQTLEPPGPAPGDTGPRELHRNPSVVVVVCLLVSLLLIGSVVMAVRFCHRNESKFENLDEVSMGSVNDRLSFAHHLQE New discovery PNMA8C 19 46428359 46427745 -1 191 1000 gene added absent ENSG00000277531 ENST00000617053 204 1 mammals previously undescribed coding gene ENSMUSG00000108348 / all sources support brain expression all sources support brain expression absent Paraneoplastic antigen-like protein family member, derived from an LTR / Gypsy element in common with other family members. no MLFGVKDIALLEHGCKALEVDSYKSLMILGIPEDCNHEEFEEIIRLPLKPLGKFEVAGKAYLEEDKSKAAIIQLTEDINYAVVPREIKGKGGVWRVVYMPRKQDIEFLTKLNLFLQSEGRTVEDMARVLRQELCPPATGPRELPARKCSVPGLGEKPEAGATVQMDVVPPLDSSEKESKAGVGKRGKRKNKKNRRRHHASDKKL New discovery AC008687.4 19 49020515 49017968 -1 34 1000 spliceform added lincRNA ENSG00000268655 ENST00000637680 334 5 coelacanth previously undescribed coding gene, first appeared in GENCODE M19 coelacanth XP_005991057 no CAGE; HPA supports testis expression; Intropolis supports testis expression CAGE supports testis expression, with weaker though consistent expression in tracheal epithelial cells; ENCODE RNAseq supports expression in testis, lung and ovary ncRNA NR_073552.1 (LOC101059948); partial match Uncharacterised protein-coding gene. Located within the LHB region, which contains numerous genes for pregnancy-related hormones. However, this novel CDS does not share obvious homology with these loci. A human unprocessed pseudogene was also added 40kb downstream not based on PhyloCSF (ENSG00000283251), an event that apparently occurred in the primate lineage. no MASGTLAPGCRISATEVPGSRPNCHLTSSYFSHRIIPPIPFTPPTVQSTVADPLPQVAKQDSHNWAFDEVLSRWETTSGSAYVPKTHGGPCAQPRAPEPADPTRTVGIKDLGEKLRHRGWRLPLTTKYQSSETRAQYTGSPSGDPRAPEYFGPQPPQLADHHRGGPSQALIAWTKNPELSGRPFTVSDRGVLDRRQLYLTTSARDFRFYPKTELSGYPRKDSLTYWSFEETPQVWSHGPQRPPCPRSSRPPRPPRVRVPRVSPVTSAMPHRGALSLAQESYSPLLHPLRRLDRFCPLEAPWGGPHWKPLRGIYSVPKAYSTENSSYGSLKPALV New discovery C19orf85 19 55464554 55463472 -1 337 1000 gene added absent ENSG00000283567 ENST00000635964 222 2 mammals; potential ortholog in fish without synteny, e.g. tetraodon CAF97887 previously undescribed coding gene ENSMUSG00000110221 / CAGE supports testicular germ cell embryonal cancer and embryonic stem cell expression; no HPA; Intropolis supports stem cell expression, especially embryonic (e.g. SRR488685) no CAGE; weak ENCODE RNAseq support for brain expression absent Uncharacterised protein. no MHPGVPEGPGVSEPGPRELCAFVSGAAAHMLRALQPRRTRPPKRRPNHRRFLHNQICRQFTKIEAATQRLALSILSQEAPPQRPSLQKPPPPPPSPFLGVACAVAPTEAPHASASLSLAALDTSTLDLFDNIALTPECASMPWDPSSGSDAPLPAPGLSHRDLGQLDLRQVPHFCGPLPLPQHALGEEADLVAPDWGWVDCWEVPRAWDSQGIPEGWGTSSP New discovery EDDM13 19 56272835 56310148 1 609 1000 spliceform added antisense lncRNA ENSG00000267710 ENST00000637802 161 15 mammals known coding gene ENSMUSG00000053367 / CAGE supports epididymis expression; no HPA; Intropolis supports expression in early embryogenesis (e.g. SRR893051) CAGE supports epididymis expression; no ENCODE RNAseq XR_001754018.1 (LOC100506374); partial match Uncharacterised protein. Eleven of the coding exons are micro-exons, i.e. under 30bp in size; PhyloCSF produced a signal across 4 of them. This transcript was included in GENCODEv27, mistakenly removed in v28 and reinstated in v29 with the new ID ENST00000649256. no MHRSEPFLKMSLLILLFLGLAEACTPREVATKEKINLLKGIIGLMSRLSPDGLRHNITSLKMPPLVSPQDRTEEEIKKILGLLSLQVLHEETSGCKEEVKPFSGTTPSRKPLPKRKNTWNFLKCAYMVMTYLFVSYNKGDWCYCHYCNLELDIRDDPCCSF New discovery PMIS2 19 35586260 35587146 1 1763 unannotated gene added absent ENSG00000283758 ENST00000646476 135 2 mammals known coding gene ENSMUSG00000049761 / all sources support testis expression all sources support testis expression absent Already recognised as 'Pmis2, sperm specific protein' in mouse, although it could potentially be classed as a member of the CD255 protein superfamily due to the presence of an interferon-induced transmembrane protein domain. no MALKPPSATQPAPNAPATPDAPPTTGDPGASAAPGSPTTTGGPGAPAEVPQEPQEPTQTPEELAFYAPNYLCLTIFAILLFPPFGLAALYFSYETMKANQNSEWEEAYINSGRTGWFGAFVVMIGLGIIYGLVLY New discovery AC005551.1 19 3483403 3482804 -1 30 1000 gene added absent ENSG00000284638 ENST00000641816 149 2 avians; potential ortholog in vertebrates without synteny, e.g. zebrafish OTTDARG00000037948 previously undescribed coding gene ENSMUSG00000078440 / CAGE supports retina expression; weak HPA; Intropolis also supports retina epxression (SRR548613), although the highest scores are in fetal intestine (SRR643742) CAGE supports eyeball; while ENCODE RNAseq suggests general expression, this is likely confounded by readthrough transcription from Dohh found 5' adjacent as these read coverages highlight a splice donor site only utilised in the readthrough model absent Uncharacterised protein. While the CDS is incorporated into transcripts that are shared with Dohh in mouse, the novel CDS also has its own TSS. A series of mouse ESTs begin at this TSS, all of which are from eye experiments e.g. CK628151. no MPGLAAEGEAEGWSPSPPLYEEYRPPPLDSIRLPRYVLYLLLAALVVVAVAYAIVGHLIKDLAHDLADWAFGPKPDQEAAPRELRPSLTGEDLEGLDLQLALAWQGEEDAGGGGEGAPSEPPPPPEPRRPSIAFKDPPSRSSFWKLMAT New discovery C20orf204 20 64036324 64038759 1 199 1000 spliceform added lincRNA ENSG00000196421 ENST00000636176 189 4 avians previously undescribed coding gene ENSMUSG00000108976 chicken AUGUSTUS g5189.t1 CAGE supports cancer expression only; HPA supports colon expression; Intropolis is dominated by cancer expression and cell lines no CAGE; weak general expression in ENCODE RNAseq ncRNA NR_027686.1 (LINC00176) neXtProt Uncharacterised protein. The mouse gene was initially annotated based on EST DV645834 from oocyte. no MVPPKPALWALLLALLGTAPSRAYSPACSVPDVLRHYRAIIFEDLQAAVKWGGAGAEKTRPGSRHFHFIQKNLTRPGSSGRRGRPRASCGAQKEHSILLSISSLGRTLRGAVAGGRRGALERAAWTVAVRTEAVMRRHCRTLRQRSRRPKMRPARRRGGRRQLLLRALDAVATCWEKLFALRAPASRDS New discovery ETDC X 135309480 135309659 1 no signal paralog gene added absent ENSG00000283644 ENST00000635820 59 1 potential primate-specific duplication absent / no evidence for expression / absent no ETDA, ETDB and ETDC are a cluster of three 59aa single exon CDS on chrX listed here, 2 of which are identical. ENSG00000229015 is an additional pseudogene, identified as part of this work based on regional analysis. The loci have clear homology to mouse Etd on mouse chrX, although gene synteny has been lost. The duplication events occurred in the primate lineage. In contrast to the other two, ETDC does not have transcriptional support at the present time and on that basis is potentially pseudogenic. yes, potential human pseudogene MDKELPKASPSEPALNIKKSGKSFKCKKPTKNVQVFLINRQLGRNRSDTDLSKWLWMLP XP/XM match CSNKA2IP 3 88465358 88467562 1 585 1000 gene added absent ENSG00000283434 ENST00000637986 734 1 mammals known coding gene ENSMUSG00000068167, previously incomplete CDS / all sources support testis expression all sources support testis expression coding XP_011532645 neXtProt Ortholog of mouse casein kinase 2, alpha prime interacting protein (Csnka2ip), which has itself now been extended from 276 to 720aa by a 5' extension based on comparative annotation. no MVPLAYYGQHFVPLDYFYQLSSANTLTHQHTGEKLNQFNNQPMAKVQSHSNHFAVPPLGSNKKVQRCSVLPSPKSQDKISQSFCDRFLNSPLFHAKHQNTPSIGLHWRSSLWPAQRALNSHLLHSKAQTTSSSDLNMTSSLELNQAALSLQLPFCKPQTTSSSLDVCWRSLSLKSHQRVSSSSLFRLQNQEIPSINIIWTSSSLGPKRKALSSTLLQSKPQKTSSLDYLWTSSLQRNQRSLSSPSLNTKLQTSDLFWTSPSFKPNQIALTSPLLDSRLQKTPILNSNPTIGGLPVSHSKARQSASSYFVHPSENLPLFQLNSQSMFMLDCNFQTTNSPVCHSKFQNTTSPNGKHRVTHLPSPHPKTNISGQLLSSSKHCTRNTAASTLGFRLQSKSSFQFSPKTESNKEIPWTLKYSQPCIVKGGTVPDDVVNKIVNSISNTRIQRDLCRQILFRRMRGRPNPHPGPRLSSNYVVCLACASCLKSPCNHLRGKKNPHCATLSVIPTPEANSEGKIEVKLVLILSLPETFSSCLPFPMKENQPNEVPEDNLEGVEKIQQFFPTSERDIQGLNMKQIWWAVAPENKVIGQQPQAIDWLFYVKKNNSQPQSLLPSTSSSTSSSSTTSSSSSVASASSDSSSSSSSSSSFSISSSSSPSKEFMTLTLSRPVFRKVLSYHRLPAGVSWLEFIYSKDYQLHPRKPNRSQSSSLKTKPVRNNNTVKWRKGANTLFKFFRTK XP/XM match MARCOL 5 148238598 148243254 1 1343 unannotated transcript extended lincRNA ENSG00000248109 ENST00000638089 285 2 mammals previously undescribed coding gene ENSMUSG00000110628 / CAGE supports fingernail expression alongside cancer cells and cell lines; no HPA; Intropolis is dominated by cell line and cancer experiments CAGE supports forelimb embryo expression; no ENCODE RNAseq coding XP_011536019.1 neXtProt A paralog of macrophage receptor with collagenous structure (MARCO). no MRAFIFFLFMLLAMFSASSTQISNTSVFKLEENPKPALILEEKNEANHLGGQRDSNKQGGSYTQGNPGTFRLQGQPGYFNKLEKPRHFKQGRAGVLNQPGILKNSGKSNQKGNPESSNKQENSGSSSQLGRPGISTQQGNPGSSDQQEKPGSFSQKVMVGSSSQQGKPGSSSQHGNLGSSTQKGNLGSSSLQGHLGLSSHQGKPESSGQQGKPGSSSQQGNLGTSGQQEKPGSSSQQGKPGLSSHQGKPGSSSQQGNLHLSSQQGNQGPSSKQRKPGSSSRQGNL XP/XM match AL354761.1 9 135614240 135618952 1 1311 lincRNA transcript extended antisense lncRNA ENSG00000236543 ENST00000430816 181 6 mammals known coding gene ENSMUSG00000085484 / no CAGE support; HPA supports skin expression; Intropolis supports skin expression (e.g. SRR836000) No CAGE; no ENCODE RNAseq coding XP_011517581.1 (LOC102723971) Lipocalin family member, found within a cluster of these loci. no MALEKGPLLLLALGLGLAGAQKALEEVPVQPGFNAQKVEGRWLTLQLAANHADLVSPADPLRLALHSIRTRDGGDVDFVLFWKGEGVCKETNITVHPTQLQGQYQGSFEGGSMHVCFVSTDYSNLILYVRFEDDEITNLWVLLARRMLEDPKWLGRYLEYVEKFHLQKAPVFNIDGPCPPP XP/XM match C13orf46 13 113973997 113956773 -1 257 1000 gene added absent ENSG00000283199 ENST00000636427 212 7 mammals known coding gene ENSMUSG00000031452 / all sources support testis expression all sources support testis expression coding XP_011533161 (LOC100507747) neXtProt Uncharacterised protein. Rodent genes splice out before the STOP conserved in other mammals, adding another 2 putative coding exons. All but one exon was missing in a sequence gap in GRC37. no MEKDTGTTHRRHRPGLRALPSGVALGHLKAASEASELQRSRSLGGLQPEGDPPSRPRKPHKELESEDQGKDPSSNAEDASCQKNLAQDKKESFSTLGKLGHESGKQDPEREKSDLEASMQEVQEGEHADGGLQEAKEQEAESIKLNDLQEEEKASVFVEIDLGDHAEEVVTDAKKEEKPSQMDVEDLSEDEMQTSWVCCIPYSTRKRAKEST XP/XM match CFAP97D2 13 114179331 114222507 1 673 1000 gene added absent ENSG00000283361 ENST00000635901 99 4 xenopus known coding gene ENSMUSG00000090336 xenopus AUGUSTUS g3716.t1 CAGE supports brain expression; HPA supports lung and brain expression; Intropolis finds cancer expression with lung as the top scoring normal tissue (ERR030879) CAGE supports tracheal epithelial cells, microlia and lung expression; ENCODE RNAseq finds lung and brain expression coding XP_006720062.1 (C17orf105-like) no KIAA1430-domain containing CDS, which is a domain of unknown function. The avian and amphibian loci have a 5 exon CDS structure in common, although this final exon has apparently been lost in mammals. Furthermore, exon 4 has a frameshift in all primates, which could suggest pseudogenisation. However, there is substantial transcriptional support for the skipping of this exon, potentially making a CDS with a different C-terminus as listed here. yes, potential human pseudogene MHGAPRLTFPCASEYLWHAREKAYQDHRRKVQSAQPLVDTRAPLTFRHLHLKLKRLKLEEERLSVIERDNRLLLEKVASVMRTRGQTDSKNNSKHRSRK XP/XM match PERCC1 16 1432594 1433397 1 320 1000 gene added absent ENSG00000284395 ENST00000640283 267 1 coelacanth; potential ortholog in fish without synteny, e.g. zebrafish OTTDARG00000044228 previously undescribed coding gene ENSMUSG00000114245 coelancanth ENSLACT00000014522 (3' truncated) weak CAGE support for cancer expression; HPA supports stomach, small intestine and prostate expression; Intropolis supports cancer expression CAGE supports weak tracheal epithelial cell expression; ENCODE RNAseq supports weak lung and stomach expression coding XP_011521082 (LOC105371045) Recently experimentally characterised as a protein-coding gene critical for intestinal function in mice (PMID:31217582). no MAAGVIRPLCDFQLPLLRHHPFLPSDPEPPETSEEEEEEEEEEEEEEGEGEGLGGCGRILPSSGRAEATEEAAPEGPGSPETPLQLLRFSELISDDIRRYFGRKDKGQDPDACDVYADSRPPRSTARELYYADLVRLARGGSLEDEDTPEPRVPQGQVCRPGLSGDRAQPLGPLAELFDYGLQQYWGSRAAAGWSLTLERKYGHITPMAQRKLPPSFWKEPTPSPLGLLHPGTPDFSDLLASWSTEACPELPGRGTPALEGARPAEA XP/XM match C17orf113 17 42043376 42039705 -1 246 1000 existing transcript sense intronic lncRNA ENSG00000267221 ENST00000587304 675 2 coelacanth novel unitary pseudogene ENSMUSG00000109112 coelacanth XP_006006434.1 CAGE supports weak expression, especially in brain, cancer cells and cell lines; HPA supports weak general expression; Intropolis top scoring experiments are leukaemia (SRR948491) and CD34+ cells (SRR534325), and an examination of BluePrint RNAseq read coverage indicates general expression in blood cells no evidence for expression coding XP_011522778.1 Uncharacterised protein. Found within an intron of a ZNF gene, although no evidence that the two loci are contranscribed. no MVPPGKKPAGEASNSNKKCKRYFNEHWKEEFTWLDFDYERKLMFCLECRQALVRNKHGKAENAFTVGTDNFQRHALLRHVTSGAHRQALAVNQGQPPFEGQAEGGGACPGLATTPASRGVKVELDPAKVAVLTTVYCMAKEDVPNDRCSALLELQRFNLCQALLGTEHGDYYSPRRVRDMQVAIASVLHTEACQRLKASPYVGLVLDETRDWPESHSLALFATSVSPCDGQPATTFLGSVELQEGEATAGQLLDILQAFGVSAPKLAWLSSSLPSERLGSVGPQLRATCPLLAELHCLPGRTDPEPPAYLGQYESILDALFRLHGGPSSHLVPELRAALDLAAIDLAGPRPVPWASLLPVVEAVAEAWPGLVPTLEAAALASPVAGSLALALRQFTFVAFTHLLLDALPSVQKLSLVLQAEEPDLALLQPLVMAAAASLQAQRGSGGARLQGFLQELASMDPDASSGRCTYRGVELLGYSEAAVRGLEWLRGSFLDSMRKGLQDSYPGPSLDAVAAFAAIFDPRRYPQAPEELGTHGEGALRVLLRGFAPAVVRQRALGDFALFKRVVFGLGRLGPRALCTQLACAHSELHELFPDFAALAALALALPAGAGLLDKVGRSRELRWWGQSGAGEGRGGHMVKIAVDGPPLHEFDFGLAVEFLESGWGEGFLGSQLT XP/XM match HSFX4 X 149929645 149931105 1 1263 lincRNA transcript extended antisense lncRNA ENSG00000283463 ENST00000457775 333 2 Mammals have a single copy of this gene pair novel pseudogene ENSMUSG00000114356 is a presumptive ortholog to either ENSG00000283463 or ENSG00000283697 Rat XP_017443635.1 is a presumptive ortholog to one duplicant, based on sequence homology as opposed to synteny all sources support testis expression no evidence for expression coding XP_005274830.1 (LOC101927685) neXtProt Heat-shock factor family member. A duplicant of ENSG00000283697 elsewhere on this spreadsheet, ~380kb distant. The translations differ in a single amino acid, and both have PhyloCSF signals despite the fact that the duplication apparently occurred in the human lineage. no MASQNTEQEYEAKLAPSVGGEPTSGGPSGSSPDPNPDSSEVLDRHEDQAMSQDPGSQDNSPPEDRNQRVVNVEDNHNLFRLSFPRKLWTIVEEDTFKSVSWNDDGDAVIIDKDLFQREVLQRKGAERIFKTDNLTSFIRQLNLYGFCKTRPSNSPGNKKMMIYCNSNFQRDKPRLLENIQRKDALRNTAQQATRVPTPKRKNLVATRRSLRIYHINARKEAIKMCQQGAPSVQGPSGTQSFRRSGMWSKKSATRHPLGNGPPQEPNGPSWEGTSGNVTFTSSATTWMEGTGILSSLVYSDNGSVMSLYNICYYALLASLSVMSPNEPSDDEEE XP/XM match HSFX3 X 149549852 149548392 -1 44 1000 existing transcript antisense lncRNA ENSG00000283697 ENST00000431993 333 2 Mammals have a single copy of this gene pair novel pseudogene ENSMUSG00000114356 is a presumptive ortholog to either ENSG00000283463 or ENSG00000283698 Rat XP_017443635.1 is a presumptive ortholog to one duplicant, based on sequence homology as opposed to synteny all sources support testis expression no evidence for expression coding XP_005262408.1 neXtProt Heat-shock factor family member. A recent duplicant of ENSG00000283463 elsewhere on this spreadsheet. no MASQNTEQEYEAKLAPSVGGEPTSGGPSGSSPDPNPDSSEVLDRHEDQAMSQDPGSQDNSPPEDRNQRVVNVEDNHNLFRLSFPRKLWTIVEEDTFKSVSWNDDGDAVIIDKDLFQREVLQRKGAERIFKTDSLTSFIRQLNLYGFCKTRPSNSPGNKKMMIYCNSNFQRDKPRLLENIQRKDALRNTAQQATRVPTPKRKNLVATRRSLRIYHINARKEAIKMCQQGAPSVQGPSGTQSFRRSGMWSKKSATRHPLGNGPPQEPNGPSWEGTSGNVTFTSSATTWMEGTGILSSLVYSDNGSVMSLYNICYYALLASLSVMSPNEPSDDEEE XP/XM match UPK3BL2 7 102543636 102538417 -1 no signal re-ranking existing transcript 3' UTR of POLR2J3 ENSG00000284981 ENST00000644544 263 6 potential human-specific duplication absent / all sources support general expression, although perhaps confounded by paralogy issues / coding XP_016868385.1 Uncharacterised protein. Identified within the 3' end of POLR2J3 after additional coding exons were found by PhyloCSF. It can now be seem that these are separate loci, previously merged into one due to the misinterpretation of readthrough cDNA evidence. This is a tandem duplicant of the adjacent POLR2J2 / UPK3BL locus, i.e. the novel CDS is a paralog of UPK3BL which is itself a paralog of UPK3B elsewhere on chromosome 7. In non-primate mammals, UPK3B and UPK3BL are adjacent to one another. This rearrangement thus seems to have occurred in the primate lineage, towards the base of the gibbon / ape clade. However, the local duplication between this novel CDS and UPK3BL is human-specific, and the PhyloCSF signal is likely based on genomic misalignments. no MDNSWRLGPAIGLSAGQSQLLVSLLLLLTRVQPGTDVAAPEHISYVPQLSNDTLAGRLTLSTFTLEQPLGQFSSHNISDLDTIWLVVALSNATQSFTAPRTNQDIPAPANFSQRGYYLTLRANRVLYQTRGQLHVLRVGNDTHCQPTKIGCNHPLPGPGPYRVKFLVMNDEGPVAETKWSSDTRLQQAQALRAVPGPQSPGTVVIIAILSILLAVLLTVLLAVLIYTCFNSCRSTSLSGPEEAGSVRRYTTHLAFSTPAEGAS NP/NM match TEX50 1 173635522 173637036 1 1550 lincRNA transcript extended antisense lncRNA ENSG00000232113 ENST00000417563 177 2 mammals previously undescribed coding gene ENSMUSG00000049160 / all sources support testis expression all sources support testis expression coding NM_001195190.1 Kim et al; neXtProt Uncharacterised protein. no MSNQRLPLIFSLLFICFFGESFCICDGTVWTKVGWEILPEEVHYWKVKGSPSHCLPYLLDKLCCDFANMDIFQGCLYLIYNLLQAVFFVLFVLSVHYLWKKWKKHQKKLKKQASLEKPGNDLESPLINNIDQTLHRVATTASVIYKIWEHRSHHPSSKKIKHCKLKKKSKEEGARRY NP/NM match FAM237A 2 206644237 206648794 1 527 1000 existing transcript lincRNA ENSG00000235118 ENST00000441223 181 2 avians previously undescribed coding gene ENSMUSG00000115378 chicken ENSGALT00000033404 CAGE supports weak brain expression; HPA supports weak brain and heart expression; Intropolis supports weak cancer expression no CAGE; ENCODE RNAseq supports weak brain expression coding NM_001102659.1 (LOC200726) Uncharacterised protein. It has clear homology to FAM237B, also on this list. The chicken ortholog was recently found to encode 'a novel small secretory protein, neurosecretory protein GL, in the chicken hypothalamic infundibulum' [PMID:24582750]). no MADPGNRGGIHRPLSFTCSLLIVGMCCVSPFFCHSQTDLLALSQADPQCWESSSVLLLEMWKPRVSNTVSGFWDFMIYLKSSENLKHGALFWDLAQLFWDIYVDCVLSRNHGLGRRQLVGEEEKISAAQPQHTRSKQGTYSQLLRTSFLKKKELIEDLISMHVRRSGSSVIGKVNLEIKRK NP/NM match GIMD1 4 106367435 106358183 -1 338 1000 transcript extended lincRNA ENSG00000250298 ENST00000507153 217 2 coelacanth known coding gene ENSMUSG00000091721 coelacanth ENSLACT00000012949 all sources support gastrointestinal expression, including gall bladder all sources support gastrointestinal expression, with ENCODE RNAseq also supporting lung and liver expression coding NM_001195138.1 neXtProt Already known as GIMAP family P-loop NTPase domain containing 1 (GIMD1) no MTDPNKMIINLALFGMTQSGKSSAGNILLGSTDFHSSFAPCSVTTCCSLGRSCHLHSFMRRGGLEVALQVQVLDTPGYPHSRLSKKYVKQEVKEALAHHFGQGGLHLALLVQRADVPFCGQEVTDPVQMIQELLGHAWMNYTAILFTHAEKIEEAGLTEDKYLHEASDTLKTLLNSIQHKYVFQYKKGKSLNEQRMKILERIMEFIKENCYQVLTFK NP/NM match UMAD1 7 7673372 7877538 1 1087 lincRNA existing transcript antisense lncRNA ENSG00000219545 ENST00000636849 137 3 mammals known coding gene ENSMUSG00000089862 / all sources support general expression all sources support general expression coding NM_001302350.1 neXtProt Already known as UBAP1-MVB12-associated (UMA) domain containing 1 (UMAD1) no MFHFFRKPPESKKPSVPETEADGFVLLGDTTDEQRMTARGKTSDIEANQPLETNKENSSSVTVSDPEMENKAGQTLENSSLMAELLSDVPFTLAPHVLAVQGTITDLPDHLLSYDGSENLSRFWYDFTLENSVLCDS NP/NM match AC025594.2 7 128866410 128868775 1 206 1000 transcript extended lincRNA ENSG00000272899 ENST00000609480 176 2 coelacanth known coding gene ENSMUSG00000090685 coelacanth ENSLACT00000003968 all sources support testis expression all sources support testis expression coding NM_001195150.1 neXtProt Uncharacterised protein no MSRQLNIDALRQNFWKEEYLREKMLRCEWYRKYGSMVKAKQKAKAAARLPLKLPTLHPKAPLSPPPAPKSAPSKVPSPVPEAPFQSEMYPVPPITRALLYEGISHDFQGRYRYLNTRKLDMPETRYLFPITTSFTYGWQLGPPVKQELVSCKMCRIESFFRKNGAFALLDPRDLAL NP/NM match NPY4R2 10 47921988 47923115 1 121 1000 existing transcript lincRNA ENSG00000264717 ENST00000576178 375 1 potential ape-specific duplication absent / weak CAGE; HPA supports gastrointestinal expression, as well as lung, skin, salivary gland; Intropolis provides no support / coding NM_001278795.1 An identical duplicant of NPY4R, found several Mb downstream. The duplication is ape-specific, so the PhyloCSF signal may be based on incorrect genome alignments. no MNTSHLLALLLPKSPQGENRSKPLGTPYNFSEHCQDSVDVMVFIVTSYSIETVVGVLGNLCLMCVTVRQKEKANVTNLLIANLAFSDFLMCLLCQPLTAVYTIMDYWIFGETLCKMSAFIQCMSVTVSILSLVLVALERHQLIINPTGWKPSISQAYLGIVLIWVIACVLSLPFLANSILENVFHKNHSKALEFLADKVVCTESWPLAHHRTIYTTFLLLFQYCLPLGFILVCYARIYRRLQRQGRVFHKGTYSLRAGHMKQVNVVLVVMVVAFAVLWLPLHVFNSLEDWHHEAIPICHGNLIFLVCHLLAMASTCVNPFIYGFLNTNFKKEIKALVLTCQQSAPLEESEHLPLSTVHTEVSKGSLRLSGRSNPI NP/NM match C11orf97 11 94512529 94531900 1 158 1000 existing transcript lincRNA ENSG00000257057 ENST00000542198 126 4 mammals known coding gene ENSMUSG00000031927 / CAGE finds lung, trachea, testis and brain tissues; HPA finds testis, kidney and endometrium; Intropolis finds lung as top non-cancer experiment (e.g. SRR192333) CAGE supports testis expression; ENCODE RNAseq supports testis expression alongside weaker general expression coding NM_001190462.1 neXtProt Already known as chromosome 11 open reading frame 97 (C11orf97). This CDS had been previously been removed from GENCODE. no MTGEEAVVVTAVVAPKAGREEEQPPPPAGLGCGARGEPGRGPLEHGQQWKKFLYCEPHKRIKEVLEEERHIKRDECHIKNPAAVALEGIWSIKRNLPVGGLKPGLPSRNSLLPQAKYYSRHGGLRR NP/NM match ZNF888 19 52917873 52906165 -1 12307 lincRNA spliceform added lincRNA ENSG00000213793 ENST00000638862 718 3 potential ape-specific duplication absent / all sources support general expression / coding NM_001310127.1 neXtProt Already known as zinc finger protein 888 (ZNF888). no MALPQGLLTFRDVAIEFSQEEWKCLDPAQRTLYRDVMLENYRNLVSLDISSKCMMEFSSIGKGNTEVIHTGTLQRLASHHIGECCFQEIEKDIHDFVFQWQEDETNGHEAPMTEIKELTGSTDQYDQRHAGNKPIKYQLGSSFHSHLPELHIFQPEGKIGNQLEKSINNASSVSTSQRISCRPKTHISNNYGNNFFHSSLLTQKQDVHRKEKSFQFNESGKSFNCSSLFKKHQIIHLGEKQYKCDVCGKDFNQKRYLAHHRRCHTGEKPYMCNKCGKVFNKKAYLARHYRRHTGEKPYKCNECGKTFSDKSALLVHKTIHTGEKPYKCNECGKVFNQQSNLARHHRVHTGEKPYQCKECDKVFSRKSYLERHRRIHTGEKPYKCKVCDKAFRHDSHLAQHIVIHTREKPYKCNECGKTFGENSALLVHKTIHTGEKPYKCNECGKVFNQQSNLARHHRLHTGEKPYKCKECDKVFSRKSHLERHRRIHTGEKPYKCKVCDKAFRRDSHLAQHTVIHTGEKPYKCNECGKTFVQNSSLVMHKVIHTGEKRYKCNECGKSFNHKSSLAYHHRLHTGEKPYKCNECGKVFRTQSQLACHHRLHTGEKPYKCEECDKVFNIKSHLEIHRRVHTGEKPYKCRVCDKAFGRDSYLAQHQRVHTGEKPYKCKVCDKAFKCYSHLAQHTRIHTGEKPFKCSECGKAFRAQSTLIHHQAIHGVGKLD NP/NM match TMEM191C 22 21467319 21469651 1 1328 lincRNA existing transcript lincRNA ENSG00000206140 ENST00000432134 284 10 mammals known coding gene ENSMUSG00000055692 / no CAGE; HPA supports testis expression alongside weaker general expression; Intropolis finds weak expression; BluePrint and CAGE suggest a 5' truncated model is transcribed in blood cells generally all sources support testis expression; ENCODE RNAseq also finds weaker general expression coding NM_001207052.1 (TMEM191C) neXtProt Already known as ransmembrane protein 191C (TMEM191C) no MCRATLGLPLPPIVIQPARRSLPPIVTPASRRLGPRGGRHLGSVSTAMAATQELLLQLQKDNRDGRQRKQELEKLMRGLEAESESLNQRLQDLSERERSLLRRRSQAAQPLQGEAREAARERAERVRRRLEEAERHKEYLEQHSRQLQEQWEELSSQLFYYGGELQSQKSTEQQLAAQLVTLQNELELAETKCALQEEKLQQDALQTAEAWAIFQEQTVVLQEVQVKVMEAAEELDAWQSGRELCDGQLRGVQYSTESLMEEMARADRETRLFGGPRALAIRCC NP/NM match DPEP2NB 16 68015770 68014108 -1 1080 lincRNA existing transcript antisense lncRNA ENSG00000263201 ENST00000574912 123 2 mammals previously undescribed coding gene ENSMUSG00000084782 / weak CAGE supports testis and placenta expression; HPA supports testis and placenta expression, with weaker expression in bone marrow; Intropolis finds additional placetal experiments (e.g. SRR638937) all sources support testis expression coding NM_001282442.1 (LOC100131303) neXtProt Uncharacterised protein. no MTDRILYIVSNMSSVPWEGSAAAAVPATSPPTPGHYHVLYRGCGETQVGWHGETYCLVGGYRVHGDAPLATPTKAEAEKPAPRRAPKRRQATIESDKDLGCSSPKIRRLEHRGRRLTPQKLAG NP/NM match SPACA6 19 51693527 51705123 1 36 1000 existing transcript lincRNA ENSG00000182310 ENST00000637797 324 9 mammals known coding gene ENSMUSG00000080316 / all sources support general expression all sources support general expression coding NM_001316972.1 neXtProt Already known as sperm acrosome associated 6 (SPACA6) no MALLALASAVPSALLALAVFRVPAWACLLCFTTYSERLRICQMFVGMRSPKLEECEEAFTAAFQGLSDTEINYDERSHLHDTFTQMTHALQELAAAQGSFEVAFPDAAEKMKKVITQLKEAQACIPPCGLQEFARRFLCSGCYSRVCDLPLDCPVQDVTVTRGDQAMFSCIVNFQLPKEEITYSWKFAGGGLRTQDLSYFRDMPRAEGYLARIRPAQLTHRGTFSCVIKQDQRPLARLYFFLNVTGPPPRAETELQASFREVLRWAPRDAELIEPWRPSLGELLARPEALTPSNLFLLAVLGALASASATVLAWMFFRWYCSGN NP/NM match C14orf132 14 96039501 96086735 1 787 1000 existing transcript lincRNA ENSG00000227051 ENST00000555004 83 2 vertebrates known coding gene ENSMUSG00000094910 zebrafish OTTDARG00000044232 all sources support general expression CAGE is dominated by brain experiments though also supports weak general expression; ENCODE RNAseq supports general expression, highest in brain / CNS coding NM_001252507.2 no Already known as chromosome 14 open reading frame 132 (C14orf132). The locus was previously annotated as protein-coding in GENCODEv19 with a spurious CDS. no MDLSFMAAQLPMMGGAFMDSPNEDFSTEYSLFNSSANVHAAANGQGQPEDPPRSSNDAVLLWIAIIATLGNIVVVGVVYAFTF NP/NM match RAB34 17 28717708 28716053 -1 248 1000 existing transcript 5' UTR of RAB34 ENSG00000109113 ENST00000636154 198 3 mammals known coding gene ENSMUSG00000002059 / all sources support general expression all sources support general expression coding NM_001256281 contranscribed with RAB34 neXtProt Transcribed from an alternative TSS within the RAB34 locus, having a substantial overlap with the RAB34 CDS though in an alternative frame. Identified and confirmed as functionally distinct in 2011, where the protein was unofficially named NARR (PMID:21586586). no MVGQPQPRDDVGSPRPRVIVGTIRPRVIVGTIRPRVIVGSARARPPPDGTPRPQLAAEESPRPRVIFGTPRARVILGSPRPRVIVSSPWPAVVVASPRPRTPVGSPWPRVVVGTPRPRVIVGSPRARVADADPASAPSQGALQGRRQDEHSGTRAEGSRPGGAAPVPEEGGRFARAQRLPPPRHLRLPGAPDRHRGQI NP/NM match TMEM225B 7 99604389 99610565 1 1102 lincRNA existing transcript readthrough transcript from GS1-259H13.1 with NMD annotated in alternative reading frame ENSG00000244219 ENST00000431679 221 4 mammals absent dog ENSCAFT00000050123.2 all sources support testis expression / coding NM_001195541 neXtProt Already known as transmembrane protein 225B (TMEM225B). no MLTLEDKDMKGFSWAIVPALTSLGYLIILVVSIFPFWVRLTNEESHEVFFSGLFENCFNAKCWKPRPLSIYIILGRVFLLSAVFLAFVTTFIMMPFASEFFPRTWKQNFVLACISFFTGACAFLALVLHALEIKALRMKLGPLQFSVLWPYYVLGFGIFLFIVAGTICLIQEMVCPCWHLLSTSQSMEEDHGSLYLDNLESLGGEPSSVQKETQVTAETVI UniProt match SMIM30 7 113117578 113117399 -1 726 1000 existing transcript lincRNA ENSG00000214194 ENST00000397764 59 1 xenopus known coding gene ENSMUSG00000052419 xenopus OCT87389 all sources support general expression all sources support general expression ncRNA NR_024412.1 (LINC00998) no Putative transmembrane protein according to UniProt. The CDS was present in GENCODEv19 although subsequently removed. There are at least 8 duplications of this sequence in human; each of the others are likely pseudogenes. no MTSVSTQLSLVLMSLLLVLPVVEAVEAGDAIALLLGVVLSITGICACLGVYARKRNGQM UniProt match TEX52 12 2857026 2849231 -1 6 1000 gene added absent; protein-coding in v19 ENSG00000283297 ENST00000637658 305 3 mammals known coding gene ENSMUSG00000079304 / all sources support testis expression all sources support testis expression coding XP_005253874.2 Uncharacterised protein. Had been made coding in v19 but subsequently removed. no MASNRQRSLRGPSHPSHMEEPFLQMVQASESLPPSQTWAQREFFLPSESWEFPGFTRQAYHQLALKLPPCTDMKSKVRQRLIHPWKGGAQHTWGFHTWLDVCRLPATFPTQPDRPYDSNVWRWLTDSNAHRCPPTEHPIPPPSWMGQNSFLTFIHCYPTFVDMKRKKQVIFRTVKELKEVEKLKLRSEARAPPLDAQGNIQPPASFKKYRHISAGGRFEPQGLQLMPNPFPNNFARSWPCPNPLPHYQEKVLKLALLPSAPLSQDLIRDFQTLIKDRTALPLHHLSKAQASKSPARKRKRRPGHF UniProt match SMIM34B 21 7793827 7789206 -1 526 1000 existing transcript lincRNA ENSG00000278961 ENST00000624951 139 2 potential human specific duplication absent / no expression data; locus was absent on GRC37 / absent Uncharacterised protein. It is a non-local duplicant of SMIM34A on chr21, within the large duplication on the p-arm of the chromosome. The duplication event is human specific, so the PhyloCSF signal is presumably based on paralogous alignments. SMIM34A has highly specific kidney expression. no MEWAKWTPHEASNQTQASTLLGLLLGDHTEGRNDTNSTRALKVPDGTSAAWYILTIIGIYAVIFVFRLASNILRKNDKSLEDVYYSNLTSELKMTGLQGKVAKCSTLSISNRAVLQPCQAHLGAKGGSSGPQTATPETP UniProt match ETDA X 135253277 135253456 1 41242 lincRNA existing transcript lincRNA ENSG00000238210 ENST00000427686 59 1 potential primate-specific duplication absent / all sources support testis expression / ncRNA XR_938601.1 (LOC101928677) no ETDA, ETDB and ETDC are a cluster of three 59aa single exon CDS on chrX listed here, 2 of which are identical. ENSG00000229015 is an additional pseudogene, identified as part of this work based on regional analysis. The loci have clear homology to mouse Etd on mouse chrX, although gene synteny has been lost. The duplication events occurred in the primate lineage. no MDKEVPKGSPREPALNIKKSDKSFKRKKPTENVLIFLINRQLGRHRSDIDLSRWVWMLS UniProt match ETDB X 135119261 135119082 -1 27432 lincRNA existing transcript lincRNA ENSG00000224107 ENST00000423661 59 1 potential primate-specific duplication absent / all sources support testis expression / ncRNA NR_033941.1 (LINC00633) no ETDA, ETDB and ETDC are a cluster of three 59aa single exon CDS on chrX listed here, 2 of which are identical. ENSG00000229015 is an additional pseudogene, identified as part of this work based on regional analysis. The loci have clear homology to mouse Etd on mouse chrX, although gene synteny has been lost. The duplication events occurred in the primate lineage. no MDKEVPKGSPREPALNIKKSDKSFKRKKPTENVLIFLINRQLGRHRSDIDLSRWVWMLS Mackowiak match CTXND1 15 80201949 80201770 -1 617 1000 existing transcript lincRNA ENSG00000259417 ENST00000560778 59 1 vertebrates previously undescribed coding gene ENSMUSG00000097789 zebrafish OTTDARG00000044230 CAGE is dominated by brain experiments; HPA supports expression in liver, kidney, brain, spleen heart; Intropolis shows no obvious expression pattern, with the highest top ranking normal experiment from fetal intestine (SRR643742) CAGE is dominated by various developmental cells, especially mesenchymal; ENCODE RNAseq supports general expression, highest in adrenal gland ncRNA NR_120317.1 (LINC01314) no Homology to cortexin family members. no MEEPTPEPVYVDVDKGLTLACFVFLCLFLVVMIIRCAKVIMDPYSAIPTSTWEEQHLDD Mackowiak match BRD3OS 9 134026748 134027002 1 592 1000 existing transcript antisense lncRNA ENSG00000235106 ENST00000603928 84 1 vertebrates previously undescribed coding gene ENSMUSG00000109946 zebrafish OTTDARG00000043595 all sources support general expression all sources support general expression ncRNA NR_015427.2 (LINC00094) neXtProt no Uncharacterised protein. Previously reported as an ultraconserved lncRNA by Ulitsky et al [PMID:22196729], identified in human, mouse and zebrafish. no MSGRVPLAEKALSEGYARLRYRDTSLLIWQQQQQKLESVPPGTYLSRSRSMWYSQYGNEAILVRDKNKLEVSRDTGQSKFCTIM Mackowiak match TUNAR 14 95922820 95922966 1 392 1000 existing transcript lincRNA ENSG00000250366 ENST00000503525 48 2 vertebrates previously undescribed coding gene ENSMUSG00000097929 zebrafish OTTDARG00000044226 all sources support brain expression all sources support brain expression ncRNA NR_038861.1 (TUNAR) no This gene had previously been published as functional lncRNA Tunar in mouse, and it may play a role in maintaining pluripotency and the neural differentiation of embryonic stem cells [PMID:24530304]. The human ortholog was since named TUNAR. It was originally reported as a conserved lncRNA by the Ulitsky group in zebrafish, mouse and human [PMID:22196729]. In fact, the experimental work of the Ulitsky group focused a paralogous locus OTTDARG00000038681 (Megamind), which is also conserved in human (BIRC6-AS2; within the intron of BIRC6), though apparently lost in rodents / lagomorphs. This locus lacks PhyloCSF support and so was not found in the present analysis. Ulitsky noted the presence of a plausible CDS in Megamind - which we can see is highly similar to that now annotated in zebrafish Tunar - although insertional mutagenesis led them to rule out its functionality. Furthermore, the putative Megamind CDS is disrupted in human, being ~50% truncated at the 5' end compared to TUNAR. Given these observations, we believe the most likely scenario is that Megamind / BIRC6-AS2 is an ancient duplication of TUNAR, and that it evolved from an earlier pseudogenic state into a potentially functional lncRNA in zebrafish. In human, given the absence of functional support for the transcript and the severe CDS truncation, BIRC6-AS2 has been converted into a pseudogene of TUNAR in GENCODE (ENSG00000279897). Nonetheless, the human locus is transcribed, and could thus also turn out to be a functional lncRNA. no MVITSENDEDRGGQEKESKEESVLAMLGIIGTILNLIVIIFVYIYTTL Mackowiak match TMEM238L 17 10803963 10803724 -1 593 1000 existing transcript lincRNA ENSG00000263429 ENST00000581851 79 1 mammals; potential ortholog in fish without synteny, e.g. zebrafish AUGUSTUS g16563.t1 previously undescribed coding gene ENSMUSG00000085683 / CAGE supports gastrointestinal expression; HPA supports gastrointestinal expression alongside weaker expression in certain other organs; the highest scoring Intropolis study is on skin (SRR835999), although gastrointestinal expression is also supported (e.g. SRR364829) CAGE supports gastrointestinal expression; ENCODE RNAseq supports gastrointestinal expression alonside weaker expression in other organs ncRNA NR_036581.1 (LINC00675) no Putative transmembrane protein. The transcript has a second non-coding exon that may suggest it is a nonsense mediated decay target, although this transcriptional scenario is apparently conserved at least across to chicken no MLLGSLWGRCHPGRCALFLILALLLDAVGLVLLLLGILAPLSSWDFFIYTGALILALSLLLWIIWYSLNIEVSPEKLDL Mackowiak match TINCR 19 5567924 5562208 -1 2076 lincRNA existing transcript lincRNA ENSG00000223573 ENST00000448587 87 2 avians previously undescribed coding gene ENSMUSG00000110218 alligator KYO42567.1 CAGE supports expression in placenta and amniotic membrane, with lower scores for esophagus, tongue, skin; Intropolis has a top rank in skin (SRR836000), although there is appreciable data from numerous other tissues CAGE supports expression in skin, tongue, tracheal epithelial cells, vagina, eye; ENCODE RNAseq supports general expression, highest in bladder ncRNA NR_027064.2 (TINCR) no This gene had previously been functionally characterised as lncRNA TINCR in human, having an apparent role in the control of epidermal differentiation [PMID:23201690]. There are other transcripts in the locus without an obvious CDS, and it could theoretically be a dual function protein-coding gene / lncRNA. no MEGLRRGLSRWKRYHIKVHLADEALLLPLTVRPRDTLSDLRAQLVGQGVSSWKRAFYYNARRLDDHQTVRDARLQDGSVLLLVSDPR Mackowiak match AC010325.1 19 50786104 50785812 -1 2506 lincRNA existing transcript lincRNA ENSG00000261341 ENST00000562076 28 3 mammals previously undescribed coding gene ENSMUSG00000087376 / all sources support testis expression all sources support testis expression ncRNA NR_134883.1 (LOC105372440) no Uncharacterised protein. no MNFQENVTLAMALFTILTSIYFFNKAQQ Mackowiak match SERTM2 X 111518861 111519130 1 452 1000 existing transcript lincRNA ENSG00000260802 ENST00000569275 89 1 xenopus previously undescribed coding gene ENSMUSG00000085139 xenopus OCA25043.1 CAGE supports expression in epididymis, with weaker signals in hepatocytes, seminal vesicles, cervix, uterus; HPA supports high endometrium expression, with weaker expression in other tissues e.g. prostate; Intropolis is dominated by prostate cancer experiments, with mesenchymal cells the top ranked normal experiment (SRR486239) CAGE supports pituitary gland expression; ENCODE RNAseq supports weak general expression ncRNA NR_033974.1 (LINC00890) no Homology to serine-rich transmembrane proteins. no MEAHFKYHGNLTGRAHFPTLATEVDTSSDKYSNLYMYVGLFLSLLAILLILLFTMLLRLKHVISPINSDSTESVPQFTDVEMQSRIPTP Mackowiak match STRIT1 3 155293632 155290944 -1 553 1000 existing transcript lincRNA ENSG00000240045 ENST00000489090 35 2 avians previously undescribed coding gene ENSMUSG00000103476 chicken PLAR linc|3P|XLOC_114812|TCONS_00227609:186.408|90AA|35AA| CAGE supports expression in skeletal muscle, artery, spinal fluid, throat, diaphragm, tongue; HPA supports expression in esophagus and heart; Intropolis is dominated by very high expression in muscle (e.g. SRR1398546) CAGE supports weak heart expression; ENCODE RNAseq supports heart expression ncRNA NR_037902.1 no Subsequently reported as a protein-coding based on experimental data, and unofficially named DWORF [PMID:26816378]. It enhances SERCA activity by displacing the SERCA inhibitors phospholamban, sarcolipin, and myoregulin. no MAEKAGSTFSHLLVPILLLIGWIVGCIIMIYVVFS Mackowiak match LINC00672 17 38925694 38925771 1 699 1000 existing transcript lincRNA ENSG00000263874 ENST00000583195 25 1 vertebrates previously undescribed coding gene ENSMUSG00000050538 tetraodon CAF94638.1 all sources support brain expression all sources support brain expression ncRNA NR_038847 gawron_2016:193602 / gonzalez_2014:386349 / werner_2015:621349 Uncharacterised protein. An existing CDS at the mouse locus was found to be completely spurious (CCDS36298). no MLDIFILMFFAIIGLVILSYIIYLL Mackowiak match SMIM36 17 55511334 55511053 -1 855 1000 gene added absent ENSG00000261873 ENST00000636752 93 1 vertebrates previously undescribed coding gene ENSMUSG00000110344 zebrafish ENSDART00000173549 CAGE supports pineal gland and eye expression; HPA supports testis expression; top ranked Intropolis experiment is retina (SRR060734) CAGE supports eye expression; ENCODE RNAseq supports weak brain expression ncRNA XR_243721.3 (LOC101927367); partial match Kim et al; neXtProt no Uncharacterised protein. Originally noted during an analysis of the Kim et al and Kuster et al proteomics datasets, although the experimental support was not considered strong enough to support annotation at that time. no MEFYLEIDPVTLNLIILVASYVILLLVFLISCVLYDCRGKDPSKEYAPEATLEAQPSIRLVVMHPSVAGPHWPKGPGLSLGDPAPLGKKSTMV Mackowiak match MYMX 6 44217472 44217726 1 864 1000 existing transcript lincRNA ENSG00000262179 ENST00000573382 84 1 avian previously undescribed coding gene ENSMUSG00000079471 alligator KYO43399 CAGE supports expression in a variety of muscle-linked experiments, especially skeletal muscle and myoblasts, but also adipocytes; HPA supports adipose expression alongside weaker general expression; Intropolis supports muscle expression, with the top ranked experiment from skeletal muscle (SRR1398564) CAGE supports embryonic stem cell and embryo expression; ENCODE RNAseq supports limb expression alongside weaker general expression uncharacterized XR_242025.3 (LOC101929726) no Subsequently reported as functional protein-coding gene MYMX / Mymx based on experimental data [PMID:28386024]. It is involved in the formation of skeletal muscle during embryogenesis. no MPTPLLPLLLRLLLSCLLLPAARLARQYLLPLLRRLARRLGSQDMREALLGCLLFILSQRHSPDAGEASRVDRLERRERLGPQK Mackowiak match RNF227 17 7916195 7915522 -1 436 1000 transcript extended antisense lncRNA ENSG00000179859 ENST00000324348 190 2 avians previously undescribed coding gene ENSMUSG00000043419 lizard XP_008121180 CAGE supports two separate TSS linked to a long and short form of the CDS, the upstream showing highest expression in macrophages and the downstream in eye, skin and brain tissues; HPA supports general expression from the downstream TSS; the top ranked Intropolis experiment is skin (SRR835999) CAGE supports two TSS as seen in human, with expression of the upstream in skin, forelimb and tongue and expression of the downstream in neurons and visual cortex; ENCODE RNAseq supports brain expression alongside weak general expression, largely from the downstream TSS ncRNA NR_024349.1 (LOC284023) Protein containing RING Ubox and DUF4632 domains. While the CDS from Mackowiak is linked to a strong CAGE tag, PhyloCSF extends upstream in the same first exon to a weaker CAGE to give the CDS listed here. It seems likely both proteins exist, thus both have been annotated in GENCODE human and mouse. The shorter form lacks the RING Ubox domain. no MQLLVRVPSLPERGELDCNICYRPFNLGCRAPRRLPGTARARCGHTICTACLRELAARGDGGGAAARVVRLRRVVTCPFCRAPSQLPRGGLTEMALDSDLWSRLEEKARAKCERDEAGNPAKESSDADGEAEEEGESEKGAGPRSAGWRALRRLWDRVLGPARRWRRPLPSNVLYCAEIKDIGHLTRCTL Mackowiak match SMIM39 2 131035092 131035262 1 62 1000 / ms existing transcript non-coding transcript within ARHGEF4 ENSG00000284479 ENST00000635976 56 1 vertebrates previously undescribed coding gene, annotated within ENSMUSG00000037509 (Arhgef4) as ENSMUST00000211073 in GENCODE M18; will be separated as an independent gene in a future release zebrafish OTTDARG00000044231 CAGE supports brain expression; HPA supports brain expression but also adipose, and weaker general expression; Intropolis has top ranking experiments from brain (e.g. SRR1047869) CAGE supports brain / CNS expression; ENCODE RNAseq supports brain expression alongside weaker general expression absent cotranscribed with ARHGEF4 Kim et al no Uncharacterised protein. Identified within an alternative first non-coding exon of ARHGEF4. There is no evidence that the locus is transcribed independently of the downstream ARHGEF4 exons in human, mouse or zebrafish. Mass spectrometry support for this CDS was previously observed in our reanalysis of the Kim et al and Kuster et al proteomics datasets, although statistical confidence was not high enough to support annotation at that time. no MARAPQPRRGPAAPGNALRALLRCNLPPGAQRVVVSAVLALLVLINVVLIFLLAFR Published AL034430.1 20 10413887 10413735 -1 296 1000 existing transcript 5' UTR of MKKS ENSG00000285508 ENST00000609375 50 1 coelacanth previously undescribed coding gene ENSMUSG00000027274 coelacanth EnsCodingFull|3P|XLOC_155214|TCONS_00200630|0:0.347849 all sources support general expression all sources support general expression no model, although recognised as a uORF in the MKKS gene description cotranscribed within MKKS fritsch_2012:48204 / gawron_2016:282740 / liu_2013:602716 / park_2016:1017024 / rubio_2014:1214914 Found within the 5' UTR of MKKS, previously reported and experimentally confirmed in 2013 [PMID:23671934]. The novel protein is found in the mitochondria, whereas MKKS is a skeletal protein. The loci apparently share a promoter, and current data indicate that MKKS is always translated from transcripts that also contain the novel CDS. no MKNTSWIRKNWLLVAGISFIGVHLGTYFLQRSAKQSVKFQSQSKQKSIEE Published AL022312.1 22 39504231 39504443 1 860 1000 existing transcript 5' UTR of MIEF1 ENSG00000285025 ENST00000637854 70 1 vertebrates previously undescribed coding gene ENSMUSG00000115798 Tetraodon CAG02533.1 all sources support general expression all sources support general expression absent cotranscribed within MIEF1 neXtProt gonzalez_2014:634134 / loayza_puch_2013:926373 / rutkowski_2015:699643 Found within the 5' UTR of MIEF1, previously reported based on ribosome profiling and phylogenetics [PMID:25621764]. no MAPWSREAVLSLYRALLRQGRQLRYTDRDFYFASIRREFRKNQKLEDAEARERQLEKGLVFLNGKLGRII Mackowiak partial match CCDC196 14 66486503 66498472 1 330 1000 existing transcript lincRNA ENSG00000196553 ENST00000636229 297 10 mammals previously undescribed coding gene ENSMUSG00000099418 / all sources support testis expression all sources support testis expression ncRNA NR_024338.2 (LINC00238); partial match neXtProt Uncharacterised protein. The locus was for a time recognised as protein-coding gene C14orf53 by RefSeq, prior to becoming reannotated as non-coding gene LINC00238. no MTSGANSSGSYLPSEIRSSKIDDNYLKELNEDLKLRKQELLEMLKPLEDKNNLLFQKLMSNLEEKQRSLQIMRQIMAGKGCEESSVMELLKEAEEMKQNLERKNKMLRKEMEMLWNKTFEAEELSDQQKAPQTKNKADLQDGKAPKSPSSPRKTESELEKSFAEKVKEIRKEKQQRKMEWVKYQEQNNILQNDFHGKVIELRIEALKNYQKANDLKLSLYLQQNFEPMQAFLNLPGSQGTMGITTMDRVTTGRNEHHVRILGTKIYTEQQGTKGSQLDNTGGRLFFLRSLPDEALKN Mackowiak partial match PVALEF 17 81181227 81183011 1 7315 lincRNA spliceform added antisense lncRNA ENSG00000225180 ENST00000637878 134 4 coelacanth; potential ortholog in fish without synteny, e.g. herring XP_012675455.1 novel unitary pseudogene ENSMUSG00000113114 coelacanth XP_005988106.1 no CAGE support; no HPA support; Intropolis finds weak support in a series of adipose experiments (e.g. SRR833729) / ncRNA NR_027255.1 (AATK-AS1); partial match Novel EF-hand domain-containing protein. Apparently pseudogenised in the rodent / lagomorph clade. The CDS reported by Mackowiak is incorrect as it has been extrapolated from RNA evidence that retains an intron. no MEEDFSSQMKKMALAMGTSLSDKDIELLPTDMRHHGSFNYLKFFKHIRKLHASGQLDDAIHTAFQSLDKDKSGFIEWNEIKYILSIIPSSGPTTPLTDEEAEAMIQAADTHGDGRINYEEFSELIKKEKIPKKK Mackowiak partial match C2orf92 2 97669789 97702801 1 818 1000 existing transcript antisense lncRNA ENSG00000228486 ENST00000627399 265 8 mammals previously undescribed coding gene ENSMUSG00000102416 / CAGE supports testis expression; HPA finds testis expression alongside weaker general expression; Intropolis supports general expression all sources support testis expression ncRNA NR_038386.1 (LINC01125); partial match Uncharacterised protein. There is strong PhyloCSF across the final two coding exons, and the provenance of the locus is clearly as protein-coding. However, it has not been possible to find a consistent CDS among mammals across the 5' end of the locus, and we cannot deduce an ancestral form. Most exons of the human 265aa translation have appreciable conservation, although the model reported here contains exonic duplications not seen in mouse. It seems that the first half of the locus has evolved independently in different lineages during the mammalian radiation, which may undermine confidence in its coding potential. However, there is good ribosome profiling coverage across the human and mouse translations, and Mackowiak found mass spectrometry support for the human translation with peptide [PCGQLLHFLQR]. yes, potentially a unitary pseuodgene in human and / or mouse MSRAMALFFVLCWIQDEIVLQVFSKVPYDPSFDETRTAVRSITKRDTQKSYSQQKSLNNAAFASGSNEREEHLAKIFDEILLQVFPKFPYDPSFNEATAVRSITKTDMRKGTSIAWNSPKPEYFLGSVDKIPDKDHLSEEKNFKESCLFDRDLREQLTTIDKETLQGAAKPDAHFRTMPCGQLLHFLQRNTIIAAVSGVAILMAIVLLLLGLASYIRKKQPSSPLANTTYNIFIMDGKTWWHNSEEKNFTKLAKKQKQLKSSSCV NP/NM partial match GNG14 19 12688049 12688372 1 306 1000 transcript extended 5' UTR of FBXW9 ENSG00000283980 ENST00000640117 69 2 mammals known coding gene ENSMUSG00000095845 / no CAGE; HPA coverage cannot be separated from FBXW9 3' UTR on the opposite strand CAGE supports weak expression in neurons and diencephalon; ENCODE RNAseq supports very weak general expression coding NM_001316692.1 (LOC105372280); partial match no Paralog of GNG12. The 2 exon structure has been taken by comparison to that locus and to the mouse ortholog (Gm5741). RefSeq have annotated the locus as a single-exon model, and although the CDS can be theoretically translated across the intron without disruption, this intronic portion of CDS is poorly conserved. yes, potential unitary pseudogene in human MSSKVAINSDIGQALWAVEQLQMEAGIDQVKMAADLLKFCTEQAKNDPFLVGIPAATNSFKEKKPYAIL NP/NM partial match C5orf58 5 170234977 170246113 1 1010 lincRNA existing transcript antisense lncRNA; protein-coding in v19 ENSG00000234511 ENST00000593851 81 2 mammals known coding gene ENSMUSG00000085684 / all sources support testis expression all sources support testis expression coding NM_001102609.1; partial match neXtProt no Already known as chromosome 5 open reading frame 58 (C5orf58). RefSeq have extended the CDS to a poorly conserved ATG. It was protein-coding in an earlier GENCODE release. no MGKKRVTDHKLNVDKVIKNINTISSELKKIKELSQLLLCDLILHFNHPIKTENLAEAERNNPLFEESKISDVSLVSNSFSI NP/NM partial match TSTD3 6 99521075 99531720 1 1091 lincRNA existing transcript antisense lncRNA ENSG00000228439 ENST00000636394 111 3 vertebrates known coding gene ENSMUSG00000028251 / all sources support general expression all sources support general expression coding NM_001195131.1; partial match Already known as thiosulfate sulfurtransferase (rhodanese)-like domain containing 3 (TSTD3), previously missed by GENCODE because the only human cDNA evidence incorporates a 'poison' exon (i.e. an exon that introduces a frameshift). Recognised by RefSeq as protein-coding for many years, although their model uses a downstream ATG within the poison exon, i.e. misses the conserved ATG. no MVLPWLLLETARRAVLGSAEAALCGLTSIKGNCHNFYTAISKDVTYKELKNLLNSKNIMLIDVREIWEILEYQKIPESINVPLDEVGEALQMNPRDFKEKYNEVKPSKSDS NP/NM partial match FAM240A 3 46617186 46625218 1 79 1000 existing transcript non-coding transcript within TDGF1 ENSG00000283473 ENST00000640551 77 2 avians; potential ortholog in coelacanth without synteny, e.g. linc|3P|XLOC_096740|TCONS_00126143:5.57183|93AA|92AA| (94aa ORF) known coding gene ENSMUSG00000096393 lizard LOC107982372 no CAGE; HPA supports salivary gland, cerebral cortex and thyroid expression; highest Intropolis score is from spermatozoa (SRR650364) CAGE supports weak T cell expression; ENCODE RNAseq supports thymus expression coding NM_001195442.1; partial match no Homology to FAM240B and FAM240C, both of which are on this sheet. RefSeq use an upstream ATG with poor conservation on the same transcript structure. no MNNQYTRREVFCRNTCHDLKHFWEREIGKQTYYRESEERRLGRSALRKLREEWKQRLETKLRLRNNPEDTEKRTNVG NP/NM partial match TMEM269 1 42789894 42798225 1 135 1000 existing transcript lincRNA ENSG00000274386 ENST00000637012 203 5 avians; potential ortholog in zebrafish without synteny, NM_001195246 known coding gene ENSMUSG00000028642 lizard ENSACAT00000012596 all sources support testis expression all sources support testis expression coding NM_001242750; partial match Already known as transmembrane protein 269 (TMEM269). RefSeq use a first exon with poor conservation and transcriptional support, whereas the correct structure became apparent on integrating capture-seq and CAGE data. no MVLGLFSIIFSFSRKCHYASRMLLVSFLLDMAVRAMTSHINICSKLGAELNDFAVFTTFGLASALLLGVDGLLSGILAIIYVSAASFHLCFYSPGVPSTYKGLPCPYASCILASTSLLTKGNRFILCCMASLMILFMMDQSYYPYDKILESENWKKLVYIGGVIMLFFSPLSLSAFYCLMWSLSYIFFPDALWGKAACLSPQH XP/NM partial match CLEC20A 1 178496939 178479535 -1 820 1000 transcript extended lincRNA and adjacent antisense lncRNA ENSG00000188585 ENST00000623247 400 8 mammals novel unitary pseudogene ENSMUSG00000109913 / no CAGE for full length transcript; HPA supports testis expression of the full-length transcript reported here, with evidence of a short form based on an alternative first exon specific to lymph and bone marrow (ENST00000646925); BluePrint supports the existence of this truncated model, which also has CAGE in B cells; Intropolis is dominated by cancer experiments, but also B cells (e.g. DRR013790) / coding XP_016858575.1 (LINC00083); partial match C-type lectin domain-containing protein. Evidence suggests a shorter isoform exists in B cells. RefSeq uses an upstream ATG with poor conservation. no MLPRALLLSFCAAALQLVSSKRDLVLVKEALSWYDAQQHCRLHYTDLADLQPSGLWKLYSLMTSTPAWIGLFFDASTSGLRWSSGSTFTALEWGQKLPEFGVGFCATLYTWLKLPSIGAASCTAQKPFLCYCDPDVGHLISTKPSLSLTTSPKPAVVQISGQTFMRFDQVMTWSSALLYCRSHHTDLADLQMVTDETGKEALRSIMSETEAWIGLYLNANSGSLSWSSDLGASIPSWLQVPMMVRGLCTALGIYMTYSPKVYSVNCSSLLPFFCFYDSSTGHRASAELPPLFHTSPTEMTEETTPRPGRAVASVGSGTDRRDTAAATEAQHLSSESKEKTSAQKSGHPFGILKADFTISTLMDPEEMKDQFLRQIQEVLKLTLGHEQFRLKWVSFEVNKK XP/NM partial match TEX51 2 126898919 126901402 1 15 1000 spliceform added lincRNA ENSG00000237524 ENST00000450035 166 6 mammals absent in mouse and rat cow EST DT838655 contains an intact CDS all sources support testis expression coding XP_011510574.1 (LOC101929926); partial match Uncharacterised protein. Alternative splicing allows the potential usage of 4 distinct STOP codons in human, none of which have strong conservation. The first 6 coding exons are conserved in mammals with an open translation, and it is not obvious if the translation after this point has diverged dramatically in mammals, or if there have been pseudogenisation events in certain lineages. yes, potential unitary pseudogene in human MLPLLIICLLPAIEGKNCLRCWPELSALIDYDLQILWVTPGPPTELSQNRDHLEEETAKFFTQVHQAIKTLRDDKTVLLEEIYTHKNLFTERLNKISDGLKEKDIQSTLKVTSCADCRTHFLSCNDPTFCPARNRRTSLWAVSLSSALLLAIAGDVSFTGKGRRRQ XP/NM partial match MINDY4B 3 150905439 150871045 -1 469 1000 spliceform added pseudogene; protein-coding in v19 ENSG00000214237 ENST00000465419 460 12 xenopus; potential ortholog in fish without synteny, e.g. salmon XP_014017429 known coding gene ENSMUSG00000101860 (Fam188b2) xenopus XP_002937138 no CAGE; no HPA; Intropolis finds weak expression in a variety of fetal tissues, e.g. adrenal gland (SRR980484) CAGE supports expression in inner cells and organ of corti; no ENCODE RNAseq coding XP_016863110.1; partial match MINDY family member, previously known as FAM188B2. It was a protein-coding gene in GENCODEv19, but switched to pseudogene as the only transcript evidence at that time supported a truncated CDS. PMID:20717163 previously demonstrated that certain of these exons are transcribed in retina as part of the CLRN1 locus immediately upstream (their coding potential was not examined). In context, these look like readthrough transcription events; both the canonical STOP of CLRN1 and the ATG used for MINDY4B are deeply conserved, and there is no evidence for Clrn1 readthrough transcription in other species. Although the human locus has minimal RNAseq / CAGE support, the annotation of most introns is supported by ESTs. The RefSeq model remains 5' truncated with an incorrect ATG. no MDMEVLGQEQSSEQLDLEEISRKISFLDKWREIFSYHRLGTNNSTPQNHEGNHTSADENEDGTGLSQPKGQGHLPSSGLCSIPNPSIISSKLGGFPISLAMATKLRQILFGNTVHVFSYNWKKAYFRFHDPSSELAFTLEVGKGGARSIQMAVQGSIIKYLLFTRKGKDCNLGNLCEISKKEQEQALAAALAGILWAAGAAQKATICLVTEDIYVASTPDYSVDNFTERLQLFEFLEKEAAEKFIYDHLLCFRGEGSHGVILFLYSLIFSRTFERLQMDLDVTTTQLLQPNAGGFLCRQAVLNMILTGRASPNVFNGCEEGKSQETLHGVLTRSDVGYLQWGKDASEDDRLSQVGSMLKTPKLPIWLCNINGNYSILFCTNRQLLSDWKMERLFDLYFYSGQPSQKKLVRLTIDTHSHHWERDQQEEKHGPRRRFSPVEMAIRTKWSEATINWNGTVPFF XP/NM partial match IQCM 4 149742691 149351951 -1 169 1000 transcript extended lincRNA ENSG00000234828 ENST00000636793 501 12 mammals known coding gene ENSMUSG00000031620 / no CAGE; HPA support testis expression with weaker colon expression; Intropolis finds additional support for colon expression (SRR1213821) all sources support testis expression 14 XPs have partial matches IQ motif-containing protein. no MTTEEAMPEKAKCPTLEITKQDFFQEAKTLIAQHYEKINENKVQGTSINVFRKKHQKPKSGKYIPLEIDKKVTRDVVQEHRAALRRICFPKELSKSEHLQEPPQRISFKEPHIFSRRERCRPIDLITKGQVKLDKIMTIIEPVSKKMETAKQQHFEESRNRMLELLYPFPVHLYLQPGTSNLELLKEPDKAFYDWRGFVLTRSFRLACDSRRVSFSQSSSIFRDYYSKTFKTLIKKERQPIKPEPKSQPRIKGTPNKTDKLDSKVKRIGPHIEIFQVFRERKKFMITPKLIRMVTVMQAHVRGWLERKRLQRVMTKALDHGPDMKAVINMYGRLIHRVRYRRGLWRTRQILNLAELEEWMDRKKFYEIMFAKREDWPKIERNELPNFFSDCGHFPTQKQVDDTWDLVHQDGKEKYSELIKKSKAIEMLFTLYPPEGAHVPDSTLLKSTWLRPIVNGEEGYRYIVNGHPALKRANIRVVGKLVARSIRERKMRQHYKSCKVE XP/NM partial match PRRT1B 9 131545616 131558202 1 95 1000 gene added absent ENSG00000283526 ENST00000636672 263 4 coelacanth; potential ortholog in fish without synteny, e.g. zebrafish OTTDARG00000036869 previously undescribed coding gene ENSMUSG00000079497 coelacanth XP_014340850 CAGE supports cancer expression only; HPA supports expression in internal organs, lung especially; Intropolis is dominated by cancer and cell lines experiments CAGE supports pancreas, stomach, kidney and liver expression; ENCODE RNAseq supports expression in these and other organs coding XP_011517576.1; partial match neXtProt Proline-rich transmembrane protein. no MEAGAGGAGSDTKGGGSPATPEDPRSPAKPAAPEDPQMPAQPALPQLPRRPRTLDEDGAPSEDGAAGGSEPAPEDAPAQAAGEAGPVSKAAAGGAPHIGFVGEPPPYAPPDPKAAPLLYPPFPQVPVVLQPAPSALFPPPAQLYPAAPTPPALFSPPAGAAFPFPVYNGPMAGVPGPATVEHRPLPKDYMMESVLVTLFCCLLTGLIAIVYSHEARAALGRGDLAQAEEASRKARSLVLFSLLFGVFVSTSWVIYVVVALYLP XP/NM partial match VSIG10L2 11 125946056 125955914 1 249 1000 transcript extended antisense lncRNA ENSG00000283703 ENST00000638636 767 10 xenopus previously undescribed coding gene ENSMUSG00000098590 xenopus OCT70428 no CAGE; no HPA; Intropolis finds expression in skin (e.g. SRR836000) no expression data coding XP_006719013.2; partial match V-set and immunoglobulin domain containing protein. Mouse annotation is partly supported by ESTs from inner ear, e.g. BB850581 no MVGQRAQHSPVSLLLLIHLCLLHLRASGQPHPTPEAPVEEVVSVQGVRGGSVELACGSGPAPLLVLWSFTPLGSLVPRPVAVTDGAMSKVEAIASALGVVSLRNSSLVLGELHEGARGHFLCQVLHVAGGQLHAAYSHLTLAVLVPVSKPQVRLSNPSPVEGASVVATCAVREGTEPVTFAWQHRAPRGLGEALVGVTEPLFQLDPVNRTHLGWYMCSASNSVNRLSSDGAFLDVIYGPDKPVITMEPLGLTEEGFWASEREEVTLSCLAASNPPSHYVWLRDHTQVHTGPTYVIARAGRVHTGLYTCLARNSYLDTRTQTTVQLTIYYPPEGQPSCAVHPSPEAVTLLCAWPGGLPPAQLQWEGPQGPGPTAPSNVTWSHAAAQLPSGSVFTCTGQHPALAPPALCTVMLWEPLGRPTCWSTATMGDQFIMLSCEWPGGEPPATLGWLDEQQQPLGGSSSSMAVHLLQAQEDLAGREFTCRGTHLLRTPDPHCHLQLEAPQLDVAEPRVSVLEGGEAWLECSLRGGTPPAQLLWLGPQQQKVDPGTSGFMLHPEGAQLRLGIYDADPAHHRGTYQCVARNAVGNSSQSVLLEVLRYPAPPNVTISRLTYGRHRREVQLQWAILGPGNLTGFLVQRKASALGPGAGAWETAASDIEPESRGRRLGGLDPGVLYAFRILALNHHTAGHPSEVKIPADPPFSAYPAVLGAAGTGMVVATVASLLVFQYAARHPETFPRLETPTTTPGLDPAQETTDSPVNVTITVTATP XP/NM partial match AC022167.5 16 8884356 8885295 1 1466 unannotated gene added absent ENSG00000283516 ENST00000636296 72 2 avians previously undescribed coding gene ENSMUSG00000107252 lizard chrUn_GL343263:1,930,831-1,931,738 [Broad AnoCar2.0/anoCar2]; partial alignment to AUGUSTUS g12243, which is 5' extended all sources support testis expression all sources support testis expression XP_011521085.1; XP_011521084.1; XP_011521086.1 (LOC101929989); all partial matches no A truncated homolog of lipopolysaccharide-induced tumor necrosis factor-alpha factor (LITAF). The first coding exon of LITAF has been lost, with the novel CDS beginning instead within coding exon 2 of LITAF using an alternative ATG. This prospective human initiation codon is not stongly conserved in mammals, although the mouse translation uses a species-specific ATG 6 codons downsteam. Nonethless, this locus is potentially pseudogenic in any mammalian lineage. yes, potentially a unitary pseudogene in human and / or mouse MPVQAVCPYCGNRIITVTTFVPGALTWLLCTTLFLFGYVLGCCFLAFCIRSLMDVKHSCPVCQRELFYYHRL XP/NM partial match CCDC194 19 17394158 17390509 -1 112 1000 spliceform added lincRNA ENSG00000269720 ENST00000636079 234 4 mammals previously undescribed coding gene ENSMUSG00000108900 / CAGE supports cancer expression only; no HPA support; Intropolis is dominated by cancer experiments CAGE supports weak expression in multiple neonate cell types including bone and epididymis XP_011507149.1 (LOC105372343); partial match Uncharacterised protein. The RefSeq model uses an additional exon that is poorly conserved. no MAEPGPEPGRAWRVLALCGVAVFLAAAAAGGALVAWNLAASAARGPRCPEPGANATAPPGDPPPGVDDLRRRLAEAAEREEALARQLDQAESIRHELEKALKACEGRQSRLQTQLTTLKIEMDEAKAQGTQMGAENGALTEALARWEAAATESTRRLDEALRRAGVAEAEGEACAAREAALRERLNVLEAEMSPQRRVPRPRPRSGSRPRPSPRSRSRSGPSGGCRRPARRARG XP/NM partial match TLE7 16 71433324 71430262 -1 1 1000 transcript extended pseudogene ENSG00000260734 ENST00000561754 441 9 mammals previously undescribed coding gene ENSMUSG00000095941 / weak CAGE support for testis and testicular cancer expression; HPA supports testis expression; Intropolis supports embryo expression in multiple experiments (e.g. SRR499827) weak CAGE support for embryonic stem cells and testis expression; ENCODE RNAseq supports testis expression XP_016879408.1; partial match neXtProt Member of the transducin like enhancer proteins family. However, the first 3 coding exons are apparently unique to this locus and only seen in mammals. The RefSeq model uses a spurious intronic ATG. no MSGEKEEASLRMFGAYGEPEERRDVLESSGVSSQPEPQVQQQLGSLLGVPWQPPGPPIQHSPADQETSTVTQQQWHLQGLGRSELQAAGLPDAQPGEAAESSPSFLLGSEVGQPYSSSSPSEEVLSLLRAIPPIPDEVVVRQKRAPQGSWKVGTLFHGKRVYAVAISGSTHHVYTCGSGYIRVWDESALHAGEKAPRAQLDLQHPQDRVVTCKLFPDERSLITGGASQAVTLWDLAPTPQVRAQLTSTGPTCYSLAVSSDAHICLACFHGFVEIWDLQNQILIRKHEVPVYGSRCVDITGNIFWTGGEDTILYSWDLRSYQRLHQHNLQNEILSITHDPGEEWVLAGLRTSDIVFLHTRRNEQFKALMKKYTRHHSLKFASCGSYFVTAIDTRLSGLEAPSLQKLFQIEESSGILCCDVSSDNQYLVMGSSSSATIYQLLY UniProt partial match IQANK1 8 143735854 143790608 1 85 1000 spliceform added three adjacent lincRNAs ENSG00000203499 ENST00000527139 560 13 coelacanth previously undescribed coding gene ENSMUSG00000102018 coelacanth ENSLACT00000013317 (5' truncated) all sources support testis expression CAGE supports testis expression; ENCODE RNAseq supports general expression ncRNA NR_033849.1 (FAM83H-AS1) IQ motif and ankyrin repeat domain-containing protein. Incorporates exonic GWAS SNP rs4875053 linked to menarche, which can now be reinterpreted as mis-sense [ACG]-[AGG]. UniProt A8MXQ7 uses a spurious first coding exon and misses the first two coding exons presented here. no MDSKKGRPKAAAGKWQTLHPGPKTRAAAGKPGENRPPQRKAGWQAREPASAESPQAPTGPAEDRAARAIQGAFRQLRARRELARRREERREYLEQMETPQKEAYLAPVRREQEAARRLREQEEAAQRERREELQRRRRLLDAAFDGDVGEIRAVLKEVEQLLTREGVGHDEAGEARRLQRRVALAECEDSYGNTPLSEAAAGGQPLAIQLRAELGASPNSKGAFGPTPLYRAAFGGHLAAVEVLLKLGADPRVYAEDGSTPERVASLDTVVSVLRSWDLSLTEAMLQNMEAEQQRRAQEAQRHKEAEAERCGSMTLKVQQLTREQQQCHKELQQAYCELSRRISEHDQCEWRCMDKTKLTLQAIKDTEAQVDRLRQEAQKAEEALAMARLELREQTQEGEEEAPGLKCQVTELHDVLMKDVGNRIRADGRWPLVIDPLGQAATFLRYQDTNYVDTVNPEPLRPETMWLALLGALRYGKPLVFDLREEDLFPVVQRQLEAVQERYLSLLRPTDGPEYSPTQFQEQRLEHFRLFFVTKVQWPPAEQLQVLLPVRVQLPGTGL UniProt partial match CCDC197 14 93998132 94008812 1 8715 lincRNA existing transcript lincRNA ENSG00000175699 ENST00000636493 272 6 mammals novel unitary pseudogene ENSMUSG00000110677 / CAGE supports testis expression; HPA finds testis expression alongside heart expression for a specific model with an alternative multi-exon 5' UTR; Intropolis is dominated by testis experiments / ncRNA NR_024183 (LINC00521) neXtProt Coiled-coil domain-containing protein. UniProt Q8NCU1.1 and Q8NCU1-2.1 are 3' truncated. no MAAMDTGQRADPSNPGDKEGDLQGLWQELYQLQAKQKKLKREVEKHKLFEDYLIKVLEKIPEGCTGWEEPEEVLVEATVKHYGKLFTASQDTQKRLEAFCQMIQAVHRSLESLEEDHRALMLSLKIRLCQLQKKCYRKQEQWWQLKHSITYQKDIDFDTHTSSSYNDQLLGYMQMTITNMARQCCPSAHGVPKSMDLFSKLDLIKEFMLDKMETVRLIALLTEPKVCWSWDSFGDQWLRRHPKPFRKCPRRRVSTPRTPFPSPHASECSGLY C-HPP match SPEM3 17 7428903 7432762 1 1096 unannotated gene added absent ENSG00000283439 ENST00000636696 1196 3 mammals previously undescribed coding gene ENSMUSG00000109737 / all sources support testis expression all sources support testis expression coding XP_011522098.1 Weisser et al; neXtProt Putative transmembrane protein, potentially an ancient duplication of adjacent C17orf74. no MGERAYHGAQVCSGTNPRKCQDLGDSILLLLGSFILLNVWINVVTLLWKHLKSSLRILFRHFFPKDKQPSGSHPICICSSVDPKNLCSKVSSRVHPRPGFLLRRVNHLDSWIPDTNDEKVSACCCVPPKCGHAGVPRESARGLYKAGMMGGGEAPQVTASKAQASLLSRPETSSQFPKMSKLDTGPCHLPQESKTKTPDCAPAEAPAQAQVHSPTHTPVCTPTHPWTRSTDHTAVHTPAHSWTHSKARTPEGTHSQAQDTSAQAQAHTSAPTPAQTPAHIQAHTPAPTPAKASAHTKAHTSAQAQTHSPPHTPEYTHSQAHSPEHTSAHSPAQAPMPVPAHPQAHAPEYTSAHAPAYIPDHSHLVRSSVPVPTSAPAPPGTLAPATTPVLAPTPAPVPASAPSPAPALVMALTTTPVPDPVPATTPAPIITPIPSTPPAFSHDLSTGHVVYDARREKQNFFHMSSPQNPEYSRKDLATLFRPQEGQDLVSSGISEQTKQCSGDSAKLPAGSILGYLELRNMEWKNSDDAKDKFPQTKTSPYCSFHPCSSEKNTDSQAPFYPKFLAYSRDTACAKTCFHSATTAQSSVCTLPPPFTLSLPLVPPRSFVPPQPTNHQRPSTLIQTPTVLPTSKSPQSILTSQFPIPSLFATISQPLIQPQCPECHESLGLTQDSGLQRTPGPSKDSRVPRNLDLAQNPDLYKNPGLTQDPGLHENPGLAPNQGLHEFPGLPQDSYLCQNPSPSQDFGLHKNSGITQDSHPQKNTGLTQEAGILRSPCLTQSPGLHKKTPFTQTSDLQRSSGFTQDSGIYRNLEPNQETVIYKNQDLSQATDHQKNLGSSKDSGGHKNTGNVQDPGVCSTAGLTEDSGSQKGPYVPQDSEVNKSSGVIQESFLHKSPGLVQTSGLPKCSGLTQNSGDYKNPGLIQDCGGHKVKGLTQDSNLPSLTQATKVERRFSLPQDVGVYRSSEHSQDSNLHKCPGINQDPGPHKDPALVQDSGLPKISGLTQESGPYKSSCLIPDPSLYKNPSPALGSDFVQLLSLLQTPKSTLSLMKSSVPEKAAQKEDAQRHVLWARVQLNENSCPSKAQVVSNDLQTFSEVPVLIELQSSSWRAGSQHGAYRPVDTVPSGYQNYRQMSMPTHINWKSHCPGPGTQAGHVVFDARQRRLAVGKDKCEALSPRRLHQEAPSNSGKPSRSGDIRM C-HPP match AC007906.2 16 53044158 53044000 -1 2335 lincRNA existing transcript antisense lncRNA ENSG00000277639 ENST00000619363 52 1 xenopus previously undescribed coding gene ENSMUSG00000110332 xenopus EST cluster gives a CDS at [KB021653:23,707,665-23,707,817](-) JGI 7.0/xenTro7 all sources support general expression all sources support general expression ncRNA NR_136518.1 (LOC105371267; p53-regulated lncRNA 1) Weisser et al no Uncharacterised protein. Previously described as a lncRNA p53 target [PMID:25524025]. no MGCHSSKSTTVAAESQKLEEEREGREPGLETGTQAADCKDAPLKDGTPEPKS C-HPP match TEX53 9 114657774 114656529 -1 2004 lincRNA existing transcript lincRNA ENSG00000230054 ENST00000423632 70 2 mammals previously undescribed coding gene ENSMUSG00000084782 / all sources support testis expression all sources support testis expression ncRNA XR_930261.1 Weisser et al no Uncharacterised CDS. Apparently a partial duplicant of TEX48 immediately upstream. no MGSKIFCCCRKTSEGSSTTVGFHNPRMFEQHHPRSFNLNTNSLHSAVPKRHPRLPYDNRMMLKACILRRP C-HPP match AL451007.3 1 244730923 244730375 -1 1827 unannotated gene added absent ENSG00000284188 ENST00000640271 182 1 mammals previously undescribed coding gene ENSMUSG00000070489; currently annotated as non-coding due to a known error in the genome assembly all sources support testis expression all sources support testis expression absent Weisser et al Uncharacterised protein. The ATG region is well conserved in mammals although the CDS diverges notably towards its middle portion. The STOP has limited conservation. Peptide evidence was therefore necessary to support confident protein-coding annotation. no MAAPRFGGPRRGGAQHELLEKAARLERGPPPRGDPEAVGRRAVAGDGGSCSGCWCWRRLFRGPRRKKLRQAHARAGKEAPERGLWGPSSLQRLLQRLATWRRRYLRRKERPDRLEEIPLLVLDRAQGGHEAAAGPQSSVPGRPAQAAPARQPRRRSATRSRSPVAPPVHAQDCFFLFGQQKQ C-HPP match SPAAR 9 35910464 35910691 1 389 1000 existing transcript lincRNA ENSG00000235387 ENST00000443779 75 1 mammals known coding gene ENSMUSG00000028475 / all sources support general expression all sources support general expression ncRNA NR_024283 (LINC00961) Weisser et al; neXtProt no Subsequently identified and confirmed experimentally by PMID:28024296. The protein negatively regulates mTORC1 activation. However, the CDS reported here is is 15aa shorter due to the usage of an alternative initiation codon with stronger conservation. no METAVIGVVVVLFVVTVAITCVLCCFSCDSRAQDPQGGPGRSFTVATFRQEASLFTGPVRHAQPVPSAQDFWTFM C-HPP match CCDC192 5 127703389 127941468 1 127 1000 existing transcript lincRNA ENSG00000230561 ENST00000514853 292 7 mammals known coding gene ENSMUSG00000058925 / all sources support testis expression all sources support testis expression ncRNA XR_159059.3 (LINC01183) Weisser et al; neXtProt Coiled-coil domain-containing protein. no MMPVDVCPRDRGSQWVWLEMGQCYSKKSVVPESDTSERSSMTSGSSESDIPQENKVSKASLDTGQMAFTLAQLESLEICLKEAEEKAKALSEQLSVSEGTKSKLLEQVSRLEEKLEAVDHKEASGGPYEKMVLVKDQCIQKLQAEVKASQEQLIAQKLKHEKKVKKLQTDLATANAITVLELNEKIKTLYEGKPAPREDSLLEGFCGGLPPVEEGDRKISLIMELSTQVSLQTERITQLKEVLEEKERKIQQLEAERSPHPPQEVKDPPGCLPEAPVFSTHDIPPVVSDENL Wright et al match LBHD2 14 103086013 103089797 1 307 1000 / ms gene added absent ENSG00000283071 ENST00000634353 108 3 xenopus known coding gene ENSMUSG00000087075 xenopus XP_018086063 weak though consistent CAGE support for adipocytes; HPA supports adipose and testis expression; Intropolis finds highest expresion in cancer cells, though there is consistent support in adipose (e.g. SRR833729) weak CAGE support for ovary and brain expression; ENCODE RNAseq supports weak general expression absent Wright et al; neXtProt Homology to limb bud and heart development (LBH). no MSTPRPAPPQPGAAEGAGGPEGKAVAGAWEKGPRLGQRLPSIVVEPSEADPVESGELRWPLESAQRGPSQSRAAAAPSPSLPGEPGKAADNAGSECACSEDPAAPARG Wright et al match AC073111.5 7 150407612 150408223 1 488 1000 / ms existing transcript ZNF775 3' UTR ENSG00000284691 ENST00000642087 203 1 mammals known coding gene ENSMUSG00000053297 / all sources support general expression all sources support general expression pseudogene NR_027237.1 Wright et al Zinc finger protein. no MTELASSGGGSPAGDGEEGLGDERGLVIHHPAEEQPYRCPLCGQTFSQQPSLVRHQKAHAGAGRAAAFVCPECGKAFSVKHNLEVHQRTHTGERPFPCPECGRCFSLKQNLLTHQRIHSGEKPHQCAQCGRCFREPRFLLNHQRTHARMPAPHPRRPGVFGERRPYFCPRCGKSFAREGSLKTHQRSHGHGPEGQAAHLGRVL Wright et al match MMP24OS 20 35277929 35277714 -1 1109 lincRNA existing transcript antisense lncRNA ENSG00000126005 ENST00000424358 71 1 avians known coding gene ENSMUSG00000074649 chicken PLAR EnsASCoding|3P|XLOC_103732|TCONS_00201937|MTG1:2.69588 all sources support general expression all sources support general expression ncRNA NR_102705.1 (MMP24-AS1) Wright et al; neXtProt fritsch_2012:46839 / gonzalez_2014:598039 / loayza_puch_2013:872810 / rubio_2014:1237497 / rutkowski_2015:659335 / stern_ginossar_2012:22799 Uncharacterised protein. no MGAQLSGGRGAPEPAQTQPQPQPQPAAPEGPEQPRHPPQPQPQPQPQPQPEPSPWGPLDDVRFLIACTSWY Wright et al match SPTY2D1OS 11 18599929 18603179 1 1296 lincRNA existing transcript antisense lncRNA ENSG00000247595 ENST00000635674 59 3 mammals known coding gene ENSMUSG00000056509 / all sources support testis expression all sources support testis expression ncRNA NR_038360.1 (SPTY2D1-AS1) Wright et al; neXtProt no Uncharacterised protein. This locus remains difficult to resolve, as the peptide data provided support for a CDS that represents a different isoform in a different frame. Thus, PhyloCSF provides support for coding potential outside of the peptide-supported CDS. no MIVLGWMFFVGLVCYMGTFPELMPPTLKWQERWPVQESKTQLRRRALGEDLLQNHVEGI Wright et al match TMEM275 1 46533527 46532994 -1 991 1000 / ms gene added absent ENSG00000282881 ENST00000634804 177 1 vertebrates known coding gene ENSMUSG00000034185 zebrafish NM_001128377 no CAGE; HPA supports testis and brain expression; Intropolis supports brain expression (e.g. SRR921936) alongside cancer expression no CAGE; ENCODE RNAseq supports brain / CNS expression absent Wright et al Uncharacterised protein. no MPPAEKSEGPPVPAPAERARGRVPGLPSPALCCACGLCALLAGVNVTLAGAFASFLPEHNALLVVGLALLVLALGFFAACCVCSRRGLAPRGRSAAAAGPGQGGGRAGPVALEMESSEPTAQDTTAVQLSPAVSAASSGCSSPGPSPLALEAPAPAAVCALRSEGVQLNPPRARAAP