
(A) Comparing different publicly available gene sets. The protein-coding content of five major publicly available gene sets— GENCODE, AceView, consensus coding sequence (CCDS), RefSeq, and UCSC—were compared at the level of total gene number, total transcript number, and mean transcripts per locus. (Blue) GENCODE data; (orange) AceView; (yellow) CCDS; (green) RefSeq; (red) UCSC. The lncRNA content of three of these gene sets—GENCODE, RefSeq, and UCSC—were also compared at the level of total gene number, total transcript number, and mean transcripts per locus. Again, GENCODE data are shown in blue, RefSeq in green, and UCSC in red. (B) Overlap between GENCODE, RefSeq, and UCSC at the transcript and CDS levels. Both protein-coding and lncRNA transcripts of all data sets were compared at the transcript level. Two transcripts were considered to match if all their exon junction coordinates were identical in the case of multi-exonic transcripts, or if their transcript coordinates were the same for mono-exonic transcripts. Similarly, the CDSs of two protein-coding transcripts matched when the CDS boundaries and the encompassed exon junctions were identical. Numbers in the intersections involving GENCODE are specific to this data set, otherwise they correspond to any of the other data sets.











