1.2. Classification of ncORF
Technological advances have enabled the discovery of many neglected ncORFs, which are classified into different categories, such as long non-coding ORFs (lncORFs), circular ORFs (circORFs), intronic ORFs, primary microRNA-derived ORFs (pri-miORFs), and small ORFs (smORFs) (
Figure 1) [
1,
3]. This section reviews ncORFs based on each category.
ncORFs can be found within the 5’ and 3’ untranslated regions (UTR) of the annotated CDS in an alternate reading frame from the CDS, including upstream (uORFs) and downstream ORFs (dORFs), respectively (
Figure 1)[
3]. uORFs regulate translation of the main CDS through interactions with the 43S ribosome pre-initiation complex in the “leaky scanning” mechanism. As the 43S ribosome pre-initiation complex scans the mRNA and encounters the uORF, which often has a near-cognate non-AUG start codon, it has three options: (1) initiate translation at the uORF and disassemble upon reaching the stop codon, resulting in no translation of the downstream CDS, (2) fail to recognize the uORF and continues scanning until it reaches the downstream CDS, resulting in CDS translation, and (3) initiate transition at the uORF, remain bound to the mRNA, and resume scanning, resulting in CDC translation [
3,
5]. uORFs can encode proteins that interact or functionally cooperate with the downstream CDS protein [
6]. A ncORF positioned at the upstream of a protein kinase C (PKC) isoform encodes uORF2, which significantly impairs viability of breast cancer and leukemia cells through inhibition of the PKC family [
7]. Compared to uORFs, the function of dORFs is less well-defined, but it has been hypothesized that translation of dORFs can enhance main CDS translation by recruiting ribosomes or translation initiation factors to the main CDS [
8]. Unlike uORFs and dORFs, alternative ORFs (altORFs) overlap with the main CDS, but in a shifted or alternate ORF [
3]. The altFUS protein is a prime example of an altORF, where it overlaps its main CDS, FUS, in an open reading frame that is shifted by a single nucleotide [
9]. ncORFs can also be in-frame with the main CDS, expressing extended or truncated isoforms of the annotated proteins based on the locations and presence of the ncORF start and stop codons [
3,
10]. For example, the MYC gene has an alternate CUG start codon upstream of the main AUG start codon, which can translate into N-terminally extended variants of the MYC protein [
11].
lncORFs are encoded within long non-coding RNA (lncRNA), which are RNA transcripts longer than 200 nucleotides annotated as non-coding (
Figure 1)[
3,
12]. lncRNAs are pervasively expressed and play important roles in gene regulation as untranslated RNA molecules [
13]. Many RNA sequences that are previously considered as lncRNA have now been discovered to contain ORFs with protein-coding potential [
14,
15,
16,
17,
18,
19,
20]. An example of a lncORF is the steroid receptor RNA activator (SRA) gene. In its RNA form, SRA functions as a nuclear receptor coactivator[
21,
22], while it also encodes a protein, SRAP, associated with breast cancer cell motility [
21,
22]. lncRNAs can originate from pseudogenes, refering to DNA regions that contain presumably non-functional, untranslatable copies annotated genes derived from retrotransposition of processed mRNAs and segmental duplication [
12,
23]. Increasing evidence has revealed peptides translated from pseudogenic ncORFs [
24]. In the case of lncRNAs located in between genes with no overlap with the CDS, they are recognized as intergenic, and ncORFs derived from these regions are called long intergenic non-coding ORFs (lincORFs) or intergenic ORFs [
25]. Though rare, intronic ORFs with translational capabilities have also been discovered in the introns of pre-sliced mRNAs [
26]. Another type of non-coding RNAs (ncRNAs) is micro-RNAs (miRNAs) with 18-24 nucleotides, which originate as primary transcripts of miRNAs (pri-miRNAs) and become mature miRNAs through specific cleavage and progressing [
27,
28]. ncORFs derived from these unprocessed transcripts of miRNAs are referred to as pri-miORFs (
Figure 1).
Circular RNAs (circRNAs) are single-stranded RNA molecules covalently linked at the 5’ and 3’ ends as a product of back-splicing and display translation potential, resulting in the discovery of a new class of ncORFs, termed circORFs [
3,
29]. circORFs can regulate gene expression through their interactions with micro-RNA and circRNA binding proteins (cRBPs) [
30]. Due to the lack of the 5’ end of linear messenger transcripts (5’ cap), circRNAs were historically considered to be a type of lncRNA [
12]. However, through cap-independent initiation mechanisms, circRNAs have been demonstrated to encode functionally significant circORF proteins (
Figure 1)[
31]. An example of a circORF is circMAPK1, encoding MAPK1–109aa, a microprotein with tumor suppressive functions in gastric cancer cells through interactions with MEK1 of the MAPK signaling pathway [
32].
Finally, smORFs are protein-coding sequences that are 100 amino acids or smaller in size [
3]. This arbitrary size limit is due to the historical assumption that proteins smaller than 100 amino acids are statistically unlikely to be functional and the difficulty of detecting DNA sequences less than 300 nucleotides long before the emergence of modern omics-based technology, such as ribosome sequencing [
3]. Regardless, all other classifications of ncORFs are also classified as smORF if they meet the size cut-off. Outside of the nuclear genome, 8 instances of smORFs have been identified in mitochondrial DNA, which encodes mitochondrial-derived peptides (MDPs) [
33].
1.3. Identification of ncORF
Computational analyses have been the most common method in annotating canonical genes, where they are defined as the longest evolutionarily conserved AUG-containing ORF in an mRNA [
34]. However, ncORFs/smORFs are trickier to predict and produce more noise, thus alternative methods are required for their annotation [
35]. The detection of ncORFs in the early stages was challenging due to the conservativeness of existing gene identification algorithms, which were not designed for ncORFs [
36]. sORF finder is one of the first successful ncORF prediction tools, resulting in the identification of over 2,000 intergenic ORFs in a plant species called Arabidopsis thaliana [
37]. Technological improvements led to the establishment of various ncORF/smORF prediction tools and databases, such as PhyloCSF, MiPepid, uPEPperoni, and PhyloCSF. [
36,
38,
39,
40,
41]. While bioinformatic tools provide great value in reduced cost and convenience compared to experimental validations of ncORFs, they are inadequate in confirming the translation of novel ncORFs [
36]. Investigation and characterizations of these translatable ncORFs require further study.
During translation, ribosomes enable protein synthesis by reading mRNA transcripts. As the ribosome reads the mRNA, it protects 30-31 nucleotides of the mRNA from nuclease degradatoin, creating a ribosome footprint. In 2009, Ingolia et al. exploited this phenomenon to develop the technique known as Ribo-Seq [
42]. By converting these footprints to DNA sequences and utilizing deep sequencing, they were able to map the precise positions of ribosomes and quantify translational activity. Actively translated ORFs are characterized by continuous 3-nucleotide periodicity, which results from the 80S ribosomes reading the mRNA template one codon at a time [
34]. Furthermore, ribosome footprint density can be used to deduce the rate of translation for a particular polypeptide, with higher ribosome density meaning slower elongation, and vice versa [
43]. Another advantage of Ribo-Seq is its ability to identify ncORFs regardless of their start codon, which is invaluable since a large portion of ncORFs do not initiate translation with an AUG start codon [
44,
45]. Ribo-Seq data can be analyzed using computational methods, such as RibORF, which calculates the overall probability of translation [
24,
35]. However, validation of ncORFs at the protein level is required, as ribosome occupancy alone cannot distinguish between coding and non-coding RNA transcripts, limiting the reliability of Ribo-seq [
36,
46,
47]. To reduce the number of false positives detected, a method called polysome profiling or Poly-Ribo-Seq was developed [
48]. Scanning of the 40S ribosome subunit alone can result in ribosome footprints and the false detection of translation. Poly-Ribo-Seq takes advantage of the fact that polysomes, which consists of multiple ribosomes, bind to mRNAs collectively during active translation. By isolating polysomal fractions, actively translation regions can be differentiated from false positives derived from single ribosomes or ribosome subunits. Yet, this is not a perfect solution as it leaves a blind spot for particularly short smORFs, which are not long enough to bind multiple ribosomes [
48].
MS-based proteomic techniques are often used to compliment Ribo-Seq in the direct identification and quantification of ncORF-encoded proteins [
36,
49]. MS-based approaches can also aid in determining post-translation modifications, furthering our understanding of the protein’s functional mechanisms [
50].
Finally, the biological functions of ncORF-encoded protein can be validated using CRISPR/Cas9-based approaches [
36,
51,
52]. Knockout (loss-of-function) or overexpression (gain-of-function) assays can be performed to elucidate the functions of specific ncORF-encoded proteins[
36,
51,
52]. The CRISPR/Cas9 system can also be used to observe the expression and localization of ncORF-encoded proteins by knocking-in epitope tags into DNA sequences of ncORFs, which can be detected by the corresponding antibodies [
53]. Genome-wide CRISPR-based mutagenesis screens have successfully identified high-priority ncORFs that may encode functionally significant proteins such as GREP1, ASNSD1-uORF or ASDURF [
51,
52,
54]. CRISPR screening can also be combined with RNA sequencing (RNA-Seq) to confirm the biological function of ncORFs. By perturbing expression of candidate ncORFs using CRISPR/Cas9 and observing the changes in RNA-seq profiles within a single cell, the molecular mechanisms of the ncORF-encoded peptides can be readily elucidated [
51].