1. Introduction
The papain family (peptidase C1A family in InterPro (IPR013128), subfamily C1A peptidases in the MEROPS database (Db)) is the largest and best characterised group of cysteine peptidases, named after the first archetype, the plant cysteine protease papain. Members of this family are widely distributed in Archaea, Bacteria, Eukaryota and some viruses [
1,
2]. Papain-like peptidases are involved in numerous physiological and pathological processes, parasitic infections, and host defence. In parasitic protozoa, C1A peptidases participate in diverse processes, such as host cell and tissue invasion, encystation/excystation, catabolism of host proteins, and both stimulation and evasion of host immune responses [
3]. In plants, C1A peptidases are involved in stress response, mobilisation of storage proteins during seed germination, induction of defence reactions, senescence, and regulation of cell death [
4,
5,
6]. They are central hubs in plant immunity and are required for their resistance to various pathogens. At the same time, C1A peptidases are targeted by secreted pathogen effectors to suppress immune responses. Consequently, they are subject to a coevolutionary host-pathogen arms race [
5]. The most studied mammalian C1A peptidases are human lysosomal cysteine cathepsins, which are essential for antigen processing [
7], ageing, neurodegeneration [
8], cancer [
9,
10], cardiovascular diseases [
11], signalling [
12,
13], cell death [
14], and inherited diseases [
15]. Their activity can be regulated by gene expression, post-translational modifications, activation of inactive zymogens, accessibility to cleave peptide bonds, compartmentalisation, metal binding, and endogenous or exogenous inhibitors [
16]. Dysregulation of C1A peptidase expression, localisation, or proteolytic activity can disrupt cellular homeostasis.
The first crystal structure of the papain family to be determined was papain from papaya (
Carica papaya) [
17]. The crystal structures of diverse human lysosomal cysteine cathepsins were determined during the 1990s, first with cathepsin B [
18], followed by other cathepsins K [
19], L [
20,
21], H [
22], X [
23], V [
24], C [
25], S [
26], and F [
27]. To date, more than 40 crystal structures of diverse papain family representatives have been determined [
1]. The papain fold is composed of two domains: the left L-domain, which contains three α-helices, and the right R-domain, which contains a twisted β-sheet and two helices. The two domains are linked to each other, forming a deep active site cleft that acts as a substrate-binding groove in which Cys25 is positioned at the N-terminus (left domain) and His159 is positioned in the R-domain. Both residues form an ion pair. The binding sites between the substrate and the enzyme are the S2, S1, and S1’ sites [
28]. All cathepsins are monomers of approximately 30 kDa, with the exception of tetrameric cathepsin C [
25,
29] and dimeric cathepsin X [
30]. Cathepsins differ in their specificity and tissue distribution [
31]. Most cathepsins exhibit endopeptidase activity, whereas cath B, H, C, and X are the only known exopeptidases. However, cathepsins C and X are strict exopeptidases. Lysosomal cathepsins are synthesised as inactive zymogens. They are composed of propeptides that unfold at an acidic pH, thereby opening the active site of the enzyme [
32]. Cathepsins are activated by autocatalytic processing [
33,
34,
35] or by other proteases such as cathepsin C [
36]. Equally important are cystatins, the cathepsin’s endogenous proteinase inhibitors, which are the most investigated. Cystatins were divided into three families: stefins, cystatins, and kininogens [
37,
38]. Kininogens are composed of two inhibitory cystatin-like domains. They are divided into low molecular weight (LK) and high molecular weight (HK) kininogens. Both LK and HK kininogens bind to two molecules of cathepsins with high affinity, which is unique among cathepsins [
39,
40]. More information on cystatins can be found in [
41,
42].
The evolutionary analyses of the papain family started in the early 1990s in the pre-genomic era [
43,
44,
45] and were based on a small sample of organisms and the limited diversity within the papain family that was available at that time. Since then, the number of representatives of the papain family has increased significantly, largely due to the accumulation of eukaryotic and prokaryotic transcriptomic and genomic sequences. Here, we aimed to obtain a comprehensive insight into the distribution, origin, and early diversification of the papain family in Eukaryota, Bacteria, and Archaea. Such an analysis could not have been previously performed because of the limited number of available eukaryotic and prokaryotic genomes. We used significantly expanded taxon sampling compared to previous studies, in which, for the first time, all major eukaryotic and prokaryotic lineages were represented [
46,
47]. In particular, we included data from previously unsampled eukaryotic lineages to represent all eukaryotic supergroups [
46]. We traced the birth and expansion of the papain family through phylogenomic analysis, using publicly available information from numerous prokaryotic and eukaryotic proteomes, transcriptomes, and genomes. We found that the papain family expanded greatly during eukaryogenesis through massive gene innovation and diversification, which resulted in eight ancestral C1A lineages in the ancestor of eukaryotes. The papain family expanded further during eukaryotic evolution, especially through extensive gene duplications of the ancestral cathepsin L and B lineages. Together, we demonstrated that diversification of the papain family predates the origin of eukaryotes, and that a burst of innovation during eukaryogenesis led to a eukaryotic ancestor with a complex set of ancestral C1A lineages.