1. Introduction
Runs of homozygosity (ROHs) are continuous segments of the genome identical in both copies of a chromosome pair (alleles) ranging from tens of kilobases to megabases [
1]. Homozygosity refers to having two identical alleles of a gene inherited from each parent [
2], and due to common ancestry between the parents or due to identity-by-descent (IBD), it is called autozygosity [
3]. ROHs can arise from consanguineous marriages (estimated to affect ~10% of people worldwide) [
4], inbreeding [
5,
6], or founder effect [
7], increasing the risk of recessive diseases in the offspring. ROHs may also be “runs of hemizygosity”, when there is a deletion in one copy of a chromosome, leading to a loss of heterozygosity [
8].
ROH patterns reflect the level of kinship and autozygosity, both reduced by people’s mobility and globalization [
9]. Short ROH are characteristic of admixed populations resembling ancient parental relatedness, whereas longer ROH reflect higher consanguinity levels and recent parental relatedness [
10,
11,
12,
13]. ROH patterns reflect population and demographic history [
14,
15], including differences in consanguinity and number of ROHs between ethnic subgroups [
10,
16,
17,
18,
19,
20,
21]. Understanding these patterns in diverse populations is essential for assessing disease risk and identifying disease-causing genetic variants, particularly in admixture isolates [
11,
22,
23,
24].
As consanguinity increases, so does the number and size of ROHs, raising the risk of autosomal recessive (AR) diseases. ROH analysis increases the diagnostic rate of recessive diseases, especially in consanguineous families, for finding candidate genes [
25,
26,
27,
28,
29], disease-causing homozygous variants [
30], and corroborating the historical context of communities [
31]. Furthermore, it is crucial for identifying candidate genes for specific recessive diseases [
32,
33,
34,
35,
36], even in non-consanguineous families [
37]. Biodemographic and genetic studies provide insights into population structure and its link to diseases, by exploring the human genome's significance in population history and consanguinity practices [
14,
38].
Homozygosity mapping (ROH detection) aids gene discovery by assuming individuals with AR diseases likely have homozygous markers surrounding the disease
locus, searching for and identifying regions harboring the affected gene. If other relatives also have the disease, the strategy includes identifying ROHs exclusive to affected individuals within the family [
39]. It was first applied in 1987 by Lander and Botstein in consanguineous families affected by a recessive disease using restriction fragment length polymorphisms (RFLPs) [
40]. Homozygosity mapping evolved to utilize Single Nucleotide Polymorphism (SNP) array data, and with the advent of next-generation sequencing (NGS), software tools were designed to accommodate this sequencing data as input [
39].
NGS introduction enables simultaneous homozygosity mapping and variant detection, generating vast data volumes surpassing previous technologies in speed and cost-effectiveness. Since 454 sequencing by Roche, NGS has evolved through second-generation (short-read) and third/fourth-generation (long-read) technologies [
41]. Second-generation sequencing generates short DNA fragments (100-600 bp), with Illumina being widely used for genetic testing [
41,
42]. Third/fourth-generation sequencing achieves reads of over 10 kb, effectively detecting genome-wide repeats and structural variants, suitable for diagnostic and clinical applications [
41,
43]. The two main technologies are provided by Pacific Biosciences (PacBio) [
44], and Oxford Nanopore (ONT) [
45].
NGS applications include single-gene, targeted multigene panel, whole-exome sequencing (WES), whole-genome sequencing (WGS), and transcriptome (RNA sequencing), all effective for genetic testing [
46]. WES, which targets protein-coding exons, where ~85% of the known Mendelian disease variants occur, has become a mainstream approach due to its cost-effectiveness and simplified data management [
47,
48,
49]. WES can be done individually or in trio (enhanced variant identification) [
50]. Its limitations include sensitivity to GC-rich regions, reliance on Sanger sequencing to confirm low-quality variants, challenges with variants of uncertain significance (VUS), shared homology between genomic regions (segmental duplications/ pseudogenes and failure to genotype highly repetitive regions completely and especially when large repeats expansions [
2,
51,
52].
Re-analyzing genomic data enhances diagnostic rates by uncovering novel gene-disease associations, improving bioinformatics techniques for CNV detection and variant calling, incorporating consanguinity assessment (ROH filter) to narrow down the list of candidate variants, and integrating the Human Phenotype Ontology (HPO) [
51] terms. HPO terms describe human phenotypic information in a standardized way (used for supporting clinical diagnostics and genetic research). Estimating consanguinity through ROH analysis allows for an unbiased determination of parental or ancestral consanguinity, overcoming the limitations of self-reports or inferences based on family context [
52].
Homozygosity mapping tools, both adapted and new, emerged [
39], enhancing diagnostic rates by combining WES data and ROH analysis [
53,
54]. The software can be based on sliding-window or hidden Markov model (HMM) algorithms [
39]. Sliding-window algorithms, originally designed for SNP array data analysis, move a fixed-size window along the chromosome to find stretches of consecutive homozygous SNPs [
39]. PLINK [
39] is widely used on its own [
16,
55,
56,
57,
58,
59,
60,
61,
62,
63,
64,
65], as a complementary analysis [
66], or integrated into other algorithms [
67]. Other software followed, such as Obelisc [
68], GERMLINE [
39], EX-HOM (EXome-HOMozygosity) [
69], and HomozygosityMapper (HM) [
70]. PLINK, GERMLINE and HomozygosityMapper (HM) were subsequently adapted for WES data [
39,
71]. Other software created includes HOMWES [
72], GARLIC [
73], HomSI [
39,
74], and Automap [
75].
Hidden Markov models (HMM) represent observed data as outputs generated by hidden states, modelled as a Markov chain [
76]. In ROH detection, HMMs estimate the likelihood of a genotype (observation) being homozygous or heterozygous (hidden states) [
77]. The software tools available are H3M2 [
77,
78]; IBDSeq and GIBDLD [
78]; BEAGLE [
79]; ROHMM and BCFtools/RoH [
77,
80]; and Python packages FILTUS and hapROH [
81,
82,
83].
The accuracy of these tools can be influenced by many factors, such as the choice of algorithm used, sample sequencing depth and coverage, SNPs density and sequence quality, the need for phased data, loss of short and medium-sized ROHs, and false positives [
39,
82]. These factors should be considered when selecting the appropriate software for a project [
39,
83].
This work presents new bioinformatics approaches to address the creation of personalized multigene panels based on WES data using ROH and/or HPO terms, integrated into a Django Web application. Its impact on diagnostics is illustrated by the genetic characterization of two siblings affected by a recessive disease. Analysis of ROH at a genomic scale in a representative sample of 3,941 patients advances ROH analysis using WES data, highlighting its diagnostic potential and significance in population genetics.
2. Materials and Methods
The dataset used in this work consisted of WES samples from patients who performed genetics tests at the Center for Predictive and Preventive Genetics (CGPP), Portugal.
2.1. Creation of Personalized Multigene Panels Based on ROH
Multigene panels based on the patient’s ROH focus on the analysis of regions of the genome more likely to contain recessive disease-causing variants. By targeting genes within these ROH, the panels are more likely to identify relevant genetic variants, particularly in consanguinity context or shared ancestry.
The samples used for the creation of these panels were analyzed using two Homozygosity Mapping algorithms, HomozygosityMapper (HM), which uses a sliding-window algorithm, and ROHMMCLI, which uses a hidden Markov model (HMM) algorithm. Each patient has a pseudo-anonymized ID without any personal information.
Both algorithms output data in different formats: HM outputs a raw data text file with chromosome, position and score, while ROHMMCLI outputs a BED file. To generate the Uniform Resource Locator (URL), a connection to the HM database is initiated and the project number (project_no) is retrieved using the patient ID. With the URL generated, the data is collected and saved in a BED File. Then, the HM and ROHMMCLI BED files are merged using a shell script. This script takes the patient ID and the current date as inputs and is divided into four Linux commands, as follows:
Clean up the ROHMMCLI BED file to contain only chromosome, start and end positions.
Merge the HM and the cleaned ROHMMCLI BED files, using bedtools merge with option -d of 1000000 bp, the maximum distance between ROHs to be merged.
Use bedtools intersect to find overlaps between the merged BED file and the coding sequence coordinates BED file, producing another BED file with the list of gene coordinates found within ROHs.
Create a text file with a list of gene Entrez IDs present in the identified ROHs.
The process of obtaining the gene list is outlined in
Figure 1.
The final step to generate multigene panels consists of comparing the genes’ list with the coverage for the representative transcript of each gene, described in
Figure 3. Genes are divided into three lists based on the percentage of horizontal coverage at 20x: white (≥0.9), grey (0.1-0.9), and black (≤0.1). Only the white and grey genes are included in the multigene panel. Another list, containing the genes that were not assorted to any of the previous lists, is generated.
Copy Number Variations (CNVs), more specifically heterozygous deletions, can mimic ROH, given that the single nucleotide variants (SNVs) encompassed by the deletion cannot be heterozygous (and are in fact hemizygous). To incorporate such a possible impact in the multigene panel creation, the following steps were implemented:
Find the CNV results for the sample in analysis and filter by CNVs with a span above 500,000 bp and that are ‘Heterozygous Deletion’, resulting in a BED file with CNV’s genomic coordinates.
Filter by non-empty files, meaning files that contain CNVs.
The shell script uses bedtools jaccard tool to calculate the Jaccard index for each CNV that intersects an ROH, using the merged ROH results and CNV BED files.
The Jaccard index (equation 1) is a single statistic that reflects the similarity of the two BED files based on the intersections between them, where a value of 0.0 indicates no overlap and 1.0 represents complete overlap.
Where the intersection is the difference between the end of the ROH and the start of the CNV, the ROH length is the difference between the end and start coordinates of the ROH and the CNV length is the difference between the end and start coordinates of the CNV.
2.2. Creation of Personalized Multigene Panels Based on HPO Terms
Multigene panels based on HPO terms ensure that gene selection is targeted for the patient’s specific phenotype. These panels are designed to ensure that only the genes possibly associated with the phenotype are analyzed, and therefore the most likely to harbor disease-causing variants responsible for the patient’s phenotype, increasing the accuracy and relevance of genetic testing.
A Python script that takes as input an HPO term was created to establish the connection to the HPO API and retrieve the list of genes’ Entrez IDs associated with the HPO identifier (
https://hpo.jax.org/api/hpo/term/{hpoId}/genes, version 1.7.13, accessed May 2023). By parsing the JSON file retrieved by the HPO API, we were able to get the gene entrez ID and gene symbol. Then the white, grey and black lists for the gene panels were created, as previously described in
Figure 3.
2.3. Creation of Personalized Multigene Panels Based on ROH and HPO Terms
The integration of ROH and HPO term analysis may offer an even more personalized approach. This method focuses on the individual's ROH and examines whether the genes associated with the patient's specific HPO terms are located within these regions.
A new script was generated based on the previously described script for creating multigene panels. To obtain a BED file with the coordinates of the genes from the HPO terms’ gene list, the command-line tool tsv2bed.py was used. The merged BED file results containing the ROHs and the HPO terms genes’ BED file are merged to get a final list of the genes from the HPO terms within the ROHs. The corresponding flowchart is presented in
Figure 4.
From the list of genes obtained, the process of generating the white, grey and black lists for the gene panels is the same as previously described in
Figure 3.
2.4. Django Web Application Development
The process of personalized multigene panel creation based on ROH, HPO and both, simultaneously, was made using Python 3 and shell scripting. To deploy a more user-friendly interface, a Django web application was developed. The design/style of all HTML pages for this application (app) was made using Cascading Style Sheets (CSS) for visual consistency.
The home page of this app contains a title (“Personalized multigene panels”), a small description of the app, and three buttons: “HPO term-based panels”, “ROH based panels” and “HPO term and ROH-based panels”. Each button is linked to a different HTML file, with different HTML forms to submit the input data.
For the “HPO term-based panels” button, the form handles multiple HPO terms at the same time. For the “ROH-based panels” button, the form contains two different types of input, a text area to fit the multiple input lines, and a drop-down list of the possible HM threshold options.
For the “HPO term and ROH-based panels” button, the form is a combination of the ones from the previously described HTML files. The only difference is that the text area only allows the analysis of one sample at a time with multiple HPO terms.
2.5. Establishing the First Portuguese ROH Characterization on a Genomic Scale
To establish the first Portuguese ROH characterization on a genomic scale, the dataset initially consisted of over 12,000 WES samples. Since there were municipalities that were over-represented, normalization and down-sampling processes were automated. The final number of samples was 3,941 WES samples (detailed process in Supplementary File S1).
The process of establishing the first Portuguese ROH characterization started with the assessment of the ROH levels through the genome-wide autozygosity measure from ROH (
), calculated using equation 2.
Here, is the total length of all of an individual’s ROHs above a specified minimum length and is the length of the autosomal genome covered by WES, after removing the centromeres (which are excluded to prevent overestimating autozygosity).
For each sample, three values were calculated, using three ROH minimum size thresholds to calculate : 0.5, 1.5 and 5 Mb.
The calculation of
involved using the Integrative Genomics Viewer (IGV) to determine the genomic coordinates of the first and last genes on the p and q arms of each chromosome. These coordinates were used to calculate the size of each chromosomal arm using Equation 3:
The sum of both chromosomal arms’ sizes corresponds to the size of the autosome without the centromeres. value is the total size of all autosomes, calculated as 2638.813981 Mb.
The patients’ address information was obtained from the internal database, consisting of patients’ ID, postcode, municipality and district names. All patients’ VCF files were previously processed by HM and ROHMMCLI and both results were merged. Homozygosity mapping data was organized into standardized CSV (*.csv) files containing the detailed patient ROH profile (chromosome, ROH’s start and end position, ROH length (bp), ROH length (Mb), and ROH length/chromosome length), and the general patient’s profile (Number of ROH, Number of ROH > 1 Mb and Sum ROH (Mb)).
To calculate the FROH, the information needed was combined into separate CSV (*.csv) files:
One containing ROH > 0.5 Mb;
One containing ROH > 1.5 Mb;
One containing ROH > 5 Mb.
Then, the was calculated per patient, grouping the ROHs, from each CSV (*.csv) file and summing its total per patient.
The value of FROH was calculated for each patient using the corresponding value, for each minimum ROH size, resulting in an CSV (*.csv) file containing the patients’ ID and FROH. Portugal comprises 18 districts and 2 Autonomous Regions (Açores and Madeira), divided into a total of 308 municipalities. With this information, and the address information, patients were grouped by municipality, and the mean FROH was calculated (per municipality) and used to create Maps of the Portugal Mailand and Autonomous Regions (Açores and Madeira) per municipality.
There were two types of maps, the classical and the interactive ones, created for each FROH mean calculated using minimum ROH’s size of 0.5, 1.5 and 5 Mb.
To create the maps several data were necessary. The geographical data, at the municipality level (shapefile) was obtained from dados.gov (
https://dados.gov.pt/en/datasets/concelhos-de-portugal/, accessed June 6, 2023), the Portuguese Public Administration's open data portal. Then we established the connection to the internal SQL database to get the municipality and respective districts’ association, the CSV (*.csv) files per municipality and a CSV (*.csv) file containing the number of people per municipality and respective ratio, so that only the municipalities with representativity were used for the maps. For the creation of the maps, since we were dealing with geospatial data, we used the GeoPandas package. The maps created were stored as PNG (*.png) files.
The interactive maps were created using the explore method on a Geodata Frame and were saved as HTML files.
2.6. Consanguinity Classification Approach
The set of samples used to build the clustering model was meticulously chosen from the internal database based on the patients’ information concerning consanguinity. A total of 9,160 WES samples were collected for the analysis. This included 9,020 (98,5%) individuals with “unknown” consanguinity, 84 (0.92%) known non-consanguineous samples, and 56 (0.61%) known consanguineous samples. Of the 56 consanguineous samples, 34 (60.7%) were stringent (i.e. parents were first-degree cousins). Each sample was submitted to HM to find homozygous blocks of the exome and the raw data was provided in a text file containing the chromosome, position and their corresponding homozygosity scores from HM.
2.6.1. Feature Extraction
For each sample we generated features pertaining to the descriptive statistics of the ROH embedded within each chromosome. For the purposes of this analysis, we considered ROH to be at least 2 consecutive chromosome positions with homozygosity scores greater than or equal to 64 (i.e., 80% of the highest observed score, 80, as defined by [
71]). Specifically, the following features were generated for each sample with respect to each chromosome
x:
Count_x: the number of ROH in chromosome x.
Sum_x: the sum of ROH sizes in chromosome x.
Min_x: the minimum ROH size in chromosome x.
Max_x: the maximum of ROH size in chromosome x.
Mean_x: the mean of ROH in chromosome x.
STD_x: the standard deviation of ROH size in chromosome x.
To make these features more concrete, we provide an illustrative example. Suppose an individual possesses three ROHs within chromosome 1 (x = 1), with sizes 3, 7, and 4 Mb. The following features are extracted from chromosome 1: Count_1 = 3; Sum_1 = 14; Min_1 = 3; Max_1 = 7; Mean_1 = 4.67; and STD_1 = 1.6997. This feature extraction process is then repeated for the individual’s remaining chromosomes.
Following feature extraction, to test the predictive quality of various feature sets in our experiments, we created three separate representations of the data, dictated by the following sets of features:
Tier 0: includes "Count_x" and "Sum_x" features only;
Tier 1: includes "Count_x," "Sum_x," "Min_x," and "Max_x" features only;
Tier 2: includes "Count_x," "Sum_x," "Min_x," "Max_x," "Mean_x," and "STD_x" features.
2.6.2. Outlier Detection
We formulate the task of consanguinity classification as an outlier detection problem. For our experiments, we randomly selected 50% of all labeled data points (70 total data points) to be reserved for testing. Using the remaining 50% of labeled data points for validation and 100% of unlabeled data points for training, we proceeded to establish the semi-supervised outlier detection pipeline. First, we projected the data into a low-dimensional (2-D) space using classical multidimensional scaling (MDS) [
84], which is a manifold learning approach that aims to preserve pairwise Euclidean distances between points from the high-dimensional representation in the low-dimensional data representation. Following dimensionality reduction, we then fit an elliptic envelope [
85] to the data with “unknown” consanguinity labels, validating the optimal contamination hyperparameter (i.e., the proportion of the data estimated to be outliers) using the remaining 50% of the labeled samples. Given the imbalanced class distribution, the F1-score was used to both optimize the contamination hyperparameter and evaluate the model on the reserved test set.
4. Discussion
Clinical diagnostics is evolving towards more personalized approaches, demanding the development of new genetic tests and the adaptation of existing ones. There are platforms created to support virtual gene panel curation, such as Genomics England PanelApp [
90]. It is a database for storing virtual gene panel information while gathering community feedback, helping to build a consensus on the evidence needed to establish a gene-disease association.
With this work, we automated the process of creating personalized multigene panels based on different scenarios. Generically, all panels described in this work are less time-consuming to create, releasing professionals to other tasks to increase the number of tests carried out, and help narrow down the number of genes being analyzed. Since all the described tests are based on WES data, physicians can request a reanalysis of patient data without the need for additional sample sequencing, thereby conserving resources. In terms of storage, one sequencing per patient and a more personalized approach means less data being analyzed and consequently less allocated space.
Multigene panels based only on the specific ROH identified in a patient narrow the analysis to genes located within the identified ROH, thereby increasing the likelihood of detecting recessive disease-causing variants in those genes. Since the CNV assessment was also included in these panels, the diagnostic technician analyzing the case knows where the CNVs overlap with ROH, eliminating this confounding factor from ROH analysis.
The multigene panels based only on HPO terms take into consideration the possible phenotype or phenotypes that the patient presents. Resorting to a publicly available database, HPO, the panel is built only with the genes that are within the specified terms.
The use of HPO terms and ROH overlap results, simultaneously, is an even more personalized approach, by narrowing-down the list of genes to create personalized multigene panels. In this case, only the genes associated with the HPO term(s) in analysis are taken into account and their presence is checked in the patient’s ROHs. This allows a higher level of personalization, proven to be useful by the clinical case of the two sisters previously presented.
After having the gene list from the two sisters’ case, the common genes between the two were analyzed and visualized on IGV. The results were interpreted using Online Mendelian Inheritance in Man (OMIM) database to find the phenotype associated with the genes. According to OMIM, the CSTB gene is associated with the phenotype “Epilepsy, progressive myoclonic 1A (Unverricht and Lundborg)” (gene MIM number 601145 and phenotype MIM number 254800) and the mode of inheritance is AR. This result is in accordance with the analysis done, since the ROHs are used to target diseases with AR modes of inheritance. The gene SIK1, on the other hand, is associated with the phenotype “Developmental and epileptic encephalopathy 30”, with an autosomal dominant (AD) mode of inheritance. The gene SLC32A1 does not currently have an associated phenotype, but the gene encodes an amino acid transporter that loads the Gamma-aminobutyric acid (GABA) and glycine to synaptic vesicles. Even though there were no variants present in these genes, the diagnosis of Epilepsy was only possible through this multigene panel approach, through the visualization of the biallelic expansion on the CSTB gene.
Knowing our cohort background in terms of ROH is also important. In our study we showed the ROH distribution in Portugal (
Figure 10), where most ROHs (9,358 ROHs) are within the size range of 0.5 to 1.0 Mb, followed by a decreasing tendency as the length of the ROHs increases until 4.0 Mb. Then, there is a small peak of 708 ROHs in the interval from 4.0 to 5.0 Mb, followed by a decreasing number of ROHs until the interval from 10.0 and 15.0 Mb, and another decreasing tendency in the next intervals. The minimum value is 0.5 Mb, the maximum value is 72.42 Mb and the mean value is 2.29 Mb. This is typical of a more ancient parental relatedness of the overall population.
Using the same sample with the 3,941 exomes, and by calculating the FROH per individual using the thresholds 0.5, 1.5 and 5 Mb, it was possible to build three maps using the mean value for each municipality. Consequently, we found that the top five districts exhibiting higher FROH were Portalegre, Viseu, Bragança, Madeira, and Vila Real. Furthermore, another notable finding from this study was the striking similarity between the patterns of admixed and consanguinity demographics observed in the Portuguese population when examining the Number of ROHs versus Sum of ROHs, available at Supplementary File S4.
The overall mean value of FROH for the thresholds 0.5, 1.5 and 5 Mb decreased as the minimum threshold increased. There are less samples with ROHs above a certain minimum length, with a smaller number of ROHs but with bigger sizes, per individual. The municipality of Alter do Chão from Portalegre district is the municipality with the highest value of mean FROH for all the presented minimum ROH size thresholds (0.5, 1.5 and 5 Mb), which might indicate more consanguinity.
The disparities observed in the data can derive from various factors, one significant contributor being the sample sizes utilized. In our study, we analyzed a sample of 3,941 individuals, with 3,800 exhibiting a FROH distinct from zero. In contrast, the comparative study only included 49 individuals. Notably, we compared populations from diverse geographic regions: an insular population from the Orkney Isles in northern Scotland[
87] with a comprehensive Portuguese population encompassing individuals from Portugal Mainland as well as the Autonomous Regions of Madeira and the Açores. Another differing factor was the way Lauto was calculated, since the reference study referred using the length of the autosomal genome covered by SNPs in an array, excluding the centromeres, and in our study, we used the length of the autosomal genome covered by WES, excluding the centromeres.
According to the results from the study shown in
Figure 14 [
88], the Autonomous Region of Madeira exhibits the highest number of consanguineous marriages, closely followed by the Autonomous Region of the Açores. This observation can be attributed to the isolation of island populations, due to limited population mobility during the 1980s. However, improved transportation infrastructure has since increased population mobility to and from the islands. In Portugal Mainland, the district with the highest incidence of consanguineous marriages is Bragança [
88]. Furthermore, the top five districts (as shown in
Table 6) with the highest number of consanguineous marriages, listed in descending order, are Madeira, the Açores, Bragança, Viseu, and Vila Real.
Our findings, revealed that the top five districts with the highest FROH mean remain consistent, considering thresholds of 0.5, 1.5, and 5 Mb. The ranking from highest to lowest FROH mean value for 0.5 and 1.5 Mb thresholds is the following: Portalegre, Viseu, Bragança, Madeira, and Vila Real. Meanwhile, the ranking for the 5 Mb threshold is Portalegre, Bragança, Viseu, Madeira, and Vila Real.
Portalegre stands out with the highest FROH mean values across all three thresholds in our data, despite having fewer consanguineous marriages compared to other districts. This might be due to our sample including fewer individuals from Portalegre, which may suggest that those from this region who were referred for genetic testing were more likely to have been screened due to consanguinity. Surprisingly, the results show low FROH values across all three thresholds for the Autonomous Region of the Açores, which are not in accordance with the reference data on consanguineous marriages. This is possibly due to insufficient sample localization. Porto, the district with the lowest number of consanguineous marriages, also shows the lowest FROH mean across all thresholds. Additionally, we were unable to acquire information regarding the country of origin or birth of individuals included in the sample, this parameter was not used as an exclusion criterion. Moving forward, we should take this into consideration because certain countries present higher levels of consanguinity due to religious and cultural practices. We must also acknowledge the potential presence of samples from immigrants residing in our country, which could introduce biases into the data.
The presence of an admixed pattern denotes our country’s history, whilst the consanguineous pattern is a result of the marriages between cousins, leading to an increase in the sum of ROHs.
Having a model to predict patients consanguinity based on ROH features is useful in clinical centers. They can be used for the genetic test decision process and for assessing the risk of recessive diseases, knowing that the presence of consanguinity increases the risk of having recessive genetic diseases. Tier 0 (count and sum of ROH) of the model presented provided the majority of the predictive power for consanguinity classification.
Transitioning from WES to WGS might open some doors in terms of genetic testing, by adding insights into the non-coding regions of the genome. This will be of great interest particularly for undiagnosed patients and accelerate the diagnosis.
With this work, we demonstrated the applicability and utility of the newly developed resources and their impact on diagnostics, by solving the genetic etiology of a rare recessive disease. The representative sample of 3,941 WES used in this work, allowed us to provide the extensive analysis of ROH on a genomic scale for the first time ever in the Portuguese population. In summary, this research advances ROH analysis using WES data, highlighting its diagnostic potential and significance in population genetics’ characterization.
Figure 1.
Flowchart representing the automation of the creation of multigene panels based on ROH (DB - Database, DF - Dataframe).
Figure 1.
Flowchart representing the automation of the creation of multigene panels based on ROH (DB - Database, DF - Dataframe).
Figure 2.
Flowchart to obtain the reference BED file.
Figure 2.
Flowchart to obtain the reference BED file.
Figure 3.
The flowchart of the multigene panels lists: white, grey and black.
Figure 3.
The flowchart of the multigene panels lists: white, grey and black.
Figure 4.
Flowchart of the ROH and HPO multigene panels automation.
Figure 4.
Flowchart of the ROH and HPO multigene panels automation.
Figure 5.
Overview of the results regarding processes of generating the multigene panel application in a case study, the first Portuguese ROH characterization, and the clustering model.
Figure 5.
Overview of the results regarding processes of generating the multigene panel application in a case study, the first Portuguese ROH characterization, and the clustering model.
Figure 6.
Pedigree depicting two affected sisters, daughters of a consanguineous couple.
Figure 6.
Pedigree depicting two affected sisters, daughters of a consanguineous couple.
Figure 7.
Example of an input for the personalized multigene panels based on HPO term and ROH.
Figure 7.
Example of an input for the personalized multigene panels based on HPO term and ROH.
Figure 8.
IGV visualization of the reads mapped to the CSTB gene in both sisters (II:1 and II:2).
Figure 8.
IGV visualization of the reads mapped to the CSTB gene in both sisters (II:1 and II:2).
Figure 9.
BAM visualization depicting the region of the dodecamer repeat expansion in a control I, and in both sisters (II:1 and II:2). No reads are aligned in this region in both patients, suggesting that a possible expansion is biallelic (present in both CSTB alleles).
Figure 9.
BAM visualization depicting the region of the dodecamer repeat expansion in a control I, and in both sisters (II:1 and II:2). No reads are aligned in this region in both patients, suggesting that a possible expansion is biallelic (present in both CSTB alleles).
Figure 10.
Histogram depicting the distribution of ROH length above 0.5 Mb in a Portuguese cohort of 3,941 samples.
Figure 10.
Histogram depicting the distribution of ROH length above 0.5 Mb in a Portuguese cohort of 3,941 samples.
Figure 11.
Geographical distribution per municipality of FROH > 0.5 Mb in Portugal Mainland, Autonomous Region of Açores and Autonomous Region of Madeira.
Figure 11.
Geographical distribution per municipality of FROH > 0.5 Mb in Portugal Mainland, Autonomous Region of Açores and Autonomous Region of Madeira.
Figure 12.
Geographical distribution per municipality of FROH > 1.5 Mb in Portugal Mainland, Autonomous Region of Açores and Autonomous Region of Madeira.
Figure 12.
Geographical distribution per municipality of FROH > 1.5 Mb in Portugal Mainland, Autonomous Region of Açores and Autonomous Region of Madeira.
Figure 13.
Geographical distribution per municipality of FROH > 5 Mb in Portugal Mainland, Autonomous Region of Açores and Autonomous Region of Madeira.
Figure 13.
Geographical distribution per municipality of FROH > 5 Mb in Portugal Mainland, Autonomous Region of Açores and Autonomous Region of Madeira.
Figure 14.
Map of Portugal representing the consanguineous marriages between 1980 and 1986 (/100000) adapted from [
88] (upper left) and the Portugal Mainland maps for the FROH calculated for ROHs of size above 0.5 Mb (upper right), 1.5 Mb (lower left) and 5 Mb (lower right).
Figure 14.
Map of Portugal representing the consanguineous marriages between 1980 and 1986 (/100000) adapted from [
88] (upper left) and the Portugal Mainland maps for the FROH calculated for ROHs of size above 0.5 Mb (upper right), 1.5 Mb (lower left) and 5 Mb (lower right).
Figure 15.
Low-dimensional MDS representations of each “tier” dataset, where Tier 0: training and validation results (A) and testing results (B), Tier 1: training and validation results (C) and testing results (D); Tier 2: training and validation results (E) and testing results (F). Data points are colored according to their consanguinity labels: White “unknown” points do not possess a ground truth label; green “NCON” points represent non-consanguineous samples; red “CON” points represent consanguineous samples; and purple “CON_ST” represent stringent consanguineous points. The red dashed circles represent the elliptic envelope’s outlier decision boundary (i.e., points falling outside of the envelope are predicted to be consanguineous, either stringent or non-stringent).
Figure 15.
Low-dimensional MDS representations of each “tier” dataset, where Tier 0: training and validation results (A) and testing results (B), Tier 1: training and validation results (C) and testing results (D); Tier 2: training and validation results (E) and testing results (F). Data points are colored according to their consanguinity labels: White “unknown” points do not possess a ground truth label; green “NCON” points represent non-consanguineous samples; red “CON” points represent consanguineous samples; and purple “CON_ST” represent stringent consanguineous points. The red dashed circles represent the elliptic envelope’s outlier decision boundary (i.e., points falling outside of the envelope are predicted to be consanguineous, either stringent or non-stringent).
Table 1.
Resulting list of genes for each sister (II:1 and II:2).
Table 1.
Resulting list of genes for each sister (II:1 and II:2).
II:1 |
II:2 |
DHDDS |
SIK1 |
HMGCL |
CSTB |
MERC |
SLC32A1 |
SDHA |
|
SIK1 |
|
CSTB |
|
PIGV |
|
SLC25A19 |
|
SLC32A1 |
|
TERT |
|
TSEN54 |
|
Table 2.
Number of people with FROH > 0.5 Mb within each interval.
Table 2.
Number of people with FROH > 0.5 Mb within each interval.
FROH > 0.5 Mb Intervals |
Number of samples |
]0.000, 0.004] |
3,104 |
]0.004, 0.008] |
210 |
]0.008, 0.010] |
156 |
]0.010, 0.018] |
107 |
]0.018, 0.034] |
124 |
]0.034, 0.088] |
99 |
Table 3.
Number of people with FROH > 1.5 Mb within each interval.
Table 3.
Number of people with FROH > 1.5 Mb within each interval.
FROH > 1.5 Mb Intervals |
Number of samples |
]0.000, 0.003] |
1,430 |
]0.003, 0.005] |
196 |
]0.005, 0.009] |
162 |
]0.009, 0.017] |
107 |
]0.017, 0.033] |
126 |
]0.033, 0.085] |
89 |
Table 4.
Number of people with FROH > 5 Mb within each interval.
Table 4.
Number of people with FROH > 5 Mb within each interval.
FROH > 5 Mb Intervals |
Number of samples |
]0.000, 0.002] |
38 |
]0.002, 0.004] |
318 |
]0.004, 0.008] |
146 |
]0.008, 0.016] |
113 |
]0.016, 0.032] |
91 |
]0.032, 0.074] |
52 |
Table 5.
FROH mean, FROH mean of means per municipality and comparative values for the different ROH size thresholds (0.5, 1.5 and 5 Mb).
Table 5.
FROH mean, FROH mean of means per municipality and comparative values for the different ROH size thresholds (0.5, 1.5 and 5 Mb).
|
Mean FROH
|
Mean FROH of means per municipality |
FROH comparative values [87] |
FROH > 0.5 Mb |
0.0042 |
0.0057 |
0.0315 |
FROH > 1.5 Mb |
0.0033 |
0.0049 |
0.0021 |
FROH > 5 Mb |
0.0020 |
0.0039 |
0.0001 |
Table 6.
FROH calculated for the 0.5, 1.5 and 5 Mb thresholds per district, and data from a study estimating the number of consanguineous marriages per district [
88] .
Table 6.
FROH calculated for the 0.5, 1.5 and 5 Mb thresholds per district, and data from a study estimating the number of consanguineous marriages per district [
88] .
District |
FROH > 0.5 Mb |
FROH > 1.5 Mb |
FROH > 5.0 Mb |
Number of consanguineous marriages (10 000) [88] |
Açores |
0.0046 |
0.0035 |
0.0023 |
78.7 |
Aveiro |
0.0052 |
0.0041 |
0.0024 |
22.1 |
Beja |
0.0048 |
0.0039 |
0.0022 |
22.8 |
Braga |
0.0034 |
0.0025 |
0.0013 |
19.2 |
Bragança |
0.0102 |
0.0090 |
0.0060 |
52.7 |
Castelo Branco |
0.0066 |
0.0054 |
0.0034 |
19.9 |
Coimbra |
0.0063 |
0.0053 |
0.0036 |
38.2 |
Évora |
0.0039 |
0.0028 |
0.0016 |
34.5 |
Faro |
0.0029 |
0.0019 |
0.0010 |
27.2 |
Guarda |
0.0048 |
0.0038 |
0.0024 |
35.3 |
Leiria |
0.0058 |
0.0047 |
0.0030 |
35.1 |
Lisboa |
0.0039 |
0.0030 |
0.0017 |
20.2 |
Madeira |
0.0077 |
0.0068 |
0.0041 |
133.6 |
Portalegre |
0.0106 |
0.0092 |
0.0070 |
24.8 |
Porto |
0.0026 |
0.0018 |
0.0010 |
14.4 |
Santarém |
0.0056 |
0.0045 |
0.0032 |
27.6 |
Setúbal |
0.0038 |
0.0029 |
0.0019 |
30.1 |
Viana do Castelo |
0.0030 |
0.0021 |
0.0010 |
17.8 |
Vila Real |
0.0070 |
0.0060 |
0.0037 |
38.3 |
Viseu |
0.0105 |
0.0091 |
0.0059 |
38.7 |
Table 7.
Outlier detection F1-score results, separated by feature set tier.
Table 7.
Outlier detection F1-score results, separated by feature set tier.
Dataset Tier (Feature Set) |
Best Contamination Hyperparameter |
Validation F1-Score |
Test F1-Score |
Tier 0 (Count_x, Sum_x) |
0.0786 |
0.9310 |
0.9412 |
Tier 1 (Count_x, Sum_x, Min_x, Max_x) |
0.1190 |
0.9655 |
0.9434 |
Tier 2 (Count_x, Sum_x, Min_x, Max_x, Mean_x, STD_x) |
0.1061 |
0.9474 |
0.9615 |