1. Introduction
Since coronavirus disease (COVID-19) cases were first reported in Wuhan, Hubei Province, China, in December 2019, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) rapidly spread worldwide, and the World Health Organization (WHO) officially declared COVID-19 a pandemic with the highest alert level on March 11, 2020 [
1]. The first confirmed COVID-19 case in South Korea was identified as a Chinese entry from Wuhan on January 20, 2020. Starting with the first reported case of a Korean patient with COVID-19 patient who returned from Thailand and was diagnosed on February 3, 2020, there have been many cases of COVID-19 infection through imported cases and local outbreaks in Gwangju, South Korea [
2,
3].
+All viruses, including SARS-CoV-2 (the causative agent of COVID-19), have constantly evolved. Mutations resulting from viral genetic changes have little effect on viral characteristics. However, some mutations can affect transmission, pathogenicity, and vaccine and therapeutic efficacy [
4]. The WHO monitors the occurrence of significant changes in COVID-19 variants through amino acid substitution and classifies them into variants of concern (VOCs) and variants of interest, updates them regularly, and recommends public health measures [
5].
Viral mutations can be identified through genetic analysis. Whole-genome sequencing has been widely used recently, and genetic information associated with COVID-19 variants has been shared worldwide through the Global Initiative on Sharing All Influenza Data (GISAID) and Phylogenetic Assignment of Named Global Outbreak Lineages (PANGOLIN) [
6]. Accordingly, we established a whole-genome analysis in May 2021 to trace the source of infection and confirm genetic mutations in COVID-19 cases.
As the lineage transitions of SARS-CoV-2 increase, several institutions have constructed naming systems to establish standardized nomenclature. GISAID, NextStrain, and PANGOLIN are the most common naming conventions used by scientific communities around the world [
7,
8,
9].
The GISAID system is named based on a large-scale clade defined by the marker variant of the reference genome WIV04 (Genbank: MN996528.1), which is a simple system, but it is inconsistent and classified only by a few mutations. The SARS-CoV-2 sequence in GISAID can be easily analyzed using NextStrain, a system that uses phylogenetic analysis to identify evolutionarily stable lineages and sublineages [
10]. Once identified, they are named based on their year of appearance and consecutive characters. Specific sublineages are identified as additional information; however, there are no systematic rules. To overcome this problem, the dynamic PANGOLIN system based on evolutionary relations also considers the mechanical relevance of lineages. According to this system, each lineage name consists of an alphabetical prefix and a numeric suffix separated by a period or point. The PANGOLIN system provides detailed and informative outbreak cluster information [
6].
In this study, we identified COVID-19 variants, including VOCs, and performed mutational and phylogenetic analyses of the dominant strain using whole-genome sequencing of COVID-19 confirmed cases in Gwangju, Korea, from February 2020 to December 2022. SARS-CoV-2 can be continuously monitored to prevent the spread of infectious diseases by tracking new variants that affect pathogenicity, transmissibility, and vaccine efficacy. In addition, a survey of the evolutionary history of mutant viruses will help prepare countermeasures against newly emerging variants of concern [
11].
3. Results
3.1. SARS-CoV-2 cases during pandemic waves in Gwangju, South Korea
By the end of December 2022, a total of 850,707 cases have been reported in Gwangju since the first COVID-19 infection in February 2020. When classified by epidemic period, there were 36 cases in the first wave, including the first imported case of COVID-19 and occurrence in metropolitan and Seoul, Daegu, and Gyeongbuk (February 2020 to May 2020). In the second wave, 725 cases were confirmed in a rapid spread throughout the Seoul Metropolitan Area, including outbreaks in assemblies and religious organizations (June 2020 to November 2020). During the third wave, when a new COVID-19 variant appeared and spread nationwide (December 2020 to June 2021), 2285 patients were diagnosed with COVID-19. In addition, 17,123 cases occurred in the fourth wave, spreading the Delta variant (July 2021 to January 2022). In the fifth wave, the number of confirmed cases increased explosively nationwide to 518,055 patients due to the Omicron variant, and 312,483 cases were confirmed during the period of continued emergence of new Omicron variants (July 2022 to December 2022) [
12]. We randomly selected 2399 COVID-19 cases by classifying patients with or without a confirmed infection route. Then, whole-genome sequencing was performed to assign lineages. The sequences identified 147 lineages based on the PANGOLIN criteria (v.4.3) (
Table 1 and
Table 2). According to the NextClade classification [
13], 16 different clades were detected in Gwangju: 19A, 20A, 20H, 20I, 21A, 21I, 21J, 21K, 21L, 22A, 22B, 22C, 22D, 22E, 22F, and the recombinant strain.
Figure 1 shows the phylogenetic tree constructed using the NextClade online tool (accessed on 30th June 2023).
3.2. Dominant SARS-CoV-2 lineages during waves
Sequences from Gwangju were included along with the global sequences to determine where they lie in the tree.
Figure 2 shows a phylogenetic tree based on the PANGOLIN classification (
Figure 2a) and indicates the epidemic period classified by sample collection date and variants relevant to that period (
Figure 2b).
During the first wave, two cases were identified as B.41, and 35 cases were identified as B.1.497 during the second wave. In the third epidemic period, we revealed 50.9% B.1.619.1 (n = 84), 37.6% B.1.497 (n = 62), 9.1% B.1.1.7 (n = 15), 1.8% B.1.617.2 (n = 3), and 0.6% B.1.351 (n = 1). In the fourth wave, the prevalence rate was as follows: 53.8% 1.617.2 (n = 168) > 30.8% BA.1 (n = 96) > 11.2% B.1.619.1 (n = 35) > 3.2% BA.2 (n = 10), with Delta becoming the dominant SARS-CoV-2 variant. Omicron variants were as follows: 66.6% BA.2 (n = 504), 23.0% BA.1 (n = 174), 8.6% BA.5 (n = 65), and 1.5% BA.4 (n = 11) during the fifth wave. In the sixth epidemic, when Omicron sublineages continued to appear, 82.6% BA.5 (n = 932) > 16.2% BA.2 (n = 183) > 0.7% recombinant variant (n = 8) > 0.5% BA.4 (n = 6) were identified.
3.3. B.1.619.1 genome sequences
In the early days of the COVID-19 outbreak, B.1.497 with a D614G mutation in the Spike protein and B.1.619.1 with additional mutations, such as N440K and E484K, were prevalent. We focused on the B.1.619.1 variant, which was dominant in Gwangju until the fourth wave when the B.1.617.2 (Delta variant) spread. Among the sequences corresponding to B.1.619.1, the FASTA files of 86 sequences, excluding those with low-quality data, were used for analysis. We observed 125 nucleotide mutations and 61 amino acid mutations after analyzing the mutation sites and numbers of 86, B.1.619 mutant strains in this study and comparing them with the phylogenetic analysis reference strain (Wuhan-Hu 1, NC045512) provided by NextClade. Regarding the distribution of amino acid mutations in individual genes, ORF1ab had the highest number of mutations (30), followed by 12 mutations in S, eight in N, five in ORF7a, three in ORF3a, two in ORF8, and one in M.
Among a total of 61 amino acid mutations, 20 mutations were identified in all 86 sequences: eight mutations in ORF1ab (A2123V, E2607L, S3675del, G3676del, F3677del, M3752I, K3929R, and P4715L), seven mutations in S (I210T, N440K, E484K, D614G, D936N, S939F, and T1027I), three mutations in N (P13L, S201I, T205I), I82T in M (I82T), and E22D in ORF7a (Supplementary Table S2).
To analyze the genetic relationship among the 86, B.1.619.1 cases identified in this study, we performed phylogenetic analysis using MEGA 11software (
Figure 3a). The sequences were grouped into three clusters. Group A comprised a cluster of 24 cases in May and June 2021, including 18 where the route of infection was confirmed. Group B consisted of nine patients who had been in contact with confirmed cases in different regions of Korea. Group C comprised 33 cases detected in July and August 2021, 31 of which had an additional F548S amino acid change in ORF1ab.
3.4. B.1.617.2 genome sequences
B.1.619.1 gradually disappeared as B.1.617.2 was introduced in the second half of 2021, and became the dominant variant from July to December 2021 after its confirmation in June 2021. In particular, the sublineages of B.1.617.2, namely, AY.69 and AY.122, accounted for most of the cases (
Figure 4a). To analyze the genetic relationships of B.1.617.2 identified in this study, we created a phylogenetic tree with 108 sequences of B.1.617.2 using MEGA-11(ver.11.0.13) software (
Figure 3b). Samples collected from COVID-19 patients in Gwangju were classified into three clades: 21A, 21I, and 21J.
Clade 21A included five samples related to the inflow of the metropolitan area, clade 21J consisted of 27 samples related to imported cases and enumeration of foreigners, and clade 21I included 76 cases of various infection groups that occurred sporadically in Gwangju. Thus, they were grouped based on the common route of infection and classified into sublineages.
3.5. Omicron variant spread in Gwangju in 2022
By the end of December 2021, Omicron had spread rapidly against Delta, becoming the dominant variant within several weeks. This result appeared two weeks after the first report of Omicron in Gwangju, and its spread was faster than that of the previous variants. As the prevalence of Omicron prolonged, its sublineages continued to appear.
Figure 4b shows the distribution of the COVID-19 Omicron variants in 2022. BA.1 and BA.2 were identified as the dominant variants from January to May 2022. BA.1 was the dominant variant for 10 weeks starting from the 1
st week of 2022, and the Omicron sublineages BA.1.1 and BA.1.1.5 accounted for a large proportion. BA.2 was the dominant variant in the 12
th and 13
th weeks of 2022, and the proportion of sublineage BA.2.3 was also high. Subsequently, both BA.4 and BA.5 were detected, and BA.5 became the dominant variant in June 2022, indicating a tendency for various sublineages to occur simultaneously. BF.7, BQ.1, BQ.1.1, XBB, and XBB.1 were detected since November 2022, and sublineages of BA.2.75, BN.1, and CH.1.1 were confirmed.
4. Discussion
Viruses change constantly, and new variants continuously emerge during SARS-CoV-2 proliferation and propagation [
14]. New virus variants may affect the infectivity, virulence, and immune evasion [
15,
16]. SARS-CoV-2 is an enveloped positive-sense single-stranded RNA virus, which is more pathogenic than SARS-CoV (2002) and Middle East respiratory syndrome coronavirus (2013) [
17]. The SARS-CoV-2 genome is approximately 30 kb long (GenBank number MN908947), and it encodes 9860 amino acids [
18,
19]. Gene fragments encode structural and nonstructural proteins involved in viral replication. The nucleocapsid (
N), spike (
S), membrane (
M), and envelope (
E) genes encode structural proteins, whereas non-structural proteins, such as 3-chymotrypsin-like protease, papain-like protease, and RNA-dependent RNA polymerase are encoded by ORF regions [
20,
21]. Among them, the most notable mutations are those in the
S gene, which is involved in viral entry into cells [
22]. It has been widely reported that SARS-CoV-2 has undergone several genetic variants, leading to various mutations such as substitutions, reversions, and deletions throughout the genome [
23]. Therefore, analyzing its impact on public health by monitoring variants is of utmost importance, which is performed through genetic composition change analysis, also known as genome sequencing.
We used whole-genome sequencing to monitor variants that could be new inflows into Gwangju, South Korea, or propagate within the region. Genome analysis of SARS-CoV-2 in Gwangju showed that B.1.497 was prevalent in the early days of the COVID-19 outbreak. Since then, B.1.1.7 (Alpha variant), first reported in the U.K in September 2020, was partially confirmed through infection in the region, but did not spread significantly. In addition, all cases of B.1.351 (Beta variant), first reported in South Africa in May 2020, were imported, and there were few cases of local transmission. P.1 (Gamma variant), which was first identified in Brazil in November 2020, was not identified, and mainly B.1.619.1 occurred during this time [
24]. B.1.617.2 (Delta variant) was identified as the dominant strain from the 2nd week of July 2021, and the sublineages AY.69 and AY.122 were predominantly observed. Since the Delta variant was first identified in India in October 2020, it rapidly increased since April 2021 and became prevalent worldwide since July 2021. It accounts for more than 90% of the data registered in GISAID (analysis update, 10/01/2021) [
25]. After the end of December 2021, the Delta variant decreased owing to the appearance of the omicron variant, and as the prevalence of the omicron variant continued for a prolonged period, various Omicron sublineages, such as BA.1, BA.2, BA.4, and BA.5 were detected. Omicron is now the dominant strain worldwide, accounting for more than 98% of viruses in GISAID beyond 2022, and its transmission continues [
26]. Therefore, in order to track the public health risks that may arise from variants in the various Omicron sublineages, WHO has defined “Omicron subvariants under monitoring” as a new category to monitor variants [
5].
In this study, we noted the existence of the B.1.619.1 variant, which is not classified as a VOC by the WHO, but was identified as the dominant strain in Gwangju, South Korea until the introduction of the Delta variant. Our results match those of previous studies on the distribution of variant strains in South Korea; B.1.497 predominated from March 2020 to January 2021, and the prevalence of B.1.619 and B.1.620 increased rapidly from March 2021 [
27]. According to WHO data, the B.1.619 variant was first reported in May 2020, managed as a variant under monitoring on July 14, 2021, and reclassified as a formerly monitored variant on November 9, 2021. In addition, it is difficult to find a reference for the B.1.619 variant, but we confirmed that the mutant virus, which was widespread in Central Africa, spread to Europe using GISAID data [
28].
The B.1.619 variant in South Korea was first confirmed in immigrants from Cameroon in February 2021, but was later confirmed to be distinct from those of European countries based on sequence phylogenetic analysis of the B.1.619 variant isolated in Korea. It has been reported that
ORF1ab is reclassified as sub-lineage B.1.619.1 due to the additional presence of the K3929R mutation [
27]. Most variants are subdivided according to the processes of propagation and spread to new countries. In this study, as previously reported in South Korea, the B.1.619.1 variant was reclassified as a sub-lineage of the B.1.619 variant.
We focused on the B.1.619.1 variant because of its mutations, specifically E484K and N440K, which affect the antigenicity of the spike protein. The E484K mutation was identified in both the B.1.351 (Beta variant) and P.1 (Gamma variant) strains, making it an escape mutation that weakens the binding affinity between the neutralizing antibody and the RBD. This reduces the antibody effectiveness and is a significant factor in the ability to evade vaccine-induced antibodies [
29,
30]. The N440K mutation increases the affinity for angiotensin-converting enzyme 2 and confers resistance to monoclonal antibodies [
31]. The emergence of variants associated with contagiousness and immune evasion can lead to more cases of reinfection and negatively affect vaccines and therapeutic effects [
32,
33]. Notably, common mutations and deletions were observed in the B.1.619.1 and B.1.351 variants in this study, namely S3675-3677del and P4715L in
ORF1ab, T205I in
N, E484K and D614G in
S, and S84L in
ORF8.
Like the B.1.619.1 variant, AY.69 and AY.122 were identified as the B.1.617.2 (Delta) sublineage that occurred in large numbers in South Korea, including Gwangju. According to previously reported, AY.69 was specifically distributed in South Korea, and AY.122 was reported to have spread widely in Russia as well [
34,
35]. As new mutations occur through intra-regional transmission or new variants are introduced from other regions, conducting surveillance in each region along with national genome surveillance will increase the efficiency of infectious disease prevention and management.
During the early stages of the COVID-19 pandemic, large outbreaks were mostly associated with specific groups. However, the spread of the virus was also closely related to everyday activities, small gatherings, and contact with infected individuals. Accordingly, measures to strengthen quarantine management, such as an order to ban gatherings, were implemented, and an epidemiological investigation was systematically carried out, which we used to identify the route of infection [
36]. For patients whose route of infection was unclear, an epidemiological link was inferred through whole-genome sequencing. Phylogenetic tree analysis of the sequences corresponding to B.1.619.1 and B.1.617.2 (Delta) showed that cases with unidentified infection routes formed a cluster with other cases, and those with confirmed contact with other regions formed a single cluster. COVID-19 cases in May–June 2021 and July–August 2021 formed different clusters in terms of time. In addition, groups sharing a common route of infection could be classified into the same lineage. These results will provide useful data for tracking the source of infection and blocking the spread of infectious diseases; it is considered necessary to examine the relevance of various mutations to changes in countermeasures against infectious diseases, including vaccination [
37,
38].
In the absence of vaccines and treatments in the early stages of the pandemic, strengthened quarantine policies, such as social distancing, were implemented to prevent a surge in the number of COVID-19 patients responding to SARS-CoV-2 [
39]. Subsequently, the national vaccination rate reached 70%, and gradual daily recovery increased; however, with mitigated controls and the emergence of the highly contagious Omicron variant, the number of COVID-19 patients has soared, stopping daily recovery [
40]. Due to the rapid increase in confirmed Omicron cases, epidemiological investigation to trace the source of infection was halted, making it virtually meaningless. The Omicron variant was first reported in South Africa in November 2021, and it is known for its lower fatality rate than the Delta variant, but also for its very high propagation power [
41,
42]. As the Omicron variant has spread since the fifth wave, the number of infected people has soared nationwide, regardless of the region or route of infection. Thus, owing to the various Omicron sublineages, real-time reverse transcription PCR targeting some genes could not predict gene mutations and deletion patterns. Therefore, the difficulty in identifying various sublineages was addressed using whole-genome sequencing [
43].
After Omicron was first detected in Gwangju in December 2021, it spread rapidly and became the dominant strain in a short period of time, and BA.1, BA.1.1, BA.1.1.5, BA.2, BA.2.3. dominated in the first half of 2022. In the second half of 2022, BA.5 with mutations such as R346X, K444X, and N460X in the
S gene was prevalent, and sublineages of BA.5 occurred simultaneously, typically including BF.7, BQ.1, and others [
44,
45]. Moreover, we identified BA.2.75, CH.1.1, and BN.1, which have additional mutations in BA.2. A recombinant variant of BJ.1 and BM.1.1.1 known as XBB was detected recently [
46].
This study has limitations. Owing to resource constraints, only selectively analyzed samples were included, thus limiting the representation of the entire case cohort. However, despite this limitation, the analysis enabled the determination of variant distribution and trends within the analyzed subset, offering valuable insights within the scope of the study’s conditions.
We monitored mutations in Gwangju that may be newly introduced or propagating in the area via whole-genome sequencing analysis. We estimated the epidemiological link between patients with COVID-19 and unknown sources of infection using phylogenetic analysis. Continuous surveillance of SARS-CoV-2 variants tracks the influence of new variants on the pathogenicity, propagation power, and vaccine efficacy; ultimately, it can be used as evidence to prevent the spread of infectious diseases. Additional research on the evolutionary history of variants will help to prepare countermeasures against new variants in the future [
47,
48].