2. Analysing Medical Service Reviews as a Natural Language Text Classification Task
Online reviews and online ratings make up the so-called electronic word of mouth (eWOM), informal communications targeting consumers, channelled through Internet technologies (online reviews and online opinions) and relating to the use experiences or features of specific products or services or their vendors [
1]. The rise of eWOM makes online reviews the most influential source of information for consumer decision-making [
2,
3,
4].
The predominance of positive feedback is an incentive for businesses to raise prices for their goods or services to maximise their profits [
5]. This prompts unscrupulous businesses to manipulate the customer reviews and ratings with cheating reviews, rigging the feedback scoring, etc. [
6,
7,
8]. Some researchers have highlighted a separate threat posed by artificial intelligence, which can generate reviews that are all but indistinguishable from those posted by real users [
9,
10]. That said, artificial intelligence also serves as a tool to detect fake reviews [
9].
eWOM has also become a widespread phenomenon in the healthcare sector: many physician rating websites (PRW) are available now. Physicians themselves are the first to use this opportunity by actively registering on such websites and filling out their profiles. For instance, in Germany, according to 2018 data, more than 52% of physicians had a personal page on online physician rating websites [
11].
Online portals provide ratings not only of physicians, but also of larger entities, such as hospitals [
12,
13]. In many countries, however, a greater number of reviews concern hospitals or overall experiences rather than individual physicians and their actions [
14,
15].
Creating reviews containing reliable information enhances the efficiency of the healthcare sector by providing the patient, among other things, with trustworthy data and information on the quality of medical services. Despite the obvious benefits of PRWs, they have certain drawbacks, too, such as:
Poor understanding and knowledge of healthcare on the part of the service consumers, which casts doubt on the accuracy of their assessments of the physician and medical services provided [
16,
17]. Patients often use indirect indicators unrelated to the quality of medical services as arguments (for example, their interpersonal experience with the physician [
18,
19]).
Lack of clear criteria by which to assess a physician / medical service [
18].
Researchers have found that online reviews of physicians often do not actually reflect the outcomes of medical services [
20,
21,
22]. Consequently, reviews and online ratings in the healthcare industry are less useful and effective compared to other industries [
23,
24]. However, some studies, on the contrary, have revealed a direct correlation between online user ratings and clinical outcomes [
25,
26,
27,
28].
In general, the healthcare industry shows a high level of concentration of positive reviews and physician ratings [
29,
30,
31,
32,
33,
34,
35]. However, at the beginning of the COVID-19 pandemic, the share of negative reviews on online forums prevailed [
36].
The main factors behind a higher likelihood of a physician receiving a positive review are physician’s friendliness and communication behaviour [
37]. Shah et al. divide the factors that increase the likelihood of a physician receiving a positive review into factors depending on the medical facility (hospital environment, location, car park availability, medical protocol, etc.) and factors relating to the physician's actions (physician’s knowledge, competence, attitude, etc.) [
38].
Some researchers have noted that patients mostly rely on scoring alone while rarely using descriptive feedback when assessing physicians [
39], which is due to the reduced time cost of completing such feedback [
40]. At the same time, consumers note the importance of receiving descriptive feedback as it is more informative than numerical scores [
41,
42].
Physician’s personal characteristics, such as gender, age, specialty, can also influence the patient's assessment and feedback, apart from objective factors [
43,
44,
45,
46]. For example, according to studies based on German and US physician assessment data, higher evaluations prevail among female physicians [
43,
44], obstetrician-gynaecologists [
47], and younger physicians [
47].
The characteristics of the patients have an impact on the distribution of scores. For example, according to a study by Emmert and Meier based on the online physician review data from Germany, older patients tend to score physicians higher than younger patients [
43]. However, according to their estimates, doctors' assessments/scores don’t depend on the respondent’s gender [
43]. Having an insurance policy has a significant influence on the feedback sentiment [
43,
46]. Individual studies have demonstrated that negative feedback is prevalent among patients from rural areas [
48]. Some studies have focused on the characteristics of online service users, noting that there are certain characteristics of people that are indicative of the PRW use frequency [
49,
50]. Depending on this, users having different key characteristics will differ significantly in their ratings of importance of online physician reviews [
51].
A number of studies have used both rating scores and commentary texts as data [
52]. In particular, the study [
52] identified the factors influencing more positive ratings that would be related to both the physician’s characteristics and other factors beyond the physician's control.
A number of studies use arrays of physician review texts as the data basis [
53,
54,
55,
56,
57]. Researchers have found that physician assessment services can complement the information provided by more conventional patient experience surveys while contributing to a better understanding by patients of the quality of medical services provided by a physician or health care facility [
58,
59,
60].
Social media analysis can be viewed as a multi-stage activity involving collection, comprehension, and presentation:
Supervised or unsupervised methods can be used to effectively analyse the sentiments based on social media data. [
63] gives an overview of these methods. The main approaches to classify the polarity of analysed texts are based on word, sentence, or paragraphs.
In [
64], various text mining techniques for detecting different text patterns in a social network were studied. The text mining using classification based on various machine learning and ontology algorithms was considered, along with a hybrid approach. The experiments described in the above paper showed that there is no single algorithm that would perform best across all data types.
In [
65], different classifier types were analysed for text classification and their respective pros and cons. Six different algorithms were considered:
Bayesian classifier;
decision tree;
k-nearest neighbour algorithm (k-NN);
support vector machine (SVM);
artificial neural network based on multilayer perceptron;
Rocchio algorithm.
A limited performance is a common drawback of all the above algorithms. Some of these are easy to implement, but their performance is poor. Some others perform well while requiring some extra time for training and parameter setting.
Lee et al. [
66] classified trending topics on Twitter [now X] by using two approaches to topic classification, a well-known Bag-of-Words method for text classification and a network classification. They identified 18 classes and assigned trending topics to the respective categories. Ultimately, the network classifier significantly outperformed the text classifier.
[
67] discusses methods addressing the challenges of short text classification based on streaming data in social networks.
[
68] compared six feature weighting methods in the Thai document categorisation system. They found that the using SVM score thresholding with ltc yielded the best results for the Thai document categorisation.
[
69] proposed a multidimensional text document classification framework. The paper reported that classifying text documents based on a multidimensional category model using multidimensional and hierarchical classifications was superior to the flat classification.
[
70] compared four mechanisms to support divergent thinking using associative information obtained from Wikipedia. The authors used the Word Article Matrix (WAM) to compute the association function. This is a useful and effective method for supporting divergent thinking.
[
71] proposed a new method to fine-tune a model trained on some known documents with richer contextual information. The authors used WAM to classify text and track keywords from social media to comprehend social events. WAM with a cosine similarity is an effective method of text classification.
As is clear from the review of the current state of automatic processing of unstructured social media data, there is no single approach at this time to achieve effective classification of text resources. The classification results will depend on the domain, representativeness of the training sample, and other factors. Therefore, it is important to develop and apply such intelligent methods for analysing reviews of medical services provided.
3. Classification Models for Text Reviews of the Quality of Medical Services in Social Media
In this study, we have developed a hybrid method of classification of text reviews from social media. This resulted in the classification of text reviews based on:
Initially, the task was to break the multitude of reviews down into four classes. To solve this, we started with testing machine learning methods using various neural network architectures.
Mathematically, a neuron is a weighted adder whose single output is defined by its inputs and weight matrix as follows:
where
xi and
wi are the neuron input signals and the input weights, respectively; the function
u is the induced local field, and
f(u) is the transfer function. The signals at the neuron inputs are assumed to lie in the interval [0,1][0,1]. The additional input
x0 and its corresponding weight
w0 are used for neuron initialisation. Initialisation here means the shifting of the neuron’s activation function along the horizontal axis.
There are a large number of algorithms available for context-sensitive and context-insensitive text classification. In this study, we propose three neural network architectures that have shown the best performance in non-binary text classification tasks. We compared the effectiveness of the proposed algorithms with the results of text classification using the models applied in our previous studies, which achieved good results in binary classification (BERT and SVM) [
72,
73].
3.1. LSTM Network
Figure 1 shows the general architecture of the LSTM network.
The proposed LSTM network architecture consists of the following layers:
Embedding, the neural network input layer consisting of neurons:
where
Size(
D) — dictionary size in text data;
size of the vector space in which the words will be inserted;
;
length of input sequences, equal to the maximum size of the vector generated during word pre-processing.
LSTM Layer — recurrent layer of the neural network; includes 32 blocks.
Dense Layer — output layer consisting of four neurons. Each neuron is responsible for an output class. The activation function is "Softmax".
3.2. A Recurrent Neural Network
Figure 2 shows the general architecture of a recurrent neural network.
The proposed recurrent neural network architecture consists of the following layers:
Embedding — input layer of the neural network.
GRU — recurrent layer of the neural network; includes 16 blocks.
Dense — output layer consisting of four neurons. The activation function is "Softmax".
3.3. A Convolutional Neural Network
Figure 3 shows the general architecture of a convolutional neural network (CNN).
The proposed convolutional neural network architecture consists of the following layers:
Embedding — input layer of the neural network.
Conv1D — convolutional layer required for deep learning. This layer improves the accuracy of text message classification by 5-7%. The activation function is "ReLU".
MaxPooling1D — layer which performs dimensionality reduction of generated feature maps. The maximum pooling is equal to 2.
Dense — first output layer consisting of 128 neurons. The activation function is "ReLU".
Dense — final output layer consisting of four neurons. The activation function is "Softmax".
3.4. Using Linguistic Algorithms
We observed that some text reviews contained elements from different classes. To account for this, we added two new review classes: mixed positive and mixed negative.
To improve the classification quality, we used hybridisation of the most effective machine learning methods and linguistic methods that account for the speech and grammatical features of the text language.
Figure 4 shows the general algorithm of the hybrid method.
A set of methods of pre-processing, validation and detection of named entities representing the physicians’ names in the clinic was the linguistic component of the hybrid method we developed.
5. Experimental Results of Text Review Classification
5.1. Using Dataset
A number of experiments were conducted on classifying Russian-language text reviews of medical services provided by clinics or individual physicians with a view to evaluating the effectiveness of the proposed approaches.
All extracted data to be analysed had the following list of variables:
city — city where the review was posted;
text — feedback text;
author_name — name of the feedback author;
date — feedback date;
day — feedback day;
month — feedback month;
year — feedback year;
doctor_or_clinic — a binary variable (the review is of a physician OR a clinic);
spec — medical specialty (for feedback on physicians);
gender — feedback author’s gender;
id — feedback identification number.
The experiments were designed to impose a 90-word limit on the feedback length.
5.2. Experimental Results on Classifying Text Reviews by Sentiment
In the first experiment, a database of 5,037 reviews from the website
prodoctorov.ru with initial markups by the sentiment and target was built to test the sentiment analysis algorithms.
The RuBERT language model was used as a text data vectorisation algorithm. The Transformer model was used for binary classification of text into positive / negative categories. The training and test samples were split in an 80:20 ratio.
The results of the classifier on the test sample were as follows: Precision = 0.9857, Recall = 0.8909, F1-score = 0.9359.
The classifier performance quality metrics confirm the feasibility of using this binary text sentiment classifier architecture for alternative sources of medical feedback data.
The LSTM classifier, the architecture of which is described in
Section 3.1, was also tested on this sample. The reviews were supposed to be classified into four classes as follows:
Positive review of a physician;
Positive review of a clinic;
Negative review of a physician;
Negative review of a clinic.
The split between the training and test samples was also 80/20.
Figure 5 gives the classification results.
5.3. A Text Feedback Classification Experiment Using Various Machine Learning Models
We used data from the online aggregator
infodoctor.ru to classify Russian-language text feedback into four (later, six) classes using the machine learning models described in
Section 3. This aggregator has a comparative advantage over other large Russian websites (
prodoctorov.ru,
docdoc.ru) in that it groups reviews by their ratings on the scale from one star to five stars with a breakdown by Russian cities, which greatly simplifies the data collection process.
We collected samples from Moscow, St. Petersburg and 14 other Russian cities with a million-plus population, for which we could obtain minimally representative samples by city (with a minimum 1,000 observations per city). The samples spanned the time period from July 2012 to August 2023.
We retrieved a total of 58,246 reviews. These reviews either contained only positive experiences with a physician or a clinic or had mixed sentiments and targets.
Table 1 summarises some selected feedback examples.
All the algorithms had an 80:20 split between the training and test samples.
Figure 6 gives the graphs showing the classification results on the training and test datasets for LSTM, GRU, and CNN architectures.
Table 2 compares the performance indicators of text feedback classification using the above approaches.
where:
Accuracy — training accuracy;
Val_accuracy — validation accuracy;
Loss — training loss;
Val_loss — validation loss.
To compare the proposed models with other methods, we performed experiments using the support vector machine (SVM) and RuBERT algorithms on the same dataset. As shown in
Table 2, these algorithms had slightly lower performances than our models.
As mentioned earlier, one of the difficulties with the text reviews analysed was that they could contain elements of different classes within the same review. For instance, a short text message could include both a review of a physician and a review of a clinic.
Hence, we introduced two additional classes, mixed positive and mixed negative, while applying the linguistic method of named entity recognition (as described in
Section 3.4) to enhance the quality of classification of each feedback class.
This approach improved the classification performance for all the three artificial neural network architectures.
Figure 7 illustrates the classification results obtained using the hybrid algorithms.
We applied the linguistic method to the reviews that were classified as “clinic review” by the neural network at the first stage, regardless of their sentiment. The named entity recognition method improved the classification performance when it was used after the partitioning of text messages. However, some reviews were still misclassified even after applying the linguistic method. Those reviews were long text messages that could belong to different classes semantically. The reasons for the misclassification were as follows:
Some reviews were of both a clinic and a physician without mentioning the latter’s name. This prevented the named entity recognition tool from assigning the reviews to the mixed class. This problem could be solved by parsing the sentences further with identifying a semantically significant object unspecified by a full name.
Some reviews expressed contrasting opinions about the clinic, related to different aspects of its operation. The opinions often differed on the organisational support versus the level of medical services provided by the clinic.
For example, the review “This doctor’s always absent sick, she’s always late for appointments, she’s always chewing. Cynical and unresponsive” expresses the patient’s dissatisfaction with the quality of organisation of medical appointments, which is a negative review of organisational support. The review “I had meniscus surgery. He fooled me. He didn’t remove anything, except damaging my blood vessel. I ended up with a year of treatment at the Rheumatology Institute. ####### tried to hide his unprofessional attitude to the client. If you care about your health, do not go to ########” conveys the patient’s indignation at the quality of treatment, which belongs to the medical service class. A finer classification of clinic reviews will enhance the meta-level classification quality.
6. Conclusions
In this paper, we propose a hybrid method for classifying Russian-language text reviews of medical facilities extracted from social media.
The method consists of two steps: first, we use one of the artificial neural network architectures (LSTM, CNN, or GRU) to classify the reviews into four main classes based on their sentiment and target (positive or negative, physician or clinic); second, we apply a linguistic approach to extract named entities from the reviews.
We evaluated the performance of our method on a dataset of more than 60,000 Russian-language text reviews of medical services provided by clinics or physicians from the websites
prodoctorov.ru and
infodoctor.ru. The main results of our experiments are as follows:
The neural network classifiers achieve high accuracy in classifying the Russian-language reviews from social media by sentiment (positive or negative) and target (clinic or physician) using various architectures of the LSTM, CNN, or GRU networks, with the GRU-based architecture being the best (val_accuracy=0.9271).
The named entity recognition method improves the classification performance for each of the neural network classifiers when applied to the segmented text reviews.
To further improve the classification accuracy, semantic segmentation of the reviews by target and sentiment is required, as well as a separate analysis of the resulting fragments.
As future work, we intend to develop a text classification algorithm that can distinguish between reviews of medical services and reviews of organisational support for clinics. This would enhance the classification quality at the meta-level, as it is more important for managers to separate reviews of the medical services and diagnosis from those of organisational support factors (such as waiting time, cleanliness, politeness, location, speed of getting results, etc.).
Moreover, in a broader context, refined classification of social media users’ statements on review platforms or social networks would enable creating a system of standardised management responses to changes in demographic and social behaviours and attitudes towards socially significant services and programmes [
73].