1. Introduction
As e-commerce platforms are accompanied by features such as openness and transparency, fierce competition in the same category, product diversification, and review functions, consumers' perception and evaluation of goods are transferred from the staff in the store to the public view online, and users can share their views on goods and merchants at any time and any place, and express their own feelings. Users now have the ability to share their opinions on products and merchants, expressing their emotions at any time and from anywhere. Once a transaction is concluded, users' subjective product reviews convey a certain inclination of sentiment. These reviews serve a dual purpose as both "information providers" and "emotional influencers," significantly impacting merchants. Not only do they influence store reputation, but they also affect the long-term sales of their products. To enhance merchants' comprehension of sentiment tendencies within reviews, it becomes imperative to categorize and analyze product review information, thereby discerning whether the feedback leans towards positivity or negativity. This enables merchants to refine their stores and products based on the emotional orientation of the review data, ultimately augmenting user satisfaction, product sales, and store ratings.
The evaluation of review information is not solely influenced by subjective factors such as consumers' personal preferences, emotions, and personalities. It also exhibits a strong correlation with the objective factor of product quality itself. In order to conduct sentiment analysis on review data, the initial step involves extracting text features. Feature selection aims to extract pertinent information from raw attribute sets while reducing data dimensionality. These attribute sets can be derived based on dictionaries or commonly employed statistical methods [
22,
24].
Following feature extraction, the subsequent step in sentiment analysis entails sentiment categorization. This can be accomplished through various approaches, including machine learning, vocabulary-based methods, and deep learning techniques [
26]. Sentiment analysis models employing polar vocabularies are frequently utilized for sentiment classification. However, the availability of sentiment vocabularies remains limited. Machine learning methods possess distinct advantages over sentiment dictionaries when it comes to nonlinear, high-dimensional pattern recognition problems.
Given that comments encompass a wealth of content, traditional neural network models may struggle to fully capture the complete context of a sentence or comment. Therefore, novel neural networks must employ different types of word embeddings. For instance, utilizing RNN variants like Bidirectional LSTM (BiLSTM) and Bidirectional GRU (BiGRU) proves beneficial for sentiment analysis tasks.
Traditional text data mining methodologies tend to overlook the relevant information pertaining to online products themselves, particularly the descriptive textual details regarding product quality and the potential disparities between depicted images and the actual unadorned goods. In our contemporary reality, individuals exist within a multimodal and interconnected environment, where various forms of information modalities, including text, speech, images, and videos, converge. Enriching linguistic expression methods through the utilization of multiple information modalities allows computers to better comprehend and interpret input data, leading to more precise and comprehensive output results [
27].
Hence, when analyzing product reviews, it becomes essential to consider multimodal information in order to fully grasp consumers' emotions. This approach not only enhances users' genuine experiences and satisfaction but also improves merchants' sales performance. By delving into review information, merchants can tailor their products to align with consumers' current needs, thereby increasing relevance and appeal.
To enable AI to attain a deeper comprehension of the world, it is imperative to endow it with the capacity to learn, comprehend, and reason about multimodal information. Multimodal learning entails constructing models that enable machines to acquire knowledge from multiple modalities, allowing for effective communication and information transformation across each modality. Throughout the course of this research, feature representations were created using multimodal deep learning techniques, encompassing both image and text data.
In comparison to purely statistical models, machine learning models offer greater diversity and exhibit enhanced capability to capture the diverse features present in multimodal data. These models excel at approximating various nonlinear relationships, while showcasing superior adaptability.
2. Relevant Research
The research direction of the content of online review information is roughly divided into two categories: one is for the text mining of online review information; the other is from the multi-level of online review information, and the picture information in online review information has attracted a lot of attention in the academic community [
16,
17]. Traditional multi-scalar sentiment categorization methods mainly include two routes: sentiment dictionary-based and machine learning-based. Sentiment lexicon-based methods usually calculate sentiment polarity on a whole-sentence or sub-sentence basis and apply it to all facets involved in a sentence, but they cannot handle the complex mapping relationship between sentences and facets well [
40,
41]. The core of this approach is the application of affective lexicons, however, there are differences in the affective tendencies of affective words in different domains, making it difficult to generalize domain-specific affective lexicons to other domains [
42,
43].
D. Lahat et al. define multimodality as a qualification that can be used intensively compared to unimodality [
34]. Zhao Liang considers multimodal data to be data obtained through different domains or perspectives for the same descriptive object, and calls each domain or perspective describing these data a modality [
35]. The main objective of multimodal learning is to correlate and process multimodal data by building models.
According to the research summary of related scholars, the main content of current multimodal research can be summarized as five levels of multimodal data representation, data mapping, data alignment, data fusion and co-learning [
36]. Most of it revolves around cross-modal mapping between data. As the fields of computer vision and natural language processing continue to evolve, and as more large-scale datasets become available for research, multimodal data mapping methods continue to mature [
37]. The mainstream multimodal data mapping method is: based on the existing mapping relationship, the existing multimodal data is first symbolized or vectorized, which is used as the input to the neural network, and then combined with the existing correspondences, mapped to another modal. After continuous training based on massive datasets, a cross-modal data mapping model with universal applicability is obtained. The framework of this multimodal data mapping approach is shown in
Figure 1. One of the most common and widely used scenarios is image semantic recognition [
38,
39], which maps image modal data to textual modal data.
Sentiment analysis of online product review information faces two major challenges: dimension mapping and sentiment word disambiguation. While the dimension mapping problem lies in the correct use of dimensions to map online review texts, sentiment word disambiguation refers to the situation where two or more dimensions exist for a sentiment word. Therefore, sentiment analysis of online reviews is considered as a multidimensional classification process [
18,
19,
20].
For the construction of sentiment lexicon requires a lot of human intervention, and its completeness and accuracy have an important impact on the sentiment classification results. On the other hand, machine learning-based methods usually regard multi-scalar sentiment polarity recognition as a sequence labeling problem, which requires manually designing features and labeling them, and then using classifiers for training and learning. Common sequence labeling methods include conditional random fields, maximum entropy, plain Bayes, and support vector machines [
44]. Despite the achievements of machine learning methods in multi-scalar sentiment classification, feature engineering is time-consuming and labor-intensive, and the classification results are highly dependent on feature quality.
Pang (2002) first applied the machine learning method of N-Gram to the field of sentiment analysis, and the experimental results showed that N-Gram achieved the highest classification accuracy of 81.9% [
28]. Since feature selection affects the performance of machine learning methods, Abinash Tripathy (2016) analyzed online review comments through N-gram model combined with machine learning methods and the experimental results showed that SVM combined with unigram, bigram and trigram features obtained the best classification results [
23].
Zheng Fei et al showed that a combination of LDA model and Word2Vec word vectors can be used to complete the modeling of comment text word vectors for sentiment classification in problems such as varying lengths of comment texts and non-uniformity of unit schemas [
1]. Kim and Hovy (2004) applied synonyms and antonyms of WordNet dictionaries and hierarchical structures to analyze word vectors' sentiment tendencies [
29]. Zhu Xiaoliang and other researchers solved the composition classification problem by using TextRank model to filter key sentence words and combining it with word embedding model to model documents. Aiming at the problem that Word2Vec model cannot identify the importance of special words and scene words in Chinese in the text [
2]. Zhang Qian et al proposed to introduce the TFIDF model to weight the output word vector matrix, to get the weighted text vectorization model and classify it [
3]. Yuting Yang et al researchers encoded text into high dimensional vectors with contextual semantic, sequential and sentiment information through Doc2Vec model and verified the effectiveness of document distributed representation approach [
4].
Marco Guerini (2013) used Simple Bayes, K-Nearest Neighbors, Maximum Entropy, and Support Vector Machines to analyze reviews for sentiment propensity, with Support Vector Machines significantly outperforming the other methods with up to 83% accuracy on larger training sets [
30]. Yunfei Shao et al researchers used LDA and TF-IDF to expand the input text features, and then further aggregated the features to form classification basis vectors by CNN, which improved the classification effect on news headline data [
5,
15]. Qu and Wang (2018) proposed a sentiment analysis model based on hierarchical attention networks with a 5% improvement in accuracy over recurrent neural networks [
31]. Tao, Zhiyong et al researchers fused the bidirectional features of the BiLSTM model and used them for attention weight computation, achieving improved classification results on various benchmark datasets [
6].
In particular, Duan Dandan et al human used a short text classification model based on BERT, which utilizes BERT's own sentence vector training to achieve automatic text classification [
7,
21]. Du Lin et al who obtained word vectors by inputting the text into the BERT model, and then input it into the BiLSTM containing self-attention in chronological order to realize the extraction and automatic classification of the text of Chinese medical records [
8].
Trofimovich (2016) used LSTM (Long Short-Term Memory) to solve the problem of sentiment analysis in order to classify sentiment at the phrase level including linguistic rules such as negativity, intensity, and polarity, and trained on labeled text with BiLSTM for syntactic and semantic processing [
32]. In the field of traditional machine learning classifiers, some of the literature uses Doc2Vec model and LDA model to obtain multi-channel text feature matrices and input the modeled text into SVM and LR for classification. By obtaining the final classification results through the voting mechanism among multiple classifiers, the model achieves excellent results on the short text classification problem [
33]. Ge et al researchers used CNN network to extract features after representing the text with bag-of-words model and used SVM classifier to classify the adverse nursing events [
9].
In the field of deep learning, H. Wang et al researchers obtained word embedding representations of documents and fed them into a two-channel classification model. The first channel is a three-layer CNN to extract local features. At the same time, the model fuses the input vectors obtained from the word embedding model with the output vectors of each layer of CNN to realize the reuse of the original features; the second channel is LSTM to obtain the context-associated semantics of the text. Finally, the vectors of the two channels are fused through a fully connected network to realize feature fusion, and the model achieves better results than the previous traditional model in the Sina news classification problem [
10,
11,
12].
It is worth noting that the model uses a unidirectional LSTM, while Bi-LSTM is considered to be a better existence than unidirectional LSTM. LSTM models are used for phrase-level sentiment classification centered on regularization, which contains linguistics such as negativity, intensity, and polarity [
13], so BiLSTM will be better for sentiment classification. Multi-channel Text CNN is able to obtain more adequate keyword aggregation than multi-layer Text CNN. Cho (2014) proposed Gated Recursive Units (GRU) to analyze dependency contexts, which showed significant improvements in various tasks. Considering the multimodal, sparsely informative, highly unstructured, and word polysemous nature of review text data [
14,
25].
In this paper, we propose the Bert BiGRU Softmax deep learning model with hybrid masking, comment extraction and attention mechanisms combined with multimodality. In order to improve the correctness of the results, this paper uses several models for image content recognition respectively, and finds that Squeeze Net performs better in terms of execution efficiency and accuracy, so Squeeze Net is finally adopted for image content recognition. The Bert BiGRU Softmax model extracts multidimensional product features from online reviews using the Bert model as an input layer; The bidirectional GRU model is used as a hidden layer to obtain the semantic code and compute the sentiment weights of the comments; finally, Softmax and the attention mechanism are utilized as an output layer to classify the positive or negative nuances.
3. Collection and Analysis of Data Sets
In this paper, the food category review information of the online platform was selected as the data source. With the sky-rocketing changes in the food business model, people's consumption habits are also quietly shifting. By clicking on your favorite food items on the mobile app, these food items are delivered on time and accurately to the designated area. It also includes the fact that people habitually look at the information about the online reviews of the restaurant before they go to spend their money in the restaurant before judging which store to go to.
However, with the rapid development of various platforms, the safety hazards of certain food products as reflected in the review information cannot be ignored. The occurrence of food safety incidents has serious harmful effects on consumers, takeaway platforms, food merchants and society beyond our imagination. Therefore, this paper aims to strengthen the food safety regulation of stores by the relevant authorities through the analysis of store reviews.
In reviews, users explicitly or implicitly rate the attributes of multiple items, including environment, price, food, and service. In this paper, four preprocessing steps were performed on the collected data to ensure the ethics, quality, and reliability of the reviews. Specifically, they include (1) user information (e.g., user ID, user name, avatar, and posting time) is deleted due to privacy considerations; (2) short comments with fewer than 50 Chinese characters and long comments with more than 1,000 Chinese characters are filtered out; (3) if the percentage of non-Chinese characters in a comment is more than 70%, the comment will be discarded; and (4) preprocessing of the data includes data cleansing, Chinese word splitting, de-duplication of words, etc.
In this paper, we use the Bert BiGRU Softmax deep learning model to perform sentiment analysis of online product quality reviews in terms of multiple dimensions such as service, taste, price, hygiene, packaging, and delivery time. The polarity is categorized into positive, neutral and negative dimensions.
4. Introduction to Multimodal Learning and Modeling
Multimodality refers to the different forms in which things are presented or experienced. Multimodality can be based on the human senses, including visual, auditory, and tactile modalities, each of which can represent human perception. The combination of multiple modal perceptions gives the complete human modal perception. At the same time multimodal can be used to represent different forms of data forms, can also be the same form of different formats, generally expressed as text, images, audio, video, mixed data.
Image content recognition has been an important research problem. With the continuous development of learning methods, the accuracy of image recognition is constantly improving, the original image consists of pixel matrix, the traditional edge recognition and other image recognition methods can only be divided by pixel blocks each time, the recognition effect is poor; and due to the existence of convolutional layer, the pixel matrix of the image after convolutional processing, it turns into a high-dimensional matrix of features, and transforms the simple pixel information into composite feature information. Through such a convolution operation, the computer is not only able to recognize the basic edge detection, but also able to recognize shapes, such as circles, rectangles, etc., and then carry out continuous convolution, and ultimately realize the recognition of the object.
Light weighted based on Inception benefits. In this paper, we use the open-source tool Image AI based on Python language, which is based on ImageNet dataset for model training, integrating the mainstream ResNet50, DenseNet121, InceptionV3 and
Squeeze Net, which are four kinds of convolutional neural network-based deep learning models for image recognition, and also support the Customized model training is also supported. In order to improve the correctness of the results, this paper uses the above four models for image content recognition, and finds that Squeeze Net performs better in terms of efficiency and accuracy, so Squeeze Net is finally adopted for recognition.
Mathematical modeling for text is still an essential aspect in the field of natural language processing. There is a significant impact of how text is modeled on the effectiveness of downstream feature extraction models and classification models. Most of the common text modeling models are designed for English corpus, either question and answer corpus or comment corpus. However, in addition to the sparse information and highly unstructured characteristics of English commentary texts, Chinese commentary texts also have the problems of multiple meanings of words and non-uniformity of the smallest unit of expression. These problems usually result in limiting the classification effectiveness of traditional classification models on Chinese text.
At the same time, the information contained in the comment text includes not only textual information, but also image information. To textually model multimodal data expressed by these two kinds of information, there are not only the problems of word polysemy and textual representation granularity, but also the difficulties of acquiring picture information and performing sentiment tendency analysis.
The BERT model was proposed in 2018 (Pre-training of Deep Bidirectional Transformers for Language Understanding) as a milestone work in the field of pre-training, achieving the best current results on several NLP tasks and opening a new chapter. BERT is a pre-trained model on deep bi-directional transformers for language understanding, where transformer refers to a network structure for processing sequential data. BERT learns the semantic information of the text and applies it to tasks such as categorization, semantic similarity, etc. through outputs in the form of vectors. It is a pre-trained language model, i.e., it has been trained unsupervised on a large-scale corpus, and in using it we only need to train and update its parameters on this basis.
Unlike other language models, BERT is trained on unsupervised precisions, where information related to the left and right of the text is considered in each layer. Xia et al. argue that the supervised deep learning methodology approach relies on a large number of clean seismic records without ground-rolling noise as a reference label.The unsupervised learning method approach considers different temporal, lateral, and frequency features that distinguish the ground-roll noise from the real reflected waves in the seismic records before deep stacking. By designing the ground-roll suppression loss function, the deep learning network can learn the specific distribution characteristics of the real reflected waves in seismic records containing ground-roll noise.Bert's linguistic input representation consists of three components: word embeddings, segmentation embeddings, and position embeddings. The final embedding vector is a direct sum of the above three vectors.
Figure 2.
Structure of Bert input layer.
Figure 2.
Structure of Bert input layer.
Figure 3.
BERT model diagram.
Figure 3.
BERT model diagram.
In summary, for text classification tasks, the BERT model inserts a [CLS] symbol in front of the text and uses the output vector corresponding to this symbol as a semantic representation of the whole text for text classification. This notation has no obvious semantic information, so it can more "fairly" incorporate the semantic information of individual words or phrases in the text. In addition, we can add additional structures such as fully connected layers after the BERT model to perform fine-tune operations for specific tasks, such as linguistic reasoning tasks like Q&A.
By learning the distribution over the input text vectors, the Emotion Bert model can be efficiently used to learn feature extraction over variable-length sequences S Given a review sentence S, we can directly obtain its category and the set of dimensions of the category Dc. For each word
(
) and dimension dj in a review sentence, we assign a probability score that describes the probability
that word wi belongs to class dj in an online product quality review (as in Equation (1))
Transformer adds sequence information to the sequence via word position embedding (PE) Formulas (2) and (3).
When d is 64, the text sequence is represented as 512 characters, the
is the even position in the given sequence of the input vector and
is the odd position. When the transformer extracts the features
and
from the two special words
and
in the S-sequence, the Bert loss function considers only the prediction of the masked values and thus ignores the prediction of the non-masked words as shown in Equations (4) and( 5).
GRU is a specific model of a recurrent neural network that performs machine learning tasks related to memory and clustering using connections through a series of nodes, which allows GRU to pass information over multiple time periods in order to influence subsequent time periods. GRU can be considered as a variant of LSTM as both are similar in design and produce equally good results, both gate recursive units help in tuning the neural network input weights to solve the vanishing gradient problem. As a refinement of the recurrent neural network, GRU has a gate called update gate and reset gate rt. Using an input vector x and an output vector , the model refines the information flow in the output-1 model by controlling ht. As with other types of recurrent network models, GRU with gated recurrent units can retain information over a period of time, which is why it is easiest to describe these techniques as "memory-centric" types of neural networks. In contrast, other types of neural networks without gated recurrent units typically do not have the ability to retain information. the structure of the GRU is shown in Figure.
Figure 4.
Structure of GRU.
Figure 4.
Structure of GRU.
BiGRU refers to Bidirectional Gated Recurrent Unit (BGRU), i.e., an additional reverse layer is added to the GRU. BiGRU can process both forward and reverse information of the input sequence, thus capturing features in the sequence more comprehensively and improving model performance. It can be used for a variety of tasks such as speech recognition, person name recognition, lexical annotation etc.
BiGRU has advantages over GRU such as bi-directionality, better performance, better handling of long sequences, and finer-grained feature representation. Therefore, BiGRU has become a very effective model in various sequence learning tasks. Since GRU retains only less state information and is prone to the problems of gradient vanishing or gradient explosion, its processing of long sequences may not be as effective as BiGRU.
BiGRU, on the other hand, introduces more state information through the inverse layer, which improves the handling of long sequences and reduces the risk of vanishing or exploding gradients. In combining BiGRU and Softmax, the input text sequence can be encoded using BiGRU, then the encoded result is passed to a fully connected layer, and finally classified using the Softmax activation function. Specifically, the output of BiGRU can be used as an input to the fully connected layer, and the output of the fully connected layer is then used for label prediction via Softmax. This combination can effectively improve the performance of text categorization, especially when facing complex text datasets.
The BiGRU model operates on a given sequence of input vectors
(where xt denotes a concatenation of input features) and computes the corresponding hidden activations
. At the same time, a sequence of output vectors is generated from the input data
At time t, the current hidden state is determined by three components: the input vectors
, the forward hidden state and the backward hidden state. The reset gate
controls how much previous state information is ignored; the smaller the value of
, the more previous state information is ignored. The update gate
controls the extent to which the unit state receives new input information. The symbol
represents elemental multiplication,
denotes a sigmoid function, and tanh denotes a hyperbolic tangent function. The hidden state
, update gate
, and reset gate
of the BiGRU are computed by Equations (6)–(9).
Softmax functions are widely used in tasks such as text categorization, sentiment analysis, and machine translation. Often, we need to represent a piece of text as a vector to facilitate subsequent computation and analysis. One of the common ways to represent text vectors is to use the Word Embedding technique to map each word to a low-dimensional vector of real numbers, and then transform the entire text into a fixed-length vector through some aggregation or transformation operations. In the following algorithm,
represents the weight matrix of the attention function,
refers to the hyperbolic tangent function,
represents the sentiment analysis results, and
represents the corresponding bias of the output layer.
After completing the text vector representation, we also need to perform tasks such as classification and labeling. At this point, the Softmax function can be used to map the text vectors to different classes of probability distributions. Specifically, in natural language processing tasks, a neural network model is usually used as a classifier, with text vectors as inputs, and after several layers of fully connected layers and nonlinear activation functions, the outputs are finally mapped to individual categories using the Softmax function, and the probability values of each category are calculated. Ultimately, we can consider the category with the largest probability value as the category to which the text belongs.
This paper investigates the Bert BiGRU Softmax model for sentiment analysis of online product quality reviews. The sentiment Bert model is used as an input layer for feature extraction in the preprocessing stage. The hidden layer of the bi-directional GRU performs dimension-oriented sentiment classification by using bi-directional long and short-term memory and selective recursive units to maintain the long-term dependencies inherent in the text regardless of length and number of occurrences.
The output layer of Softmax calculates sentiment polarity by merging to smaller weighted dimensions according to the attraction mechanism. The output layer of Softmax calculates sentiment polarity by merging to smaller weighted dimensions according to the attraction mechanism.
Figure 6 shows the structure of the Bert BiGRU Softmax model as follows.
The objective of sentiment analysis is to uncover the subjective emotional inclinations expressed by users towards products, as conveyed through online information. By leveraging deep learning techniques, sentiment analysis aims to establish connections between various features such as syntax, semantics, and emoticons, and sentiments. It involves categorizing user-generated content into positive, negative, or neutral sentiments, thereby enabling a better understanding of users' opinions and attitudes towards goods.
6. Discussion and Conclusion
Online reviews are consumer evaluations and feedback on products sold on online platforms. These reviews record multimodal information such as user experience, product characteristics and service satisfaction, and are important references for both manufacturers and platform operators. By analyzing customer feedback from reviews, vendors and platform operators can learn how the products they produce and sell are performing in the marketplace, and at the same time be able to continually improve their product designs and optimize their service processes based on user needs and expectations, thereby improving product quality and maintaining customer relationships.
Online reviews encompass consumer appraisals and assessments of merchandise vended on digital platforms. These evaluations encapsulate diverse dimensions including user encounters, product attributes, and contentment with services rendered. They serve as invaluable points of reference for both manufacturers and platform administrators. Through scrutinizing customer feedback embedded within these reviews, vendors and platform operators can glean insights into the reception and efficacy of their offerings in the market. Consequently, they can embark on a journey of perpetual refinement in terms of product conception, whilst optimizing service procedures to align with user requisites and anticipations. This endeavor ensures enhanced product excellence and the preservation of enduring customer rapport.
Through meticulous examination of customer feedback derived from reviews, vendors and platform operators are equipped with the means to ascertain the performance of their merchandise within the marketplace. Simultaneously, this practice allows them to continuously refine their product designs and optimize service processes in accordance with user needs and expectations. This concerted effort not only enhances product quality but also fosters enduring customer relationships.
The proposed model comprises several integral components: foremost, the utilization of the Bert model as a feature extractor facilitates the extraction of semantic representations from the input layer's textual comments, thereby capturing nuanced semantic relations between words and contextual information. Subsequently, the integration of the BiGRU model, augmented with an attention mechanism, assumes the role of a hidden layer, facilitating the acquisition of high-dimensional semantic coding encompassing attention probabilities at the input layer and textual sequences of contextual information. Finally, the Softmax classifier is employed to categorize all comments into sentiment polarity prediction and trend classification tasks, thus enabling effective recognition and analysis of users' affective tendencies and emotional attitudes.
Comprehensive experiments were conducted on a substantial review dataset, comparing the performance of the proposed Bert BiGRU Softmax model against other advanced models such as CNN BiGRU, BiGRU, and BiGRU Attention. The experimental findings unequivocally demonstrate that the Bert BiGRU Softmax model exhibits superior performance and accuracy, thereby effectively augmenting the precision of sentiment analysis pertaining to online product quality reviews.
Furthermore, we delve into the integration of domain expertise and human experiential insights into the training process of the model, aiming to cater more effectively to diverse industries and contextual nuances. Looking ahead, there is ample room for further refinement and optimization of the model. This can be achieved by amalgamating multimodal data and leveraging a plethora of information sources to conduct extensive and profound investigations into sentiment analysis. Such endeavors hold promise for uncovering novel breakthroughs and innovations, ultimately enhancing the performance of the model in real-world scenarios.