3.3. Single methods similarity
Semantic relatedness with transformers: Transformers, as a remarkable development in the field of neural networks, have revolutionized the domain of natural language processing [
18]. In contrast to conventional approaches that heavily depend on manually engineered features and statistical models, transformers employ a distinct mechanism known as self-attention [
19]. This mechanism enables the model to dynamically allocate attention to various parts of the input, facilitating the capture of long-term dependencies within the language data.
Among the diverse variety of pre-trained language models, BERT (Bidirectional Encoder Representations from Transformers) stands out as one of the most influential and widely adopted models [
10]. BERT, being one of the earliest pre-trained models, has significantly impacted the field by providing a powerful representation learning framework. By training on large-scale corpora, BERT is capable of learning contextualized word embeddings that encapsulate the semantic information of words based on their surrounding context. The availability of BERT has significantly advanced various downstream natural language processing tasks, such as sentiment analysis, named entity recognition, and machine translation. These tasks have experienced notable improvements in terms of accuracy and performance. Similarly, other models like GPT, RoBERTa, and Mini-L6, also based on the transformer architecture, utilize similar techniques as BERT to capture contextualized word representations. Using these techniques, these models enable a wide range of natural language processing applications. Each model offers unique enhancements or modifications tailored to address specific challenges and requirements in NLP tasks. As a result, they contribute to the continuous development and progress of natural language processing, expanding its capabilities and potential in various fields.
These models are built upon a transformer-based neural network architecture. They undergo training using vast amounts of unlabeled text data, enabling them to develop a deep understanding of natural language patterns and structures. BERT and its counterparts developed by Microsoft, Facebook, OpenAI, or HuggingFace are characterized by bidirectional language modeling. This means that they can comprehensively analyze and comprehend the context of a word by considering both the preceding and succeeding words in a sentence. This bidirectional approach allows for a more nuanced understanding of word relationships and linguistic subtleties. In the paper, we conduct experiments where we evaluate various pre-trained models.
The process starts with pretrained models that represent individual words as vectors. These vectors encapsulate the semantic meaning of the respective words. However, when dealing with n-grams or phrases, additional techniques are required to combine the vectors associated with each component and generate a single vector that represents the entire n-gram or phrase. This paper explores three distinct methods employed for this purpose, which are elaborated upon below. In the experimental section, we will present and analyze the results obtained through the utilization of these methods.
One approach is to use the embedding of the [CLS] token, which is added to the beginning of each sentence in a batch during the pre-processing stage. The [CLS] token is designed to represent the entire sentence, and its embedding captures the meaning of the sentence. One way to compute similarity between two sentences (in this case n-grams) using pretrained language models is to take the dot product of their [CLS] token embeddings, which will give a score between -1 and 1, where 1 means the sentences are identical and -1 means they are completely dissimilar. Another way to compute similarity is by using the cosine similarity between the [CLS] token embeddings.
Another approach to compute sentence similarity using BERT-like language models is to take the average of the token embeddings for each sentence. This is known as the mean-pooling approach and from now on, this method will be referred to as [AVG]. To compute sentence similarity using mean-pooling approach, first, we run pretrained models on the input sentence and get the hidden states for each token. Then, we take the average of the hidden states for each sentence. Finally, we can compute the similarity between two sentences by taking the dot product or cosine similarity of their average token embeddings. This approach can be useful in certain use-cases where we want to find the overall similarity between two sentences rather than just comparing the [CLS] token embeddings. However, it should be noted that this approach might not work as well as the [CLS] token embeddings approach in all the cases, as it may not capture the entire meaning of the sentence. In our case, this is not a problem as we are dealing with very short phrases, typically consisting of 3 or 4 words at most.
Another approach to compute sentence similarity using BERT is to take the maximum of the token embeddings for each sentence, also known as the max-over-time pooling approach and referred to as [MAX].
To compute sentence similarity using max-over-time pooling approach, first, we run the model on the input sentence and get the hidden states for each token, as for [AVG] method. Then, we take the maximum of the hidden states for each sentence by taking the maximum value over the last dimension. Finally, we can compute the similarity between two sentences by taking the dot product or cosine similarity of their maximum token embeddings. This approach can be useful in certain use-cases where we want to find the dominant meaning or feature of the sentence. Also in this case, it should be noted that the max-over-time pooling approach may not capture the entire meaning of the sentence like the mean-pooling approach, and it may also be sensitive to outliers.
WordNet -based similarity: WordNet is a lexical database and semantic network that organizes words and their meanings into a hierarchical structure [
20,
21]. It provides a comprehensive and structured resource for understanding the relationships between words, synonyms, antonyms, and the hierarchical structure of concepts. In WordNet, words are grouped into synsets (synonym sets), which represent a set of words that are closely related in meaning. Each synset represents a distinct concept or meaning. Synsets are connected through semantic relations, such as hyponyms (subordinate concepts), hypernyms (superordinate concepts), meronyms (part-whole relationships), and holonyms (whole-part relationships).
WordNet's primary purpose is to facilitate the exploration of semantic relationships between words and to measure their semantic similarity. Again, three measures were used to assess the similarity between n-grams:
The shortest path length measure computes the length of the shortest path between two synsets in the WordNet graph, representing the minimum number of hypernym links required to connect the synsets. This measure assigns a higher similarity score to word pairs with a shorter path length, indicating a closer semantic relationship. It will be referred to as path.
Wu-Palmer Similarity: The Wu-Palmer similarity measure utilizes the depth of the LCS (Lowest Common Subsumer - the most specific common ancestor of two synsets in WordNet's hierarchy) and the shortest path length to assess the relatedness between synsets. By considering the depth of the LCS in relation to the depths of the synsets being compared, this measure aims to capture the conceptual similarity based on the position of the common ancestor in the WordNet hierarchy. It will be referred to as wu.
Measure based on distance: analogously to shortest path length, this measure is also based on the minimum distance between 2 synsets. This measure, hand-crafted by the authors, takes into consideration that the shorter the distance, the greater the similarity. In this case, the similarity measure is calculated using this equation:
When distance >0, in the other case min_dist=1
This measure considers only the distance between synsets, in a depth-independent way, and was obtained by doing several tests and evaluations, so that much weight is given not only to synonyms (with distance 0), but also to hypernyms or hyponyms (distance 1) or siblings (distance 2). This measure will be referred to as min_dist.
String comparison algorithms: Jaro, Jaro-Winkler, Levenshtein and other similar similarity measures are string comparison algorithms that focus on quantifying the similarity (or distance) between two strings based on their characters and their order. These measures are useful in various applications, including record linkage, data deduplication, and fuzzy string matching. Here it has been used Jaro-Winkler Similarity, that is an extension of the Jaro similarity measure. It incorporates a prefix scale that rewards strings for having a common prefix. The Jaro-Winkler similarity score ranges from 0 to 1, with 1 indicating a high similarity and a closer alignment of the prefixes.