1. Introduction
Informally, Information Retrieval (IR) can be defined as the methods to process information to construct collections of documents. Probabilistic latent semantic analysis (PLSA) was firstly formulated as an unsupervised IR technique. This method, also known as Probabilistic latent semantic indexing (PLSI), was introduced in conference proceedings [
1,
2]. The classical reference is
unsupervised learning by probabilistic latent semantic analysis by Hofmann [
3]. PLSA is based on the ideas of Latent semantic analysis (LSA) [
4] and in fact is a probabilistic remake. LSA uses cross terms and documents of a corpus to obtain a count or a table of co-occurrences. Then, arranging frequencies in a matrix, the Singular value decomposition (SVD) space span is considered a set of latent variables and interpreted as the aspect model [
5]. The PLSA uses the frequencies to decompose them as mixtures or aggregate Markov models [
3], and adjusted with the Expectation maximization (EM) algorithm.
Although PLSA was formulated as an IR technique, it has been used for diverse purposes. PLSA’s versatility, clarity of results, and solid statistical properties have enabled a wide range of applications, in which the concepts of words and documents are assimilated into other discrete entities, thus enabling justification of the hypotheses on which PLSA relies. However, the PLSA has several problems: the nature of the data and the underlying hypotheses leads to a rigid model; the iterative nature, based on the EM algorithm, has very slow convergence; and probabilistic interpretation is lacking for latent variables. Those problems translate to uneven growth, as partly determined by algorithmic and computational advances. These limitations have prompted several reformulations and a myriad of algorithms, the development of related techniques, such as Latent Dirichlet allocation (LDA), and other studies focused on the relationship between PLSA and Non-negative matrix factorization (NMF).
Exists many surveys and review articles including PLSA as a technique for IR, as [
6], with a classical perspective. Other works recompile this technique as an alternative to classifying opinions from Twitter [
7], or a method to detect fake news [
7]. However, few reviews have focused exclusively on PLSA. One such review, by Tian [
8], focuses on semantic image analysis.
In this manuscript, we pay special attention to what has been written on PLSA, the extension of this method to less restrictive data structures than co-occurrences or contingency tables, the obtained results by modifying the underlying hypotheses, and the relationship with other techniques. These results makes the PLSA a fundamental character, providing a probabilistic interpretation of the SVD.
This article is structured to reflect this point of view. The
Section 3 is dedicated to the solutions of the two PLSA formulations, and the extent of their use as a supervised and semi-supervised technique. Criticism introduced by LDA is the starting point of several contributions, examined in
Section 4. Mixtures in which it decomposes the PLSA relate it in a natural way to the NMF. From this viewpoint, extensions are introduced in
Section 5, and are considered to represent a qualitative leap. Extensions which are conceptually relevant according to the exposed criteria, are compiled in the
Section 6 and are summarized in
Table 1. The works dedicated to efficiency improvement are also discussed in the
Section 7. Although no studies have provided definitive results, they illustrate the efforts to construct fast and reliable computational solutions. In addition, this article has attempted to use a consistent notation.
2. Literature Review
According to our bibliographic searches (Web of Science, Scopus, Arxiv, and Google Scholar), there are a relatively large number of articles based on the PLSA. It has successfully spaned many research areas, as shown in
Table 2.
The widest field of applications is Engineering, which we consider separate from Information Engineering or Computer Sciences. The applications of PLSA to this field rely on the ability of PLSA to handle discrete entities as words. Some examples are [
19] seeks communication with machines and proposes to use the method to obtain a distribution. It claims that it is advantageous concerning supervised methods.
Applications of Information Engineering include syntactic structure study, quickly examined in [
20]. Collaborative filtering, in which user ratings construct suitable matrices to perform PLSA algorithms [
21]; or speech recognition, introducing a score concatenation matrix [
22,
23], cybersecurity [
24] and analysis of keywords from webs related to certain topics and sentiment analysis (involving a system of definitions on which the users’ opinions and other instances, analyzed as co-occurrences) [
25,
26].
We adopt the Tians’ criteria for PLSA image applications. Tian classifies contributions into three types: image annotation, image retrieval, and image classification. Image annotation involves generating textual words to describe the content [
27], with diverse applications including internet image filtering [
28]; image retrieval, consisting of a procedure
of ranking images in a database according to their posterior probabilities of being relevant to infer
which visual patterns describe each object [
8]; pioneering work [
29]; and use in clinical image diagnosis [
30,
31] or facial expression semantics [
32]. Image classification [
33] has also enabled pain recognition [
34] or autonomous driving [
35]. Several variants exist, such as co-regularized PLSA, for the recognition of observed images with different perspectives [
36], among many others. For applications and developments in semantic image analysis, we refer readers to the Tian (2018) review and its selected references [
8].
Examples in Life Sciences can be found in bioinformatics [
37,
38] identificating genomic sequences with documents and some classes of genotype characteristics, such as words; neurodegenerative diseases, identifying common and non-common symptoms [
39] and biology, the work [
40] is devoted to the nuclear prediction and localization of proteins.
Applications in foundamental sicences include geophysics [
41], instrumentation [
42], spectroscopy [
43]. A comparative study of PLSA, latent Dirichlet allocation (LDA), and other techniques in the framework of spectroscopy, recently published is [
44].
The PLSA is a fundamental method that provides probabilistic sense to the SVD. Limitations on the original formulation compromise this assertion. Also, it causes problems of various types (described in the
4 section). We pay special attention to the works involving reformulations and those that relate to other techniques. In this manuscript, we have selected works of a methodological nature that imply a methodological contribution. We do not consider the number of citations to each work, the impact of the publication, or the interpretative orthodoxy. In this way, we equate many publications with interpretability or computability criteria, especially those that are extensions of the method.
3. The Method: PLSA Formulas
The original formulation of the PLSA, according to [
3], provides a probabilistic solution to the problem of extracting a set of
latent variables of a data frame
, obtained from a corpus of
documents when crossed with a thesaurus of
words. The relative frequencies
are estimated by the joint probability
. A key idea in this method is decomposing this probabilistic approximation as the product of conditional distributions over a set of latent variables. After some manipulations, and using the Bayes rule,
where
and
are probabilities of the document
and the latent variable
, respectively. Formulas (
2) and (3) are was called by Hofmann the asymmetric and symmetric formulations [
13], or formulations I and II [
45].
The discrete nature of the documents identifies each one with the probabilities of over the latent variables, and justifies the postulation that the mixtures are k independent identically distributed (iid) multinomials. Because the same occurs for the words, the objective is to determine the parameters and such that the conditional probabilities and for the asymmetric formulation (alternatively and for the symmetric case), with no hypothesis regarding the number or distribution of , which is a set of dummy variables with no probabilistic sense.
The adjustment of mixtures, given by Formulas (
2) and (3), is the other key idea for obtaining a reliable probabilistic interpretation by maximizing the likelihood of the parameters. A method widely used for this purpose is the EM algorithm, which always converges [
46]. The use of the EM algorithm is roughly equivalent to the problem of fitting
to
, but ensuring a maximum likelihood estimation of the sufficient (not necessarily minimal) parameters
and
.
In fact, the EM algorithm is a consequence of Jensen inequality [
47]. For a function
Q such that
where
M is a map, and in statistics usages is the expectation, usually written as
E. Then, for the log-likelihood
,
occurs, defining a monotonically increasing sequence reaching the limit if
. In the PLSA case, the parameters (which are not provided by the model in a closed manner) are the mixtures of relations (
2) or (3).
The EM algorithm supposes two steps: expectation and maximization. Expectation (E-step) is computed on the log-likelihood
and for parametrization (
2) or (3) takes the forms
for the asymmetric and the symmetric cases, respectively. In both cases, the expectation of
is the posterior
and after several manipulations
The expressions for
for both formulations are shown in the
Table 3.
The calculation of expectation
presents several complications related to the meaning of the primed index appearing in the formulas of
Table 3. Interpretation requires consideration of the expression
of Formula (
8). For computational purposes, the object supporting this data structure is an array containing the matrices with the estimates of
, fixing for each one the values of
. Then each of the elements of the array is a matrix taking the form
indicating the primed index that is fixed (it should be noticed that a vector multiplied by its transpose is a matrix. In this case are
matrices). The
vec notation has been used to better identify the scalar products of the vectors of probabilities
obtained by varying
. The entire array is
Maximization (M-step) uses Lagrange multipliers, the correspondent derivatives, to obtain the solutions maximizing probabilities after eliminating them. These solutions for each formulation yield the generative models for the figures shown in
Figure 1.
The execution of adjustment of probabilities, in both formulations, involves selecting a value for
k, initializing the distributions appearing in (
2) or (3), and computing the E-step and M-step in an iterative process in which
is recalculated until a certain condition is achieved. Hofmann has noted that the iterative process can end when there are no changes in the qualitative inputs, a condition called
early stop [
3]. A detailed, accessible derivation of the PLSA formulas and an introductory discussion of the EM algorithm convergence can be found in [
45].
Another point to consider is what PLSA solutions are.
Table 3 provides the formulas leading to the solutions. In many cases, providing words or documents best identify each aspect or latent variable would be more appropriate. Then, the numerical values of the columns of the involved matrices are ordered, and the corresponding labels are substituted, thus revealing the most relevant items in the respective latent class. Because the type of the desired solution has no ambiguity in the respective context, it is rarely explicit but does not confuse which result to provide. As an example, we provide two cases related to image study. For classification purposes, qualitative solutions are more suitable, and the numerical solutions are more suitable for spatial co-occurrence analysis on image regions.
Example 1.
An example provided by Hofmann is reproduced below to illustrate the sense of word rank for interpretation of "the 4 aspects most likely generate the word “segment,” derived from a K=128 aspect model of the CLUSTER document collection. The displayed word stems are the most probable words in the class-conditional distributions , from top to bottom in descending order" [3].
Aspect 1 |
Aspect 2 |
Aspect 3 |
Aspect 4 |
imag |
video |
region |
speaker |
SEGMENT |
sequenc |
contour |
speech |
color |
motion |
boundari |
recogni |
tissu |
frame |
descript |
signal |
Aspect1 |
scene |
imag |
train |
brain |
SEGMENT |
SEGMENT |
hmm |
slice |
shot |
precis |
sourc |
cluster |
imag |
estim |
speakerindepend |
mri |
cluster |
pixel |
SEGMENT |
algorithm |
visual |
paramet |
sound |
In addition, we provide an artificial example to illustrate the effects of selection of K, consisting of a corpus of 5 ( to ) documents containing letters , which we assimilate into words in a thesaurus. The co-occurrences’ data frame N isand the frequency matrix n
If in this example, the objective is to classify documents by subject (or specialized words with the correspondent matters). Simple visual inspection indicates that they are 3. For the symmetric case formulas, running iterations in each case, the results are
The characters’ matrices are the ordination of the most likely words identifying each latent variable (informally, the subjects in our toy example). Lines represent probabilities close to zero and are not useful for classification. The effect of selecting K is clear in the comparison of columns 3 and 5, which are equivalent (for ).
3.1. Training and Prediction
The PLSA algorithm can be executed for the entire data set, providing results in the same manner as probabilistic clustering methods [
48, Chap. 3]. However, to exploit the predictive power of PLSA, the model must be fitted on the available data (or training phase). Predictions for new observations are made by simply comparing them with the trained data set.
In the prediction phase, cannot assign probabilities for documents that are not in the training phase, because non-zero probabilities are needed. This problem has been solved in [
49] by splitting the data set into a training group with the
observed documents and the new unobserved documents
. By using probabilities
instead of
in (
2) and expanding the logarithm, equation (
6) can be rewritten as
To avoid a zero probability of the unseen documents in the training phase, Brants has introduced
, stating that the log-likelihood can be maximized, taking into account only the second term of (
12), and
Brants has highlighted that equation (
13)
does not represent the true likelihood, but if the goal is likelihood maximization, the same parameter setting is found as that when the true likelihood had been maximized [
49]. The same article has proposed other methods for estimating likelihood on the basis of marginalization and splitting. Brants has also proposed PLSA folding-in, a more refined derivation of this technique [
50]. A further improvement, which is more computationally efficient and is protected by a patent is [
51], involves estimating the log-likelihood by spiting the data set in the training set, denoted
, and introducing the unknown documents one by one as the second term of
In the symmetric formulation, after training on the documents by using the formulas given in
Table 3, new documents can be classified by simply alternating the expressions given by [
38]
In this case, binary data can be handled by entering a matrix
such that
substituting
in equations of
Table 3.
PLSA can also be used as a semi-supervised learning tool, in a process known as semi-supervised PLSA [
52]. Using this mode requires entering labeled and non-labeled data in the EM iterative process, and being able to split the data set into a portion in which the labels are assigned and a portion in which the labels are not assigned. A measure of similarity performs the rest of the task. Another related strategy involves introducing the link functions
must-link and
cannot-link in the training phase [
53].
5. NMF Point of View
The algebraic object that supports the probabilities of (
2) or (3) are matrices with restricted entries to the set
, and they are non-negative. To construct such matrices, the transformation (
1) involves identifying
to a multivariate matrix
. The matrix
containing the probabilities
is obtained with the transformation
. This is a special case of probabilistic transformation in which probabilities are in Laplace’s sense or relative frequencies.
For matrices obtained in this way, the standard formulation of the NMF is [
69, p. 131]
where
is the error matrix, and makes the NMF suitable for its use in alternative formulations of the PLSA [
70]
1.
Many authors attribute the introduction of this technique to the work of Paatero [
74], while others do so with the publication of Lee and Seung [
75]. Both approaches are not equivalent. While Paatero uses the Euclidean norm as the objective function, Lee and Seung use the I-divergence (distances
d are maps that satisfy, for vectors
,
, and
, the following axioms:
symmetry,
;
identity
if
; and
(triangular inequality)
. A divergence
D does not satisfy one of these axioms, usually symmetry, which is more suitable for measuring how densities are similar.). Furthermore, the work of Lee and Seung is focused on the clustering problem. This attribution creates conceptual errors in many works, identifying NMF techniques to classification. A previous and algebraically rigorous and sound formulation of the NMF is a debt of Chen [
76]. A brief introduction to NMF as an optimization problem can be found in [
77, Chap. 6]; a more standard introduction is provided in [
48, Chap. 7].
On the other hand, the SVD arose from the efforts of several generations of mathematicians, from the nineteenth-century works of Beltrami [
78], and independently Jordan [
79], to more recent contributions regarding inequalities between eigenvalues and matrix norms by Ky-Fan [
80,
81]. Currently, the SVD plays a central role in algebra, constituting a field known as eigenanalysis, and serves as a basis for matrix function theory [
82], and is also the basis of many multivariate methods. This research field remains active. Currently it is formulated as [
77, p. 275]
Theorem 1.
Let (or ); then orthogonal (or unitary) matrices (or ) and (or ) exist such that
where with diagonal entries
One of the first proofs can be found in [
83]. The theorem as given is known as
full rank SVD. The approximation for
is known as
low-rank approximation, assuming an approximation for (
26) [
84]. In the PLSA context, connected with probabilities, only real matrices are used.
Hofmann has related PLSA (in the symmetric formulation case) to SVD in conference proceedings [
1] and [
3], writing the formal equivalence
where
,
, and
are related to the SVD of the matrix
.
The relationships between the PLSA and the SVD and these relations have severe restrictions because the data are frequencies obtained from counts obeying multinomial laws, whereas SVD exists for every matrix of real entries. In addition, the conditions for the degree of adjustment of
to
are unclear since is not defined the approximation bound. Also, the popssibility of the use of smooth tecniques for cases that data are not frequencies, is omitted. The relations (
27a)-(
27c), first written by Hofmann, was considered a mere formal equivalence [
3,
38].
Several attempts focusing on the equivalence between PLSA and SVD, in light of NMF, have aimed to build a more rigorous relation. The explicit relationship between PLSA and NMF, stated by Gaussier, minimizes I-divergence
2 with non-negative constraints
known as Karush-Kuhn-Tucker (KKT) conditions, where ⊙ is the Hadamard or element-wise product. KKT conditions are a widespread optimization method when divergences are used.
Solutions are [
75]
and the matrix quotient is the element-wise entry division.
After adjusting equation (25) in an iterative process, consisting of selecting a value of
k, switching between (
28) and (29) until a satisfactory approximation degree is achieved, Gaussier has introduced diagonal matrices
and
of suitable dimension
stating that
any (local) maximum solution of PLSA is a solution of the NMF with KL-divergence (I-divergence according to the nomenclature herein) [
11].
Further work by Ding [
86], with the same divergence, has introduced normalization for matrices
and
, such that the column stochastic matrix
and the row stochastic matrix
are obtained as
calling those conditions
probabilistic normalizations, and writing
where the
and
diagonal matrices contain the column sums of the respective sub-index matrices. Ding has arrived at similar conclusions to Gaussier, and assimilated the latent variables into the space span of matrix factorization [
87].
Conditions for the reverse result are shown in [
18] by the KL divergence
obtaining the solutions
after proof that
if
, choosing the diagonal matrix as
and arranging the entries of
in decreasing order, with the same permutation on the columns of
and the rows of
, and obtaining the respective column and row stochastic matrices
and
, indicating that
In this case, factorization
37 reaches the SVD of the orthonormalization of
(see [
88, p. 24] for orthonormalization process).
This procedure keeps matrix norms (also row or columns norms) [
63]. Moreover, minimization of KL divergence is equivalent to maximization of the likelihood in certain cases (as can easily be seen by expanding the logarithm of the KL divergence as a difference; while the first term is a constant, the second term is the log-likelihood), but this is not exact. The minimization of the
divergence is known as the
em algorithm. In many cases, the results obtained with both methods are similar. Amari has shown that in the general case, the
em solutions are asymptotes of the EM algorithm case [
89].
8. Discussion
Hofmann does not provide indications as to when each formulation is applicable, the value of PLSA is clear and relatively simple: for the asymmetric formulation, it is an unsupervised learning method to train a set of latent variables with unknown number and distribution, when the data include co-occurrence or contingency tables. Also, the model can be extended to continuously evaluated entities. The use of the EM algorithm to adjust probabilities provides a maximum likelihood estimation of the parameters. This technique can be extended to semi-supervised cases. However, the symmetric formulation pales the PLSA within the fundamentals of algebra and multivariate analysis: it is the probabilistic companion of the SVD. However, there are some gaps to make full sense of this statement.
Convergence occurs for
, however, for the case
. It has been said that the convergence limit does not necessarily occur at a global optimum [
141] and does not necessarily converge to a point but can converge on a compact set [
142], thus providing sub-optimal results [
143]. In addition, sparse data structures can cause failures in convergence [
144]. These statements are due to the interpretation of the SVD, and therefore the PCA as a low-rank approximation. However, these statements do not take into account the low-rank approximation, based on Schmidt’s approximation Theorem [
145]. Establishing a bound with the help of this result is an open question.
The LDA techniques and those built on them are hierarchical models whose construction corresponds to particular fields of application. These constructions are also possible in the symmetric formulation assuming additional hypotheses on the data, like distribution, qualitative categories, or hierarchical relations. This type of treatment supposes a pre-processing of the data, preserving the content of the PLSA significance.
Furthermore, although the concept of probabilistic learning is sound, based on Vaillant’s work [
146], symmetric PLSA is especially apt in the context of transfer learning. In this way, the certainty depends on the available data, as suggested by the relation (
60). Then, reanalyzing a problem with new (or complementary) data or observational variables can provide learning sequences.
Although the applications of the PLSA are numerous, more surprises are likely to be encountered. However, theoretical studies from the perspective of the relationship of PLSA to SVD are scarce. Such studies are necessary for broader interpretation. In particular, data matrices of observed phenomena with real values must be related to probabilities to extend the scope of PLSA applicability.