2.5. How Generative Models work?
Given the training data and the set of parameters,
, a model can be built to estimate probability distribution. The
likelihood is the probability that a model assigns to the training data for a dataset containing
m samples of
,
The maximum likelihood provides a way to compute the parameters,
, that can maximize the likelihood of the training data. To simplify
, log is taken in Equation (
1) to express the probabilities as a sum rather than the product,
If
lies within the family of distributions of
, the model can precisely find the
. In the real world, there is no access to
, and only the training data is available for modeling. The models must define their density function and find the
that maximizes the likelihood. The generative models which can explicitly represent the probability distribution of the data are called explicit density models [
38]. The Fully Visible Belief Networks (FVBNs) [
39] and nonlinear independent component analysis [
40] are a few such explicit density models which can successfully optimize directly on the log-likelihood of the training dataset. However, their use is limited to solving simple problems and imposing design restrictions. As the data gets complex and the dimensionality of the data grows, it gets computationally intractable to find the maximum likelihood. Then approximations are made on the maximum likelihood, either by using deterministic approximations, as in variational methods like
Variational AutoEncoder (VAE) [
23], or by using stochastic approximations such as Monte Carlo methods [
41] The
variational autoencoder is one of the popular semi-supervised generative modeling technique, but it suffers from low-quality samples.
Another family of deep generative nets, called implicit density models [
42], do not explicitly represent the probability distribution over the space where data lies but provide some indirect way to interact with the probability distribution
. In indirect ways, they can draw samples from the distribution. One of the methods used by implicit density models is
Markov Chain [
43] to stochastically draw samples from
distribution and transform an existing sample to obtain another sample from the same distribution. Another strategy is to generate the samples in a single step directly from the distribution represented by the model. The generative model in GANs is based on implicit density models and uses the latter strategy to generate the samples directly from the distribution represented by the model.
2.6. How Generative Models generate data?
Any information can be processed if it is represented well. In the case of machine learning tasks, it is critical to represent the information so that the model can perform subsequent learning tasks efficiently [
44]. The choice of representation varies as per the learning strategy of the model. For instance, a feedforward network trained using supervised learning criteria learns specific properties at every hidden layer. The network’s last layer is usually a softmax layer, which is a linear classifier. The features in the input may not represent linearly separable classes, but they may eventually become separable until the last hidden layer. Also, the choice of the classifier in the output layer impacts the properties learned by the last hidden layer. The supervised learning methods do not explicitly pose any condition on the intermediate features that the network should learn. Whereas, in cases where the model wants to estimate density, the representation should be designed to make density estimation easier. In such a case, it may be appropriate to consider the distributed representations, which are independent and can be easily separated from each other. Representation learning [
45] plays an integral role in the unsupervised and semi-supervised models, which try to learn from unlabeled data by capturing the shape of the input distribution. A good representation would be one that can help the learning algorithm identify the different underlying factors causing variations in data and help them separate these factors from each other. It would result in the different features or directions in the feature space corresponding to different causes disentangled by the representation. In the classic case of supervised learning, the label
y presented with each observation
x is at least one of the essential factors directly providing variation. In the case of unlabeled data, as in unsupervised and semi-supervised [
46], the representation needs to use other indirect hints about these factors. The learning algorithm can be designed to represent these hints in the form of implicit prior beliefs to guide the learner. For a given distribution
, let
h represent many of the underlying causes of the observed
x and let the output
y be one of the most silent causes of
x. The
and
should be firmly tied, and a good representation would allow us to compute
. Once it is possible to obtain the underlying explanations, i.e.,
h for the observed
x, it is easy to separate the features or directions in feature space corresponding to the different causes and consequently easier to predict
y from
h.
The true generative process would be,
and, the marginal probability for data,
x, can be computed from the expectation of
h, as:
If the representation is made in such a way that it is possible to recover
h, then it is easy to predict
y from such representation and by using Bayes’ rule, it is possible to find
,
The marginal probability, , is tied to conditional probability, , and the knowledge of the structure of would help us learn . Here, latent factors are the underlying causes h of the observed x. The latent factors or variables are the variables that are not directly observed but rather inferred from other variables that are directly measured. The latent variables are meaningful but not observable. The latent variables can capture the dependencies between different observed variables, x. They help reduce the dimensionality of data and provide different ways of representing the data. So they can give a better understanding of the data.
Many probabilistic models, like linear factor models, use latent variables and compute the marginal probability of data,
, as described in Equation (
4). A linear factor model can be defined as a stochastic linear decoder function that can generate
x by adding noise to a linear transformation of
h. It is possible to find some explanatory independent factors
h, which have a similar joint distribution and are sampled from the given distribution like
, where
is a factorial distribution, with
Then the real-valued observable variables can be sampled as,
where, W is the weight matrix and noise is Gaussian and diagonal, which means it is independent of dimensions.
The unsupervised learning algorithm would try to learn a representation that captures all the underlying factors of variation and then try to disentangle them from each other. A brute force solution may not be feasible to find all or most of such factors, so a semi-supervised approach can be used to determine the most relevant factors of variation and encode only those salient factors. The autoencoder and generative models can be trained to optimize fixed criteria like the mean square error to determine which ’causes’ or factors should be considered salient. For instance, if a group of pixels follows a highly recognizable or distinct pattern, that pattern could be considered extremely salient. However, the models trained on mean square error have limited performance and failed to reconstruct the images completely [
47].
Another method to identify features’ salience is using GANs [
48]. In this approach, a generative model is trained to fool a classifier which is a discriminative model. The classifier should recognize all the samples from training data as accurate and the samples from the generative model as fake. Any structured pattern recognized by the discriminator can be considered salient, which makes the generative adversarial networks better at finding which factors should be represented.
Thus, summarizing the above discussion, there are two essential aspects that make the generative way of learning powerful. First, they try to learn the underlying causal factors from cause-effect relationships via the hidden factors that can explain the data. Secondly, they use the distributed representations to identify these factors, which are independent and can be set separately from each other. Each direction in the distributed representation space can correspond to a different underlying causal factor, helping the system identify the salient features.
The advantage of learning the underlying causal factors [
49] is that if the exact generative process learns to model from
x being the effect and
y as the cause, then
is adaptive to change in
. Also, the causal relationships are invariant to any change in the problem domain, type of tasks, or any non-stationary temporal variations in the dataset. The learning strategy of generative models attempting to recover the causal factors,
h and
, is robust and generalizes to any feature changes. Various regularization strategies have been suggested in the literature to find the underlying factors of variations [
50]. Some of the popular strategies used by different learning algorithms are smoothness, linearity, multiple explanatory factors, depth or hierarchical organization of explanatory factors, shared factors across tasks, manifolds, natural clustering, sparsity, simplicity of factor dependencies, temporal and spatial coherence, etc. but causal factors [
51] is most advantageous for the semi-supervised learning and makes the model more robust to any change in the distribution of underlying causes or while using the model for a new task [
52].
The second advantage of the underlying causal factors is that the distributed representations are more potent in representing the underlying causal factors than the symbolic factors. The symbolic or one-hot representations are non-distributed, representing only n mutually exclusive regions, whereas distributed representations can represent configurations for a vector of n binary features. Each direction in the representation space can correspond to the value of a different underlying configuration variable.
Different learning algorithms like
k-means clustering [
53],
k-nearest neighbors [
54], decision trees [
55], gaussian mixtures, kernel machine with the gaussian kernel [
56], and language or translation models based on
n-grams [
57] are based on non-distributed representations. These algorithms break the input space into different regions with a separate set of parameters for each region. Suppose there are enough examples in the dataset that represents each different region. In that case, the learning algorithm can fit the training data set well without solving any complicated optimization problem. However, these models suffer as the number of dimensions grows and if there are insufficient examples in the dataset to represent each dimension. They fail miserably if the number of parameters exceeds the number of examples that explain each region. Also, the non-distributed representation needs a different degree for each region that does not allow them to generalize to new regions when target functions are not smooth and may increase or decrease several times in many different regions.
On the other hand, the distributed representations [
58] use the shared attributes and introduce the concept of similarity space by representing the inputs as semantically close if they are close in the distance. They can compactly represent complicated structures using a small number of parameters and generalize better over shared attributes. For example, a ’truck’ and ’car’ both have common attributes like "
" and "
" and many other things that are valid for cars and generalizations to trucks, as well.
The distributed representation uses separate directions in the representation space to capture the variations between different underlying factors [
59]. These features are discovered automatically by the network and are not required to be fixed beforehand or labeled. The generative models learn from the distributed representation to disentangle the various features, even when the model has never seen the feature before. Each direction or vector represents a new feature. Adding or subtracting these representation vectors is possible to generate new features. For instance, in the famous example of generating new images using GAN [
60], the distributed representation disentangles the concept of gender from the concept of wearing glasses. Given the image of a man with glasses, if the representation vector of the man is subtracted and the representation of a woman without glasses is added, it would give the vector representation of the woman with glasses, and a generative model can correctly generate the image corresponding to the resulting representation vector. Therefore, it is successfully able to generate new unseen synthetic data.
Table 1.
Comparison between Generative and Discriminative modeling techniques.
Table 1.
Comparison between Generative and Discriminative modeling techniques.
Generative Models |
Discriminative Models |
Learn the underlying data distribution |
Learn the decision boundary between different classes of the data |
Model the joint probability distribution between the input and output data |
Model the conditional probability distribution of the output given the input |
Can generate new data from the learned distribution |
Cannot generate new data from the learned decision boundary |
Used for tasks such as image and audio synthesis, text generation, and anomaly detection |
Used for tasks such as classification, regression, and object recognition |
Make no assumptions about the data |
Use prior assumptions about the data |
Examples include VAE, GAN, and RBM |
Examples include Logistic Regression, SVM, and Neural Networks |