1. Introduction
In the 21
st century, enormous mathematical and data analytic techniques and algorithms have been adopted in designing new video games and frames, for the purposes of enhancing teaching and learning processes within virtual environment, pushing the innovation and specialty of gameplay mechanics to its furthest extent, and visualizing game scenes in a human-crafted, realistic and dynamic manner [
1,
2,
3]. The concerned subjects include the investigation of 3-dimensional geometric properties of characters within a particular frame [
4], the capturing of geometric transformations and motion in real-time basis [
5], the use of simulation games for analyzing and building up complex systems that could better reflect real world conditions [
6]. Today, credited with the increment of computing power and resources, the enhancement of capability in terms of data storage, and the massive data volume for simultaneous processing [
3], advances in machine learning (ML) and artificial intelligence (AI) approaches are taking place and being widely adopted in different practical disciplines, especially those related to image processing and computer vision, as well as the emergence of generative models. Such new digital era has also promoted the use of these approaches in handling creative and artistic tasks, for example, conditional adversarial neural networks have been applied for generating city maps from a sketch [
7]; a Generative Adversarial Network (GAN) model was proposed and established to generate images based on the simple sentence description of an object or a specific scenario [
8]; the Game Design via Creative Machine Learning (GDCML) mechanism was utilized for setting up an interface and modules of games and informing new systems [
9]. In view of all these success, achieving “computational creativity” in the perspective of video game design has now become a hotspot and new focus; while game companies and developers are seeking ways to adopt ML and AI algorithms, so that the overall production cost of a game or related products can be reduced, at the same time brand new working procedures within the game can also be explored in the long run. A research report published by Netease has reviewed that the incorporation of ML models into game design could reduce the development costs by millions of Renminbi (RMB) dollars [
10].
In the early days of video game development, most games were relatively simple and “monotonous”, and were conducted via the “third-person shooting” mode with the aid of electronic machines. In 1962, Steve Russell and several student hobbyists in Massachusetts Institute of Technology (MIT) developed the first ever video game in the world, called Spacewar! [
11], and this game was published on the Digital Equipment Corporation (DEC) platform at a later time. Within the historical development stage, Spacewar! was considered as the first highly influential video game, because it motivated the advancement of computing resources, reviewed the difficulties in transferring programs and graphics between computing platforms at different places [
11], and stimulated the development of different game genres. In the early 1970s, the first home video game console called the “Magnavox Odyssey” and the first arcade video game called the “Computer Space and Pong” were respectively established. At the earlier stages, despite the effective integration of technology, creativity and computing resources, it is lacking of a uniform standard for classifying game genres in terms of gameplay, but can be generally categorized as in [
12]. Some key examples include (1) Action Games that emphasize on physical challenges, particularly the coordination of hands and eyes; (2) First Person Shooter Games that include the use of guns and weapons for competition and fighting against each other from a first-person perspective; (3) Sports and Racing Games, which simulate the practice of sports or racing originated from real or fantastical environments; (4) Simulation Games that describe a diverse super-category of video games, so that real world activities can be effectively simulated and displayed, for example, flight simulation and farming simulation. With the combination of these genres and capabilities in algorithmic design and data analytics, the importance and popularity of arcades and consoles had diminished, and were gradually replaced by games that are compatible on personal computers, smartphones and mobile devices. Some mainstream game platforms in the 21
st century are as shown in
Table 1. Nowadays, most games released are not limited to a particular genre, for example, “Need for Speed” is considered to be both a Sports and Racing Game, as well as a Simulation Game [
13]; while many games can be released on multiple platforms, for example, the “Genshin Impact” is compatible on PC, mobile device and PlayStation simultaneously [
14].
Apart from categorizing video games based on their genres and compatible platforms, modern games all consist of three major components, namely (1) the program component; (2) the gameplay component; and (3) the artistic component. Programs form the basis of a video game, which determine the basic structure and logic of the game; gameplay decides how the players and the surrounding environment could interact, from the aspects of designing background settings, battles, balances and stages of the game itself; artistic components lay down what the player can visualize and hear during the gameplay, which can include the design of characters, environmental settings, design of background music and animations, as well as the ways of interaction [
18]. In particular, when designers attempt to produce artistic materials that put spice into the attractiveness of the video game, they may either take reference of real world architectural and parametric settings, or create objects and environment that do not exist in reality. This provide possibilities for the utilization and application of image generation techniques within game design processes [
19,
20]. Designers and scientists have started exploring how ML and combinatorial algorithms could play a systematic role in different levels of game design, for example, data preprocessing, clustering, decoding and encoding, as well as generating attractive and sustainable image outputs of a specific game [
21,
22,
23,
24]. Further, a recent concept called “game blending” was adopted by Gow and Coreneli to establish a framework that produces new games from multiple existing games [
25], while the Long Short-Term Memory (LSTM) technique has also been applied to blend computer game levels based on Mario and Kid Icarus, then combine with the Variational AutoEncoder (VAE) model to generate more controllable game levels [
26,
27]. In recent years, Generative Adversarial Network (GAN) models have become popular, and have been incorporated into the framework of generating game levels and images under specific conditions and settings [
28,
29]. These black-box models allow users to design and generate levels in an automatic manner, thus Schrum et al. [
30] utilized such unique features to develop a latent model-based game designing tool; while Torrado et al. [
31] investigated the conditional GAN and established a new GAN-based architecture called “Conditional Embedding Self-Attention GAN”, then equipped it with the bootstrapping mechanism for the purpose of generating Role-Playing Games (RPG) levels. All these have shown certain capabilities in generating game levels within specific set-ups. Nevertheless, it is incredibly hard to obtain complete understanding of the internal structure of ML-based models, as well as the statistical properties behind the scene. Thus, it is of utmost importance to develop and explore the use of a mathematical model that can do corresponding tasks, i.e., generate new game levels that are applicable in modern game design and for future extension of the game, but at the same time, users can acquire basic understanding of statistical properties of the model, for example, time complexity, amount of loss during model training process, the relationship between time consumption with size of the input data.
In this study, the effectiveness of the Variational AutoEncoder (VAE) model in image generation within game design was first explored and assessed. It is considered as a deep generative model that consists of a variational autoencoder, which is equipped with a prior and noise distribution. During model training process, which is usually conducted by Expectation-Maximization meta-algorithm, the encoding distribution is “regularized” so that the resulting latent space suffices to generate new and meaningful datasets. Detailed mathematical derivation will be discussed in
Section 3, and readers can refer to [
32] for more technical details as well. The VAE model was first proposed by Kingma and Welling [
33], and has been widely applied in different disciplines, for example, image generation, data classification and dimensionality reduction [
34,
35,
36]. In particular, Vuyyuru et al. has constructed a weather prediction model based on the combination of VAE and Multilayer Perceptrons (MLP) models [
37], and Lin et al. attempted to detect the anomaly of office temperature within a prescribed period via LSTM and VAE models [
38]. Furthermore, Bao et al. had effectively combined the Convolutional Variational Autoencoder (cVAN) with GAN model to generate human photos by controlling the gender of required figures [
39]. All these have demonstrated the systematic and practical usages of the VAE model, therefore we expect that with suitable data processing mechanism, fine-tuning of model parameters, and minimization of loss function during training, selected game functions or level maps could be generated, as a result provide assistance to game developers in the long run, in terms of auxiliary development, designing new games, and enhancing the speed and time complexity of image generation.
Section 2 includes the flowchart of how the VAE model was applied within this study, and the description of datasets used for later case studies. Then, the mathematical theories and statistical properties of the VAE model are outlined in
Section 3, and
Section 4 showcases some numerical experiments conducted and their corresponding statistical analyses.
Section 5 discusses the drawbacks and limitations of current study, as well as some potential future research direction, then a short conclusion is provided in
Section 6.
3. Methodologies: Steps of the VAE Model
The important steps and statistical measures of the VAE model are provided in this section, which provide readers a crucial reference of how the VAE model was constructed; the ideas of data preprocessing; and the important parameters that should be controlled (i.e., maximized or minimized) during the machine learning stage.
3.1. Data Preprocessing
First, the raw images were compressed by applying a specific scaling factor, which is defined as the ratio of the length of a side of a desired output image to that of the original image. In this study, a scaling factor of less than 1 was adopted to speed up the machine learning and training processes, at the same time prevent the overflowing of memory.
Afterwards, the compressed images were decolorized using the optimization approach proposed in [
46], with the aim of preserving original color contrasts to the best extent. In principle, the VAE model is applicable for handling RGB images, however due to limitations of computer performance, the images obtained from datasets in
Section 2 were converted into grayscale styles. Nevertheless, the texture, color contrast and pixel properties were preserved as much as possible, so that the effectiveness of VAE model could be fairly assessed. In this study, the Intel(R) Xeon(R) CPU E5-2670 v3 with 2 processors was adopted, and the system was prescribed as a 64-bit operating system, with 128 GB RAM installed.
As for the Arknights game maps described in
Section 2.2.1, since every game map represents only a class label, while only 180 different images could be obtained from the open data source, each of these 180 images was then copied by 10 times, so that a total of 1800 images were ingested into the VAE model, with most of them being grouped as the ‘training set’, and a small pile of these images was considered the ‘testing set’. Further, the 10 versions of each image possess different brightness, contrast and gamma correction factors, so that a total of 1800 class labels could be obtained for statistical analyses.
3.2. Autoencoding, Variational AutoEncoder (VAE) and Decoding Processes
In analyzing large datasets that contain vast amount of features within each observation, Principal Component Analysis (PCA) was widely adopted to visualize multi-dimensional information, by reducing the dimension of the original dataset but keeping the maximum amount of information in the output [
47]. However, PCA was only applicable in handling linear surfaces, thus the concept of “autoencoding” came in. An autoencoder is capable of handling both linear and non-linear transformations, which is a model that reduces the dimension of complex datasets via neural network approaches [
48]. It adopts backpropagation for learning features at instant time during model training and building stages, thus are more prone to achieve data overfitting when compared with PCA [
49]. The structure of an autoencoder is as shown in
Figure 5, which includes mainly an encoder for handling input datasets, some codes within the encoding process, and a decoder to produce meaningful outputs.
Denote as the set of all samples in the original dataset, where represents the th sample. The encoder is a function that encodes the original dataset to , i.e., , where the dimension of is significantly less than that of . Afterwards, the simplified dataset is passed onto the decoder, which decodes and outputs . Hence, the decoder is mathematically expressed as . The loss function under arbitrary norm (depending on the type of application) is then used to estimate the closeness between and . If the magnitude of is small, then the model is considered effective. Here, we may assume that the encoded will include most valuable information from , so that suffices to represent the original dataset even after dimensionality reduction has been applied during the model training process. For example, let be an image, where and are the dimensions that store the information of . The overall goal is to train an autoencoder that encodes the image into (i.e., dimensionality reduction), then apply a decoder that reformulates the image as such that the loss function is minimized. In practice, this model will create not only useful attributes of the image, but also unwanted noise components, because the distribution of , as denoted by has not been modelled. To complement such deficiency, the Variational AutoEncoder (VAE) was adopted to first model the probabilistic distribution of , before all useful attributes of are extracted to form a sampling space of and passed into the decoder for image recovery.
Suppose , where I represents an identity matrix, which means that can be regarded as a multi-dimensional random variable that obeys the standard multivariate Gaussian distribution. Denote and as random variables, and the corresponding ith samples are denoted as and respectively. With this set-up, the eventual output is generated by a stochastic process of 2 steps, with being the hidden variable: (1) The prior distribution of is encoded and sampled to obtain , then (2) based on the conditional distribution , a data point or sample is achieved.
As for the decoding process, the samples
obtained from the
distribution are ingested into the decoder, then the parametrized decoder establishes a mapping that outputs the precise distribution of
corresponding to
, which is denoted as
. To simplify the statistical complexity, we may assume that
obeys isotropic multivariate Gaussian distribution for any given
, i.e., Equation (1) holds. This means that after
is ingested into the decoder, the distribution of
can be obtained after fitting
and
.
By taking into account that
, Equation (2) is obtained, where
represents the hyper-parameter within our VAE model.
Then, the Maximum Likelihood Estimation (MLE) is applied to estimate
based on the observed or inputted dataset
. Detailed formulation is as shown in Equation (3).
Generally speaking, the dimension of
is very large, while even after dimensionality reduction is conducted, the dimension of
is not extremely small. Thus, a sufficiently large amount of samples
have to be considered for achieving an accurate estimate of
. To cope with this, the posterior distribution
has to be introduced in the encoder. Equation (4) shows how the Bayes’ formula can be applied into computing
. The procedures here are designed and formulated with reference to the ideas of [
50].
Next, the AutoEconding Variational Bayesian (AEVB) algorithm is applied to optimize the parametrized encoder and
. Denote
as the approximate posterior distribution of the encoder (with parameter
), if
, then the encoder can be adopted to obtain the probabilistic distribution of
[
33]. Since
and
are of multivariate Gaussian distributions, so is
. As a result, it suffices to acquire outputs of
and
from the encoder to outline the posterior of the generative model. For any sample
,
should satisfy the distribution as shown in Equation (5).
3.3. Steps of the VAE Model
Based on the methods reviewed and introduced in
Section 3.2, the actual steps of the VAE model are outlined as follows (Steps 1 – 4):
Step 1: The encoder was assigned a data point / sample , and parameters of that the latent variable obeys were obtained through neural network methods. Since this posterior distribution is of an isotropic Gaussian distribution, it suffices to find out the parameters and of the Gaussian distribution that obeys. As an example, here may represent some images of orange cats.
Step 2: Based on the parameters and , a sample from the distribution was obtained, which is considered as a similar type of sample as . As an example, represents all cats that are orange in color.
Step 3: Then, the decoder proceeded to fit the likelihood distribution , i.e., when was ingested into the decoder, the parameters of the distribution that obeys could be achieved. Since the likelihood will also obey an isotropic Gaussian distribution, we can denote the output parameters as and . As an example, represents a distribution of images of orange cats.
Step 4: After the statistical parameters of the distribution were acquired, a sequence of data points was obtained via sampling. Nevertheless, most people use as an alternative representation of . An example here is to sample a new orange cat image from a particular distribution of orange cats.
In addition, it is also widely recognized that
is an isotropic multivariate Gaussian distribution with fixed variance, which can be mathematically expressed as in Equation (6), where
is considered as a hyper-parameter.
The overall graphical structure of the VAE model is as shown in
Figure 6.
3.4. Evidence Lower Bound (ELBO) of the VAE Model
After fixing the structure of the VAE model for handling datasets in
Section 2, an effective loss function for estimating the information loss during model construction process has to be established. Following the idea of MLE, and the application of variational inference, the likelihood function
can be expressed as in Equation (7), and it is bounded below by
, which is named the “Evidence Lower Bound (ELBO)”.
Here, the first integral of the last expression in Equation (7) is denoted as
, while the second integral is called the KL divergence (also known as relative entropy in information theory), and is denoted by
. Since KL divergence is always non-negative,
is considered as the lower bound of
. Rearranging Equation (7) results in Equation (8) below.
That is, to maximize is equivalent to maximize and to minimize . To minimize , we further assume that the approximate posterior distribution converges to the posterior distribution , which is valid because the encoder should only output meaningful distributions for further retrieval and signal recovery.
Expanding
as shown in Equation (9), we have:
Again, the two terms in the last step of Equation (9) have their own physical meanings and implications, where the first integral represents the “latent loss”, and is denoted by ; while the second integral is known as the “reconstruction loss”, and is denoted by the expectation quantity .
Based on our assumption of the VAE model,
and
both follow Gaussian distribution, therefore, the analytical solution of
can be obtained as follows:
Here, represents the relative entropy from to for these two probability distributions defined on the same measurable sample space.
As for the second term, multiple
’s from
were sampled to approximate the term
, where
Suppose the dimension of every data point
is
, we can expand
as shown in Equation (11) below.
3.5. General Loss Function of the VAE Model
Based on the parameters introduced in
Section 3.4, the loss function
in Equation (12) should be minimized during the machine learning and model training processes:
In the formula,
’s are actually sampled from
, however only one such
is needed empirically, therefore, we simply consider the case of
, hence Equation (12) can be simplified as Equation (13).
In our study, by considering that
is an isotropic multivariate Gaussian distribution with fixed variance, it is reasonable to set
as a
-dimensional vector, with all elements being 0.5. With that, the corresponding loss function can be expressed as in Equation (14).
Here, represents the ith sample, which acts as the input of the encoder; and are the outputs of the encoder, which act as the parameters of the distribution of ; is sampled from and acts as the input of the decoder; is the output of the decoder, which precisely represents the eventually generated data point .
3.6. Loss Function of the VAE Model in Clustering
As aforementioned, the KL-divergence for and is defined as . Such an expression is only valid when we have the assumptions that is a Gaussian distribution, and both and are conditional Gaussian distributions. If all these hold, the loss of the ordinary VAE model can be obtained by a series of substitutions.
Nevertheless, in the case of data clustering, the hidden variables may not always be continuous variables. Thus, we set the latent variable as
, where
is a continuous variable that represents a coding vector, and
is a discrete variable that represents the category. After updating the latent variable, the resulting KL-divergence is as shown in Equation (15), and such expression is applicable for clustering within the VAE model of this study.
Therefore, Equation (15) can be re-written as Equation (17), which can essentially obtain the specific loss function of data clustering by following the procedures outlined in preceding sub-sections.
The formula shown in Equation (17) is also applicable for describing both encoding and decoding procedures. First, a data point or dataset is sampled, which represents an image formed by the original data, then apply to obtain the encoding characteristic, followed by using the cluster that classifies the encoded information or attributes. Next, a category is selected from the distribution , and a random hidden variable is selected from the distribution . Finally, the decoding process can generate new images accordingly. By these theoretical procedures, images with specific class labels and minimized loss can be generated in a systematic manner.
6. Conclusion
In this study, we illustrated the possibility and statistical feasibility of using the combination of VAE model and machine learning strategies for modern game design, with the aid of 3 case studies arising from different nature and applications. The mathematical principles and assumptions of the VAE model, as well as its Evidence Lower Bound (ELBO), loss function during model construction, and loss function in data clustering, were first explored and derived. Then, the VAE model was applied to generate new game maps based on existing maps obtained from Arknights, create and retrieve anime avatars, and cluster a group of MNIST datasets that consist of numerals. The output images and datasets could retain and re-combine information from the inputs to a certain extent, however in the case study of Arknights (Case Study 1), there were room for improvements due to the lack of clarity within the output image, which could essentially represent a new game level in practice.
Some statistical features of the model and relationship between different parameters were also reviewed from these 3 case studies, for example, there is a high possibility that the time complexity of this VAE model is O(n); the loss of the VAE model decreases as the number of epochs applied increases, but the rate of change of such loss is also declining in general; the time consumed for performing VAE model is positively and linearly related to the number of epochs. For preventing memory overflow and saving computing resources, an appropriate scaling factor has been applied to each input dataset or image at the pre-processing stage, and it was found that the time consumed increases as the scaling factor increases, and there is a significant possibility that the loss derived from the loss function is positively and linearly related to such scaling factor.
Despite showing some technical deficiencies in generating new game levels (as reviewed in Case Study 1), the VAE model has shown its capability in data clustering. Further, for image attributes (or data points) with obviously different characteristics or spatial features, the VAE model can successfully distinguish one class from another via the model training process, then generate images of a specific class. On average, the recognition accuracy under 50 epochs is 85.4%, which is considered satisfactory.
Generally speaking, the VAE model is most effective in generating images with a specific graphical pattern, or handling and producing images of low resolution requirements, for example, clouds, grass and distant views in our nature. It is particularly promising in terms of clustering and creating new characters within a game.
In view of the technical shortcoming of the current VAE model, we learnt that the future enhancement should lie on increasing the resolution of images generated, for example via the combination of VAE model with other machine learning mechanisms like GAN and LSTM; ensuring sufficiency with regard to the amount of information in the model training set, so that all output images will contain more useful information and attributes, but at the same time consist of the least amount of noise components. This may be possible by tracing back to the techniques adopted in data pre-processing stages. This study has opened a new window for utilizing the strengths of VAE for future game design missions within the industry, at the same time identifying some potential weaknesses of VAE and proposing potential ways of remedy in the foreseeable future.