4.3. Results of the Personalized Image Generation Model
We divided our experiments into two parts, the quantitative experiments and the qualitative experiments. Quantitative results are shown in
Table 4, we take the CLIP score to validate the alignment between text and image and MSE loss between latent noises in the denoising process as evaluation metrics. In
Table 4, our model achieves better performance than the baseline model on both metrics, improving by 0.13% and 28.8% respectively.
As for qualitative experiment, We have divided it into two parts. Firstly, we aim to validate our model’s ability to produce images with fewer anomalies in response to short-text inputs. To achieve this, we will compare the images generated by our model with those generated by the baseline introduced in [
4]. Secondly, we seek to establish that our model can generate a range of personalized images tailored to individual preferences. We will do this by comparing the image outputs that align with diverse aesthetic preferences.
As shown in
Figure 3, we discover that our model generates images with better quality and aesthetics than the baseline model. Our model outperforms the baseline model in several ways. In the first three columns, our images have no white boxes indicating discomfort areas, and they appear more vibrant. In the next two columns, our images have more detailed and colorful backgrounds, while the subjects remain of high quality. Lastly, our model successfully generates the required subject, "purple cake," and the background scene looks more artistic and layered.
Figure 4 presents images generated by our model for individuals with openness and conscientiousness personality traits. Overall, each image corresponds to the keywords or main subjects of the short text prompts, with no regions that contradict commonsense. Additionally, there are notable differences in the images preferred by individuals with openness versus conscientiousness.
From the perspective of color and details, the images in the first row feature richer and more varied colors, such as the sheen of the car and the intricate markings of the clock. These detailed elements provide more visual exploration space, catering to the curiosity of individuals with high openness. In contrast, the second row of images displays softer tones, with lower color saturation and weaker contrast between objects. For example, the background details of the vase are simpler and quieter, with minimal visual stimulation. This muted and orderly tone aligns with the preference for structure and clarity typical of conscientious individuals.
In terms of composition, the scenes in the first row exhibit stronger contrasts, such as the car and the building, especially the body of the clock standing out prominently and the cake decorations appearing more unusual. This sense of contrast appeals more to individuals with high openness, who prefer novel and unconventional elements. Meanwhile, the composition in the second row is more balanced, with objects arranged in an orderly manner and no striking or unusual items, reducing visual complexity. This structured and precise arrangement corresponds to the preference of conscientious individuals for orderliness and clarity.
Figure 5 presents images generated by our model for individuals with extraversion and agreeableness personality traits. From a general aesthetic perspective, each image is harmonious and free of regions that violate common sense or cause discomfort. In terms of personalized preferences, the images corresponding to different personality traits exhibit variations in details, colors, and backgrounds.
From the perspective of visual complexity, the images in the first row contain multiple focal points, such as the arrangement of vehicles, the layered decorations on the cake, and the diagonally extending branches of the plant. This creates a rich and diverse visual experience, which strongly appeals to individuals with high extraversion. In contrast, the images in the second row have lower visual complexity. The scenes appear more orderly and avoid excessive visual noise, such as the more neatly arranged vehicles and the more neatly pruned branches. Individuals with high agreeableness tend to prefer such simpler and more harmonious images, which do not induce excessive emotional stimulation.
Figure 6 presents images generated by our model for individuals with conscientiousness and neuroticism personality traits. It can be observed that the images in the second row exhibit softer colors with lower contrast. This low-stimulation color palette helps reduce tension and creates a more relaxed feeling, particularly through the more natural and comfortable postures of the animals, which aligns with the need of individuals high in neuroticism for gentle, calming visual experiences.
Furthermore, the images in the second row convey a softer overall style. The colors of the cake are slightly muted, and the postures of the horse and dog are more relaxed. Additionally, the background elements, such as the green trees and outdoor setting, evoke a serene, natural atmosphere. This harmonious scene helps to alleviate the heightened sensitivity to external stimuli often experienced by individuals with high neuroticism.
As shown in
Figure 7, the overall performance of our model demonstrates relatively clear differences among the five columns of images representing personalized preferences. This indicates that our model can detect aesthetic differences between various personality traits. In the first row, the base of the clocks shows different hollow carving designs, with the neuroticism clock having a flat bottom, while only the conscientiousness clock features a square outline.
In the second row, the bases of the trees in the extraversion and agreeableness images are relatively larger squares compared to those in other images, and the conscientiousness image contains an additional tree branch compared to the agreeableness image. In the third row, only the openness and neuroticism images include half of an apple; notably, only the openness image has a distinct black-and-white background, whereas the other images feature colored backgrounds.
In the final row, it is evident that each plate contains different types of food, with some plates featuring additional yellow fragments, while others have brighter blue patterns. We observed that while there is a certain degree of similarity between images corresponding to different personality traits, subtle differences in detail are still noticeable. Furthermore, when the text length remains constant, the larger the numbers included in the text, the greater the differences between the generated images.
From this, it can be seen that our adapted personalized short-text image generation model not only achieves the effect of reducing abnormal areas in terms of common sense aesthetics but also distinguishes itself in terms of personalized aesthetics while maintaining overall consistency between the text and image.