Applying the scientific methodology of Nunamaker, this section presents our observations representing the current state of the art.
2.1. Definitions and Characterizations of Avatars in Literature
According to Miao et al. [
29], p.71f, Avatars in virtual worlds of the metaverse lack a unified definition and taxonomy [
29], p.71f. They suggest the following definition based on their empirical findings of different avatar definitions used in relevant papers: "We define avatars as digital entities with anthropomorphic appearance, controlled by a human or software, that are able to interact" [
29]. They suggest a simple taxonomy two-by-two in dimension. The first dimension is form realism, the second one is behavioral realism. Realism in form is defined mainly by the level of anthropomorphism, which increases with realistic human appearance, movement, and spatial dimensions. The behavioral realism is determined by interactivity and the controlling entity. For the purposes of this discussion, it is sufficient to note that four simple characters can be created and are of type,
Simplistic Character, low in form and behavior,
Superficial Avatar, high in form but low in behavior,
Intelligent Unrealistic Avatar, low in form but high in behavior, and
Digital Human Avatar, high in form and behavior. They define a typology of avatars, where the form-realism part describes attributes that are relevant when looking at an avatar, they include the representation as a 2D or 3D model, its static or dynamic graphic, and human characteristics, such as gender, race, age, and name. The excerpt is presented in
Figure 2 [
29], p.72.
Ante et al. [
30], p.11 present a similar definition. They define an avatar of a user as “one or more digital representations of themselves in the digital world” [
30], p.11. At a high level, they classify them into the following types:
Customizable,
Non-customizable,
Self-representational,
Non-human and
Abstract. They again see a high level of anthropomorphism, but also include abstract and non-human characters, which can lack these attributes [
30], p.16.
In consideration of the aforementioned classifications, it is evident that humanoid avatars (Human) are a prominent subject of interest, particularly in the context of the metaverse. By the classification of Miao et al. [
29], p.56 they are regarded as
Superficial Avatar or
Digital Human Avatar. By the classification of Ante et al. [
30], p.16 they fit Customizable, Non-customizable and Self-representational. These humanoid avatars are most easily recognized by their silhouette, with four limbs, a torso, and head, but also a face and clothing. All other detected avatars are then broadly subsumed with a residue class (
NonHuman).
Anthropomorphism is described as an important feature of an avatar [
30], p.5. Using humanoid avatars makes it easier to assert human features to avatars and strengthens social interaction and general engagement in a virtual world [
31]. This includes a simpler possibility of self-representing and identifying with an avatar. In addition, a higher level of empathy, social connection, and satisfaction is reached. [
30], p.20f. The Human-Centered Design approach [
32], p.1106 is another argument to focus on humanoid avatars.
In summary, a characterization of avatars roughly either being Human or NonHuman is found, while NonHuman can still include anthropomorphic features at a lower level, but are not required. This might make it harder to classify them as avatars than with Human avatars. Furthermore, there is no unified or widely used avatar characterization, and creating or extending an ontology is helpful to model avatars.
The 256-Metaverse Records dataset [
33,
34] contains video-based
MVRs collected in wild from different metaverse virtual worlds. A sample of different avatars from the dataset is shown in
Figure 3. The virtual worlds displayed all contain an avatar, at least the self-representation of the user engaging in the virtual world. On the top left, a scene from Second Life [
35] is displayed, followed by a snapshot of Roblox [
12]. On the bottom left, a gathering in Fortnite [
13] is shown, next to a scene in a restaurant in Meta Horizon Worlds [
36]. All avatars are close to a humanoid representation, with different levels of abstraction. The Roblox sample displays a blocky toy-like representation, Second Life is close to a photorealistic representation, Fortnite has a more realistic look that has some cartoonish elements, and finally Horizon Worlds uses an oversimplified but still realistic look but at the same time the avatars are missing their lower body and are free floating. Even though there is an avatar labeled training data set, it is highly likely that the labels might require adaption or the videos must be converted, either to images of specific size, format or similar.
Figure 3 shows different examples of avatars in the dataset. From the observation of the dataset made by the authors of this paper, virtual worlds employ
indicators hovering over the avatar heads of different forms, including text boxes diamonds, or arrows.
Steinert et al.[
37], p.4 investigate the differences in ontologies of common multimedia with
MVRs. They also propose to define an avatar as a tuple of a nametag, the
Indicator, and a character. Further, the characterization describes a character can be a text line, a 2D model, or a 3D model. Although Steinert et al. do not find an existing ontology containing an avatar, they propose to extend the Large-Scale Concept Ontology for Multimedia (LSCOM) [
38], p.81f. They name as a possible super-class the
perceptual agent class. After reviewing the literature and visually inspecting
MVRs with avatar data, it might be easier to classify avatars, when extending the characterization found. When something of type
Human is found, it could be an avatar but could also be a simple representation of a human in a painting. For
NonHuman avatars it can be even harder. The author suggests including a
Sign, which refers to the
indicator, when found in relation to a detection of
Human or
NonHuman, to allow easier identification of avatars. However, based on the observations in the wild, the described nametag is only one form of
Indicator and no method and evaluation is presented.
Upon examination of the MVRs, it becomes evident that a significant proportion of non-human avatars are anthropomorphic animals or animal-like creatures. This observation has the potential to influence recognition, yet it is not reflected in existing categorization schemes.
However, the extension of the existing avatar classification by modeling the indicator property, and the use in Avatar Detection remains a challenge.
2.2. Object Detection
The automatic detection of object instances in images and video, machine learning, in particular object detection, has proven to be efficient [
39]. There are multiple algorithms from the field of supervised learning used for classification and localization, which might be applicable to the task of avatar detection.
One can use existing object detection models to detect avatars. For example, avatars are similar to humans. Hence, a model successfully detecting humans could be used for avatar detection. This approach likely reduces the amount of training data needed [
40], by using a combination of active learning and transfer learning. Transfer learning provides a pre-trained model that has been trained on a different dataset [
40]. Active learning describes a selective annotation of only unlabeled training data with a high entropy; e.g., this could be determined by classification of the unlabeled data and then checking for data points with low probabilities assigned [
40]. Therefore, only some training data have to be labeled.
A similar approach is used by Ratner et al. [
41], p.7, showing three things when using their data programming framework within the field of weak supervision, where some is automatically labeled by simple solutions such as labeling function that are based on heuristics and therefore noisy, biased or otherwise error-prone. First, Ratner et al. show that data programming can generate high quality training data sets. Second, they demonstrate that LSTM models can be used in conjunction with data programming to automatically generate better training data. As a last point, they present empirical evidence that this is an intuitive and productive tool for domain experts.
These approaches outline that acquiring training data is not a simple or solved issue, because not even the labeling can be handled fully automatically. In theory, if the model that is planned to be trained works well, then this could also be regarded an automatic creator for labels of training data within the field of weak supervision. Other than that, these approaches might help to create more labels for training data, but still require generation of training data to label.
Other approaches that seem promising due to current success with image recognition are based on Convolutional Neural Networks (CNN) [
42], basic CNNs have been extended and improved to models such as You Only Look Once (YOLO) [
43] or Regional CNNs (R-CNNs) [
44], which have proven to be useful for object detection. In short comparison, R-CNNs work in multiple steps which increases accuracy but reduces speed for live object detection, while YOLO does all these steps at once which reduces complexity and increases speed, but slightly lowers detection accuracy.
YOLO generalizes objects better compared to R-CNNs by a wide margin, e.g., after training on images of real humans, it is still able to detect abstract persons quite well in artworks. A high capability in abstraction might be a big advantage. At the same time it is really fast, allowing for a wider use case or less resource consumption due to its more simple and efficient modeling. Furthermore, it takes in the input of the entire picture including the background, which might be helpful to include contextual information, especially when trying to detect more abstract avatars. However, even if an avatar would look like something amorphic, YOLO has proven to detect unspecific objects like potholes [
45,
46].
YOLO’s major shortcoming in accuracy is with exact detection location and multiple objects in proximity. This might limit the ability of the model when multiple avatars might be close to each other, but newer versions of YOLO are quite capable at reducing these issues.
Our literature search could not find an approach that directly attempts to apply object detection on avatars in MVRs, but there are multiple highly potent candidate algorithms at hand. Some, such as a pre-trained YOLO model, might work quite well without further modification, since they are able to generalize well from human images to humanoid representations. Then, specialized training data is provided to such models. A remaining challenge is the modeling and implementation of an adapted YOLO model, specialized by transfer learning on the avatar class annotated MVRs.