1. Introduction
Inner Mongolia and its surrounding areas are the primary producers of China’s sheep industry [
1]. Recent technological advances have driven large-scale developments in this field, with automated sheep identification and tracking becoming high priorities, replacing ear tagging and other manual techniques [
2]. The rapid growth of artificial intelligence (AI) and the similarity between human and animal facial features have also led to increased activity in the field of facial recognition [
3,
4,
5]. Specifically, the emergence of animal face datasets has further promoted the development of individual animal detection and identification. For example, Chen et al. [
6] applied transfer learning to obtain feature maps from 38,800 samples of 194 yaks as part of classification and facial recognition. Qin et al. [
7] implemented a VGG algorithm based on a bilinear CNN to extract and classify 2,110 photos of 200 pigs. Chen et al. [
8] used a parallel CNN network, based on transfer learning, to recognize 90,000 facial images from 300 yak cows.
The similarity between animal and human facial features has motivated a variety of face recognition algorithms, which have been applied to solve multiple identification problems. Specifically, convolutional neural networks (CNNs) are a fundamental model used in computer vision algorithms and the emergence of transformers has provided new possibilities for visual feature learning. For example, ViT has replaced the backbone of CNNs with a convolution-free model that accepts image blocks as input. Swin constructs layered representations by starting with small patches and gradually merging adjacent regions in deeper transformer layers. PVT inherits the advantages of both CNNs and transformers, providing a unified backbone for various visual tasks. However, these and other techniques have yet to be applied to sheep face recognition, partly due to the lack of a large-scale dataset required for model training.
While sheep face datasets have been introduced in previous studies, they exhibit several limitations. For example, Corkery et al. [
9] proposed an automated sheep face identification system, but the associated dataset only included 450 samples (nine training images from 50 sheep). In addition, subject heads were fixed, and their faces were cleaned prior to image collection. The resulting data were then manually screened, and a cosine distance classifier was implemented as part of independent component analysis. Although this approach achieved a recognition rate of 96%, the data collection process (i.e., cleaning, and restraining sheep in a fixed posture) is time-consuming and not conducive to large-scale farming. Wei et al. [
10] introduced a larger dataset, comprised of 3,121 sheep face images, and achieved a recognition accuracy of 91% using the VGGFace neural network. However, this approach was also tedious as sheep pictures were manually cropped and framed in frontal poses. This type of manual selection and labelling of candidate boxes is not conducive to large-scale breeding. Yang et al. [
11] used a cascaded pose regression framework to locate critical signs of sheep faces and extract triple interpolation features, producing an image dataset containing 600 sheep face images. While more than 90% of these images were successfully located using facial marker points, the calibration of critical points was problematic in cases of excessive head posture variability or severe occlusions.
Salama et al. adopted a Bayesian optimization framework to automatically set the parameters of a convolutional neural network, detecting the resulting configurations with AlexNet. However, the algorithm was successful primarily for sheep in favorable positions photographed against dark backgrounds, which occurred in only 7 of 52 batch sheep images. In addition, the sheep were cleaned prior to filming as facial dirt and other potential occlusions were removed using a custom tool. Data augmentation was also included, expanding the dataset to 52,000 images for training and verification, reaching a final detection accuracy of 98% [
12]. Alam collected 2,000 sheep face images from ImageNet, established a database of sheep face expressions in normal and abnormal states, and analyzed facial expressions using deep learning. This was done to classify facial expression categories and estimate pain levels caused by trauma or infection [
13]. Hutson developed the “sheep facial expression pain rating scale” by first manually labeling facial characteristics in 480 sheep face photos. Features were then learned as part of automated recognition of five facial expressions, used to determine whether the sheep were in pain [
14]. Xue et al. developed the open sheep facial recognition network (SheepFaceNet) based on the European spatial metric, reaching a precision of 89.12% [
15]. Zhang et al. proposed a sheep face recognition model based on MobileFaceNet, which integrated spatial information with efficient channel attention mechanisms. The ECCSA-MFC model has also been applied to sheep face detection, reaching an accuracy of 88.06% [
16].
Shang et al. collected 1,300 whole body images from 26 sheep and applied ResNet18 as a pre-training model. Transfer learning, combined with triple and cross-entropy loss functions, was then included for parameter adjustments. This whole-body identification algorithm reached an accuracy of 93.077% [
17]. Xue used a target detection algorithm to extract sheep facial regions from images, using key point detection and the face-to-face algorithm to achieve an average accuracy of 92.73% during target detection tests [
18]. Yang constructed a recognition model based on a channel mixing module in the ShuffleNetV2 algorithm. The SKNet attention mechanism was then integrated to further enhance model capabilities for extracting facial features. The accuracy of this improved recognition model reached 91.52% by adjusting an adaptive cosine measurement function with optimal hyperparameters [
19].
While these studies have achieved high recognition accuracy, the associated datasets exhibit several limitations, as described below.
Pose restrictions: sheep are often photographed in fixed postures intended to increase the consistency of facial features.
Obstruction removal: sheep are sometimes cleaned as dirt and other materials are removed prior to data collection.
Extensive pre-processing: some techniques require the manual selection or cropping of images to identify facial features.
Limited sample size: the steps listed above can be tedious and time consuming, which typically limits datasets to a few hundred samples.
These limitations are addressed in the present study, which introduces the first comprehensive benchmark intended solely for the evaluation of sheep face recognition algorithms. Unlike many existing datasets, which have imposed restrictions on the collection of sample images, the photographs used in this study were collected in a natural environment and from multiple angles, as sheep walked unprompted through a gate channel. The resulting dataset is therefore larger and more diverse than most existing sets, including 5,350 images from 107 different subjects. This diversity of samples, variety of viewing angles, lack of pose restrictions, and high sample quantity makes Sheepface-107 the most robust sheep face dataset collected to date. For this reason, we suggest it could serve as a benchmark in the evaluation of future sheep face recognition algorithms. The remainder of this paper is organized as follows. Related work is first reviewed in
Section 2. A description of the proposed methodology is provided in
Section 3. Validation tests and corresponding analysis are described in
Section 4. Finally, conclusions are presented in
Section 5.
4. Discussion
The performance of three neural network models applied to the Sheepface-107 dataset was evaluated using several different metrics, as shown in
Table 2. Specifically, the precision, recall, and F1-score for ResNet50 were 93.67%, 93.22%, and 93.44%, respectively, suggesting ResNet50 to be the most accurate. In addition, ResNet50 produced the lowest cost value (456 ms), indicating faster identification speed, though VGG16 required fewer parameters. This experimental validation process demonstrates the sheep face dataset constructed in this paper offers high recognition accuracy and short runtimes for effective and efficient image classification. In addition, these results are compared with those of similar studies in
Table 3, which provides several insights. For example, the study conducted by Corkery et al. [
9] achieved one of the highest recognition rates of any sheep face study to date, though it also involved some of the most restrictive data collection steps. The fixing of sheep posture and facial orientation provides highly consistent data which are more easily identifiable, but at the cost of increased data collection complexity. It is also worth noting that increasing the number of samples did not necessarily ensure higher recognition rates. For instance, Xue et al. [
18] included 6,559 samples in their study and developed a novel network architecture specifically for this task (SheepFaceNet) yet produced a recognition rate (89%) lower than that of Shang et al. [
17] (93%), who only included 1,300 samples. This suggests the way in which samples are collected and the diversity of features is more impactful than the total number of samples. While VGG16 produced slightly poorer results than GoogLeNet or ResNet50 in this study, the similarity of these two outcomes with those of SheepFaceNet, ResNet18, and VGGFace suggests a broad range of network architectures are applicable to this problem, given a sufficiently diverse sample set.
The stability of the reported results was also investigated to ensure the networks were not biased or overtrained.
Figure 11 shows accuracy plots as a function of epoch number for the training and test sets processed by VGG16, GoogLeNet, and ResNet50. Each of these curves is seen to converge without drastic variability, as the amplitude of oscillations is comparable in both the training and test sets, which suggests the data have not been overfit. This study included multiple types of data augmentation as a preprocessing step, three different neural networks, and five evaluation metrics. The similarity across these results suggests the networks are not being overtrained due to high similarity in the data. Each of these outcomes supports the proposed Sheepface-107 dataset, offering a diverse variety of features, as a potential benchmark for evaluating sheep face recognition algorithms.
5. Conclusion
This study addressed several issues exhibited by current datasets used for automated sheep face recognition, including limited sample size, pose restrictions, and extensive pre-processing requirements. As part of the study, images of sheep were acquired in a non-contact, stress-free, and realistic environment, as sheep walked unprompted through a gate channel. A benchmark dataset, Sheepface-107, was then constructed from these images, which is both robust and easily generalizable. The dataset includes samples from 107 Dupo sheep and a total of 5,350 sheep face images. Three classical CNNs (VGG16, GoogLeNet, and ResNet50) were used for a comprehensive performance evaluation. Compared with existing sheep face datasets, Sheepface-107 is more conducive to animal welfare breeding and the automation required by large-scale sheep industries. It could also be highly beneficial for breeding process management and the construction of intelligent traceability systems. This sample set, which is larger and more diverse than many existing sets, provides images of sheep collected in natural environments, without pose restrictions or pre-processing requirements. As such, it represents a new benchmark for research in animal face recognition technology. Future work will focus on the expansion and further study of Sheepface-107, making the dataset more stable and effective for automated sheep face recognition.
Figure 1.
An illustration of posture determination using images from an overhead camera. An ellipse was fitted to the body of each sheep using histogram segmentation and used to determine when the sheep was oriented in a favorable position, at which point either the left or right camera would activate. This approach avoids the need to restrict subject poses, as has often been done in previous studies.
Figure 1.
An illustration of posture determination using images from an overhead camera. An ellipse was fitted to the body of each sheep using histogram segmentation and used to determine when the sheep was oriented in a favorable position, at which point either the left or right camera would activate. This approach avoids the need to restrict subject poses, as has often been done in previous studies.
Figure 2.
A still frame collected by the mounted cameras. This perspective (< 30°) was empirically determined to provide favorable viewing angles.
Figure 2.
A still frame collected by the mounted cameras. This perspective (< 30°) was empirically determined to provide favorable viewing angles.
Figure 3.
The structure of the fixed fence channel. Labels includes (1) the cameras, (2) the fixed fence aisle, (3) the fence channel entrance, and (4) the fence channel exit.
Figure 3.
The structure of the fixed fence channel. Labels includes (1) the cameras, (2) the fixed fence aisle, (3) the fence channel entrance, and (4) the fence channel exit.
Figure 4.
An example of images collected from the 23rd Dupo sheep.
Figure 4.
An example of images collected from the 23rd Dupo sheep.
Figure 5.
A comparison of images before and after data enhancement, including (a) the original sheep face image and (b) sample results after augmentation.
Figure 5.
A comparison of images before and after data enhancement, including (a) the original sheep face image and (b) sample results after augmentation.
Figure 6.
The VGG16 network architecture.
Figure 6.
The VGG16 network architecture.
Figure 7.
The Inception module.
Figure 7.
The Inception module.
Figure 8.
The residual structure.
Figure 8.
The residual structure.
Figure 9.
An example of a characteristic feature map for a sheep face model.
Figure 9.
An example of a characteristic feature map for a sheep face model.
Figure 10.
The output of individual network layers applied to specific facial features. The highlighted regions demonstrate a focus on the eyes, ears, nose, and mouth.
Figure 10.
The output of individual network layers applied to specific facial features. The highlighted regions demonstrate a focus on the eyes, ears, nose, and mouth.
Figure 11.
A comparison of training and test accuracy for three different networks.
Figure 11.
A comparison of training and test accuracy for three different networks.
Table 1.
A comparison of CNN parameters.
Table 1.
A comparison of CNN parameters.
Model parameter |
VGG16 |
GoogLeNet |
ResNet50 |
Input shape |
224 x 224 x 3 |
224 x 224 x 3 |
224 x 224 x 3 |
Total parameters |
138 M |
4.2 M |
5.3 M |
Base learning rate |
0.001 |
0.001 |
0.001 |
Binary Softmax |
107 |
107 |
107 |
Epochs |
200 |
200 |
200 |
Table 2.
The results of validation experiments using three neural network models.
Table 2.
The results of validation experiments using three neural network models.
Model |
Precision (%) |
Recall (%) |
F1-score (%) |
Parameters |
Cost (ms) |
VGG16 |
82.26 |
85.38 |
83.79 |
8.3×106
|
547 |
GoogLeNet |
88.03 |
90.23 |
89.11 |
15.2×106
|
621 |
ResNet50 |
93.67 |
93.22 |
93.44 |
9.9×106
|
456 |
Table 3.
A comparison of Sheepface-107 and existing sheep face image datasets.
Table 3.
A comparison of Sheepface-107 and existing sheep face image datasets.
Study |
Number of Samples |
Classifier Model |
Recognition Rate |
Corkery et al. [9] |
450 |
Cosine Distance |
96% |
Wei et al. [10] |
3,121 |
VGGFace |
91% |
Yang et al. [11] |
600 |
Cascaded Regression |
90% |
Shang et al. [17] |
1,300 |
ResNet18 |
93% |
Xue et al. [18] |
6,559 |
SheepFaceNet |
89% |
This study |
5,350 |
ResNet50 |
93% |