4.1. Dataset construction
Acquiring a large and comprehensive dataset is critical to model complexity for accurately classifying calligraphy styles. The training accuracy of the model related to epistemic uncertainty can be enhanced using more data. Epistemic uncertainty refers to the uncertainty of the model and is often due to a lack of training data. Unfortunately, in our previous experiments, we were unable to find appropriate data for our research. As a result, we recognized the importance of constructing our own dataset so that epistemic uncertainty can be focussed. We constructed new dataset named Chinese Calligraphy Dataset CQU (CCD-CQU).
The preparation of the dataset was a time-consuming process as we had specific target calligraphers from a particular period: Tan dynasty (690-970 A.D). Chinese calligraphy works we used for our study is sourced from a public website that has encyclopaedic collection of scanned and digitized picture of Chinese calligraphy works from different authors and periods (
http://www.yac8.com). The Chinese calligraphy works were identified according to the targeted authors’ works and were sorted chronologically.
For Ou Yangxun (557-641 A.D.), we focused on his famous inscription work "Jiu Cheng Gong Li Quan Ming " stele (《九成宫醴泉铭》), which is considered one of the finest representatives of his calligraphy art. It was originally an essay documenting the trip of emperor Tang Taizong (599-649 A.D.). His calligraphy style is often regarded as strict, neat, and well-organized, making it a popular choice for calligraphy teachers to assign to their students to copy from as their first example.
For Chu Suiliang (596 – 658 A.D.), we collected works not only from famous inscription stele sources such as "Yan Ta Sheng Jiao Xu" (《雁塔圣教序》), "Meng Fa Shi Bei Ming" (《孟法师碑铭》), and "Qian Zi Wen Bei" (《千字文碑》), but also some rare collections of his actual writing on paper, such as "Ying Fu Jing" (《阴符经》). These works spanned his lifetime and are representative of his calligraphy style.
For Yan Zhenqing (709 -785 A.D.), we selected the main body of characters from his famous inscription stele "Duo Bao Ta Bei" (《多宝塔碑》) and also included his alleged actual writing on paper, such as "Zhu Shan Lian Ju" (《竹山连句》). Yan has over 138 legacy works, and "Duo Bao Ta Bei" was written when he was at the peak of his career and writing style.
For Liu Gongquan (778-865 A.D.), we chose the main body of calligraphy characters from the famous "Mysterious Tower Stele" (《玄秘塔碑》), which is considered the masterpiece of his calligraphy art and often the first choice for followers to copy and practice the regular script style as well.
We also include miscellaneous examples from other copybooks that extract individual characters from various historical documents or rubbed from inscriptions for followers to study. We collected over 2000 images of the calligraphy characters for each author. We processed them in Photoshop by cropping, resizing, and reducing them into 64 x 64 pixels in dimension and in black and white images (
Figure 4 (a)-(d)) so that it would be faster for CNN to train and test. There are characters which are repeated and have differences in value, contrast, and variations and noises in the background. The assumption we worked on was that these nuances were not significant for the learning algorithm and probably would be even more beneficially challenging for deep learning to train with.
4.2. Numerical experiments, results, and discussion
To improve the performance of the neural network model, it is a common practice to train the model with more data so that uncertainty can be captured. Data augmentation has been typically used to obtain additional training samples by applying transformations such as flipping, cropping, rotating, scaling, and elastic deformations to the dataset samples. Therefore, we have added an equal amount of image data through data augmentation procedures. These procedures involve generating 50% of augmented images through rotation with angles ranging from 10 to 180 degrees, with uniform distribution intervals of 10 degrees. The remaining 50% of image data were generated by adding a random background.
Figure 5a and
Figure 5b illustrate some samples of augmented data images. Overall, there were 16564 images processed in the experiment.
Figure 5a.
Rotated image data samples
Figure 5a.
Rotated image data samples
Figure 5b.
Image data samples with adding random background noise
Figure 5b.
Image data samples with adding random background noise
The behaviour of CNNs is highly complex. In the studied classification problem using the CNN method, the dataset is the base where the built network requires learning. On the other hand, the performance accuracy of a CNN model for a specific learning task is significantly influenced by its architecture and algorithmic parameters [31]. To ensure the best possible accuracy, we carefully studied and tuned the hyperparameters of our CNN model, resulting in the design of five different architectures for our application. The key parameters of our architecture are summarized in
Table 2, whereas Tables 3(1)-(5) outline the detailed configuration manners for filters of varying sizes, which are referred to as configuration types 1 to 5 in the following discussions.
As a CNN model is composed of iterative blocks of convolutional layers and pooling layers, the combination way of convolution and pooling operations can vary significantly. Each specific combination of convolution and pooling operation can be treated as a configuration of a distinct architecture for the network in the numerical experiments for. For instance, AlexNet [24], a well-known CNN model, employs a 11x11 convolution filter with a stride of 4 as the initial layer, followed by a 5x5 filter, and subsequently uses a 3x3 filter for all other convolutional layers. The VGG model [25], on the other hand, employs two continuous convolutional layers followed by a pooling layer, repeated twice, and then follows this up with three continuous layers plus a pooling layer, with each convolution operation using a 3x3 filter.
In our design, configuration types 1, 2, and 4 use convolution filter sizes of 3x3, 5x5, and 7x7, respectively. Configuration type 3 uses a 5x5 filter size for the first convolution, followed by all others using a 3x3 filter size. Configuration type 5 follows a VGG-like style with a total of 7 convolutional layers. Each configuration starts with a certain number of filters (K), with the number of filters doubling at the subsequent convolution layer. The five configurations have a total number of layers of 11, 9, 11, 9, and 15 layers, respectively. In addition to the architecture parameters, we have also included the algorithm-related parameters in
Table 4.
Table 2.
Critical architecture parameters of our CNN design.
Table 2.
Critical architecture parameters of our CNN design.
Parameters |
Values |
Input image size |
64×64×1 |
Filter size (F×F) |
3×3, 5×5, 7×7 |
Number of filters (K) |
16, 24, 32, 48 |
Pooling size (Max Pooling) |
2×2 stride=2 |
Configuration type (Cn) |
C1, C2, C3, C4, C5
|
Neuron numbers of FC Layer (N) |
512, 256, 128 |
Table 3.
CNN configuration type (1)-(5).
Table 3.
CNN configuration type (1)-(5).
(1) Filter size 3×3 (K=32) |
|
Block |
Layer |
Layer type |
Description (feature map size)
|
Block 1 |
L1 |
Conv+ReLU |
32@62×62 |
L2 |
Max Pooling |
32@31×31 |
Block 2 |
L3 |
Conv+ReLU |
64@29×29 |
L4 |
Max Pooling |
64@14×14 |
Block 3 |
L5 |
Conv+ReLU |
128@12×12 |
L6 |
Max Pooling |
128@6×6 |
Block 4 |
L7 |
Conv+ReLU |
256@4×4 |
L8 |
Max Pooling |
256@2×2 |
Block 5 (FC layers) |
L9 |
FC1 |
512 neurons |
L10 |
FC2 |
256 neurons |
L11 |
FC3 (softmax) |
4 neurons |
(2) Filter size 5×5 (K=32) |
|
Block |
Layer |
Layer type |
Description (feature map size)
|
Block 1 |
L1 |
Conv+ReLU |
32@60×60 |
L2 |
Max Pooling |
32@30×30 |
Block 2 |
L3 |
Conv+ReLU |
64@26×26 |
L4 |
Max Pooling |
64@13×13 |
Block 3 |
L5 |
Conv+ReLU |
128@9×9 |
L6 |
Max Pooling |
128@4×4 |
Block 4 (FC layers) |
L7 |
FC1 |
512 neurons |
L8 |
FC2 |
256 neurons |
L9 |
FC3 (softmax) |
4 neurons |
(3) Filter sizes 5×5, 3×3 (K=32) |
|
Block |
Layer |
Layer type |
Description (feature map size)
|
Block 1 |
L1 |
Conv+ReLU |
32@60×60 |
L2 |
Max Pooling |
32@30×30 |
Block 2 |
L3 |
Conv+ReLU |
64@28×28 |
L4 |
Max Pooling |
64@14×14 |
Block 3 |
L5 |
Conv+ReLU |
128@12×12 |
L6 |
Max Pooling |
128@6×6 |
Block 4 |
L7 |
Conv+ReLU |
256@4×4 |
L8 |
Max Pooling |
256@2×2 |
Block 5 (FC layers) |
L9 |
FC1 |
512 neurons |
L10 |
FC2 |
256 neurons |
L11 |
FC3 (softmax) |
4 neurons |
(4) Filter size 7×7 (K=32) |
|
Block |
Layer |
Layer type |
Description (feature map size)
|
Block 1 |
L1 |
Conv+ReLU |
32@58×58 |
L2 |
Max Pooling |
32@29×29 |
Block 2 |
L3 |
Conv+ReLU |
64@23×23 |
L4 |
Max Pooling |
64@11×11 |
Block 3 |
L5 |
Conv+ReLU |
128@5×5 |
L6 |
Max Pooling |
128@2×2 |
Block 5 (FC layers) |
L7 |
FC1 |
512 neurons |
L8 |
FC2 |
256 neurons |
L9 |
FC3 (softmax) |
4 neurons |
(5) VGG-like style (filter size 3×3, K=32) |
|
Block |
Layer |
Layer type |
Description (feature map size)
|
Block1 |
L1 |
Conv+ReLU |
32@62×62 |
L2 |
Conv+ReLU |
32@60×60 |
L3 |
Max Pooling |
32@30×30 |
Block2 |
L4 |
Conv+ReLU |
64@28×28 |
L5 |
Conv+ReLU |
64@26×26 |
L6 |
Conv+ReLU |
64@24×24 |
L7 |
Max Pooling |
64@12×12 |
Block3 |
L8 |
Conv+ReLU |
128@10×10 |
L9 |
Conv+ReLU |
128@8×8 |
L10 |
Conv+ReLU |
128@6×6 |
L11 |
Max Pooling |
128@3×3 |
Block4 |
L12 |
Conv+ReLU |
256@1×1 |
Block5 (FC layers) |
L13 |
FC1 |
256 neurons |
L14 |
FC2 |
128 neurons |
L15 |
FC3 (softmax) |
4 neurons |
Table 4.
Algorithm-related parameters.
Table 4.
Algorithm-related parameters.
Optimizer adam |
Learning rate 1.0e-4 |
Batch size 32, 64 |
Dropout rate 0.25, 0.5 |
In this study, the main objective was to explore feasible ways to maximize image classification accuracy, while ensuring that the application program’s running time remains manageable. Given a set of fixed algorithm-related parameters, the performance accuracy of a CNN model can be defined approximately as a function of the architecture parameters (
F,
K,
C,
N) as below,
where
F is the filter size,
K is the number of filters,
C represents the configuration manner, and
N represents the number of neurons at the fully connected layers (which may consist of two or more components).
We implemented a Python application program using the TensorFlow library [32] based on the design described above. To evaluate the performance of our model, we conducted numerical experiments on our dataset, which was divided into 80% for training and 20% for testing. The results of a typical training example are shown in
Figure 6, where the loss function tends to stabilize after 25 epochs, indicating a convergence in the training process.
Figure 7 illustrates the performance curves against the epoch, where both the training and testing accuracy curves approach stability after 25 epochs. The running time for each modification on an ordinary PC with a single GPU ranges between approximately 12 minutes and 45 minutes, depending on the configuration type and the number of filters (K). The VGG-like architecture takes a longer running time due to its deeper structure.
The recorded accuracies for each modification are presented in
Table 5. The accuracy on the training dataset varied from 93.7% to 99.5%, which indicates that the designed networks are well-trained. On the other hand, the accuracies on the testing dataset ranged between 89.5% and 96.2%. Fig.8 depicts the test error graph for different configuration types and starting numbers of filters (K). Configuration type 1 with K=32 yielded the lowest error rate of 3.8%, which is more in line with human recognition abilities for this type of image. The graph reveals that for configuration type 1 which has a small filter size (3x3), the test error decreases as the number of filters increases, but it reaches a saturation point at K=32. Configuration types 1 and 3 have lower error rates, while configuration types 2 and 4 have relatively high error values. For a small-size image input, a smaller filter size achieves better accuracy, while a larger filter size may result in some information loss in the extracted feature maps. Furthermore, configuration type 5 is a VGG-like style configuration with a deeper structure, but it did not produce as impressive accuracy as expected. This could be attributed to the fact that the additional convolutional operations may not necessarily extract more meaningful features for this particular image dataset. Moreover, as illustrated in
Figure 10, the optimal number of filters for most configurations should be 24 or 32. The experimental results have clearly demonstrated that the built network with the carefully-tuned architecture parameters can correctly recognize the different font styles i.e. categories in our dataset. The outstanding performance accuracy is attributed to the correct extraction of font style features. A Chinese character is typically composed of eight basic components such as dots, horizontals, verticals, hooks, lifts (or raises), curves, throws, and presses, each of which has a dynamic corresponding form known as sideways, tighten, bow or strive, leap, whip, sweep, peck, and hack. Different calligraphers apply these basic writing forms in unique ways, including variations in motion direction, graphic manner, stroke way, and the force applied to create their works. For example, Ou preferred using sharp and bold strokes, whereas Chu often intentionally handled the strokes, lines, and dots, with turning points to reflect an aesthetic and abstract value that transcends the characters’ physical appearance
. These special writing manners result in individual font styles and their features in the image are reflected in texture, shape, angle, and pattern. These features are effectively extracted by convolutional operations using various filters in a series of feature maps, which leads to accurate classification. Increasing the size of the training dataset in future work is likely to further improve prediction accuracy.
Table 6 presents the selected test accuracies of individual classes for configuration types 1 and 2 with K=32. The results indicate that the different category characters exhibit varying performance accuracy, with class O having the lowest accuracy. This could be attributed to the imbalanced sample data, as the O class has the fewest number of images. Imbalanced training data can lead to several problems that affect classification accuracy [33,34]. If one class has significantly fewer samples than the others, the network may exhibit biased behaviours towards that class and may classify even the least number of sample images as belonging to that corresponding class. The underlying cause is that CNN attempts to minimize overall errors, which may cause it to focus more on the dominant classes and less on the minority classes. Consequently, for the fewer samples in the minority classes, the network may not have sufficient information to learn the features belonging to that class, thus reducing the classification accuracy.
Figure 8.
Performance errors for various configurations.
Figure 8.
Performance errors for various configurations.
Table 5.
The accuracy obtained on our dataset with different architecture parameters.
Table 5.
The accuracy obtained on our dataset with different architecture parameters.
Architecture |
Accuracy |
K |
Configuration |
training |
testing |
16 |
C1
|
98.6% |
94.3% |
C2
|
96.3% |
92.2% |
C3
|
98.7% |
95.5% |
C4
|
93.7% |
89.5% |
C5
|
96.6% |
93.7% |
24 |
C1
|
99.2% |
95.2% |
C2
|
98.9% |
94.0% |
C3
|
99.0% |
95.7% |
C4
|
95.7% |
90.4% |
C5
|
98.1% |
93.8% |
32 |
C1
|
99.5% |
96.2% |
C2
|
98.2% |
94.1% |
C3
|
98.9% |
94.9% |
C4
|
95.2% |
92.2% |
C5
|
98.7% |
95.1% |
48 |
C1
|
99.1% |
95.8% |
C2
|
97.7% |
93.5% |
C3
|
98.3% |
94.5% |
C4
|
96.8% |
91.5% |
C5
|
98.2% |
93.9% |
Figure 9.
Performance errors vs the number of filter (K).
Figure 9.
Performance errors vs the number of filter (K).
Table 6.
Individual class accuracy for configuration type 1 & 2 and K=32 .
Table 6.
Individual class accuracy for configuration type 1 & 2 and K=32 .
class |
Type 1 (C1) |
Type 2 (C2) |
O |
93.82% |
92.08% |
C |
98.32% |
96.16% |
L |
96.96% |
95.22% |
Y |
95.71% |
93.03% |
In any classification problem, misclassification is a common issue. In this study, we analyzed a few typical misclassification examples. Figs.10-13 depict the five test patterns that our CNN misclassified. Below each image, the correct answer (left) and the network prediction (right) are displayed. The errors were mostly caused by the following reasons:
Figure 10.
mis-classified example 1.
Figure 10.
mis-classified example 1.
Firstly, the issue of symmetry can make it difficult for calligraphers to write characters in a distinct order of strokes. For instance, the character "會" (Figure. 10) is symmetrical and can easily be written similarly in terms of strokes and order.
Secondly, the stroke width is another factor that can contribute to the difficulty of character recognition. In the examples provided (Figure. 12), challigrapher 1, 3 have stroke widths that are similar in thickness, while 0 and 2 being thinner and elegant, calligrapher 1 and 3 are thicker and bolder. This similarity in stroke width can confuse the CNN when trying to differentiate between certain characters.
Figure 11.
mis-classified example 2.
Figure 11.
mis-classified example 2.
Lastly, the simplicity of some characters, such as character "三," (
Figure 12 right) which only has 2-3 strokes, can make them challenging to differentiate. Additionally, big stains or noise signals in the examples (
Figure 12 left) can further hinder the accuracy of CNN detection, compounding with the aforementioned problems.
Figure 12.
mis-classified example 2.
Figure 12.
mis-classified example 2.
Overall, these examples provide valuable insights into the challenges faced by CNNs in character recognition and highlight the need for continued improvement in this field.