4.1. Reagents, Laboratory Equipment and Measurement Procedure
In this work, the colorimetric reaction used to determine ascorbic acid (AA) was based on the reduction of Fe(III) to Fe(II) by AA, which forms a red color complex with orthophenanthroline. In this experiment, L( +)-ascorbic acid pure p.a., CAS: 50-81-7 (POCH, Poland), 1,10 phenanthroline p.a., CAS: 66-71-7 (POCH, Poland), ammonium iron(III) sulphate dodecahydrate pure p.a., CAS: 7783-83-7 (POCH, Poland) were used. The 1M acetate buffer was prepared in our laboratory by mixing sodium acetate and acetic acid i.e. 1.0 mol·L− 1 CH3COOH with 1.0 mol·L−1 CH3COONa (both reagents purchased from Avantor Performance Materials, Poland) and adjusting to the desired pH using 10N HCl. All solutions, ie, AA with a concentration of 2.5 gL-1, phenanthroline with a concentration of 4.0 gL-1, ammonium iron(III) sulphate with a concentration of 2.5 gL-1, and 1M acetate buffer (pH 4.6) used were prepared with double distilled water. Small laboratory equipment was applied in the experiments: precise analytical balance (RADWAG, model AS 60/220.XS, Poland), magnetic stirrer (WIGO, Poland), automatic pipettes of various volumes (Eppendorf, Germany), as well as glass, i.e., beakers and Petri dishes. To adjust the pH values of the buffer solution, the SevenCompact S210 laboratory pH meter (Mettler Toledo, Switzerland) was applied. All experiments were performed at room temperature.
Solutions with different concentrations of vitamin C for direct color assessment were prepared by first adding 9 mL of distilled water to the beakers, then 1 mL of acetate buffer, next the appropriate volume of ascorbic acid and 0.5 mL of ammonium iron(III) sulfate. After waiting 3 minutes, 0.5 mL of phenanthroline was finally added. The additions of AA were 0, 20, 50, 80, 100, 150 200, 300, 400 µL.
Figure 3 shows photos of samples prepared for further testing. Parameters studied for optimization were reagent concentration, volume and the time between the individual stages (influence of chemical kinetics). After waiting about 20 minutes, 10 mL of each colored solution was poured into Petri dishes with a flat bottom and photos were taken with a smartphone camera.
4.2. Preparation of a Color Template and Picture Acquisition
The experiment was divided into several steps and the procedure is shown in
Figure 4.
The first step was the preparation of the template for color standardization. To determine which colors should be used as a reference, a series of vitamin C solutions covering the concentration range considered in this study was prepared. This was done to estimate the range of RGB values significant for these calculations. The pictures of tested solutions were taken and using dedicated home-made software, 12 configurations of RGB values (colors) were chosen to be used on the template. The averaged RGB values formed the basis for determining the color range in the template dedicated to the solved issue. In order to slightly expand the color palette, the additional upper and lower values were estimated and the template was created to contain 12 colors. The template, where each color was represented by one square, was printed in several copies with one printer. The size of each square was 180x180 pixels and the total size of the template equaled 600x790 pixels.
The process of taking the pictures consisted of putting 10 ml of the solution in the Petri dish on the white background (highlighting the contrast between background and the Petri dish with solution), next to the template (which served as a color control set) on the black surface. There were two conditions in which the pictures were taken, natural lighting and light from the torch of the smartphone. In the second variant, the box completely isolating the photographed objects from the surroundings was put onto the experimental setup to cut off the access of the natural light. Differences in the colors of solutions and photographed color charts were observed under these two illuminations (
Figure 5). The process was repeated 11 times for each sample with a different concentration of vitamin C. All of the pictures were taken with Xiaomi smartphone (Xiaomi Redmi Note 9) with the camera specs 48MP with f/1.79 aperture and the resolution of the pictures was 3984x1840px.
To histogram match the templates to be used in standardization additional pictures were taken of the template in both conditions. The templates were cut from the images and the histogram matching was performed between the achieved templates and the original one. The process with the results can be seen in
Figure 6. Each template is presented as an RGB image and its histograms for each of the channels R, G and B. This figure clearly illustrates the operation of our algorithm. The first column shows the ideal histogram of the digital color template, designed for this task. The color histograms of the template photos are expected to match the ideal characteristics. However, in real conditions, there are significant differences (second and fourth columns) between the histograms of the photos and the digital version. Nonetheless, after applying the algorithm proposed in this work, the expected shape of the color histogram for all three components R, G, and B can be reproduced. The graphs in the third and fifth columns match those in the first column.
The cumulative histograms of matched images are the same as a cumulative histogram of the digital template (which serves as a reference) for each channel. Each spike on a given histogram corresponds to one component of a certain color on the template. The less discrete look of histograms of original images of a template are caused by variance due to the printing and different variables causing a camera to perceive the object with certain changes (e.g. shadows, highlights).
The data set for calculations was prepared in this way, that for 9 concentrations of vitamin C and 2 different lighting conditions, 198 pictures were taken. Which means that for a given solution for each of the conditions 11 pictures were taken. Each of the images underwent the process of standardization described in
Section 3.1. The exemplary effect can be seen in
Figure 7. In
Figure 8, the effect of applying the algorithm to various images of the vitamin C solution with concentration 33.63 µg/mL was presented. This resulted in the generation of datasets for two lighting conditions. The combination of both sets resulted in preparation of mixed condition set, which consisted of:
The prepared image datasets served as input data for training a regression neural network to obtain a model for determining AA concentration based on the taken photos.
4.3. Network Architecture
A multivariate regression model was defined using deep neural networks with an unconventional architecture, specifically dedicated to this task. The network design considered essential information such as color-based modeling, very similar colors corresponding to successive values of the target variable, and the approximately uniform color of each photo.
The neural network’s input layer accepts images of dimensions 50x50 with three color channels (RGB), randomly selected from each photo. Subsequently, Conv2D convolutional layers with small 1x1 filters, L2 regularization, and the GELU (Gaussian Error Linear Unit) activation function after each convolutional layer were applied. A normalization layer was also used, operating along the feature axis for each sample to stabilize and improve the learning process. The architecture also includes pooling layers that reduce the spatial size of the feature map, including AveragePooling2D and GlobalMaxPooling2D, which reduce the spatial dimension to one value per feature map. The combination of global pooling and flattening is a standard technique for dimension reduction before fully connected layers. In the next stage, the data is flattened to a 1D vector, which is passed to fully connected dense layers with GELU activation and L2 regularization, while the addition of a Dropout layer prevents overfitting. The output was a value corresponding to the predicted concentration of the solution on the image. A detailed description of the network architecture is provided in
Table 1. The total number of parameters is 23,879, and the training in a single cycle takes a few to several seconds (depending on the computation variant).
In a convolutional layer such as Conv2D, the applied L2 regularization works on the convolutional filters (kernels). These filters are responsible for extracting features from the input images. Adding L2 regularization means that during the optimization process, the loss function the model tries to minimize includes an additional term that penalizes large weight values of the filters. Adding L2 regularization to the convolutional layer helps prevent overfitting, stabilizes the optimization process, and promotes more uniform and stable solutions, leading to better model generalization.
The kernel size of 1x1 in the convolutional layer has specific and useful properties in neural networks. 1x1 filters are typically used to reduce the spatial dimensions without spatial aggregation, which can be useful in reducing the number of feature channels or introducing nonlinearity without changing the spatial size. This configuration is often called “pointwise convolution”. The 1x1 kernel changes the number of output channels (depth) without changing the width and height of the input image, allowing for the combination of information contained in different input channels without integrating spatial information. Each output point is a linear combination of the input channel values at a given spatial point. For example, if at a given input point (x, y) we have values in the channels (w1, w2, w3), the 1x1 kernel applies different weights to these values and combines them, creating new output channels. Such a kernel has fewer parameters compared to larger convolutional kernels, resulting in fewer computational operations. This is efficient in terms of memory and computation time. In modeling based on RGB components, using a 1x1 kernel means we examine the relationships between individual color components, rather than the neighborhood of each pixel. In our task, there is no need to detect small and then larger details, as is the case in shape recognition.
The GELU activation function [
35], applied as an alternative to ReLU (Rectified Linear Unit) and other activation functions in neural networks, ensures better gradient behavior and more efficient learning in deep neural networks (
Figure 9). Its continuity, differentiability, and ability to preserve nonlinear data characteristics make it a valuable tool in deep learning model design. Unlike ReLU, which can suffer from gradient flow issues during training (known as the vanishing gradient problem), GELU provides better gradient throughput due to its structure. GELU preserves nonlinear features that are crucial for learning data representations, helping models better capture complex dependencies in training data.
In the LayerNormalization layer, normalization is performed along the feature axis for each sample, meaning normalization is independent of other samples in the batch. Each feature vector (e.g., channels in an image) is normalized separately, considering only the values within that vector, not the entire batch. Unlike BatchNormalization, which operates dependent on batch size, LayerNormalization remains stable even with small batches or a batch size of 1. Normalization can help prevent issues like vanishing or exploding gradients and accelerate convergence. Normalizing activations helps maintain values within a reasonable range, stabilizing the learning process. When activations are normalized, the optimizer (learning algorithm) finds it easier to adjust network weights, speeding up convergence to optimal weight values.
GlobalMaxPool2D is a layer used in neural networks to reduce spatial data dimensions after applying convolutional layers or other layers processing spatial data, such as pooling layers [
36,
37]. GlobalMaxPool2D operates by selecting the maximum value from each feature channel across the entire spatial area (all points) of the input feature tensor. GlobalMaxPool2D is an effective tool for spatial dimensionality reduction, retaining only the most significant features (highest values) from each feature channel. This reduces the number of model parameters, which can help in reducing overfitting and computational complexity. GlobalMaxPool2D has no trainable parameters; its sole purpose is to select the maximum value from each feature channel across the entire feature map. For each feature channel, the GlobalMaxPool2D layer selects the highest value across the entire spatial area (height and width). The result of GlobalMaxPool2D is a tensor with reduced dimensions, containing only the maximum values for each feature channel. After applying GlobalMaxPool2D, the resulting tensor can be passed to fully connected layers or other types of layers that process one-dimensional data vectors.
In this study, the model was compiled using the Adam optimizer with customized beta parameters and a defined loss function, which was the sum of MSE and MAE. When the loss function is a combination of MSE and MAE, it affects how the model learns to minimize errors. MSE penalizes large errors more severely (due to the squares of differences), whereas MAE treats all errors equally. Combining these loss functions means the model will aim to reduce both large and small errors. The gradient of MSE is steeper for larger errors, causing the model to learn faster from significant errors. The gradient of MAE is constant, leading to a more consistent learning rate across all errors. MAE is more resistant to outliers than MSE because it does not involve squared differences. A loss function that combines MSE and MAE can lead to faster convergence in some cases by leveraging the strengths of both methods. MSE can accelerate the reduction of large errors, while MAE can provide stability and a consistent learning pace.
During training, a modification of the learning rate was applied after reaching a stable val_loss level for five epochs. The initial value was 0.001. Details regarding the learning parameters are provided in
Table 2.
The input dataset consisted of images with dimensions of 400x400 and was divided into three parts: training, validation, and test sets in a 60/20/20 ratio. During training, 50x50 squares were dynamically cropped from the images and these squares served as the direct input to the first layer of the network.
There were six different variations of testing in deep learning, to train a regression algorithm to determine vitamin C based on images: original images for artificial lighting, matched images for artificial lighting, original images for natural lighting, matched images for artificial lighting, original images for mixed dataset, matched images for mixed dataset. For each one of them, the model was trained (
Section 3.2.) and the predictions were made. Training and validation loss for calculations for artificial light data for both original and histogram matched images are presented in
Figure 10. The range is similar in both cases; however, the values fluctuations are smaller in the case of matched images and in that condition validation and training losses are more similar to each other.
Figure 11 presents the results of the prediction made with the presented neural network. There are three sets of pairs of outcomes: artificial lightning, natural lightning, and mixed data each of which is repeated for both original images and processed one. There is an improvement for higher values of concentration for each of the conditions (
Table 3). The standard deviation of predictions is lower in case of applying the proposed standardization method. The highest difference can be seen in the case of artificial light, which could be caused by the uniformization of light reflexes visible for original images.
Values of numerical model evaluation and fitting factors are presented in
Table 3. R2 score resulted in a really high value (above 0.99) for all of the cases, however, in every instance, the R2 score received was higher when standardization was performed. As for root mean squared error values – they were lower for matched images in comparison to original ones. Depending on the lightning conditions, learning rate changes from an initial value of 10
-3 up to 10
-5.