A clear perception of the road, followed by processing and correlating the knowledge obtained with the existing frictional state of the road, is necessary for RCE. This state knowledge is then passed on to the controller that plans the applicable behavioural reaction. Various parameters in this workflow determine how effectively the RCE functions. Sensory perception of the road can be achieved in a contact form through traditional mechanical methods or using principles of capacitance, magnetorestriction, piezoelectric resonance, fibre optic or flat film resonance, but the necessity to reduce maintenance costs and achieve real-time RCE has led to the emergence of contactless sensors [
17]. Contact sensors are directly installed on the road[
32], but primarily for purpose of winter road maintenance. In case of an AV, the tires are the only parts in contact with the road. While friction estimation using tire slip has been explored, it requires the AV to first make contact with the affected area [
33], which undermines the whole purpose of alerting the AV before it comes in contact with that area. This makes the development of contactless RCE crucial and pivotal. Furthermore, the ramifications of failing to detect the affected road part amplify the importance of necessity of a prior warning. This prior detection allows for timely behavioral adjustments, reducing the probability of hazards and accidents occurrence and enhancing the overall safety of the AV [
34,
35].
The choice of sensory perception medium decides the trigger point (
Section 2). In the case of contactless RCE, the perceptual power of the sensor is crucial and also determines the effective range at which RCE functions. A balance must be maintained between maximizing detection distance without losing critical features. Sensors and sensing principles work hand in hand, as outlined by the authors in [
17], who categorized contactless RCE methods into infrared spectroscopy, optical polarisation, radar and computer vision. Near-infared LED has been employed to differentiate between dry, snowy, and icy surfaces, utilizing varying light scattering behavior upon incidence on those surfaces [
36,
37,
38]. In 2012, the difference in permittivity properties of ice and water at lower and higher frequencies was used as the principle for capacitor-based ice layer detection [
39]. Camera-based road perception has gained prominence in recent years, although its roots trace back to 1998 when researchers from Sweden utilized images to differentiate winter road conditions [
40]. Over the years, resolution of the images has improved, accompanied by advancements in feature extraction algorithms and classical ML techniques, making cameras increasingly common in RCE applications [
35,
41,
42]. Being an integral component of an ADS as the standard vision sensor, and its affordability and availibility at a reasonable price further enhances its practicality [
43]. Cameras are adaptable for use not only within AVs but also alongside roads to minimize occlusion issues [
15]. Specifically for RCE, their utility has been also occasionally combined with other vision [
44] and non-vision sensors [
14] to improve the detection accuracy. The advent of CNN-based feature extraction techniques has now positioned camera-based RCE among significant topics of research [
15].
Correlating the exact friction coefficient of the road directly from camera images is desired but not feasible. As alternatives, in the early years, researchers explored various strategies to address this challenge. Depending on the road material and type of weather-induced condition, a rough estimate of the friction coefficient of a certain road patch can be envisaged, and this knowledge can be used to group the road patch images into different classes, effectively conceiving an image classification problem [
31,
45]. An example of the process of road patch acquisition can be seen in
Figure 4, where the real road image on the right-hand side is taken from the dataset of [
41]. The camera fitted on the ego vehicle has a clear view of the road. A road patch extraction algorithm sends only the relevant part of the road to the RCE algorithm, thus reducing the computation required. Efforts have also been made to use pixel-level features for segmenting winter-induced conditions on the roads in the captured images. Practical approaches such as detecting drivable areas by leveraging identifiable features like trodden snow on the road have also been explored. In recent years, the research is slowly shifting to directly calculating the friction coefficient from image features. In
Table 2, the most commonly used evaluation metrics are listed to help the reader better understand the different methods.
3.1.1. Image Classification
A significant breakthrough on RCE happened back in 1998 [
40], when the researchers worked with a small dataset of 69 color images to differentiate between dry, wet, tracks, snow and icy roads. They selected the different statistical parameters of gray levels, and additionally the standard deviations of ratios of red and blue, and red and gray level as features. The features were then used to train a simple feed forward neural network, and the proposed technique achieved an accuracy of 52%. Distinguishing between wet and icy roads proved to be the most challenging, while between dry and snowy the least. During dataset preparation, the most significant challenge was the manual annotation of the images, particularly in correctly differentiating between wet and snowy conditions, as slushy conditions exhibited characteristics of both.
A purely mathematical solution was proposed in [
34], which aimed to capture the vertical polarization intensity of different winter surfaces using block filters. The results were then enhanced using a graininess analysis, where the contrast between the original image and the blurred image was analyzed. This approach did not consider the road material and focused on conditions such as snowy, icy, wet and normal dry asphalt states. The polarization step was successful in differentiating snow with icy and wet conditions, but the graininess analysis failed to differentiate between all the three condition altogether.
An innovative classical ML approach was introduced in [
46], which identified color as the most relevant feature for differentiating snow from dry roads. Using a set of 516 images, RGB histograms were extracted for small sub-images, and then concatenated, and used as feature vectors for differentiating between snow and bare roads while edge detectors were employed to differentiate between tracks from other classes, leveraging the presence of straight lines of tracks. The features were then trained using a Support Vector Machine (SVM) to achieve a maximum of 89% accuracy on images recorded from a drive recorder.
Existing public datasets were utilized effectively by authors in [
35], in which they assembled 19000 road patches and used them to fine-tune ResNet50 [
47] and InceptionV3 [
48] models to differentiate between four road material types – asphalt, dirt, grass and cobblestone – and two road condition types – wet asphalt and snow. Color features proved to be crucial again for distinguishing between the classes, although the ambiguous texture nature of the defined classes led to several misclassifications. Among the models, ResNet50 performed the best, achieving an average accuracy of 92%.
The authors in [
49] modified the ReLu activation function to prevent loss of neurons during training, addressing the issue of fluctuations in weights. The leakage from the negative axis of ReLU is taken into account in the training process, and as part of a 13-layer CNN, 1600 images are trained to classify between dry, wet, snowy, muddy and other roads. A test accuracy of 94.89% is achieved on 400 images.
The proof of ResNet50’s robustness in RCE applications was further strengthened in [
50], where 28000 images from in-vehicle cameras were used to fine-tune 4 SOTA classification CNNs (VGG16, ResNet50, InceptionV3 and Xception) to predict bare, partly snow covered, fully snow covered, and unrecognizable conditions, achieving a maximum accuracy of 99.41% [
51,
52]. 4728 images from fixed roadside Road Weather Information System (RWIS) cameras were used to test the robustness of these models to external factors such as camera angle, illumination, distance, road topology and geometry.
In [
53], the authors proclaimed surface texture as the decisive factor in RCE, while using computer vision methods. They employed circular local binary patterns to obtain the minimum value of grey level texture from the images, and also used grey level histogram as a significant first-level feature from a dataset of 1000 images. These features were trained on a random forest (RF) model and a custom VGG-based CNN TLDKNet, and the performance was compared. The CNN achieved 80% accuracy in distinguishing between road patches of high, medium and low resistance, outperforming the classical RF model by 20%.
RCE method using video data and a CNN-based feature extractor was developed in [
54], and tested on a vehicle model to control the breaking behaviour. Although the proposed CNN was not the best fit for all types of braking systems, it was found to be effective for simple brake systems. The proposed method achieved an accuracy of 92% on a dataset of 1200 images with dry and wet conditions.
A new custom 33-layer CNN called RCNet was introduced in [
55] and its performance was evaluated using seven different optimizers on a dataset of 20,757 images to classify roads as curvy, dry, icy, rough, or wet. The Adam optimizer is found to be the most suitable, and the computational efficiency of RCNet was reported to be superior to that of the model presented in [
49].
Surface condition detection using images from RWIS and fixed roadside webcams was proposed in [
56]. In a comparitive analysis between SOTA pre-trained CNNs AlexNet [
57], GoogLeNet [
57] and ResNet18, the latter ResNet18 was found to be the best performing to classify a dataset of 15000 images into dry, snowy and wet/slushy to achieve an average accuracy of 99.1% on a test dataset. The ResNet architecture triumphed again in [
58], but this time on a dataset of 18,835 images from an in-vehicle camera. The classical ML models SVM and K-Means Clustering were compared with the CNN architectures ResNet50 and MobileNet, and ResNet50 demonstrated superior performance, achieving an accuracy of 98.1% for classifying dry, wet and icy roads.
Challenges in RCE during nighttime include the difficulty of reliably capturing road patches due to absence of sunlight. The available illumination sources are primarily headlights of ego vehicle and in some instances, ambient light from headlights of other vehicles and street lights. The presence of ambient illumination cannot be ensured all the time. Dry and snow surfaces, due to their rough texture, reflect light in all directions. In contrast, wet roads have a more smoother textures resulting in higher capturability only from the ambient lighting, while reflections from ego vehicle’s headlights are often scattered away from the camera. This dilemma is analyzed in [
59] where 45,200 images collected at night time with and without ambient illumination are trained on SOTA classification architectures, SqueezeNet [
60], VGG16/19, ResNet50 and DenseNet121 [
61], and three custom CNNs. DenseNet121 outperformed ResNet50 by nearly 2% making it the most effective classifier of road patches for nighttime conditions.
A highly imbalanced dataset of 3790 images with broad labels for different weather and road conditions, including wet, snowy, dry and icy, was analyzed for RCE classification using SOTA CNNs and notably the Vision Transformers (ViTs) [
62] in [
63]. The ViTs have gained prominence for image processing in last few years. Unlike conventional CNNs, which evaluate relationships between every pixel in an image, ViTs process small patches, thus reducing overall computation time. In this specific work, ViT-B/32 model achieved a training accuracy of 98.66% while ViT-B/16 model achieved a slightly higher validation accuracy of 90.95%.
In a significant milestone for RCE, a wetness dataset (RoadSAW) was introduced in 2022 [
41], accompanied by a baseline analysis on it using the MobineNetV2 architecture [
64]. The choice of MobileNetV2 was driven by the need to balance the complexity of CNN networks with higher computational efficiency. This dataset, one of the first large-scale for RCE, comprises 720,000 road patches categorized into 12 classes with 3 road material types and 4 wetness conditions. The dataset attained an F1-score of 64.23%, setting the stage for further research opportunities and advancements in the field. In 2023, the same authors complemented the RoadSAW dataset with a snowy road patches dataset (RoadSC) [
65]. This new dataset features 90,759 manually annotated images categorized into
freshly fallen snow,
fully packed snow, and
partially covered snow. When combined with the RoadSAW dataset, the overall F1-Score improved to 70.92%.
Another large scale RCE dataset Road Surface Classification Dataset (RSCD) was introduced in 2023 [
30], featuring 27 class combinations and a baseline analysis using the architecture of EfficientNet-B0 [
66]. To enhance the robustness of the algorithm, a fusion technique based on Dempster-Shafer (DS) evidence theory is also proposed. The baseline analysis yielded an accuracy of 89.02%, while ablation experiments increased the training accuracy by 3% to reach 92.05%. On a test dataset of 1200 image pairs, the fusion approach reached an impressive accuracy of 97.50%.
To address feature redundancy and class imbalance issue in RSCD, the authors in [
67] explored an innovative approach by modifying the RexNet network, a MobileNet-based CNN. A separate feature extraction convolution was introduced for low dimensional and high dimensional features which are then fused in a later stage, leading to the definition of a custom CNN classifier, Attention-RexNet. For the long tail issue in the dataset, a balanced softmax cross entropy is proposed. The proposed algorithm was unable to cross the baseline accuracy, yielding only 87.67%.
The authors in [
68] combined lightweight ViTs, TinyViT and MobileViT [
69,
70], to leverage their ability of sustaining spatial information till they can be subjected to a late fusion that integrates the local and global characteristics of the images. Using RSCD, they developed a late fusion module that concatenates the feature maps and inputs into a simple classifier block. The choice of ViTs was motivated by their computational efficiency and suitability for real-time RCE estimation. A baseline comparison with the previous works is done and their model EdgeFusionViT was able to surpass the baseline results from [
30] by achieving an accuracy of 89.76%. The generalization capability of RSCD and RoadSAW/RoadSC datasets are discussed in detail in the next subsection.
Two examples where the weather data was considered to enhance the classification accuracy are detailed next. In [
72], the authors combined five weather variables – air temperature, relative humidity, pressure, wind speed, and dew point – with 14,000 images to classify road conditions as bare, partly snow-covered, and fully snow-covered. They first compared seven state-of-the-art CNN models on the images alone, then integrated the weather data using three classical ML models. After the fusion, the Naive Bayes Classifier marginally achieved the highest accuracy. Another instance of fusing the weather parameters was done in [
14], where where 600 images are taken from an asphalt pavement in Northeast Forestry and The image features were combined with meteorological data and temperature data. an average precision of 95.3% was achieved to classify between the classes dry, fresh snow, transparent ice, granular snow and mixed ice.
3.1.4. Friction Coefficient Estimation
One of the first approaches to correlate the image features with a friction coefficient was presented in [
33], where the authors designed a two-stage classifier using a dataset of 5300 images from a vehicle’s front camera. In the first stage, images are classified into dry asphalt, wet/water, slush or snow/ice, using a CNN SqueezeNet and a feature-based model. The SqueezeNet-based model achieved the highest accuracy of 97.36% by learning features from sky and the surroundings. In the second stage, the classified road patch is divided into 5x3 sub-patches, and each patch is regressed to a value between 0 and 1, 0 being for dry and 1 being for snow, yielding a probabilistic matrix. Finally, a rule-based model classifies the patches into low, medium or high level of friction. The overall method achieves an accuracy of 89.5%. Although, exact friction coefficients were not estimated, this work uniquely divided images into smaller patches for enhanced feature correlation between image features and road frictional state.
A consequential contribution for RCE was done in [
18], where also the authors introduced a RCE dataset, Winter Road States. They addressed the issue of open-soruce unavailability of most datasets that were used for RCE previously. 5061 images were collected from famous AV datasets and were annotated as – dry, wet, partly snow, melted snow, fully packed snow, and slush. A mapping is also established from road states to approximate friction coefficients. In a subset of 2007 images, the drivable area is annotated pixel-wise. The original dataset is trained on 6 previously mentioned SOTA CNNs. An auxillary network performs the segmentation task on the images to identify the drivable area, concentrating mainly on the road features, and this segmented map is transferred to the classification results. Once again, ResNet50 as the base structure achieves the highest accuracy of 86.53%.
Vehicle parameters are synchronized with the results from RCE part in [
85] to have a unified effect on the control module and acquire the tire road friction. In the RCE part, there are two segments: semantic segmentation for reducing the detail, and then a ShuffleNetV2-based road condition estimator [
86]. This idea used here is similar to [
18]. Semantic segmentation part is trained using 500 images from Cityscapes. For the classification part, 8000 original images were collected for 8 different classes: dry asphalt, wet asphalt, dry cement, wet cement, brick, loose snow, compacted snow and icy and an accuracy of 97.9% was achieved. In parallel, an unscented Kalman Filter estimates the tire-road coefficient from the vehicle dynamic parameters, and the confidence values from both visual and dynamic sections are fused together to get a unified tire-road friction coefficient.
The creators of RSCD further extended their analysis on their dataset by attempting to estimate the exact friction coefficient of the road patch [
45]. In the first stage, the images are classified into the 5 different classes – dry, wet, water, snow and ice – using an EfficientNet-B0 network, achieving a top-1 accuracy of 94.84%. In the second stage, the predicted class is mapped to a friction coefficient range using Gaussian kernel functions. Misclassifications are handled in the secondary stage with a filter to converge the faulty friction values to desired range.
Another significant work in estimating the friction coefficient from images is presented in [
15], where the authors propose a regression approach to estimate a scalar frictional value from image features. The dataset used included 48791 images, captured from roadside cameras, with corresponding
grip factor values. Feature extraction happens in parallel using a DINOv2 structure and a custom CNN structure, with the features concatenated at the exit. They are then further passed on to a fully connected NN, ending with a sigmoid function, to produce a scalar value. The evaluation is compared with other SOTA CNNs, and their custom architecture was found to be the best performing a MAE of 15%.