1. Introduction
A crash is the result of a chain of malfunctions of the driving components. Among all the crash causation factors, misbehavior of the driver presented most frequently, as more than 90% of all the crashes (Dingus, et al., 2016). From the geniture of the crash, psychological precursors of abnormal status already existed before the crash scene (Reason, 1990). Therefor,it is possible to identify abnormal driving status before the crash scene. Anomaly detection (AD) is a data mining technology to identify the events that deviate from the majority and do not conform to a pre-defined normal behavior (Chandola, Banerjee, & Kumar, 2009). Driving anomaly detection (DAD) is the operation to identify driving anomaly (DA), and the methods for DAD can be classified by how to define DAs.
Although the definition of DA is not yet legitimized, there were three major approaches: the common sense, the statistics of the majority drivers and the individual driving pattern. From common sense DA can be defined the status and behavior that would likely cause crashes, such as aggressiveness, drowsiness, and impaired driving (driving under influence and driving with distraction) (Miyaji, Danno, & Oguri, 2008). Most of the drivers normally comply with the traffic control rules, therefore if complying with the majority the individual driving maneuver was considered normal. Hence DA can be defined as deviating from the statistical majority. A crash was a rare event for a driver, which means a driver can drive for years without a crash. Every driver had one’s own driving pattern, such as the way of hitting the gas and brake pedals, the way of wheel steering, and the distance they kept when following a vehicle. Driving was a complicated behavior controlled by both consciousness and subconsciousness, each driver had one’s own familiar way to drive safely, in this regard, not complying with one’s own driving pattern could also be considered a DA (Igarashi, et al., 2004) (Fancher, 1998).
From the first definition of DA, DAD was conducted through monitoring the driver’s physical body, such as exhalation using in-vehicle alcohol sensors, facial and body movements using cameras and images processing technology. These methods were straightforward but had many drawbacks, such as high cost of devices, technical limitations, and the privacy issue (Jafari & others, 2017). From the second definition of DA, many DADs used the social-economic (SE) data, such as age, gender, and income level etc. SE factors were assumed to have impacts on driving behavior psychologically (Boyle & Lampkin, 2007) and were found statistically correlated with the occurrence of crashes. The SE methods were widely utilized by the automobile manufacturers and insurance companies to identify risky drivers because the measurements were easily and economically available (Ayuso, Guillen, & Nielsen, 2019). Also, from the second definition of DA the trajectory data, the footprints of driving maneuvers, were utilized. For example, the highway patrol policemen observed the trajectories of the vehicles to catch traffic violations. “Aggressive driving” is a term used by the National Highway Traffic Safety Administration to classify “driving actions that markedly exceed the norms of safe driving behavior”. However, how to define the “norms” theoretically was declared challenging to reach consensus (Richard, Magee, Bacon-Abdelmoteleb, Brown, & others, 2018). For one reason was the individual driving pattern differ significantly, the norm that fitted a slow driver might not fit an acute driver. Nonetheless, in the non-administrative safety research, driving aggressive was replaced by the term of “driving volatility”, a more objective and measurable term to describe the instantaneous driving decisions (Lajunen, Karola, & Summala, 1997).
Embedded in the trajectories, speed, acceleration (Lajunen, Karola, & Summala, 1997), jerk (Ericsson, 2000) were studied and selected as key performance indicators (KPIs) to measure driving volatility. Directly using speed as a KPI for DAD was considered naïve because speed was contextual to speed limits (Ellison & Greaves, 2010). A simple solution was to use higher maximum speeds, which were associated with drivers who had more accident records (Lajunen, Karola, & Summala, 1997). Acceleration which was also found associated with risky drivers. The cut-off values for abnormal acceleration were choose as 1.47 m/s2 for aggressive acceleration and 2.28 m/s2 for extremely aggressive acceleration (Kim & Choi, 2013); and (De Vlieger, De Keukeleere, & Kretzschmar, 2000) set the range of 0.85 to 1.10 m/s2 as aggressive acceleration. No consensus had been reached because the thresholds on normal accelerations were contextual sensitive (Wang, Khattak, Liu, Masghati-Amoli, & Son, 2015). To include both speed and accelerations to classify drivers driving behavior, the changes of acceleration with respect to the speed (Langari & Won, 2005) and with respect to the time (Murphey, Milton, & Kiliaris, 2009) were studied. Accelerations were found varied with speeds and on different directions accelerations could not change together. A multivariate KPIs for longitudinal and lateral accelerations of various speed bins were introduced (Liu & Khattak, Delivering improved alerts, warnings, and control assistance using basic safety messages transmitted between connected vehicles, 2016). The rule-based method had the advantages of simplicity and efficiency (Martinez, Heucke, Wang, Gao, & Cao, 2017) while its disadvantage was it cannot address the different driving patterns of individuals. Thus, a DAD on the individual level from the third definition of DA was impending.
The DAD on the individual level was expected to be more accurate than those on the aggregate level. If a swift driver were forced to driving slowly, the driver might be overrelax and pay less attention on driving than necessary for driving safely. A driver was more skillful or safer when using the driving pattern of her/his own. The aggregate level used the average of all and might even out the individual characters. The advantage of the individual over the aggregate also lied in that it could be tailored and fine turned to fit a particular driver. The safety measures at individual level can reveal more clearly the criticalness of a safety situation for a specific driver.
Although with prominent advantages, the DAD on the individual-level was not found in literature. One reason might be it required massive computation power, which was not available. More probably, driving behavior was considered formidably complex. Diving was a serial of activities directed by spontaneous decisions from the human brain, reacting to a series of instantaneous changes of surrounding circumstances, such as adjacent vehicles, roadway, geometric and weather conditions (Wang, Khattak, Liu, Masghati-Amoli, & Son, 2015). Driving required four pairs of brain lobes — occipital, temporal, parietal, and frontal to active and combine both conscious and unconscious brain activities (Halim & Rehan, 2020) . A complete study of DAD would be multidisciplinary involving not only transportation, computer science, but also neurology and the cognitive science (Lees, 2010). However, if jumping directly to the consequences, the anomaly of the factors would all derive abnormal trajectory. Therefore, examine the vehicle trajectory might be a shortcut before a complete study launching. This DAD would be a tweaking from highway patrol policemen eyeballing the vehicle trajectories to catch traffic violations to the in-vehicle-computer running computation molders to catch driving anomalies.
In essence, a DAD was a model for anomaly detection or outlier detection (OD). In data science, outliers were the data points that deviate outstandingly from the majority. For a machine learning (ML) program, OD was an initial step of data cleaning, however, OD itself developed ML algorithms. OD was typically unsupervised ML because the data were often in lack of labels as the outliers were usually rare (Boukerche, 2020). This non-label nature posted difficulty in defining statistical and mathematical measurement for deviation. It triggered significant research and numerous OD algorithms of were developed in various programming languages. The basic categories of unsurprised OD algorithms included Angle-Based Outlier Detection (ABOD) (Kriegel, et al., 2008), Cluster-based Local Outlier Factor (CBLOF) (Duan, et al., 2009), Histogram-base Outlier Detection (HBOS) (Putrada & Abdurohman, 2021), Isolation Forest (Xu, et al., 2017), and K Nearest Neighbors (KNN) (Larose & Larose, 2014). The varied OD algorithm had different way of measuring the deviations while different datasets had various dimensions and features, and the users had different interests. Therefore, it was challenging to determine a universally best OD ML algorithm and hard to reach consensus. Thus, the selecting of algorithm became important for OD processing. For the users’ convenience, packages were developed to pack the OD algorithms together. For example, the PyOD package summarized more than forty OD algorithms and has been utilized in many academic and industrial with over 10 million downloads (Zhao, et al., 2019). The domains of DAD included but not limited to finance for credit card fraud detection, healthcare for malignant tumors detection (Wilson, Johnston, Macleod, & Barker, 1934), astronomy for spacecraft damage detection, cybersecurity for intrusion detection (Chandola, Banerjee, & Kumar, 2009), and connected vehicle (CV) environment for signal intrusion detection (Richard, Magee, Bacon-Abdelmoteleb, Brown, & others, 2018).
Our research, Automatic Safety Diagnosis in the CV Environment, was initiated to develop a near crash waring system on the individual level using basic safety messages (BSMs) only. Diverse as the existing OD algorithms, none of them were found fit our need. The closest research was using vehicle trajectories to identify abnormal driving behavior at the aggregated level (Liu & Khattak, Delivering improved alerts, warnings, and control assistance using basic safety messages transmitted between connected vehicles, 2016) (Liu, Wang, & Khattak, Generating real-time driving volatility information, 2014), but no application of DAD using BSMs on individual level was found. We define a near crash as a traffic situation that fulfill two conditions: (a) a conflict is identified, and (b) at least one of the drivers involved in the conflict exhibits abnormal driving status. As shown in the conceptual architecture of our research, the cloud maintains a flag list of all abnormal CVs in its region kept broadcasting the list to all CVs. In each CV, the In-vehicle computer (IVC) collects the BSMs of its own. The DAD uses the historical BSMs to generates the thresholds differentiating the normal and abnormal driving status, then with the real-time BSMs the DAD determines if the status of the ego vehicle is abnormal. If an anomaly event detected, a flag will be sent to the cloud. When a CV is running on the road, it also keeps receiving BSMs from the nearby CVs. The CV will run the conflict identification model (CIM) (Wu, Zhang, Whalin, & Tu, 2022) to check the conflict if any of the CV pair is on the flag list. This paper summarizes our DAD model, and the rest of the paper is organized as follows:
Section 2 introduces the methods of the DAD;
Section 3 presents the model evaluation; and
Section 4 gives the results and discussions, and
Section 5 gives the conclusions.
Figure 1.
The Conceptual Architecture of the Near Crash Warning System.
Figure 1.
The Conceptual Architecture of the Near Crash Warning System.