Accordingly, a subset of crash data including multiple-vehicle crashes that were not affected by intersections and driveways was prepared. To simplify the estimation and avoid applying Crash Modification Factors (CMF) associated with base conditions, SPFs are constructed for multiple-vehicle non-driveway crashes for divided two-way (2 lanes in each directions) 4-lane urban and suburban arterial segments in the study regions under the base condition (i.e., absence of automated speed enforcement, prohibited on-street parking, no lighting, no roadside fixed object density). According to HSM’s definition, U4D segments have four lanes with continuous cross-sectional areas of two lanes in each direction of travel, where the lanes are physically separated by distance or barrier [
2]. The FDOT GIS database provides a shapefile containing a spatial attribute representing the number of lanes [
31]. HSM recommends the following Negative Binomial regression model for predicting multiple-vehicle non-driveway crashes:
where AADT is the annual average daily traffic, L is the segment length (miles),
, and
are the regression coefficients. HSM develops the regression coefficients for various crash types in terms of the highest level of injury [
2]. Hence, we categorize the crash data based on the KABCO scale and utilize the appropriate regression coefficients recommended by HSM to predict crash frequency.
Table 2 presents the values of the coefficients
, and
reported in HSM to be used in applying Equation (
1) for U4D segments.
Since our data have accurate locations of the observed crashes, the Site-Specific EB Method is applicable. Therefore, Equation (
2) has been also adopted to apply the empirical Bayes method on the basis of the recognition that the safety of a site is best estimated by considering both the number of observed crashes at the site and the number of crashes at sites with similar characteristics, as predicted by the SPF. For District 4, the average crash frequency is predicted based on the crashes that occurred between 2015 and 2017 and validated using the observed crashes in 2018.
where w is a weight factor defined as a function of the SPFs overdispersion parameter (see
Table 2), k, to combine the two estimates:
To test the adequacy of the SPF model, we fit a Negative Binomial (NB) regression model to the data using the functional form and the variables specified in the SPF for each crash severity and assess the adequacy of the model using statistical diagnostic tests. In particular, we employ three model adequacy tests: A test for linearity (adequacy of the functional form) of the NB model, a test for the closeness of the estimated overdispersion parameter to the HSM value and a test for excess zeros in the NB model. For linearity, we employ a chi-square goodness of fit (GOF) test, which asks whether the assumed linear model functional form is adequate. Assuming that the model is valid, the deviance of the NB model is distributed with a chi-square distribution of degree of freedom equal to n-p. For a NB regression [
33], deviance is calculated using Equation (
4) the test rejects the hypothesis that the model functional form is adequate if the p-value of the test, found using Equation (
5) is below some level of significance (typically 0.05).
Therefore, for a given geographical region, if
is close to one, the functional form of the HSM-specified NB model is adequate. The second test checks whether the overdispersion parameter estimated from the data agrees well with the overdispersion values provided by HSM published for a given facility (roadway segment or intersection) and crash type (See
Table 2). The overdispersion parameter is estimated based on Equation (
6):
where
n is the number of years of data used to fit the model and p is the number of explanatory variables in the SPF,
and
are observed crash counts and crash counts predicted by SPF, respectively. For the HSM specified NB model to be adequate in terms of overdispersion parameter we want
to be close to
, the HSM published overdispersion parameters for a given facility (See
Table 2). The third test, determines whether there is a larger number of zeros or small counts in the crash data than what the NB regression model can represent. If the data set contains a large number of zeros, the zero-inflation problem, the predictive capability of the NB regression model can be adversely impacted. The likelihood ratio test compares the fit of a zero inflated NB regression model (which contains a hurdle part) to a NB regression for the given data [
33]. The two parts of the zero inflated model, the hurdle (the zero) model and the count (the nonzero) model, can have different sets of explanatory variables. The likelihood ratio test based on Equation (
7) determines whether a hurdle part, that models only the zeros using a binary logit, is needed in the NB regression model.
is the maximized likelihood of a zero inflated NB regression model (model 1), that includes the hurdle part, and
is the maximized likelihood of the NB regression model (model 2) that does not include the hurdle part. Since the two models are hierarchical (model 1 contains model 2) the likelihood ratio test statistic will follow a
distribution with degrees of freedom equal to the difference in number of parameters
if both models 1 and 2 fit equally well (i.e., the hurdle part does not improve the fit of the model). If the p-value of the likelihood ratio test, defined as Equation (
8) is significant (e.g., smaller than 0.05) then we conclude that there are excess zeros or small counts in the data and a zero inflated NB should instead be used to model the crash data.
Utilizing the statistical diagnostic tests, this paper proposes a new spatial scan method in order to determine the best region size to use in estimating an EB adjusted SPF model. The method considers a sequence of overlapping circular subregions and uses all the historical crash counts observed for the segments within the segment to fit an SPF model. For a circular subregion radius, a subregion
i is constructed by pooling in all crash data for 3 years (2015-2017), the AADT and length data for the
segment and all segments contained within a subregion with a radius of
R miles. A Negative Binomial model (both ordinary and zero-inflated) of the form given in Equation (
1) is fitted to the data and the linearity test p-value
is computed with Equation (
5) and the overdispersion parameter
is computed with Equation (
6). The scanning is repeated for n overlapping circular subregions in the study region and the metrics are computed for
. To combine the metrics obtained from all
n subregions in the entire study region with a single number, the following summary metrics are defined.
combines the deviations of the estimated overdispersion parameters
using subregions of radius of
R miles from the ideal HSM overdispersion parameter
.
combines the deviations goodness of fit p-value
using subregions of radius of R miles from the ideal value of 1, which implies the linear SPF model form is a perfect fit. The analysis is repeated and
is calculated for increasing
R values. The goal of the spatial scan analysis is to identify a good subregion size that satisfies the statistical modeling assumptions and identify the subregions within the study region that cannot be adequately represented using a NB regression model. Therefore, a large value of the combined metric
indicates that the subregion radius
R used in modeling could be revised (to a smaller or larger value) so that so that the modeling assumptions are better satisfied. The empirical results from the study area reveal that the subregion radius
R has small impact on the zero-inflation test results. Therefore, in this method, a summary metric for the zero-inflation test is not included.