Preprint
Article

Dirichlet Process Log Skew-normal Mixture with Missing at Random Covariate in Insurance Claim Analysis

This version is not peer-reviewed.

Submitted:

31 May 2023

Posted:

01 June 2023

You are already at the latest version

A peer-reviewed article of this preprint also exists.

Abstract
In actuarial practice, the modeling of total losses tied to a certain policy is a non-trivial task. Traditional parametric models to predict total losses have limitations due to complex distributional features such as extreme skewness, zero inflation, multi-modality, etc., and the lack of explicit solutions for log-normal convolution. In the recent literature, the application of the Dirichlet process mixture for insurance loss has been proposed to eliminate the risk of model misspecification biases; however, the effect of covariates as well as missing covariates in the modeling framework is rarely studied. In this article, we propose novel connections among covariate-dependent Dirichlet process mixture, log-normal convolution, and missing covariate imputation. Assuming an individual loss is log-normally distributed, we develop a log skew-normal Dirichlet process to approximate the log-normal sum. As a generative approach, our framework models the joint of outcome and covariates, which allows to impute missing covariates under the assumption of missingness at random. The performance is assessed by applying our model to several insurance datasets, and the empirical results demonstrate the benefit of our model compared to the existing actuarial models such as the Tweedie-based generalized linear model, generalized additive model, or multivariate adaptive regression spline.
Keywords: 
;  ;  ;  ;  ;  ;  

1. Introduction

In short-term insurance contracts, predicting accurate aggregate claims is essential for major actuarial decisions such as pricing or reserving. However, it is often not easy to model the aggregate loss due to its complex distributional features such as high skewness, zero inflation, hump shape, multi-modality, etc. With the advance of the modern Bayesian paradigm and computing power, the development of full distribution of aggregate claims has been studied and applied in actuarial practice. In particular, because of its considerable flexibility, a Bayesian nonparametric (BNP) approach has been gradually recognized to solve distributional conundrums in an insurance context. For instance, Hong and Martin (2018) [1] recently developed the Dirichlet process model as a BNP approach that maximizes the fitting flexibility of the full distribution for insurance loss, which obviates the chance of model misspecification bias. In this article, as an extension of their work, we aim to go beyond the search for the maximized fitting flexibility, focusing on the issues that arise from the presence of covariates and the aggregate outcome (total losses). The implication is that the predictive distribution for the expected aggregate claims developed under Hong and Martin’s Dirichlet process framework can be biased with the incorporation of covariate effects and log-normal convolution. For example, as covariates add new information that differentiates the data points of the outcome variable, a new structure can be introduced into the data space, and this increases the within-cluster heterogeneity [2]. Besides, the incorporation of missing covariates may exacerbate the existing heterogeneity. Additionally, assuming that the outcome variable describes the aggregate losses, rather than individual claim amounts, it is difficult to compute the log-normal convolution as it does not have a closed-form solution. In this regard, our study extends their work by addressing the following research questions:
  • RQ1. If an additional unobservable heterogeneity is introduced by the inclusion of covariates, what is the best method to capture the within-cluster heterogeneity in modeling the total losses, comparing several conventional approaches?
  • RQ2. If an additional estimation bias results from the use of the incomplete covariates under Missing At Random (MAR), what is the best way to increase the imputation efficiency, comparing several conventional approaches?
  • RQ3. If an individual loss is distributed with log-normal densities, what is the best way to approximate the sum of log-normal outcome variables, comparing several conventional approaches?

2. Discussion on Research Questions and Related Work

Let Y i , i = 1 , 2 , , N be the independent claim amount (reported by each policyholder for a single policy) random variable, defined on a common probability space ( Ω , F , P ) from a certain loss distribution such as log-normal. Let X be a vector of covariates, and N ( t ) be the total claim count denoting the number of individual claims for a single policy up to time t (policy period). The aggregate claim S h ( t ) for a single policy, h, given time t can be expressed as a convolution: S h ( t ) = i = 1 N ( t ) Y i = Y 1 + Y 2 + + Y N ( t ) . At the end of the policy period t, let S ˜ ( t ) be the total aggregate claim amounts from the total policies received by an insurer, then: S ˜ ( t ) = h = 1 H S h ( t ) = S 1 ( t ) + S 2 ( t ) + + S H ( t ) in which H is the total number of policies on the contracts. Note that both convolutions described so far are built upon the assumption that the summands - Y i , i = 1 , 2 , , N ( t ) and S h , h = 1 , 2 , , H - are mutually independent and identically distributed with log-normal densities (to maintain homogeneity of each loss).
However, the involvement of covariates and the lack of closed-form solutions for the log-normal sum bring about several challenges that violate the assumptions for an accurate estimation of the total aggregate loss S ˜ ( t ) . To begin with, the use of covariates gives rise to an additional within-cluster heterogeneity. Kass et al.(2008) [3] describes a standard aggregate loss modeling principle denoting that the expected aggregate claims E [ S h ] is obtained by the product of the mean claim counts and severities: E [ S h ] = E [ N ] E [ Y ] . With the inclusion of covariates X , a new unknown structure or heterogeneity is introduced into the data space of Y i , which means that Y 1 | X 1 , Y 2 | X 2 , , Y N | X N within a single policy can still be independent, but cannot be identically distributed. Therefore, E [ S h | X ] E [ N | X ] E [ Y | X ] , and the total aggregate loss S ˜ ( t ) becomes difficult to compute with the conventional collective risk modeling approach. In addition, assuming that the severity Y i follows a log-normal distribution, the computation of S ˜ ( t ) becomes quite difficult as its convolution S h is not known to have a closed-form [4]. Another challenge is the missing covariates in S h | X . As shown by Ungolo et al.(2020) [5], the missing covariates under the missingness at random (MAR) assumption lead to the biased parameter estimations because the uncertainty in the estimation results of the parameters describing the outcome Y is heavily affected by the quality of covariates X . Again, in this case, S ˜ ( t ) cannot be computed properly.
Compounding all this, we propose the Dirichlet process log skew-normal mixture to model the S h | X . We consider the Dirichlet process framework to cope with the within-cluster heterogeneity as suggested by Hong and Martin (2018); Braun et al.(2016) [1,6] while employing the log skew-normal approximation studied by Li (2008) [7] to compute each S h | X , the sum of log-normal random variables i = 1 N ( t ) Y i | X . When it comes to the problem of missing covariates, we exploit the generative capability of the Dirichlet process to capture the latent structure of data, which allows for a rigorous statistical treatment of MAR covariates.

2.1. Can Dirichlet Process Capture the Heterogeneity and Bias?: RQ1, RQ2

In Figure 1, Y i refers to individual claim amount and each S h represents a total claim amount defined by a unique policy (cluster h) as a homogeneous distribution. Although an insurer can collect the aggregate loss data S h for each policy cluster given policy period t, individual policyholders (in different risk classes) can raise more than one claim (i.e. random N h ( t ) ) at any time over a fixed time horizon t, and their corresponding claim amounts (i.e. random Y h * ) will not be known in advance. Hence, the unsettled liability information of Y h * from certain policyholders always renders S h incomplete, which is often translated into the challenge of their inherent stochasticity. In addition, the new claims Y h * raised from unknown risk classes can trigger inherent heterogeneity across unique clusters as well. To make matters worse, if introducing covariates X to better understand the different risk classes, one might introduce an additional source of heterogeneity into the scene, which prevents each cluster from being identically distributed.
With respect to this, Hong and Martin (2018) [1] propose the concept of the loss distribution mixture for each cluster based on the Dirichlet process framework. The main idea behind the Dirichlet process mixture (DPM) is to produce a single master distribution to model stochasticity in S h with the help of an infinite dimensional parametric structure and the probabilistic simulations of clustering scenarios. Braun et al.(2006) [6] articulates how the DPM automatically captures unobservable heterogeneity such as intracorrelation between claim amounts Y i in the different risk classes without specifying the number of the classes upfront. In short, no matter how complex the distribution of the data is, the DPM is capable of accommodating any distributional properties - multi-modes, skewness, heavy tails, etc. - resulting from unobservable heterogeneity; and therefore, dramatically minimizes model misspecification biases.
With the inclusion of the covariates, the DPM offers a useful bedrock for a MAR treatment. As a generative modeling approach, the DPM models both outcomes S h and covariates X jointly to produce cluster memberships. This is used as key knowledge to identify the latent structure of the data. For example, in the domain of medicine research, Roy et al. (2018) [8] develop a novel imputation strategy for the MAR covariate, using the latent structure unraveled by the DPM and the other covariate knowledge available. A further survey of imputation methods based on the Nonparametric Bayesian framework can be found in Si and Reiter (2013) [9] and references therein.

2.2. Can Log Skew-Normal Mixture Approximate the Log-Normal Convolution?: RQ3

The log-normal distribution has been considered a suitable claim amount Y i distribution due to its non-negative support, right-skewed curve, and moderately heavy tail to accommodate some outliers. However, if generalizing the individual claim amount Y i by introducing a log-normal distribution, the convolution computation for S h fails because the exact closed form for the log-normal sum is unknown.
Furman et al. (2020) [10] present several existing methods for the log-normal sum approximation that have been studied in the literature. This includes the moment matching approximation approaches such as Minimax approximation, Least squares approximation, Log shifted gamma approximation, and Log skew-normal approximation. The distance minimization approaches - Minimax approximation or Least squares approximation - described by Beaulieu and Xie (2003); Zhao and Ding (2007) [4,11] are conceptually simple, but they require to fit the entire cumulative densities to the sum of claim amounts, which can be computationally expensive and easy to fail when the number of the summands Y i increases. The Log shifted gamma approximation suggested by Lam and Le-Ngoc (2007) [12] has less strict distributional assumptions, but it is not very accurate at the lower region of the distribution. In our study, special attention is paid to the possibility of the Log skew-normal approximation method for the sake of simplicity. A skew-normal distribution as an extension of a normal distribution has a third parameter to naturally explain skewness apart from the other parameters (for a location and spread). Li (2008) [7] points out that one can exploit the third parameter of the skew-normal distribution to capture different skewness levels of each summand. Taking the log of skew-normal densities, we can approximate S h , the sum of the log-normal Y i . Using the log skew-normal as the underlying distribution for S h in the DPM framework, one can eliminate the need to compute the cumulative density curve, and its closed-form density and the optimal distribution parameters for S h can be easily obtained by the moment matching technique. For further details, see Li (2008) [7] and the references contained within.

2.3. Our Contribution and Paper Outline

The contribution of this study is as follows: first, using the Bayesian nonparametric framework, we propose solutions to the two major challenges of the aggregate claim S h computation - 1) heterogeneity in the log-normal random variable Y i , 2) lack of closed-form of the sum of log-normal random variables Y i - in a more unified fashion. Second, we introduce covariates X into the aggregate claim modeling framework, taking into account the adverse impact triggered by the covariates X . This includes the added heterogeneity across Y i and the missing information fed by MAR covariates X . To our knowledge, there have been no previous attempts to either estimate the log skew-normal mixture within the DPM framework or use the DPM to handle the MAR covariate in the insurance loss modeling.
The rest of the paper is structured as follows. In Section 3, we describe the proposed modeling framework for S h , assuming log-normal distributed Y i and the inclusion of both continuous and discrete covariates X . This section also presents our novel imputation approach for the MAR covariate within the DPM framework. Section 4 clarifies the final forms of the posterior and predictive densities accordingly. Section 5 presents our empirical results, and validates our approach by fitting to two different datasets with different sample sizes drawn from the R package CASdatasets and the Wisconsin Local Government Property Insurance Fund (LGPIF). This is followed by a discussion in Section 6.

3. Model: DP Log Skew-Normal Mixture for S h | X

3.1. Background

Consider that there are multiple unknown risk classes (clusters) across the claim Y i information within each policy, and then the individual aggregate claims S h for the policy h would have diverse characteristics that cannot be explained by fitting a single log skew-normal distribution. In order to approximate the distribution that captures such diverse characteristics in S h , we seek to investigate diverse clustering scenarios. To this end, as suggested by Hong and Martin (2018) [1], we exploit the infinite mixture of log skew-normal clusters and their complex dependencies by employing a Dirichlet process. The Dirichlet process produces a distribution over clustering scenarios (with clustering parameters).
{ θ j , ω j } G G D P α , G 0
where G denotes the clustering scenarios, and the important components of G are
  • θ j : the parameters of the outcome variable defined with cluster j.
  • ω j : the parameter of the cluster weights defined with cluster j.
G, as a single realization of the joint cluster probability vector { G ( A 1 ) , G ( A 2 ) } sampled from the DPM model, takes independent partitions A 1 , A 2 , of the sample space k = 1 A k = A of the support of G 0 . By sufficient simulations of G, the Dirichlet process investigates all possible clustering scenarios rather than relying on a single best guess. The overall production of G is controlled with two parameters - a precision α and a base measure G 0 . The precision α controls a variance of sampling G in the sense that larger α generates new clusters more often to account for the unknown risk classes. The base measure G 0 , as the mean of DP( α , G 0 ), is a DP prior over the joint space of all parameters for the outcome model, covariate model, and the precision α , as shown in Ghosal (2010) [13].
Note that the original research on DPM by Hong and Martin (2018) [1] mainly focuses on the random cluster weights ω j independent of the covariates X . On the other hand, in our model, the covariate effects are incorporated into the development of cluster weights ω j . All calculations for the development of the DPM modeling components in this paper are based on the principles introduced by Ferguson (1973), Antoniak (1974), and Sethuraman (1994) [14,15,16].

3.2. Model Formulation with Discrete and Continuous Clusters

Let the outcome be S = { S 1 , S 2 , , S H } denoting the H different aggregate claims (incurred by the H different policies). We assume that the covariate x 1 is binary, and the x 2 is Gaussian, and then our baseline DPM model can be expressed as:
S h | x 1 , x 2 , β j , σ j 2 , ξ j , β ˜ j δ ( X T β ˜ j ) 1 ( S h = 0 ) + 1 δ ( X T β ˜ j ) L o g S N X T β j , σ j 2 , ξ j x 1 | π j B e r n π j x 2 | μ j , τ j 2 N μ j , τ j 2 { θ j , w j } G G D P α , G 0
where j is the risk class index; θ j = { β j , σ j 2 , ξ j , β ˜ j } describe the outcome model while w j = { π j , μ j , τ j 2 } explains the covariate model. S h is modeled as a mixture of a point mass at 0 and positive values distributed with log skew-normal density to address the complications of zero inflation in the loss data. δ ( X T β ˜ j ) models the probability of the outcome being zero using a multivariate logistic regression. Variable Definitions section has a brief description of all parameters used in this study.
Considering a Dirichlet process log skew-normal mixture to house the multiple unknown risk classes in S h , it is necessary to differentiate the forms of mixture components depending on the types of clusters it uses - the discrete and continuous. While keeping the inference of the cluster parameters to be data dominated, the DPM first develops discrete clusters based on the given claim information and then extrapolates certain unobservable clusters of claims by examining the heterogeneity (or hidden risk classes) of each cluster. In this process, the DPM develops new continuous clusters additionally and assesses them with some probabilistic decision-making algorithms, rendering the parameter estimations computationally efficient and asymptotically consistent [17].
The discrete mixture components (clusters) in the DPM framework have the standard form that is useful in accounting for the observed classes such as policy information for aggregate loss S h [18]. In calculating the discrete cluster probabilities, we assume that the non-zero outcome and covariates are distributed with the densities denoted by
f L S N S h | X h T β j , σ j 2 , ξ j = 2 S h σ j ϕ log S h X h T β j σ j · Φ ξ j · log S h X h T β j σ j
f B e r n x 1 | π j = π j x 1 1 π j 1 x 1
f N x 2 | μ j , τ j 2 = 1 2 π τ j 2 exp 1 2 τ j 2 x 2 μ j 2
where ϕ ( · ) and Φ ( · ) are standard normal probability and cumulative density functions for the log skew-normal density. To model the outcome data S h | X h for the policy h, the DPM takes the general form of the mixture
f ( S h | X h , θ ) = j = 1 J ω j δ ( X h T β ˜ j ) 1 ( S h = 0 ) + 1 δ ( X h T β ˜ j ) f L S N ( S h | X h , θ j )
where J is the total number of mixture components (risk classes), θ j = { β j , σ j 2 , ξ j , β ˜ j } and w j = { π j , μ j , τ j 2 } are the outcome and covariate parameters to explain the risk clusters, and ω j , functions of covariates: ω j ( X h | w j ) , are the cluster components weights (mixing coefficient) satisfying j = 1 J ω j = 1 .
However, when the DPM is extended as j , the new continuous clusters are introduced by the G 0 (with its infinite-dimensional parametric structure) in order to address the additional unknown risk classes. This assesses the within-class heterogeneity in S h by confronting the current discrete clustering result and investigating the homogeneity more closely. As the new clusters are considered countably infinite, their corresponding forms of the outcome and covariate models to obtain the continuous cluster are given by
f 0 ( S h | X h ) = f ( S h | X h , θ ) d G 0 ( θ )
f 0 ( x 1 ) = f B e r n ( x 1 | w ) d G 0 ( w )
f 0 ( x 2 ) = f N ( x 2 | w ) d G 0 ( w )
They are also known as a “parameter-free outcome model" and a “parameter-free covariate model" respectively to develop the new continuous cluster mixture. Given a collection of outcome-covariate data pairs D = { S h , X h } h = 1 H , the DPM puts together the current discrete clusters and new continuous clusters to update the mixture form in Equation (3), with help of Monte Carlo Markov Chain (using sufficiently simulated samples of the major parameters θ j , w j ). Consequently, the sample G described in Equation (1) becomes G = f ( S h | X h , D ) = j = 1 ω j · δ z j where δ z j denotes both discrete and continuous cluster densities as point mass distributions at the random locations sampled from G 0 . Aligned with such flexible cluster development, the form of the predictive distribution can be molded based on the knowledge extracted from G, as follow:
f ( S h | X h , θ , w , α ) = ω J + 1 * ω J + 1 * + j = 1 J ω j * · f 0 ( S h | X h ) + j = 1 J ω j * · f ( S h | X h , θ j ) ω J + 1 * + j = 1 J ω j *
and the finalized cluster weights in Equation (5) are secured through computing these two sub-models below for discrete and continuous cluster weights respectively which reflect the properties of the clusters and relevant covariates.
ω J + 1 * = α α + H · f 0 ( x 1 , x 2 )
ω j * = n j α + H · f ( x 1 , x 2 | w j = ( π j , μ j , τ j 2 ) )
where α is the precision parameter to control the acceptance chances of the new clusters, n j is the number of observations in cluster j, f 0 ( X ) is the parameter-free covariate model in Equation (4b, 4c) to support the new continuous clusters, and f ( X | w j ) is the covariate model to support the current discrete clusters. Note that instead of the popular stick-breaking scheme used by Hong and Martin (2018) [1], the cluster weights are obtained based on the covariate models of x 1 , x 2 that explain the outcome S h .
The simulated outcome model f ( S h | X h , D ) = j = 1 ω j · δ z j and its predictive model in Equation (5) show that although the DPM framework allows infinite-dimensional modeling, the dimension of the sampling output G is adaptive as it is a mixture with at most finite components determined by data itself (e.g. its dimension cannot be greater than the total sample size H). This gives the model flexibility, and throughout such modeling flexibility, the G can become the comprehensive mixture distribution for S h , accommodating all distributional properties of the given claims as well as the additional unknown claims.

3.3. Modelling S h | X h with Complete Case Covariate

The joint posterior update for the outcome and covariate parameters - θ j , w j - in Equation (5,6) can be made through a Gibbs Sampler. Using the conditional distribution of the unobservable variables given the observed data, the Gibbs sampler can obtain draws from the analytically intractable posterior distribution of the parameters [20]. Let the cluster-index j = 1 , 2 , , J for the observation h be s h . The parameter inference steps to ensure convergence are described below.
Step.1
Initialize the cluster membership and the main parameters:
(a)
First the cluster membership j = 1 , , J is initialized by some clustering methods such as hierarchical clustering or k-means, etc. This step provides an initial clustering of the data ( S h , X h ) as well as the initial number of clusters.
(b)
Next, after all observations have been assigned to a particular cluster j = 1 , 2 , , J , we can then update the parameters α and ( θ j , w j ) for each cluster. This is done using the posterior densities denoted by p ( α | J ) , p ( θ | S h , X h ) , and p ( w | X h ) in which ( S h , X h ) represent all observations in cluster j.
Step.2
Loop through the Gibbs sampler and new continuous cluster selection:
Once the cluster memberships and parameters are initialized, we then loop through the Gibbs sampler many times (e.g. M = 100 , 000 iterations) where the algorithm alternates between updating the cluster membership for each observation and updating the parameters given the cluster partitioning. Each iteration might give a slightly different selection of the new clusters based on the Polya Urn scheme [20], but the log-likelihood calculated at the end of each iteration can help keep track of the convergence of the selections. A detailed description of each iteration is given in Algorithm (A2) in Appendix B. The term p ( s h | s h ) on lines 6 and 9 in Algorithm (A2) is the Chinese Restaurant process [19] posterior value given by
p ( s h | s h ) = c · n j h α + H 1 , for record h entering into existing cluster : s h = j . c · α α + H 1 , for record i entering into a new cluster : s h = J + 1 .
where c is a scaling constant to ensure that the probabilities sum to 1, and s h is the collection of cluster indices ( s 1 , s 2 , , s h 1 , s h + 1 , , s H ) assigned to every observation without the cluster index s h of the observation h. As shown in Equation (7), the larger α results in a higher chance of developing the new continuous cluster and adding to the collection of the existing discrete clusters. The forms of the prior and posterior densities used to simulate the main parameters ( θ j * , α * , w j * ) on lines from 16 to 23 in Algorithm (A2) are presented in Appendix A.
There is a couple of points to note. The Gibbs sampler for the DPM described here can be characterized by the use of infinite clusters and covariates. Due to the infinite mixture capacity, the resulting clusters can be kept as homogeneous as possible. In this process, the within-class heterogeneity can be captured between parameters across the observations, and the DPM utilizes such dependencies within existing clusters to determine the rationale for the development of new clusters. The DPM harnesses the power of the covariate as well. For example, the DPM associates individual policies with the unobserved claim (in new clusters) and the observed claims (in old clusters), matching on the covariate information. The investigation of the infinite clusters, covariates, and the continuous cluster selection process in the DPM are briefly illustrated in the diagram in Figure 2. As a result, the unobserved claim problem mentioned in Figure 1 can be addressed by the new cluster introduction, which leads to a better approximation of S h .

3.4. Modelling S h | X h with MAR Covariate

The DPM model for complete case data ( S h , X h ) has been discussed in Section 3.3. In this Section, we present our novel imputation strategy for the MAR covariate in the DPM framework in which the missing values are explained by the observed data. We focus on the missingness in the binary type covariate. In addition, we specify here different prior distributions and the corresponding posterior distributions constructed for the Gibbs sampler, taking into account the MAR covariate. With the model definition in Equation (1), suppose the binary covariate x 1 has missingness within it. To handle this MAR covariate, we consider the following modifications in the DPM Gibbs sampler:
a)
Imputation: The missing covariate impacts on the parameter - θ , w - update. For w j , only the observations S h without the missing covariate are used to update. If the cluster does not have any observations with complete data for that covariate, then a draw from the prior distribution would be used to update. For θ j , however, we must first impute values for the missing covariates x 1 h for all observations S h within the cluster. Since having already defined a full joint model - f ( S h | X h , θ j ) · f ( X h | w j ) - in Section 3.2, we can obtain draws for the MAR covariate x 1 h from the imputation model such as f B e r n ( x 1 h | S h , θ j , w j ) f ( S h | X h , β j , σ j 2 , ξ j ) · f B e r n ( x 1 h | π j ) at each iteration of the Gibbs sampler. The imputation process is briefly illustrated in Figure 3. Once all missing data in all covariates has been imputed, then we can sample from the posterior for θ and the parameters of each cluster β j , σ j 2 are re-calculated. After this cycle is complete, the imputed data is discarded and the same imputation steps are repeated every iteration.
b)
Re-clustering: To determine each cluster probability after the imputations, the algorithm re-defines the two main components for the cluster probability calculation - 1) covariate model, 2) outcome model. For the covariate model f ( X h | w j ) , we set this equal to the density functions of only those covariates with complete data for observation h. Assuming that X h = { x 1 h , x 2 h } , and the covariate x 1 is missing for observation h, then we drop x 1 h and only use x 2 h in the covariate model,
f ( X h | w j ) = f N ( x 2 h | w 2 j )
This is the refined covariate model for the cluster j with the observation h where the data in x 1 is not available. For the outcome model f ( S h | X h , θ j ) , the algorithm simply takes the imputation model for each cluster and integrates them out the covariates with missing data. This reduces the degree of variances introduced by the imputations. In our case, as covariate x 1 is missing for observation h, this missing covariate can be removed from the X h term that is being conditioned on. Therefore, the refined outcome model is
f ( S h | x 2 h , θ j ) f ( S h | X h , θ j ) · f B e r n ( x 1 h | w 1 j ) d x 1 h
A similar process is conducted for each observation with missing data and each combination of missing covariates. Hence, using Equation (8,9), the cluster probabilities and the predictive distribution can be obtained as illustrated in Step III in Figure 4.
c)
Parameter update: The cluster probability computation is followed by the parameter re-estimation for each cluster, which is illustrated via the diagram in Figure 5. This is the same idea as what we have discussed about the parameter - θ , w - update in Figure 2.

4. Bayesian Inference for S h | X h with MAR Covariate

The efficient simulation for the model parameters - θ : { β , σ 2 , ξ β ˜ } , w : { π , μ , τ 2 } , and α - requires the proper parameterization in the parameter models - prior parameter model and posterior parameter model. The accurate estimations of cluster probabilities rely on the legitimate development of data models - outcome model and covariate model - and the model parameter simulation results that govern the data model behaviors. This section is centered on the novel development of parameter and data models, providing the details of the DPM implementation integrated with the MAR imputation strategy.

4.1. Parameter Models and MAR Covariate

Our study is based on a three-level hierarchical structure: the first level regards the data models such as the log skew-normal outcome model and the Bernoulli, Gaussian covariate models, the second level involves the parameter models such as p ( θ | S h , X h ) , p ( w | X h ) to explain the data, and the third level is developed from the generalized regression to explain the parameters or the related hyperparameters such as a 0 , b 0 , ν 0 , c 0 , d 0 , μ 0 , τ 0 2 , e 0 , γ 0 , g 0 and h 0 to set a probabilistic distribution on the parameter vectors θ = { β , σ 2 , ξ , β ˜ } , w = { π , μ , τ 2 } . See Variable Definition for further information on the variables. Given the model definition in Equation (1), we consider a set of conjugate parameter models due to its computational advantages [21]. For S h δ ( X h T β ˜ j ) 1 ( S h = 0 ) + 1 δ ( X h T β ˜ j ) L o g S N ( X h T β j , σ j 2 , ξ j ) , x 1 B e r n ( π j ) , and x 2 N ( μ j , τ j 2 ) , the prior models come in
p 0 ( σ j 2 | a 0 , b 0 ) : InvGa ( a 0 , b 0 ) , p 0 ( β j | β 0 , Σ 0 ) : MVN ( β 0 , σ j 2 Σ 0 ) , p 0 ( ξ j | ν 0 ) : T ( ν 0 ) p 0 ( β ˜ j | β ˜ 0 , Σ ˜ 0 ) : MVN ( β ˜ 0 , Σ ˜ 0 ) , p 0 ( π j | c 0 , d 0 ) : Beta ( c 0 , d 0 ) , p 0 ( μ j | μ 0 , τ 0 2 ) : N ( μ 0 , τ j 2 ) , p 0 ( τ j 2 | e 0 , γ 0 ) : InvGa ( e 0 , γ 0 ) , p 0 ( α | g 0 , h 0 ) : Ga ( g 0 , h 0 )
and their corresponding kernels chosen in this study are listed in Appendix A.1. Accordingly, the Dirichlet process prior (probability measure) G 0 in our case can be defined as G 0 = MVN ( β 0 , Σ 0 ) × InvGa ( a 0 , b 0 ) × T ( ν 0 ) × MVN ( β ˜ 0 , Σ ˜ 0 ) × Beta ( c 0 , d 0 ) × N ( μ 0 , τ j 2 ) × InvGa ( e 0 , γ 0 ) × Ga ( g 0 , h 0 ) . With a feed of the observed data inputs - ( S h , x 1 h , x 2 h ) -, the prior models for each cluster j described above will be updated into the following posterior models analytically apart from θ j = { β j , σ j 2 , ξ j , β ˜ j } .
p ( π j | c 0 , d 0 , S , x 1 ) : Beta ( c n e w , d n e w ) p ( μ j | μ 0 , τ 0 2 , S , x 2 ) : N ( μ n e w , τ n e w 2 ) , p ( τ j 2 | e 0 , γ 0 , S , x 2 ) : InvGa ( e n e w , γ n e w ) p ( α | g 0 , h 0 , h , J , η , π η ) : π η Ga ( g 0 + J , h 0 l o g ( η ) ) + ( 1 π η ) Ga ( g 0 + J 1 , h 0 l o g ( η ) )
and their corresponding parameterizations are elaborated in Appendix A.2. Note that the value of the precision parameter α relies on the total cluster number J, thus does not vary by the cluster membership j, and its derivation of the posterior parameterization is not subject to the Bayesian conjugacy. Hence, we instead adapt the form of the posterior density for the α suggested by Escobar and West (1995) [22], and its derivation is shown in Appendix C.1. As for θ j = { β j , σ j 2 , ξ j , β ˜ j } , there are no conjugate priors available for log skew-normal likelihood, but their posterior samples can be secured by the conventional metropolis hastings described in Algorithm (A2) in Appendix A.
Considering that x 1 has missing data, although the parameterizations of the posterior densities for the covariate parameter model of w and the precision α listed in Equation (10) are not affected, any outcome data of S h with missingness should be dropped; therefore, n j and x 1 are defined with the only observations in cluster j that are not missing. This imputation example is provided in Appendix C.2. For the outcome parameter model of θ j , the missing covariate x 1 must be imputed before its posterior computation shown in Algorithm (A2). Once the parameters are updated with the imputation, the data models can be constructed as described in Equation (8,9).

4.2. Data Models and MAR Covariate

Data models are the main components for cluster probability computations depicted in Figure 2. As with the development of parameter models, the covariate data model of X ignores the observations with missingness while the outcome data model of S h requires to complete the covariates beforehand. However, the formulation of their densities can be more complex due to the marginalization process with respect to the missing covariate. In addition, as discussed in Section 3.2, the data model development is bound by the types of clusters such as discrete clusters f ( S h | X h , θ j ) , f ( X h | w j ) and continuous clusters f 0 ( S h | X h ) , f 0 ( X h ) .
a)
covariate model for the discrete cluster: f ( X h | w j )
Focusing on the scenario that x 1 is binary, x 2 is Gaussian, and the only covariate with missingness is x 1 h , we simply drop the covariate x 1 h to develop the covariate model for the discrete cluster. For instance, when computing the covariate probability term for hth observation in j cluster, the covariate model f ( x 1 h , x 2 h | π j , μ j , τ j 2 ) simply becomes f ( x 2 h | μ j , τ j 2 ) due to the missingness of x 1 h . As we have x 2 that is assumed to be normally distributed as defined in Equation (1), its probability term is
f ( x 2 h | μ j , τ j 2 ) = 1 2 π τ j 2 exp { ( x 2 h μ j ) 2 2 τ j 2 }
instead of
f ( x 1 h , x 2 h | π j , μ j , τ j 2 ) = π j x 1 h 1 π j 1 x 1 h · 1 2 π τ j 2 exp { ( x 2 h μ j ) 2 2 τ j 2 }
b)
covariate model for the continuous cluster: f 0 ( X h )
If the binary covariate x 1 h is missing, by the same logic, we drop the covariate x 1 h for the continuous cluster; however, using Equation (4), the covariate model for the continuous cluster integrates out the relevant parameters simulated from the Dirichlet process prior G 0 as follows:
f 0 ( x 2 h ) = f ( x 2 h | μ , τ 2 ) d G 0 ( μ , τ 2 ) = f ( x 2 h | μ , τ 2 ) · p ( μ | τ 2 ) · p ( τ 2 ) d μ d τ 2 = γ 0 e 0 Γ ( e 0 + 1 / 2 ) 2 π Γ ( e 0 ) γ 0 + ( x 2 h μ 0 ) 2 4 ( e 0 + 1 / 2 )
instead of
f 0 ( x 1 h , x 2 h ) = f ( x 1 h , x 2 h | π , μ , τ 2 ) · p ( π ) · p ( μ | τ 2 ) · p ( τ 2 ) d π d μ d τ 2 = B ( x 1 h + c 0 , 1 x 1 h + d 0 ) B ( c 0 , d 0 ) · γ 0 e 0 Γ ( e 0 + 1 / 2 ) 2 π Γ ( e 0 ) γ 0 + ( x 2 h μ 0 ) 2 4 ( e 0 + 1 / 2 )
The derivation of the distributions above is provided in Appendix C.3.
c)
outcome model for the discrete cluster: f ( S h | X h , θ j )
In developing the outcome model, as with the parameter model case discussed in Section 4.1 and Appendix C.2, it should be ensured that the covariate is complete beforehand. With all missing data in x 1 h imputed, the outcome model for the discrete cluster is obtained by marginalizing the joint - f ( S h , x 1 h | x 2 h , θ j , π j ) - out the MAR covariate x 1 h , which is a log skew-normal mixture as follows:
f ( S h | x 2 h , β j , σ j 2 , ξ j , β ˜ j ) = x 1 h = 0 1 f ( S h | x 1 h , x 2 h , β j , σ j 2 , ξ j , β ˜ j ) · f ( x 1 h | π j ) = f ( S h , x 1 h = 1 | x 2 h , β j , σ j 2 , ξ j , β ˜ j , π j ) + f ( S h , x 1 h = 0 | x 2 h , β j , σ j 2 , ξ j , β ˜ j , π j ) = δ ( X h T β ˜ j ) 1 ( S h = 0 ) + 1 δ ( X h T β ˜ j ) · 2 σ j S h · ϕ log S h ( β j 0 + β j 1 + β j 2 x 2 h ) σ j · Φ ξ j log S h ( β j 0 + β j 1 + β j 2 x 2 h ) σ j π j + δ ( X h T β ˜ j ) 1 ( S h = 0 ) + 1 δ ( X h T β ˜ j ) · 2 σ j S h · ϕ log S h ( β j 0 + β j 2 x 2 h ) σ j · Φ ξ j log S h ( β j 0 + β j 2 x 2 h ) σ j · ( 1 π j )
instead of
f ( S h | x 1 h , x 2 h , β j , σ j 2 , ξ j , β ˜ j ) = δ ( X h T β ˜ j ) 1 ( S h = 0 ) + 1 δ ( X h T β ˜ j ) · 2 σ j S h · ϕ log S h ( β j 0 + β j 1 x 1 h + β j 2 x 2 h ) σ j · Φ ξ j log S h ( β j 0 + β j 1 x 1 h + β j 2 x 2 h ) σ j
d)
outcome model for the continuous cluster: f 0 ( S h | X h )
Once a missing covariate x 1 is fully imputed and the outcome model is marginalized out conditioned to the MAR covariate x 1 h , the outcome model f 0 ( S h | x 2 h ) for the continuous cluster can also be computed by integrating out the relevant parameters, using Equation (4).
f 0 ( S h | x 2 h ) = f ( S h | x 2 h , β , σ 2 , ξ , β ˜ ) · p ( β ) · p ( σ 2 ) · p ( ξ ) · p ( β ˜ ) d β d σ 2 d ξ d β ˜
However, it can be too complicated to compute its form analytically. Instead, we can integrate the joint model out the parameters, using Monte Carlo integration. For example, we can do the following for each h = 1 , , H .
(i)
Sample β , σ 2 , ξ , β ˜ from the DP prior densities G 0 specified previously.
(ii)
Plug in these samples into f ( S h | x 2 h , β , σ 2 , ξ , β ˜ ) · p ( β ) · p ( σ 2 ) · p ( ξ ) · p ( β ˜ ) .
(iii)
Repeat the above steps many times, recording each output.
(iv)
Divide the sum of all output values by the number of Monte Carlo samples, which will be the approximate integral.

4.3. Gibbs Sampler Modification for MAR Covariate

We have examined the parameter models and data models to update the parameters of the DPM based on probabilistically imputed values of the MAR covariate. Now we set out some modifications of the DPM and let the Gibbs sampler in Algorithm (A2) in Appendix B. address the MAR covariate of x 1 . The Gibbs sampler will alternate between imputing missing data and drawing parameters until it reaches a stationary distribution of the parameters. We elaborate below on the modifications that fit into Algorithm (A2) to update the clustering scenarios and the posterior cluster parameters properly.
a)
In line 6, with the presence of missing covariate x 1 h , the modification of the cluster probability for the observation ( S h , x 1 h , x 2 h ) that belongs to the discrete cluster j can be made as follows,
P ( s h = j ) = p ( s h | s h ) · f ( x 2 h | μ j , τ j 2 ) · f ( S h | x 2 h , β j , σ j 2 , ξ j , β ˜ j )
where f ( x 2 h | μ j , τ j 2 ) is from Equation (11), and f ( S h | x 2 h , β j , σ j 2 , ξ j , β ˜ j ) is from Equation (13).
b)
In line 9, with the presence of missing covariate x 1 h , the modification of the cluster probability for the observation ( S h , x 1 h , x 2 h ) that belongs to the continuous cluster J + 1 can be made as follows,
P ( s h = J + 1 ) = p ( s h | s h ) · f 0 ( x 2 h ) · f 0 ( S h | x 2 h )
where f 0 ( x 2 h ) is from Equation (12), and f 0 ( S h | x 2 h ) is from Equation (14).
c)
In line 22, with the presence of missing covariate x 1 h , the imputation should be made before simulating the parameter θ j * as follows,
Preprints 75290 g013
The imputation model formulation in the above has been discussed in Section 3.4.
Again, these modifications allow to draw missing covariate values from the conditional posterior density at each iteration, using the Metropolis-Hastings with a random walk.

5. Empirical Study

5.1. Data

The performance of our DPM framework is assessed based on two insurance datasets. They highlight data difficulties such as unobservable heterogeneity in an outcome variable and MAR covariates. For simplicity, in each dataset, we only consider two covariates - one binary and one continuous - to explain its loss information (outcome variable). In this study, all computations on these two datasets are performed in the same data format:
Preprints 75290 g014
The first dataset is PnCdemand, which is about the international property and liability insurance demand of 22 countries over 7 years from 1987 to 1993. Secondly, we use a dataset drawn from the Wisconsin Local Government Property Insurance Fund (LGPIF) with information about the insurance coverage for government building units in Wisconsin for years from 2006 to 2010. The first one - PnCdemand - can be obtained from the R package CASdatasets. The dataset is relatively small as it has H = 240 cases with an outcome variable GenLiab: the individual loss amount under the policies of general insurance for each case. As for covariates, we consider one indicator variable of the statutory law system (LegalSyst:1 or 0) and one continuous variable that measures a risk aversion rate (RiskAversion) for each area. For additional background on this dataset, see Browne et al. (2000)  [23]. In the LGPIF dataset, the insurance coverage samples for the government properties from H = 5660 policies are provided. The outcome variable is the sum of all types of losses (Total Losses) for each policy. Only the covariates - LnCoverage, Fire5 - are considered in our study. Fire5 is a binary covariate that indicates fire-protection levels while LnCoverage is a continuous covariate that informs a total coverage amount in a logarithmic scale. For further details, see Quan et al. [24].
Histograms of the losses of the two datasets are exhibited in Figure 6. Due to the significant skewness, the loss data are log-transformed to attain Gaussianity. As shown in the histograms, each distribution displays different characteristics in regard to skewness, modality, excess of zeros, etc. Note that the zero-inflated outcome variable in LGPIF data (b1, b2 in Figure 6) requires a two-part modeling technique that distinguishes the probabilities of the outcome being zero and positive.

5.2. Three Competitor Models and Evaluation

Our DPM framework is compared to other commonly used actuarial models in practice. We employ three predictive models as benchmarks - namely, a generalized linear mixture model (GLM), multivariate adaptive regression spline (MARS), and generalized additive model (GAM). In each dataset, we assume different distributions for the outcome variables, and thus the three benchmark models are built upon the different outcome data models. For example, the PnCdemand dataset (a1,a2) that appeared in Figure 6, has a high frequency of small losses without zero values, hence it is safe to use a gamma mixture to explain the outcome data. As for the LGPIF data (b1,b2) in Figure 6, we consider the outcome data model based on a Tweedie distribution to accommodate the zero-inflated loss data. The benchmark models are implemented in R with the mgcv, splines, and mice packages.
All four models are trained, and investigations are performed in terms of model fit, prediction accuracy, and the conditional tail expectation (CTE) of the predictive distribution. Note that the goodness of fit value for a DPM is not available in Table 1,Table 2. Teh (2010) [25] argues that the goodness of fit evaluation for a DPM is unnecessary as underfitting is mitigated by the unbounded complexity of a DPM while overfitting is alleviated by the approximation of posterior densities over each parameter in a DPM. Gelman et al. (2007) [26] point out Posterior predictive check, which compares the simulated data under the fitted DPM to the observed data, can be useful in studying model adequacy, but its usage cannot be for model comparison. Therefore, the goodness of fit is only compared between the rival models. For the evaluation of prediction performance, the sum of square prediction error (SSPE) and sum of square absolute error (SAPE) are used.

5.3. Result 01. International General Insurance Liability Data

For this dataset, a training set of response and covariates pair ( Y , X ) with n = 160 records, and a test set of response and covariates pair ( Y , X ) with m = 80 records are constructed. We implement the following DPM:
Y h | x 1 , x 2 , β j , σ j 2 L o g N X T β j , σ j 2 x 1 | π j B e r n π j x 2 | μ j , τ j 2 N μ j , τ j 2 { θ j , w j } G G D P α , G 0
A log-normal likelihood is chosen to accommodate the individual loss Y h :GenLiab for a policy h. The covariate x 2 :RiskAversion is subject to missingness, and found to depend on Y h (a MAR case). This is addressed by the internalized imputation process as discussed in Figure 3. The posterior parameters of θ j , w j are estimated with our DPM Gibbs sampler presented in Algorithm (A2). The algorithm runs 10,000 iterations until convergence, and the resulting scenarios of clustering mixture are shown in Figure 7. The plot reveals the overlays of predictive densities on the log scale from the last 100 iterations that are tied to convergence.
Figure 8 lists the classical data imputation process - Multivariate Imputation Chained Equation (MICE) - and predictive densities produced from our rival models - GLM, GAM, MARS. The MICE runs multiple imputation chains, and selects the imputation values from the final iteration. This process results in multiple candidate datasets. The trace plots (a1,a2) monitor the imputation mean and variance for the missing values in the dataset. In the covariate distribution plot (a3), the density of the observed covariate shown in blue is compared with the ones of the imputed covariate for each imputed dataset shown in red. The parameter inferences for the rival models are performed based on the imputed datasets tied to convergence [27]. The gamma distribution is chosen to fit the rival models as the Y h is continuous and positively skewed with a constant coefficient of variation. The gamma-based predictive density plots (b1,b2,b3) estimated with GLM, GAM, MARS look similar, showing unusual bumps near the right tail.
In Figure 9, a histogram of the outcome data in the test set is displayed. The posterior mean densities for out-of-sample predictions produced with our DPM along with the rival models’ density estimates are overlaid on the histogram. Judging from the plot, one can say that our DPM model generates the best approximation. While the rival models generate smooth, mounded curves to make predictions, our DPM captures all possible peaks and bumps, which is closer to the actual situation. According to Table 1, the rival models produce slightly higher SAPEs, but lower SSPEs, compared to our proposed DPM. As SAPE weights all the individual differences equally, we can assume that the rival models tend to give too much focus on the most probable data points and miss some outliers. This is mainly due to the insufficient sample size. However, our DPM has good performance under small sample sizes when there is sufficient prior knowledge available. From the perspective of CTE, Table 1 shows that our DPM proposes a heavier tail than other rival models, which reflects that our DPM captures more uncertainties given the small sample size.

5.4. Result 02. LGPIF Data

For this dataset, a training set of response and covariates pair ( S , X ) with n = 4529 records, and a test set of response and covariates pair ( S , X ) with m = 1110 records are constructed. We implement the following DPM:
S h | x 1 , x 2 , β j , σ j 2 , ξ j , β ˜ j δ ( X T β ˜ j ) 1 ( S h = 0 ) + 1 δ ( X T β ˜ j ) L o g S N X T β j , σ j 2 , ξ j x 1 | π j B e r n π j x 2 | μ j , τ j 2 N μ j , τ j 2 { θ j , w j } G G D P α , G 0
As the outcome S h :Total Losses for a policy h in this dataset is considered to be distributed with the sum of log-normal densities, a log skew-normal likelihood is chosen to approximate this convolution [7]. The covariate x 1 :Fire5 is subject to missingness under MAR, and the internalized imputation process illustrated in Figure 3 resolves this issue without creating imputed datasets. As the outcome S h exhibits zero inflation, we employ a two-part model, using a sigmoid and indicator function. Our DPM Gibbs sampler described in Algorithm (A2) produces the posterior parameters of θ j , w j with 10,000 iterations until convergence. Figure 10 reveals the resulting scenarios of clustering mixture. In the plot, there are 100 predictive densities suggested by our DPM, each of which stands for the convergence of the estimation results.
The output of the MICE and the resulting predictive densities from the rival models are displayed in Figure 11. The rival models are built upon a Tweedie distribution due to its ability to account for a large number of zero losses, and the flexibility to capture the unique loss patterns of the different classes of policyholders. According to the plot, all three rival models reasonably capture zero inflation, but the GAM tends to suggest more bumps that indicate a need for further assessment of the prediction uncertainty.
The overall out-of-sample prediction comparison is made in the histogram overlayed with predictive density curves generated from the four models in Figure 12. From the plot, it is apparent that the posterior predictive density proposed by our DPM best explains the new samples while other rival models keep producing multiple peaks. The prediction performance of our DPM is confirmed by the smallest SSPE and SAPE in Table 2. In terms of CTE, all three rival models suggest a similar level of tailedness, reflecting the knowledge obtained from the observed data. However, our DPM goes beyond this and proposes a much heavier tail. This is because our DPM accommodates the presence of outliers and shapes the tail behavior based on the combined knowledge of prior parameters and the observations available.

6. Discussion

This paper proposes a novel DPM framework for actuarial practice to model total losses with the incorporation of MAR covariates. Both log-normal and log skew-normal DPM present overall good empirical performances in capturing the shape of the distribution, out-of-sample prediction, and the estimation of the tailedness. This suggests that it is worth considering our DPM framework in order to avoid various model risks or biases in insurance claim analysis.

6.1. Research Questions

Regarding RQ1, we propose a DPM framework to address the within-cluster heterogeneity emerging from the inclusion of covariates. By allowing for an infinite number of clustering scenarios determined by the observations as well as prior knowledge, our DPM outperforms the rival methods in drawing the lines for the cluster membership. This can be assessed by examining the homogeneity of the resulting clusters. In our case, we fit cluster-wise GLMs (based on Gamma and Tweedie) to the data points within each resulting cluster to compare the goodness-of-fit, and the consistent AICs across all clusters endorse the benefits of the DPM. Similarly, our rival methods such as GAM or MARS can capture heterogeneity by using customized smooth functions across different subsets of the data, but we observe some statistically insignificant smooth terms, indicating the presence of heterogeneity in the cluster.
In terms of RQ2, we suggest incorporating the imputation steps into the parameter and cluster membership update process in the DPM Gibbs sampler by leveraging the joint distribution of the observed outcomes and missing covariates. This approach allows the imputed values to be consistent with the observed data, preserving the correlation structure within the dataset. In order to make a comparison of our approach with an existing alternative, we additionally employ a chained equation technique. The multiple sets of imputed values simulated from both approaches are investigated, and the result shows that our DPM Gibbs sampler does not represent a significant improvement over the chained equation because their average estimates of the imputed values are closer to each other. However, we feel that this result is mainly due to the relatively low dimensionality of the datasets we use and their simple data structure. The specific characteristics or dependencies in the data and the complexity of the missing patterns would give different results in practice.
As for RQ3, we fit a log skew-normal density to the aggregate loss outcomes. In order to assess its performance, one can consider Minimax approximation, Least squares approximation, Log shifted gamma approximation, etc. as the competitors. Li (2008) [7] provides a useful comparison between these competitors by overlaying the cumulative density curves for each technique, but its experiments are grounded on the simulated log-normal data with the pre-defined parameters and assumptions, which cannot be easily controlled in real-world scenarios. Therefore, we feel that the choice of the best approximation technique should be made based on the identification of the specific characteristics of the dataset. In our case, each summand in our dataset is significantly different from each other in magnitudes (the Minimax is inappropriate) and LGPIF data has a large volume of data smaller than 5 (the Log shifted gamma is inappropriate); therefore, we choose a log skew-normal density that is relatively simple while giving an accurate approximation at the lower region of the distribution.

6.2. Future Work

There are several concerns with our log skew-normal DPM framework.
(a)
Dimensionality: First, in our analysis, we only use two covariates (binary and continuous) for simplicity, hence more complex data should be considered. As the number of covariates grows, the likelihood components (covariate models) to describe the covariates grow, which results in the shrinking of the cluster weights. Therefore, using more covariates might enhance the level of sensitivity and accuracy in the creation of cluster memberships. However, it can also introduce more noise or hidden structures that render the resulting predictive distributions unstable. In this sense, further research on the problem of high dimensional covariates in the DPM framework would be worthwhile.
(b)
Measurement error: Second, although our focus in this article is MAR covariate, mismeasured covariate is an equally significant challenge that impairs the proper model development in insurance practice. For example, Aggarwal et al. (2016) [28] point out that “model risk" mainly arises due to missingness and measurement error in variables, leading to flawed risk assessments and decision-making. Thus, further investigation is necessary to explore the specialized construction of the DPM Gibbs sampler for mismeasured covariates, aiming to prevent the issue of model risk.
(c)
Sum of log skew-normal: Third, as an extension to the approximation of total losses S h (the sum of individual losses) for a policy, we recommend researching into ways to approximate the sum of total losses S ˜ across entire policies. In other words, we pose the question of “how to approximate the sum of log skew-normal random variables". From the perspective of an executive or an entrepreneur whose concern is the total cash flow of the firm, nothing might be more important than the accurate estimation of the sum of total losses in order to identify the insolvency risk or to make important business decisions.
(d)
Scalability: Lastly, we suggest investigating the scalability of the posterior simulation by our DPM Gibbs sampler. As shown in our empirical study on the PnCdemand dataset, our DPM framework produces reliable estimates with relatively small sample sizes ( n 160 ). This is because our DPM framework actively utilizes significant prior knowledge in posterior inference rather than heavily relying on the actual features of the data. In the result from the LGPIF dataset, our DPM exhibits stable performance at sample size n = 4529 as well. However, a sample size of over 10000 is not explored in this paper. With increasing amounts of data, our DPM framework raises a question of computational efficiency due to the growing demand for computational resources or degradation in performance [29]. This is an important consideration, especially in scenarios where the insurance loss information is expected to grow over time.

7. Variable Definitions

    The following variables and functions are used in this manuscript:
i = 1 , , N h observation index i in policy h.
h = 1 , , H policy index h with sample (policy) size H.
j = 1 , , J cluster index for J clusters.
s h cluster index j = 1 , , J for observation h.
n j number of observations in cluster j.
n j h number of observations in cluster j where observation h removed from.
Y i h individual loss i in a policy observation h.
S h outcome variable which is a Σ Y i h in a policy observation h.
S ˜ outcome variable which is a Σ S h across entire policies
X h vector of covariates (including x 1 , x 2 ) for observation h.
x 1 vector of covariate (Fire5).
x 2 vector of covariate (Ln(coverage)).
x 1 individual value of covariate (Fire5).
x 2 individual value of covariate (Ln(coverage)).
p 0 ( · ) parameter model (for prior).
p ( · ) parameter model (for posterior).
f 0 ( · ) data model (for continuous cluster).
f ( · ) data model (for discrete cluster).
δ ( · ) logistic sigmoid function - expit(·) - to allow for a positive probability of the zero outcome.
θ j set of parameters - β , σ 2 , ξ - associated with the f ( Σ Y | X ) for j cluster.
w j set of parameters - π , μ , τ - associated with the f ( X ) for j cluster.
ω j cluster weights (mixing coefficient) for j cluster.
β 0 , Σ 0 vector of initial regression coefficients and variance-covariance matrix, i.e. σ ^ 2 ( X T X ) 1 = X T X ( Σ Y Σ Y ^ ) T ( Σ Y Σ Y ^ ) / ( n p ) obtained from the baseline multivariate Gamma regression of Σ Y ^ > 0 .
β j regression coefficient vector for a mean outcome estimation.
σ j 2 cluster-wise variation value for the outcome.
ξ j skewness parameter for log skew-normal outcome.
β ˜ 0 , Σ ˜ 0 vector of initial regression coefficients and variance-covariance matrix obtained from the baseline multivariate logistic regression of Σ Y ^ = 0 .
β ˜ j regression coefficient vector for a logistic function to handle zero outcomes.
π j proportion parameter for Bernoulli covariate.
μ j , τ j location and spread parameter for Gaussian covariate.
α precision parameter that controls the variance of the clustering simulation. For instance, a larger α allows to select more clusters.
G 0 prior joint distribution for all parameters in the DPM - β , σ 2 , ξ , π , μ , τ , and α . It allows all continuous, integrable distributions to be supported while retaining theoretical properties and computational tractability such as asymptotic consistency, efficient posterior estimation, etc.
a 0 , b 0 hyperparameters for Inverse Gamma density of σ j 2 .
c 0 , d 0 hyperparameters for Beta density of π j .
ν 0 hyperparameters for Student’s t density of ξ j .
μ 0 , τ 0 2 hyperparameters for Gaussian density of μ j .
e 0 , γ 0 hyperparameters for Inverse Gamma density of τ j 2 .
g 0 , h 0 hyperparameters for Gamma density of α .
η random probability value for Gamma mixture density of the posterior on α .
π η mixing coefficient for Gamma mixture density of the posterior on α .

Author Contributions

All authors contributed substantially to this work - Conceptualization, Kim.M.; Methodology, Kim.M., Lindberg.D., Bezbradica.M. and Crane.M.; Software, Kim.M., Lindberg.D.; Validation, Kim.M., Bezbradica.M and Crane.M.; Formal analysis, Kim.M.; Investigation, Kim.M., Bezbradica.M and Crane.M.; Resources, Kim.M.; Data curation, Kim.M.; Writing—original draft preparation, Kim.M.; Writing—review and editing, Kim.M., Bezbradica.M and Crane.M.; Visualization, Kim.M.; Supervision, Bezbradica.M and Crane.M.; Project administration, Bezbradica.M and Crane.M.; Funding acquisition, Crane.M. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

Data and implementation details are available at https://github.com/mainkoon81/Paper2-Nonparametric-Bayesian-Approach01 (accessed on 25 May 2023).

Acknowledgments

For this research, the authors wish to acknowledge the support from the Science Foundation Ireland under Grant Agreement No.13/RC/2106 P2 at the ADAPT SFI Research Centre at DCU. ADAPT, the SFI Research Centre for AI-Driven Digital Content Technology, is funded by the Science Foundation Ireland through the SFI Research Centres Programme.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A Parameter Knowledge

Appendix A.1. Prior Kernel for Distributions of Outcome, Covariates, and Precision

p 0 ( β j | β 0 , Σ 0 ) : MVN ( β 0 , σ j 2 Σ 0 ) * e { ( β j β 0 ) T Σ 0 1 ( β j β 0 ) } , p 0 ( σ j 2 | a 0 , b 0 ) : InvGa ( a 0 , b 0 ) ( σ j 2 ) ( a 0 + 1 ) · e b 0 / σ j 2 p 0 ( ξ j | ν 0 ) : T ( ν 0 ) ξ j 2 ν 0 + 1 ( ν 0 + 1 ) / 2 , p 0 ( β ˜ j | β ˜ 0 , Σ ˜ 0 ) : MVN ( β ˜ 0 , Σ ˜ 0 ) * e { ( β ˜ j β ˜ 0 ) T Σ ˜ 0 1 ( β ˜ j β ˜ 0 ) } p 0 ( π j | c 0 , d 0 ) : Beta ( c 0 , d 0 ) π j ( c 0 1 ) · ( 1 π j ) ( d 0 1 ) , p 0 ( μ j | μ 0 , τ 0 2 ) : N ( μ 0 , τ 0 2 ) e 1 2 ( μ j μ 0 ) 2 / τ 0 2 p 0 ( τ j 2 | e 0 , γ 0 ) : InvGa ( e 0 , γ 0 ) ( τ j 2 ) ( e 0 + 1 ) · e γ 0 / τ j 2 , p 0 ( α | g 0 , h 0 ) : Ga ( g 0 , h 0 ) α ( g 0 1 ) · e α · h 0
* β 0 , Σ 0 Gamma regression, β ˜ 0 , Σ ˜ 0 Logistic regression.

Appendix A.2. Posterior Inference for Outcome, Covariates, and Precision

Algorithm A1:Posterior inference θ j * = { β j * , σ j 2 * , ξ j * , β ˜ j * }
Require:
initialize θ j ( o l d ) : β j MVN ( β 0 , σ j 2 Σ 0 ) σ j 2 IG ( a 0 , b 0 ) ξ j T ( ν 0 ) β ˜ j MVN ( β ˜ 0 , Σ ˜ 0 )
1:
repeat 
2:
    for  j = 1 , , J do▹ Assume J cluster memberships.  
3:
        Sample θ ( n e w ) from the proposal densities q : ▹ Choose priors as q .
              β j ( n e w ) q β , σ j 2 ( n e w ) q σ 2 , ξ j ( n e w ) q ξ , β ˜ j ( n e w ) q β ˜  
4:
        for  θ j ( n e w ) ={ β j ( n e w ) , σ j 2 ( n e w ) , ξ j ( n e w ) , β ˜ j ( n e w ) } do   
5:
           Compute the transition ratio, using the outcome models:
                   R a t i o θ = h = 1 H f ( S h | X , θ j ( n e w ) ) 1 · p 0 ( θ j ( n e w ) ) · q θ ( θ j ( o l d ) ) h = 1 H f ( S h | X , θ j ( o l d ) ) 1 · p 0 ( θ j ( o l d ) ) · q θ ( θ j ( n e w ) )  
           Sample U Unif ( 0 , 1 )  
6:
           if  U < R a t i o θ then θ j * = θ j ( n e w ) otherwise   θ j * = θ j ( o l d )  
7:
           end if 
8:
        end for 
9:
        Record θ j *  
10:
    end for 
11:
until M posterior samples ( θ j = 1 , , J * ) obtained. ▹ M is a sufficient sample size
1 The outcome density is defined as: f ( S h | X , θ j ) = δ ( X h T β ˜ j ) 1 ( S h = 0 ) + 1 δ ( X h T β ˜ j ) f L S N ( S h | X h , θ j ) .
p ( π j | c 0 , d 0 , S , x 1 ) : Beta ( c n e w , d n e w ) p ( μ j | μ 0 , τ 0 2 , S , x 2 ) : N ( μ n e w , τ n e w 2 ) c n e w = c 0 + h = 1 n j x 1 h d n e w = d 0 + n j h = 1 n j x 1 h μ n e w = ( n j x 2 ¯ + μ 0 ) / ( n j + 1 ) τ n e w 2 = τ j 2 / ( n j + 1 ) p ( τ j 2 | e 0 , γ 0 , S , x 2 ) : InvGa ( e n e w , γ n e w ) p ( α | g 0 , h 0 , h , J , η , π η ) : π η Ga ( g 0 + J , h 0 l o g ( η ) ) + ( 1 π η ) Ga ( g 0 + J 1 , h 0 l o g ( η ) ) e n e w = e 0 + n j / 2 γ n e w = γ 0 + 1 2 { n j n j + 1 · ( x 2 ¯ μ 0 ) 2 + h = 1 n j ( x 2 h x 2 ¯ ) 2 } η | α , h Beta ( α + 1 , h ) π η = g 0 + J 1 g 0 + J 1 + h ( h 0 log ( η ) )

Appendix B Baseline Inference Algorithm for DPM

Once we obtain decent parameter samples from the posterior distributions, the posterior predictive density can be computed via the DPM Gipps sampling. The basic inference algorithm is described below. Note that the modification details for the missing data imputation are provided in Section 4.3. In every iteration, the algorithm updates the cluster memberships based on the parameter samples and observed data at hand, which leads to the re-calculation of the cluster parameters. In the sampler, the state is the collection of membership indices ( s 1 , , s H ) and parameters { α * , ( θ 1 * , , θ J * ) , ( w 1 * , , w J * ) } , where θ j * refers to the parameter associated with cluster j.
Algorithm A2:DPM Gibbs Sampling for new cluster development
Require:
Starting state ( s 1 , , s H ) , α , ( θ 1 , , θ J ) , ( w 1 , , w J )  
1:
repeat 
2:
    for  h = 1 , , H  do  
3:
        (1) Update cluster memberships:
▹ Take s h and compute the C l probabilities using the joint model.  
4:
        if  s h = j  then  
5:
           for  j = 1 , , J  do   
6:
                P ( s h = j ) = p ( s h | s h ) f ( x 1 h , x 2 h | w j ) · f ( S h | x 1 h , x 2 h , θ j )
▹ for observation h entering into existing discrete clusters.  
7:
           end for 
8:
        else if  s h = J + 1  then  
9:
            P ( s h = J + 1 ) = p ( s h | s h ) f 0 ( x 1 h , x 2 h ) · f 0 ( S h | x 1 h , x 2 h )
▹ for observation h entering into a new continuous cluster.
10:
        end if 
11:
        Draw a C l index from a multinomial { 1 , 2 , , J + 1 }
▹ with probabilities P ( s h = 1 ) , P ( s h = 2 ) , , P ( s h = J + 1 ) :Polya Urn.  
12:
        if the C l index = J + 1  then  
13:
           Record ( θ 1 , , θ J + 1 ) , ( w 1 , , w J + 1 )  
14:
        end if
15:
  
16:
        (2) Update parameters:
( θ j , α , w j ) for each cluster based on the posterior densities.  
17:
        for  j = 1 , , J + 1  do   
18:
           Sample w j * from the posterior: p ( w | X h ) .   
19:
        end for  
20:
        Sample α * from the posterior: p ( α | J + 1 ) .   
21:
        for  j = 1 , , J + 1  do   
22:
           Sample θ j * from the posterior: p ( θ | S j , X h ) .   
23:
        end for  
24:
        Record ( θ 1 * , , θ J + 1 * ) , ( w 1 * , , w J + 1 * )   
25:
    end for  
26:
    Record α *
27:
  
28:
    for  h = 1 , , H  do   
29:
        (3) Compute the log-likelihood:    h = 1 n l o g [ f ( X h | w j * ) f ( S h | X h , θ j * ) ]
▹ the function is to eventually stabilize after a large number of iterations.   
30:
    end for  
31:
until M posterior samples ( θ j * , α * , w j * ) obtained. ▹ M is a sufficient sample size

Appendix C Development of the Distributional Components for DPM

Appendix C.1. Derivation of the Distribution of Precision α

In Section 4.1, the parameter model (posterior) of the precision term α is defined as
p ( α | J ) p 0 ( α ) · α J 1 · ( α + n ) · Beta ( α + 1 , n ) p ( α | J , η , g 0 , h 0 ) π η Ga ( g 0 + J , h 0 l o g ( η ) ) + ( 1 π η ) Ga ( g 0 + J 1 , h 0 l o g ( η ) )
To derive this, we first derive the distribution of the number of clusters given the precision parameter: p ( J | α ) . Consider a trivial example where we want to determine the number of clusters that n = 5 observations fall into. One possible arrangement would be that observations 1, 2, and 5 form new clusters, while observations 3 and 4 join an existing cluster. (note, the order is important):
  • observation 1 forms a new cluster, probability = α α
  • observation 2 forms a new cluster, probability = α α + 1
  • observation 3 enters into an existing cluster, probability = 2 α + 2
  • observation 4 enters into an existing cluster, probability = 3 α + 3
  • observation 5 forms a new cluster, probability = α α + 4
In this example, we have J = 3 clusters. We want to find the probability of this arrangement. The probability is the following:
α α α α + 1 2 α + 2 3 α + 3 α α + 4 α 3 α ( α + 1 ) ( α + 2 ) ( α + 3 ) ( α + 4 ) = α 3 Γ ( α ) Γ ( α + 5 )
Hence the probability of observing J clusters amongst a sample size of n is given by
p ( J | α ) α J Γ ( α ) Γ ( α + n )
This is also considered the likelihood function. The posterior on α is proportional to the likelihood times the prior, p 0 ( α ) .
p ( α | J ) p ( J | α ) p 0 ( α ) α J Γ ( α ) Γ ( α + n ) p 0 ( α )
The beta function Beta ( x , y ) is defined as the following:
Beta ( x , y ) = Γ ( x ) Γ ( y ) Γ ( x + y )
We can find the beta function of α + 1 and n as follows:
Beta ( α + 1 , n ) = Γ ( α + 1 ) Γ ( n ) Γ ( α + 1 + n ) α Γ ( α ) ( α + n ) Γ ( α + n ) Γ ( α ) Γ ( α + n ) Beta ( α + 1 , n ) α + n α
Thus the posterior simplifies to the following:
p ( α | J ) α J · Beta ( α + 1 , n ) · α + n α · p 0 ( α ) p 0 ( α ) · α J 1 · ( α + n ) · Beta ( α + 1 , n )
Now, under the Ga ( g 0 , h 0 ) prior for α , substituting p 0 ( α ) with Ga ( g 0 , h 0 ) , then
p ( α | J , η , g 0 , h 0 ) α g 0 + j 2 · ( α + n ) · e α ( h 0 l o g ( η ) π η Ga ( g 0 + J , h 0 l o g ( η ) ) + ( 1 π η ) Ga ( g 0 + J 1 , h 0 l o g ( η ) )

Appendix C.2. Outcome Data Model of S h Development with MAR Covariate x 1 for the Discrete Clusters

Prior to the outcome parameter estimation, the missing covariates should be imputed first to obtain the complete covariate model beforehand. In this study, if the binary covariate x 1 h is the only covariate with missingness, we develop the imputation model to impute the binary covariate x 1 h , taking the following steps below, then update β , σ 2 , ξ , β ˜ based on the posterior sampling detailed in Algorithm (A1) in Appendix (A.2). The imputation model for x 1 h is approximated by the joint:
f ( x 1 h | S h , x 2 h , β j , σ j , ξ j , β j ˜ , π j ) f ( S h , x 1 h | x 2 h , β j , σ j , ξ j , β j ˜ , π j )
where
f ( S h , x 1 h | x 2 h , β j , σ j 2 , ξ j , β j ˜ , π j ) = f ( S h | x 1 h , x 2 h , β j , σ j 2 , ξ j , β j ˜ ) · f B e r n ( x 1 h | π j ) = δ ( X h T β ˜ j ) 1 ( S h = 0 ) · π j x 1 h 1 π j 1 x 1 h + 1 δ ( X h T β ˜ j ) 2 σ j S h · ϕ log S h X T β j σ j · Φ ξ j log S h X T β j σ j · π j x 1 h 1 π j 1 x 1 h
which serves as the joint density that we can use to sample the imputation values. For example,
f B e r n ( x 1 h = 1 | S h , x 2 h , β j , σ j 2 , ξ j , β j ˜ , π j ) f ( S h , x 1 h = 1 | x 2 h , β j , σ j 2 , ξ j , β j ˜ , π j ) = δ ( β ˜ j 0 + β ˜ j 1 + β ˜ j 2 x 2 h ) 1 ( S h = 0 ) · π j + 1 δ ( β ˜ j 0 + β ˜ j 1 + β ˜ j 2 x 2 h ) 2 σ j S h · ϕ log S h ( β j 0 + β j 1 + β j 2 x 2 h ) σ j · Φ ξ j log S h ( β j 0 + β j 1 + β j 2 x 2 h ) σ j π j f B e r n ( x 1 h = 0 | S h , x 2 h , β j , σ j 2 , ξ j , β j ˜ , π j ) f ( S h , x 1 h = 0 | x 2 h , β j , σ j 2 , ξ j , β j ˜ , π j ) = δ ( β ˜ j 0 + β ˜ j 2 x 2 h ) 1 ( S h = 0 ) · ( 1 π j ) + 1 δ ( β ˜ j 0 + β ˜ j 2 x 2 h ) 2 σ j S h · ϕ log S h ( β j 0 + β j 2 x 2 h ) σ j · Φ ξ j log S h ( β j 0 + β j 2 x 2 h ) σ j · ( 1 π j )
Then, we can impute x 1 h with the values sampled from B e r n ( π x 1 * ) where
π x 1 * = f ( S h , x 1 h = 1 | x 2 h , β j , σ j 2 , ξ j , β j ˜ , π j ) f ( S h , x 1 h = 1 | x 2 h , β j , σ j 2 , ξ j , β j ˜ , π j ) + f ( S h , x 1 h = 0 | x 2 h , β j , σ j 2 , ξ j , β j ˜ , π j )
Note that in R, the computation can be difficult when the numerator is too small. We suggest the following tricks.
p 1 = f ( S h , x 1 h = 1 | x 2 h , β j , σ j 2 , ξ j , β j ˜ , π j ) p 0 = f ( S h , x 1 h = 0 | x 2 h , β j , σ j 2 , ξ j , β j ˜ , π j ) π x 1 * = e l o g ( p 1 ) e l o g ( p 1 ) + e l o g ( p 0 ) · e l o g ( p 1 ) e l o g ( p 1 ) = 1 1 + e l o g ( p 0 ) l o g ( p 1 )
Finally, the outcome model that is required to compute the parameter θ = { β j , σ j 2 , ξ j , β j ˜ } in Metropolis-Hastings in Algorithm (A1) is obtained by summing the joint of S h and x 1 h (marginalize) out the MAR covariate x 1 h , shown in Equation (9), as below.
f ( S h | x 2 h , β j , σ j 2 , ξ j , β j ˜ , π j ) = x 1 h = 0 1 f ( S h , x 1 h | x 2 h , β j , σ j 2 , ξ j , β j ˜ , π j ) = f ( S h , x 1 h = 1 | x 2 h , β j , σ j 2 , ξ j , β j ˜ , π j ) + f ( S h , x 1 h = 0 | x 2 h , β j , σ j 2 , ξ j , β j ˜ , π j )

Appendix C.3. Covariate Data Model of x 2 Development with MAR Covariate x 1 for the Continuous Clusters

The parameter-free distributions f 0 ( y | x ) and f 0 ( x ) as data models for continuous clusters are needed to calculate the probabilities of cluster membership and for the post-processing calculations for prediction in the DPM. However, when MAR covariates are present, it gives extra complexity in specifying distribution to integrate out the parameters. Recall the integrals we are attempting to find are the following:
f 0 ( x i ) = f ( x i | w ) d G 0 ( w ) = f ( x i | w ) p ( w ) d w
If binary covariate x 1 is missing, then we will need to replace the distribution f ( x | w ) with the continuous distribution (Gaussian) of x 2 , which is f ( x 2 | μ j , τ j 2 ) . The derivation of the parameter-free distribution f 0 ( x 1 ) and f 0 ( x 2 ) for the continuous cluster is as below.
f 0 ( x 1 ) = f ( x 1 | π ) p ( π ) d μ d π = π x 1 1 π 1 x 1 1 Beta ( c 0 , d 0 ) π ( c 0 1 ) ( 1 π ) ( d 0 1 ) d π = 1 Beta ( c 0 , d 0 ) π ( x 1 + c 0 1 ) ( 1 π ) ( 1 x 1 + d 0 1 ) d π = Beta ( x 1 + c 0 , 1 x 1 + d 0 ) Beta ( c 0 , d 0 ) · π ( x 1 + c 0 1 ) ( 1 π ) ( 1 x 1 + d 0 1 ) Beta ( x 1 + c 0 , 1 x 1 + d 0 ) d π = 1 , beta distribution
f 0 ( x 2 ) = f ( x 2 | μ , τ 2 ) p ( μ | τ 2 ) p ( τ 2 ) d μ d τ 2 = 1 2 π τ 2 exp 1 2 τ 2 x 2 μ 2 × 1 2 π τ 2 exp 1 2 τ 2 μ μ 0 2 × γ 0 e 0 Γ ( e 0 ) τ 2 e 0 1 e γ 0 / τ 2 d μ d τ 2 = γ 0 e 0 2 π Γ ( e 0 ) τ 2 e 0 2 exp 1 2 τ 2 x 2 μ 2 1 2 τ 2 μ μ 0 2 γ 0 τ 2 d μ d τ 2
The first step is to integrate with respect to μ . First, we’ll simplify the exponent.
1 2 τ 2 x 2 μ 2 1 2 τ 2 μ μ 0 2 γ 0 τ 2 = 1 2 τ 2 x 2 2 2 x 2 μ + μ 2 + μ 2 2 μ 0 μ + μ 0 2 γ 0 τ 2 = 1 2 τ 2 2 μ 2 2 ( x 2 + μ 0 ) μ 1 2 τ 2 x 2 2 + μ 0 2 γ 0 τ 2 = 2 2 τ 2 μ 2 ( x 2 + μ 0 ) μ + ( x 2 + μ 0 ) 2 4 + 1 τ 2 ( x 2 + μ 0 ) 2 4 x 2 2 + μ 0 2 2 τ 2 γ 0 τ 2 = 1 2 ( τ 2 / 2 ) μ x 2 + μ 0 2 2 + ( x 2 + μ 0 ) 2 4 τ 2 x 2 2 + μ 0 2 2 τ 2 γ 0 τ 2
The integrand will have the kernel of a normal distribution for μ with mean x 2 + μ 0 2 and variance τ 2 2 .
f 0 ( x 2 ) = γ 0 e 0 2 π Γ ( e 0 ) 2 π ( τ 2 / 2 ) term from μ integral × τ 2 e 0 2 × exp ( x 2 + μ 0 ) 2 4 τ 2 x 2 2 + μ 0 2 2 τ 2 γ 0 τ 2 d τ 2 = γ 0 e 0 2 π Γ ( e 0 ) τ 2 e 0 3 / 2 exp 1 τ 2 x 2 2 + 2 x 2 μ 0 + μ 0 2 4 + x 2 2 + μ 0 2 2 + γ 0 d τ 2 = γ 0 e 0 2 π Γ ( e 0 ) τ 2 e 0 1 / 2 1 exp 1 τ 2 ( x 2 2 μ 0 ) 2 4 + γ 0 d τ 2
The integrand is the kernel of an inverse gamma distribution with shape parameter e 0 + 1 2 and scale parameter ( x 2 2 μ 0 ) 2 4 + γ 0 .
f 0 ( x 2 ) = γ 0 e 0 2 π Γ ( e 0 ) × Γ ( e 0 + 1 / 2 ) ( x 2 2 μ 0 ) 2 4 + γ 0 e 0 1 / 2
As shown above, a closed-form expression can be determined, but it is not always the case since it can be extremely complicated. To simplify, we instead might have to consider a Monte Carlo integral.

References

  1. Hong, L.; Martin, R. Dirichlet process mixture models for insurance loss data. Scandinavian Actuarial Journal 2018, 2018, 545–554. [Google Scholar] [CrossRef]
  2. Neuhaus, J.M.; McCulloch, C.E. Separating between-and within-cluster covariate effects by using conditional and partitioning methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2006, 68, 859–872. [Google Scholar] [CrossRef]
  3. Kaas, R.; Goovaerts, M.; Dhaene, J.; Denuit, M. Modern actuarial risk theory: using R; Vol. 128, Springer Science & Business Media, 2008.
  4. Beaulieu, N.C.; Xie, Q. In Proceedings of the Minimax approximation to lognormal sum distributions. The 57th IEEE Semiannual Vehicular Technology Conference, 2003. VTC 2003-Spring. IEEE, 2003, Vol. 2, pp. 1061–1065.
  5. Ungolo, F.; Kleinow, T.; Macdonald, A.S. A hierarchical model for the joint mortality analysis of pension scheme data with missing covariates. Insurance: Mathematics and Economics 2020, 91, 68–84. [Google Scholar] [CrossRef]
  6. Braun, M.; Fader, P.S.; Bradlow, E.T.; Kunreuther, H. Modeling the “pseudodeductible” in insurance claims decisions. Management Science 2006, 52, 1258–1272. [Google Scholar] [CrossRef]
  7. Li, X. A Novel Accurate Approximation Method of Lognormal Sum Random Variables. PhD thesis, Wright State University, 2008.
  8. Roy, J.; Lum, K.J.; Zeldow, B.; Dworkin, J.D.; Re III, V.L.; Daniels, M.J. Bayesian nonparametric generative models for causal inference with missing at random covariates. Biometrics 2018, 74, 1193–1202. [Google Scholar] [CrossRef] [PubMed]
  9. Si, Y.; Reiter, J.P. Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys. Journal of educational and behavioral statistics 2013, 38, 499–521. [Google Scholar] [CrossRef]
  10. Furman, E.; Hackmann, D.; Kuznetsov, A. On log-normal convolutions: An analytical–numerical method with applications to economic capital determination. Insurance: Mathematics and Economics 2020, 90, 120–134. [Google Scholar] [CrossRef]
  11. Zhao, L.; Ding, J. Least squares approximations to lognormal sum distributions. IEEE Transactions on Vehicular Technology 2007, 56, 991–997. [Google Scholar] [CrossRef]
  12. Lam, C.L.J.; Le-Ngoc, T. Log-shifted gamma approximation to lognormal sum distributions. IEEE Transactions on Vehicular Technology 2007, 56, 2121–2129. [Google Scholar] [CrossRef]
  13. Ghosal, S. The Dirichlet process, related priors and posterior asymptotics. Bayesian nonparametrics 2010, 28, 35. [Google Scholar]
  14. Ferguson, T.S. A Bayesian analysis of some nonparametric problems. The annals of statistics. 1973, pp. 209–230.
  15. Antoniak, C.E. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The annals of statistics, 1152. [Google Scholar]
  16. Sethuraman, J. A constructive definition of Dirichlet priors. Statistica sinica.
  17. Hong, L.; Martin, R. A flexible Bayesian nonparametric model for predicting future insurance claims. North American Actuarial Journal 2017, 21, 228–241. [Google Scholar] [CrossRef]
  18. Diebolt, J.; Robert, C.P. Estimation of finite mixture distributions through Bayesian sampling. Journal of the Royal Statistical Society: Series B (Methodological) 1994, 56, 363–375. [Google Scholar] [CrossRef]
  19. Blei, D.M.; Frazier, P.I. Distance dependent Chinese restaurant processes. Journal of Machine Learning Research 2011, 12. [Google Scholar]
  20. Gershman, S.J.; Blei, D.M. A tutorial on Bayesian nonparametric models. Journal of Mathematical Psychology 2012, 56, 1–12. [Google Scholar] [CrossRef]
  21. Cairns, A.J.; Blake, D.; Dowd, K.; Coughlan, G.D.; Khalaf-Allah, M. Bayesian stochastic mortality modelling for two populations. ASTIN Bulletin: The Journal of the IAA 2011, 41, 29–59. [Google Scholar]
  22. Escobar, M.D.; West, M. Bayesian density estimation and inference using mixtures. Journal of the american statistical association 1995, 90, 577–588. [Google Scholar] [CrossRef]
  23. Browne, M.J.; Chung, J.; Frees, E.W. International Property-Liability Insurance Consumption. The Journal of Risk and Insurance 2000, 67, 73–90. [Google Scholar] [CrossRef]
  24. Quan, Z.; Valdez, E.A. Predictive analytics of insurance claims using multivariate decision trees. Dependence Modeling 2018, 6, 377–407. [Google Scholar] [CrossRef]
  25. Teh, Y.W. Dirichlet Process. 2010.
  26. Gelman, A.; Hill, J. Data analysis using regression and multilevel/hierarchical models; Cambridge university press, 2007.
  27. Shah, A.D.; Bartlett, J.W.; Carpenter, J.; Nicholas, O.; Hemingway, H. Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study. American journal of epidemiology 2014, 179, 764–774. [Google Scholar] [CrossRef] [PubMed]
  28. Aggarwal, A.; Beck, M.B.; Cann, M.; Ford, T.; Georgescu, D.; Morjaria, N.; Smith, A.; Taylor, Y.; Tsanakas, A.; Witts, L.; others. Model risk–daring to open up the black box. British Actuarial Journal 2016, 21, 229–296. [Google Scholar] [CrossRef]
  29. Ni, Y.; Ji, Y.; Müller, P. Consensus Monte Carlo for random subsets using shared anchors. Journal of Computational and Graphical Statistics 2020, 29, 703–714. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Independent and identically distributed aggregate losses S h (left) and a Dirichlet process mixture (DPM) to model the S h in every possible way (right). Given the unobserved loss Y h * incurred by the next policyholder (and added to a certain policy group), by how much (subject to stochasticity) and by which policy (subject to heterogeneity) will be left to the main concerns. A DPM addresses these concerns via the simulation of S h .
Figure 1. Independent and identically distributed aggregate losses S h (left) and a Dirichlet process mixture (DPM) to model the S h in every possible way (right). Given the unobserved loss Y h * incurred by the next policyholder (and added to a certain policy group), by how much (subject to stochasticity) and by which policy (subject to heterogeneity) will be left to the main concerns. A DPM addresses these concerns via the simulation of S h .
Preprints 75290 g001
Figure 2. An example of looping through the Gibbs sampler with complete data. In Step I, the algorithm requires the initial cluster memberships and parameters. In Step II.(A), based on the Chinese Restaurant scheme [19] with the DPM prior ( G 0 ), the probabilities of the selected observation h being in each current and the proposed new cluster are computed, which updates the cluster memberships. In Step II.(B), the new continuous cluster membership is determined by a multinomial distribution with a set of the resulting cluster probabilities from Step II.(A) randomly assigned based on the Polya Urn scheme. Once all observations have been assigned to clusters at a given iteration in the Gibbs sampler, then the parameters are updated, given cluster membership.
Figure 2. An example of looping through the Gibbs sampler with complete data. In Step I, the algorithm requires the initial cluster memberships and parameters. In Step II.(A), based on the Chinese Restaurant scheme [19] with the DPM prior ( G 0 ), the probabilities of the selected observation h being in each current and the proposed new cluster are computed, which updates the cluster memberships. In Step II.(B), the new continuous cluster membership is determined by a multinomial distribution with a set of the resulting cluster probabilities from Step II.(A) randomly assigned based on the Polya Urn scheme. Once all observations have been assigned to clusters at a given iteration in the Gibbs sampler, then the parameters are updated, given cluster membership.
Preprints 75290 g002
Figure 3. An example of a re-clustering process with MAR imputation in the DPM Gibbs sampler: Step I and II. The imputations are made cluster membership-wise. Each imputation model as a joint distribution is the product of the outcome model and the covariate model that has missing data.
Figure 3. An example of a re-clustering process with MAR imputation in the DPM Gibbs sampler: Step I and II. The imputations are made cluster membership-wise. Each imputation model as a joint distribution is the product of the outcome model and the covariate model that has missing data.
Preprints 75290 g003
Figure 4. An example of a re-clustering process with MAR imputation in the DPM Gibbs sampler: Step III. The DPM refines the outcome models for all possible configurations based on the types of missingness prior to running the Gibbs sampler. Using these outcome models, each cluster probability and the predictive density are updated.
Figure 4. An example of a re-clustering process with MAR imputation in the DPM Gibbs sampler: Step III. The DPM refines the outcome models for all possible configurations based on the types of missingness prior to running the Gibbs sampler. Using these outcome models, each cluster probability and the predictive density are updated.
Preprints 75290 g004
Figure 5. Parameter re-estimation after the re-clustering with imputation in the Gibbs sampler. This diagram articulates flows of the parameter updates, using the acyclic graphical representation. The process cycles until achieving convergence.
Figure 5. Parameter re-estimation after the re-clustering with imputation in the Gibbs sampler. This diagram articulates flows of the parameter updates, using the acyclic graphical representation. The process cycles until achieving convergence.
Preprints 75290 g005
Figure 6. Histograms of the outcomes and log-transformed outcomes for the two datasets: (a) PnCdemand, (b) LGPIF.
Figure 6. Histograms of the outcomes and log-transformed outcomes for the two datasets: (a) PnCdemand, (b) LGPIF.
Preprints 75290 g006
Figure 7. LogN-DPM: The last 100 in-sample predictive densities (scenarios) overlaid together.
Figure 7. LogN-DPM: The last 100 in-sample predictive densities (scenarios) overlaid together.
Preprints 75290 g007
Figure 8. MICE trace plots and in-sample predictive densities produced from GLM, GAM, MARS.
Figure 8. MICE trace plots and in-sample predictive densities produced from GLM, GAM, MARS.
Preprints 75290 g008
Figure 9. A histogram of the observed loss Y h on the log scale and the out-of-sample predictive densities for the typical class of a policy.
Figure 9. A histogram of the observed loss Y h on the log scale and the out-of-sample predictive densities for the typical class of a policy.
Preprints 75290 g009
Figure 10. LogSN-DPM: The last 100 in-sample predictive densities (scenarios) overlaid together.
Figure 10. LogSN-DPM: The last 100 in-sample predictive densities (scenarios) overlaid together.
Preprints 75290 g010
Figure 11. MICE trace plots and in-sample predictive densities produced from GLM, GAM, MARS.
Figure 11. MICE trace plots and in-sample predictive densities produced from GLM, GAM, MARS.
Preprints 75290 g011
Figure 12. A histogram of the observed total loss S h on the log scale and the out-of-sample predictive densities for the typical class of a policy.
Figure 12. A histogram of the observed total loss S h on the log scale and the out-of-sample predictive densities for the typical class of a policy.
Preprints 75290 g012
Table 1. The comparison of out-of-sample modeling results based on the dataset PnCdemand.
Table 1. The comparison of out-of-sample modeling results based on the dataset PnCdemand.
Model AIC SSPE SAPE 10% CTE 50% CTE 90% CTE 95% CTE
Ga-GLM 830.56 268.6 139.8 6.5 13.8 54.5 78.0
Ga-MARS 830.58 267.2 138.2 6.1 13.0 57.2 71.1
Ga-GAM 845.94 266.7 136.1 6.2 13.3 58.1 72.2
LogN-DPM - 272.0 134.7 6.4 13.8 59.3 79.3
Table 2. The comparison of out-of-sample modeling results based on the LGPIF dataset
Table 2. The comparison of out-of-sample modeling results based on the LGPIF dataset
Model AIC SSPE SAPE 10% CTE 50% CTE 90% CTE 95% CTE
Tweedie-GLM 26270.3 2.04e+14 89380707 955.9 12977.2 133374.4 340713.1
Tweedie-MARS 24721.4 1.99e+14 88594850 961.7 10391.0 129409.2 355112.6
Tweedie-GAM 21948.9 1.95e+14 88213987 989.4 13026.2 140199.5 398263.1
LogSN-DPM - 1.98e+14 83864890 975.3 13695.1 147486.6 425682.6
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Downloads

116

Views

30

Comments

0

Subscription

Notify me about updates to this article or when a peer-reviewed version is published.

Email

Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

© 2025 MDPI (Basel, Switzerland) unless otherwise stated