1. Introduction
The risk management problem of an insurance company has been studied extensively in the literature. It dates back to the Cramér-Lundberg (C-L) model of Lundberg (1903) which described the surplus process of the insurance company in terms of two cash flows: premiums received and claims paid. Consider an insurance company with claims arriving at Poisson rate
, i.e., the total number of claims
up to time
t is Poisson distributed with parameter
. Denote by
the size of the
i-th claim, where
’s are independently and identically distributed with
and
for some constants
. Let
denote the surplus process of the insurance company. Then
where
is the initial surplus level, and
is the premium rate which is the amount of premium received by the insurance company per unit of time.
De Finetti (1957) first proposed the dividend optimization problem: An insurance company maximizes the expectation of cumulative discounted dividends until the ruin time by choosing dividend strategies, that is, when and how much of the surplus should be distributed as dividends to the shareholders. De Finetti (1957) derived that the optimal dividend policy under a simple discrete random walk model should be a barrier strategy. Gerber (1969) then generalized the dividend problem from a discrete-time model to the classical C-L model and showed that the optimal dividend strategy should be band strategy, which degenerates to a barrier strategy for exponentially distributed claim size.
With the development of technical tools such as dynamic programming, the dividend optimization problem has been analyzed under the stochastic control framework. In particular,
in C-L model can be approximated by a diffusion process
that evolves according to
where
,
, and
is a standard Brownian motion; see, e.g., Schmidli (2007). It is worth noting that the diffusion approximation for the surplus process works well for large insurance portfolios, where an individual claim is relatively small compared to the size of the surplus. Under the drifted Brownian motion model the optimal dividend strategy is a barrier strategy, and if the dividend rate is further upper bounded, the optimal dividend strategy is threshold-type; see, e.g., Jeanblanc-Picqué and Shiryaev (1995) and Asmussen and Taksar (1997). Other extensions on the dividend optimization problem include Jgaard and Taksar (1999), Asmussen
et al. (2000), Azcue and Muler (2005), Azcue and Muler (2010), Gaier
et al. (2003), Kulenko and Schmidli (2008), Yang and Zhang (2005), Choulli
et al. (2003), Gerber and Shiu (2006), Avram
et al. (2007) and Yin and Wen (2013), etc.
Previous literature studied the dividend optimization problem based on the complete information of the environment, i.e., all the model parameter values are known. This assumption is no longer valid if the environment is a black-box or the model parameter values are unknown. One way to handle this issue is to use the past information to estimate the model parameters and then solve the problem with the estimated parameters. However, the optimal strategy in the classical dividend optimization problem is a barrier-type or threshold-type, which is extremely sensitive to the model parameter values; a slight change in the model parameters would lead to a totally different strategy.
1
In contrast to the traditional approach that separates the estimation and optimization, reinforcement learning aims to learn the optimal strategy through trial-and-error interactions with the unknown environment without estimating the model parameters. In particular, one takes different actions in the unknown territory and receives feedbacks to learn the optimal action and use it to further interact with the environment. In recent years, reinforcement learning had successful applications in many fields such as health-care, autonomous control, natural language processing, and video games; see, e.g., Zhao et al. (2009), Komorowski et al. (2018), Mirowski et al. (2016), Zhu et al. (2017), Radford et al. (2017), Paulus et al. (2017), Mnih et al. (2015), Jaderberg et al. (2019), Silver et al. (2016), Silver et al. (2017). There is no doubt that reinforcement learning has become one of the most popular and fastest-growing fields today.
Exploration and exploitation are the key concepts in reinforcement learning, and they proceed simultaneously. On one hand, exploitation is to utilize the so-far-known information to derive the current optimal strategy which might not be optimal from the long-term view. On the other hand, exploration emphasizes learning from trial-and-error interactions with the environment to improve its knowledge for the sake of long-term benefit. While the optimal strategy of the classical dividend optimization problem is deterministic when the model parameter values are fully known, randomized strategies are considered to encourage exploration of other actions in the unknown environment. Although the exploration causes a cost in the short term, it helps to learn the optimal (or near-optimal) strategy and bring benefit from the long-term point of view.
Obviously, how to balance the trade-off between exploitation and exploration is an important issue. The -greedy strategy is a frequently used randomized strategy in reinforcement learning. It balances the exploration and exploitation by illustrating that the agent should stick to the current optimal policy most of the time, while the agent could sometimes randomly take other non-optimal actions to explore the environment; see, e.g., Auer et al. (2002). Boltzmann exploration is another randomized strategy extensively studied in RL literature. Instead of assigning constant probabilities to different actions based on current information, Boltzmann exploration uses the Boltzmann distribution to allocate the probability to different actions, where the probability of each action is positively related to its reward. In other words, agent should choose action with higher expected rewards with higher probability; see, e.g., Cesa-Bianchi et al. (2017).
Another way to introduce a randomized strategy is to intentionally include a regularization term to encourage exploration. Entropy is a frequently used criterion in the RL family that measures the level of exploration. The entropy regularization framework directly incorporates entropy as a regularization term into the original objective function to encourage exploration; see, e.g., Todorov (2006), Ziebart et al. (2008), Nachum et al. (2017). In the entropy regularization framework, the weight of exploration is determined by the coefficient imposed on the entropy, which is called the temperature parameter. The larger the temperature parameter, the greater the weight of exploration. A temperature parameter that is too large may result in too much focus on exploring the environment and little effort in exploiting the current information; otherwise, if the temperature parameter is too small, one may stick to the current optimal strategy without the opportunity to explore better solutions. Therefore, careful selections of the temperature parameter is important for designing reinforcement learning algorithms.
While most existing literature in reinforcement learning focus on the Markov decision process, recently Wang et al. (2020) extended the entropy regularization framework to the continuous-time setting. The authors showed that the optimal distributional control is Gaussian distribution in the linear-quadratic stochastic control problem. In the series work, Wang and Zhou (2020) studied continuous-time mean-variance portfolio selection problem under the entropy-regularized RL framework and showed that the precommitted strategies are Gaussian distributions with time-decaying variance. Dai et al. (2023) considered the equilibrium mean-variance problem with log return target and showed that the optimal control is Gaussian distribution with the variance term not necessarily decaying in time.
This paper studies the dividend optimization problem in the entropy regularization framework to encourage the exploration in the unknown environment. We follow the same setting as in Wang et al. (2020) which use Shannon’s differential entropy. The key idea is to use distribution as the control to solve the entropy-regularized dividend optimization problem. Consequently, the optimal dividend policy is a randomization over the possible dividend paying rates. We derive the so-called exploratory HJB and establish the theoretical results to guarantee the existence of the solution. We obtain that the optimal exploratory dividend policy is a truncated exponential distribution whose parameter depends on the surplus level and the temperature parameter. We show that, for suitable choices of the maximal dividend paying rate and the temperature parameter, the value function of the exploratory dividend optimization problem could be significantly different from the value function in the traditional problem. In particular, we classify the value function of the exploratory dividend optimization problem into three cases based on its monotonicity.
Recently, Bai et al. (2023) also study the optimal dividend problem under the continuous time diffusion model. The authors then use a policy improvement argument along with policy evaluation devices to construct approximating sequences of the optimal strategy. The difference is that in their paper the feasible controls are open-loop, while we consider feedback controls only. We show that the value function is decreasing when the maximal dividend paying rate is relatively small compared to the temperature parameter, where in their paper the maximal dividend paying rate is assumed to be larger than one and thus the value function is always increasing.
The rest of the paper is organized as follows. In
Section 2, we introduce the formulation of the entropy-regularized dividend optimization problem. In
Section 3, we present the exploratory HJB and the theoretical results to solve the exploratory dividend problem. We then discuss the three cases of the value function for the exploratory dividend problem in
Section 4. Some numerical examples to show the impact of parameters on the optimal dividend policy and the value function are presented in
Section 5.
Section 6 concludes.
4. Discussion
In view of Theorem 3, value functions can be classified into three cases according to the monotonicity: (1) ; (2) ; (3) . The following proposition will be useful in analyzing the properties of value functions.
Proposition 3.
(a) Define
Then is increasing. Therefore, and ;
Case 1: .
The value function in this case is non-increasing and thus non-positive, as a sharp contrast to the results of classical dividend problem. To see the reason, on one hand, note that for
,
Then due to Proposition 1, the entropy term is negative, that is,
. On the other hand, when
is large, it implies that the exploration parameter
is relatively large compared with the maximal dividend paying rate
M. Then the negative entropy has a large weight in the total objective value, dominating the total expected dividends and leading to a negative value function.
Case 2: .
When
, the value function is non-decreasing, which is closer to the increasing value function in classical dividend optimization problem than it does in Case 1. This is because a relatively small
compared with
M decreases the weight of entropy term in the total objective value. Note that in classical dividend optimization the limit of value function is
, while in the current exploratory dividend optimization the limit of value function is given in (
20). Therefore, if (i)
, the limit of
is no larger than that of
; if (ii)
, then
asymptotically achieves a higher value than that of the classical dividend optimization. Then if
and
, the limit of
is larger than that of
.
As shown in Proposition 3, for any , . Therefore, when , it always belongs to Case 2 for any . On the other hand, for any , . Therefore, when or , since , it cannot be Case 2 (ii); when and , it is always Case 2 (i). Note that corresponds to the classical dividend optimization and , by definition. Since for any positive constant M, classical dividend optimization can be viewed as a special Case 2 (i). It implies that exploratory dividend optimization is a generalization of the classical dividend optimization.
Case 3: .
As shown in Theorem 3, the value function in this case should be constantly zero. This is because compared with M happens to strike a balance between exploitation and exploration such that the total expected dividends is offset by the entropy.
Figure 2 depicts the different cases of value functions given different combinations of
M and
. When
, the value function falls into Case 1 area. When
, the value function corresponds to Case 2, which can be further classified into two cases based on the comparison of
M and
, i.e., whether the value function asymptotically achieves a higher value than that of the classical problem. When
, the value function should be Case 3 type.
5. Numerical Examples
In this section, we present numerical examples of optimal exploratory policy and corresponding value function which solves exploratory HJB equation (
19) based on the theoretical results obtained in the previous sections.
2 To have a clear vision on the weight of cumulative dividends and that of entropy in the total objective value, we further decompose
into two parts: the expected total discounted dividends under the optimal exploratory dividend policy
and the expected total weighted discounted entropy under the optimal exploratory dividend policy
where the entropy of
is derived via substituting the optimal distribution (
15) into the definition of entropy (
10), i.e.,
Hence,
. We show examples of three cases, respectively, with commonly used parameters:
,
,
.
First, let
,
. Then
and it belongs to Case 1. Note that
in this case is decreasing and non-positive, as a sharp contrast to the results of classical dividend problem. The figure on the top row, left column of
Figure 3 plots the corresponding value function and its two components
and
.
3 The figure on the middle row, left column of
Figure 3 plots the mean of the optimal distribution
, which is decreasing on
x. The figure on the bottom row, left column of
Figure 3 shows the density function of the optimal distribution with respect to different surplus level
x. Because
, the optimal distribution is a truncated exponential distribution with rate
for any
. Therefore, it is more likely to pay high dividend rate. Furthermore, as surplus
x increases, the density function becomes more flat because
is increasing to 0 and the rate
is decreasing on
x.
Second, let
,
. Then
and it belongs to Case 2 (i). The figure on the top row, middle column of
Figure 3 shows the corresponding value function,
and
. In contrast to Case 1,
in this case becomes positive since
M is sufficiently large, making the value function
positive. The figure on the middle row, middle column of
Figure 3 plots the mean of the optimal distribution
, which is increasing on
x. The figure on the bottom row, middle column of
Figure 3 shows the density function of optimal distribution with respect to different surplus level
x. When
x is small, it is more likely to choose a low dividend paying rate, because paying too high dividend rate would probably cause the insurance company to go bankruptcy and harms the shareholder’s benefit in the long run. When
x becomes larger, it is more likely to pay high dividend rate.
Third, let
,
. Then
and it belongs to Case 2 (ii). The figure on the top row, right column of
Figure 3 shows the corresponding value function,
and
. In this case, the limit of
is higher than that of the classical value function
, which is
. Note that the expected total discounted dividends under exploratory policy
does not exceed that of the classical policy
, because the classical optimal dividend policy fully exploits the known environment. For sufficiently large
M and
,
is large enough to make
larger than
. The figure on the middle row, right column of
Figure 3 plots the mean of the optimal distribution
and the figure on the bottom row, right column of
Figure 3 plots the density function of the optimal distribution, which are similar to that of Case 2 (i).
When , , it belongs to Case 3 and the value function in this case should be constantly zero.
We also vary the value of
while keeping the other parameter values unchanged.
Figure 4 shows the value function under different values of
with
and
respectively. Note that when
,
degenerates to the classical value function
. For
, it is Case 2 (ii) when
is small and then becomes Case 3 and Case 1 as
getting larger. As aforementioned, it cannot be Case 2 (ii) since
. Indeed, the left panel of
Figure 4 shows the value function could not exceed the classical one as
getting smaller. On the other hand, for
, it can only be Case 2 and even Case 2 (ii) if
is large enough. The right panel of
Figure 4 shows the value function is always increasing on
x for different values of
and it can exceed the classical value function for a sufficiently large
.