Preprint
Article

Exploratory Dividend Optimization with Entropy Regularization

Altmetrics

Downloads

76

Views

34

Comments

0

A peer-reviewed article of this preprint also exists.

This version is not peer-reviewed

Submitted:

28 November 2023

Posted:

28 November 2023

You are already at the latest version

Alerts
Abstract
This paper studies the dividend optimization problem in the entropy regularization framework by following the same continuous-time reinforcement learning setting as in Wang et al. (2020). The exploratory HJB is established and the optimal exploratory dividend policy is a truncated exponential distribution. We show that, for suitable choices of the maximal dividend paying rate and the temperature parameter, the value function of the exploratory dividend optimization problem could be significantly different from the value function in the classical dividend optimization problem. In particular, the value function of the exploratory dividend optimization problem could be classified into three cases based on its monotonicity. Numerical examples are also presented to show the impact of temperature parameter on the solution.
Keywords: 
Subject: Business, Economics and Management  -   Finance

1. Introduction

The risk management problem of an insurance company has been studied extensively in the literature. It dates back to the Cramér-Lundberg (C-L) model of Lundberg (1903) which described the surplus process of the insurance company in terms of two cash flows: premiums received and claims paid. Consider an insurance company with claims arriving at Poisson rate ν , i.e., the total number of claims N t up to time t is Poisson distributed with parameter ν t . Denote by ξ i the size of the i-th claim, where { ξ i } ’s are independently and identically distributed with E [ ξ i ] = μ 1 and E [ ξ i 2 ] = μ 2 for some constants μ 1 , μ 2 > 0 . Let X ˜ t denote the surplus process of the insurance company. Then
X ˜ t = x 0 + ζ t i = 1 N t ξ i ,
where x 0 is the initial surplus level, and ζ is the premium rate which is the amount of premium received by the insurance company per unit of time.
De Finetti (1957) first proposed the dividend optimization problem: An insurance company maximizes the expectation of cumulative discounted dividends until the ruin time by choosing dividend strategies, that is, when and how much of the surplus should be distributed as dividends to the shareholders. De Finetti (1957) derived that the optimal dividend policy under a simple discrete random walk model should be a barrier strategy. Gerber (1969) then generalized the dividend problem from a discrete-time model to the classical C-L model and showed that the optimal dividend strategy should be band strategy, which degenerates to a barrier strategy for exponentially distributed claim size.
With the development of technical tools such as dynamic programming, the dividend optimization problem has been analyzed under the stochastic control framework. In particular, X ˜ t in C-L model can be approximated by a diffusion process X t that evolves according to
d X t = μ d t + σ d W t ,
where μ : = ζ ν μ 1 , σ : = ν μ 2 , and { W t } is a standard Brownian motion; see, e.g., Schmidli (2007). It is worth noting that the diffusion approximation for the surplus process works well for large insurance portfolios, where an individual claim is relatively small compared to the size of the surplus. Under the drifted Brownian motion model the optimal dividend strategy is a barrier strategy, and if the dividend rate is further upper bounded, the optimal dividend strategy is threshold-type; see, e.g., Jeanblanc-Picqué and Shiryaev (1995) and Asmussen and Taksar (1997). Other extensions on the dividend optimization problem include Jgaard and Taksar (1999), Asmussen et al. (2000), Azcue and Muler (2005), Azcue and Muler (2010), Gaier et al. (2003), Kulenko and Schmidli (2008), Yang and Zhang (2005), Choulli et al. (2003), Gerber and Shiu (2006), Avram et al. (2007) and Yin and Wen (2013), etc.
Previous literature studied the dividend optimization problem based on the complete information of the environment, i.e., all the model parameter values are known. This assumption is no longer valid if the environment is a black-box or the model parameter values are unknown. One way to handle this issue is to use the past information to estimate the model parameters and then solve the problem with the estimated parameters. However, the optimal strategy in the classical dividend optimization problem is a barrier-type or threshold-type, which is extremely sensitive to the model parameter values; a slight change in the model parameters would lead to a totally different strategy.1
In contrast to the traditional approach that separates the estimation and optimization, reinforcement learning aims to learn the optimal strategy through trial-and-error interactions with the unknown environment without estimating the model parameters. In particular, one takes different actions in the unknown territory and receives feedbacks to learn the optimal action and use it to further interact with the environment. In recent years, reinforcement learning had successful applications in many fields such as health-care, autonomous control, natural language processing, and video games; see, e.g., Zhao et al. (2009), Komorowski et al. (2018), Mirowski et al. (2016), Zhu et al. (2017), Radford et al. (2017), Paulus et al. (2017), Mnih et al. (2015), Jaderberg et al. (2019), Silver et al. (2016), Silver et al. (2017). There is no doubt that reinforcement learning has become one of the most popular and fastest-growing fields today.
Exploration and exploitation are the key concepts in reinforcement learning, and they proceed simultaneously. On one hand, exploitation is to utilize the so-far-known information to derive the current optimal strategy which might not be optimal from the long-term view. On the other hand, exploration emphasizes learning from trial-and-error interactions with the environment to improve its knowledge for the sake of long-term benefit. While the optimal strategy of the classical dividend optimization problem is deterministic when the model parameter values are fully known, randomized strategies are considered to encourage exploration of other actions in the unknown environment. Although the exploration causes a cost in the short term, it helps to learn the optimal (or near-optimal) strategy and bring benefit from the long-term point of view.
Obviously, how to balance the trade-off between exploitation and exploration is an important issue. The ε -greedy strategy is a frequently used randomized strategy in reinforcement learning. It balances the exploration and exploitation by illustrating that the agent should stick to the current optimal policy most of the time, while the agent could sometimes randomly take other non-optimal actions to explore the environment; see, e.g., Auer et al. (2002). Boltzmann exploration is another randomized strategy extensively studied in RL literature. Instead of assigning constant probabilities to different actions based on current information, Boltzmann exploration uses the Boltzmann distribution to allocate the probability to different actions, where the probability of each action is positively related to its reward. In other words, agent should choose action with higher expected rewards with higher probability; see, e.g., Cesa-Bianchi et al. (2017).
Another way to introduce a randomized strategy is to intentionally include a regularization term to encourage exploration. Entropy is a frequently used criterion in the RL family that measures the level of exploration. The entropy regularization framework directly incorporates entropy as a regularization term into the original objective function to encourage exploration; see, e.g., Todorov (2006), Ziebart et al. (2008), Nachum et al. (2017). In the entropy regularization framework, the weight of exploration is determined by the coefficient imposed on the entropy, which is called the temperature parameter. The larger the temperature parameter, the greater the weight of exploration. A temperature parameter that is too large may result in too much focus on exploring the environment and little effort in exploiting the current information; otherwise, if the temperature parameter is too small, one may stick to the current optimal strategy without the opportunity to explore better solutions. Therefore, careful selections of the temperature parameter is important for designing reinforcement learning algorithms.
While most existing literature in reinforcement learning focus on the Markov decision process, recently Wang et al. (2020) extended the entropy regularization framework to the continuous-time setting. The authors showed that the optimal distributional control is Gaussian distribution in the linear-quadratic stochastic control problem. In the series work, Wang and Zhou (2020) studied continuous-time mean-variance portfolio selection problem under the entropy-regularized RL framework and showed that the precommitted strategies are Gaussian distributions with time-decaying variance. Dai et al. (2023) considered the equilibrium mean-variance problem with log return target and showed that the optimal control is Gaussian distribution with the variance term not necessarily decaying in time.
This paper studies the dividend optimization problem in the entropy regularization framework to encourage the exploration in the unknown environment. We follow the same setting as in Wang et al. (2020) which use Shannon’s differential entropy. The key idea is to use distribution as the control to solve the entropy-regularized dividend optimization problem. Consequently, the optimal dividend policy is a randomization over the possible dividend paying rates. We derive the so-called exploratory HJB and establish the theoretical results to guarantee the existence of the solution. We obtain that the optimal exploratory dividend policy is a truncated exponential distribution whose parameter depends on the surplus level and the temperature parameter. We show that, for suitable choices of the maximal dividend paying rate and the temperature parameter, the value function of the exploratory dividend optimization problem could be significantly different from the value function in the traditional problem. In particular, we classify the value function of the exploratory dividend optimization problem into three cases based on its monotonicity.
Recently, Bai et al. (2023) also study the optimal dividend problem under the continuous time diffusion model. The authors then use a policy improvement argument along with policy evaluation devices to construct approximating sequences of the optimal strategy. The difference is that in their paper the feasible controls are open-loop, while we consider feedback controls only. We show that the value function is decreasing when the maximal dividend paying rate is relatively small compared to the temperature parameter, where in their paper the maximal dividend paying rate is assumed to be larger than one and thus the value function is always increasing.
The rest of the paper is organized as follows. In Section 2, we introduce the formulation of the entropy-regularized dividend optimization problem. In Section 3, we present the exploratory HJB and the theoretical results to solve the exploratory dividend problem. We then discuss the three cases of the value function for the exploratory dividend problem in Section 4. Some numerical examples to show the impact of parameters on the optimal dividend policy and the value function are presented in Section 5. Section 6 concludes.

2. Problem

2.1. The Model

Suppose an insurance company has surplus X t at time t, with
d X t = μ d t + σ d W t , X 0 = x ,
where μ > 0 , σ > 0 , and { W t } t 0 is a standard Brownian motion defined on the filtered probability space ( Ω , F , { F t } t 0 , P ) . As remarked by Asmussen and Taksar (1997), such a surplus process (1) can be viewed either as a direct modelling with drifted Brownian motion or as an approximation to the classical compound Poisson model.
A dividend strategy or policy is defined as a = { a t } t 0 , where a t is the dividend paying rate at time t, i.e., the cumulative amount of dividends paid from time t 1 to time t 2 is given by t 1 t 2 a t d t . We consider herein the Markov feedback controls, i.e., a t = a ( X t ) , where a ( · ) is a function of the surplus level X t . Note that a t is nonnegative for any t. Further, we assume that a t is upper bounded by a positive constant M, which is consistent with the assumption made in the literature. We give the formal definition of the admissible dividend policy below.
Definition 1. 
A dividend policy a is said to be admissible if { a t } t 0 is F t -adapted and a t [ 0 , M ] for all t 0 .
Denote by A the set of admissible dividend policies. For an insurance company whose surplus process evolves according to (1) and pays the dividend according to policy a = { a t } t 0 A , the controlled surplus process for this insurance company is
d X t a = ( μ a t ) d t + σ d W t , X 0 a = x .
Define the ruin time to be the first time that the surplus level hits zero, i.e.,
τ x a : = inf { t 0 : X t a 0 X 0 a = x } .
For an insurance company starting with initial surplus x [ 0 , ) , the problem is to find the optimal dividend policy that maximizes the expected value of exponentially discounted dividends to be accumulated until the ruin time, that is,
J c l ( x , a ) : = E 0 τ x a e ρ t a ( X t a ) d t ,
where ρ > 0 is the discounting rate. Then the optimal dividend problem is
sup a A J c l ( x , a ) .

2.2. Classical Optimal Dividend Problem

First, we briefly review the results of solving the dividend optimization problem (3) classically. Let V c l ( x ) be the value function of the dividend optimization problem:
V c l ( x ) : = sup a A J c l ( x , a ) .
Assume that the value function V c l ( x ) is twice-continuously differentiable. The standard dynamic programming approach leads to the following Hamilton-Jacobi-Bellman equation,
ρ V c l ( x ) = sup a [ 0 , M ] a + ( μ a ) V c l ( x ) + 1 2 σ 2 V c l ( x ) ,
with boundary condition V c l ( 0 ) = 0 . It can be easily seen that the optimal dividend paying rate at surplus level x is
a * ( x ) = 0 , if V c l ( x ) > 1 , M , if V c l ( x ) 1 .
Assume that V c l ( x ) is a concave function. Then there exists a nonnegative constant x b such that V c l ( x ) 1 when x x b and V c l ( x ) > 1 when 0 x < x b . Substitute (5) into (4), then it turns into the following ODEs:
1 2 σ 2 V c l ( x ) + μ V c l ( x ) ρ V c l ( x ) = 0 , if 0 x < x b , 1 2 σ 2 V c l ( x ) + ( μ M ) V c l ( x ) ρ V c l ( x ) + M = 0 , if x x b .
Combining with the boundary condition, one can derive that
V c l ( x ) = C 1 e θ 1 x e θ 2 x , if 0 x < x b , M ρ C 2 e θ 3 x , if x x b ,
where
θ 1 = μ + μ 2 + 2 ρ σ 2 σ 2 , θ 2 = μ + μ 2 + 2 ρ σ 2 σ 2 , θ 3 = ( μ M ) + ( μ M ) 2 + 2 ρ σ 2 σ 2 .
C 1 , C 2 and x b are determined by the smooth pasting conditions, i.e.,
C 1 e θ 1 x b e θ 2 x b = M ρ C 2 e θ 3 x b , C 1 θ 1 e θ 1 x b + θ 2 e θ 2 x b = 1 , C 2 θ 3 e θ 3 x b = 1 .
If M ρ 1 θ 3 > 0 , there exists unique solution to (8). In this case, V c l ( x ) is given by (7), where C 1 , C 2 , x b are determined uniquely through (8). Consequently, the optimal dividend policy is to pay the maximal rate M when surplus level x exceeds threshold x b and to pay nothing if else. If M ρ 1 θ 3 0 , then V c l ( x ) = M ρ ( 1 e θ 3 x ) . In this case, the optimal dividend policy is always to pay the maximal rate M. Detailed proofs can be found in Asmussen and Taksar (1997). It is also straightforward to check that the optimal value function V c l ( x ) is concave on x and always smaller than M / ρ which is the limit of V c l ( x ) as x going to infinity. Figure 1 below illustrates the value function and the corresponding optimal dividend policy under the following parameter values: μ = 1 , σ = 1 , ρ = 0.3 , M = 0.6 (left panels), M = 1.2 (middle panels), and M = 1.8 (right panels), respectively.

2.3. Exploratory Formulation

The above optimal dividend policy (5) is implemented based on the complete information, i.e., the model parameters μ and σ are known. In reality, however, it is difficult to know exactly the values of μ and σ , due to the uncertainty in premium rate, claim arrival process and claim size. Therefore, we consider a technique named reinforcement learning to learn the optimal (or near-optimal) dividend paying strategy through trial-and-error interactions with the unknown territory.
Whereas the majority of the work in reinforcement learning consider Markov decision process in discrete time, we follow the pioneering work of Wang et al. (2020) who model the reinforcement learning in continuous time as a relaxed stochastic control problem. At time t with surplus level X t the dividend paying rate a is randomly sampled according to a distribution π t : = π ( a ; X t ) , where π ( · ; · ) : [ 0 , M ] × [ 0 , ) [ 0 , ) , satisfying 0 M π ( a ; x ) d a = 1 for any x [ 0 , ) . We call π : = { π t } t 0 the distributional dividend policy. Following the same procedure as in Wang et al. (2020), we derive the exploratory dynamic of the surplus process under π to be
d X t π = μ 0 M a π ( a ; X t π ) d a d t + σ d W t , X 0 π = x ,
and the expected value of total discounted dividends under exploration to be
E 0 τ x π e ρ t 0 M a π ( a ; X t π ) d a d t ,
where the ruin time is
τ x π : = inf { t 0 : X t π 0 X 0 π = x } .
In addition to the expected value of total discounted dividends under exploration, Shannon’s differential entropy is introduced into the objective to encourage exploration. For a given distribution π , the entropy is defined as
H ( π ) : = 0 M π ( a ) ln π ( a ) d a .
Thus, the objective of entropy-regularized exploratory dividend problem is
J ( x , π ) : = E 0 τ x π e ρ t 0 M a π ( a ; X t π ) d a + λ H ( π t ) d t = E 0 τ x π e ρ t 0 M a λ ln π ( a ; X t π ) π ( a ; X t π ) d a d t ,
where λ > 0 is the so-called temperature parameter. Note that λ controls the weight to be put on the exploration and is exogenously given. If λ = 0 , the distribution degenerates to the Dirac measure, which is the solution to classical optimal dividend problem without exploration. The entropy-regularized exploratory dividend problem is
sup π Π J ( x , π ) ,
where Π is the set of admissible exploratory dividend policies. We give the formal definition of admissible exploratory dividend policy π below.
Definition 2. 
An exploratory dividend policy π is admissible if the following conditions are satisfied:
(i) 
π ( · ; x ) Π [ 0 , M ] for any x [ 0 , ) , where Π [ 0 , M ] is a set of probability density functions with support [ 0 , M ] ;
(ii) 
the stochastic differential equation (9) has a unique solution { X t π } t 0 under π;
(iii) 
E 0 τ x π e ρ t 0 M a λ ln π ( a ; X t π ) π ( a ; X t π ) d a d t < .
The following proposition will be used later.
Proposition 1. 
For any distribution π on support [ 0 , M ] , the entropy H ( π ) ln M .

3. Exploratory HJB Equation

To solve exploratory optimal dividend problem (12), we first derive the corresponding HJB equation, the so-called exploratory HJB; see Wang et al. (2020), Tang et al. (2022), etc.
Let V ( x ) be the value function of entropy-regularized exploratory dividend problem, that is,
V ( x ) : = sup π Π J ( x , π ) .
Assume that the value function V ( x ) is twice-continuously differentiable. Following the standard arguments in dynamic programming, we derive the exploratory HJB equation below:
ρ V ( x ) = sup π Π [ 0 , M ] 0 M a ( 1 V ( x ) ) λ ln π ( a ; x ) π ( a ; x ) d a + μ V ( x ) + 1 2 σ 2 V ( x ) ,
with boundary condition
V ( 0 ) = 0 .

3.1. Exploratory Dividend Policy

To solve the supremum in (13) together with the constraint that 0 M π ( a ; x ) d a = 1 , we introduce the Lagrange multiplier η :
sup π Π [ 0 , M ] 0 M a ( 1 V ( x ) ) λ ln π ( a ; x ) η π ( a ; x ) d a + η .
Maximizing the integrand above pointwisely and using the first-order condition leads to the solution
π * ( a ; x ) = exp a 1 V ( x ) λ 1 η λ , a [ 0 , M ] .
Because 0 M π * ( a ; x ) d a = 1 , we solve that
π * ( a ; x ) = 1 Z M ( ( 1 V ( x ) ) / λ ) exp a 1 V ( x ) λ , a [ 0 , M ] ,
where
Z M ( y ) : = e M y 1 y , y 0 M , y = 0 .
Recall that the classical optimal dividend policy given in (5) is two-threshold strategy, i.e., it pays nothing, a * ( x ) = 0 , if V c l ( x ) > 1 or pays the maximal rate, a * ( x ) = M , if V c l ( x ) 1 . In contrast, the exploratory dividend policy is not restricted to two extreme actions only but gives the probability to take certain actions. This result is very similar to [38] in which the authors study the temperature control problem for Langevin diffusions by incorporating randomization of the temperature control and regularizing its entropy. The classical optimal control of such a problem is of the bang-bang type, whereas the exploratory control is a state-dependent, truncated exponential distribution. Likewise, the optimal distribution π * ( a ; x ) given in (15) is also a continuous version of Boltzmann distribution or Gibbs measure which is widely used in discrete reinforcement learning.
When V ( x ) > 1 , π * ( a ; x ) is decreasing in a so it has large probability to take small dividend pay-out rate close to 0; when V ( x ) < 1 , π * ( a ; x ) is increasing in a so it has large probability to take large dividend pay-out rate close to M ; when V ( x ) = 1 , it degenerates to a uniform distribution on [ 0 , M ] . In other words, the optimal exploratory dividend policy is an “exploration” of the classical dividend pay-out policy: it searches around the current optimal dividend rate given by the classical solution, 0 or M, with the probability to take a certain rate decreasing as it moving away from the classical solution.
The exploratory surplus process under the optimal policy is well-posed. Note that the optimal distributional policy is π * = { π t * } t 0 , where
π t * : = π * ( a ; X t π * ) = 1 Z M ( ( 1 V ( X t π * ) ) / λ ) exp a 1 V ( X t π * ) λ .
Applying the optimal distributional policy (17) into the exploratory surplus process (9), we obtain that
d X t π * = μ 0 M a π * ( a ; X t π * ) d a d t + σ d W t = μ M λ 1 V ( X t π * ) + M e M ( 1 V ( X t π * ) ) / λ 1 1 V ( X t π * ) 1 M 2 1 V ( X t π * ) = 1 d t + σ d W t .
Since 0 M a π * ( a ; X t π * ) d a [ 0 , M ] , the SDE (18) has bounded drift and constant volatility. As a result, there exists unique solution { X t π * } to (18).

3.2. Verification Theorem

Substituting the optimal distribution π * ( a ; x ) as shown in (15) into the HJB equation (13), we have the following equation for V ( x ) :
ρ V ( x ) = μ V ( x ) + 1 2 σ 2 V ( x ) + λ ln Z M ( ( 1 V ( x ) ) / λ ) ,
or equivalently,
ρ V ( x ) = μ V ( x ) + 1 2 σ 2 V ( x ) + λ ln λ 1 V ( x ) e M ( 1 V ( x ) ) / λ 1 1 V ( x ) 1 + M 1 V ( x ) = 1 .
The following verification theorem shows that V ( x ) that solves (19) is indeed the value function of the exploratory dividend problem (12).
Theorem 1. 
Assume there exists twice-continuously differentiable function V that solves (19) with boundary condition (14), and | V | , | V | are bounded. Then V is the value function of entropy-regularized exploratory dividend problem (12) under exponential discounting.
Theorem 1 shows that solution to the exploratory HJB equation (19) could be the value function of exploratory dividend problem (12). On the other hand, a similar argument could show that the value function shall also satisfy (19), while the optimal exploratory dividend strategy is given by (17). To establish a rigorous statement, we need the following result. The next proposition shows that the value function V ( x ) converges as x going to infinity.
Proposition 2. 
Let V be the value function of (12) and suppose the optimal exploratory dividend strategy is (17). Then as x going to infinity, V ( x ) converges to a constant, i.e.,
lim x V ( x ) = λ ln λ + λ ln ( e M / λ 1 ) ρ .

3.3. Solution to Exploratory HJB

Compared with the differential equation (6) which solves the classical value function, the exploratory HJB equation (19) has a nonlinear term ln Z M ( ( 1 V ( x ) ) / λ ) , which makes it difficult to be solved explicitly. The theorem below guarantees the existence and uniqueness of solution V ( x ) .
Theorem 2. 
There exists a unique twice-continuously differentiable function V ( x ) that solves (19) with boundary condition (14) and (20). Moreover, lim λ 0 | V ( x ) V c l ( x ) | = 0 for all x [ 0 , ) , where V c l ( x ) is the value function of classical dividend problem.
Theorem 2 follows from the results in Tang et al. (2022, Theorem 3.9, 3.10) and in Strulovici and Szydlowski (2015, Proposition 1). It is straightforward to check that the conditions to guarantee the existence and uniqueness of the solution to (19) and its twice-continuous differentiability are satisfied.
Theorem 2 also states that when λ becomes smaller, the exploratory value function converges to the classical value function. Indeed, a stronger convergence is established by Tang et al. (2022) that V converges to V c l locally uniformly as λ going to 0. Note that the parameter λ is the weight to be put on the exploration in contrast to the exploitation. If it is more close to 0, the entropy term has smaller effect on the total objective value and the optimal exploratory distribution π * ( a ; x ) in (15) are more concentrated and close to the Dirac distribution – the optimal solution to the classical dividend optimization problem. Then not surprisingly, the exploratory value function V ( x ) also converges to the classical value function V c l ( x ) as λ going to 0.
Now, thanks to Theorem 2, we have V ( x ) that solves the exploratory HJB equation (19). On the other hand, it is straightforward to show that according to (20), if M < λ ln 1 / λ + 1 , the limit of V ( x ) is negative; if M > λ ln 1 / λ + 1 , the limit of V ( x ) is positive; if M = λ ln 1 / λ + 1 , the limit of V ( x ) is zero. The next theorem shows that indeed, we classify V ( x ) into three cases based on its monotonicity.
Theorem 3. 
Let V ( x ) be the solution to (19) with boundary condition (14) and (20). Then V ( x ) is monotone. To be more specific,
(i) 
if M < λ ln 1 / λ + 1 , V ( x ) is non-increasing.
(ii) 
if M > λ ln 1 / λ + 1 , V ( x ) is non-decreasing.
(iii) 
if M = λ ln 1 / λ + 1 , V ( x ) 0 .
The following corollary is a direct result from above theorem.
Corollary 1. 
Let V ( x ) be the solution to (19) with boundary condition (14) and (20). Then | V ( x ) | and | V ( x ) | are bounded.
Note that in Theorem 1 we need | V | and | V | to be bounded so that V – solution to (19) – is indeed the value function of exploratory optimal dividend problem. Corollary 1 verifies the boundedness conditions are satisfied. In other words, the solution to the exploratory HJB equation (19) is the value function of the exploratory dividend problem.

4. Discussion

In view of Theorem 3, value functions can be classified into three cases according to the monotonicity: (1) M < λ ln 1 / λ + 1 ; (2) M > λ ln 1 / λ + 1 ; (3) M = λ ln 1 / λ + 1 . The following proposition will be useful in analyzing the properties of value functions.
Proposition 3. 
(a) Define
d 1 ( λ ) : = λ ln 1 / λ + 1 1 λ > 0 + 0 · 1 λ = 0 , λ [ 0 , ) .
Then d 1 ( λ ) is increasing. Therefore, lim λ 0 d 1 ( λ ) = d 1 ( 0 ) = 0 and lim λ d 1 ( λ ) = 1 ;
  • (b) Define
    d 2 ( λ ) : = λ ln λ / λ 1 1 λ > 1 + · 1 λ [ 0 , 1 ] , λ [ 0 , ) .
    Then d 2 ( λ ) > d 1 ( λ ) , and d 2 ( λ ) is decreasing on λ > 1 . Therefore, lim λ 1 d 2 ( λ ) = and lim λ d 2 ( λ ) = 1 .
Case 1: M < d 1 ( λ ) .
The value function in this case is non-increasing and thus non-positive, as a sharp contrast to the results of classical dividend problem. To see the reason, on one hand, note that for λ > 0 ,
ln M < ln d 1 ( λ ) = ln λ ln 1 λ + 1 ln λ · 1 λ = 0 .
Then due to Proposition 1, the entropy term is negative, that is, H ( π ) ln M < 0 . On the other hand, when d 1 ( λ ) is large, it implies that the exploration parameter λ is relatively large compared with the maximal dividend paying rate M. Then the negative entropy has a large weight in the total objective value, dominating the total expected dividends and leading to a negative value function.
Case 2: M > d 1 ( λ ) .
When M > d 1 ( λ ) , the value function is non-decreasing, which is closer to the increasing value function in classical dividend optimization problem than it does in Case 1. This is because a relatively small λ compared with M decreases the weight of entropy term in the total objective value. Note that in classical dividend optimization the limit of value function is M / ρ , while in the current exploratory dividend optimization the limit of value function is given in (20). Therefore, if (i) d 1 ( λ ) < M d 2 ( λ ) , the limit of V ( x ) is no larger than that of V c l ( x ) ; if (ii) d 2 ( λ ) < M , then V ( x ) asymptotically achieves a higher value than that of the classical dividend optimization. Then if λ > 1 and M > λ ln λ λ ln ( λ 1 ) = d 2 ( λ ) , the limit of V ( x ) is larger than that of V c l ( x ) .
As shown in Proposition 3, for any λ 0 , d 1 ( λ ) < lim λ d 1 ( λ ) = 1 . Therefore, when M 1 , it always belongs to Case 2 for any λ 0 . On the other hand, for any λ 0 , d 2 ( λ ) > lim λ d 2 ( λ ) = 1 . Therefore, when M 1 or λ 1 , since d 2 ( λ ) > M , it cannot be Case 2 (ii); when λ 1 and M 1 , it is always Case 2 (i). Note that λ = 0 corresponds to the classical dividend optimization and d 1 ( 0 ) = 0 , d 2 ( 0 ) = by definition. Since d 1 ( 0 ) < M < d 2 ( 0 ) for any positive constant M, classical dividend optimization can be viewed as a special Case 2 (i). It implies that exploratory dividend optimization is a generalization of the classical dividend optimization.
Case 3: M = d 1 ( λ ) .
As shown in Theorem 3, the value function in this case should be constantly zero. This is because λ compared with M happens to strike a balance between exploitation and exploration such that the total expected dividends is offset by the entropy.
Figure 2 depicts the different cases of value functions given different combinations of M and λ . When M < d 1 ( λ ) , the value function falls into Case 1 area. When M > d 1 ( λ ) , the value function corresponds to Case 2, which can be further classified into two cases based on the comparison of M and d 2 ( λ ) , i.e., whether the value function asymptotically achieves a higher value than that of the classical problem. When M = d 1 ( λ ) , the value function should be Case 3 type.

5. Numerical Examples

In this section, we present numerical examples of optimal exploratory policy and corresponding value function which solves exploratory HJB equation (19) based on the theoretical results obtained in the previous sections.2 To have a clear vision on the weight of cumulative dividends and that of entropy in the total objective value, we further decompose V ( x ) into two parts: the expected total discounted dividends under the optimal exploratory dividend policy
D v ( x ) : = E 0 τ x π * e ρ t 0 M a π * ( a ; X t π * ) d a d t ;
and the expected total weighted discounted entropy under the optimal exploratory dividend policy
E n t r ( x ) : = λ E 0 τ x π * e ρ t H ( π t * ) d t ,
where the entropy of π * is derived via substituting the optimal distribution (15) into the definition of entropy (10), i.e.,
H ( π * ) = ln ( Z M ( ( 1 V ( x ) ) / λ ) ) M e M ( 1 V ( x ) ) / λ Z M ( ( 1 V ( x ) ) / λ ) + 1 .
Hence, V ( x ) = D v ( x ) + E n t r ( x ) . We show examples of three cases, respectively, with commonly used parameters: μ = 1 , σ = 1 , ρ = 0.3 .
First, let λ = 1.5 , M = 0.6 . Then M < d 1 ( λ ) and it belongs to Case 1. Note that V ( x ) in this case is decreasing and non-positive, as a sharp contrast to the results of classical dividend problem. The figure on the top row, left column of Figure 3 plots the corresponding value function and its two components D v ( x ) and E n t r ( x ) .3 The figure on the middle row, left column of Figure 3 plots the mean of the optimal distribution π * ( · ; x ) , which is decreasing on x. The figure on the bottom row, left column of Figure 3 shows the density function of the optimal distribution with respect to different surplus level x. Because V ( x ) 0 , the optimal distribution is a truncated exponential distribution with rate λ / 1 V ( x ) < 0 for any x 0 . Therefore, it is more likely to pay high dividend rate. Furthermore, as surplus x increases, the density function becomes more flat because V ( x ) is increasing to 0 and the rate λ / 1 V ( x ) is decreasing on x.
Second, let λ = 1.5 , M = 1.2 . Then d 1 ( λ ) < M < d 2 ( λ ) and it belongs to Case 2 (i). The figure on the top row, middle column of Figure 3 shows the corresponding value function, D v ( x ) and E n t r ( x ) . In contrast to Case 1, E n t r ( x ) in this case becomes positive since M is sufficiently large, making the value function V ( x ) positive. The figure on the middle row, middle column of Figure 3 plots the mean of the optimal distribution π * ( · ; x ) , which is increasing on x. The figure on the bottom row, middle column of Figure 3 shows the density function of optimal distribution with respect to different surplus level x. When x is small, it is more likely to choose a low dividend paying rate, because paying too high dividend rate would probably cause the insurance company to go bankruptcy and harms the shareholder’s benefit in the long run. When x becomes larger, it is more likely to pay high dividend rate.
Third, let λ = 1.5 , M = 1.8 . Then M > d 2 ( λ ) and it belongs to Case 2 (ii). The figure on the top row, right column of Figure 3 shows the corresponding value function, D v ( x ) and E n t r ( x ) . In this case, the limit of V ( x ) is higher than that of the classical value function V c l ( x ) , which is M / ρ = 6 . Note that the expected total discounted dividends under exploratory policy D v ( x ) does not exceed that of the classical policy V c l ( x ) , because the classical optimal dividend policy fully exploits the known environment. For sufficiently large M and λ , E n t r ( x ) is large enough to make V ( x ) larger than V c l ( x ) . The figure on the middle row, right column of Figure 3 plots the mean of the optimal distribution π * ( · ; x ) and the figure on the bottom row, right column of Figure 3 plots the density function of the optimal distribution, which are similar to that of Case 2 (i).
When λ = 1.5 , M = 0.7662 , it belongs to Case 3 and the value function in this case should be constantly zero.
We also vary the value of λ while keeping the other parameter values unchanged. Figure 4 shows the value function under different values of λ with M = 0.6 and M = 1.2 respectively. Note that when λ = 0 , V ( x ) degenerates to the classical value function V c l ( x ) . For M = 0.6 , it is Case 2 (ii) when λ is small and then becomes Case 3 and Case 1 as λ getting larger. As aforementioned, it cannot be Case 2 (ii) since M < 1 . Indeed, the left panel of Figure 4 shows the value function could not exceed the classical one as λ getting smaller. On the other hand, for M = 1.2 , it can only be Case 2 and even Case 2 (ii) if λ is large enough. The right panel of Figure 4 shows the value function is always increasing on x for different values of λ and it can exceed the classical value function for a sufficiently large λ .

6. Conclusion

This paper studies the dividend optimization problem in the entropy regularization framework. In an unknown environment, the entropy is incoprated into the objective function to encourage the exploration and an exploratory dividend policy is introduced. We establish the exploratory HJB equation, we find that the optimal distributional control is a truncated exponential distribution. Compared to the classical value function, the value function in the exploratory dividend problem is classified into three cases. The monotonicity of the value function is determined by the maximal dividend paying rate and the temperature parameter which controls the weight of exploration.
One future research direction is to consider the exploratory dividend policy under the non-exponential discounting which makes the problem time-inconsistent. Furthermore, reinsurance could also be considered as part of the insurance company’s strategy in addition to the dividend policy, which is more technically challenging under the entropy regularization framework. Finally, one could take other definitions of the entropy, instead of the Shannon’s differential, as a measure of the level of exploration in reinforcement learning.

Appendix A. Proof

Proof of Proposition 1 
By definition (10),
H ( π ) = 0 M π ( a ) ln π ( a ) d a = 0 M π ( a ) ln 1 π ( a ) d a ln 0 M π ( a ) 1 π ( a ) d a = ln M ,
where the inequality is due to Jensen’s inequality. Q.E.D.
Proof of Theorem 1 
Let π ˜ Π be an exploratory dividend policy. Because V solves (13), for any x [ 0 , ) ,
0 = sup π Π [ 0 , M ] 0 M a λ ln π ( a ; x ) a V ( x ) π ( a ; x ) d a + μ V ( x ) + 1 2 σ 2 V ( x ) ρ V ( x ) 0 M a λ ln π ˜ ( a ; x ) a V ( x ) π ˜ ( a ; x ) d a + μ V ( x ) + 1 2 σ 2 V ( x ) ρ V ( x ) .
which shows that
ρ V ( x ) + μ 0 M a π ˜ ( a ; x ) d a V ( x ) + 1 2 σ 2 V ( x ) 0 M a λ ln π ˜ ( a ; x ) π ˜ ( a ; x ) d a .
Applying Itô’s Lemma on e ρ t V ( X t π ˜ ) ,
V ( x ) = e ρ ( T τ x π ˜ ) V ( X T τ x π ˜ π ˜ ) 0 T τ x π ˜ e ρ t ρ V ( X t π ˜ ) + μ 0 M a π ˜ ( a ; X t π ˜ ) d a V ( X t π ˜ ) + 1 2 σ 2 V ( X t π ˜ ) d t 0 T τ x π ˜ σ e ρ t V ( X t π ˜ ) d W t e ρ ( T τ x π ˜ ) V ( X T τ x π ˜ π ˜ ) + 0 T τ x π ˜ e ρ t 0 M a λ ln π ˜ ( a ; X t π ˜ ) π ˜ ( a ; X t π ˜ ) d a d t 0 T τ x π ˜ σ e ρ t V ( X t π ˜ ) d W t ,
where the inequality is due to (A1). Then taking expectation on both sides,
V ( x ) E e ρ ( T τ x π ˜ ) V ( X T τ x π ˜ π ˜ ) + E 0 T τ x π ˜ e ρ t 0 M a λ ln π ˜ ( a ; X t π ˜ ) π ˜ ( a ; X t π ˜ ) d a d t E 0 T τ x π ˜ σ e ρ t V ( X t π ˜ ) d W t .
For the first term on the right hand side of (A2), noting that | V | is bounded, then by bounded convergence theorem,
lim T E e ρ ( T τ x π ˜ ) V ( X T τ x π ˜ π ˜ ) = E e ρ ( τ x π ˜ ) V ( X τ x π ˜ π ˜ ) = 0 .
For the second term on the right hand side of (A2), since π ˜ is admissible and satisfies Definition 2 (iii),
E 0 T τ x π ˜ e ρ t 0 M a λ ln π ˜ ( a ; X t π ˜ ) π ˜ ( a ; X t π ˜ ) d a d t = E 0 T τ x π ˜ e ρ t 0 M a π ˜ ( a ; X t π ˜ ) d a d t λ E 0 T τ x π ˜ e ρ t 0 M ln π ˜ ( a ; X t π ˜ ) π ˜ ( a ; X t π ˜ ) + 1 d a d t + λ E 0 T τ x π ˜ e ρ t M d t .
Because 0 M a π ˜ ( a ; X t π ˜ ) d a is non-negative, by monotone convergence theorem,
lim T E 0 T τ x π ˜ e ρ t 0 M a π ˜ ( a ; X t π ˜ ) d a d t = E 0 τ x π ˜ e ρ t 0 M a π ˜ ( a ; X t π ˜ ) d a d t .
Noting that y ln y + 1 y > 0 for any y ( 0 , ) , by monotone convergence theorem,
lim T E 0 T τ x π ˜ e ρ t 0 M ln π ˜ ( a ; X t π ˜ ) π ˜ ( a ; X t π ˜ ) + 1 d a d t = E 0 τ x π ˜ e ρ t 0 M ln π ˜ ( a ; X t π ˜ ) π ˜ ( a ; X t π ˜ ) + 1 d a d t .
and lim T E 0 T τ x π ˜ e ρ t M d t = E 0 τ x π ˜ e ρ t M d t .
For the third term on the right hand side of (A2), noting that | V | is bounded, the stochastic integral 0 s σ e ρ t V ( X t π ˜ ) d W t s 0 is a martingale, then by optional sampling theorem,
E 0 T τ x π ˜ σ e ρ t V ( X t π ˜ ) d W t = 0 .
Thus, letting T on both sides of (A2),
V ( x ) E 0 τ x π ˜ e ρ t 0 M a λ ln π ˜ ( a ; X t π ˜ ) π ˜ ( a ; X t π ˜ ) d a d t = J ( x , π ˜ ) .
Since π ˜ is arbitrarily chosen, V ( x ) becomes an upper bound of the optimal value of J ( x ; · ) .
On the other hand, the above inequality becomes an equality if the supremum in (13) is achieved, that is, π ˜ = π * , where π * is given by (15). Thus, V ( x ) is the value function. Q.E.D.
Define function G λ , M to be
G λ , M ( y ) : = M 1 y + M e M y 1 1 y 0 + M 2 1 y = 0 ( 1 λ y ) + λ ln Z M ( y ) .
where function Z M is given in (16).
Lemma A1. 
The function G λ , M ( y ) defined in (A3) is maximized when y = 1 / λ , and
G λ , M ( 1 / λ ) = λ ln λ + λ ln ( e M / λ 1 ) .
Moreover, G λ , M ( 1 / λ ) < 0 when M < λ ln 1 / λ + 1 , G λ , M ( 1 / λ ) > 0 when M > λ ln 1 / λ + 1 , and G λ , M ( 1 / λ ) = 0 when M = λ ln 1 / λ + 1 .
Proof. 
Take the first-order derivative of function G λ , M :
G λ , M ( y ) = 1 y 2 M 2 e M y ( e M y 1 ) 2 λ M λ M ( e M y 1 ) M 2 y e M y ( e M y 1 ) 2 λ ( e M y 1 ) M y e M y y ( e M y 1 ) = ( 1 λ y ) e 2 M y ( 2 + M 2 y 2 ) e M y + 1 y 2 ( e M y 1 ) 2 = ( 1 λ y ) f 1 ( y ) y 2 ( e M y 1 ) 2 , y 0 ,
where f 1 ( y ) : = e 2 M y ( 2 + M 2 y 2 ) e M y + 1 , y 0 . Take the first-order derivative of f 1 :
f 1 ( y ) = 2 M e 2 M y 2 M e M y M 3 y 2 e M y 2 M 2 y e M y = M e M y f 2 ( y ) , y 0 ,
where f 2 ( y ) : = 2 e M y M 2 y 2 2 M y 2 , y 0 . Take the first-order derivative of f 2 :
f 2 ( y ) = 2 M e M y 2 M 2 y 2 M = 2 M f 3 ( y ) , y 0 ,
where f 3 ( y ) : = e M y M y 1 , y 0 . Take the first-order derivative of f 3 :
f 3 ( y ) = M e M y M , y 0 .
Note that f 3 ( y ) > 0 for y > 0 and f 3 ( y ) < 0 for y < 0 . Hence, f 3 ( y ) is increasing on y > 0 and decreasing on y < 0 , and f 3 ( y ) > 0 . Then f 2 ( y ) > 0 , which means that f 2 ( y ) is increasing. As a result, f 2 ( y ) > 0 for y > 0 and f 2 ( y ) < 0 for y < 0 . Hence, f 1 ( y ) > 0 for y > 0 and f 1 ( y ) < 0 for y < 0 , which means that f 1 ( y ) is increasing on y > 0 and decreasing on y < 0 . As a result, f 1 ( y ) > 0 for y 0 .
The above analysis shows that G λ , M ( y ) is positive when 1 λ y > 0 , i.e., y < 1 / λ , and negative when 1 λ y < 0 , i.e., y > 1 / λ . Thus the maximum is obtained at y = 1 / λ :
max G λ , M ( y ) = G λ , M ( 1 / λ ) = λ ln Z M ( 1 / λ ) = λ ln λ + λ ln ( e M / λ 1 ) .
Moreover, when M < λ ln 1 / λ + 1 , G λ , M ( 1 / λ ) < λ ln λ + λ ln ( e ln ( 1 / λ + 1 ) 1 ) = 0 ; when M > λ ln 1 λ + 1 , G λ , M ( 1 / λ ) > 0 ; when M = λ ln 1 λ + 1 , G λ , M ( 1 / λ ) = 0 . □
Proof of Proposition 2 
With the optimal distributional policy given in (17), substituting (17) into the objective (11) leads to
V ( x ) = J ( x , π * ) = E 0 τ x π * e ρ t 0 M a λ ln π * ( a ; X t π * ) π * ( a ; X t π * ) d a d t = E 0 τ x π * e ρ t 0 M a λ a 1 V ( X t π * ) λ + λ ln Z M 1 V ( X t π * ) λ π * ( a ; X t π * ) d a d t = E [ 0 τ x π * e ρ t { M λ 1 V ( X t π * ) + M e M ( 1 V ( X t π * ) ) / λ 1 1 V ( X t π * ) 1 + M 2 1 V ( X t π * ) = 1 1 λ 1 V ( X t π * ) λ + λ ln Z M 1 V ( X t π * ) λ } d t ] = E 0 τ x π * e ρ t G λ , M 1 V ( X t π * ) λ d t ,
where G λ , M is defined in (A3).
On one hand,
V ( x ) = E 0 τ x π * e ρ t G λ , M 1 V ( X t π * ) λ d t E 0 τ x π * e ρ t λ ln λ + λ ln ( e M / λ 1 ) d t ,
where the inequality follows from Lemma A1. Letting x and by dominated convergence theorem,
lim x V ( x ) lim x E 0 τ x π * e ρ t λ ln λ + λ ln ( e M / λ 1 ) d t = E 0 e ρ t λ ln λ + λ ln ( e M / λ 1 ) d t = λ ln λ + λ ln ( e M / λ 1 ) ρ .
On the other hand, consider an exploratory policy π ^ = { π ^ t } t 0 , where
π ^ t = π ^ ( a ; X t π ^ ) = e a / λ λ ( e M / λ 1 ) , a [ 0 , M ] .
Then
V ( x ) J ( x , π ^ ) = E 0 τ x π ^ e ρ t 0 M a λ ln π ^ ( a ; X t π ^ ) π ^ ( a ; X t π ^ ) d a d t = E 0 τ x π ^ e ρ t λ ln λ + λ ln ( e M / λ 1 ) d t .
Letting x and by dominated convergence theorem,
lim x V ( x ) λ ln λ + λ ln ( e M / λ 1 ) ρ ,
which then together with the previous inequality leads to (20). Q.E.D.
Define a function h as
h ( x ) = ln e k ( 1 x ) 1 1 x , x 1 , ln k , x = 1 ,
where k > 0 is given.
Lemma A2. 
The function h defined in (A4) satisfies following properties:
(i)
h ( x ) is continuous and decreasing in x;
(ii)
there exists a unique x 0 R such that h ( x 0 ) = 0 ;
(iii)
| h ( x ) | < k | x | + c , for some constant c R which depends on k only;
(iv)
| h ( x 1 ) h ( x 2 ) | < k | x 1 x 2 | , x 1 , x 2 R .
Proof. 
We first show that function h ( x ) is continuous at x = 1 . By L’Hôpital rule, lim x 1 e k ( 1 x ) 1 1 x = k . Hence, lim x 1 h ( x ) = ln k = h ( 1 ) .
Taking the first-order derivative of h, for x 1 ,
h ( x ) = 1 x e k ( 1 x ) 1 k ( 1 x ) e k ( 1 x ) + e k ( 1 x ) 1 ( 1 x ) 2 = h 1 ( x ) ( 1 x ) ( e k ( 1 x ) 1 ) ,
where h 1 ( x ) = e k ( 1 x ) 1 k ( 1 x ) e k ( 1 x ) . Then
h 1 ( x ) = k e k ( 1 x ) + k e k ( 1 x ) + k 2 ( 1 x ) e k ( 1 x ) = k 2 ( 1 x ) e k ( 1 x ) ,
which is positive when x < 1 and negative when x > 1 . Therefore, h 1 ( x ) is increasing on x < 1 then decreasing on x > 1 and h 1 ( x ) < lim x 1 h 1 ( x ) = 0 . Combining with the fact that ( 1 x ) ( e k ( 1 x ) 1 ) > 0 for x 1 , we show that h ( x ) < 0 for x 1 . It then completes the proof of ( i ) that h ( x ) is decreasing in x. To show ( i i ) , note that lim x h ( x ) > 0 and lim x h ( x ) < 0 . By the continuity and monotonicity of h ( x ) , there must exist a unique x 0 R such that h ( x 0 ) = 0 . In particular, when k = 1 , x 0 = 1 .
Note that for x 1 , e k ( 1 x ) 1 > k ( 1 x ) . which implies
e k ( 1 x ) 1 k ( 1 x ) e k ( 1 x ) > k ( 1 x ) ( e k ( 1 x ) 1 ) ,
Combining with the fact that ( 1 x ) ( e k ( 1 x ) 1 ) > 0 for x 1 ,
h ( x ) = e k ( 1 x ) 1 k ( 1 x ) e k ( 1 x ) ( 1 x ) ( e k ( 1 x ) 1 ) > k .
Based on the previous results, for x < x 0 ,
| h ( x ) | = h ( x ) = h ( x 0 ) x x 0 h ( y ) d y = x x 0 h ( y ) d y < x x 0 k d y = k ( x 0 x ) ;
similarly, for x x 0 ,
| h ( x ) | = h ( x ) = h ( x 0 ) x 0 x h ( y ) d y = x 0 x h ( y ) d y < x 0 x k d y = k ( x x 0 ) .
To show ( i i i ) ,
| h ( x ) | < k | x x 0 | k | x | + k | x 0 | , x R .
It remains to prove ( i v ) . Without loss of generality, we assume x 1 x 2 . Then
| h ( x 1 ) h ( x 2 ) | = h ( x 2 ) h ( x 1 ) = x 1 ˜ x 2 h ( y ) d y < x 2 x 1 k d y = k | x 1 x 2 | .
Proof of Theorem 2 
It is straightforward to show that Assumption 3.8 in Tang et al. (2022) hold for our exploratory dividend problem. The well-posedness of SDE (18) for the optimal exploratory surplus process is also established. Then, by applying the results of Tang et al. (2022, Theorem 3.9, 3.10), the existence and uniqueness of solution to (19) and convergence of V to V c l are established.
To show the twice-continuously differentiability of V ( x ) , we apply the results in Strulovici and Szydlowski (2015, Proposition 1) (with the infinite domain). We rewrite the HJB equation (19) into the following form:
V ( x ) + H ( V ( x ) , V ( x ) ) = 0 ,
where
H ( p , q ) : = 2 σ 2 ρ p + μ q + λ ln λ 1 q e M ( 1 q ) / λ 1 1 q 1 + M 1 q = 1 = 2 σ 2 ρ p + μ q + λ ln λ + λ h ( q ) ,
and h is defined in (A4) with k = M / λ . According to Proposition 1 in Strulovici and Szydlowski (2015), if H satisfies Condition 1-3, then there exists a twice-continuously differentiable solution to the HJB equation.
To check Condition 1 in Strulovici and Szydlowski (2015, Proposition 1), note that for p , q R ,
| H ( p , q ) | 2 σ 2 ρ | p | + μ | q | + λ | ln λ | + λ | h ( q ) | < 2 σ 2 ρ | p | + μ | q | + λ | ln λ | + M | q | + c ,
where the second inequality comes from Lemma A2 ( i i i ) , and c R is a constant. Taking L 1 : = 2 σ 2 max ( λ | ln λ | + c , ρ , μ + M ) , we have
| H ( p , q ) | L 1 ( 1 + | p | + | q | ) .
Secondly, for p , p ˜ , q , q ˜ R ,
| H ( p , q ) H ( p ˜ , q ˜ ) | 2 σ 2 ρ | p p ˜ | + μ | q q ˜ | + λ | h ( q ) h ( q ˜ ) | < 2 σ 2 ρ | p p ˜ | + μ | q q ˜ | + M | q q ˜ | ,
where the second inequality comes from Lemma A2 ( i v ) . Taking L 2 : = 2 σ 2 max ( ρ , μ + M ) , we have
| H ( p , q ) H ( p ˜ , q ˜ ) | L 2 ( | p p ˜ | + | q q ˜ | ) .
To check Condition 2, note that for all q R , H ( · , q ) is nonincreasing in p.
It remains to check Condition 3. For each K ¯ > 0 , choose K 1 , K 2 > K ¯ such that
K 1 max ( M + μ ) K 2 + λ ln λ + λ c ρ , ( M + μ ) K 2 λ ln λ + λ c ρ ,
where c is a constant satisfying Lemma A2 ( i i i ) . Then for all p R , ϵ { 1 , 1 } ,
H ( K 1 + K 2 | p | , ϵ K 2 ) = 2 σ 2 ρ K 1 ρ K 2 | p | + μ ϵ K 2 + λ ln λ + λ h ( ϵ K 2 ) < 2 σ 2 ρ K 1 + μ K 2 + λ ln λ + λ h ( ϵ K 2 ) < 2 σ 2 ρ K 1 + μ K 2 + λ ln λ + M K 2 + λ c < 0 ,
where the third inequality is due to Lemma A2 ( i i i ) and last inequality due to (A5). Secondly,
H ( K 1 K 2 | p | , ϵ K 2 ) = 2 σ 2 ρ K 1 + ρ K 2 | p | + μ ϵ K 2 + λ ln λ + λ h ( ϵ K 2 ) > 2 σ 2 ρ K 1 μ K 2 + λ ln λ + λ h ( ϵ K 2 ) > 2 σ 2 ρ K 1 μ K 2 + λ ln λ M K 2 λ c > 0 ,
where the third inequality is due to Lemma A2 (iii) and last inequality due to (A5). Q.E.D.
Proof of Theorem 3 
Note that (19) can be rewritten as
ρ V ( x ) = σ 2 2 V ( x ) + μ V ( x ) + λ h ( V ( x ) ) + λ ln λ ,
where h is defined in (A4) with k = M / λ .
First, suppose M < λ ln 1 / λ + 1 . Then λ ln λ + λ ln ( e M / λ 1 ) ) < 0 . According to (20), lim x V ( x ) < 0 . Define x 0 : = inf { x 0 : V ( x + ) 0 } . Note that V ( x ) is not a constant in this case and hence, V ( x ) does not always equal to 0, which implies that x 0 < .
Assume that V ( x 0 + ) > 0 . Since V ( x 0 ) = V ( 0 ) = 0 , there must exist some interval such that V ( x ) is decreasing in order to reach its negative limit, which means that there exists some point such that V ( x ) changes its sign from positive to negative. Define this point as
x 1 : = inf { x > x 0 : V ( x ) = 0 , V ( x + ) < 0 } .
Hence, V ( x 1 ) 0 . Then according to (A6),
ρ V ( x 1 ) = σ 2 2 V ( x 1 ) + μ V ( x 1 ) + λ h ( V ( x 1 ) ) + λ ln λ = σ 2 2 V ( x 1 ) + λ ln e M λ 1 + λ ln λ < 0 ,
which implies that V ( x 1 ) < 0 . But a contradiction happens because V ( x ) is non-negative on [ 0 , x 1 ] , which leads to V ( x 1 ) > 0 .
Then, assume that V ( x 0 + ) < 0 and there exists some point such that V ( x ) > 0 . Define x 2 as
x 2 : = inf { x > x 0 : V ( x ) = 0 , V ( x + ) > 0 } .
Hence, V ( x 2 ) 0 . According to (A6),
ρ V ( x 2 ) = σ 2 2 V ( x 2 ) + μ V ( x 2 ) + λ h ( V ( x 2 ) ) + λ ln λ = σ 2 2 V ( x 2 ) + λ ln e M λ 1 + λ ln λ λ ln e M λ 1 + λ ln λ .
Therefore,
V ( x 2 ) λ ln λ + λ ln ( e M / λ 1 ) ρ = lim x V ( x ) .
Since V ( x 2 + ) > 0 , V ( x ) is strictly increasing in a local neighborhood after x 2 . Then, after point x 2 there should exists some interval such that V ( x ) is strictly decreasing in order to achieve the limit. Define x 3 as
x 3 : = inf { x > x 2 : V ( x ) = 0 , V ( x + ) < 0 } .
Hence, V ( x 3 ) 0 . Note that V ( x ) is strictly positive in a local neighborhood after x 2 and non-negative on [ x 2 , x 3 ] , thus V ( x 3 ) > V ( x 2 ) . Then according to (A6),
V ( x 2 ) = 2 σ 2 ρ V ( x 2 ) μ V ( x 2 ) λ h ( V ( x 2 ) ) λ ln λ < 2 σ 2 ρ V ( x 3 ) μ V ( x 3 ) λ h ( V ( x 3 ) ) λ ln λ = V ( x 3 ) ,
which is a contradiction. Therefore, V ( x ) 0 and V ( x ) is decreasing.
For the other two cases, the proof is similar. Q.E.D.
Proof of Corollary 1 
Because according to Theorem 3 V ( x ) is monotone and its limit as shown in (20) is finite, it is straightforward that | V ( x ) | and | V ( x ) | are bounded. Q.E.D.
Proof of Proposition 3 
(a) Taking the first-order derivative of d 1 , for λ > 0 ,
d 1 ( λ ) = ln 1 λ + 1 + λ · λ 2 1 / λ + 1 = ln λ λ + 1 + λ λ + 1 1 = ω λ λ + 1 ,
where ω ( x ) : = ln x + ( x 1 ) , x > 0 . Since ω ( x ) = 1 / x + 1 < 0 for x ( 0 , 1 ) , ω ( x ) is decreasing on x ( 0 , 1 ) . Therefore, ω ( x ) > ω ( 1 ) = 0 for x ( 0 , 1 ) , which shows that d 1 ( λ ) is increasing. By L’Hôpital rule,
lim λ 0 d 1 ( λ ) = lim λ 0 λ 2 / 1 / λ + 1 λ 2 = lim λ 0 1 1 / λ + 1 = 0 , lim λ d 1 ( λ ) = lim λ 1 1 / λ + 1 = 1 .
  • (b) Note that for λ > 1 , ( λ + 1 ) / λ < λ / ( λ 1 ) . Therefore, d 1 ( λ ) < d 2 ( λ ) .
Taking the first-order derivative of d 2 , for λ > 1 ,
d 2 ( λ ) = ln λ λ 1 + λ · λ 1 λ · λ 1 λ λ 1 2 = ln λ λ 1 λ λ 1 1 = ω λ λ 1 .
Since ω ( x ) = 1 / x + 1 > 0 for x > 1 , ω ( x ) is increasing on x > 1 . Therefore, ω ( x ) > ω ( 1 ) = 0 for x > 1 , which shows that d 2 ( λ ) is decreasing. By L’Hôpital rule,
lim λ 1 d 2 ( λ ) = lim λ 1 λ 1 λ · λ 1 λ λ 1 2 · 1 λ 2 = lim λ 1 λ λ 1 = , lim λ d 2 ( λ ) = lim λ λ λ 1 = 1 .
Q.E.D.

References

  1. Wang, H.; Zariphopoulou, T.; Zhou, X.Y. Reinforcement Learning in Continuous Time and Space: A Stochastic Control Approach. Journal of Machine Learning Research 2020, 21, 1–34. [Google Scholar]
  2. Lundberg, F. Approximerad framställning af sannolikhetsfunktionen. Återförsäkring af kollektivrisker. Akademisk afhandling; Almqvist &Wiksells, 1903.
  3. De Finetti, B. Su un’impostazione alternativa della teoria collettiva del rischio. Transactions of the XVth international congress of Actuaries. New York, 1957, Vol. 2, pp. 433–443.
  4. Gerber, H.U. Entscheidungskriterien für den zusammengesetzten Poisson-Prozess. PhD thesis, ETH Zurich, 1969.
  5. Schmidli, H. Stochastic control in insurance; Springer Science & Business Media, 2007.
  6. Jeanblanc-Picqué, M.; Shiryaev, A.N. Optimization of the flow of dividends. Uspekhi Matematicheskikh Nauk 1995, 50, 25–46. [Google Scholar] [CrossRef]
  7. Asmussen, S.; Taksar, M. Controlled diffusion models for optimal dividend pay-out. Insurance: Mathematics and Economics 1997, 20, 1–15. [Google Scholar] [CrossRef]
  8. Jgaard, B.H.; Taksar, M. Controlling risk exposure and dividends payout schemes: insurance company example. Mathematical Finance 1999, 9, 153–182. [Google Scholar] [CrossRef]
  9. Asmussen, S.; Højgaard, B.; Taksar, M. Optimal risk control and dividend distribution policies. Example of excess-of loss reinsurance for an insurance corporation. Finance and Stochastics 2000, 4, 299–324. [Google Scholar] [CrossRef]
  10. Azcue, P.; Muler, N. Optimal reinsurance and dividend distribution policies in the Cramér-Lundberg model. Mathematical Finance: An International Journal of Mathematics, Statistics and Financial Economics 2005, 15, 261–308. [Google Scholar] [CrossRef]
  11. Azcue, P.; Muler, N. Optimal investment policy and dividend payment strategy in an insurance company. The Annals of Applied Probability 2010, 20, 1253–1302. [Google Scholar] [CrossRef]
  12. Gaier, J.; Grandits, P.; Schachermayer, W. Asymptotic ruin probabilities and optimal investment. The Annals of Applied Probability 2003, 13, 1054–1076. [Google Scholar] [CrossRef]
  13. Kulenko, N.; Schmidli, H. Optimal dividend strategies in a Cramér–Lundberg model with capital injections. Insurance: Mathematics and Economics 2008, 43, 270–278. [Google Scholar] [CrossRef]
  14. Yang, H.; Zhang, L. Optimal investment for insurer with jump-diffusion risk process. Insurance: Mathematics and Economics 2005, 37, 615–634. [Google Scholar] [CrossRef]
  15. Choulli, T.; Taksar, M.; Zhou, X.Y. A diffusion model for optimal dividend distribution for a company with constraints on risk control. SIAM Journal on Control and Optimization 2003, 41, 1946–1979. [Google Scholar] [CrossRef]
  16. Gerber, H.U.; Shiu, E.S. On optimal dividend strategies in the compound Poisson model. North American Actuarial Journal 2006, 10, 76–93. [Google Scholar] [CrossRef]
  17. Avram, F.; Palmowski, Z.; Pistorius, M.R. On the optimal dividend problem for a spectrally negative Lévy process. The Annals of Applied Probability 2007, 17, 156–180. [Google Scholar] [CrossRef]
  18. Yin, C.; Wen, Y. Optimal dividend problem with a terminal value for spectrally positive Levy processes. Insurance: Mathematics and Economics 2013, 53, 769–773. [Google Scholar] [CrossRef]
  19. Zhao, Y.; Kosorok, M.R.; Zeng, D. Reinforcement learning design for cancer clinical trials. Statistics in medicine 2009, 28, 3294–3315. [Google Scholar] [CrossRef] [PubMed]
  20. Komorowski, M.; Celi, L.A.; Badawi, O.; Gordon, A.C.; Faisal, A.A. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature medicine 2018, 24, 1716–1720. [Google Scholar] [CrossRef]
  21. Mirowski, P.; Pascanu, R.; Viola, F.; Soyer, H.; Ballard, A.J.; Banino, A.; Denil, M.; Goroshin, R.; Sifre, L.; Kavukcuoglu, K. ; others. Learning to navigate in complex environments. arXiv:1611.03673. [CrossRef]
  22. Zhu, Y.; Mottaghi, R.; Kolve, E.; Lim, J.J.; Gupta, A.; Fei-Fei, L.; Farhadi, A. Target-driven visual navigation in indoor scenes using deep reinforcement learning. 2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2017, pp. 3357–3364. [CrossRef]
  23. Radford, A.; Jozefowicz, R.; Sutskever, I. Learning to generate reviews and discovering sentiment. arXiv:1704.01444. [CrossRef]
  24. Paulus, R.; Xiong, C.; Socher, R. A deep reinforced model for abstractive summarization. arXiv:1705.04304. [CrossRef]
  25. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; others. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
  26. Jaderberg, M.; Czarnecki, W.M.; Dunning, I.; Marris, L.; Lever, G.; Castaneda, A.G.; Beattie, C.; Rabinowitz, N.C.; Morcos, A.S.; Ruderman, A.; others. Human-level performance in 3D multiplayer games with population-based reinforcement learning. Science 2019, 364, 859–865. [Google Scholar] [CrossRef] [PubMed]
  27. Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; others. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef]
  28. Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; others. Mastering the game of go without human knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef]
  29. Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning 2002, 47, 235–256. [Google Scholar] [CrossRef]
  30. Cesa-Bianchi, N.; Gentile, C.; Lugosi, G.; Neu, G. Boltzmann exploration done right. Advances in neural information processing systems 2017, 30. [Google Scholar]
  31. Todorov, E. Linearly-solvable Markov decision problems. Advances in neural information processing systems 2006, 19. [Google Scholar]
  32. Ziebart, B.D.; Maas, A.L.; Bagnell, J.A.; Dey, A.K. ; others. Maximum entropy inverse reinforcement learning. Aaai. Chicago, IL, USA, 2008, Vol. 8, pp. 1433–1438.
  33. Nachum, O.; Norouzi, M.; Xu, K.; Schuurmans, D. Bridging the gap between value and policy based reinforcement learning. Advances in neural information processing systems 2017, 30. [Google Scholar]
  34. Wang, H.; Zhou, X.Y. Continuous-time mean–variance portfolio selection: A reinforcement learning framework. Mathematical Finance 2020, 30, 1273–1308. [Google Scholar] [CrossRef]
  35. Dai, M.; Dong, Y.; Jia, Y. Learning equilibrium mean-variance strategy. Mathematical Finance 2023, 33, 1166–1212. [Google Scholar] [CrossRef]
  36. Bai, L.; Gamage, T.; Ma, J.; Xie, P. Reinforcement Learning for optimal dividend problem under diffusion model. arXiv:math/2309.10242. [CrossRef]
  37. Tang, W.; Zhang, Y.P.; Zhou, X.Y. Exploratory HJB equations and their convergence. SIAM Journal on Control and Optimization 2022, 60, 3191–3216. [Google Scholar] [CrossRef]
  38. Gao, X.; Xu, Z.Q.; Zhou, X.Y. State-dependent temperature control for Langevin diffusions. SIAM Journal on Control and Optimization 2022, 60, 1250–1268. [Google Scholar] [CrossRef]
  39. Strulovici, B.; Szydlowski, M. On the smoothness of value functions and the existence of optimal strategies in diffusion models. Journal of Economic Theory 2015, 159, 1016–1055. [Google Scholar] [CrossRef]
1
For example, the dividend paying rate under the threshold strategy is the maximal rate if the surplus exceeds the threshold; otherwise, it pays nothing. Since the threshold is determined by the model parameters, the change in the estimated parameters may dramatically change the dividend paying rate from zero to the maximal rate, or conversely.
2
We apply the shooting method which adjusts the initial value of first-order derivative such that the boundary conditions (14) and (20) are satisfied and use “ode45” function in Matlab to find the numerical solution to (19).
3
For each initial surplus x, we discretize the continuous time into small pieces ( Δ t = 0.0005 ) and sample 2000 independent surplus processes X t π * to simulate D v ( x ) and E n t r ( x ) .
Figure 1. The classical value functions (top) and the optimal dividend-paying rate (bottom) for μ = 1 , σ = 1 , ρ = 0.3 , M = 0.6 (left panels), M = 1.2 (middle panels), and M = 1.8 (right panels), respectively.
Figure 1. The classical value functions (top) and the optimal dividend-paying rate (bottom) for μ = 1 , σ = 1 , ρ = 0.3 , M = 0.6 (left panels), M = 1.2 (middle panels), and M = 1.8 (right panels), respectively.
Preprints 91638 g001
Figure 2. Cases of value functions given M and λ .
Figure 2. Cases of value functions given M and λ .
Preprints 91638 g002
Figure 3. Let μ = 1 , σ = 1 , ρ = 0.3 , λ = 1.5 . Let M = 0.6 (left column); M = 1.2 (middle column); M = 1.8 (right column), respectively. The figures on the top row show the value function V ( x ) , the expected total discounted dividends D v ( x ) and the expected total weighted discounted entropy E n t r ( x ) . The figures on the middle row show the mean of the optimal distribution π * ( · ; x ) . The figures on the bottom row show the density function of the optimal distribution with respect to different surplus level x.
Figure 3. Let μ = 1 , σ = 1 , ρ = 0.3 , λ = 1.5 . Let M = 0.6 (left column); M = 1.2 (middle column); M = 1.8 (right column), respectively. The figures on the top row show the value function V ( x ) , the expected total discounted dividends D v ( x ) and the expected total weighted discounted entropy E n t r ( x ) . The figures on the middle row show the mean of the optimal distribution π * ( · ; x ) . The figures on the bottom row show the density function of the optimal distribution with respect to different surplus level x.
Preprints 91638 g003
Figure 4. The value function V ( x ) given different values of λ with M = 0.6 (left) and M = 1.2 (right)
Figure 4. The value function V ( x ) given different values of λ with M = 0.6 (left) and M = 1.2 (right)
Preprints 91638 g004
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

© 2024 MDPI (Basel, Switzerland) unless otherwise stated