3.2. Fusion in the Deep Learning-based Recommendation System
We then integrate our Non-stationary Transformer architecture into the deep learning-based recommendation system framework, mainly focusing on refining the BST [
7] algorithm. This integration involves replacing conventional transformer layers with our advanced Non-stationary Transformer modules to better capture temporal dynamics and distributional shifts in user behaviour sequences, see
Figure 1.
Embedding Layer: The embedding layer initiates the adaptation process by transforming the multifaceted input data into compact, low-dimensional vector representations. The input data is categorised into three principal segments: (1) The core component comprises user behaviour sequence, encapsulating the dynamic interplay between users and items over time. (2) Auxiliary features encompass a broad spectrum of attributes, including user demographics, product specifications, and contextual information, enriching the model’s understanding beyond mere interaction patterns. (3) The target item features primarily focus on characteristics of new or prospective items that are subjects of prediction. Each of these segments undergoes a distinct embedding process, resulting in specialised embeddings that collectively form a comprehensive representation of the multifaceted input data within our model. This embedding strategy is crucial for capturing the nuanced relationships and attributes inherent in user behaviour sequences, auxiliary features, and target items. To preserve the sequential essence of user interactions, we employ positional features that assign temporal values based on the chronological distance between item interactions and the moment of recommendation.
Non-stationary Transformer Layer: We replace the BST algorithm’s standard transformer layers with our Non-stationary Transformer layers. This substitution improves the model’s ability to adapt to temporal variations and data distribution shifts, thereby enabling a deeper understanding of complex inter-item relationships and user interaction patterns within a dynamically changing context.
Multi-layer Perceptron Layers: The final part of our architecture is marked by a series of MLP layers coupled with a customised loss function designed for the binary classification task of predicting user clicks or the multi-classification task of predicting product scores. This final ensemble leverages the enriched feature set processed through the Non-stationary Transformer layers, facilitating precise and context-aware recommendations.
By adding the Non-stationary Transformer into the structure of the BST algorithm, our approach retains the original model’s capability to process user behaviour sequences. It significantly enhances the adaptability and predictive accuracy of user interaction. This novel integration represents a significant improvement in deep learning-based recommendation systems, promising superior performance in navigating the complexities of dynamic user behaviour patterns.
3.3. Fusion in the Reinforcement Learning-Based Recommendation System
We embedded our Non-stationary Transformer architecture into the core of reinforcement learning-based recommendation systems, specifically choosing the DDQN, DDPG, and SAC frameworks for integration. These frameworks are classic models within the field of reinforcement learning and represent different branches of the discipline, which showcases the versatility of our Non-stationary Transformer. By leveraging this architecture, we aim to enhance the models’ predictive capabilities and robustness, especially given their superior handling of non-stationary data characteristics. With their distinct mechanisms and strengths, the choice of DDQN, DDPG, and SAC provides a broad and comprehensive testing ground to demonstrate our approach’s enhanced adaptability and performance across various reinforcement learning scenarios.
Integration with DDQN: Integrating the Non-stationary Transformer within the DDQN framework substantially augments the model’s precision in value estimation and policy optimisation (
Figure 2). DDQN [
9], an extension of DQN [
29], introduces a critical improvement by decoupling the selection and evaluation of the action in the Q-value update equation, thereby mitigating overestimation. The standard DQN update equation [
45] is given by:
where
and
are the state and action at time
t,
is the reward received after taking action
,
is the learning rate, and
is the discount factor. In DDQN, this is modified to:
where
represents the action-value function estimated by the target Q-network. In DDQN, we introduce a dual mechanism that significantly boosts the model’s ability to process and predict sequential datasets by embedding the Non-stationary Transformer into both the Q-network and the target Q-network. This is particularly relevant for recommendation systems where the goal is to sequentially recommend products on a page, with each recommendation considered an action. The DDQN, enhanced with our transformer, aims to maximise the overall layout’s utility, striving for the highest possible number of clicks or transactions through strategic product recommendations based on historical user-item interactions, product characteristics, and user profiles.
This enhanced approach allows the DDQN to more accurately anticipate the cumulative rewards associated with different action sequences, optimising the selection of items to present to the consumer at each step. The Non-stationary Transformer’s integration further empowers the DDQN to handle the temporal dynamics and non-stationary nature of recommendation system data, ensuring enhanced performance in environments characterised by rapidly evolving user preferences and interaction patterns.
Integration with DDPG: The augmentation of the DDPG [
11] framework with the Non-stationary Transformer involves strategically embedding this architecture into the actor-network and the critic-network (
Figure 3). This integration significantly enhances the model’s capacity to interpret and respond to recommendation system tasks’ complex, sequential nature.
In the
Actor Network, the Non-stationary Transformer’s integration facilitates a more nuanced understanding of the current state, enabling the network to propose actions (e.g., product recommendations) that are not only optimal based on current knowledge but also adaptive to the evolving user preferences and behaviours. The transformer’s ability to process temporal sequences and adapt to data shifts allows the actor-network to make more informed decisions, especially in scenarios where user interactions with items change dynamically. Most importantly, it recovers the original non-stationary dataset distribution to reflect in the decision-making of the recommendation systems. For the Non-stationary Transformer integration within the DDPG framework, the actor-network update is defined by the following equation:
where
represents the gradient of the objective function
J with respect to the actor parameters
. This gradient is estimated as the expected value of the product of the gradient of the action-value function
Q with respect to the action
a, evaluated at the current state
and the action proposed by the current policy
and
are the parameters of the Non-stationary Transformer integration DDPG’s actor-network. Equation (
9) allows the actor-network to learn optimal policies over time by ascending the gradient of the performance criterion.
For the
Critic Network, including the Non-stationary Transformer, empowers the network to more accurately estimate the future rewards associated with the actions proposed by the actor-network. This is particularly critical in recommendation systems, where the value of an action (e.g., the likelihood of a user clicking on a recommended item) can vary significantly over time. By capturing the temporal dynamics and distributional shifts in user-item interaction data, the critic-network can provide more reliable feedback to the actor-network, leading to continuous policy refinement. In the critic network of the Non-stationary Transformer-integrated DDPG model, the loss function
L used for training is defined as follows:
where
r is the reward received after executing action
a in state
s, and
is the discount factor that weighs the importance of future rewards. The term
is the action-value predicted for the next state
by the target policy
and the target critic network, parameterised by
and
. Meanwhile,
is the parameters of the critic-network.
Incorporating the Non-stationary Transformer into DDPG preserves DDPG’s advantages by offering smoother policy updates and reducing the variance in policy evaluation while significantly enhancing the model’s robustness and adaptability. By leveraging the transformer’s ability to process non-stationary data, our adapted DDPG framework exhibits superior performance in capturing the dynamic user-item interactions and evolving preferences, which are crucial for making precise and contextually relevant recommendations.
Integration with SAC: The SAC [
39] algorithm, known for its stability and efficiency in continuous action spaces, employs an entropy-augmented reinforcement learning strategy that encourages exploration by maximising a trade-off between expected return and entropy. Integrating the Non-stationary Transformer into SAC involves embedding this advanced architecture into both actor-networks and critic-networks (
Figure 4). It enhances their capability to process sequential decision-making tasks by capturing the complex dependencies in user-item interactions. The transformer’s ability to handle temporal dynamics and non-stationary data significantly improves the policy’s adaptability and the precision of action selection in dynamic recommendation environments. The core of enhanced SAC consists of two actor-networks
and
, and four critic networks
,
,
, and
, where
and
denote the parameters of the actor and critic networks, respectively. Including the Non-stationary Transformer in the four critic networks enables a more nuanced valuation of the state-action pairs, considering the evolving nature of user preferences and item attributes. The objective function for the actor-network in enhanced SAC with Non-stationary Transformer is given by:
where
is the experience replay buffer,
is the temperature parameter that determines the importance of the entropy term, and
and
represent the state and action at time
t, respectively. This comprehensive understanding of the data’s temporal and non-stationary aspects allows for a more accurate estimation of expected returns, facilitating more effective policy updates. The SAC equipped with the Non-stationary Transformer, sets a new benchmark for reinforcement learning-based recommendation systems, particularly in handling the complexities of sequential decision-making and adapting to the dynamic nature of recommendation tasks.
Through these strategic fusions, our Non-stationary Transformer improves the predictive accuracy of reinforcement learning models in recommendation systems. It enhances adaptability and robustness previously unattainable with conventional transformer architectures. This innovative approach promises to redefine the standards of reinforcement learning-based recommendation systems, accommodating the complex and dynamic nature of real-world user behaviour and preferences.