To comprehensively cover the topics addressed in this manuscript, this section presents a concise introduction to Spiking Neural Networks (SNNs), Long Short-Term Memory, and Quantum Neural Networks (QNNs), and Deep Reinforcement Learning (DRL), which constitute the principal subjects addressed in this research.
2.1. Spiking Neural Networks
SNNs draw inspiration from the neural communication patterns observed in the brain, resembling the encoding and retention processes of working memory or short-term memory in the prefrontal cortex, along with the application of the Hebbian plasticity principle.
The Hebbian theory is a neuroscientific concept that describes a fundamental mechanism of synaptic plasticity. According to this theory, the strength of a synaptic connection increases when neurons on both sides of the synapse are repeatedly activated simultaneously. Introduced by Donald Hebb in 1949, it is known by various names, including Hebb’s rule, Hebbian learning postulate, or Cell Assembly Theory. The theory suggests that the persistence of repetitive activity or a signal tends to induce long-lasting cellular changes that enhance synaptic stability. When two cells or systems of cells are consistently active at the same time, they tend to become associated, facilitating each other’s activity. This association leads to the development of synaptic terminals on the axon of the first cell in contact with the soma of the second cell, as depicted in
Figure 2 [
36].
Despite DNNs being historically inspired by the brain, there exist fundamental differences in their structure, neural processing, and learning mechanism when compared to biological brains. One of the most significant distinctions lies in how information is transmitted between their units. This observation has lead to the emergence of spiking neural networks (SNNs). In the brain, neurons communicate by transmitting sequences of potentials or spike trains to downstream neurons. These individual spikes are temporally sparse, imbuing each spike with significant information content. Consequently, SNNs convey information through spike timing, encompassing both latencies and spike rates. In a biological neuron, a spike occurs when the cumulative changes in membrane potential, induced by pre-synaptic stimuli, surpass a threshold. The rate of spike generation and the temporal pattern of spike trains carry information about external stimuli and ongoing computations.
ANNs communicate using continuous-valued activations and, for that reason, SNNs are more efficient. This efficiency stems from the temporal sparsity of spike events, as elaborated bellow. Additionally, SNNs posses the advantage of being inherently attuned to the temporal dynamics of information transmission observed in biological neural systems. The precise timing of every spike is highly reliable across various brain regions, indicating a pivotal role in neural encoding, particularly in sensing information-processing areas and neural-motor regions [
37,
38].
SNNs find utility across various domains of pattern recognition, including visual processing, speech recognition, and medical diagnosis. Deep SNNs represent promising avenues for exploring neural computation and diverse coding strategies within the brain. Although the training of deep spiking neural networks is still in its nascent stages an important open question revolves around their training, such as enabling online learning while mitigating catastrophic forgetting [
39].
Spiking neurons operate by summing weighted inputs. Instead of passing this result through a sigmoid or ReLU non-linearity, the weighted sum contributes to the membrane potential
of the neuron. If the neuron becomes sufficiently excited by this weighted sum and the membrane potential reaches a threshold
, it will emit a spike to its downstream connections. However, most neuronal inputs consist of brief bursts of electrical activity known as spikes. It is highly unlikely for all input spikes to arrive at the neuron body simultaneously. This suggests the existence of temporal dynamics that maintain the membrane potential over time.
Figure 1.
Louis Lapicque [
40] observed that a spiking neuron can be analogously likened to a low-pass filter circuit, comprising a resistor (R) and a capacitor (C). This concept is referred to as the leaky integrate-and-fire neuron model. Even in the present, that idea remains valid. Physiologically, the neuron’s capacitance arises via the insulating lipid bilayer constituting its membrane, while the resistance is a consequence of gated ion channels regulating the flow of charged particles across the membrane (see
Figure 1b).
The characteristics of this passive membrane can be elucidated through an RC circuit, in accordance with Ohm’s Law. This law asserts that the potential across the membrane, measured between the input and output of the neuron, is proportional to the current passing through the conductor [
37].
The behavior of the passive membrane, which is simulated using an RC circuit, can be depicted as follows:
where
, representing the time constant of the circuit. Following the Euler method,
without
:
Extracting the membrane potential in the subsequent step:
To isolate the dynamics of the leaky membrane potential, let’s assume there is no input current
:
The parameter
represents the decay rate of the membrane potential, also referred to as the inverse of the time constant. Based on the preceding equation, it follows that
.
Figure 2.
The standard structure of a neuron includes a cell body, also known as soma, housing the nucleus and other organelles; dendrites, which are slender, branched extensions that receive synaptic input from neighboring neurons; an axon; and synaptic terminals.
Figure 2.
The standard structure of a neuron includes a cell body, also known as soma, housing the nucleus and other organelles; dendrites, which are slender, branched extensions that receive synaptic input from neighboring neurons; an axon; and synaptic terminals.
Let’s assume that time
t is discretised into consecutive time steps, such that
. To further minimize the number of hyperparameters, let’s assume
. Then
When dealing with a constant current input, the solution to this can be obtained as
This demonstrates the exponential relaxation of
towards a steady-state value following current injection, with
representing the initial membrane potential at
If we compute Eq.(
8) at discrete intervals of
then we can determine the ratio of the membrane potential between two consecutive steps using:
This equation for
offers greater precision compared to
, which is accurate only under the condition that
.
Another non-physiological assumption is introduced, wherein the effect of
is assimilated into a learnable weight
W (in deep learning, the weight assigned to an input is typically a parameter that can be learned):
represents an input voltage, spike, or unweighted current, which is scaled by the synaptic conductance
W to produce a current injection to the neuron. This generates the following outcome:
by decoupling the effects of
W and
, simplicity is prioritized over biological precision. Lastly, a reset function is added, which is triggered each time an output spike occurs:
where
is the output spike, 1 in case of activation and 0 in otherwise. In the first scenario, the reset term subtracts the threshold
from the membrane potential, whereas in the second scenario, the reset term has no impact.
A spike occurs when the membrane potential exceeds the threshold:
Various techniques exist for training SNN’s [
37], with one of the more commonly utilized approaches being backpropagation using spikes, also known as backpropagation through time (BPTT). Starting from the final output of the network and moving backwards, the gradient propagates from the loss to all preceding layers. The objective is to train the network utilizing the gradient of the loss function concerning the weights, thus updating the weights to minimize the loss. The backpropagation algorithm accomplishes this by utilizing the chain rule:
Nevertheless, the derivative of the Heaviside step function from (
13) is the Dirac Delta function, which equates to 0 everywhere except at the threshold
, where it tends to infinity. Consequently, the gradient is almost always nullified to zero (or saturated if
U precisely sits at the threshold), rendering learning ineffective. This phenomenon is referred to as the dead neuron problem. The common approach to address the dead neuron problem involves preserving the Heaviside function during the forward pass, but substituting it with a continuous function,
, during the backward pass. The derivative of this continuous function is then employed as a substitute for the Heaviside function’s derivative, denoted as
, and is termed the surrogate gradient. In this manuscript, we utilize snntorch library, which defaults to using the arctangent function [
37].
The structure of QSNNs follows the hybrid-architecture formed by Classical linear layers and VQC for QLIF implementation (Quantum Leaky integrate and fired Neuron) and trained using gradient descent method. The
Figure 3 shown the general pipeline for this model for a classification task and the detailed architecture is defined in
Section 3.
Previous works have been made inspiring in brain functionality emulating for classification tasks such as MINST dataset classification using SNN and Hyperdimensional computing [
41] or in the decoding and understanding of muscle activity and kinematics from electroencephalography signals [
42], utilizing Hyperdimensional Computing (HDC) and SNN for the MNIST classification problem [
41]. Other works have explored the application of Reinforcement Learning for navigation in dynamic and unfamiliar environments, supporting neuroscience-based theories that consider grid cells as crucial for vector-based navigation [
43].
2.2. Long Short-Term Memory
Long Short-Term Memory (LSTM) networks belong to a class of recurrent neural networks that have the ability to learn order dependence in sequence prediction problems. These networks are crafted to address the challenges encountered in training RNNs (Recurrent Neural Networks). Retro-propagated gradients often exhibit substantial growth or decay over time due to their dependency not only on the current error but also on past errors. The accumulation of these errors impedes the memorization of long-term dependencies. Consequently, Long Short-Term Memory neural networks (LSTM) are employed to tackle these issues. LSTMs incorporate a series of mechanisms to determine which information should be retained and which should be discarded [
44]. Furthermore, standard RNNs have a limited ability to access contextual information in practice. The impact of a specific input on the hidden layer and, consequently, on the network output, diminishes or amplifies exponentially as it circulates through the recurrent connections of the network. This phenomenon is known as the vanishing gradient problem, which represents the second challenge to overcome using LSTM [
45,
46].
The behavior of this model is essential in complex problem domains such as machine translation, speech recognition and time-series analysis, among others.
These networks consist of LSTM modules, which are a specialized type of recurrent neural network introduced in 1997 by Hochreiter and Schmidhuber [
47]. They consist of three internal gates, known as input, forget and output gates, detailed in
Figure 4.
These gates are filters and each of them have its own neural-network. At a given moment, the output of an LSTM relies on three factors:
Cell state: The network’s current long-term memory
Hidden state: The output from the preceding time step
The input data in the present time step
The internal gates mentioned above can be described as follows [
48]:
-
Forget Gate: This gate decides which information from the cell state is important, considering both the previous hidden state and the new input data. The neural network that implements this gate is build to produce an output closer to 0 when the input data is considered unimportant, and closer to 1 otherwise. To achieve this, we employ the sigmoid activation function. The output values from this gate are then passed upwards and undergo pointwise multiplication with the previous cell state. This pointwise multiplication implies that components of the cell state identified as insignificant by the forget gate network will be multiplied by a value approaching 0, resulting in reduced influence on subsequent steps.
To summarize, the forget gate determines which portions of the long-term memory should be disregarded (given less weight) based on the previous hidden state and the new input data.
-
Input gate: Determines the integration of new information into the network’s long-term memory (cell state), considering the prior hidden state and incoming data. The same inputs are utilized, but now with the introduction of a hyperbolic tangent as the activation function. This hyperbolic tangent has learned to blend the previous hidden state with the incoming data, resulting in a newly updated memory vector. Essentially, this vector encapsulates information from the new input data within the context of the previous hidden state. It informs us about the extent to which each component of the network’s long-term memory (cell state) should be updated based on the new information.
It should be noted that the utilization of the hyperbolic tangent function in this context is deliberate, owing to its output range confined to [-1,1]. The inclusion of negative values is imperative for this methodology, as it facilitates the attenuation of the impact associated with specific components.
Output gate: the objective of this gate is to decide the new hidden state by incorporating the newly updated cell state, the prior hidden state, and the new input data. This hidden state has to contain the necessary information while avoiding the inclusion of all learned data. To achieve this, we employ the sigmoid function.
This architecture is replicated for each time step considered in the prediction. The ultimate layer of this model is a linear layer responsible for converting the hidden state into the ultimate prediction.
The quantum counterpart of this neural network is constructed with a VQC model for each gate as
Figure 9 shows.
Finally, we summarise the Lstm implementation steps as follows:
2.3. Deep Reinforcement Learning
Reinforcement learning (RL) [
49] is a branch of machine learning inspired by behavioral psychology. In RL, an entity known as the agent adjusts its behavior based on the rewards and penalties it receives from interacting with an unkwnown environment. RL serves as the foundational framework for elucidating how autonomous intelligent agents acquire the ability to navigate unfamiliar environments and optimize cumulative rewards through decision-making. Deep Reinforcement Learning (DRL) combines traditional RL algorithms with neural networks. The general schema of DRL is illustrated in
Figure 5: When an agent interacts with an environment, it has no knowledge of the environment´s state except for the observations it receives. At time
t, the observation of the environment is denoted as
. The agent then selects an action
from the set of the available actions and executes in the environment. Subsequently, the environment transitions to a new state and provides the agent with the new observation
and a reward
. The reward indicates the quality of the action taken by the agent and is utilized to improve its performance in subsequent interactions.
This sequential process is described using a Markov Decision Process (MDP). which consists of the tuple
, where
and
represent the sets of states and actions, respectively.
denotes the probability of state transition, defined as
indicating the likelihood of transitioning from state
at time
t to state
at time
when action
is taken at time
t. Additionally,
represents the reward function associated with executing action
in state
and transitioning to state
.
The agent’s goal is to maximize its cumulative reward through a series of interactions with the environment
, beginning at time
. This cumulative reward, referred to as the return and defined in Equation (
21), is influenced by the hyperparameter
, which determines the relative significance of recent versus past rewards in the learning process.
is commonly known as the
discount factor. To maximize
, the agent must acquire knowledge about the optimal action
a to take in a given state
s, known as the
policy. This policy function charasterizes the agent’s behavior within the environment, providing the probability of selecting action
a in state
s. In RL we consider two key functions: the value of a state–action pair
(as defined in Equation (
22), representing the expected return obtained from starting at state
s and taking action
a), and the value of a state
( as defined in Equation (
23), representing the expected return obtained from starting at state
s). Additionally, another relevant concept is the advantage of a state–action pair
(as defined in Equation (
24)), which quantifies te benefit of selecting action
a in state
s compared to the other available actions in the same state
s.
Deep reinforcement learning aims to use a (deep) artificial neural network to learn the optimal policy
. This policy takes the state
s as input and outputs either a chosen action
a (in the case of a deterministic policy) or the probability distribution of selection of action
a in the state
s (in the case of a stochastic policy). In recent years, the literature has emphasized two main families of algorithms: deep Q-networks (DQN) [
51] and policy gradient [
52]. The former focuses on training an artificial neural network to approximate the function
, while the latter directly approximate
. DQN training draws inspiration from the classic Q-learning approach and aims to minimize the loss function described in Equation (
26). Here,
represents the output value corresponding to action
a generated by the neural network when provided with input
s. A deep Q-network consists of input neurons equal to the dimension of the state and output neurons equal to the number of actions available in the action set.
On the contrary, actor–critic policy gradient models necessitate the utilization of at least two neural networks types for training: one (the actor) shares a structure resembling that of a DQN, but its a-th output aims to yield . The other type endeavors to approximate , mirroring the actor in terms of number of inputs and featuring a single output value.
Various approaches have been developed enhance DQN training and policy gradient methods. For additional information, we direct readers to the following references: [
51,
52]. In this manuscript, we employ the Advantage Actor–Critic (A2C) training algorithm [
52], which is characterized by its designed loss function outlined in Equation (
25).