2.2. Eye Tracking Based on Spiking Neuron Network
The tracking method designed in this study consists of an SNN backbone network and a tracker. The backbone network extracts spatial features with temporal information, while the tracker generates tracking results.
SEW ResNet [
30] is used as the backbone network. ResNet has shown excellent performance in various image processing fields, and SEW ResNet improves upon it by modifying the residual structure, connecting the input of the residual block to the output of the spiking layer for identity mapping:
where
represents the spike input from the previous layer,
denotes convolution, pooling operations within the residual block, and SN represents the spiking layer. This adaptation makes SEW ResNet suit complex motion representation of spiking neurons.
A spiking neuron can be describe as follows:
here,
is the input,
is the membrane potential,
is the emitting threshold and
is the reset potential.
means the hidden state when the neuron is after charging but before emitting.
f is the charging equation in which spiking neurons differs.
is the Heaviside function. The spiking neuron model uses the Parametric-LIF(PLIF) model [
31], with a charging equation
f same as the LIF model [
32]:
the difference between PLIF and LIF neuron is that the time constant
of PLIF is not a preset hyper parameter but is optimized during training. Each layer of spiking neurons shares
, while different layers have different
values.
To obtain the firing frequency of spikes over a period, SEW ResNet or other existing SNN structures need to run for several time steps on PC to simulate the charging state of spiking neurons continuously receiving input. The spike output frequency of the neurons over these time steps is then used as the final result. As shown in
Figure 1 A commonality with ANN is that the SNN still treats inputs as "frames", with each output of the SNN being independent from the others. Compared to ANNs, each output of a spiking neural network incurs several times the operational cost, which obviously contradicts the initial goal of low power consumption.
Observing the state equations of spiking neurons, one finds that the membrane potential of the spiking neurons is continuously stored internally and changes with each time step’s input. This attribute makes SNN get the ability to remember the current state. When an SNN continuously receives inputs, neurons emit spikes continuously according to the charging equation and the emission threshold, with the spike output result of each time step being related to the state of the previous time step. Therefore, in situations where the SNN continuously receives event inputs, we consider real-time recording of the spike tensor output, but the information contained in the output of a single time step is limited. Accumulating spikes over all time steps to obtain the spike firing frequency would result in the output containing information from too far back, making it difficult to differentiate the tracking target’s position. To address this, we set a record length
L, with the network continuously receiving event inputs and recording the spike output
(spike tensor) of the backbone network at each time step. The spike feature tensor of the current and the previous
time steps is averaged over the time dimension to obtain the spike rate tensor
for the recent L time steps, which is considered as the feature of the current time step.
As shown in the
Figure 2, it depicts the membrane potential changes and output spikes of two different neurons after receiving inputs.
It is evident that the inputs between these two cases differ significantly; however, the differences in the inputs did not result in any change in the output. For both inputs, there were six spikes within 300 time steps, indicating a spike firing rate of 0.02. This suggests that using spike firing rate as a feature loses the relative timing information of each spike emission. For instance, there is a distinct difference in the information contained within sequences where spikes are concentrated in a short period versus those where spikes are uniformly distributed over a longer duration.
To better extract information from spike sequences, it is necessary to consider a module capable of appropriately capturing information from different spike sequences. This module should output a higher value when continuously receiving input and also respond when input ceases. This requirement aligns well with the characteristics of the LIF neuron model. Consequently, we replaces the operation of computing the spike firing rate with a PLIF neuron as well as above. The time constant , which influences the neuron’s charging curve, is treated as a learnable parameter that is updated synchronously during network training. This neuron accumulates membrane potential without emitting spikes. After continuously receiving inputs for L time steps (the feature extraction window length of our model structure), the membrane potential in the neuron is extracted as a floating-point number. The final result varies based on the presence or absence of input spikes at different times. For ease of reference, such neurons are named NSPLIF (Non-Spiking PLIF).
As shown in
Figure 3, the output differences exhibited by the NSPLIF under various input conditions are clearly demonstrated. By shifting the input along the time axis, the figure displays two membrane potential change curves. When the neuron continuously receives inputs, its membrane potential experiences a rapid rise followed by a gradual stabilization; when the input ceases, the membrane potential decreases gradually due to the effect of leak voltage until the arrival of the next input. This process enables the NSPLIF to effectively extract temporal information hidden at different moments from the input spikes, thereby providing a more precise basis for the final output. Additionally, it is worth mentioning that the NSPLIF model can be derived by making appropriate modifications to the PLIF model. This feature allows the entire neural network to utilize a single type of neuron model, thus ensuring structural uniformity and consistency within the network.
Eye tracking tasks can be viewed as single-object tracking, and thanks to event camera not including redundant background information in its inputs, bounding box regression is handled by a fully connected layer.
is passed through a pooling layer and then unrolled before being input to the fully connected layer, ultimately yielding the tracking result for the current time step.The whole structure of the algorithm is shown in
Figure 4.
The algorithm network requires a warm-up period initially, as the first update result must wait for the network to run for time steps; from the Lth time step onward, the tracking result can be updated at each time step, with each regression containing information from the previous time steps. This enables the network to self-recover tracking when the target disappears and reappears.
In situations where there is a high similarity between successive frames in eye tracking, the memory advantage of SNN becomes evident. Common algorithms for eye tracking tasks fall into two categories: one based on object detection, independently computing each time an image frame is input, with the advantage being independent prediction, allowing for continued recognition when the target reappears after being lost; the other, OPE tracking algorithms [
33,
36], fundamentally rely on similarity matching between frames, which is challenging for re-tracking lost targets. The SNN-based tracking structure proposed in this study combines the advantages of both, reasonably utilizing prior information for predicting the target’s position and enabling the re-acquisition of the target without external information even after the tracking target is lost.