The protection mechanism based on hardware locking can prevent attackers from tampering with the internal data of BTB through spy processes, thereby obtaining critical branch information of normal processes. However, the jump direction and jump address of branches are often stored in plaintext in BHT and BTB, which attackers can obtain through other means. Therefore, there is still a certain risk of information leakage in BPU.
4.2.1. BTB and BHT Based on Dynamic Isolation
Figure 5 shows the dynamic isolation mechanism proposed in this paper. When the program is running, this mechanism concatenates the higher 16 bits of the Branch_Address with the output of the branch history register (BHR_Out), and generates a digest through the LHash module as the index of the corresponding 2-bit counter. The designed confidentiality protection module will generate two keys, one for encrypting the branch history information stored in BHR, and the other for encrypting the digest used as the index. In this way, attackers will find it difficult to access BHT and obtain the jump directions of critical branches.
LHash is a hash function which can be used to generate index values for data [
33]. Compared to other hash functions, LHash can provide the required security with minimal resource overhead, making it suitable for lightweight embedded processors [
19]. Considering security and area consumption, the LHash module designed in this article uses the 128 bits internal permutation structure and outputs the 96 bits binary number. Due to the BHT size of the E906 processor being 512 (2
8) × 16 and having eight 2-bit counters, the length of the index is 8 bits, which randomly selects 8 bits from the 96-bit output of LHash as the index for the 2-bit counter. Therefore, the length of the Key for index is 8 bits. Due to BHR storing the lower 16 bits of branch addresses, the length of the Key for Branch_Address is 16 bits.
When it is necessary to update the internal data of BTB, the designed confidentiality protection module will generate two keys, one for encrypting the lower 16 bits of Jump_Address and the other for encrypting the lower 16 bits of Branch_Address. Therefore, the length of these two keys are both 16 bits. The encrypted Jump_Address will be stored in the Target field of BTB, and the encrypted Branch_Address will be stored in the Tag field of BTB, which is used as the index for BTB entry. When it is necessary to predict the jump address of a branch, this module will generate two keys for that branch again. The Key for Branch_Address will XOR with Branch_Address and compare the calculated value with the TAG of each entry in BTB. Only when the value of a certain entry match successfully, BTB will output the Target of this entry. The plaintext of BTB_Out can only be obtained after Target is XOR with the Key for Jump_Address. In this way, attackers will find it difficult to access BTB entries and obtain the branch information.
Our team has proposed various security algorithms and their hardware circuits which can be used for information encryption [
11,
12]. However, although these algorithms have extremely high security and can defend against quantum attacks, they have high resource overhead, long encryption and decryption times, and are not suitable for branching information. Our team has also designed the hardware accelerator based on traditional cryptographic algorithms such as AES-GCM [
25]. However, this module is more suitable for confidentiality protection applications with large amounts of data and has high resource overhead. Therefore, it is not suitable for branching information as well. Taking into account factors such as security, resources, and performance, the confidentiality protection module of this article is based on the designed hybrid PUF circuit and efficiently generates multiple keys required for the dynamic isolation mechanism.
Due to the different keys for different branches, attackers are unable to restore correct branch jump information, and therefore cannot correctly perceive whether there is competition with other processes, nor can they reuse previous historical information, fundamentally resisting competition based and reuse based attacks. The proposed dynamic isolation mechanism does not change the original control logic for updates and queries, only encrypt the jump information and their index with low overhead of resources and performance. This mechanism minimizes changes to the existing branch predictor architecture and only requires a small amount of logic to achieve dynamic isolation.
4.2.2. The Confidentiality Protection Module
Due to constraints such as hardware resources, cost, and computing power, traditional encryption techniques are difficult to widely apply in lightweight embedded systems. However, Physical Unclonable Function (PUF), as a physical security primitive, utilizes the uncontrollable random process deviations introduced during chip manufacturing to generate unique secret keys [
31]. As shown in
Figure 6, a set of n-bit Challenge (C
1, C
2, ……, C
n) is applied to the PUF circuit, corresponding to the generation of m-bit Response, which is called challenge response pair (CRP). According to the different abilities of PUF to generate CRPs, they are divided into weak PUF and strong PUF [
26]. Strong PUF can generate an exponential number of CRPs, suitable for low-cost encryption and decryption, mainly including APUF based on arbitrators, RO-PUF based on ring oscillators, and current mirror PUF [
28]. RO-PUF and current mirror PUF are susceptible to environmental noise interference and have high hardware resource costs, making them unsuitable for lightweight encryption and decryption applications. Therefore, this article mainly chooses the PUF circuit based on APUF.
The traditional APUF circuit consists of a signal delay path cascaded with n switch units and an arbiter [
27]. As shown in
Figure 6, a switch unit consists of two parallel multiplexers (MUX). When the challenge signal Ci is 0, the signal path of this switch unit is the parallel path, and when Ci is 1, the signal path is the cross path. Therefore, the transmission path of the signal can be changed by controlling the challenge signal C, which affects the signal delay difference and generates the unpredictable response. The arbiter is a NAND latch. When the pulse signal is applied to the input of the APUF circuit and after being transmitted through two symmetrical delay paths (Path 0 and Path 1), the arbiter arbitrates the output signals of Path 0 and Path 1. If the signal of Path 0 arrives at the arbiter before the signal of Path 1, the output of this PUF circuit is 1, otherwise it is 0.
The total delay deviation of the signal reaching the arbitrator is the accumulation of delay deviations when it passes through each switch unit, as shown in Equation (2).
c
i is a sub element of the Challenge vector C, and ω
i is a constant vector containing delay parameters for each switch unit, as shown in Equation (3).
The calculation formula of α
i and β
i are shown in Equation (4).
x
i represents the delay parameter of the signal passing parallel through the delay path Path 0 when the challenge signal C
i is 0, and y
i represents the delay parameter of the signal passing parallel through the delay path Path 1 when C
i is 0. u
i represents the delay parameter of the signal crossing through the delay path (from Path 0 to Path 1) when C
i is 1, and v
i represents the delay parameter of the signal crossing through the delay path (from Path 1 to Path 0) when C
i is 1.
The relationship between the response R of the APUF circuit and the total delay deviation ∆ is shown in Equation (5). The value range of the sgn function is {−1, 1}. When ∆ is greater than 0, R is 1, otherwise R is 0.
Due to the inherent correlation between the output response and input challenge of APUF, it is susceptible to machine learning (ML) algorithm attacks [
20]. Attackers can collect a certain number of CRPs and use ML algorithms to build the mathematical model to predict response for arbitrary challenge with a high accuracy. At present, researchers mainly resist machine learning attacks by modifying PUF circuits to enhance the nonlinearity of circuit models [
29]. Some researchers have proposed Feed-Forward APUF (FF-APUF), which introduces a feed-forward loop in APUF and uses the decision results of the intermediate stage as the excitation for the subsequent stage switch unit [
30]. Some researchers have proposed the double APUF (DAPUF) circuit, which configures the signals sent to the arbiter for judgment according to the principle of cross exchange, enhancing the ability of PUF to resist ML attacks [
38]. However, other researchers have pointed out that these structures have a success rate of over 95% predicted under targeted modeling attacks [
20]. Therefore, this article combines the advantages of FF-APUF and DAPUF circuits and proposes a hybrid APUF circuit to deepen the nonlinearity of the circuit structure and improve the ability to resist ML attacks.
The proposed hybrid PUF circuit is shown in
Figure 6. Add a D trigger at the beginning of each delay path (Path 0, Path 1, Path 0’ and Path 1’) to optimize the routing delay of the PUF circuit. These D triggers are driven by the same clock. Only when the rising edge of the clock arrives, the trigger will output the signal, ensuring that the time for the pulse signal input to each delay path is consistent.
In order to improve randomness, the 32-bit linear feedback shift register (LFSR) is designed as the challenge generator. This circuit expands the input 32-bit Branch_Address or Jump_Address to the maximum of 232-1 pseudo-random 32-bit challenge signals, which will loop between 232-1 different states. For the proposed hybrid APUF circuit, each 32-bit challenge signal can generate a 1-bit response. Therefore, to generate the 16-bit Key for Branch_Address, the initial input signal is the 32-bit Branch_Address, and then sixteen challenge signals are randomly selected from the output sequence of the LFSR. After 16 clock cycles, the required key can be generated. Similarly, the initial input signal is the 32-bit Jump_ Address, and after 16 clock cycles, the 16-bit Key for Jump_Address can be obtained. As for the 8-bit Key for Index, the initial input signal of the challenge generator is the 32-bit Branch_Address, and the eight challenge signals are randomly selected from the pseudorandom sequence of the LFSR.
Due to the 32-bit challenge signals, the designed hybrid PUF circuit has 32 switch units. In order to disrupt the mapping relationship between challenges and responses and reduce the accuracy of key inference based on ML attacks, a mirrored APUF with the same structure as the original APUF is designed, ensuring that the original APUF is parallel and symmetrical with the mirrored APUF. The output of the original APUF and the mirror APUF will be obfuscated by XOR to obtain the 1-bit output response. Due to the comparison signals of the Arbiter 5 and Arbiter 5’ both come from the same type of delay paths, the asymmetry of the APUF signal delay path will be effectively compensated in the hardware implementation, which can improve the ability of the PUF circuit to resist ML attacks.
To expand the selection range of signal delay paths, enhance the randomness of the PUF circuit, and achieve maximum utilization of switch units, the lower delay path and upper delay path of the symmetrically set switch units output are crossed with each other, achieving mutual inversion of input excitation. For example, Path 1 is no longer the parallel path from Switch Unit 7 to Switch Unit 8, but crosses to Switch Unit 8'. Similarly, Path 0' will move from Switch Unit 7' to Switch Unit 8. Unlike traditional APUF circuits, the pulse signals of hybrid PUF circuits can be transmitted through the cross path on the mirror symmetric APUF delay path, which can effectively improve the randomness of the circuit output response and resist ML attacks.
The feedforward loop of the PUF circuit is shown in
Figure 6, and its quantity has a significant impact on the stability and ML attack resistance of the PUF circuit. The relationship between the number of feedforward circuits Num
FF (only consider the original APUF) and the number of interval switch units Num
SU is shown in Equation (6).
n is the total number of switch units. The hybrid PUF circuit designed in this article has a total of 32 switch units, so the selectable range of the Num
SU is {1, 3, 7, 15} and the selectable range of the Num
FF is {2, 4, 8, 16}.
Table 2 illustrates the variation of the indicators of hybrid PUF circuits with the number of feedforward loops. Stability is measured by the average intra-Hamming distance (HD), and the calculation formula is shown in Equation (7).
R represents the output response of the hybrid PUF circuit at 30 degrees centigrade and standard voltage, and Rt represents the output response of the t-th time under different environmental conditions. T represents the total number of environmental conditions, which is the total number of times the circuit response is measured. L is the length of the output response. HD(R, R
t) is the Hamming distance between the response R and R
t. This article will implement the designed hybrid PUF circuit on FPGA, and test its stability in the temperature range of 25 to 75 degrees centigrade with a step size of 10 degrees centigrade. Each experiment will be repeated 10 times, and the average stability under multiple environmental conditions will be calculated. Based on the test results in
Table 2, taking into account the stability, resistance to ML attacks, and hardware resource overhead, Num
FF and Num
SU selected in this article are 4 and 7, and the structure of the hybrid PUF circuit is shown in
Figure 6.