7.1. Software Modelling
Our baseline design computes kNN using cosine distance for the original dataset, operating on a multi-core CPU server. Because the official Scikit library implements different distance metrics with different levels of optimisation. For a fair comparison, we design our experiments in accordance with different optimisations in the library. When compared with the baseline, the CPU time is measured on the official KNeighborsClassifier library, which is implemented as a Cython extension that compiles down to C, leveraging OpenMP-based parallelism and allowing multi-core processing on the CPU. When measuring the time taken for the Hamming distance metric on the CPU, we implement our own kNN function for two reasons: 1) The official library implements the Hamming distance with normalisation and returns a double data type, whereas our experiment only requires an un-normalised distance with an integer distance. 2) When the Hamming distance metric is used, little optimisation is applied in the official library, causing an unfair comparison with the cosine distance metric. Our implementation first generates a distance list by comparing the query with all the data in the database, then returns the top-k element with the smallest distance (largest similarity) after a simple sorting on the distance list.
The proposed method uses an ensemble of models to recover the accuracy loss introduced by the Hamming space similarity. All data are first preprocessed, including standardisation and loss modelling for converting floating-point to fixed-point numbers. Then the processed data is projected to multiple latent spaces before converting to their Hamming encoding. These new sets of data are then stored in separate models waiting for queries to take place. The queries will go through the same process before turning into a Hamming vector in their corresponding latent space. Each model in the ensemble will return the top-k most similar elements in the dataset. Based on the aggregation type, each model can return either the label of the majority class, or the index of the top k element in their subspace. We will discuss the impact on the result of these two aggregation methods later. The software flow is summarised in
Figure 9.
All datasets used are large datasets with high dimensionality, with a mixture of sparse and dense datasets. Due to the lack of dense high-dimensional datasets online, two dense datasets are created. The first one is created from a stock price dataset, the dataset is created with a window size of 1000 and each window corresponds to the trend of the market for the day after. The second dataset contains the image embedding of cats and dogs created using the openAI CLIP model [
43], with an embedding length of 512.
As for the experiment setup, we run the experiment for all datasets with 4 types of random projection matrices. Gaussian, GaussianInt, Sparse and SparseInt, which correspond to dense and sparse matrices and their compressed version (using -1, 0, 1 to replace positive, zero, and negative values, identified by the Int suffix). Note that the dense RP matrix is used to compare with the DSP-based multiplier and the compressed version is used to compare with the adder tree-based multiplier. This is summarised in
Table 4. For the number of data used, we split the dataset with a fixed train test split ratio = 0.8. For the ensemble, we provide a further sample portion of the training dataset to reduce the number of data to be stored.
7.2. Performance Modelling
The estimation is done using a combination of hardware estimation and software simulation. At a higher level, the time for our system can be summarised as the computation time plus the transmission time. We denote the time taken for the baseline as , which represents kNN using the cosine distance metric operating on a single CPU without processing the dataset. The time taken with our proposed design is denoted as .
For the baseline model, the breakdown of the time can be expressed as:
where
is the time taken to forward the data from the data server to the worker,
is the time taken for the CPU to compute the cosine distance,
for sorting the obtained distance table. Note that we assume no overlap between these operations, theoretically, sorting can start immediately after receiving a single data point if we maintain a priority queue, similar to a streaming top-k algorithm.
For the proposed model, the system time bifurcates into two expressions depending on the size of the dataset. They are:
if the data size is smaller than the onboard memory, otherwise,
where
is the time taken for random projection,
is the time taken for binarising the projected data,
is the time for reading and writing data from the on-board DDR memory and
represents the time taken to compute the Hamming distance between the query and the dataset. Note that parameters with superscript represent operations that happen on the FPGA, whereas those without represent operations on the CPU. If the size of the database after compression can fit in onboard memory, then the top-k is performed on the FPGA, otherwise, performing top-k on the CPU worker. For simplicity, we skip the case that would store the data partially on the FPGA and partially on the workers, as this would require non-trivial algorithmic adaptation.
For worse case analysis, we assume only two columns of multipliers are implemented on the FPGA, i.e., generating two output dimensions at the same cycle. This reduces the execution time of the multiplier by half. Now , with M being the original dimensionality and m being the projected dimensionality. Note that N represents the number of multiplication units, which is the same for the adder tree and the DSP multiplier.
contains two phases, namely the Store and Query phases. They can be calculated with:
where
is the total data size,
is the initial latency of the sorter pipeline,
is the train test split ratio and
is the sample portion. Note that the sample portion only applies to the training data. The superscript * indicates the time taken in either a CPU or an FPGA. We will elaborate on Equation
14 and
15 with an example dataset in the next section.
The platforms used for our experiment are: 1) Xilinx U280 board [
42] with a total DDR capacity of 32GB and 9027 DSP slices. 2) AMD EPYC 7543 32-Core Processor running at an average 2.8GHz, with 504GB total memory. For the performance estimation, we assume the FPGA is running at 200 MHz, which is equivalent to a clock period of 5ns.
7.3. Dataset-Specific Case Study
The epsilon dataset is a dense dataset with high dimensionality, the original dataset contains 400000 entries with 2000 dimensions. To reduce the simulation time, a portion of the dataset is taken for simulation. In this case study, 7000 data entries are randomly sampled from the dataset for training and testing, and a train-test split portion of 0.8 is used.
To calculate
: According to Equation
11, the transmission time is simply the time spent forwarding the packet by the switch. This time is determined by the design of the network switch and the length of the packet. For simplicity, we assume that the packet is of fixed size, and the network switch operates the same as the NetFPGA-SUME [
40] base design, but to match the state-of-the-art FPGA, we assume it is implemented on an AMD Alveo U280 FPGA. Based on [
44],
contributes little to the total time due to the domination of the RP process. Another reason to ignore this parameter is that both the base case and our proposed design contain packet forwarding, reducing their impact on the speed up which is calculated by the ratio of the two cases. Therefore,
is reduced to
. This gives us
= 0.29270s. Note that
includes the time taken for the fitting process, this is to ensure a fair comparison with the
.
To calculate
: As stated in Equation
12 and
13, there are two situations for
, which can be further divided into two phases,
and
, both sharing the
and
processes as illustrated in
Figure 10.
Table 5.
Experiment results
Table 5.
Experiment results
Name |
#
entry |
M |
m |
sp |
Lib Time
CPU
Cosine/s |
Estimated
time/s
FPGA |
FPGA
speedup |
Total
memory
reduction |
Trigger
size
Terabytes |
epsilon [32] |
7000 |
2000 |
1000 |
0.9 |
0.2927 |
0.095917 |
3.05 |
3.95 |
1.138 |
catsndogs |
10000 |
512 |
100 |
0.6 |
0.3065 |
0.103337 |
2.97 |
21.01 |
4.369 |
AD [33] |
3279 |
1558 |
200 |
0.6 |
0.1111 |
0.012718 |
8.74 |
27.70 |
6.647 |
stock |
10074 |
2000 |
1000 |
0.8 |
0.4916 |
0.151872 |
3.24 |
26.67 |
1.280 |
realsim [34] |
3614 |
20958 |
8000 |
0.8 |
1.5200 |
0.652816 |
2.33 |
6.99 |
1.677 |
gisette [35] |
6000 |
5000 |
1000 |
0.8 |
0.5027 |
0.078037 |
6.44 |
22.22 |
3.200 |
REJAFDA [36] |
1996 |
6824 |
700 |
0.6 |
0.2153 |
0.012023 |
17.91 |
519.92 |
8.318 |
qsar_oral [37] |
8992 |
1024 |
800 |
0.8 |
0.3953 |
0.111949 |
3.53 |
12.8 |
0.819 |
Situation 1 — when the on-board memory can hold the dataset. In this case, there is no packet transmissions between the FPGA and the workers, removing one level of data transmission. The Store and Query in
Figure 10 will both run on the FPGA-based switch. However, we do need to consider the DDR RAM latency for storing and reading. For worst-case analysis, we assume the read and write latency takes 200 cycles = 1 us. Since it is smaller than
, it can be hidden within the 2.5us window. Therefore
can be replaced by the initial latency of 1us, assuming read and write have the same latency.
For the Store phase, with the random projection being a time-consuming part of the design and to support different RP matrices, we design our system to choose between two types of matrix multiplication modules. A DSP-based multiplier for the RP matrices with large fixed-point values, or an adder tree-based multiplier for RP matrices contains only 0, -1 and 1. To avoid over utilising the resources, we assume that each switch will implement only two copies of the RP module.
With the DSP-based multiplier already discussed in [
44], we implemented an adder tree with 2000 input of type 16-bit fixed-point using Vivado HLS. It only uses 5% LUT on the U280 board. Since the epsilon dataset has an input dimension equal to 2000, we allocate 4000 DSPs or two 2000 input adder trees for random projection. Using FPGA to binarise a vector can be simply done by extracting the sign bit of each element in the vector, this is included in our adder tree design. The overhead of this operation is trivial, thus adding this functionality to the DSP-based multiplier will not affect the estimation outcome. Therefore,
= 0.5*1000*5ns = 2.5us, meaning the projected data will be generated every 2.5us. The total time to generate the entire projected dataset in the store phase is
= 0.9 * 0.8 * 7000 * 2.5us = 12.60 ms.
In the Query Phase, the query data is first projected to the same latent space using the database, which takes 2.5us. Then this projected query will be compared with the entire database, computing the Hamming distance along the way; Burst reading from the DRAM can be initiated within this 2.5us window, and by the time the projection is ready, the entire database will be ready to use. To compute the Hamming distance, another adder tree is implemented. This adder tree only takes as input a single bit, therefore significantly reducing resource usage. This circuit is much simpler than the adder tree for random projection, and our implementation result shows less than one per cent of LUT is used on the target platform.
If pipelined, the initial latency can be calculated as , where N is the length of the input bit vector. For an input of 1000 bits, the initial latency is around 11 cycles. The initial latency can be amortised with a large input dataset. For the sake of simplicity, we will ignore this latency in our estimation as the datasets used in our experiments are relatively large. When reading the data from DRAM, with a reading port data width of 1024, it is possible to read one data from the DRAM every cycle. Therefore, computing the Hamming distance of a single vector can be achieved every cycle, and the total time is taken cycles. For the epsilon dataset, this is = 0.9 * 0.8 * 7000 * 5ns = 25.2us.
Finally, for the sorting time, we need to sort an integer vector of length
. We implemented an iterative merge sorter using Vivado HLS with 6300 inputs using a short data type (signed 16-bit), the latency result is shown in
Table 6. Since the size of each input is bounded by the number of elements in the input vector, this allows further optimisation of the input data width of the design. We use this simple design as a proof of concept, since designing a high-performance sorter is not the main object of this paper. We are aware that better designs for retrieving top-k exist, this design serves as the upper bound of taking the top-k element. The implementation result shows little resource usage and is able to run under 200MHz. In fact, it is designed to return the result in log2(N) stages, so the output latency is deterministic given the running frequency.
In practice, our pipelined implementation with minimum optimisation (array partition) using Vivado HLS has an initial latency of 0.410ms and an interval latency of 6304 cycles = 31.52 us, as shown in
Table 6. In theory, the interval latency (N) is the latency of a single stage and the initial latency (
) is the total length of the pipeline and the latency. The implementation summary shows the actual latency corresponds to the theoretical value. The resource usage is shown in
Table 7. Putting everything together, the total time taken by test data is
= 7000 * (1 - 0.8) * (2.5us + 25.2us + 31.52us) + 0.410ms = 83.318ms Adding the time for the two phases gives the
= 12.6 ms + 83.318 = 95.917 ms
Situation 2 — when all data are stored in the worker. Under this condition, the speed-up will be dominated by the CPU computation time. KNeighborClassifier uses SciPy’s hamming distance which returns a normalised distance with type double. This does not fit our design purpose which requires only a small non-normalised integer distance as the output type. As a result, we implemented our own version of kNN that can take our own hamming distance function as input. Re-running the experiments on our implementation and the speed up under this condition is 7.5. Although this speedup appears to be higher than it is in Situation 1, it does not reflect the real behaviour of our design. This is because our own implementation does not leverage parallel computing on the CPU. The problem then becomes memory bound, meaning the time difference is dominated by the memory retrieval time of a full-precision dataset and a hamming-encoded one, which outweighs the benefit of computing RP on the FPGA. Getting the same comparison of Situation 1 requires the exact same software implementation of the Scikit learning library, which is beyond the scope of this study. Fortunately, it requires the dataset to be massive to trigger Situation 2; this is presented in the last column of
Table 5. The computation detail is shown in
Section 8.2.2.
In respect of classification accuracy, our simulation results show that by configuring the experiment with [projection dimension=1000, sample portion=0.9, number of split=18], we could recover the accuracy loss caused by the RP and binarisation process. In terms of memory usage, depending on the data type of the original dataset, there is a fixed compression ratio W using the hamming space i.e., the bit width of the data type to 1. For example, W=32 for the epsilon dataset that uses floating-point numbers. The total memory reduction (the ratio of the original memory size to the compressed memory size) can be calculated as:
where
is the ratio between the original dimension and the projected dimension,
is the sample portion, and
is the size of the ensemble to recover the accuracy.
7.3.1. Quantifying the Gap between Forwarding Latency and Computation Latency
Based on the official NetFPGA-SUME wiki [
14], the average port-to-port latency of the reference switch is 823ns. This corresponds to
in Equation
12 and Equation
13. For the epsilon dataset,
takes more than 3 times longer than
. As long as the total transmission latency is shorter than
, it can be hidden under the random projection latency, therefore, we could keep increasing the levels of the hierarchy until
saturates
. In addition, if a certain level of buffering is enabled in the hierarchy, then it is able to control the packet arriving interval by sending packets back-to-back, with a controlled burst interval. As long as the burst interval is smaller than
, then it can also be hidden under
. Therefore we only need to add the initial latency to the whole model, which is 823ns times the number of levels in the hierarchy. Compared to
which is in millisecond scale, this is still small.