3.2.1. Principles and Structural Analysis of the ω-k Algorithm
The imaging geometry of the UAV SAR is illustrated in
Figure 2, where
is the minimum slant range,
is the distance from the antenna phase center (APC) to the scene center,
is the view angle,
is the squint angle, and
.is the beamwidth. Considering a target located at
, we can derive its instantaneous range
as follows:
where
represents the ground distance from the target to the scene center and
is the forwarding velocity of the UAV.
The UAV SAR continuously transmits a linear frequency-modulated signal and receives the returned signal in dechirp mode. The transmitted signal can be expressed as:
where
represents the chirp rate,
represents the central frequency of the transmitted chirp signal, and
represents the time. For simplicity, the amplitude is disregarded.. After dechirp reception of the transmitted signal, the dechirped signal from a point target P can be represented as:
where
represents the time delay of the target signal, and
is the speed of light. The final term is the Residual Video Phase (RVP) from the dechirp operation, which can be eliminated by multiplying by an RVP coefficient. After removing the RVP, the signal changes to:
Define
as the range wavenumber,
as the azimuth wavenumber, and
as the azimuth position. Then the Equation (4) takes the following form in the wavenumber domain:
By performing the azimuth Fast Fourier Transform (FFT) and applying the principle of stationary phase (POSP), the signal in the range-azimuth two-dimensional wavenumber domain is obtained:
The last term denotes the Doppler frequency shift induced by inner-pulse moving, which can be eliminated by multiplying a coefficient:
. After compensating for the Doppler frequency shift, the signal can be expressed as follows:
Taking the distance from APC to the scene center as the reference distance,
, and the reference function can be constructed as:
is multiplied by
to accomplish residual range migration correction (RCM) and azimuth compression, in order to obtain focused signals at the reference distance within the 2D frequency domain, as shown in Equation (12).
For cells located within other ranges, Stolt interpolation is applied to Equation (9) to achieve fully focused processing, as shown below:
where
represents the vertical component of
in the wavenumber domain. After applying the 2-D inverse FFT to Equation (10), the fully focused SAR image in the time domain is produced. The ω-k algorithm flowchart for real-time processing is depicted in
Figure 3.
3.2.2. SAR Real-Time Processing on MPSoC
The real-time processing of UAV-SAR is implemented on the Xilinx Zynq Ultrascale+ZU9EG MPSoC. As illustrated in the architecture depicted in Figure 6, the MPSoC mainly consists of an ARM processing system (PS) and programmable logic (PL). The PS consists of four Cortex-A53 cores operating at speeds of up to 1.5 GHz and two real-time Cortex-R5 cores functioning at speeds of up to 600 MHz. This configuration is especially well-suited for high-accuracy, complex tasks. The PS supports general peripheral connectivity, including UART, SPI, and I2C, through the Multiplexed Input/Output (MIO) interface. The high-speed serial interface is facilitated by the Serial Input/Output Unit (SIOU) block, which supports various interface protocols, including PCIe, USB 3.0, DisplayPort, SATA, and Ethernet. The PL mainly consists of logic resources, such as a FPGA which is more suitable for high-speed, repetitive tasks. The integration of PS and PL can fully leverage the benefits of high precision, flexible adaptability, and configurability for complex tasks provided by the PS, along with high throughput and parallel processing capabilities offered by customized hardware in the PL. The PS and PL are interconnected by multiple groups of interconnects that comply with the Advanced eXtensible Interface (AXI) specification outlined in the ARM Advanced Microcontroller Bus Architecture (AMBA) standard. The AXI interactions between the PS and PL support the AXI4, AXI4-Lite, and AXI4-Stream protocols, with a maximum bit width of 128 bits and a maximum line rate of 250 MHz. This configuration provides a transmission bandwidth of more than 3.2GB/s in total.
1) Algorithm Partitioning. To efficiently implement the ω-k algorithm illustrated in
Figure 4, it is essential to partition the algorithm into components that are favorable for hardware and software. Computations that involve low density but complex decision-making are better suited for execution on the PS without requiring specialized hardware design. Conversely, computations characterized by high computational density, which are amenable to parallelization, greatly benefit from the implementation of hardware acceleration.
Figure 5 illustrates the partitioning strategy of the ω-k algorithm on the MPSoC. To account for the non-uniform linear motion of the flight platform, a motion compensation module has been integrated into the processing workflow. The entire calculation process is divided into modules according to their mathematical operation types and functionalities. Low-density, high-complexity computations namely small batch computing, such as workflow control, motion parameter calculation, vector generation for complex multiplications, interpolation kernel generation, raw data and GPS data reception, and imaging result output, are assigned to the PS end to fully leverage its flexibility and reduce debugging time. Meanwhile, operations with high computational density namely large batch computing, such as large-scale complex multiplication, interpolation, FFT/IFFT, corner-turning, etc., are designed into hardware accelerators in the PL to achieve pipelined parallel computing.
2) Algorithm Implementation on MPSoC. According to the algorithm partitioning strategy outlined above, a real-time processing implementation of the Mini-SAR based on MPSoC is proposed, as illustrated in
Figure 6. The PS side operates in a bare-metal environment and utilizes the Asymmetric Multiprocessing (AMP) approach to facilitate multi-core operation. Each core is equipped with dedicated memory, while shared memory is utilized for inter-core communication.
In our design, the Cortex-A53 core 0 is designated as the primary core responsible for system initialization and workflow management. The main core is equipped with two Gigabit Ethernet interfaces and one UART interface. One Ethernet port is designated for receiving raw data, while another port is utilized for outputting imaging results. The Universal Asynchronous Receiver-Transmitter (UART) is utilized for receiving GNSS data. The Cortex-A53 core 1 and Cortex-A53 core 2 are configured as the slave cores. Core 1 manages small batch computing and initiates DMA transfers, while Core 2 is responsible for DMA reception.
On the PL side, the algorithm’s large batch computing units, such as FFT, complex multiplication, Stolt interpolation, and corner-turning, are designed as hardware accelerators. The accelerators utilize a fully pipelined design that is parameterized, configurable, and capable of supporting single-precision floating-point operations. During real-time processing, the raw data and intermediate results are stored in the DRAM on the PS side. To configure and invoke the accelerators, one master port and two slave ports are selected. Additionally, two sets of AXI-DMA, along with several MUX modules, are initialized to facilitate data exchange between the PS and PL. Core1 in the PS is responsible for controlling the reading and writing of registers in the MUX and accelerators through AXI-Lite via the master port. It facilitates the switching and selection of DMA data paths, along with the configuration of parameters for the accelerators. Two slave ports are utilized for transmitting and receiving parameter vectors, kernel vectors, and imaging data. When transferring data to the target accelerators, the two sets of AXI-DMA operate in simple mode and Scatter/Gather (SG) mode, respectively. They utilize the AXI4-Stream interface, which allows burst transfers of unlimited size. The SG mode is specifically designed for corner-turning, as it supports the efficient transfer of 2-D memory access patterns using an AXI4-Stream channel. In addition, a standard RAM block is designed to cache parameters and kernel vectors, and it is shared among the accelerators to improve the utilization of on-chip RAM resources. A First-In-First-Out (FIFO) buffer is implemented on both the sending and receiving paths of the AXI4-Stream to enable high-speed data caching and protocol transformation.
In the implementation design, the accelerators function as callable entities within the PS. The algorithm flow within the PL, as illustrated in Figure 8, demonstrates that the FFT/IFFT, complex multiplication, interpolation, and corner-turning operations are executed 25 times out of a total of 28 during the entire process. These operations account for nearly 90% of the processing time for PL. Thus, the design of these key accelerators is illustrated below:
a) FFT/IFFT Accelerator. The algorithm requires five FFT calculations and four IFFT calculations, all of which can be performed using a single accelerator. In this design, the FFT/IFFT accelerator is implemented using the IP core from Xilinx, which leverages its runtime reconfigurability for both forward and inverse FFTs, supporting transform lengths of up to 65,536 points. The FFT IP core is configured to operate in block floating point mode with adaptive scaling, managed by the PS through AXI-Lite. It has a default transform length of 4,096 points, supporting both FFT and IFFT. To achieve faster computing speeds and higher throughput, a pipelined computing architecture is utilized. This structure entails the continuous transmission of raw data through the data input port and the reception of calculated results at the data output port. The schematic diagram of the FFT/IFFT accelerator is as depicted in
Figure 7. Due to the requirement of the AXI-Stream protocol for data flow in the input and output ports of the FFT IP, protocol conversion wrappers implemented with AXI4-Stream Data FIFO IP are added at the input and output of the FFT IP to convert protocols between Native and AXI-Stream.
b) Complex Multiplication Accelerator. The algorithm involves a total of nine complex multiplication calculations. This includes two-window complex multiplications in the motion compensation step and seven complex multiplications in other steps. The schematic diagram of the complex multiplication accelerator is illustrated in Figure 10. For the two-window complex multiplications, the parameter vector is calculated by the PS and transmitted to the common RAM block. The original data and windowed parameters are continuously read at the PL end to facilitate pipelined windowed complex multiplication calculation. For the remaining seven complex multiplications, they can be uniformly represented as:
Where
is the imaginary unit,
represents a coefficient to be solved, and
represents the original data or intermediate data during processing. In these complex multiplications, only the process of generating the
values differs, while all other calculations remain unchanged. Therefore, the architecture of the complex multiplication accelerator is designed as depicted in
Figure 8, with
and complex multiplication (referred to as CommonMUL module) as reusable computing resources, and the seven
generation modules as independent computing resources, achieving efficient utilization of computing resources.
When executing each of the seven calculations, the generation module continuously reads the parameter vectors from the shared RAM block and performs pipeline calculations with a total of 4k×4k points. It transmits the value through MUX in order to calculate the value of the trigonometric function and store it in the synchronous FIFO. Furthermore, the values in the FIFO and the data in the DMA_TX_FIFO are read simultaneously and passed to the CommonMUL module to complete the corresponding complex multiplication operation in the algorithm process.
c) Stolt Interpolation Accelerator. The interpolation is implemented using a sinc interpolation kernel that utilizes a sinc function with 16,384 × 8 points in single-precision floating-point format. The sine interpolation kernel is precalculated on the PS side, and a Kaiser window is applied to mitigate the ringing phenomenon caused by the Gibbs effect. The kernel is then transmitted to the shared RAM block through DMA. On the PL side, the sine interpolation coefficients are determined by referencing a table. This interpolation method eliminates the need to calculate the sinc function, Kaiser window, and normalization factor for each interpolation point, thereby significantly reducing the consumption of hardware computing resources. The structural diagram of the interpolation module is presented in
Figure 9. To facilitate pipeline computing, the raw data from the DMA_RX_FIFO is stored in the
RAM frame by frame using a ping-pong method.. After the complete frame data has been cached, the interpolation calculation for the current frame is activated. The StoltInt module continuously transfers the addresses of the parameter vector to the common RAM block to retrieve the necessary parameter vectors for calculations. By continuously generating the original point coordinates and the row addresses of the sinc interpolation kernel using the parameter vectors, the system transfers the original point addresses to origRAM to obtain continuous 8-point raw data. Simultaneously, it transfers the row addresses of the sinc interpolation kernel to the common RAM block to acquire continuous 8-point sinc interpolation coefficients. Finally, the 8-point raw data and the 8-point sinc interpolation coefficients are multiplied and summed to produce the interpolation results.
d) Corner-Turning Accelerator. The corner-turning accelerator primarily utilizes the AXI-DMA operating in scatter-gather (SG) mode, which enables efficient 2-D memory access and a SRAM block. In the design, the 4k×4k data matrix, with its initial address, is divided into 4,096 smaller matrices, each containing 64×64 points. Each small matrix is written into the cache SRAM row by row and read out column by column using logical control in the PL. After the iteration, the entire process results in a 4k × 4k matrix corner-turning.