Over the past few decades, a series of rendering and texture synthesis algorithms proposed by computer-aided design research [
22,
23,
24] have predominantly focused on image stylization. The intervention of artificial intelligence in art introduces more creative applications [
25]. The confluence of these fields can be traced back to the advent of Generative Adversarial Networks (GAN) [
26], where two neural networks—generators and discriminators—compete to produce images resembling actual sample distributions. DeepDream [
27] is the first to explore neural networks’ potential to inspire artistic creation. Utilizing convolutional neural networks (CNNs), DeepDream transforms input images into highly interpretable, dream-like visualizations. Another CNN-based study, A Neural Algorithm of Artistic Style [
28], separates and recombines semantics and style in natural images, pioneering Neural Style Transfer (NST). Although NST significantly influences artistic style treatment, it typically learns stylistic traits from existing images rather than facilitating original artistic creation. Elgammal et al. introduce Creative Adversarial Networks (CAN) [
29] based on GAN, aiming to maximize divergence from established styles while preserving artistic distribution and fostering creativity.
The emergence of Transformer [
30,
31] propells the application of multimodal neural networks notably in many tasks [
32,
33]. For example, Hu et al. [
34] proposed utilizing transformers for roof extraction and height estimation. Similarly, in the field of remote sensing detection [
35,
36] and re-identification [
37,
38,
39], transformers have demonstrated tremendous potential. Recently, text-guided image generation through generative artificial intelligence, such as Glide[
40], Cogview[
41], Imagen[
42], Make-a-scene[
43], ediffi[
44] and Raphael[
45] gain widespread use, particularly with significant advancements in large-scale diffusion models [
46]. In order to achieve favorable results on the downstream tasks, the past practice is to spend substantial technical resources to fine-tune the model. Many existing studies build upon SD, incorporating adapters for guided generation and requiring minimal additional training while keeping the original model parameters frozen. ControlNet [
9] pioneered this method to learn specific task input criteria, such as Depth Map [
47], Canny Edge [
48]. ControlNet Reference-only efficiently transfers the style and subject from a reference diagram while conforming to textual descriptions, eliminating the need for additional training. In the T2I-Adapter [
10] approach, the reference image is fed into the style adapter to extract stylistic features and integrate them with text features. Uni-ControlNet [
49] employs two adapters—one for global and one for local control—enabling the combinability of diverse conditions. The IP-Adapter [
11]supports the fusion of individual image features with text features via the cross-attention layer, ensuring the preservation of both the main subject of the image and its stylistic attributes. Additionally, PCDMs [
50] suggest employing inpainting to fine-tune the entire SD for achieving pose-guided human generation. Nonetheless, this approach of training by fully releasing all parameters is not practically economical.
While these adapters effectively generate similar styles from image-guided models, the generated results tend to closely resemble specific reference images, posing challenges in providing valuable references within the design realm. Artistic creation often requires diverse images to draw inspiration from when exploring various themes. Therefore, generated images must exhibit diversity while maintaining coherence between visuals and accompanying text descriptions. Our study centers on woodcut-style design, employing dual cross-attention mechanisms to harmonize visual and textual features. Moreover, our approach facilitates the transfer of a specific artist’s style across different thematic contexts.