Version 1
: Received: 1 July 2024 / Approved: 3 July 2024 / Online: 3 July 2024 (11:07:13 CEST)
How to cite:
Han, D.; Mohamed, S.; Li, Y. ShieldDiff: Suppressing Sexual Content Generation from Diffusion Models through Reinforcement Learning. Preprints2024, 2024070330. https://doi.org/10.20944/preprints202407.0330.v1
Han, D.; Mohamed, S.; Li, Y. ShieldDiff: Suppressing Sexual Content Generation from Diffusion Models through Reinforcement Learning. Preprints 2024, 2024070330. https://doi.org/10.20944/preprints202407.0330.v1
Han, D.; Mohamed, S.; Li, Y. ShieldDiff: Suppressing Sexual Content Generation from Diffusion Models through Reinforcement Learning. Preprints2024, 2024070330. https://doi.org/10.20944/preprints202407.0330.v1
APA Style
Han, D., Mohamed, S., & Li, Y. (2024). ShieldDiff: Suppressing Sexual Content Generation from Diffusion Models through Reinforcement Learning. Preprints. https://doi.org/10.20944/preprints202407.0330.v1
Chicago/Turabian Style
Han, D., Salaheldin Mohamed and Yong Li. 2024 "ShieldDiff: Suppressing Sexual Content Generation from Diffusion Models through Reinforcement Learning" Preprints. https://doi.org/10.20944/preprints202407.0330.v1
Abstract
With the advance of generative AI, the text-to-image (T2I) model has the ability to generate various contents. However, the generated contents cannot be fully controlled. There is a potential risk that text-to-image model can generate unsafe images with uncomfortable contents. In our work, we focus on eliminating the NSFW (not safe for work) content generation from text-to-image model while maintaining the high quality of generated images by fine-tuning the pretrained diffusion model via reinforcement learning by optimizing the well-designed content-safe reward function. The proposed method leverages a customized reward function consisting of the CLIP (Contrastive Language-Image Pre-training) and nudity rewards to prune the nudity contents that adhere to the pretrained model and keep the corresponding semantic meaning on the safe side. In this way, the text-to-image model is robust to unsafe adversarial prompts since unsafe visual representations are mitigated from latent space. Extensive experiments conducted on different datasets demonstrate the effectiveness of the proposed method in alleviating unsafe content generation while preserving the high-fidelity of benign images as well as images generated by unsafe prompts. Our method surpasses existing state-of-the-art (SOTA) baseline methods and achieves better performance on sexual content removal (14.5% higher than SafeGen) and image quality retention. In terms of robustness, our method outperforms Safe Latent Diffusion and SafeGen under the SOTA black-box attacking model SneakyPrompt by approximately 2x and 5.6x, respectively. Furthermore, our constructed method can be a benchmark for other methods aiming for anti-NSFW generation, high-level prompt-image safe alignment.
Keywords
Privacy; Content Safety; Generative AI
Subject
Computer Science and Mathematics, Artificial Intelligence and Machine Learning
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.