Watermarking Degrades Alignment in Language Models (ICLR GenAI Workshop 2025)

1 comments

We've analyzed how popular watermarking methods (KGW, Gumbel) affect language model alignment—revealing critical tradeoffs impacting truthfulness, safety, and helpfulness. We propose "Alignment Resampling," a simple method to mitigate these alignment degradations, with theoretical insights and empirical results.

Paper: https://huggingface.co/papers/2506.04462

Feedback appreciated!