News | drihu.com

By dapurv5, 14 hours ago

URL: huggingface.co

1 comments

By dapurv5, 14 hours ago

We've analyzed how popular watermarking methods (KGW, Gumbel) affect language model alignment—revealing critical tradeoffs impacting truthfulness, safety, and helpfulness. We propose "Alignment Resampling," a simple method to mitigate these alignment degradations, with theoretical insights and empirical results.

Paper: https://huggingface.co/papers/2506.04462

Feedback appreciated!

Watermarking Degrades Alignment in Language Models (ICLR GenAI Workshop 2025)