News | drihu.com

By belter, 10 months ago

URL: arxiv.org

2 comments

By theapadayo, 10 months ago

Well I guess we finally got the mythical 'Q*'. Or at least some variant of it using energy functions (I think that's what they mean by 'soft' Q-learning?). The extra boost from using the value function at test time is interesting as well.

By fnqi8ckfek, 10 months ago

Gee I can't even understand the abstract.

Can someone explain in plain English how RL is even doable here, let alone desirable?

By elcomet, 10 months ago

Multi-step reasoning means that the LLM is giving a question (maths here), and generating an answer that consists of many intermediate words, before returning the solution. Here, we don't want to tell the LLM how to solve the problem word-by-word. We want to tell it at the end, "correct" or "incorrect", and have the model learn on its own to generate intermediate steps, to reach the solution.

That's typically a setup where RL is desirable (even necessary): we have sparse rewards (only at the end) and give no details to the model on how to reach the solution. It's similar to training models to play chess against a specific opponent.

By twometwo, 10 months ago

What is a ELI5 explanation of KL-regularization and entropy maximization to select the policy?

Edited: I found this to be useful for explaining maximum entropy https://awjuliani.medium.com/maximum-entropy-policies-in-rei...

I think that in chess you take a piece and that increases the value but you have to consider the position (that is how your pieces can move) and that is the entropy. So maximum entropy is taking pieces but considering strategic position (policy). But there must be a confluence term, that is how well having many players or new states is a good thing to have. Don't know how to math relate that "confluence" term to entropy. From a computer point of view having a huge number of states makes computation of best move impossible but at the same time can make the optimum larger, so it is related to how given the computer power the algorithm can approximate a maximum that is an increasing function of the number of states. There must be a trade off here that I called confluence.

Also thanks for all explanations.

By Alifatisk, 10 months ago

About KL-regularization, think of it like training wheels for the robot's brain. It helps the robot's learning process by preventing it from making drastic changes to its strategy too quickly.

It's like saying, "Hey robot, remember what you learned last time? Don't forget it completely, but feel free to adjust a bit."

By upghost, 10 months ago

It's just a fancy word for clamping the new reward value to within some delta of the original value. Otherwise the model ends up "exploiting" outliers that make sense to machines but not to humans. They do the same thing with PPO in RLHF.

Great article, if you're interested: https://huyenchip.com/2023/05/02/rlhf.html#3_2_finetuning_us...

By porridgeraisin, 10 months ago

You will have hyperparameters that weight the KL divergence (between the updated policy distribution and the current policy distribution). This helps you tune how sensitive the training process is. Entropy maximization is common in offline RL specifically as it ensures the policy has some non determinism at least and isn't bound too closely to the data you have collected, to the point of basically being deterministic. This is also tunable with a weight.

By fnqi8ckfek, 10 months ago

I don't buy it. LLMs can already put together long phrases without needing RL for training. And crucially those long phrases _make sense_ they're not use syntactically correct, which is what you'd expect by learning to predict the next word.

So clearly it's possible to get lond correlations Right even without RL.

By reissbaker, 10 months ago

RL works when you have some kind of verifier or ground truth; e.g. for math (and to some extent, coding, if you have tests and/or a type checker). You can also do it for simulations. This paper focuses on math and "embodied agent control" (i.e. simulation).

Offline Reinforcement Learning for LLM Multi-Step Reasoning