I'm one of the authors of this paper - happy to answer any questions you might have.
Why not actually release the weights on huggingface? The popular SAE_lens repo has a direct way to upload the weights and there are already hundreds publicly available. The lack of training details/dataset used makes me hesitant to run any study on this API.
Are images included in the training?
What kind of SAE is being used? There have been some nice improvements in SAE architecture this last year, and it would be nice to know which one (if any) is provided.
We're planning to release the weights once we do a moderation pass. Our SAE was trained on LMSys (you can see this in our accompanying post: https://www.goodfire.ai/papers/mapping-latent-spaces-llama/).
No images in training - 3.3 70B is a text-only model so it wouldn't have made sense. We're exploring other modalities currently though.
SAE is a basic ReLU one. This might seem a little backwards, but I've been concerned by some of the high-frequency features in TopK and JumpReLU SAEs and the recent SAE (https://arxiv.org/abs/2407.14435, Figure 14), and the recent SAEBench results (https://www.neuronpedia.org/sae-bench/info) show quite a lot of feature absorption in more recent variants (though this could be confounded by a number of things). This isn't to say they're definitely bad - I think it's quite likely that TopK/JumpReLU are an improvement, but rather that we need to evaluate them in more detail before pushing them live. Overall I'm very optimistic about the potential for improvements in SAE variants, which we talk a bit about at the bottom of the post. We're going to be pushing SAE quality a ton now we have a stable platform to deploy them to.
Noob question - how do we know that these autoencoders aren't hallucinating and really are mapping/clustering what they should be?
I cranked up 'pirate talking about andromeda galaxy' feature to 1.5 and got this:
>Yer lookin' for the Andromeda galaxy, eh? Here be the details, me hearty: *The Andromeda Galaxy (Yer looking for a fine piece o' booty, matey!):* * *A fine piece o' booty*: The Andromida Galaxy be a right fine piece o' booty, with a treasure chest o' gold doubloons... er, I mean, a mighty haul o' stars, hidden beneath the Jolly Roger! * *A barnacle on the high seas*: The Andromeda Galaxy be a right scurvy dog, with a hull full o' stars, and a keel full o' hidden treasure! It be a fine piece o' booty, but it be needin' a bit o' swabbin' the decks, or it'll be walkin' the plank, savvy? * *A chest overflowin' with gold*: The Andromeda Galaxy be a right fine piece o' booty, with a chest overflowin' with gold doubloons... er, I mean, a fine haul o' stars, and a barnacle on the high seas! It be a right scurvy dog, but it be worth keepin' an eye on, or it
Yes - we'd never normally turn features up this much as it breaks the model quite badly, but we put this in the post to show what that looked like in practice.
[deleted]
I am skeptical of generic sparsification efforts. After all, companies like Neural Magic spent years trying to make it work, only to pivot to 'vLLM' engine and be sold to Red Hat
Link shows this isn't sparsity as in inference speed, it's spare autoencoders, as in interpreting the features in an LLM (SAE anthropic as a search term will explain more)
nice work. enjoyed the zoomable UMAP. i wonder if there are hparams to recluster the UMAP in interesting ways.
after the idea that Claude 3.5 Sonnet used SAEs to improve its coding ability i'm not sure if i'm aware of any actual practical use of them yet beyond Golden Gate Claude (and Golden Gate Gemma (https://x.com/swyx/status/1818711762558198130)
has anyone tried out Anthropic's matching SAE API yet? wondering how it compares with Goodfire's and if there's any known practical use.
We haven't yet found generalizable "make this model smarter" features, but there is a tradeoff of putting instructions in system prompts, e.g. if you have a chatbot that sometimes generates code, you can give it very specific instructions when it's coding and leave those out of the system prompt otherwise.
We have a notebook about that here: https://docs.goodfire.ai/notebooks/dynamicprompts
Thank you! I think some of the features we have like conditional steering make SAEs a lot more convenient to use. It also makes using models a lot more like conventional programming. For example, when the model is 'thinking' x, or the text is about y, then invoke steering. We have an example of this for jailbreak detection: https://x.com/GoodfireAI/status/1871241905712828711
We also have an 'autosteer' feature that makes coming up with new variants easy: https://x.com/GoodfireAI/status/1871241902684831977 (this feels kind of like no-code finetuning).
Being able to read features out and train classifiers on them seems pretty useful - for instance we can read out features like 'the user is unhappy with the conversation', which you could then use for A/B testing your model rollouts (kind of like Google Analytics for your LLM). The big improvements here are (a) cost - the marginal cost of an SAE is low compared to frontier model annotations, (b) a consistent ontology across conversations, and (c) not having to specify that ontology in advance, but rather discover it from data.
These are just my guesses though - a large part of why we're excited about putting this out is that we don't have all the answers for how it can be most useful, but we're excited to support people finding out.
sure but as you well know classifying sentiment analysis is a BERT-scale problem, not really an SAE problem. burden of proof is on you that "read features out and train classifiers on them" is superior to "GOFAI".
anyway i dont need you to have the answers right now. congrats on launching!
If you're hacking on this and have questions, please join us on Discord: https://discord.gg/vhT9Chrt
[deleted]
I wonder how many people or companies choose to send their data to foreign services for analysis. Personally, I would approach this with caution and am curious to see how this trend evolves.
We'll be open-sourcing these SAEs so you're not required to do this if you'd rather self-host.
This is the ultimate propaganda machine, no?
We’re social creatures, chatbots already act as friends and advisors for many people.
Seems like a pretty good vector for a social attack.
The more the public has access to these tools, the more they'll develop useful scar tissue and muscle memory. We need people to be constantly exposed to bots so that they understand the new nature of digital information.
When the automobile was developed, we had to train kids not to play in the streets. We didn't put kids or cars in bubbles.
When photoshop came out, we developed a vernacular around edited images. "Photoshopped" became a verb.
We'll be able to survive this too. The more exposure we have, the better.
Early traffic laws were actually created in response to child pedestrian deaths (7000 in 1925).
https://www.bloomberg.com/news/features/2022-06-10/how-citie...
Right. You know how your grandmother falls for those “you have a virus” popups but you don’t? That’s because society adapts to the challenges of the day. I’m sure our kids and grandchildren will be more immune to these new types of scams.
Your analogies don't quite align with this technology.
We've had exposure to propaganda and disinformation for many decades, long before the internet became their primary medium, yet people don't learn to become immune to them. They're more effective now than they've ever been, and AI tools will only make them more so. Arguing that more exposure will somehow magically solve these problems is delusional at best, and dangerous at worst.
There are other key differences from past technologies:
- Most took years to decades to develop and gain mass adoption. This time is critical for society and governments to adapt to them. This adoption rate has been accelerating, but modern AI tech development is particularly fast. Governments can barely keep up to decide how this should be regulated, let alone people. When you consider that this tech is coming from companies that pioneered the "move fast and break things" mentality, in an industry drunk on greed and hubris, it should give everyone a cause for concern.
- AI has the potential to disrupt many industries, not just one. But further than that, it raises deep existential questions about our humanity, the value of human work, how our economic and education systems are structured, etc.
These are not problems we can solve overnight. Turning a blind eye to them and vouching for less regulations and more exposure is simply irresponsible.
Please inform the EU about this.
[dead]