JEP 483: Ahead-of-Time Class Loading and Linking

9 comments

What does this mean for Clojure? At least loading the Clojure runtime should benefit, but what about app code loading.

I feel like for the Clojure applications where you need it to start really fast, like tiny CLI utilities that don't do a lot of work, the improvements would be so marginal to not matter much. The example they use in the JEP seems to have gone from a ~4 second startup to ~2 seconds, which for a tiny CLI, still would make it seem pretty slow. You're better off trying to use Babashka, ClojureScript or any of the other solutions that give a fast startup.

And for the bigger applications (like web services and alike), you don't really care that it takes 5 seconds or 10 seconds to start it, you only restart the server during deployment anyways, so why would startup time matter so much?

Big apps where startup time matters are desktop/mobile GUI apps. These aren't heavily emphasized in the Clojure community (excluding ClojureScript), but they are feasible to build - and I do build some of them. If startup time is reduced by 40%, the end user will definitely notice it.

IMHO, while optimizations in the JVM are always welcome, they primarily address surface-level issues and don't tackle Clojure's core limitation: the lack of a proper tree shaker that understands Clojure semantics. Graalvm offers help here by doing whole-program optimization at the bytecode level, but a Clojure-specific tree shaker could take things further: it could eliminate unused vars before/during Clojure AOT, thereby reducing both program size and startup time. These improvements would happen before the JVM optimizations kick in, making everything that follows a nice extra bonus.

Clojure and the JVM are so dynamic its hard to infer what namespaces/vars/classes might be needed during runtime. That makes static analysis like tree-shaking difficult. Whose to say some strings are concatenated together at runtime and used to load a namespace that might have been tree-shaken out? The only way to really know is to run the program.

IMHO, it doesn't need to be overly complicated, but I might be wrong. The optimizer doesn't have to optimize all the Clojure code out there (for the start). Still, it could adopt a Graalvm-like approach, explicitly stating which elements the optimizer can optimize and encouraging developers to write their code accordingly. Alternatively, it can go with CL "declare"-like construct [1], allowing developers to explicitly insert hints for the optimizer.

[1] https://www.lispworks.com/documentation/HyperSpec/Body/s_dec...

Interesting thought, I wonder if there's a way to reason about the magnitude of effect this would have.

Load balanced web services on e.g. K8S could need to start and stop quite a lot if load varies. Any speed up will be welcome.

Also, I guess Java-based desktop applications like IntelliJ and DBeaver will benefit.

Very much so - I have a Java application running in Elastic Container Service that takes 35s-40s to start (it's not that heavy but it's running on limited hardware for cost reasons). Any improvement without needing to throw more hardware at it would be very welcome.

The 4 second application is a web server. They also give a basic example starting in 0.031s, fine for a CLI.

One of the use cases for startup time is AWS lambda and similar.

> The 4 second application is a web server. They also give a basic example starting in 0.031s, fine for a CLI.

Sure, my comment was more about the relative improvement. In the case of the 0.031s example (which is the number without the improvement), it gets down to 0.018s with this new AOT class loading. What value do you get from something starting in 0.018s instead of 0.031s? The difference is so marginal for that particular use case.

> One of the use cases for startup time is AWS lambda and similar.

I suppose that's one use case where it does make sense to really focus on startup times. But again, I'd rather use something that fast startup already exists (Babashka, ClojureScript) instead of having to add yet another build-step into the process.

There are plenty of CLI applications that need to be low overhead. E.g. postgres can call a wal archive command for backup purposes, and I specifically remember work being done to reduce the startup overhead for backup tools like pgbackrest / wal-e.

If you're building e.g. a PS1 prompt replacement, you'll want to start, gather data, output the PS1 prompt and exit in less than 0.016s at most. Any slower and the user will see a visible delay.

If you're on higher FPS monitors, the budget shrinks accordingly. At 60fps you'll have 16ms, at 480fps you'll have 2ms.

The same applies for any app that should feel like it starts instantly.

Prebuilding a cache through a training run will be difficult between lambda invocations though and snapstart[1] already "solves" a lot of the issues a class cache might address.

[1] https://docs.aws.amazon.com/lambda/latest/dg/snapstart.html

Of course, I wouldn't be surprised if the boffins at lambda add some integration between snapstart and class caching once their leadership can get it funded

There are all kinds of spot solutions for (Babashka, vm snapshots like snapstart, graal, using clojurescript instead of clojure) but it would still be very nice to just get a speedup running vanilla JVM Clojure out of the box without fiddlery.

Eg SnapStart only works on AWS, not on other FaaS platforms, even there needs extra cloud infra fiddlery and costs extra.

> One of the use cases for startup time is AWS lambda and similar.

Agreed - I've got some JVM Lambdas that are quite slow to start (it doesn't take many libraries to make a heavy Lambda).

Someone else mentioned SnapStart which I think came out this year but there are enough caveats that I'm reluctant to try it in anger (big inherited code base that has shoved way too much into Lambda).

I don't know how this JEP affects Clojure, but if you want to use Clojure for fast-loading CLI apps, a good thing to look at is babashka (bb). I wrote about it here:

"Learning about babashka (bb), a minimalist Clojure for building CLI tools"

https://amontalenti.com/2020/07/11/babashka

It should benefit, if namespaces are AOT-compiled by Clojure.

Note that OpenJ9 and Azul already do similar optimizations.

And Android while not being a Java/JVM proper, more of a cousin, also has similar JIT cache concept as intermediate step before doing AOT compilation of selected code.

Naturally also welcomed on the OpenJDK distributions.

The concern that jumps out at me is: what about flags that affect code generation? Some are tied to the subarch (e.g. "does this amd64 have avx2?" - relevant if the cache is backed up and restored to a slightly different machine, or sometimes even if it reboots with a different kernel config), others to java's own flags (does compressed pointers affect codegen? disabling intrinsics?).

I don’t see any mention that code is actually going to be stored in a JITted form, so possibly it’s just architecture-independent loading and linking data being cached?

From a related JEP (on AOT): https://openjdk.org/jeps/8335368

  As another possible mismatch, suppose an AOT code asset is compiled to use a specific level of ISA, such as Intel’s AVX-512, but the production run takes place on a machine that does not support that ISA level. In that case the AOT code asset must not be adopted. Just as with the previous case of a devirtualized method, the presence of AVX-512 is a dependency attached to the AOT asset which prevents it from being adopted into the running VM.

  Compare this with the parallel case with static compilers: A miscompiled method would probably lead to a crash. But with Java, there is absolutely no change to program execution as a result of the mismatch in ISA level in the CDS archive. Future improvements are possible, where the training run may generate more than one AOT code asset, for a method that is vectorized, so as to cover various possibilities of ISA level support in production.
Also: https://openjdk.org/projects/leyden/

My impression from reading this was it was about knowing which classes reference which other classes when and which jars everything is in.

So I think you’re right.

So a bit more linker style optimization than compiler related caching stuff.

The JEP explains what this does:

"The AOT cache builds upon CDS by not only reading and parsing class files ahead-of-time but also loading and linking them."

While CDS (which has been available for years now) only caches a parsed form of the class files that got loaded by the application, the AOT cache will also "load and link" the classes.

The ClassLoader.load method docs explain what loading means: https://docs.oracle.com/en/java/javase/21/docs/api/java.base...

1. find the class (usually by looking at the file-index of the jar, which is just a zip archive, but ClassLoaders can implement this in many ways).

2. link the class, which is done by the resolveClass method: https://docs.oracle.com/en/java/javase/21/docs/api/java.base... and explained in the Java Language Specification: https://docs.oracle.com/javase/specs/jls/se21/html/jls-12.ht...

"Three different activities are involved in linking: verification, preparation, and resolution of symbolic references."

Hence, I assume the AOT cache will somehow keep even symbolic references between classes, which is quite interesting.

Nice, any time optimization is welcome :)

Sweet!

I'm a dunce

Read the article, this doesn't reduce JIT capabilities at all.

i’m curious if any of this was inspired from aws lambda snapstart

Maybe read the History section.

> [example hello world] program runs in 0.031 seconds on JDK 23. After doing the small amount of additional work required to create an AOT cache it runs in in 0.018 seconds on JDK NN — an improvement of 42%. The AOT cache occupies 11.4 megabytes.

That’s not immediately convincing that it will be worth it. It is a start I guess.

How so?

RAM is almost free if you’re not on embedded, and embedded could run Java sure, but it isn’t common.

That’s not an in-memory cache either. AIUI it’s storing those artefacts to disk

Container sizes may be affected though.

So you're now weighing the increased container pull time (due to size) vs the class load time you're saving through the cache.

It's nice to at least have the option of making that tradeoff

(And I suspect for plenty of applications, the class cache will be worth more time than (an also probably cached) image pull)

If you’re deploying Java applications, container size isn’t exactly your first priority anyhow, and this is O(n) additional space.

If image size is a concern, I imagine a native binary using GraalVM would’ve been a better way out anyhow, and you’ll bypass this cache entirely.

RAM might be inexpensive, but this hasn't stopped cloud providers from being stingy with RAM and price gouging.

At current RAM prices you'd expect the smallest instances to have 2GB, yet they still charge $4/month for 512MB, which isn't enough to run the average JVM web server.

That is pretty ridiculous complaint. Your problem is that they allow configuring instances smaller than your arbitrary baseline? Especially as AWS allows you to pick 2/4/8 GB per vCPU for general purpose instances. And the smallest of these (c7g.medium) is 2GB/1vCPU. The .5 GB t4g.nano has actually more generous ratio because it also has only .1 vCPU, putting it at 5GB/vCPU.

I'd assume they are very aware of demand levels for different types and would be adjusting the configurations if needed.