Project Loom ( JEP 425 ) is probably one of the most anticipated additions to Java ever. Its implementation of virtual threading (or “green threading”) ensures that developers can create highly concurrent applications, such as those with hundreds of thousands of open HTTP connections, adhering to the well-known thread-per-request programming model without resorting to less familiar and often more complex reactive methods.
Only recently, after several years of effort, has Loom been merged into the main line of the OpenJDK and made available as a preview feature in the latest Java 19 early access release. That is, now is the perfect time to get in touch with virtual threading and explore the new features. In this post, I’ll share an interesting aspect I learned about thread scheduling fairness for CPU-intensive workloads running on Loom.
First, some background. The problem with the classical per-request thread model is that it only scales up to a certain point. Threads managed by the operating system are an expensive resource, which means you can typically have up to a few thousand, but not hundreds of thousands or even millions. Now, for example, if a Web application makes a blocking request to a database, the thread that made that request is blocking. Of course, other threads can be scheduled on the CPU at the same time, but there can’t be more concurrent requests than there are threads available.
The reactive programming model addresses this limitation by releasing threads during blocking operations such as file or network IO, while allowing other requests to be processed. Once the blocking call completes, the request in question will continue, using threads again. This model allows for more efficient use of thread resources to handle IO-intensive workloads, but unfortunately comes at the cost of a more complex programming model that many developers are not familiar with. In addition, as described in the Loom JEP, aspects such as debuggability or observability can be more challenging for a responsive model.
This explains the great excitement and anticipation in the Java community for Project Loom, which introduces the concept of virtual threads, virtual threads dispatched by the JVM to carrier threads at the operating system level. If application code encounters a blocking method, Loom will offload the virtual thread from the current carrier to make room for other virtual threads. Virtual threads are cheap and managed by the JVM, meaning that you can have many, if not millions. The beauty of this model is that developers can stick to the familiar per-request thread programming model without running into scaling problems due to the limited number of threads available. I highly recommend you read Project Loom’s JEP, which is well written and provides more details and background.
Now how does Loom’s scheduler know that a method is blocking? As it turns out, it doesn’t. I learned from Ron Pressler, lead author of Project Loom, that the opposite is true: the blocking methods in the JDK have been tuned for Loom to release OS-level carrier threads when called by virtual threads: the
Ron’s response led to a very interesting discussion with Tim Fox (of Vert.x fame, for example): what happens if the code is not IO-bound but CPU-bound? That is, if the code in a virtual thread runs some heavy computation without calling any of the JDK’s blocking methods, will the virtual thread be offloaded?
Perhaps surprisingly, the current answer is: no. This means that CPU-constrained code actually behaves very differently in a virtual thread than it does in an OS-level thread. So let’s take a closer look at the phenomenon with the following sample program.
Using a traditional cached thread pool, i.e. OS-level threads, about 64 threads are started at the same time. Each thread counts 100M (
BigInteger is used to make it more CPU intensive) and then prints out the time it took from scheduling the thread to completing it. Here are the results for my Mac Mini M1.
In wall clock time, it takes about 16 seconds to complete all 64 threads. These threads are scheduled equally between the available cores on my machine. That is, we’re looking at a fair scheduling scheme. Now here are the results of using virtual threads (by getting the executor
The graph looks very different. The first 8 threads took about 2 seconds of wall clock time, the next 8 threads took about 4 seconds, and so on. Since the executing code does not encounter any of the JDK’s blocking methods, threads never yield and thus usurp their carrier threads until they run out. This represents an unfair scheduling scheme for threads. Although they are all started at the same time, only 8 are actually executed in the first two seconds, the next 8 are executed, and so on.
Loom’s scheduler uses as many carrier threads as there are CPU cores available by default; I have eight cores in my M1, so processing takes place in blocks of eight virtual threads at a time. Using the
jdk.virtualThreadScheduler.parallelism system property, you can adjust the number of carrier threads, for example to 16.
Just for fun, let’s
Thread::sleep() add a call to (blocking method) in the processing loop and see what happens.
Sure enough, we’re back to fair scheduling, with all threads completing after roughly the same wall clock time:
It is worth noting that the actual durations look more coordinated compared to our original results running with 64 OS-level threads. It seems that the Loom scheduler does a better job of allocating available resources between virtual threads. Surprisingly, calling to
Thread::yield() does not have the same result. While the scheduler is free to ignore this intent based on the method’s JavaDoc, Sundararajan Athijegannathan says this will be applied by Loom. It would certainly be interesting to know why this is not the case here.
Seeing these results, the big question is of course whether this unfair scheduling of CPU-intensive threads in Loom is problematic in practice. Ron and Tim have debated this point extensively, and I suggest you check it out for yourself to form an opinion. According to Ron, support for conceding program execution points (rather than blocking methods) has been implemented in Loom, but this is not merged into the main thread with the initial drop of Loom. If the current behavior proves to be problematic, it should be easy to bring it back.
Now, for CPU-constrained code (which also doesn’t use virtual threads to begin with), it doesn’t make much sense to overuse more threads than are physically supported by a given CPU. But in any case, it is worth pointing out that CPU-constrained code may behave differently on virtual threads than on classic OS-level threads. This may come as a surprise to Java developers, especially if the author of such code is not responsible for choosing the thread executor/scheduler actually used by the application.
Time will tell if we need yield support for CPU-bound code, either by supporting explicit calls to
Thread::yield() (which I think should be supported at least) or by more implicitly, e.g. by yielding when a safety point is reached. As far as I know, Go’s goroutine has supported yield in similar scenarios since version 1.14, so I wouldn’t be surprised to see Java and Loom eventually go the same route.