Java Virtual Threads
Virtual threading is something I’m really excited about. It’s a long-awaited feature in the language and we’ve made very few attempts to implement it in the past, but it’s finally here and it’s been merged into Java 19. This means that it will be included as a preview feature and we will be able to use it as soon as JDK 19 is released.
Most of the content on the virtual (green) threads is written by non-Java developers, why is that? Green threads are new to the Java world, most Java applications use regular (platform) threads and have little exposure to the concept. java is one of the most popular languages and is the only one of them that does not support any form of async/await without the help of third party libraries (which we will cover soon).
In early versions of Java, when designing the multithreading API, Sun Microsystems was faced with a dilemma: should we use user-mode threads or map Java threads one-to-one with OS threads. All benchmarks at the time showed that user-mode threads were severely underutilized, increasing memory consumption without giving much in return. However, this is a benchmark from 20 years ago, when things were completely different. We didn’t have such high load requirements, and the Java language wasn’t very mature. Things are different now, and we have made little attempt to introduce user-mode threads into the language. Take fibers, for example.
Unfortunately, because they were implemented as a separate class, it was difficult to migrate the entire code base to it, and eventually they disappeared and were never merged into the language.
Project Loom
There are several Java projects that require very specific tasks to be completed. Examples include Valhala, Panama, Amber, and of course Loom. Loom’s goal is to overhaul the language’s concurrency model. Their goal is to bring virtual threads, structured concurrency, and a few other smaller things (for now).
A few words about the Java concurrency model
The way threads are implemented in the JVM is considered to be one of the best ways, even for non-Java developers. We have excellent thread debugging, and you can get thread dumps, breakpoints, memory checks, and more. You can even use the JFR API to define custom metrics for your threads.
The Thread class is Java’s way of giving you access to the OS Thread API. Most of the operations performed in this class make system calls. In production we rarely use threads directly, we use the Java Concurrency package and thread pools, locks and other goodies.Java has excellent built-in tools for multi-threading.
Concurrency and parallelism
Before we move on to the fun stuff, we have to address this issue. Concurrency and parallelism are two things that are often confused and confusing.
parallelism means executing two or more tasks at the same time . This is only possible if the CPU supports it. We need multiple cores to achieve parallelism. However, modern CPUs are always multi-core, and single-core CPUs are mostly obsolete and no longer widely used because their performance is significantly better than multi-core CPUs. this is because modern applications are designed to take advantage of multiple cores, and they always need to do something at the same time.
Concurrency means simultaneously managing tasks. For example, JavaScript is a single-threaded language, and everything that must happen at the same time happens at the same time. A single thread manages all the tasks generated by the code. JS uses async/await to do this, and we’ll discuss other ways to implement concurrency later.
From an operating system perspective, the CPU must handle multiple threads of processes. The number of threads is always higher than the number of cores, which means that the CPU must perform context switching. Briefly each thread has a priority and can be idle, working or waiting for a CPU cycle. the CPU must iterate through all non-idle threads and allocate its limited resources according to the priority. In addition, it must ensure that all threads with the same priority get the same amount of CPU time, otherwise some applications may freeze. Each time the kernel is assigned to a different thread, it must freeze the currently running thread and preserve its register state. In addition to this, it must keep track of whether there are idle threads that have not been woken up. As you can see, this is a rather complex and expensive operation and we as developers should try to minimize the number of threads we use. Ideally, the number of threads should be close to the number of CPU cores, so that we can minimize CPU context switches.
Modern Java Server Concurrency Issues
Cloud space is getting bigger and bigger, and with it comes load and resource requirements. Most enterprise servers (the most heavily loaded servers) are written in Java, so Java is used to solve the load problem. So far it has done a good job, judging by the fact that it is still the most popular server language, but that doesn’t mean it is perfect.
The usual way we handle requests is to dedicate a platform thread to them, which is the “thread per request model”. When a client requests something and we get the data or process it, that thread is occupied and cannot be used by anyone else. The server starts and allocates a predefined number of threads (e.g. 200 for Tomcat). They are placed in the thread pool and wait for requests. Their initial state is called “Parked” and in this state they do not use CPU resources.
This is easy to write, understand and debug, but what if the client requests something that executes a blocking call? A blocking call is an operation that waits for a third-party call to complete, such as a SQL query, a request for a different service, or a simple IO operation to the operating system. When a blocking call occurs, the thread must wait . While it is waiting, the thread is not available and the CPU must manage it because it is not idle. This increases context switching. The server limits the number of threads, a higher number of threads may increase the throughput, but they slow down the request processing considerably. This is a good balance that we must keep in mind and manage. People often ask, “why not just spawn 10k threads and process 10k requests at the same time”, and while this may not stop the operating system from stopping you, gee, you can even spawn a million threads with the right configuration, there are benchmarks that show 80% of CPU utilization is purely for 3-4k threads on popular CPUs. 4k threads after context switching, and remembering that the OS also needs the CPU to run and manage other processes.
To address our scalability issues, we typically just scale and spawn multiple nodes of the server. This is feasible, and we can now handle any number of requests if we pay enough to the cloud provider, but one of the main drivers for using cloud technology is to reduce operational costs. Sometimes we can’t afford the extra expense and end up with a slow and nearly unusable system.
Concurrency Model
Callbacks
Callbacks are a simple but powerful concept. They are objects that are passed as arguments to other functions or procedures. The parent function passes the callback to the child function, which can then use the callback to notify the parent of some event, such as “I have completed my task”. This is a way to achieve concurrency on a single thread. Callbacks form a stack trace, which can make debugging easier. They are good when the nesting is one or two layers, but quickly get out of hand when you need to build more complex callback chains. Currently, they are mainly used as building blocks for other concurrency models and are considered a bad practice and legacy issue.
Asynchronous/waiting and commitment
As the name implies, this model is based on promises. promise represents the final computation (or failure). A function can return a Promise, such as the result of an http request, and then the calling function can link their logic to it. This is how concurrency is implemented in most popular languages.Java also has Promise, but they are called Futures, but only CompletableFuture has the full list of functions for Promise.Most operations in Java are blocking, and Futures will take up threads anyway.
Async/await is the syntactic sugar of Promises. It saves you from writing cumbersome sample files for linking, subscribing, and managing Promise. You can usually mark a function as async and its result is wrapped internally in a Promise.
One of the big problems you can run into with async/await is the infamous colored function problem. Using async on a function essentially makes it non-blocking, but blocking functions (without the async prefix) can’t call them unless they use await. you might ask, “So what? I just make everything asynchronous and never use blocking functions”… Sure, but you need a third-party library to block and boom everything goes to hell. Also, there may be some things in the language that are inherently blocking, and sooner or later you will be forced to deal with function colors. It is important to note that in some languages like C# this is not the case and your async/await has no function colors.
Concurrency (continued + routines)
When we talk about concurrent processes, we don’t mean Kotlin’s concurrent processes, they just steal the word.
A continuity is a special kind of function call. If function A calls function B and it’s the last thing A does, then we can say that B is a continuation of A.
routines (aka. subroutines) are reusable pieces of code that are usually called multiple times during execution. Think of them as a set of immutable instructions with inputs and outputs that you can call at any time.
Combining these terms, we get CoRoutines . They are essentially suspendable tasks managed by the runtime, and they form a tree structure of chained calls.
CoRoutines have several key properties.
- They can be suspended and resumed at any time
- They are a data structure that remembers its state and stack trace
- They can cede/give control to other concurrent programs (subroutines)
- They must have isDone(), yield(), and run() functions
This is an example of a concurrent process in JS, sorry to all hardcore Java readers :(
JavaScript has a yield mechanism. With it, you can create so-called generators. To support this, they have pretty much implemented concurrency in the language.
function *getNumbersGen() {
let temp = 5;
console.log("1");
yield 1;
console.log("2");
yield 2;
console.log("3");
yield 3;
console.log("Temp " + temp);
}
This is our simple generator.’ *’ marks the function as a generator, and then the function can use yield. this yield
keyword is used to pause and resume the generator function (same as hang/run).
Now, if we execute the following code, it will get the numbers from the generator until it stops generating them.
for (let n of getNumbersGen()) {
console.log("Num " + n);
}
We get
"1"
"Num 1"
"2"
"Num 2"
"3"
"Num 3"
"Temp 5"
Notice how the control is generated between the two code blocks, first the generator prints the number and then loops through it. In addition, we define the temp variable at the beginning of the generator, after which the value of temp is generated back and forth and still retained and printed correctly.This generator implements everything needed to be called a concurrent program. It can pause, it can resume and it can maintain its state. All of this is handled by the JS interpreter. Great, we implemented concurrency without introducing any special words like async and await,we don’t have colored functions, but still run on a single Thread.
Virtual threads
Loom’s developers had a lot to think about and multiple ways to implement virtual threads. I’m glad they chose the concurrent thread approach. The Thread class will remain the same and use the same API . This makes the migration seamless and the switch to green threads is just a sign. However, this comes at a huge cost. They have to go through every API of the language, such as Sockets and I/O, and make it non-blocking in case it runs in a virtual thread. This is a huge change that affects the core JDK APIs. In addition, it had to be backwards compatible and not break existing logic. No wonder this took more than 5 years to complete.
In order to switch to virtual threads, we don’t have to learn something new, we just need to forget something.
- Never merge virtual threads , they are cheap and pointless
- Stop using thread local variables . They will work, but if you spawn millions of threads, you’ll run into memory problems. According to Ron Pressler: " Thread-local variables should not be exposed to end users and should be kept as internal implementation details “.
An almost exhaustive list of advantages of virtual threads over platform threads
- Context switching actually becomes free . They are managed by the JVM, which means that the JVM will perform context switching between threads.
- Tail/call optimization. They mention in the JEP that tail-call optimization is done on threads. This can save a lot of memory for the stack, but it is still a work in progress
- Cheap start/stop. When we stop the OS thread, we have to make a system call to terminate the thread and then release the memory it is occupying. When starting the OS thread, we make the system call again. Starting and killing a green thread is just a matter of deleting the object and then letting the GC delete it.
- Hard cap. As mentioned before, the OS can handle so many threads. Even with hardware improvements, we still can’t keep up with demand. Currently you can spawn tens of millions of virtual threads (which should be enough for most cases)
- Threads that perform transactions behave very differently than threads that perform video processing. This is very easy to overlook. Essentially what I’m saying is that the OS and CPU must be optimized for the general case. They must be able to handle the various tasks requested by the application, so they cannot be optimized for specific use cases. the JVM can optimize its threads for the specific task of processing the request.
- Resizable stacks. Virtual threads exist in RAM. Their stacks and metadata are there as well. Platform threads must be allocated a fixed stack size (1MB for Java) and that stack cannot be resized. This means that if you exceed it you get stackoverflow, and if you don’t use it you waste memory. In addition, the minimum memory required to bootstrap a virtual thread is about 200-300 bytes.
Using Virtual Threads
Consider the following
for (int i = 0; i < 1_000_000; i++) {
new Thread(() -> {
try {
Thread.sleep(1000);
} catch (Exception e) {
e.printStackTrace();
}}).start();
}
Here we try to create 1 million regular threads, and all the threads do is sleep for 1 second and then die. Obviously, this code throws an OutOfMemory error, and on my machine I was able to generate 40k threads before I ran out of memory.
Now let’s try to generate virtual ones. In order to create a new virtual thread, we have to use Thread.startVirtualThread(runnable).
for (int i = 0; i < 1_000_000; i++) {
Thread.startVirtualThread(() -> {
try {
Thread.sleep(1000);
} catch (Exception e) {
e.printStackTrace();
}});
}
This code worked fine and I was able to generate over 20 million of these threads on my machine. This is to be expected, since a user mode thread is nothing more than an object in the memory managed by the JVM.
Okay, let’s dig a little deeper. Usually when we use threads, we use them together with thread pools. We will define a blocking code that needs to be executed
static void someWork(String taskName) {
try {
System.out.println(Thread.currentThread() + " executing " + taskName);
new URL("https://httpstat.us/200?sleep=2000").getContent();
System.out.println(Thread.currentThread() + " completed " + taskName);
} catch (Exception e) {
e.printStackTrace();
}
}
We print the platform (operator) thread where the code is running, then make a 2 second http call, then print the operator thread again. Simple enough.
Now let’s run it in parallel
try (ExecutorService executor = Executors.newFixedThreadPool(5)) {
for (int i = 1; i <= 10; i++) {
String taskName = "Task" + i;
executor.execute(() -> someWork(taskName));
}
}
We create a fixed size thread pool with 5 threads and submit 10 tasks that will only run the someWork method described earlier. Did you notice anything else? The thread pool is in the try-with-resources block! This is a new Java 19 feature that now tries to use resources to wait for all threads to complete, and we no longer need to use shutdown and awaitTermination when using thread pooling. Anyway, the above code provides us with the following output
Thread[#25,pool-1-thread-5,5,main] executing Task5
Thread[#24,pool-1-thread-4,5,main] executing Task4
Thread[#21,pool-1-thread-1,5,main] executing Task1
Thread[#22,pool-1-thread-2,5,main] executing Task2
Thread[#23,pool-1-thread-3,5,main] executing Task3
Thread[#21,pool-1-thread-1,5,main] completed Task1
Thread[#22,pool-1-thread-2,5,main] completed Task2
Thread[#21,pool-1-thread-1,5,main] executing Task6
Thread[#22,pool-1-thread-2,5,main] executing Task7
Thread[#25,pool-1-thread-5,5,main] completed Task5
Thread[#23,pool-1-thread-3,5,main] completed Task3
Thread[#24,pool-1-thread-4,5,main] completed Task4
Thread[#25,pool-1-thread-5,5,main] executing Task8
Thread[#23,pool-1-thread-3,5,main] executing Task9
Thread[#24,pool-1-thread-4,5,main] executing Task10
Thread[#22,pool-1-thread-2,5,main] completed Task7
Thread[#21,pool-1-thread-1,5,main] completed Task6
Thread[#25,pool-1-thread-5,5,main] completed Task8
Thread[#24,pool-1-thread-4,5,main] completed Task10
Thread[#23,pool-1-thread-3,5,main] completed Task9
Note how each task is executed by the same thread. (e.g. Task4 was completed by Thread[#24,pool-1-thread-4,5,main])
This shows that when the blocking call is made, the thread is waiting and resumes after 2 seconds.
Now let’s convert it to a user mode thread. The code is the same, we just need to use Executors.newVirtualThreadPerTaskExecutor() to create a new green thread each time a task is submitted.
try (ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor()) {
for (int i = 1; i <= 10; i++) {
String taskName = "Task" + i;
executor.execute(() -> someWork(taskName));
}
}
This time we get the following output
VirtualThread[#25]/runnable@ForkJoinPool-1-worker-4 executing Task4
VirtualThread[#26]/runnable@ForkJoinPool-1-worker-5 executing Task5
VirtualThread[#28]/runnable@ForkJoinPool-1-worker-7 executing Task7
VirtualThread[#23]/runnable@ForkJoinPool-1-worker-2 executing Task2
VirtualThread[#21]/runnable@ForkJoinPool-1-worker-1 executing Task1
VirtualThread[#24]/runnable@ForkJoinPool-1-worker-3 executing Task3
VirtualThread[#27]/runnable@ForkJoinPool-1-worker-6 executing Task6
VirtualThread[#29]/runnable@ForkJoinPool-1-worker-8 executing Task8
VirtualThread[#32]/runnable@ForkJoinPool-1-worker-3 executing Task10
VirtualThread[#31]/runnable@ForkJoinPool-1-worker-8 executing Task9
VirtualThread[#26]/runnable@ForkJoinPool-1-worker-1 completed Task5
VirtualThread[#21]/runnable@ForkJoinPool-1-worker-1 completed Task1
VirtualThread[#25]/runnable@ForkJoinPool-1-worker-9 completed Task4
VirtualThread[#24]/runnable@ForkJoinPool-1-worker-2 completed Task3
VirtualThread[#27]/runnable@ForkJoinPool-1-worker-2 completed Task6
VirtualThread[#32]/runnable@ForkJoinPool-1-worker-9 completed Task10
VirtualThread[#28]/runnable@ForkJoinPool-1-worker-1 completed Task7
VirtualThread[#23]/runnable@ForkJoinPool-1-worker-7 completed Task2
VirtualThread[#31]/runnable@ForkJoinPool-1-worker-6 completed Task9
VirtualThread[#29]/runnable@ForkJoinPool-1-worker-1 completed Task8
Notice how the task is now executed by two threads, the first executing the code before the blocking call and the second after. For example Task5 is executed first by ForkJoinPool-1-worker-5 and then by ForkJoinPool-1-worker-1. This shows that we are not blocking the operator threads. Note also that we are now using a forked connection pool. This pool has a certain size number of cores and is managed by the JVM. It is also used for things like parallel streaming.
This is very similar to the JavaScript example we gave earlier. Threads give up control to each other, state is retained and then restored, a true collaborative process. The best part is that it’s the same as regular blocking code.
Server sending events
I wanted to use this as a cool use case for user mode threads. If you are not familiar with SSE, this is a good and more detailed explanation. Essentially, we open an http connection and never close it, and then the server is able to keep pushing data to the client. It’s very lightweight and much cheaper than WebSockets. The problem is that in order to implement it, we need a thread to keep running and sending events to the stream, and if the thread dies, we will disconnect. You can already see that using platform threads here is not a good idea. Using virtual threads, we can do this in spring boot
@GetMapping("/sse")
public SseEmitter streamSSE() {
SseEmitter emitter = new SseEmitter();
ExecutorService sseMvcExecutor = Executors.newVirtualThreadPerTaskExecutor();
sseMvcExecutor.execute(() -> {
try {
while (true) {
SseEmitter.SseEventBuilder event = SseEmitter.event()
.data("SSE time -> " + LocalTime.now().toString())
.id(UUID.randomUUID().toString())
.name("Custom SSE");
emitter.send(event);
Thread.sleep(500);
}
} catch (Exception ex) {
emitter.completeWithError(ex);
}
});
return emitter;
}
SseEmitter is a special class in Spring MVC that implements the SSE protocol. What we do is create an infinite loop (obviously not production code) and send new data to the client every 500 milliseconds. Then any number of clients can subscribe to it.
curl localhost:8080/api/v1/test/sse
You’ll keep getting events like this
data:SSE time -> 13:41:59.878294
id:f2059d56-d27e-461b-8f7c-1aa75b3aab64
event:Custom SSE
data:SSE time -> 13:42:00.383703
id:fdaac3bb-47aa-45b0-9382-549ff7549f08
event:Custom SSE
The full source code can be found here.
When not to use virtual threads
Having to manage virtual threads can degrade performance. For less loaded applications, threads may be preferable to virtual threads simply because context switching is lower when there are few active clients. In addition, if your application is CPU-intensive, such as performing a lot of math calculations, it does not make sense to use green threads, since they must occupy OS threads during calculations anyway. In general, Loom will not make your application faster, it will only increase its throughput, so if throughput is not an issue, then stick with platform threads. This blog does a good job of analyzing and showing how Loom utilizes threads, but if you read it, keep in mind that the author bashes Loom for things it shouldn’t be doing in the first place (like fair thread scheduling).
Find your bottleneck . Whether you use Postgres with 50 connection pools and then spawn more than 50 threads (platform or non-platform) won’t make any difference.
As a reference, you can check out little’s law and this great article on the topic of choosing the optimal number of threads
Structured Concurrency
Structured concurrency refers to the way we handle the life cycle of threads. Currently we can’t stop a thread that no longer needs to run, because their results are obsolete. We can only send an interrupt signal that will eventually be consumed and the thread stopped. This wastes RAM and CPU cycles.
Let’s consider the following case
All tasks must succeed, if one fails, there is no point in continuing
If a task succeeds, then at least one task must succeed, and there is no point in waiting for the rest
Deadline. If the execution takes longer than a specific time, we want to terminate everything
I’ve marked in red the threads that need to be stopped immediately after reaching a certain state. With Loom we can do this, currently we have no choice but to stop the thread manually except for the special thread pool they will introduce, but this JEP promises to bring more utilities to manage it.
These may seem like small optimizations, in fact they are trivial for small applications or low load servers. When you are dealing with millions of requests per day, these things really matter, in fact they can be game changers and in some cases greatly increase your throughput.
If you are using a reactor framework, should you consider switching to Loom
The Reactor framework is very good at dealing with application throughput issues. What they do is essentially create abstract tasks (similar to a concurrent process) and wrap everything in them. The reactor runtime then manages these tasks. Sounds very similar to virtual threading, but with few major problems
- The language itself doesn’t support it, which leads to very complex code (Flux/Mono)
- The excellent Java thread debugging we talked about before is completely ignored and replaced by a centralized error handler that provides almost zero information. We must rely heavily on logging.
- Once you use the reactor style it’s hard to go back and you may have to rewrite everything from scratch. *Brian Goetz says Loom will kill webflux. I’m not saying you should believe him blindly, but at least listen to what he has to say.
I personally don’t like reactors (I don’t like the actor model either, but it at least behaves better and is easier to understand). I really like the blocking code and the per-request model of threads. They are readable and take full advantage of the Java language. These frameworks eliminate that, and you need a good reason to use them after Loom.
Final Words
I know this is a big topic and there is a lot to deal with. I hope it’s useful. I will also link to all the resources below and further reading so you can do your own research if needed. loom is still in pre-release and things may change, but one thing is for sure, we will get virtual threads in JDK 19 and the official release date, as I write this, is September 2022. Unfortunately, Java 19 is not LTS, and if you work for companies that only use the LTS version, you will have to wait for Java 21 which should be released in September 2023.