1 Preface
Some time ago, I was involved in locating a problem with Tomcat that involved a sendfile system call. I suddenly thought of some of my previous experience with Java, and knew that there was a lot of open source software in the Java space that applied zero-copy to improve its performance, so I came up with this article. Look at these familiar software, have you ever understood the technical points behind their application:
- Tomcat: Use sendfile to write large files to sockets to improve the performance of static file data transfer
- Netty: unified ByteBuf mechanism, using off-heap memory for socket reading and writing for DirectBuffer encapsulation; also supports using sendfile to send data from file buffers to target Channel
- RocketMQ: use mmap memory mapping file method to read and write to CommitLog file, and write the content to the target socket when the client consumes the message
2 What is Zero Copy
Zero Copy means that the computer performs an operation without the CPU first copying data from one part of memory to another specific area. This technique is often used when transferring files over a network to save CPU cycles and bandwidth. Avoiding having the CPU do a lot of data copying tasks frees the CPU to focus on other tasks, which allows the system resources to be used more efficiently.
3 OS layer
3.1 Traditional I/O
Before figuring out zero-copy, let’s put Java aside and understand a few core points of knowledge in the I/O architecture in Linux.
- Kernel space: the space where the Linux kernel runs
- The Linux kernel space is isolated from the user space for system security, so that even if the user program crashes, the kernel will not be affected. Memory management, operation privileges, etc. are divided between the two spaces in isolation. In the kernel space, all system resources can be called and the system interface is provided to the user space programs. User-space programs cannot call the resource system directly, but can only make indirect requests to the kernel through the system interface, which is also called a system call.
- Disk/NIC: Disk/NIC is slow I/O compared to memory, and there are two main ways to transfer data between them.
- PIO, after CPU: Data exchange between disk/network card and memory, data has to be forwarded through CPU storage
- DMA, without CPU: direct data exchange between disk/network card and memory, the CPU only needs to give commands to the DMA controller, which will transfer the data through the system bus, and then notify the CPU when the transfer is complete
- CPU context switching: The CPU context (i.e. CPU registers and program counters) of the previous task is first saved, then the context of the new task is loaded into these registers and program counters, and finally the new task is run by jumping to the new location indicated by the program counters.
With the traditional I/O way of working, there exists data reads and writes that are copied back and forth from user space to kernel space, while kernel space data is read or written from hardware such as disks/network cards through the hardware I/O interface at the OS level.
The code will involve 2 system calls. Both file and socket are abstracted as files in Linux, and the first argument of the following functions are actually file descriptors.
- read(file, tmp_buf, len);
- write(socket, tmp_buf, len);
There are 2 system calls and 4 context switches between user and kernel states. The context switches have the following problems.
- Scheduling: each system call has to switch from user state to kernel state first, and then switch back from kernel state to user state after the kernel finishes its task
- Delay: It takes several tens of nanoseconds to several microseconds, and CPU scheduling also has a delay. In a high concurrency scenario, such time is easily accumulated and amplified, thus affecting the overall system performance
There exists a data exchange between user space <-> kernel space <-> disk/network card, where 4 data copies occur, 2 of which are DMA copies and the other 2 are via the CPU.
- Copy 1 (DMA copy): data from the disk is copied to the OS kernel buffer, this copy process is carried via DMA
- Copy 2 (CPU copy): Copy the data from the kernel buffer to the user’s buffer, so our application can use this data, this copy to process is done by the CPU
- Copy 3 (CPU copy): Copy the data from the user’s buffer to the kernel’s socket buffer, this process is still carried out by the CPU.
- Copy 4 (DMA copy): Copy the data from the kernel’s socket buffer to the NIC’s buffer, this process is again carried by the DMA
3.2 Zero-copy technique
3.2.1 mmap
Linux uses virtual memory management, where multiple virtual memories can point to the same physical address. Using this feature, virtual addresses in kernel space and user space can be mapped to the same physical address, so that data does not need to be copied back and forth during I/O operations. By mapping the read buffers in the kernel to the buffers in user space, all IO operations are done in the kernel.
mmap is a system call provided by the kernel, and its function prototype is as follows.
|
|
mmap+write takes advantage of the virtual memory feature for zero-copy, which flows as follows.
The above flow is 1 less CPU copy, which improves the speed of I/O. However, the context switch is still 4 times and not reduced, because the application still has to initiate a write operation. mmap maps the read buffer address to the user buffer address, and the kernel buffer is shared with the application buffer, so one CPU copy is saved. And the user process memory is virtual and just mapped to the kernel read buffer, the application level memory is also reduced by half.
3.2.2 sendfile
sendfile is another system call provided by the kernel, and its function prototype is as follows.
|
|
sendfile can transfer data between two file descriptors. It operates in the OS kernel, avoiding the copy operation between data from kernel buffer and user buffer, so it can be used to achieve zero copy. The flow is as follows.
The sendfile method has 3 data copies, including 2 DMA copies and 1 CPU copy, and 2 context switches.
3.2.3 sendfile+DMA scatter/gather
The Linux 2.4 kernel optimizes sendfile by providing a sendfile operation with scatter/gather, which removes the last CPU copy. The principle is that instead of doing data copy in kernel space Read BUffer and Socket Buffer, the memory address and offset of Read Buffer are recorded into the corresponding Socket Buffer so that no copy is needed. The essence is the same idea as the solution of virtual memory, which is the record of memory address. The flow is as follows.
The scatter/gather sendfile has only 2 DMA copies, and 2 context switches, the CPU copy is completely gone. However, this copy feature is requires hardware and driver support.
3.2.4 splice
The splice was introduced in Linux version 2.6.17 with the following function prototype.
|
|
The splice call is very similar to sendfile in that it is not limited to the functions of sendfile. That is, splice is a superset of sendfile. Also the kernel has a tee function that is specifically applied to move data between two pipes, again with zero copies.
- similarities: both slice and sendfile require two open file descriptors, one for input and one for output
- difference: slice allows any two files to be connected to each other, not just files and sockets for data transfer, and slice does not require hardware support
In Linux version 2.6.23, the implementation of the sendfile mechanism is no longer available, but the API and corresponding functions are still available, and the corresponding functions are implemented using the splice mechanism.
3.3 Summary
The so-called zero-copy is to reduce the CPU copy and reduce the context switch, as summarized below.
system call | CPU copy | DMA copy | context switch | |
---|---|---|---|---|
traditional I/O | read+write | 2 | 2 | 4 |
mmap | mmap+write | 1 | 2 | 4 |
sendfile | sendfile | 1 | 2 | 2 |
sendfile+gather | sendfile | 0 | 2 | 2 |
splice | splice | 0 | 2 | 0 |
After the introduction of zero copy, 2 DMA copies are all but indispensable, as both DMAs are hardware dependent.
4 Java layer
4.1 Zero-copy technology
Compared to the OS layer, Java has its own heap memory management due to the introduction of GC in the JVM. Zero-copy needs to consider two issues.
- Copy from where to where: user space when such as disk write data, you need to copy the contents of the user buffer (in-heap memory) to the kernel buffer (out-of-heap memory), OS then write the contents of the kernel buffer to the disk
- How zero-copy is optimized: In user space, it directly requests out-of-heap memory and writes the data it needs to write to disk, removing the use of in-heap memory
Thus, the core of Java’s zero-copy technology starts with the ability to use off-heap memory.
4.1.1 Java NIO
The JVM’s GC mechanism, on the other hand, can have copy moves to the memory it manages, which can affect efficiency. When there are high performance scenarios that require direct use of OS native heap memory, DirectByteBuffer is available to directly request and release JVM off-heap memory. This memory is not managed by the JVM and is not affected by GC. DirectByteBuffer uses off-heap memory and thus avoids the copy caused by JVM GC.
The Java NIO API provides a zero-copy API wrapper for the OS layer, which is also very convenient for upper layer applications to use.
- mmap: MappedByteBuffer class, whose underlying is the mmap system call
- sendfile: FileChannel provides two methods (transferTo/transferFrom), the underlying is the sendfile system call
4.1.2 Netty
Netty reduces data copying on three levels compared to Java’s built-in NIO.
- ** Avoid data flow through user space**: FileChannel.tranferTo in Netty’s FileRegion, which implements how data is written to the target Channel, and can use mmap+write
- Avoid data copy in JVM heap and OS user heap: Java provides DirectByteBuffer, Netty provides a unified interface to DirectByteBuffer out-of-heap memory encapsulating ByteBuf
- Avoid multiple copies of data in user space: Netty provides ByteBuf abstraction, supports reference counting and pooling, and provides CompositeByteBuf combined view to reduce copies
- ByteBuf: data can be shared, manage reference counting bytain/release
- ByteBufHolder: duplicate for ByteBuf to make a shallow copy, share the same data area, but do not share read and write indexes
- CompositeByteBuf: combine several buffers into one, and present them as one buffer to the outside world, so that they can be operated logically as a complete ByteBuf, thus eliminating the overhead of reallocating space and copying data again
4.2 Open source analysis
4.2.1 Tomcat
Tomcat is a web server software, web applications usually exist some static resource files, and these static files are not need to go through the application to additional processing, can be sent directly by Tomcat directly to the client, and thus do not need to go through the user space, then may use the previous mentioned zero copy technology.
In order to improve performance and save bandwidth, Tomcat provides a built-in mechanism to compress static resource files, but compression saves bandwidth but increases CPU. tomcat also provides the sendfile feature, so that when the default static file is larger than 48Kb, it will be transmitted directly using the sendfile feature, without enabling compression.
The related configuration can be found in: Advanced IO and Tomcat and Default Servlet Reference .org/tomcat-9.0-doc/default-servlet.html).
Tomcat provides three types of IO.
- BIO: Blocking IO, which should be rarely used nowadays
- NIO: non-blocking IO technology, using Java to provide NIO API, Tomcat provides two implementations of NIO and NIO2
- APR: JNI-based calls to operating system-related APIs, with improved performance compared to NIO, but you need to download the libraries needed for APR
Tomcat defines three types of Endpoint, corresponding to the above three IO modes, which are implemented in NIO mode NioEndpoint.java you can find the following code.
|
|
4.2.2 RocketMQ
RocketMQ supports message persistence, all messages are sequentially appended and written to the CommitLog, which is a file stored on disk, after they are received. The messages in the CommitLog file are indexed with the CosumerQueue. When the consumer gets the real physical address of the message through the CosumerQueue, it goes to the CommitLog and gets the corresponding message.
RocketMQ is using mmap+write to send the content of CommitLog file, which avoids the copy of heap memory of JVM and also reduces one CPU copy, but does not use sendfile.
We can find it in MappedFile.java to find the file mapping initialization, message consumption and deletion need to support random read and write is also well understood: the
|
|
When the client comes to pull the message, we can use the PullMessageProcessor.java where we can find the following code.
|
|
- If it is transferMsgByHeap, while it will read the message into the heap and then write it to the response Body
- If it is not transferMsgByHeap, then a MessageTransfer object will be created, and it implements Netty’s FileRegion interface in transferTo method to write the content to the target channel
5 Conclusion
This article has collected and organized some knowledge points of zero-copy, and simply open the source code of two open source software to see, you can see that zero-copy technology is not a complex and advanced technology, in the Java layer is also very simple to use. I hope this article can give you some inspiration to apply zero-copy technology in the subsequent network programming.