Zero-copy technology and its application in Java

1 Preface

Some time ago, I was involved in locating a problem with Tomcat that involved a sendfile system call. I suddenly thought of some of my previous experience with Java, and knew that there was a lot of open source software in the Java space that applied zero-copy to improve its performance, so I came up with this article. Look at these familiar software, have you ever understood the technical points behind their application:

  • Tomcat: Use sendfile to write large files to sockets to improve the performance of static file data transfer
  • Netty: unified ByteBuf mechanism, using off-heap memory for socket reading and writing for DirectBuffer encapsulation; also supports using sendfile to send data from file buffers to target Channel
  • RocketMQ: use mmap memory mapping file method to read and write to CommitLog file, and write the content to the target socket when the client consumes the message

2 What is Zero Copy

Zero Copy means that the computer performs an operation without the CPU first copying data from one part of memory to another specific area. This technique is often used when transferring files over a network to save CPU cycles and bandwidth. Avoiding having the CPU do a lot of data copying tasks frees the CPU to focus on other tasks, which allows the system resources to be used more efficiently.

3 OS layer

3.1 Traditional I/O

Before figuring out zero-copy, let’s put Java aside and understand a few core points of knowledge in the I/O architecture in Linux.

  • Kernel space: the space where the Linux kernel runs
  • The Linux kernel space is isolated from the user space for system security, so that even if the user program crashes, the kernel will not be affected. Memory management, operation privileges, etc. are divided between the two spaces in isolation. In the kernel space, all system resources can be called and the system interface is provided to the user space programs. User-space programs cannot call the resource system directly, but can only make indirect requests to the kernel through the system interface, which is also called a system call.
  • Disk/NIC: Disk/NIC is slow I/O compared to memory, and there are two main ways to transfer data between them.
    • PIO, after CPU: Data exchange between disk/network card and memory, data has to be forwarded through CPU storage
    • DMA, without CPU: direct data exchange between disk/network card and memory, the CPU only needs to give commands to the DMA controller, which will transfer the data through the system bus, and then notify the CPU when the transfer is complete
  • CPU context switching: The CPU context (i.e. CPU registers and program counters) of the previous task is first saved, then the context of the new task is loaded into these registers and program counters, and finally the new task is run by jumping to the new location indicated by the program counters.

With the traditional I/O way of working, there exists data reads and writes that are copied back and forth from user space to kernel space, while kernel space data is read or written from hardware such as disks/network cards through the hardware I/O interface at the OS level.

linux_io

The code will involve 2 system calls. Both file and socket are abstracted as files in Linux, and the first argument of the following functions are actually file descriptors.

  • read(file, tmp_buf, len);
  • write(socket, tmp_buf, len);

There are 2 system calls and 4 context switches between user and kernel states. The context switches have the following problems.

  • Scheduling: each system call has to switch from user state to kernel state first, and then switch back from kernel state to user state after the kernel finishes its task
  • Delay: It takes several tens of nanoseconds to several microseconds, and CPU scheduling also has a delay. In a high concurrency scenario, such time is easily accumulated and amplified, thus affecting the overall system performance

There exists a data exchange between user space <-> kernel space <-> disk/network card, where 4 data copies occur, 2 of which are DMA copies and the other 2 are via the CPU.

  • Copy 1 (DMA copy): data from the disk is copied to the OS kernel buffer, this copy process is carried via DMA
  • Copy 2 (CPU copy): Copy the data from the kernel buffer to the user’s buffer, so our application can use this data, this copy to process is done by the CPU
  • Copy 3 (CPU copy): Copy the data from the user’s buffer to the kernel’s socket buffer, this process is still carried out by the CPU.
  • Copy 4 (DMA copy): Copy the data from the kernel’s socket buffer to the NIC’s buffer, this process is again carried by the DMA

3.2 Zero-copy technique

3.2.1 mmap

Linux uses virtual memory management, where multiple virtual memories can point to the same physical address. Using this feature, virtual addresses in kernel space and user space can be mapped to the same physical address, so that data does not need to be copied back and forth during I/O operations. By mapping the read buffers in the kernel to the buffers in user space, all IO operations are done in the kernel.

mmap is a system call provided by the kernel, and its function prototype is as follows.

void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);

mmap+write takes advantage of the virtual memory feature for zero-copy, which flows as follows.

linux_io_mmap

The above flow is 1 less CPU copy, which improves the speed of I/O. However, the context switch is still 4 times and not reduced, because the application still has to initiate a write operation. mmap maps the read buffer address to the user buffer address, and the kernel buffer is shared with the application buffer, so one CPU copy is saved. And the user process memory is virtual and just mapped to the kernel read buffer, the application level memory is also reduced by half.

3.2.2 sendfile

sendfile is another system call provided by the kernel, and its function prototype is as follows.

ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);

sendfile can transfer data between two file descriptors. It operates in the OS kernel, avoiding the copy operation between data from kernel buffer and user buffer, so it can be used to achieve zero copy. The flow is as follows.

linux_io_sendfile

The sendfile method has 3 data copies, including 2 DMA copies and 1 CPU copy, and 2 context switches.

3.2.3 sendfile+DMA scatter/gather

The Linux 2.4 kernel optimizes sendfile by providing a sendfile operation with scatter/gather, which removes the last CPU copy. The principle is that instead of doing data copy in kernel space Read BUffer and Socket Buffer, the memory address and offset of Read Buffer are recorded into the corresponding Socket Buffer so that no copy is needed. The essence is the same idea as the solution of virtual memory, which is the record of memory address. The flow is as follows.

linux_io_sendfile

The scatter/gather sendfile has only 2 DMA copies, and 2 context switches, the CPU copy is completely gone. However, this copy feature is requires hardware and driver support.

3.2.4 splice

The splice was introduced in Linux version 2.6.17 with the following function prototype.

ssize_t splice(int fdin, loff_t *offin, int fdout, loff_t *offout, size_t len, unsigned int flags);

The splice call is very similar to sendfile in that it is not limited to the functions of sendfile. That is, splice is a superset of sendfile. Also the kernel has a tee function that is specifically applied to move data between two pipes, again with zero copies.

  • similarities: both slice and sendfile require two open file descriptors, one for input and one for output
  • difference: slice allows any two files to be connected to each other, not just files and sockets for data transfer, and slice does not require hardware support

In Linux version 2.6.23, the implementation of the sendfile mechanism is no longer available, but the API and corresponding functions are still available, and the corresponding functions are implemented using the splice mechanism.

3.3 Summary

The so-called zero-copy is to reduce the CPU copy and reduce the context switch, as summarized below.

system call CPU copy DMA copy context switch
traditional I/O read+write 2 2 4
mmap mmap+write 1 2 4
sendfile sendfile 1 2 2
sendfile+gather sendfile 0 2 2
splice splice 0 2 0

After the introduction of zero copy, 2 DMA copies are all but indispensable, as both DMAs are hardware dependent.

4 Java layer

4.1 Zero-copy technology

Compared to the OS layer, Java has its own heap memory management due to the introduction of GC in the JVM. Zero-copy needs to consider two issues.

  • Copy from where to where: user space when such as disk write data, you need to copy the contents of the user buffer (in-heap memory) to the kernel buffer (out-of-heap memory), OS then write the contents of the kernel buffer to the disk
  • How zero-copy is optimized: In user space, it directly requests out-of-heap memory and writes the data it needs to write to disk, removing the use of in-heap memory

Thus, the core of Java’s zero-copy technology starts with the ability to use off-heap memory.

4.1.1 Java NIO

The JVM’s GC mechanism, on the other hand, can have copy moves to the memory it manages, which can affect efficiency. When there are high performance scenarios that require direct use of OS native heap memory, DirectByteBuffer is available to directly request and release JVM off-heap memory. This memory is not managed by the JVM and is not affected by GC. DirectByteBuffer uses off-heap memory and thus avoids the copy caused by JVM GC.

The Java NIO API provides a zero-copy API wrapper for the OS layer, which is also very convenient for upper layer applications to use.

  • mmap: MappedByteBuffer class, whose underlying is the mmap system call
  • sendfile: FileChannel provides two methods (transferTo/transferFrom), the underlying is the sendfile system call

4.1.2 Netty

Netty reduces data copying on three levels compared to Java’s built-in NIO.

  • ** Avoid data flow through user space**: FileChannel.tranferTo in Netty’s FileRegion, which implements how data is written to the target Channel, and can use mmap+write
  • Avoid data copy in JVM heap and OS user heap: Java provides DirectByteBuffer, Netty provides a unified interface to DirectByteBuffer out-of-heap memory encapsulating ByteBuf
  • Avoid multiple copies of data in user space: Netty provides ByteBuf abstraction, supports reference counting and pooling, and provides CompositeByteBuf combined view to reduce copies
    • ByteBuf: data can be shared, manage reference counting bytain/release
    • ByteBufHolder: duplicate for ByteBuf to make a shallow copy, share the same data area, but do not share read and write indexes
    • CompositeByteBuf: combine several buffers into one, and present them as one buffer to the outside world, so that they can be operated logically as a complete ByteBuf, thus eliminating the overhead of reallocating space and copying data again

4.2 Open source analysis

4.2.1 Tomcat

Tomcat is a web server software, web applications usually exist some static resource files, and these static files are not need to go through the application to additional processing, can be sent directly by Tomcat directly to the client, and thus do not need to go through the user space, then may use the previous mentioned zero copy technology.

In order to improve performance and save bandwidth, Tomcat provides a built-in mechanism to compress static resource files, but compression saves bandwidth but increases CPU. tomcat also provides the sendfile feature, so that when the default static file is larger than 48Kb, it will be transmitted directly using the sendfile feature, without enabling compression.

The related configuration can be found in: Advanced IO and Tomcat and Default Servlet Reference .org/tomcat-9.0-doc/default-servlet.html).

Tomcat provides three types of IO.

  • BIO: Blocking IO, which should be rarely used nowadays
  • NIO: non-blocking IO technology, using Java to provide NIO API, Tomcat provides two implementations of NIO and NIO2
  • APR: JNI-based calls to operating system-related APIs, with improved performance compared to NIO, but you need to download the libraries needed for APR

Tomcat defines three types of Endpoint, corresponding to the above three IO modes, which are implemented in NIO mode NioEndpoint.java you can find the following code.

    if (sd.fchannel == null) {
        // Setup the file channel
        File f = new File(sd.fileName);
        @SuppressWarnings("resource") // Closed when channel is closed
        FileInputStream fis = new FileInputStream(f);
        sd.fchannel = fis.getChannel(); // Generate a FileChannel for a static file
    }

    // Configure output channel
    sc = socketWrapper.getSocket();
    // TLS/SSL channel is slightly different
    WritableByteChannel wc = ((sc instanceof SecureNioChannel) ? sc : sc.getIOChannel());

    // We still have data in the buffer
    if (sc.getOutboundRemaining() > 0) {
        ...
    } else {
        long written = sd.fchannel.transferTo(sd.pos, sd.length, wc); // underlying sendfile system call
        if (written > 0) {

4.2.2 RocketMQ

RocketMQ supports message persistence, all messages are sequentially appended and written to the CommitLog, which is a file stored on disk, after they are received. The messages in the CommitLog file are indexed with the CosumerQueue. When the consumer gets the real physical address of the message through the CosumerQueue, it goes to the CommitLog and gets the corresponding message.

RocketMQ is using mmap+write to send the content of CommitLog file, which avoids the copy of heap memory of JVM and also reduces one CPU copy, but does not use sendfile.

We can find it in MappedFile.java to find the file mapping initialization, message consumption and deletion need to support random read and write is also well understood: the

try {
    this.fileChannel = new RandomAccessFile(this.file, "rw").getChannel();
    this.mappedByteBuffer = this.fileChannel.map(MapMode.READ_WRITE, 0, fileSize);
    TOTAL_MAPPED_VIRTUAL_MEMORY.addAndGet(fileSize);
    TOTAL_MAPPED_FILES.incrementAndGet();
    ok = true;
} catch (FileNotFoundException e) {
    log.error("Failed to create file " + this.fileName, e);
    throw e;
} catch (IOException e) {

When the client comes to pull the message, we can use the PullMessageProcessor.java where we can find the following code.

if (this.brokerController.getBrokerConfig().isTransferMsgByHeap()) {
    final byte[] r = this.readGetMessageResult(getMessageResult, requestHeader.getConsumerGroup(), requestHeader.getTopic(), requestHeader. getQueueId());
    this.brokerController.getBrokerStatsManager().incGroupGetLatency(requestHeader.getConsumerGroup(),
        requestHeader.getTopic(), requestHeader.getQueueId(),
        (int) (this.brokerController.getMessageStore().now() - beginTimeMills));
    response.setBody(r);
} else {
    try {
        FileRegion fileRegion =
            new ManyMessageTransfer(response.encodeHeader(getMessageResult.getBufferTotalSize()), getMessageResult);
        channel.writeAndFlush(fileRegion).addListener(new ChannelFutureListener() {
            @Override
            public void operationComplete(ChannelFuture future) throws Exception {
                getMessageResult.release();
                if (!future.isSuccess()) {
                    log.error("transfer many messages by pagecache failed, {}", channel.remoteAddress(), future.cause());
                }
            }
        });
    } catch (Throwable e) {
        log.error("transfer many message by pagecache exception", e);
        getMessageResult.release();
    }
  • If it is transferMsgByHeap, while it will read the message into the heap and then write it to the response Body
  • If it is not transferMsgByHeap, then a MessageTransfer object will be created, and it implements Netty’s FileRegion interface in transferTo method to write the content to the target channel

5 Conclusion

This article has collected and organized some knowledge points of zero-copy, and simply open the source code of two open source software to see, you can see that zero-copy technology is not a complex and advanced technology, in the Java layer is also very simple to use. I hope this article can give you some inspiration to apply zero-copy technology in the subsequent network programming.