Changes to the String object in Java 9

Overview

In Java 9, the underlying String object has been changed from char[] to byte[], which has the immediate benefit of being more memory-efficient, hence the name Compact Strings Improvement. Because in Java char takes up 2 bytes and byte takes up 1 byte, while a Unicode character does not necessarily need 2 bytes to be represented, at least ASCII characters only need 1 byte to be done. That is, if your string is full of ASCII characters, half the space would be wasted if you use char.

A brief Unicode description

Speaking of Unicode and its Code Points and Code Units, think of Unicode as a giant table where each character has a unique corresponding number.

A Code Point represents the serial number of a Unicode character in the table.

> 'A'.codePointAt(0).toString(16)
'41'
> 'π'.codePointAt(0).toString(16)
'3c0'
> '🙂'.codePointAt(0).toString(16)
'1f642'

One or more Code Units represent a Code Point, which is used to store and transmit Unicode characters, as follows in UTF-8 encoding.

Character Code point Code units
A 0x0041 01000001
π 0x03C0 11001111, 10000000
i 0x1F642 11110000, 10011111, 10011001, 10000010

UTF-8 Code Unit occupies 8 bits (1 byte), UTF-16 Code Unit occupies 16 bits (2 bytes), UTF-32 Code Unit occupies 32 bits (4 bytes), except for UTF-32, UTF-8/16 are variable length encoding, which means that when converting a Code Point into a Code Unit This means that when converting a Code Point to a Code Unit, one or more Code Units are used depending on the situation (as explained in the table above).

So, when Java 9 String is changed from char[] to byte[] at the bottom, unless UTF-32 encoding (fixed-length encoding) is used, then it is definitely variable-length encoding, making the memory footprint more compact. That’s why the Java 9 String API adds methods on Code Point.