Changes to the String object in Java 9
Overview
In Java 9, the underlying String object has been changed from char[]
to byte[]
, which has the immediate benefit of being more memory-efficient, hence the name Compact Strings Improvement. Because in Java char
takes up 2 bytes and byte
takes up 1 byte, while a Unicode character does not necessarily need 2 bytes to be represented, at least ASCII characters only need 1 byte to be done. That is, if your string is full of ASCII characters, half the space would be wasted if you use char.
A brief Unicode description
Speaking of Unicode and its Code Points and Code Units, think of Unicode as a giant table where each character has a unique corresponding number.
A Code Point represents the serial number of a Unicode character in the table.
> 'A'.codePointAt(0).toString(16)
'41'
> 'π'.codePointAt(0).toString(16)
'3c0'
> '🙂'.codePointAt(0).toString(16)
'1f642'
One or more Code Units represent a Code Point, which is used to store and transmit Unicode characters, as follows in UTF-8 encoding.
Character | Code point | Code units |
A | 0x0041 | 01000001 |
π | 0x03C0 | 11001111, 10000000 |
i | 0x1F642 | 11110000, 10011111, 10011001, 10000010 |
UTF-8 Code Unit occupies 8 bits (1 byte), UTF-16 Code Unit occupies 16 bits (2 bytes), UTF-32 Code Unit occupies 32 bits (4 bytes), except for UTF-32, UTF-8/16 are variable length encoding, which means that when converting a Code Point into a Code Unit This means that when converting a Code Point to a Code Unit, one or more Code Units are used depending on the situation (as explained in the table above).
So, when Java 9 String is changed from char[]
to byte[]
at the bottom, unless UTF-32 encoding (fixed-length encoding) is used, then it is definitely variable-length encoding, making the memory footprint more compact. That’s why the Java 9 String API adds methods on Code Point.