In Java 9, the underlying String object has been changed from
byte, which has the immediate benefit of being more memory-efficient, hence the name Compact Strings Improvement. Because in Java
char takes up 2 bytes and
byte takes up 1 byte, while a Unicode character does not necessarily need 2 bytes to be represented, at least ASCII characters only need 1 byte to be done. That is, if your string is full of ASCII characters, half the space would be wasted if you use char.
A brief Unicode description
Speaking of Unicode and its Code Points and Code Units, think of Unicode as a giant table where each character has a unique corresponding number.
A Code Point represents the serial number of a Unicode character in the table.
> 'A'.codePointAt(0).toString(16) '41' > 'π'.codePointAt(0).toString(16) '3c0' > '🙂'.codePointAt(0).toString(16) '1f642'
One or more Code Units represent a Code Point, which is used to store and transmit Unicode characters, as follows in UTF-8 encoding.
|Character||Code point||Code units|
|i||0x1F642||11110000, 10011111, 10011001, 10000010|
UTF-8 Code Unit occupies 8 bits (1 byte), UTF-16 Code Unit occupies 16 bits (2 bytes), UTF-32 Code Unit occupies 32 bits (4 bytes), except for UTF-32, UTF-8/16 are variable length encoding, which means that when converting a Code Point into a Code Unit This means that when converting a Code Point to a Code Unit, one or more Code Units are used depending on the situation (as explained in the table above).
So, when Java 9 String is changed from
byte at the bottom, unless UTF-32 encoding (fixed-length encoding) is used, then it is definitely variable-length encoding, making the memory footprint more compact. That’s why the Java 9 String API adds methods on Code Point.