JavaScript – working with Emoji’s or the café problem

Working with non-English languages or emoji’s and finding that strings get garbled or just don’t add up? Its a long story but the point is that you can’t use plain string indexing with complex characters and emoji’s. Let me explain…

I wrote some JS code to layout text. Each letter in the string could have different font characteristics, so I had to measure each character individually. To do that I had a simple for-loop iterating the length of the input string and using string.charAt(index) to pull each character in turn, which I then measured and used that info to place it in the output.

Not all the characters we see displayed in a browser are capable of being defined in a single character

The problem with this is that not all the characters we see displayed in a browser are capable of being defined in a single character. Take for example the word café. Part of my use case is menus, so that word might popup up from time to time!

Café looks like a string of 4 letters. However, in code is defined as

cafe\u0301 

which means the word the base character e with the combining mark U+0301 COMBINING ACUTE ACCENT (rendered as ◌́ ).

So – answer this trick question. What is the length of the variable str?

console.log('cafe\u0301'); // => 'café'
console.log('café');       // => 'café'

const str = "cafe\u0301";
console.log(str.length);   // logs 5 (huh!)

Yup, a word we see as 4 chars on screen is in fact 5.

An example with emoji’s is the raised hand ✋ when we add skin tone (even WordPress can’t handle the newer emoji’s – medium-dark skin tone is code point U+1F3FE. ) Emoji’s are defined in the Unicode set, so shouldn’t we be able to treat them like any other character? Well, yes and no. Some emoji’s are single characters, some are combinations like this example.

The raised hand + medium-dark skin tone is actually made up of characters \u270B and U-1F3FE combined. In a literal JavaScript string this is written as \u270B\u{1F3FE} (not the braces around the 5 hex digits).

const str = "\u270B\u{1F3FE}";
console.log(str)
console.log(str.length);   // logs 3 (huh!)

// try logging each char the 'old' way
for (let i = 0; i < str.length; i++){
  console.log('char ' + i + ' = ' + str.charAt(i))
}

which looks like this in the console.

This topic needs way more explanation that I am qualified to provide. Luckily, there are some great blog posts on this topic

For my own project I selected Graphemer as a way forward. What this does is well described on the readme page of that link. In a nutshell, it irons out which are single and which are multi-character Unicode combos and returns an array. So you can then iterate the array instead of using plain charAt(). It seems to be working for me.

Summary

We’ve sounded that alarm about multi-character Unicode combos and how JavaScript strings are not capable of handling them without intervention. I’ve shown a couple of examples, given some links to much better material than you are reading, and told you what’s working for me. The rest is up to you!

Thanks for reading.

VW. March 2023

Image credit to Dan Cristian Pădureț

One thought on “JavaScript – working with Emoji’s or the café problem

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: