[Flutter] The split('') Trap — Android Crash from Emoji and Some CJK Characters

2026-05-27 hit count image

A summary of the cause and fix for the "string is not well-formed UTF-16" Android crash that occurs when splitting a word containing certain CJK characters or emoji character by character.

flutter

Outline

One day, the following Crashlytics report came in for a Japanese vocabulary learning app:

Fatal Exception: java.lang.RuntimeException
... string is not well-formed UTF-16 ...

The word that triggered the crash was 𩸽 (ほっけ, hokke — a fish commonly grilled at home in Japan). The moment we fed this single character into a widget that splits and displays text character by character, the app died on Android devices.

The cause is that Dart’s String.split('') cuts strings by UTF-16 code unit. Characters like 𩸽 — emoji and some CJK characters that belong to Unicode’s supplementary plane (above U+10000) — are represented in UTF-16 as a surrogate pair (two code units), so split('') breaks the pair into two pieces, producing lone surrogates. Lone surrogates are not valid UTF-16, so the moment they’re passed to Android through the platform channel, they get rejected with “string is not well-formed UTF-16”.

This post covers:

  • Why String.split('') does NOT actually split “character by character”
  • A brief background on Unicode supplementary planes and UTF-16 surrogate pairs
  • A safe runes-based split helper and how to apply it
  • How to write tests that pin down the trap

Why split('') is the Problem

Dart’s String is actually an ordered series of UTF-16 code units, not an ordered series of characters / graphemes. You can verify this with these two lines:

print('𩸽'.length);            // 2  (2 code units)
print('𩸽'.split('').length);   // 2  (a lone surrogate each)

To human intuition, ”𩸽 is one character”, but from Dart’s perspective, it’s two code units [0xD867, 0xDE3D]. split('') separates these two units into two length-1 strings. The two resulting pieces are:

  • First piece: a high surrogate (U+D867) standing alone
  • Second piece: a low surrogate (U+DE3D) standing alone

In UTF-16, high/low surrogates are valid only when they appear strictly as a pair. So if you feed this split result directly into a Flutter widget like Text, the platform channel’s encoding step judges it as invalid UTF-16, and crashes occur in cases such as:

  • When Android’s TextView etc. validates a string received via the platform channel
  • When StandardMessageCodec serializes the String
  • When some native text measurement / shaping APIs require well-formed UTF-16

Even if iOS doesn’t crash on the same input, that’s just a difference in how lenient the OS / text engine is — creating lone surrogates is wrong in itself. Silent regressions like text measurements being slightly off or font fallbacks failing can happen in places you don’t see.

Unicode Supplementary Plane and Surrogate Pair

A short background:

  • Unicode manages code points by dividing them into 17 “planes”.
  • The BMP (Basic Multilingual Plane, U+0000 ~ U+FFFF) is plane 0, which contains the most commonly used characters. ASCII, Hangul, most CJK ideographs (漢字), katakana / hiragana — all in BMP.
  • The Supplementary Plane (U+10000 ~ U+10FFFF) covers planes 1 through 16, containing additional characters that didn’t fit in the BMP.

UTF-16 encoding represents BMP characters with 1 code unit and supplementary plane characters with 2 code units (a surrogate pair).

'A'    (U+0041)   → [0x0041]
'漢'   (U+6F22)   → [0x6F22]
'😀'   (U+1F600)  → [0xD83D, 0xDE00]  ← surrogate pair
'𩸽'   (U+29E3D)  → [0xD867, 0xDE3D]  ← surrogate pair

Examples of supplementary plane characters a Japanese vocabulary app commonly runs into:

  • 𩸽 (U+29E3D, hokke — fish)
  • 𠮷 (U+20BB7, Yoshi — a variant of the surname “吉”, commonly seen on restaurant signs)
  • 𠀋 (U+2000B, Jō — a kanji used in names)
  • And more than half of all emoji (😀, 🎉, 🍣, etc.)

All CJK characters from CJK Unified Ideographs Extension B (U+20000~U+2A6DF) onwards, and nearly all emoji, belong to the supplementary plane. Whether you’re doing kanji learning or emoji processing, you can’t avoid the supplementary plane.

The Fix — A runes-Based Split Helper

Dart provides String.runes for iterating by code points. runes automatically combines surrogate pairs and treats them as a single code point.

print('𩸽'.runes.length); // 1

The split helper built on this is simple:

/// Split a string into Unicode code points.
///
/// `String.split('')` splits by UTF-16 code unit, breaking the surrogate pair
/// of supplementary plane characters (e.g. 𩸽 U+29E3D, emoji) and causing
/// "string is not well-formed UTF-16" crashes.
/// This function uses `runes` to split by code point, preserving surrogate
/// pairs as a single character.
List<String> splitByCodePoint(String text) {
  return text.runes.map((r) => String.fromCharCode(r)).toList();
}

The core is a single line (text.runes.map((r) => String.fromCharCode(r)).toList()). When the argument is a supplementary plane code point, String.fromCharCode(int) automatically produces a length-2 String correctly composed as a surrogate pair. So each element of the result list matches what humans perceive as “one character”.

splitByCodePoint('𩸽');       // ['𩸽']
splitByCodePoint('𩸽の魚');   // ['𩸽', 'の', '魚']
splitByCodePoint('a😀b');     // ['a', '😀', 'b']
splitByCodePoint('');         // []

Note: runes is not a silver bullet either. Combining characters (e.g. e + ´ = é) or ZWJ emoji sequences (👨‍👩‍👧) consist of multiple code points forming a single grapheme. To split precisely by grapheme, you need String.characters from the characters package. However, since the goal of this fix is “don’t break surrogate pairs”, runes is sufficient. The required precision depends on your domain.

Applying — Replacing All Call Sites

The essence of the problem is that the same trap exists wherever .split('') is used. Fixing only one call site means the same crash will resurface in another widget. Use grep to find all .split('') calls and replace them in one pass.

# Find .split('') calls across the project
grep -rn "\.split('')" lib/

In the project I’m developing, I fixed them with the following common pattern:

// Before
word.split('').map((char) { ... });

// After
splitByCodePoint(word).map((char) { ... });

In the STT comparison logic, the same pattern was used in the path that tokenizes CJK text character by character:

// Before — tokenize with split('')
return text
    .replaceAll(' ', '')
    .split('')
    .where((c) => c.isNotEmpty)
    .toList();

// After — split by code point to preserve supplementary plane CJK chars (e.g. U+29E3D 𩸽)
return splitByCodePoint(text.replaceAll(' ', ''))
    .where((c) => c.isNotEmpty)
    .toList();

The change itself is low-risk. Behavior is completely identical when only BMP characters are present; the behavior only changes “in the right direction” when supplementary plane characters appear.

Tests — Explicitly Contrasting with split('')

The most educational part of this fix is the tests. We don’t just verify the helper’s correct behavior — we pin the fact that split('') and splitByCodePoint behave differently as a test itself. With this in place, if someone in the future reverts a helper call back to split(''), the test will fail clearly.

group('`splitByCodePoint`', () {
  test('Splits BMP characters by code point', () {
    expect(splitByCodePoint('漢字'), ['漢', '字']);
    expect(splitByCodePoint('ご飯'), ['ご', '飯']);
    expect(splitByCodePoint('abc'), ['a', 'b', 'c']);
  });

  test('Empty string returns empty list', () {
    expect(splitByCodePoint(''), <String>[]);
  });

  test('Preserves supplementary plane (SMP) CJK characters as a single character', () {
    // 𩸽 (U+29E3D, ほっけ): represented as a surrogate pair in UTF-16
    expect(splitByCodePoint('𩸽'), ['𩸽']);
    expect(splitByCodePoint('𩸽の魚'), ['𩸽', 'の', '魚']);
  });

  test('Preserves emoji as a single character', () {
    // 😀 (U+1F600): represented as a surrogate pair in UTF-16
    expect(splitByCodePoint('😀'), ['😀']);
    expect(splitByCodePoint('a😀b'), ['a', '😀', 'b']);
  });

  test('Unlike `split`, does not break surrogate pairs', () {
    // String.split('') splits by UTF-16 code unit, breaking surrogate pairs
    expect('𩸽'.split('').length, 2);
    // splitByCodePoint splits by code point, keeping it as a single character
    expect(splitByCodePoint('𩸽').length, 1);
  });
});

The two lines of the last test are the key:

expect('𩸽'.split('').length, 2);          // Documents the trap's existence
expect(splitByCodePoint('𩸽').length, 1);  // Documents the helper's promise

The tests document the essence of the bug. Someone reading the code for the first time can immediately understand “why this helper is needed” from those two lines. It’s a pattern that delivers value beyond mere behavior verification.

Also add a test on the widget side that renders a word containing a supplementary plane character. For every widget that uses the helper, pin down a one-liner test in the same form.

testWidgets('Renders a word containing a supplementary plane CJK character (𩸽) without crashing', (tester) async {
  await tester.pumpWidget(MaterialApp(home: WordText(word: '𩸽')));
  expect(tester.takeException(), isNull);
});

Asserting tester.takeException() is null guarantees “no exception was thrown during rendering”.

Wrap-Up

A one-line summary of the .split('') trap and its resolution:

  • Dart’s String is an ordered series of UTF-16 code units. .length, .split(''), and [] indexing all operate by code unit.
  • Supplementary plane characters (above U+10000) are represented as a surrogate pair (2 code units) in UTF-16. split('') breaks the pair into two pieces, producing lone surrogates.
  • Lone surrogates are not valid UTF-16. They lead to the “string is not well-formed UTF-16” crash at points where Android’s platform channel / TextView requires well-formed UTF-16.
  • A single runes-based splitByCodePoint helper is enough as a fix. Find all .split('') calls with grep and replace them in one pass.
  • Document the trap’s existence through tests. Putting '𩸽'.split('').length == 2 and splitByCodePoint('𩸽').length == 1 in the same test prevents future regressions, and the test itself becomes documentation.
  • Japanese / CJK vocabulary apps + any app dealing with emoji are all in the potential blast radius. If 𩸽, 𠮷, or 😀 can come in through user input / DB / STT results, the same trap is lurking.

.split('') is a function that, despite the “process each character” label, actually processes code units. I recommend running grep -rn "\.split('')" lib/ once in your Dart/Flutter project to check for lurking supplementary plane time bombs.

References

Was my blog helpful? Please leave a comment at the bottom. it will be a great help to me!

App promotion

You can use the applications that are created by this blog writer Deku.
Deku created the applications with Flutter.

If you have interested, please try to download them for free.



SHARE
Twitter Facebook RSS