ENSIP-15: Name Normalization
Authors: raffy.eth
Created: April 3, 2023
Status: final
Abstract
This ENSIP standardizes Ethereum Name Service (ENS) name normalization process outlined in ENSIP-1 § Name Syntax.
Motivation
- Since ENSIP-1 (originally EIP-137) was finalized in 2016, Unicode has evolved from version 8.0.0 to 15.0.0 and incorporated many new characters, including complex emoji sequences.
- ENSIP-1 does not state the version of Unicode.
- ENSIP-1 implies but does not state an explicit flavor of IDNA processing.
- UTS-46 is insufficient to normalize emoji sequences. Correct emoji processing is only possible with UTS-51.
- Validation tests are needed to ensure implementation compliance.
- The success of ENS has encouraged spoofing via the following techniques:
- Insertion of zero-width characters.
- Using names which normalize differently between algorithms.
- Using names which appear differently between applications and devices.
- Substitution of confusable (look-alike) characters.
- Mixing incompatible scripts.
 
Specification
- Unicode version 16.0.0- Normalization is a living specification and should use the latest stable version of Unicode.
 
- spec.jsoncontains all necessary data for normalization.
- nf.jsoncontains all necessary data for Unicode Normalization Forms NFC and NFD.
Definitions
- Terms in bold throughout this document correspond with components of spec.json.
- A string is a sequence of Unicode codepoints.
- Example: "abc"is61 62 63
 
- Example: 
- An Unicode emoji is a single entity composed of one or more codepoints:
- An Emoji Sequence is the preferred form of an emoji, resulting from input that tokenized into an Emojitoken.- Example: 💩︎︎ [1F4A9]→Emoji[1F4A9 FE0F]- 1F4A9 FE0Fis the Emoji Sequence.
 
 
- Example: 
- spec.jsoncontains the complete list of valid Emoji Sequences.- Derivation defines which emoji are normalizable.
- Not all Unicode emoji are valid.
- ‼ [203C] double exclamation mark→ error: Disallowed character
- 🈁 [1F201] Japanese “here” button→- Text["ココ"]
 
 
- An Emoji Sequence may contain characters that are disallowed:
- 👩❤️👨 [1F469 200D 2764 FE0F 200D 1F468] couple with heart: woman, man— contains ZWJ
- #️⃣ [23 FE0F 20E3] keycap: #— contains- 23 (#)
- 🏴 [1F3F4 E0067 E0062 E0065 E006E E0067 E007F]— contains- E00XX
 
- An Emoji Sequence may contain other emoji:
- Example: ❤️ [2764 FE0F] red heartis a substring of❤️🔥 [2764 FE0F 200D 1F525] heart on fire
 
- Example: 
- Single-codepoint emoji may have various presentation styles on input:
- Default: ❤ [2764]
- Text: ❤︎ [2764 FE0E]
- Emoji: ❤️ [2764 FE0F]
 
- Default: 
- However, these all tokenize to the same Emoji Sequence.
- All Emoji Sequence have explicit emoji-presentation.
- The convention of ignoring presentation is difficult to change because:
- Presentation characters (FE0FandFE0E) are Ignored
- ENSIP-1 did not treat emoji differently from text
- Registration hashes are immutable
 
- Presentation characters (
- Beautification can be used to restore emoji-presentation in normalized names.
 
- An Emoji Sequence is the preferred form of an emoji, resulting from input that tokenized into an 
Algorithm
- Normalization is the process of canonicalizing a name before for hashing.
- It is idempotent: applying normalization multiple times produces the same result.
- For user convenience, leading and trailing whitespace should be trimmed before normalization, as all whitespace codepoints are disallowed. Inner characters should remain unmodified.
- No string transformations (like case-folding) should be applied.
Normalize
- Tokenize — transform the label into TextandEmojitokens.- If there are no tokens, the label cannot be normalized.
 
- Apply NFC to each Texttoken.- Example: Text["à"]→[61 300] → [E0]→Text["à"]
 
- Example: 
- Strip FE0Ffrom eachEmojitoken.
- Validate — check if the tokens are valid and obtain the Label Type.
- The Label Type and Restricted state may be presented to user for additional security.
 
- Concatenate the tokens together.
- Return the normalized label.
 
Examples:
- "_$A" [5F 24 41]→- "_$a" [5F 24 61]— ASCII
- "E︎̃" [45 FE0E 303]→- "ẽ" [1EBD]— Latin
- "𓆏🐸" [1318F 1F438]→- "𓆏🐸" [1318F 1F438]— Restricted: Egyp
- "nı̇ck" [6E 131 307 63 6B]→ error: Disallowed character
Tokenize
Convert a label into a list of Text and Emoji tokens, each with a payload of codepoints.  The complete list of character types and emoji sequences can be found in spec.json.
- Allocate an empty codepoint buffer.
- Find the longest Emoji Sequence that matches the remaining input.
- Example: 👨🏻💻 [1F468 1F3FB 200D 1F4BB]- Match (1): 👨️ [1F468] man
- Match (2): 👨🏻 [1F468 1F3FB] man: light skin tone
- Match (4): 👨🏻💻 [1F468 1F3FB 200D 1F4BB] man technologist: light skin tone— longest match!
 
- Match (1): 
- FE0Fis optional from the input during matching.- Example: 👨❤️👨 [1F468 200D 2764 FE0F 200D 1F468]- Match: 1F468 200D 2764 FE0F 200D 1F468— fully-qualified
- Match: 1F468 200D 2764 200D 1F468— missingFE0F
- No match: 1F468 FE0F 200D 2764 FE0F 200D 1F468— extraFE0F
- No match: 1F468 200D 2764 FE0F FE0F 200D 1F468— has (2)FE0F
 
- Match: 
 
- Example: 
- This is equivalent to /^(emoji1|emoji2|...)/where\uFE0Fis replaced with\uFE0F?and*is replaced with\x2A.
 
- Example: 
- If an Emoji Sequence is found:
- If the buffer is nonempty, emit a Texttoken, and clear the buffer.
- Emit an Emojitoken with the fully-qualified matching sequence.
- Remove the matched sequence from the input.
 
- If the buffer is nonempty, emit a 
- Otherwise:
- Remove the leading codepoint from the input.
- Determine the character type:
- If Valid, append the codepoint to the buffer.
- This set can be precomputed from the union of characters in all groups and their NFD decompositions.
 
- If Mapped, append the corresponding mapped codepoint(s) to the buffer.
- If Ignored, do nothing.
- Otherwise, the label cannot be normalized.
 
- If Valid, append the codepoint to the buffer.
 
- Repeat until all the input is consumed.
- If the buffer is nonempty, emit a final Texttoken with its contents.- Return the list of emitted tokens.
 
Examples:
- "xyz👨🏻" [78 79 7A 1F468 1F3FB]→- Text["xyz"]+- Emoji["👨🏻"]
- "A💩︎︎b" [41 FE0E 1F4A9 FE0E FE0E 62]→- Text["a"]+- Emoji["💩️"]+- Text["b"]
- "a™️" [61 2122 FE0F]→- Text["atm"]
Validate
Given a list of Emoji and Text tokens, determine if the label is valid and return the Label Type.  If any assertion fails, the name cannot be normalized.
- If only Emojitokens:- Return "Emoji"
 
- Return 
- If a single Texttoken and every characters is ASCII (00..7F):- 5F (_) LOW LINEcan only occur at the start.- Must match /^_*[^_]*$/
- Examples: "___"and"__abc"are valid,"abc__"and"_abc_"are invalid.
 
- Must match 
- The 3rd and 4th characters must not both be 2D (-) HYPHEN-MINUS.- Must not match /^..--/
- Examples: "ab-c"and"---a"are valid,"xn--"and----are invalid.
 
- Must not match 
- Return "ASCII"- The label is free of Fenced and Combining Mark characters, and not confusable.
 
 
- Concatenate all the tokens together.
- 5F (_) LOW LINEcan only occur at the start.
- The first and last characters cannot be Fenced.
- Examples: "a’s"and"a・a"are valid,"’85"and"joneses’"and"・a・"are invalid.
 
- Examples: 
- Fenced characters cannot be contiguous.
- Examples: "a・a’s"is valid,"6’0’’"and"a・・a"are invalid.
 
- Examples: 
 
- The first character of every Texttoken must not be a Combining Mark.
- Concatenate the Texttokens together.
- Find the first Group that contain every text character:
- If no group is found, the label cannot be normalized.
 
- If the group is not CM Whitelisted:
- Apply NFD to the concatenated text characters.
- For every contiguous sequence of NSM characters:
- Each character must be unique.
- Example: "x̀̀" [78 300 300]has (2) grave accents.
 
- Example: 
- The number of NSM characters cannot exceed Maximum NSM (4).
- Example: "إؐؑؒؓؔ" [625 610 611 612 613 614]has (6) NSM.
 
- Example: 
 
- Each character must be unique.
 
- Wholes — check if text characters form a confusable.
- The label is valid.
- Return the name of the group as the Label Type.
 
Examples:
- Emoji["💩️"]+- Emoji["💩️"]→- "Emoji"
- Text["abc$123"]→- "ASCII"
- Emoji["🚀️"]+- Text["à"]→- "Latin"
Wholes
A label is whole-script confusable if a similarly-looking valid label can be constructed using one alternative character from a different group.  The complete list of Whole Confusables can be found in spec.json.  Each Whole Confusable has a set of non-confusing characters ("valid") and a set of confusing characters ("confused") where each character may be the member of one or more groups.
Example: Whole Confusable for "g"
| Type | Code | Form | Character | Latn | Hani | Japn | Kore | Armn | Cher | Lisu | 
|---|---|---|---|---|---|---|---|---|---|---|
| valid | 67 | g | LATIN SMALL LETTER G | A | A | A | A | |||
| confused | 581 | ց | ARMENIAN SMALL LETTER CO | B | ||||||
| confused | 13C0 | Ꮐ | CHEROKEE LETTER NAH | C | ||||||
| confused | 13F3 | Ᏻ | CHEROKEE LETTER YU | C | ||||||
| confused | A4D6 | ꓖ | LISU LETTER GA | D | 
- Allocate an empty character buffer.
- Start with the set of ALL groups.
- For each unique character in the label:
- If the character is Confused (a member of a Whole Confusable):
- Retain groups with Whole Confusable characters excluding the Confusable Extent of the matching Confused character.
- If no groups remain, the label is not confusable.
- The Confusable Extent is the fully-connected graph formed from different groups with the same confusable and different confusables of the same group.
- The mapping from Confused to Confusable Extent can be precomputed.
 
- In the table above, Whole Confusable for "g", the rectangle formed by each capital letter is a Confusable Extent:- Ais [- g] ⊗ [Latin, Han, Japanese, Korean]
- Bis [- ց] ⊗ [Armn]
- Cis [- Ꮐ,- Ᏻ] ⊗ [Cher]
- Dis [- ꓖ] ⊗ [Lisu]
 
- A Confusable Extent can span multiple characters and multiple groups.  Consider the (incomplete) Whole Confusable for "o":- 6F (o) LATIN SMALL LETTER O→ Latin, Han, Japanese, and Korean
- 3007 (〇) IDEOGRAPHIC NUMBER ZERO→ Han, Japanese, Korean, and Bopomofo
- Confusable Extent is [o,〇] ⊗ [Latin, Han, Japanese, Korean, Bopomofo]
 
 
- If the character is Unique, the label is not confusable.
- This set can be precomputed from characters that appear in exactly one group and are not Confused.
 
- Otherwise:
- Append the character to the buffer.
 
 
- If the character is Confused (a member of a Whole Confusable):
- If any Confused characters were found:
- If there are no buffered characters, the label is confusable.
- If any of the remaining groups contain all of the buffered characters, the label is confusable.
- Example: "0х" [30 445]- 30 (0) DIGIT ZERO- Not Confused or Unique, add to buffer.
 
- 445 (х) CYRILLIC SMALL LETTER HA- Confusable Extent is [х,4B3 (ҳ) CYRILLIC SMALL LETTER HA WITH DESCENDER] ⊗ [Cyrillic]
- Whole Confusable excluding the extent is [78 (x) LATIN SMALL LETTER X, ...] → [Latin, ...]
- Remaining groups: ALL ∩ [Latin, ...] → [Latin, ...]
 
- Confusable Extent is [
- There was (1) buffered character:
- Latin also contains 30→"0x" [30 78]
 
- Latin also contains 
- The label is confusable.
 
 
- The label is not confusable.
A label composed of confusable characters isn't necessarily confusable.
- Example: "тӕ" [442 4D5]- 442 (т) CYRILLIC SMALL LETTER TE- Confusable Extent is [т] ⊗ [Cyrillic]
- Whole Confusable excluding the extent is [3C4 (τ) GREEK SMALL LETTER TAU] → [Greek]
- Remaining groups: ALL ∩ [Greek] → [Greek]
 
- Confusable Extent is [
- 4D5 (ӕ) CYRILLIC SMALL LIGATURE A IE- Confusable Extent is [ӕ] ⊗ [Greek]
- Whole Confusable excluding the extent is [E6 (æ) LATIN SMALL LETTER AE] → [Latin]
- Remaining groups: [Greek] ∩ [Latin] → ∅
 
- Confusable Extent is [
- No groups remain so the label is not confusable.
 
Split
- Partition a name into labels, separated by 2D (.) FULL STOP, and return the resulting array.- Example: "abc.123.eth"→["abc", "123", "eth"]
 
- Example: 
- The empty string is 0-labels: ""→[]
Join
- Assemble an array of labels into a name, inserting 2D (.) FULL STOPbetween each label, and return the resulting string.- Example: ["abc", "123", "eth"]→"abc.123.eth"
 
- Example: 
Description of spec.json
- Groups ("groups") — groups of characters that can constitute a label- "name"— ASCII name of the group (or abbreviation if Restricted)- Examples: Latin, Japanese, Egyp
 
- Restricted ("restricted") —trueif Excluded or Limited-Use script- Examples: Latin → false, Egyp →true
 
- Examples: Latin → 
- "primary"— subset of characters that define the group- Examples: "a"→ Latin,"あ"→ Japanese,"𓀀"→ Egyp
 
- Examples: 
- "secondary"— subset of characters included with the group- Example: "0"→ Common but mixable with Latin
 
- Example: 
- CM Whitelist(ed) ("cm") — (optional) set of allowed compound sequences in NFC- Each compound sequence is a character followed by one or more Combining Marks.
- Example: à̀̀→E0 300 300
 
- Example: 
- Currently, every group that is CM Whitelist has zero compound sequences.
- CM Whitelisted is effectively trueif[]otherwisefalse
 
- Each compound sequence is a character followed by one or more Combining Marks.
 
- Ignored ("ignored") — characters that are ignored during normalization- Example: 34F (�) COMBINING GRAPHEME JOINER
 
- Example: 
- Mapped ("mapped") — characters that are mapped to a sequence of valid characters- Example: 41 (A) LATIN CAPITAL LETTER A→[61 (a) LATIN SMALL LETTER A]
- Example: 2165 (Ⅵ) ROMAN NUMERAL SIX→[76 (v) LATIN SMALL LETTER V, 69 (i) LATIN SMALL LETTER I]
 
- Example: 
- Whole Confusable ("wholes") — groups of characters that look similar- "valid"— subset of confusable characters that are allowed- Example: 34 (4) DIGIT FOUR
 
- Example: 
- Confused ("confused") — subset of confusable characters that confuse- Example: 13CE (Ꮞ) CHEROKEE LETTER SE
 
- Example: 
 
- Fenced ("fenced") — characters that cannot be first, last, or contiguous- Example: 2044 (⁄) FRACTION SLASH
 
- Example: 
- Emoji Sequence(s) ("emoji") — valid emoji sequences- Example: 👨💻 [1F468 200D 1F4BB] man technologist
 
- Example: 
- Combining Marks / CM ("cm") — characters that are Combining Marks
- Non-spacing Marks / NSM ("nsm") — valid subset of CM with general category ("Mn"or"Me")
- Maximum NSM ("nsm_max") — maximum sequence length of unique NSM
- Should Escape ("escape") — characters that shouldn't be printed
- NFC Check ("nfc_check") — valid subset of characters that may require NFC
Description of nf.json
- "decomp"— mapping from a composed character to a sequence of (partially)-decomposed characters- UnicodeData.txtwhere- Decomposition_Mappingexists and does not have a formatting tag
 
- "exclusions"— set of characters for which the- "decomp"mapping is not applied when forming a composition
- "ranks"— sets of characters with increasing- Canonical_Combining_Class- UnicodeData.txtgrouped by- Canonical_Combining_Class
- Class 0is not included
 
- "qc"— set of characters with property- NFC_QCof value- Nor- M- DerivedNormalizationProps.txt
- NFC Check (from spec.json) is a subset of this set
 
Derivation
- IDNA 2003
- UseSTD3ASCIIRulesis- true
- VerifyDnsLengthis- false
- Transitional_Processingis- false
- The following deviations are valid:
- DF (ß) LATIN SMALL LETTER SHARP S
- 3C2 (ς) GREEK SMALL LETTER FINAL SIGMA
 
- CheckHyphensis- false(WHATWG URL Spec § 3.3)
- CheckBidiis- false
- ContextJ:
- 200C (�) ZERO WIDTH NON-JOINER(ZWNJ) is disallowed everywhere.
- 200D (�) ZERO WIDTH JOINER(ZWJ) is only allowed in emoji sequences.
 
- ContextO:
- B7 (·) MIDDLE DOTis disallowed.
- 375 (͵) GREEK LOWER NUMERAL SIGNis disallowed.
- 5F3 (׳) HEBREW PUNCTUATION GERESHand- 5F4 (״) HEBREW PUNCTUATION GERSHAYIMare Greek.
- 30FB (・) KATAKANA MIDDLE DOTis Fenced and Han, Japanese, Korean, and Bopomofo.
- Some Extended Arabic Numerals are mapped:
- 6F0 (۰)→- 660 (٠) ARABIC-INDIC DIGIT ZERO
- 6F1 (۱)→- 661 (١) ARABIC-INDIC DIGIT ONE
- 6F2 (۲)→- 662 (٢) ARABIC-INDIC DIGIT TWO
- 6F3 (۳)→- 663 (٣) ARABIC-INDIC DIGIT THREE
- 6F7 (۷)→- 667 (٧) ARABIC-INDIC DIGIT SEVEN
- 6F8 (۸)→- 668 (٨) ARABIC-INDIC DIGIT EIGHT
- 6F9 (۹)→- 669 (٩) ARABIC-INDIC DIGIT NINE
 
 
 
- Punycode is not decoded.
- The following ASCII characters are valid:
- 24 ($) DOLLAR SIGN
- 5F (_) LOW LINEwith restrictions
 
- Only label separator is 2E (.) FULL STOP- No character maps to this character.
- This simplifies name detection in unstructured text.
- The following alternatives are disallowed:
- 3002 (。) IDEOGRAPHIC FULL STOP
- FF0E (.) FULLWIDTH FULL STOP
- FF61 (。) HALFWIDTH IDEOGRAPHIC FULL STOP
 
 
- Many characters are disallowed for various reasons:
- Nearly all punctuation are disallowed.
- Example: 589 (։) ARMENIAN FULL STOP
 
- Example: 
- All parentheses and brackets are disallowed.
- Example: 2997 (⦗) LEFT BLACK TORTOISE SHELL BRACKET
 
- Example: 
- Nearly all vocalization annotations are disallowed.
- Example: 294 (ʔ) LATIN LETTER GLOTTAL STOP
 
- Example: 
- Obsolete, deprecated, and ancient characters are disallowed.
- Example: 463 (ѣ) CYRILLIC SMALL LETTER YAT
 
- Example: 
- Combining, modifying, reversed, flipped, turned, and partial variations are disallowed.
- Example: 218A (↊) TURNED DIGIT TWO
 
- Example: 
- When multiple weights of the same character exist, the variant closest to "heavy" is selected and the rest disallowed.
- Example: 🞡🞢🞣🞤✚🞥🞦🞧→271A (✚) HEAVY GREEK CROSS
- This occasionally selects an emoji.
- Example: ✔️ or 2714 (✔︎) HEAVY CHECK MARKis selected instead of2713 (✓) CHECK MARK
 
- Example: ✔️ or 
 
- Example: 
- Many visually confusable characters are disallowed.
- Example: 131 (ı) LATIN SMALL LETTER DOTLESS I
 
- Example: 
- Many ligatures, n-graphs, and n-grams are disallowed.
- Example: A74F (ꝏ) LATIN SMALL LETTER OO
 
- Example: 
- Many esoteric characters are disallowed.
- Example: 2376 (⍶) APL FUNCTIONAL SYMBOL ALPHA UNDERBAR
 
- Example: 
 
- Nearly all punctuation are disallowed.
- Many hyphen-like characters are mapped to 2D (-) HYPHEN-MINUS:- 2010 (‐) HYPHEN
- 2011 (‑) NON-BREAKING HYPHEN
- 2012 (‒) FIGURE DASH
- 2013 (–) EN DASH
- 2014 (—) EM DASH
- 2015 (―) HORIZONTAL BAR
- 2043 (⁃) HYPHEN BULLET
- 2212 (−) MINUS SIGN
- 23AF (⎯) HORIZONTAL LINE EXTENSION
- 23E4 (⏤) STRAIGHTNESS
- FE58 (﹘) SMALL EM DASH
- 2E3A (⸺) TWO-EM DASH→- "--"
- 2E3B (⸻) THREE-EM DASH→- "---"
 
- Characters are assigned to Groups according to Unicode Script_Extensions.
- Groups may contain multiple scripts:
- Only Latin, Greek, Cyrillic, Han, Japanese, and Korean have access to Common characters.
- Latin, Greek, Cyrillic, Han, Japanese, Korean, and Bopomofo only permit specific Combining Mark sequences.
- Han, Japanese, and Korean  have access to a-z.
- Restricted groups are always single-script.
- Unicode augmented script sets
 
- Scripts Braille, Linear A, Linear B, and Signwriting are disallowed.
- 27 (') APOSTROPHEis mapped to- 2019 (’) RIGHT SINGLE QUOTATION MARKfor convenience.
- Ethereum symbol (39E (Ξ) GREEK CAPITAL LETTER XI) is case-folded and Common.
- Emoji:
- All emoji are fully-qualified.
- Digits (0-9) are not emoji.
- Emoji mapped to non-emoji by IDNA cannot be used as emoji.
- Emoji disallowed by IDNA with default text-presentation are disabled:
- 203C (‼️) double exclamation mark
- 2049 (⁉️) exclamation question mark
 
- Remaining emoji characters are marked as disallowed (for text processing).
- All RGI_Emoji_ZWJ_Sequenceare enabled.
- All Emoji_Keycap_Sequenceare enabled.
- All RGI_Emoji_Tag_Sequenceare enabled.
- All RGI_Emoji_Modifier_Sequenceare enabled.
- All RGI_Emoji_Flag_Sequenceare enabled.
- Basic_Emojiof the form- [X FE0F]are enabled.
- Emoji with default emoji-presentation are enabled as [X FE0F].
- Remaining single-character emoji are enabled as [X FE0F](explicit emoji-presentation).
- All singular Skin-color Modifiers are disabled.
- All singular Regional Indicators are disabled.
- Blacklisted emoji are disabled.
- Whitelisted emoji are enabled.
 
- Confusables:
- Nearly all Unicode Confusables
- Emoji are not confusable.
- ASCII confusables are case-folded.
- Example: 61 (a) LATIN SMALL LETTER Aconfuses with13AA (Ꭺ) CHEROKEE LETTER GO
 
- Example: 
 
Backwards Compatibility
- 99% of names are still valid.
- Preserves as much Unicode IDNA and WHATWG URL compatibility as possible.
- Only valid emoji sequences are permitted.
Security Considerations
- Unicode presentation may vary between applications and devices.
- Unicode text is ultimately subject to font-styling and display context.
- Unsupported characters (�) may appear unremarkable.
- Normalized single-character emoji sequences do not retain their explicit emoji-presentation and may display with text or emoji presentation styling.
- ❤︎— text-presentation and default-color
- ❤︎— text-presentation and green-color
- ❤️— emoji-presentation and green-color
 
- Unsupported emoji sequences with ZWJ may appear indistinguishable from those without ZWJ.
- 💩💩 [1F4A9 1F4A9]
- 💩💩 [1F4A9 200D 1F4A9]→ error: Disallowed character
 
 
- Names composed of labels with varying bidi properties may appear differently depending on context.
- Normalization does not enforce single-directional names.
- Names may be composed of labels of different directions but normalized labels are never bidirectional.
- [LTR].[RTL] bahrain.مصر
- [LTR+RTL] bahrainمصر→ error: Illegal mixture: Latin + Arabic
 
- [LTR].[RTL] 
 
- Not all normalized names are visually unambiguous.
- This ENSIP only addresses single-character confusables.
- There exist confusable multi-character sequences:
- "ஶ்ரீ" [BB6 BCD BB0 BC0]
- "ஸ்ரீ" [BB8 BCD BB0 BC0]
 
- There exist confusable emoji sequences:
- 🚴 [1F6B4]and- 🚴🏻 [1F6B4 1F3FB]
- 🇺🇸 [1F1FA 1F1F8]and- 🇺🇲 [1F1FA 1F1F2]
- ♥ [2665] BLACK HEART SUITand- ❤ [2764] HEAVY BLACK HEART
 
 
- There exist confusable multi-character sequences:
Copyright
Copyright and related rights waived via CC0.
Appendix: Reference Specifications
- EIP-137: Ethereum Domain Name Service
- ENSIP-1: ENS
- UAX-15: Normalization Forms
- UAX-24: Script Property
- UAX-29: Text Segmentation
- UAX-31: Identifier and Pattern Syntax
- UTS-39: Security Mechanisms
- UAX-44: Character Database
- UTS-46: IDNA Compatibility Processing
- UTS-51: Emoji
- RFC-3492: Punycode
- RFC-5891: IDNA: Protocol
- RFC-5892: The Unicode Code Points and IDNA
- Unicode CLDR
- WHATWG URL: IDNA
Appendix: Additional Resources
- Supported Groups
- Supported Emoji
- Additional Disallowed Characters
- Ignored Characters
- Should Escape Characters
- Combining Marks
- Non-spacing Marks
- Fenced Characters
- NFC Quick Check
Appendix: Validation Tests
A list of validation tests are provided with the following interpretation:
- Already Normalized: {name: "a"}→normalize("a")is"a"
- Need Normalization: {name: "A", norm: "a"}→normalize("A")is"a"
- Expect Error: {name: "@", error: true}→normalize("@")throws
Annex: Beautification
Follow algorithm, except:
- Do not strip FE0FfromEmojitokens.
- Replace 3BE (ξ) GREEK SMALL LETTER XIwith39E (Ξ) GREEK CAPITAL LETTER XIif the label isn't Greek.
- Example: normalize("‐Ξ1️⃣") [2010 39E 31 FE0F 20E3]is"-ξ1⃣" [2D 3BE 31 20E3]
- Example: beautify("-ξ1⃣") [2D 3BE 31 20E3]"is"-Ξ1️⃣" [2D 39E 31 FE0F 20E3]