ENSIP-15: Normalization Standard
Author | Andrew Raffensperger <[email protected]> |
Status | Draft |
Created | 2023-04-03 |
This ENSIP standardizes Ethereum Name Service (ENS) name normalization process outlined in ENSIP-1 § Name Syntax.
- ENSIP-1 does not state the version of Unicode.
- ENSIP-1 implies but does not state an explicit flavor of IDNA processing.
- Validation tests are needed to ensure implementation compliance.
- The success of ENS has encouraged spoofing via the following techniques:
- 1.Insertion of zero-width characters.
- 2.Using names which normalize differently between algorithms.
- 3.Using names which appear differently between applications and devices.
- 4.Substitution of confusable (look-alike) characters.
- 5.Mixing incompatible scripts.
- Unicode version
15.0.0
- Normalization is a living specification and should use the latest stable version of Unicode.
- A string is a sequence of Unicode codepoints.
- Example:
"abc"
is61 62 63
- An Emoji Sequence is the preferred form of an emoji, resulting from input that tokenized into an
Emoji
token.- Example:
💩︎︎ [1F4A9]
→Emoji[1F4A9 FE0F]
1F4A9 FE0F
is the Emoji Sequence.
- Not all Unicode emoji are valid.
‼ [203C] double exclamation mark
→ error: Disallowed character🈁 [1F201] Japanese “here” button
→Text["ココ"]
- An Emoji Sequence may contain characters that are disallowed:
👩❤️👨 [1F469 200D 2764 FE0F 200D 1F468] couple with heart: woman, man
— contains ZWJ#️⃣ [23 FE0F 20E3] keycap: #
— contains23 (#)
🏴 [1F3F4 E0067 E0062 E0065 E006E E0067 E007F]
— containsE00XX
- An Emoji Sequence may contain other emoji:
- Example:
❤️ [2764 FE0F] red heart
is a substring of❤️🔥 [2764 FE0F 200D 1F525] heart on fire
- Default:
❤ [2764]
- Text:
❤︎ [2764 FE0E]
- Emoji:
❤️ [2764 FE0F]
- All Emoji Sequence have explicit emoji-presentation.
- The convention of ignoring presentation is difficult to change because:
- Presentation characters (
FE0F
andFE0E
) are Ignored - Registration hashes are immutable
- It is idempotent: applying normalization multiple times produces the same result.
- For user convenience, leading and trailing whitespace should be trimmed before normalization, as all whitespace codepoints are disallowed. Inner characters should remain unmodified.
- No string transformations (like case-folding) should be applied.
- 1.
- If there are no tokens, the label cannot be normalized.
- 2.Apply NFC to each
Text
token.- Example:
Text["à"]
→[61 300] → [E0]
→Text["à"]
- 3.Strip
FE0F
from eachEmoji
token. - 4.
- The Label Type and Restricted state may be presented to user for additional security.
- 5.Concatenate the tokens together.
- Return the normalized label.
Examples:
- 1.
"_$A" [5F 24 41]
→"_$a" [5F 24 61]
— ASCII - 2.
"E︎̃" [45 FE0E 303]
→"ẽ" [1EBD]
— Latin - 3.
"𓆏🐸" [1318F 1F438]
→"𓆏🐸" [1318F 1F438]
— Restricted: Egyp - 4.
"nı̇ck" [6E 131 307 63 6B]
→ error: Disallowed character
Convert a label into a list of
Text
and Emoji
tokens, each with a payload of codepoints. The complete list of character types and emoji sequences can be found in spec.json
.- 1.Allocate an empty codepoint buffer.
- 2.Find the longest Emoji Sequence that matches the remaining input.
- Example:
👨🏻💻 [1F468 1F3FB 200D 1F4BB]
- Match (1):
👨️ [1F468] man
- Match (2):
👨🏻 [1F468 1F3FB] man: light skin tone
- Match (4):
👨🏻💻 [1F468 1F3FB 200D 1F4BB] man technologist: light skin tone
— longest match!
FE0F
is optional from the input during matching.- Example:
👨❤️👨 [1F468 200D 2764 FE0F 200D 1F468]
- Match:
1F468 200D 2764 FE0F 200D 1F468
— fully-qualified - Match:
1F468 200D 2764 200D 1F468
— missingFE0F
- No match:
1F468 200D 2764 FE0F FE0F 200D 1F468
— has (2)FE0F
- This is equivalent to
/^(emoji1|emoji2|...)/
where\uFE0F
is replaced with\uFE0F?
and*
is replaced with\x2A
.
- 3.If an Emoji Sequence is found:
- If the buffer is nonempty, emit a
Text
token, and clear the buffer. - Emit an
Emoji
token with the fully-qualified matching sequence. - Remove the matched sequence from the input.
- 4.Otherwise:
- 1.Remove the leading codepoint from the input.
- 2.Determine the character type:
- If Valid, append the codepoint to the buffer.
- This set can be precomputed from the union of characters in all groups and their NFD decompositions.
- If Mapped, append the corresponding mapped codepoint(s) to the buffer.
- If Ignored, do nothing.
- Otherwise, the label cannot be normalized.
- 5.Repeat until all the input is consumed.
- 6.If the buffer is nonempty, emit a final
Text
token with its contents.- Return the list of emitted tokens.
Examples:
- 1.
"xyz👨🏻" [78 79 7A 1F468 1F3FB]
→Text["xyz"]
+Emoji["👨🏻"]
- 2.
"A💩︎︎b" [41 FE0E 1F4A9 FE0E FE0E 62]
→Text["a"]
+Emoji["💩️"]
+Text["b"]
- 3.
"a™️" [61 2122 FE0F]
→Text["atm"]
Given a list of
Emoji
and Text
tokens, determine if the label is valid and return the Label Type. If any assertion fails, the name cannot be normalized.- 1.If only
Emoji
tokens:- Return
"Emoji"
- 2.If a single
Text
token and every characters is ASCII (00..7F
):5F (_) LOW LINE
can only occur at the start.- Must match
/^_*[^_]*$/
- Examples:
"___"
and"__abc"
are valid,"abc__"
and"_abc_"
are invalid.
- The 3rd and 4th characters must not both be
2D (-) HYPHEN-MINUS
.- Must not match
/^..--/
- Examples:
"ab-c"
and"---a"
are valid,"xn--"
and----
are invalid.
- Return
"ASCII"
- The label is free of Fenced and Combining Mark characters, and not confusable.
- 3.Concatenate all the tokens together.
5F (_) LOW LINE
can only occur at the start.- The first and last characters cannot be Fenced.
- Examples:
"a’s"
and"a・a"
are valid,"’85"
and"joneses’"
and"・a・"
are invalid.
- Fenced characters cannot be contiguous.
- Examples:
"a・a’s"
is valid,"6’0’’"
and"a・・a"
are invalid.
- 4.The first character of every
Text
token must not be a Combining Mark. - 5.Concatenate the
Text
tokens together. - 6.Find the first Group that contain every text character:
- If no group is found, the label cannot be normalized.
- 7.If the group is not CM Whitelisted:
- Apply NFD to the concatenated text characters.
- For every contiguous sequence of NSM characters:
- Each character must be unique.
- Example:
"x̀̀" [78 300 300]
has (2) grave accents.
- The number of NSM characters cannot exceed Maximum NSM (4).
- Example:
"إؐؑؒؓؔ" [625 610 611 612 613 614]
has (6) NSM.
- 8.
- 9.The label is valid.
- Return the name of the group as the Label Type.
Examples:
- 1.
Emoji["💩️"]
+Emoji["💩️"]
→"Emoji"
- 2.
Text["abc$123"]
→"ASCII"
- 3.
Emoji["🚀️"]
+Text["à"]
→"Latin"
A label is whole-script confusable if a similarly-looking valid label can be constructed using one alternative character from a different group. The complete list of Whole Confusables can be found in
spec.json
. Each Whole Confusable has a set of non-confusing characters ("valid"
) and a set of confusing characters ("confused"
) where each character may be the member of one or more groups.Example: Whole Confusable for
"g"
Type | Code | Form | Character | Latn | Hani | Japn | Kore | Armn | Cher | Lisu |
---|---|---|---|---|---|---|---|---|---|---|
valid | 67 | g | LATIN SMALL LETTER G | A | A | A | A | | | |
confused | 581 | ց | ARMENIAN SMALL LETTER CO | | | | | B | | |
confused | 13C0 | Ꮐ | CHEROKEE LETTER NAH | | | | | | C | |
confused | 13F3 | Ᏻ | CHEROKEE LETTER YU | | | | | | C | |
confused | A4D6 | ꓖ | LISU LETTER GA | | | | | | | D |
- 1.Allocate an empty character buffer.
- 2.Start with the set of ALL groups.
- 3.For each unique character in the label:
- If the character is Confused (a member of a Whole Confusable):
- Retain groups with Whole Confusable characters excluding the Confusable Extent of the matching Confused character.
- If no groups remain, the label is not confusable.
- The Confusable Extent is the fully-connected graph formed from different groups with the same confusable and different confusables of the same group.
- The mapping from Confused to Confusable Extent can be precomputed.
- In the table above, Whole Confusable for
"g"
, the rectangle formed by each capital letter is a Confusable Extent:A
is [g
] ⊗ [Latin, Han, Japanese, Korean]B
is [ց
] ⊗ [Armn]C
is [Ꮐ
,Ᏻ
] ⊗ [Cher]D
is [ꓖ
] ⊗ [Lisu]
- A Confusable Extent can span multiple characters and multiple groups. Consider the (incomplete) Whole Confusable for
"o"
:6F (o) LATIN SMALL LETTER O
→ Latin, Han, Japanese, and Korean3007 (〇) IDEOGRAPHIC NUMBER ZERO
→ Han, Japanese, Korean, and Bopomofo- Confusable Extent is [
o
,〇
] ⊗ [Latin, Han, Japanese, Korean, Bopomofo]
- If the character is Unique, the label is not confusable.
- This set can be precomputed from characters that appear in exactly one group and are not Confused.
- Otherwise:
- Append the character to the buffer.
- 4.If any Confused characters were found:
- Assert none of the remaining groups contain any of the buffered characters.
- Example:
"0х" [30 445]
- 1.
30 (0) DIGIT ZERO
- Not Confused or Unique, add to buffer.
- 2.
445 (х) CYRILLIC SMALL LETTER HA
- Confusable Extent is [
х
,4B3 (ҳ) CYRILLIC SMALL LETTER HA WITH DESCENDER
] ⊗ [Cyrillic] - Whole Confusable excluding the extent is [
78 (x) LATIN SMALL LETTER X
, ...] → [Latin, ...] - Remaining groups: ALL ∩ [Latin, ...] → [Latin, ...]
- 3.There was (1) buffered character:
- Latin also contains
30
→"0x" [30 78]
- 4.The label is confusable.
- 5.The label is not confusable.
A label composed of confusable characters isn't necessarily confusable.
- Example:
"тӕ" [442 4D5]
- 1.
442 (т) CYRILLIC SMALL LETTER TE
- Confusable Extent is [
т
] ⊗ [Cyrillic] - Whole Confusable excluding the extent is [
3C4 (τ) GREEK SMALL LETTER TAU
] → [Greek] - Remaining groups: ALL ∩ [Greek] → [Greek]
- 2.
4D5 (ӕ) CYRILLIC SMALL LIGATURE A IE
- Confusable Extent is [
ӕ
] ⊗ [Greek] - Whole Confusable excluding the extent is [
E6 (æ) LATIN SMALL LETTER AE
] → [Latin] - Remaining groups: [Greek] ∩ [Latin] → ∅
- 3.No groups remain so the label is not confusable.
- Partition a name into labels, separated by
2D (.) FULL STOP
, and return the resulting array. - Example:
"abc.123.eth"
→["abc", "123", "eth"]
- Assemble an array of labels into a name, inserting
2D (.) FULL STOP
between each label, and return the resulting string. - Example:
["abc", "123", "eth"]
→"abc.123.eth"
"name"
— ASCII name of the group (or abbreviation if Restricted)- Examples: Latin, Japanese, Egyp
- Examples: Latin →
false
, Egyp →true
"primary"
— subset of characters that define the group- Examples:
"a"
→ Latin,"あ"
→ Japanese,"𓀀"
→ Egyp
"secondary"
— subset of characters included with the group- Example:
"0"
→ Common but mixable with Latin
- CM Whitelist(ed) (
"cm"
) — (optional) set of allowed compound sequences in NFC- Each compound sequence is a character followed by one or more Combining Marks.
- Example:
à̀̀
→E0 300 300
- Currently, every group that is CM Whitelist has zero compound sequences.
- CM Whitelisted is effectively
true
if[]
otherwisefalse
- Ignored (
"ignored"
) — characters that are ignored during normalization- Example:
34F (�) COMBINING GRAPHEME JOINER
- Mapped (
"mapped"
) — characters that are mapped to a sequence of valid characters- Example:
41 (A) LATIN CAPITAL LETTER A
→[61 (a) LATIN SMALL LETTER A]
- Example:
2165 (Ⅵ) ROMAN NUMERAL SIX
→[76 (v) LATIN SMALL LETTER V, 69 (i) LATIN SMALL LETTER I]
- Whole Confusable (
"wholes"
) — groups of characters that look similar"valid"
— subset of confusable characters that are allowed- Example:
34 (4) DIGIT FOUR
- Confused (
"confused"
) — subset of confusable characters that confuse- Example:
13CE (Ꮞ) CHEROKEE LETTER SE
- Fenced (
"fenced"
) — set of characters that cannot be first, last, or contiguous- Example:
2044 (⁄) FRACTION SLASH
- Example:
👨💻 [1F468 200D 1F4BB] man technologist
- Non-spacing Marks / NSM (
"nsm"
) — valid subset of CM with general category ("Mn"
or"Me"
) - Maximum NSM (
"nsm_max"
) — maximum sequence length of unique NSM - Should Escape (
"escape"
) — characters that shouldn't be printed
UseSTD3ASCIIRules
istrue
VerifyDnsLength
isfalse
Transitional_Processing
isfalse
DF (ß) LATIN SMALL LETTER SHARP S
3C2 (ς) GREEK SMALL LETTER FINAL SIGMA
CheckBidi
isfalse
200C (�) ZERO WIDTH NON-JOINER
(ZWNJ) is disallowed everywhere.200D (�) ZERO WIDTH JOINER
(ZWJ) is only allowed in emoji sequences.
B7 (·) MIDDLE DOT
is disallowed.375 (͵) GREEK LOWER NUMERAL SIGN
is disallowed.5F3 (׳) HEBREW PUNCTUATION GERESH
and5F4 (״) HEBREW PUNCTUATION GERSHAYIM
are Greek.30FB (・) KATAKANA MIDDLE DOT
is Fenced and Han, Japanese, Korean, and Bopomofo.6F0 (۰)
→660 (٠) ARABIC-INDIC DIGIT ZERO
6F1 (۱)
→661 (١) ARABIC-INDIC DIGIT ONE
6F2 (۲)
→662 (٢) ARABIC-INDIC DIGIT TWO
6F3 (۳)
→663 (٣) ARABIC-INDIC DIGIT THREE
6F7 (۷)
→667 (٧) ARABIC-INDIC DIGIT SEVEN
6F8 (۸)
→668 (٨) ARABIC-INDIC DIGIT EIGHT
6F9 (۹)
→669 (٩) ARABIC-INDIC DIGIT NINE
- The following ASCII characters are valid:
24 ($) DOLLAR SIGN
- Only label separator is
2E (.) FULL STOP
- No character maps to this character.
- This simplifies name detection in unstructured text.
- The following alternatives are disallowed:
3002 (。) IDEOGRAPHIC FULL STOP
FF0E (.) FULLWIDTH FULL STOP
FF61 (。) HALFWIDTH IDEOGRAPHIC FULL STOP
- Nearly all punctuation are disallowed.
- Example:
589 (։) ARMENIAN FULL STOP
- All parentheses and brackets are disallowed.
- Example:
2997 (⦗) LEFT BLACK TORTOISE SHELL BRACKET
- Nearly all vocalization annotations are disallowed.
- Example:
294 (ʔ) LATIN LETTER GLOTTAL STOP
- Obsolete, deprecated, and ancient characters are disallowed.
- Example:
463 (ѣ) CYRILLIC SMALL LETTER YAT
- Combining, modifying, reversed, flipped, turned, and partial variations are disallowed.
- Example:
218A (↊) TURNED DIGIT TWO
- When multiple weights of the same character exist, the variant closest to "heavy" is selected and the rest disallowed.
- Example:
🞡🞢🞣🞤✚🞥🞦🞧
→271A (✚) HEAVY GREEK CROSS
- This occasionally selects an emoji.
- Example: ✔️ or
2714 (✔︎) HEAVY CHECK MARK
is selected instead of2713 (✓) CHECK MARK
- Many visually confusable characters are disallowed.
- Example:
131 (ı) LATIN SMALL LETTER DOTLESS I
- Many ligatures, n-graphs, and n-grams are disallowed.
- Example:
A74F (ꝏ) LATIN SMALL LETTER OO
- Many esoteric characters are disallowed.
- Example:
2376 (⍶) APL FUNCTIONAL SYMBOL ALPHA UNDERBAR
- Many hyphen-like characters are mapped to
2D (-) HYPHEN-MINUS
:2010 (‐) HYPHEN
2011 (‑) NON-BREAKING HYPHEN
2012 (‒) FIGURE DASH
2013 (–) EN DASH
2014 (—) EM DASH
2015 (―) HORIZONTAL BAR
2043 (⁃) HYPHEN BULLET
2212 (−) MINUS SIGN
23AF (⎯) HORIZONTAL LINE EXTENSION
23E4 (⏤) STRAIGHTNESS
FE58 (﹘) SMALL EM DASH
2E3A (⸺) TWO-EM DASH
→"--"
2E3B (⸻) THREE-EM DASH
→"---"
- Only Latin, Greek, Cyrillic, Han, Japanese, and Korean have access to Common characters.
- Latin, Greek, Cyrillic, Han, Japanese, Korean, and Bopomofo only permit specific Combining Mark sequences.
- Han, Japanese, and Korean have access to
a-z
. - Restricted groups are always single-script.
- Scripts Braille, Linear A, Linear B, and Signwriting are disallowed.
27 (') APOSTROPHE
is mapped to2019 (’) RIGHT SINGLE QUOTATION MARK
for convenience.- Ethereum symbol (
39E (Ξ) GREEK CAPITAL LETTER XI
) is case-folded and Common. - Emoji:
203C (‼️) double exclamation mark
2049 (⁉️) exclamation question mark
- Remaining emoji characters are marked as disallowed (for text processing).
- All
RGI_Emoji_ZWJ_Sequence
are enabled. - All
Emoji_Keycap_Sequence
are enabled. - All
RGI_Emoji_Tag_Sequence
are enabled. - All
RGI_Emoji_Modifier_Sequence
are enabled. - All
RGI_Emoji_Flag_Sequence
are enabled. Basic_Emoji
of the form[X FE0F]
are enabled.- Emoji with default emoji-presentation are enabled as
[X FE0F]
. - Remaining single-character emoji are enabled as
[X FE0F]
(explicit emoji-presentation). - All singular Skin-color Modifiers are disabled.
- All singular Regional Indicators are disabled.
- Blacklisted emoji are disabled.
- Whitelisted emoji are enabled.
- Confusables:
- Emoji are not confusable.
- ASCII confusables are case-folded.
- Example:
61 (a) LATIN SMALL LETTER A
confuses with13AA (Ꭺ) CHEROKEE LETTER GO
- 99% of names are still valid.
- Unicode presentation may vary between applications and devices.
- Unicode text is ultimately subject to font-styling and display context.
- Unsupported characters (
�
) may appear unremarkable. - Normalized single-character emoji sequences do not retain their explicit emoji-presentation and may display with text or emoji presentation styling.
❤︎
— text-presentation and default-color❤︎
— text-presentation and green-color❤️
— emoji-presentation and green-color
- Unsupported emoji sequences with ZWJ may appear indistinguishable from those without ZWJ.
💩💩 [1F4A9 1F4A9]
💩💩 [1F4A9 200D 1F4A9]
→ error: Disallowed character
- Normalization does not enforce single-directional names.
- Names may be composed of labels of different directions but normalized labels are never bidirectional.
- [LTR].[RTL]
bahrain.مصر
- [LTR+RTL]
bahrainمصر
→ error: Illegal mixture: Latin + Arabic
- Not all normalized names are visually unambiguous.
- There exist confusable multi-character sequences:
"ஶ்ரீ" [BB6 BCD BB0 BC0]
"ஸ்ரீ" [BB8 BCD BB0 BC0]
- There exist confusable emoji sequences:
🚴 [1F6B4]
and🚴🏻 [1F6B4 1F3FB]
🇺🇸 [1F1FA 1F1F8]
and🇺🇲 [1F1FA 1F1F2]
♥ [2665] BLACK HEART SUIT
and❤ [2764] HEAVY BLACK HEART
- Already Normalized:
{name: "a"}
→normalize("a")
is"a"
- Need Normalization:
{name: "A", norm: "a"}
→normalize("A")
is"a"
- Expect Error:
{name: "@", error: true}
→normalize("@")
throws
- Do not strip
FE0F
fromEmoji
tokens. - Replace
3BE (ξ) GREEK SMALL LETTER XI
with39E (Ξ) GREEK CAPITAL LETTER XI
if the label isn't Greek. - Example:
normalize("‐Ξ1️⃣") [2010 39E 31 FE0F 20E3]
is"-ξ1⃣" [2D 3BE 31 20E3]
- Example:
beautify("-ξ1⃣") [2D 3BE 31 20E3]"
is"-Ξ1️⃣" [2D 39E 31 FE0F 20E3]
Last modified 3mo ago