CharacterNormalize

CharacterNormalize["text",form]

converts the characters in text to the specified normalization form.

Details

  • CharacterNormalize supports the following Unicode normalization forms:
  • "NFD"canonical decomposition (Form D)
    "NFC"canonical decomposition, followed by canonical composition (Form C)
    "NFKD"compatibility decomposition (Form KD)
    "NFKC"compatibility decomposition, followed by canonical composition (Form KC)
    "NFKCCaseFold"compatibility decompostion, followed by canonical composition, with case folding (Form KCCaseFold)
  • In CharacterNormalize[text,], text can be a string or a list of strings.
  • In "NFD" and "NFC", canonical decomposition refers to these four type of operations:
  • Å ,decompose marks
    Ȱ Ȱ,decompose and order marks
    한, decompose Hangul characters and Conjoing marks
    (Ohm) Ω (Omega),map character to its canonical Unicode equivalent
  • In "NFKD" and "NFKC", compatibility decomposition refers to operations such as:
  • H ,H,normalize font variants
    (NBSP)Space), normalize linebreaking differences
    ع, ع, normalize positional variants
    1, normalize circled variants
    , normalize width variants
    { ,} , normalize rotated variants
    i⁹ i9,i₉ i9, normalize subscripts/superscripts
    アパート, decompose squared characters
    ¼ 1/4 , normalize fractions
    dždž, other normalizations

Examples

open allclose all

Basic Examples  (6)

Normalize string characters using canonical decomposition:

Normalize string characters using compatibility decomposition:

Normalize string characters using compatibility decomposition followed by canonical composition:

Normalize string characters using canonical decomposition followed by canonical composition:

Normalize string characters using compatibility decomposition followed by canonical composition with case folding:

Normalize the characters in the string using compatibility decomposition:

Characters with diacritics have been decomposed:

Scope  (2)

Decompose a composite character into its constituents:

Ordering of the mark and the character has changed after normalization:

Obtain the "Ohm" character from its code:

NFD maps characters to their canonically equivalent Unicode. Normalize the character using NFD:

Convert the output (omega) to its character code:

Generalizations & Extensions  (1)

CharacterNormalize threads itself elementwise over lists:

CharacterNormalize works on strings of different scripts and letters:

Possible Issues  (1)

Compatibility equivalence may convert different forms of a character to a canonical form:

Compatibility equivalence may remove formatting distinctions that are not changed in canonical equivalent characters:

Introduced in 2020
 (12.1)