CharacterNormalize

CharacterNormalize["text",form]

converts the characters in text to the specified normalization form.

Details

  • CharacterNormalize supports the following Unicode normalization forms:
  • "NFD"canonical decomposition (Form D)
    "NFC"canonical decomposition, followed by canonical composition (Form C)
    "NFKD"compatibility decomposition (Form KD)
    "NFKC"compatibility decomposition, followed by canonical composition (Form KC)
  • In CharacterNormalize[text,], text can be a string or a list of strings.
  • In "NFD" and "NFC", canonical decomposition refers to these four type of operations:
  • Å ,decompose marks
    Ȱ Ȱ,decompose and order marks
    한, decompose Hangul and conjoining Jamo
    (Ohm) Ω (Omega),map character to its canonical Unicode equivalent
  • In "NFKD" and "NFKC", compatibility decomposition refers to operations such as:
  • H ,H,normalize font variants
    (NBSP)(Space), normalize linebreaking differences
    ع, ع, normalize positional variants
    1, normalize circled variants
    , normalize width variants
    { ,} , normalize rotated variants
    i⁹ i9,i₉ i9, normalize subscripts/superscripts
    アパート, decompose squared characters
    ¼ 1/4 , normalize fractions
    dždž, other normalizations

Examples

open allclose all

Basic Examples  (5)

Normalize string characters using canonical decomposition:

Normalize string characters using compatibility decomposition:

Normalize string characters using compatibility decomposition followed by canonical composition:

Normalize string characters using canonical decomposition followed by canonical composition:

Normalize the characters in the string using compatibility decomposition:

Characters with diacritics have been decomposed:

Scope  (2)

Decompose a composite character into its constituents:

Ordering of the mark and the character has changed after normalization:

Obtain the "Ohm" character from its code:

NFD maps characters to their canonically equivalent Unicode. Normalize the character using NFD:

Convert the output (omega) to its character code:

Generalizations & Extensions  (1)

CharacterNormalize threads itself elementwise over lists:

CharacterNormalize works on strings of different scripts and letters:

Possible Issues  (1)

Compatibility equivalence may convert different forms of a character to a canonical form:

Compatibility equivalence may remove formatting distinctions that are not changed in canonical equivalent characters:

Wolfram Research (2020), CharacterNormalize, Wolfram Language function, https://reference.wolfram.com/language/ref/CharacterNormalize.html.

Text

Wolfram Research (2020), CharacterNormalize, Wolfram Language function, https://reference.wolfram.com/language/ref/CharacterNormalize.html.

CMS

Wolfram Language. 2020. "CharacterNormalize." Wolfram Language & System Documentation Center. Wolfram Research. https://reference.wolfram.com/language/ref/CharacterNormalize.html.

APA

Wolfram Language. (2020). CharacterNormalize. Wolfram Language & System Documentation Center. Retrieved from https://reference.wolfram.com/language/ref/CharacterNormalize.html

BibTeX

@misc{reference.wolfram_2024_characternormalize, author="Wolfram Research", title="{CharacterNormalize}", year="2020", howpublished="\url{https://reference.wolfram.com/language/ref/CharacterNormalize.html}", note=[Accessed: 19-September-2024 ]}

BibLaTeX

@online{reference.wolfram_2024_characternormalize, organization={Wolfram Research}, title={CharacterNormalize}, year={2020}, url={https://reference.wolfram.com/language/ref/CharacterNormalize.html}, note=[Accessed: 19-September-2024 ]}