CharacterNormalize

converts the characters in text to the specified normalization form.

Details

CharacterNormalize supports the following Unicode normalization forms:

	"NFD"	canonical decomposition (Form D)
	"NFC"	canonical decomposition, followed by canonical composition (Form C)
	"NFKD"	compatibility decomposition (Form KD)
	"NFKC"	compatibility decomposition, followed by canonical composition (Form KC)

In CharacterNormalize[text,…], text can be a string or a list of strings.
In "NFD" and "NFC", canonical decomposition refers to these four type of operations:

	Å  Å,…	decompose marks
	Ȱ Ȱ,…	decompose and order marks
	한 한, …	decompose Hangul and conjoining Jamo
	Ω(Ohm) Ω (Omega),…	map character to its canonical Unicode equivalent

In "NFKD" and "NFKC", compatibility decomposition refers to operations such as:

	ℌH ,ℍH,…	normalize font variants
	(NBSP)(Space), …	normalize linebreaking differences
	ﻉ ع,ﻊ ع, …	normalize positional variants
	①1, …	normalize circled variants
	ｶカ, …	normalize width variants
	︷{ ,︸} , …	normalize rotated variants
	i⁹ i9,i₉ i9, …	normalize subscripts/superscripts
	㌀アパート, …	decompose squared characters
	¼ 1/4 , …	normalize fractions
	ǆ→dž, …	other normalizations

Examples

open allclose all

Basic Examples (5)

Normalize string characters using canonical decomposition:

Normalize string characters using compatibility decomposition:

Normalize string characters using compatibility decomposition followed by canonical composition:

Normalize string characters using canonical decomposition followed by canonical composition:

Normalize the characters in the string using compatibility decomposition:

Characters with diacritics have been decomposed:

Scope (2)

Decompose a composite character into its constituents:

Ordering of the mark and the character has changed after normalization:

Obtain the "Ohm" character from its code:

NFD maps characters to their canonically equivalent Unicode. Normalize the character using NFD:

Convert the output (omega) to its character code:

Generalizations & Extensions (1)

CharacterNormalize threads itself elementwise over lists:

CharacterNormalize works on strings of different scripts and letters:

Possible Issues (1)

Compatibility equivalence may convert different forms of a character to a canonical form:

Compatibility equivalence may remove formatting distinctions that are not changed in canonical equivalent characters:

Top

More Learning

Tech Support

Wolfram Solutions

Wolfram Solutions For Education

Get Started

Grow Your Skills

Work with Us

Educational Programs for Adults

Educational Programs for Youth

Read

CharacterNormalize

Details

Examples

Basic Examples (5)

Scope (2)

Generalizations & Extensions (1)

Possible Issues (1)

Text

CMS

APA

BibTeX

BibLaTeX

CharacterNormalize

Details

Examples

Basic Examples (5)

Scope (2)

Generalizations & Extensions (1)

Possible Issues (1)

See Also

Related Guides

History

Text

CMS

APA

BibTeX

BibLaTeX