ANSEL and Diacritics

ANSEL is a common encoding format for GEDCOM data. It is a very old encoding standard, which has been superseded by Unicode (which is internally what Java uses for string formats - technically, UTF-16). That said, many genealogy programs still use ANSEL as their primary encoding when writing GEDCOM files, although this seems to be switching slowly to UTF-8.

gedcom4j includes support for ASCII, ANSEL, Unicode, and UTF-8 encodings, which are the encodings supported by both the GEDCOM 5.5 and 5.5.1 standards (5.5.1 added UTF-8 support). This means that gedcom4j can read and write files encoded in any of these formats, and can be used to convert between them; but internally, all string data is stored as Java strings, which again are UTF-16 format.

This poses some interesting problems when using ANSEL to hold extended characters, particularly those with diacritics, which are common in languages other than English. In ANSEL, diacritics are all in what is called "combining" form - this means that letters with diacritic are done as compound characters consisting of 2 or 3 bytes. In ANSEL, the last byte in a compound character with diacritics is the glyph (aka letter) itself, and the that character is preceded by one or two combining diacritics. These diacritics take no horizontal space when rendering, which is why ANSEL refers to them as "non-spacing" characters. The horizontal spacing for the character is obtained from the final character, which is the glyph itself. Thus, when rendered, the diacritic and the glyph itself are overlaid on top of each other.

In Unicode (and by extension, Java), diacritics are handled a little differently. First, combining diacritic characters exist in Unicode, but they follow the glyph being modified, rather than preceding them, as in ANSEL. More importantly, many (but not all) of the diacritic+letter combinations have pre-combined forms in Unicode as a single glyph, and these are considered canonically equivalent to the decomposed form. gedcom4j uses these pre-combined glyphs, also known as the NFC ("Normalization Form Canonical Composition") form whenever possible in the string data it reads from GEDCOM files. However, when writing ANSEL data back out, the combined glyphs are decomposed to the NFD ("Normalization Form Canonical Decomposition") form, then written as ANSEL bytes as the encoding scheme requires.

Acknowledgement: Heiner Eichmann's site on ANSEL to Unicode conversion was very helpful here. My thanks to him for documenting what he did here and some of the challenges involved.