GEDCOM Versions

Introduction

The last officially approved standard for GEDCOM is version 5.5, approved in 1995/96. In 1999, an unofficial draft standard was released - version 5.5.1. Version 5.5.1 was never officially approved, however many tools use it.

There is a certain amount of confusion caused by this situation. Among the issues is the fact that many tools use 5.5.1 standard structures but do not accurately inform the user of this, saying that they are using the 5.5 version. Also, the fact that it is an unapproved draft but de facto standard makes it ubiquitous without formality.

In Nov 2012, with release 1.1.0, gedcom4j began supporting the 5.5.1 standard in addition to the 5.5 standard, because of several attractive features of the 5.5.1 standard:

  • UTF-8 support. This is a big plus and is the source of most problems people have had with using earlier versions of gedcom4j.
  • Support for email addresses, fax numbers, and web URLs
  • Map coordinate support
  • Multi-line copyright notices
  • Status indicators on family/child relationships
  • Restriction notices on events for privacy
  • Support for religious affiliation on events
  • Support for romanized and phonetic variants on names of places and people
  • The new generic FACT tag

The 5.5.1 standard also made significant changes to the way multimedia files are referenced and stored. Most drastic is the removal of embedded/encoded multimedia files under 5.5.1, and the reorganization of the tag hierarchy for multimedia files.

The new version of gedcom4j, starting with 1.1.0, supports both 5.5-compliant and 5.5.1-compliant GEDCOM files.

How gedcom4j supports both 5.5 and 5.5.1

gedcom4j has three major components - the data model, the parser, and the writer.

Data Model

The data model contains all the newer 5.5.1 components, but also retains all of the deprecated 5.5 components (which really only consists of the embedded multimedia elements).

Some of the items in 5.5 only allow a single value, where 5.5.1 allows multiples. In these cases, multiples are always allowed in the data model, and a single value is just a degenerate case of multiple values.

Parser

The parser can load UTF-8 files, as well as using the other supported encodings in the 5.5 and 5.5.1 specs.

The parser takes a "forgiving" approach where it tries to load as much data as possible, including 5.5.1 data in a file that says it's in 5.5 format, and vice-versa. However, when it finds inconsistencies, it will add messages to the warnings and errors collections. Most of these messages indicate that the data was loaded, even though it was incorrect, and the data will need to be corrected before it can be written.

The parser makes the assumption that if the version of GEDCOM used is explicitly specified in the file header, that the rest of the data in the file should conform to that spec. For example, if the file header says the file is in 5.5 format (i.e., has a VERS 5.5 tag), then it will generate warnings if the new 5.5.1 tags (e.g., EMAIL) are encountered elsewhere, but will load the data anyway. If no version is specified, the 5.5.1 format is assumed as a default.

This approach was selected based on the presumption that most of the uses of gedcom4j will be to read GEDCOM files rather than to write them, so this provides that use case with the lowest friction.

Writer

The writer is able to write in UTF-8 encoding as well as the others defined in the 5.5 and 5.5.1 standards.

The writer is far less forgiving than the parser. Where the parser would add a string to its warnings or errors collection and proceed in the case of inconsistent data, the writer throws an Exception (most commonly, a GedcomWriterVersionDataMismatchException, which is a subclass of GedcomWriterException) indicating that the data in the model does not conform to the spec. The parser uses the value specified in Gedcom.header.gedcomVersion to determine which spec to check against.

This approach was selected based on the idea that it is better to get an exception than to write a malformed file that does not conform to spec.

Multimedia

The differences in the multimedia specs between the two versions were among the most drastic and most difficult to deal with. Not only did version 5.5 do away with embedded multimedia support (i.e., the BLOB tag), it also changed cardinalities (multiple file references per MULTIMEDIA_RECORD in 5.5.1, where 5.5 only allowed one), and moved tags to become children of other tags (i.e., the FORM tag is now a child of the new FILE tag in 5.5.1).

Users who plan to read files produced by other systems and rewrite them with gedcom4j should pay special attention to the multimedia section and ensure that the data in the model is compliant with the version of GEDCOM being used, and making adjustments as needed.