I have long felt that Linked Data and its underlying RDF technologies are a good match for the complexities of publishing genealogical data and research on the Internet. A quick Google search suggests I'm not the only person to have thought this. And, indeed, this is much the approach taken in the GedcomX data model currently being developed by the Mormon-backed familysearch.org site, and, even though they have chosen not to explicitly refer to it as RDF, there's no escaping the fact that their data model is simply an RDF vocabulary. But, as the saying goes, the proof of the pudding is in the eating, and to that end I have decided to carry out an experiment. The experiment is simple enough. I intend to start my family history afresh with my paternal grandfather and trace his ancestry back a few generations, giving sources and documenting any non-trivial reasoning needed — basically doing what I'd consider to be good, thorough research.
The starting point is my grandfather, Lionel Vane Smith, and, for a nice easy start, I simply want to state that there is (or was) a person with that name. Here's how I've done it:
<?xml version="1.0" encoding="UTF-8"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:foaf="http://xmlns.com/foaf/0.1/"> <foaf:Person> <foaf:name>Lionel Vane Smith</foaf:name> </foaf:Person> </rdf:RDF>
The first line is optional and identifies the document as being XML and says that I'm using the UTF-8 encoding of Unicode character set. The next line identifies the document contains RDF dumped in the RDF/XML format, and the third line says that I'm using the FOAF vocabulary. FOAF, which stands for “friend of a friend”, was originally designed as a way of encoding basic social information on people, but has become the standard base set of terminology for describing people. The next two lines contain the real content which loosely translates as:
- There is a person called “Lionel Vane Smith”.
In fact, when broken down to the level of RDF statements, this is two separate statements:
- There is an entity which is a
- that entity has the
foaf:name“Lionel Vane Smith”.
We might question whether this is accurate. My grandfather died decades ago: is it correct to say that he is a person? Fortunately, this is dealt with in the documentation of the
foaf:Person class which says “We don't nitpic about whether they're alive, dead, real, or imaginary”. So far, so good. We might also question whether it's sensible to rely on the third-party FOAF vocabulary instead of defining our own person class. My view is that when a standard vocabulary exists we should use it, as doing so makes our data more readily usable with other existing data and tools. In this particular case, the FOAF vocabulary is one of the most widely used vocabularies there is. But in any case, this is not an issue I want to consider too deeply at the moment. None of the following will change substantively if I use my own
genmine:Person class instead of the FOAF one.
Before we get into any real genealogy, I want to flesh out this example a bit more.
Referencing people & versioning
First of all, I want to introduce some means of identifying the person in that document so that I can reference, either elsewhere in the same document, or subsequent documents. The subject of a statement in RDF can either be anonymous (a so-called blank node) or it can be identified by a URL. One common convention is to use URLs that include fragment identifiers (that is a
# followed by something) when naming abstract or physical entities in order to avoid confusing between the document that defines them and the concept themselves. I shall adopt that convention here:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xml:base="http://richard.genmine.com/rdf-intro/intro">
The URL I have chosen to represent my grandfather is
http://richard.genmine.com/rdf-intro/intro#LVS, though that URL does not appear explicitly anywhere in the document: instead it is formed from the
rdf:ID attributes. (I could have written the URL explicitly and placed it in an
rdf:about attribute, however I have found it useful to stick to the convention that I use
rdf:ID for the primary definition of something and
rdf:about for subsequent references. This, however, is simply my convention. So far as the RDF is concerned, there is no difference.)
Although RDF uses URLs to identify entities, there is no requirement that the URLs resolve to anything; nevertheless, doing so is considered good practice. I have therefore arranged for the
http://richard.genmine.com/rdf-intro/intro URL to redirect with an HTTP 303 “See Other” to the latest version of this RDF document. As this raises the issue of versioning and I'm already on my second version of the document, I will add some basic information about the current version of the document:
<rdf:Description xmlns:dcterms="http://purl.org/dc/terms/" rdf:about="intro-02.rdf"> <dcterms:issued>2012-09-03</dcterms:issued> <dcterms:isVersionOf rdf:resource="intro" /> <dcterms:replaces rdf:resource="intro-01.rdf" /> </rdf:Description>
- that it was written (or rather, issued) on 3rd Sept 2012;
- that it is a version (though not necessarily the most recent version) of the document found at
- and that it replaces the earlier
intro-01.rdfthat I developed in the previous section.
The next task is to add some more basic information about my grandfather. I shall say that his
foaf:familyName is Smith and that his
foaf:gender is “male”. I would also like to give his date of birth and year of death. FOAF doesn't have a way of doing this. (FOAF has a
foaf:birthday property, but that is just a day and month, and is complemented by a
foaf:age property. But it seems utterly misleading to add a statement saying that he is 109, the age he would be at the time of writing were he alive.) Instead, I shall use the BIO vocabulary. This adds a layer of indirection. Instead of saying he was born on 9 Aug 1903, I say that he had a birth, and that event occurred on 9 Aug 1903. By adding the concept of an event, I could also add the place of his birth to the event without requiring a separate property for the place of the birth. This also ties in well with the established GEDCOM concept of events.
<foaf:Person rdf:ID="LVS"> <foaf:name>Lionel Vane Smith</foaf:name> <foaf:familyName>Smith</foaf:familyName> <foaf:gender>male</foaf:gender> <bio:birth rdf:parseType="Resource"> <bio:date>1903-08-09</bio:date> </bio:birth> <bio:death rdf:parseType="Resource"> <bio:date>1980</bio:date> </bio:death> </foaf:Person>
It's interesting to compare this to the corresponding data encoded in GEDCOM, and other than the need to close elements in XML, there is a one-to-one correspondence between lines in the RDF and in the GEDCOM.
0 @LVS@ INDI 1 NAME Lionel Vane /Smith/ 2 SURN Smith 1 SEX M 1 BIRT 2 DATE 9 AUG 1903 1 DEAT 2 DATE 1980
rdf:parseType="Resource" attributes are worth mentioning in passing as they mark an abbreviation in the RDF. Written in full the
bio:birth property would read as follows.
<foaf:Person rdf:ID="LVS"> <bio:birth> <bio:Birth> <bio:date>1903-08-09</bio:date> </bio:Birth> </bio:birth> </foaf:Person>
bio:birth (with the lower-case “b”) is a property associating a person with a birth event, and the inner
bio:Birth denotes the event itself. The fact that the event implied by
rdf:parseType="Resource" is specifically a
bio:Birth event can be inferred from the machine-readable definition of the
bio:birth property. There are further techniques available for abbreviating the RDF/XML, for example by putting
bio:gender="male" as an attribute on the
bio:Person. Such things are purely cosmetic and have no bearing on the underlying data being conveyed.
The dates are all encoded in the W3C profile of ISO 8601, with the year first, then optionally the month, and then optionally the day of the month. This has the advantage that sorting the textual representation of the date results in them being sorted chronologically. It also avoids any ambiguity as to whether 9/8/1903 should be interpreted in the European way as the 9th of August, or the American way as September the 8th. Only a year is given for the date of death, but this is also valid in the W3C ISO 8601 profile.
Genealogical data is virtually worthless if there is no indication of where it came from, and at present my document contains no such information. I shall start by saying who wrote the document, and as I anticipate being the author of several files, I shall encapsulate the information about me in a new file,
me-01.rdf. The important part of this file is the
<foaf:Person rdf:ID="RAS" foaf:name="Richard Smith" foaf:mbox_sha1sum="dfb0ce37eb9695fc8053faa6fa59d5c7bbb91a91" />
This says that I am a
foaf:Person, and I assign myself the URL
http://richard.genmine.com/rdf-intro/me#RAS. I state that my name is “Richard Smith”, which I've done using the short-hand attribute syntax. As mentioned earlier, this is entirely equivalent to a
<foaf:name>Richard Smith</foaf:name> element.
The other RDF statement above is perhaps more interesting. It gives the SHA-1 hash of my email address. Why would I do that? The
foaf:mbox_sha1sum property is declared to be
owl:InverseFunctionalProperty. This means that if two people have same mailbox hash, then they can be inferred to be the same person. So if I write another document and assign myself a different URL there, a computer program can still tell that they are (or claim to be) written by the same person. I could have used my email address for this, but putting my email address in plain text on the Internet is inviting spam.
Now that I have a brief description of myself, I can now update the
intro document to reference myself as the author. I shall also add an RDF statement giving a short description in English of the document's contents. This is where I explain that the information on my grandfather was told to me by my father.
<rdf:Description rdf:about=""> <dcterms:creator rdf:resource="me#RAS" /> <dcterms:description xml:lang="en"> Information about my grandfather, as related verbally by my father. </dcterms:description> <dcterms:source> <foaf:Person bio:gender="male"> <rel:parentOf rdf:resource="me#RAS" /> <rel:childOf rdf:resource="#LVS" /> </foaf:Person> </dcterms:source> </rdf:Description>
But this is an RDF document, and I would like my statements to be machine readable so far as possible. The
dcterms:source property is an attempt to convert my textual description into something machine readable. Frequently the
dcterms:source of a document will be another document, but that's not a requirement and here I have chosen to make the source a person. I have chosen not to name that person or assign him an identifier, but I have used the Relationship vocabulary to state that the source is my father (or rather, a male parent — the vocabulary considers “father” and “mother” to be redundant terms, and just provides a single
rel:parentOf), and that he is the child of my grandfather. This finally establishes a link, albeit an indirect one, between my
foaf:Person and my grandfather's.
Although I have stated that I wrote these RDF documents, a reader must take this on trust. But what if this is not the case? What if someone altered my document, perhaps with the best of intentions, or perhaps with malicious intent. I don't wish to labour this point at this stage, but it is easy to apply the techniques of public-key cryptography to provide a cryptographic signature to every document which makes it easy to detect such changes. I do this here by adding a link to an external GPG signature:
<rdf:Description rdf:about="intro-04.rdf"> <wot:assurance rdf:resource="intro-04.rdf.asc" /> </rdf:Description>
gpg --armour --detach-sign intro-04.rdf
The process of updating a signed file, such as
intro.rdf, is fairly tedious. Not only do I have to regenerate the GPG signature, but before doing so, I have to update the
wot:assurance link, the various references to the versioned file name (e.g.
intro-04.rdf), and the
rdf:replaces link to the previous version. This whole process can be automated, and any genealogical application that used RDF as its publication format would do this internally. To that end, I have written a script that adds the following block to the RDF, and uploads it to my the web server.
<rdf:Description rdf:about="intro-2012101401.rdf"> <dcterms:issued>2012-10-14</dcterms:issued> <dcterms:isVersionOf rdf:resource="intro"/> <dcterms:replaces rdf:resource="intro-04.rdf"/> <dcterms:creator rdf:resource="me#RAS"/> <wot:assurance xmlns:wot="http://xmlns.com/wot/0.1/" rdf:resource="intro-2012101401.rdf.asc"/> </rdf:Description>
As can be seen, I have also taken the opportunity to change the naming scheme so that my latest version of
intro.rdf is called
intro-2012101401.rdf. The first eight digits are the date, 2012-10-14, and the next two numbers indicate it's the first version of the day. The script also ensures that the unversioned URL
 redirects to the newest version. I use Apache as my webserver, and I generate the redirect by adding the following rewrite rule to an
RewriteRule ^intro$ intro-2012101401.rdf [R=303]
This, however, is an implementation detail of how I've configured my web server.