Using RDF
I have long felt that Linked Data and its underlying RDF technologies are a good match for the complexities of publishing genealogical data and research on the Internet. A quick Google search suggests I'm not the only person to have thought this. And, indeed, this is much the approach taken in the GedcomX data model currently being developed by the Mormon-backed familysearch.org site, and, even though they have chosen not to explicitly refer to it as RDF, there's no escaping the fact that their data model is simply an RDF vocabulary. But, as the saying goes, the proof of the pudding is in the eating, and to that end I have decided to carry out an experiment. The experiment is simple enough. I intend to start my family history afresh with my paternal grandfather and trace his ancestry back a few generations, giving sources and documenting any non-trivial reasoning needed — basically doing what I'd consider to be good, thorough research.
Contents |
Starting out
The starting point is my grandfather, Lionel Vane Smith, and, for a nice easy start, I simply want to state that there is (or was) a person with that name. Here's how I've done it:
<?xml version="1.0" encoding="UTF-8"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:foaf="http://xmlns.com/foaf/0.1/"> <foaf:Person> <foaf:name>Lionel Vane Smith</foaf:name> </foaf:Person> </rdf:RDF>
The first line is optional and identifies the document as being XML and says that I'm using the UTF-8 encoding of Unicode character set. The next line identifies the document contains RDF dumped in the RDF/XML format, and the third line says that I'm using the FOAF vocabulary. FOAF, which stands for “friend of a friend”, was originally designed as a way of encoding basic social information on people, but has become the standard base set of terminology for describing people. The next two lines contain the real content which loosely translates as:
- There is a person called “Lionel Vane Smith”.
In fact, when broken down to the level of RDF statements, this is two separate statements:
- There is an entity which is a
foaf:Person
; and - that entity has the
foaf:name
“Lionel Vane Smith”.
We might question whether this is accurate. My grandfather died decades ago: is it correct to say that he is a person? Fortunately, this is dealt with in the documentation of the foaf:Person
class which says “We don't nitpic about whether they're alive, dead, real, or imaginary”. So far, so good. We might also question whether it's sensible to rely on the third-party FOAF vocabulary instead of defining our own person class. My view is that when a standard vocabulary exists we should use it, as doing so makes our data more readily usable with other existing data and tools. In this particular case, the FOAF vocabulary is one of the most widely used vocabularies there is. But in any case, this is not an issue I want to consider too deeply at the moment. None of the following will change substantively if I use my own genmine:Person
class instead of the FOAF one.
Before we get into any real genealogy, I want to flesh out this example a bit more.
Referencing people & versioning
First of all, I want to introduce some means of identifying the person in that document so that I can reference, either elsewhere in the same document, or subsequent documents. The subject of a statement in RDF can either be anonymous (a so-called blank node) or it can be identified by a URL. One common convention is to use URLs that include fragment identifiers (that is a #
followed by something) when naming abstract or physical entities in order to avoid confusing between the document that defines them and the concept themselves. I shall adopt that convention here:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xml:base="http://richard.genmine.com/rdf-intro/intro">
The URL I have chosen to represent my grandfather is http://richard.genmine.com/rdf-intro/intro#LVS
, though that URL does not appear explicitly anywhere in the document: instead it is formed from the xml:base
and rdf:ID
attributes. (I could have written the URL explicitly and placed it in an rdf:about
attribute, however I have found it useful to stick to the convention that I use rdf:ID
for the primary definition of something and rdf:about
for subsequent references. This, however, is simply my convention. So far as the RDF is concerned, there is no difference.)
Although RDF uses URLs to identify entities, there is no requirement that the URLs resolve to anything; nevertheless, doing so is considered good practice. I have therefore arranged for the http://richard.genmine.com/rdf-intro/intro
URL to redirect with an HTTP 303 “See Other” to the latest version of this RDF document. As this raises the issue of versioning and I'm already on my second version of the document, I will add some basic information about the current version of the document:
<rdf:Description xmlns:dcterms="http://purl.org/dc/terms/" rdf:about="intro-02.rdf"> <dcterms:issued>2012-09-03</dcterms:issued> <dcterms:isVersionOf rdf:resource="intro" /> <dcterms:replaces rdf:resource="intro-01.rdf" /> </rdf:Description>
This uses Dublin Core metadata terms to make three statements about intro-02.rdf
(which is a URL relative to xml:base
):
- that it was written (or rather, issued) on 3rd Sept 2012;
- that it is a version (though not necessarily the most recent version) of the document found at
http://richard.genmine.com/rdf-intro/intro
; - and that it replaces the earlier
intro-01.rdf
that I developed in the previous section.
Biographical details
The next task is to add some more basic information about my grandfather. I shall say that his foaf:familyName
is Smith and that his foaf:gender
is “male”. I would also like to give his date of birth and year of death. FOAF doesn't have a way of doing this. (FOAF has a foaf:birthday
property, but that is just a day and month, and is complemented by a foaf:age
property. But it seems utterly misleading to add a statement saying that he is 109, the age he would be at the time of writing were he alive.) Instead, I shall use the BIO vocabulary. This adds a layer of indirection. Instead of saying he was born on 9 Aug 1903, I say that he had a birth, and that event occurred on 9 Aug 1903. By adding the concept of an event, I could also add the place of his birth to the event without requiring a separate property for the place of the birth. This also ties in well with the established GEDCOM concept of events.
<foaf:Person rdf:ID="LVS"> <foaf:name>Lionel Vane Smith</foaf:name> <foaf:familyName>Smith</foaf:familyName> <foaf:gender>male</foaf:gender> <bio:birth rdf:parseType="Resource"> <bio:date>1903-08-09</bio:date> </bio:birth> <bio:death rdf:parseType="Resource"> <bio:date>1980</bio:date> </bio:death> </foaf:Person>
It's interesting to compare this to the corresponding data encoded in GEDCOM, and other than the need to close elements in XML, there is a one-to-one correspondence between lines in the RDF and in the GEDCOM.
0 @LVS@ INDI 1 NAME Lionel Vane /Smith/ 2 SURN Smith 1 SEX M 1 BIRT 2 DATE 9 AUG 1903 1 DEAT 2 DATE 1980
The two rdf:parseType="Resource"
attributes are worth mentioning in passing as they mark an abbreviation in the RDF. Written in full the bio:birth
property would read as follows.
<foaf:Person rdf:ID="LVS"> <bio:birth> <bio:Birth> <bio:date>1903-08-09</bio:date> </bio:Birth> </bio:birth> </foaf:Person>
The outer bio:birth
(with the lower-case “b”) is a property associating a person with a birth event, and the inner bio:Birth
denotes the event itself. The fact that the event implied by rdf:parseType="Resource"
is specifically a bio:Birth
event can be inferred from the machine-readable definition of the bio:birth
property. There are further techniques available for abbreviating the RDF/XML, for example by putting bio:gender="male"
as an attribute on the bio:Person
. Such things are purely cosmetic and have no bearing on the underlying data being conveyed.
The dates are all encoded in the W3C profile of ISO 8601, with the year first, then optionally the month, and then optionally the day of the month. This has the advantage that sorting the textual representation of the date results in them being sorted chronologically. It also avoids any ambiguity as to whether 9/8/1903 should be interpreted in the European way as the 9th of August, or the American way as September the 8th. Only a year is given for the date of death, but this is also valid in the W3C ISO 8601 profile.
Provenance
Genealogical data is virtually worthless if there is no indication of where it came from, and at present my document contains no such information. I shall start by saying who wrote the document, and as I anticipate being the author of several files, I shall encapsulate the information about me in a new file, me-01.rdf
. The important part of this file is the foaf:Person
definition:
<foaf:Person rdf:ID="RAS" foaf:name="Richard Smith" foaf:mbox_sha1sum="dfb0ce37eb9695fc8053faa6fa59d5c7bbb91a91" />
This says that I am a foaf:Person
, and I assign myself the URL http://richard.genmine.com/rdf-intro/me#RAS
. I state that my name is “Richard Smith”, which I've done using the short-hand attribute syntax. As mentioned earlier, this is entirely equivalent to a <foaf:name>Richard Smith</foaf:name>
element.
The other RDF statement above is perhaps more interesting. It gives the SHA-1 hash of my email address. Why would I do that? The foaf:mbox_sha1sum
property is declared to be owl:InverseFunctionalProperty
. This means that if two people have same mailbox hash, then they can be inferred to be the same person. So if I write another document and assign myself a different URL there, a computer program can still tell that they are (or claim to be) written by the same person. I could have used my email address for this, but putting my email address in plain text on the Internet is inviting spam.
Now that I have a brief description of myself, I can now update the intro
document to reference myself as the author. I shall also add an RDF statement giving a short description in English of the document's contents. This is where I explain that the information on my grandfather was told to me by my father.
<rdf:Description rdf:about=""> <dcterms:creator rdf:resource="me#RAS" /> <dcterms:description xml:lang="en"> Information about my grandfather, as related verbally by my father. </dcterms:description> <dcterms:source> <foaf:Person bio:gender="male"> <rel:parentOf rdf:resource="me#RAS" /> <rel:childOf rdf:resource="#LVS" /> </foaf:Person> </dcterms:source> </rdf:Description>
But this is an RDF document, and I would like my statements to be machine readable so far as possible. The dcterms:source
property is an attempt to convert my textual description into something machine readable. Frequently the dcterms:source
of a document will be another document, but that's not a requirement and here I have chosen to make the source a person. I have chosen not to name that person or assign him an identifier, but I have used the Relationship vocabulary to state that the source is my father (or rather, a male parent — the vocabulary considers “father” and “mother” to be redundant terms, and just provides a single rel:parentOf
), and that he is the child of my grandfather. This finally establishes a link, albeit an indirect one, between my foaf:Person
and my grandfather's.
Although I have stated that I wrote these RDF documents, a reader must take this on trust. But what if this is not the case? What if someone altered my document, perhaps with the best of intentions, or perhaps with malicious intent. I don't wish to labour this point at this stage, but it is easy to apply the techniques of public-key cryptography to provide a cryptographic signature to every document which makes it easy to detect such changes. I do this here by adding a link to an external GPG signature:
<rdf:Description rdf:about="intro-04.rdf"> <wot:assurance rdf:resource="intro-04.rdf.asc" /> </rdf:Description>
I have also produce a updated me-02.rdf
which has been similarly signed, and that links to a copy of my public key. The signatures can be generated trivially with the command:
gpg --armour --detach-sign intro-04.rdf
Automating updates
The process of updating a signed file, such as intro.rdf
, is fairly tedious. Not only do I have to regenerate the GPG signature, but before doing so, I have to update the wot:assurance
link, the various references to the versioned file name (e.g. intro-04.rdf
), and the rdf:replaces
link to the previous version. This whole process can be automated, and any genealogical application that used RDF as its publication format would do this internally. To that end, I have written a script that adds the following block to the RDF, and uploads it to my the web server.
<rdf:Description rdf:about="intro-2012101401.rdf"> <dcterms:issued>2012-10-14</dcterms:issued> <dcterms:isVersionOf rdf:resource="intro"/> <dcterms:replaces rdf:resource="intro-04.rdf"/> <dcterms:creator rdf:resource="me#RAS"/> <wot:assurance xmlns:wot="http://xmlns.com/wot/0.1/" rdf:resource="intro-2012101401.rdf.asc"/> </rdf:Description>
As can be seen, I have also taken the opportunity to change the naming scheme so that my latest version of intro.rdf
is called intro-2012101401.rdf
. The first eight digits are the date, 2012-10-14, and the next two numbers indicate it's the first version of the day. The script also ensures that the unversioned URL [1]
redirects to the newest version. I use Apache as my webserver, and I generate the redirect by adding the following rewrite rule to an .htaccess
file.
RewriteRule ^intro$ intro-2012101401.rdf [R=303]
This, however, is an implementation detail of how I've configured my web server.