This document collects some emerging patterns for data modeling. If you are developing your own data model, you may benefit from reading the different solutions to the use cases and requirements discussed below.

Be semantic

A term should communicate its meaning; its usage should be clear from its name alone. For example, as in the case of a membership, using the generic start_date and end_date terms makes sense, because a membership starts and ends. However, in most cases, generic terms are not recommended, because they are either inappropriate or ambiguous.

For example, people and organizations do not start and end, so the generic terms are inappropriate; the terms birth_date, death_date, founding_date and dissolution_date are preferred.

For a political division, the term start_date may refer to either the date on which the division was created or the date on which the division came into force, and is therefore ambiguous; the terms creation_date and coming_into_force_date are unambiguous.

After choosing the properties of a class, you may realize that the term for the class is inappropriate. For example, legislative texts are available in multiple formats, like HTML and PDF. If you only need a url property for each formatted document, the term links may seem appropriate to describe the relation between the text and its formatted documents, e.g.:

{
  "title": "A bill to amend the Income Tax Act",
  "text": "Lorum ipsum dolor sit amet ...",
  "links": [
    {
      "url": "http://example.com/legislation/123.html"
    },
    {
      "url": "http://example.com/legislation/123.pdf"
    }
  ]
}

In response to new use cases, you add a content_type property for each formatted document, whose value is text/html or application/pdf. However, a link does not have a content type; only the linked resource has a content type. Therefore, a term like distributions would be more appropriate than the term links to describe the relation, e.g.:

{
  "title": "A bill to amend the Income Tax Act",
  "text": "Lorum ipsum dolor sit amet ...",
  "distributions": [
    {
      "url": "http://example.com/legislation/123.html",
      "content_type": "text/html"
    },
    {
      "url": "http://example.com/legislation/123.pdf",
      "content_type": "application/pdf"
    }
  ]
}

Be concise

Use fewer terms where possible. For example, an event may have many possible states: tentative, confirmed and cancelled.1

If the states are disjoint – i.e. if it’s impossible to be in two states at once – you only need a single status property whose possible values are tentative, confirmed or cancelled:

{
  "summary": "Open Data Day",
  "start_date": "2013-02-23",
  "end_date": "2013-02-24",
  "status": "confirmed"
}

It would be verbose to instead have a boolean property for each state, i.e. to have three properties tentative, confirmed and cancelled whose possible values are true or false:

{
  "summary": "Open Data Day",
  "start_date": "2013-02-23",
  "end_date": "2013-02-24",
  "tentative": false,
  "confirmed": true,
  "cancelled": false
}

Furthermore, if the states are disjoint, using a single status property ensures that an event is never both tentative and confirmed. Using multiple boolean properties, it is possible for an event to be invalid by having more than one property set to true.

1. These event states are from the iCalendar specification.

Limit the number of classes

When creating an ontology, resist the urge to name all the things. For example, the MUNI Ontology used by Citizen DAN has 2669 classes and 121 properties. It includes 23 classes for musical styles, including Chinese and Japanese music. It would be easy to add another 200-plus classes to represent the musical style of each country in the world, at the expense of significantly increasing the complexity of the ontology and the time it takes to learn and adopt it. An approach that favors classes, like this one, obliges an ontology to continously add new classes, in order to have a comprehensive list of musical styles, for example.

An alternative approach, which drastically simplifies an ontology, is to have a single property musical_style instead of 23 or more classes as above. The Organization ontology discusses each approach with respect to classifying organizations. It proposes two strategies: either create subclasses of Organization, or use the classification property. Its guidance is:

If the classification is not intrinsic to the organization but simply some way to group organizations, for example as part of a directory, then org:classification should be used. If the classification is a reflection of the intrinsic nature of the organization and affects other properties then the sub-class approach should be used. For example, only charities have charity numbers so it would be better to represent a charity as a sub-class of org:FormalOrganization rather than via a taxonomic labelling.

In other words, subclasses should only be used if the benefits – for example, being able to add class-specific properties – outweigh the complexity. In the organization classification example, it is of no use to create the classes Partnership, LimitedCompany, UnlimitedCompany, etc. if all these classes behave the same way in your use case. It is simpler to use the classification property in that case.

It is very important to identify a reasonable set of use cases and requirements before creating an ontology. Without use cases and requirements to guide its development, an ontology risks becoming a catalog of all things that exist: a list of nouns (classes) that occur within the field of interest. An ontology that focuses on use cases is more likely to have a small number of terms, which include only those necessary to fulfill its use cases.

A property with an unknown value or no value

It is sometimes useful to distinguish between the following cases for a given resource and property. In the context of the resource:

  1. The property is known to be applicable, but its value is unknown, e.g. Herodotus’ birth date.

  2. The property is known to be inapplicable, and no value is appropriate, e.g. a living person’s death date.

In each case, the value is absent, but for different reasons. Two approaches exist to disambiguate the two: one using sentinel values (or markers), like NULL, another addressing the problem at the schema level. In most use cases, however, it’s not important to distinguish between the two.

In general, if, for a given resource, a property is applicable but its value is unknown, do not make any statements about its value. This approach is consistent with the open world assumption used by RDF, according to which the absence of a particular statement implies nothing about the world; its absence implies only that the truth value of that statement is unknown, which is the desired interpretation.

Marker strategy

SQL implements a three-valued logic, in which the truth values are true, false or unknown. In SQL, NULL means “unknown”; it is not a value, but rather a marker indicating the absence of value. NULL does not indicate the reason for this absence of value, however; as such, NULL cannot disambiguate between the two cases above. In an attempt to address this problem, E. F. Codd, the creator of the relational database model, proposed two new markers to stand for “applicable but unknown” and “inapplicable”, effectively requiring a four-valued logic.

Like SQL, most programming languages and data models implement at most three-valued logic. Instead of relying on a native implementation of four-valued logic, strings, classes and blank nodes can be used as markers in practice.

Using strings

NASA’s Planetary Data System (PDS) uses the strings “N/A”, “UNK” and “NULL” as markers to stand for “not applicable”, “permanently unknown” and “temporarily unknown”. You may choose your own strings to indicate as many reasons as you need for an absence of value.

Unfortunately, using strings as markers is problematic. Markers must receive special handling to ensure expected behavior. In SQL, for example, the marker NULL is not equal to NULL; one unknown is not equal to another unknown. Take for example Beowulf, a book by an unknown author, whose author would be set to NULL in a SQL table. Querying for books by the same author would return zero results:

SELECT * FROM books WHERE author = NULL

If you want to find all books by unknown authors, which is a different question, the query would be:

SELECT * FROM books WHERE author IS NULL

Unlike NULL, the custom string marker “UNK” is equal to itself. Performing the first query above against a table that uses “UNK” instead of NULL would return all books by unknown authors, instead of returning zero results:

SELECT * FROM books WHERE author = "UNK"

This result is incorrect; it isn’t true that the author of Beowulf is the author of all books by unknown authors! This is only one of many cases in which markers like NULL receive special handling. To handle string markers appropriately, you would have to implement considerable additional logic.

Using blank nodes

In RDF, blank nodes indicate the existence of a thing, without using, or saying anything about, the name of that thing. Therefore, you can write:

ex:beowulf dcterms:creator [] .
{
  "@id": "ex:beowulf",
  "dcterms:creator": {}
}

which states that Beowulf has an author, while saying nothing about the author. Unlike a string, a blank node is not equal to another blank node, preserving the logic of NULL described above, which is that one unknown is not equal to another unknown.

Note that when using a blank node, it’s possible to say something about an author, without naming the author. For example, the author of Beowulf was, in all likelihood, a person. Therefore, you can write:

ex:beowulf dcterms:creator [ a foaf:Person ] .
{
  "@id": "ex:beowulf",
  "dcterms:creator": { "@type": "foaf:Person" }
}

Using classes

To describe the relationship between a politician and a political party, the Organization ontology offers a memberOf property, whose range is Organization. In other words, the property maps a person to an organization, for example:

ex:john org:memberOf ex:xyz-party .
{
  "@id": "ex:john",
  "org:memberOf": "ex:xyz-party"
}

In many democracies, it is common to elect a candidate who belongs to no political party. To disambiguate a candidate with no party affiliation from a candidate whose party affiliation is unknown, change the range of the memberOf property to be the union of the Organization class and a new marker class that stands for “no party”, using OWL:

org:memberOf rdfs:domain [ owl:unionOf (org:Organization ex:Independent) ]

You may then publish data on independent candidates, like:

ex:john org:memberOf ex:independent .
{
  "@id": "ex:john",
  "org:memberOf": "ex:independent"
}

As with string markers, querying for all candidates belonging to the same party (ex:independent) as an independent member, using SPARQL for example, would returns all independent candidates, which is incorrect:

SELECT ?person WHERE {
  ?person org:memberOf ex:independent .
}

To correct this problem, you may change the query template to limit the results to candidates belonging to an organization, which in this case will return zero results, because ex:independent is not an organization, but a marker for “no party”:

SELECT ?person WHERE {
  ?person org:memberOf ?organization .
  ?organization a org:Organization .
  FILTER (?organization = ex:independent)
}

This approach is effective but inefficient, as it requires creating a marker for most classes, and not always feasible, as it requires changing the range of properties which may be defined by third-party ontologies.

Schema strategy

Instead of introducing a new truth value every time there is a new reason for an absence of value, you should avoid NULL and markers altogether. You should not make a statement like “Herodotus was born on <unknown>” when the value of a property is unknown, or “John Doe died on <not applicable>” when a property is inapplicable, as you would if using NULL. Instead, state the reason for which the property is absent. For example, if John is alive, and a death_date property is therefore inapplicable, you may first, for example, either state that John belongs to the class Alive:

ex:john rdf:type ex:Alive .
{
  "@id": "ex:john",
  "@type": "ex:Alive"
}

Or state that the property alive has the value true for John:

ex:john ex:alive true .
{
  "@id": "ex:john",
  "ex:alive": true
}

Second, using OWL for example, you may state that instances of the class Alive cannot have a value for the property death_date – in other words, that live people cannot have a death date:

# Live people are the class of people whose "alive" property is set to true.
:Alive owl:equivalentClass [
  a owl:Class ;
  owl:intersectionOf (
    foaf:Person
    [
      a owl:Restriction ;
      owl:onProperty :alive ;
      owl:hasValue true .
    ]
  )
] .

# Dead people are the class of people whose "alive" property is set to false.
:Dead owl:equivalentClass [
  a owl:Class ;
  owl:intersectionOf (
    foaf:Person
    [
      a owl:Restriction ;
      owl:onProperty :alive ;
      owl:hasValue false .
    ]
  )
] .

# A person cannot be both alive and dead.
:Alive owl:disjointWith :Dead .

# The death_date property can only be set for a dead person.
:death_date rdfs:domain :Dead .

In this way, it is possible to use a semantic reasoner to infer the reason for which a value is absent. For a live person, if the death_date property is absent, it is because live people cannot have a death date. For a dead person, if the death_date property is absent, it is because the death date is unknown. This approach is effective but inefficient, as it requires writing rules for each property.

Revisiting the independent candidate example

We can use the schema strategy in the independent candidate example above to disambiguate a candidate with no party affiliation from a candidate whose party affiliation is unknown. First, create a property independent whose possible values are true or false, which can be set on either the candidate or their candidacy. Then, using OWL, state that instances of the class Nonpartisan cannot have a value for the property memberOf that is a Party – in other words, that independent candidates cannot be members of parties:

# Partisanship is belonging to a party.
_:partisanship a owl:Restriction ;
  owl:onProperty org:memberOf ;
  owl:someValuesFrom :Party .

# Nonpartisanship is not belonging to any party.
_:nonpartisanship a owl:Class ;
  owl:complementOf _:partisanship .

# Partisan people have a membership in a party, and have an "independent" property set to false.
:Partisan
  owl:equivalentClass [
    a owl:Class ;
    owl:intersectionOf (
      foaf:Person
      _:partisanship
    ) .
  ] , [
    a owl:Class ;
    owl:intersectionOf (
      foaf:Person
      [
        a owl:Restriction ;
        owl:onProperty :independent ;
        owl:hasValue false .
      ]
    )
  ] .

# Nonpartisan people have no memberships in any party, and have an "independent" property set to true.
:Nonpartisan
  owl:equivalentClass [
    a owl:Class ;
    owl:intersectionOf (
      foaf:Person
      _:nonpartisanship
    ) .
  ] , [
    a owl:Class ;
    owl:intersectionOf (
      foaf:Person
      [
        a owl:Restriction ;
        owl:onProperty :independent ;
        owl:hasValue true .
      ]
    )
  ] .

# A person cannot be both nonpartisan and partisan.
:Nonpartisan owl:disjointWith :Partisan .

It is now possible, as before, to use a semantic reasoner to infer the reason for which a value is absent. For a partisan candidate, if no memberOf property maps the candidate to a party, it is because the party membership is unknown. For an independent candidate, if no memberOf property maps the candidate to a party, it is because independent candidates cannot be members of parties. In your data, you would state that a candidate belongs to the class Nonpartisan:

ex:john rdf:type ex:Nonpartisan .
{
  "@id": "ex:john",
  "@type": "ex:Nonpartisan"
}

Or state that the property independent has the value true for a candidate:

ex:john ex:independent true .
{
  "@id": "ex:john",
  "ex:independent": true
}