As web inventor and W3C Director Sir Tim Berners-Lee notes in his Design Issues for Linked Data,
“The Semantic Web isn’t just about putting data on the web. It is about making links, so that a person or machine can explore the web of data.; With linked data, when you have some of it, you can find other, related, data.” (Linked Data – Design Issues)
This straightforward realisation is expounded in a set of four deceptively simple ‘rules’ or (as Berners-Lee prefers) ‘expectations of behaviour.’ Ultimately these lie behind everything that might be described as Linked Data, whether out on the open web for all to see, or locked away in a Computer Science laboratory or behind the firewall at a Pharmaceutical company or Bank.
- Use URIs as names for things
- Use HTTP URIs so that people can look up those names
- When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)
- Include links to other URIs, so that they can discover more things.
Whilst the exact wording of these statements has changed slightly since first expressed in 2006, and there remains some question as to the strength of the requirement for specific standards, the acronyms mask a simple yet powerful set of behaviours;
- Name objects and resources, unambiguously;
- Make use of the structure of the web;
- Make it easy to discover information about the named object or resource;
- If you know about related objects or resources, link to them too.
There is a widely held presumption amongst many of Linked Data’s most persuasive advocates that the standards (such as RDF for modelling and syntax or SPARQL for querying) referred to by Berners-Lee are prerequisite for sharing or consuming Linked Data. Whilst the power of these standards delivers the richest set of capabilities today – with every indication that tool development and the ongoing standardisation process will increase this still further – there is also value in a more permissive reading of Berners-Lee’s rules. There is much to gain in embracing the philosophy behind these rules, separately to adopting the standards and specifications required to realise their full potential. Unambiguous identification of resources across the web, easily parsable descriptive information, shared terminologies comprising web-addressable terms, and unambiguous links to related resources deliver real value, as do microformats, RDFa markup in web pages and other ‘simpler’ approaches. Whether this leads towards ‘Linked Data’ in a formal sense or not perhaps remains unclear, yet may ultimately prove to be unimportant.
As a simple illustration of these principles, let us consider that there are two universities in the English city of York. There is also a York University in Canada. This example is simplistic and there are any number of ways in which people and machines might disambiguate a statement in order to clarify which institution is being referred to, but even so, knowing that the institution in question is 133913 in the Department for Children, Schools and Families’ EduBase database or 10007167 in the UK Register of Learning Providers makes for less ambiguity in line with Berners-Lee’s first rule.
EduBase also exposes URIs to the web in line with Berners-Lee’s second rule, and www.edubase.gov.uk/establishment/summary.xhtml?urn=133913 refers unambiguously to details of the same institution. The presence of ‘summary.xhtml?’ in the address may raise issues with respect to persistence as and when the Department makes changes to their software solution, and raises a set of naming issues that have implications far beyond the creation of Linked Data.
Closely associated with the Linking Open Data Community Project discussed on p18, DBpedia also exposes persistent URIs for the structured information stored in Wikipedia. It is increasingly seen as a reliable means of identifying a wide range of concepts, including the institution in our example; dbpedia.org/resource/University_of_York. DBpedia has emerged as something of a hub amongst the Linked Data projects, and this trend seems likely to continue.
The value of naming and identification is not, of course, new, although this recent integration with the architecture of the web makes it feasible to consider scalable and sustainable methods of proceeding that encompass both formal naming schemes managed by some responsible authority (ISBNs, DOIs, Learning Provider IDs, etc), more ad hoc community efforts such as DBpedia, and even the task or application-specific generation of completely new identifiers as a last resort. There is no requirement to ‘boil the ocean,’ with massive over-arching schemes that seek to categorise and label everything from the outset. Rather, it is increasingly practical to contemplate operating in an environment in which numerous identifiers are assigned to a single resource of interest, and for specific application contexts to rely upon those best suited to their purposes. A particular application might place greatest trust in the institutional identifier assigned by EduBase, but could also include identifiers from DBpedia or the UK RLP; both to support the third and fourth rules, but also to meet the needs of third party applications reliant upon one of these identifiers. There is no need for every application to record – and maintain – every identifier. Reliance upon the web means that interested applications can traverse the links from one store of data in another, rapidly discovering that ‘a’ in data store ‘1’ is the same as a resource referred to as ‘a’ and ‘b’ in data store ‘2,’ and therefore also the same as ‘b’ in data store ‘3.’ Until a resource is actually of interest to an application, it may very well go unidentified. By devolving labelling and categorisation to the place and time of need, the larger task becomes more manageable, and the individual identifiers perhaps become more relevant, more timely, and more liable to be actively maintained.
Berners-Lee’s third rule calls upon applications to ‘provide useful information’ when a URI is accessed. EduBase certainly does this, but in a manner that is only really useful for human consumption. To provide useful information in a manner that may be interpreted and acted upon by software, W3C recommendations such as SPARQL and RDF come to the fore, and work has already been done within the UK Government’s data.hmg.gov.uk activity to convert EduBase to RDF and to make a SPARQL endpoint available for developers to query. To fully meet Berners-Lee’s exhortation to ‘provide useful information,’ there will be a need to employ content negotiation techniques in order to present different responses to different tools. An undergraduate searching for the University of York is unlikely to welcome being presented with a SPARQL endpoint or an RDF document, and a SPARQL-aware application gathering information about a number of universities will not want human-readable content of the form already delivered from the EduBase web database. Maintaining wholly separate and unconnected services for humans and for machines makes little sense in the longer term.
Berners-Lee’s fourth rule, that resource descriptions should include links to related resources, is well demonstrated by DBpedia.
In this screenshot we can see links to a number of individuals and organisations elsewhere in DBpedia that have declared some relationship to the University. By making it easy – and useful – to declare those links, the web of possible connections grows richer. I might declare myself an alum of the University of York, and there may be value in doing so for myself, the University, and third parties interested in either or both of us. Whilst I gain individual value from the relationship, and have an incentive to describe it, the University is more likely to gain value in the aggregate (all these people graduated here), and has very little incentive to track an individual such as myself in sufficient detail to declare and maintain the association from their side.