data types and interpretation in RDF

After following a link from one of Nad’s tweets, read Jeni Tennison’s “SPARQL & Visualisation Frustrations: RDF Datatyping“.  Jeni had been having problems processing RDF of MP’s expense claims, because the amounts were plain RDF strings rather than as typed numbers.  She  suggests some best practice rules for data types in RDF based on the underlying philosophy of RDF that it should be self-describing:

  • if the literal is XML, it should be an XML literal
  • if the literal is in a particular language (such as a description or a name), it should be a plain literal with that language
  • otherwise it should be given an appropriate datatype

These seem pretty sensible for simple data types.

In work on the TIM project with colleagues in Athens and Rome, we too had issues with representing data types in ontologies, but more to do with the status of a data type.  Is a date a single thing “2009-08-03T10:23+01:00″, or is it a compound [[date year=”2009″ month=”8” …]]?

I just took a quick peek at how Dublin Core handles dates and see that the closest to standard references1 still include dates as ‘bare’ strings with implied semantics only, although one of the most recent docs does say:

It is recommended that RDF applications use explicit rdf:type triples …”

and David MComb’s “An OWL version of the Dublin Core” gives an alternative OWL ontology for DC that does include an explicit type for dc:date:

<owl:DatatypeProperty rdf:about="#date">
  <rdfs:domain rdf:resource="#Document"/>
  <rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#dateTime"/>
</owl:DatatypeProperty>

Our solution to the compound types has been to have “value classes” which do not represent ‘things’ in the world, similar to the way the RDF for vcard represents  complex elements such as names using blank nodes:

<vCard:N rdf:parseType="Resource">
  <vCard:Family> Crystal </vCard:Family>
  <vCard:Given> Corky </vCard:Given>
  ...
</vCard:N>

From2

This is fine, and we can have rules for parsing and formatting dates as compound objects to and from, say, W3C datetime strings.  However, this conflicts with the desire to have self-describing RDF as these formatting and parsing rules have to be available to any application or be present as reasoning rules in RDF stores.  If Jeni had been trying to use RDF data coded like this she would be cursing us!

This tension between representations of things (dates, names) and more semantic descriptions is also evident in other areas.  Looking again at Dublin Core the metamodal allows a property such as “subject”  to have a complex object with a URI and possibly several string values.

Very semantic, but hardly mashes well with sources that just say <dc:subject>Biology</dc:subject>.  Again a reasoning store could infer one from the other, but we still have issues about where the knowledge for such transformations resides.

Part of the problem is that the ‘self-describing’ nature of RDF is a bit illusary.   In (Piercian) semiotics the interpretant of a sign is crucial, representations are interpreted by an agent in a particular context assuming a particular language, etc.  We do not expect human language to be ‘sef describing’ in the sense of being totally acontextual.  Similarly in philosophy words and ideas are treated as intentional, in the (not standard English) sense that they refer out to something else; however, the binding of the idea to the thing it refers to is not part of the word, but separate from it.  Effectively the desire to be self-describing runs the risk of ignoring this distinction3.

Leigh Dodds commented on Jeni’s post to explain that the reason the expense amounts were not numbers was that some were published in non-standard ways such as “12345 (2004)”.  As an example this captures succinctly the perpetual problem between representation and abstracted meaning.  If a journal article was printed in the “Autumn 2007” issue of  quarterly magazine, do we express this as <dc:date>2007</dc:date> or <dc:date>2007-10-01</dc:date>  attempting to give an approximation or inference from the actual represented date.

This makes one wonder whether what is really needed here is a meta-description of the RDF source (not simply the OWL as one wants to talk about the use of dc:date or whatever in a particular context) that can say things like “mainly numbers, but also occasionally non-strandard forms”, or “amounts sometimes refer to different years”.  Of course to be machine mashable there would need to be an ontology for such annotation …

  1. see “Expressing Simple Dublin Core in RDF/XML“, “Expressing Dublin Core metadata using HTML/XHTML meta and link elements” and Stanford DC OWL[back]
  2. Renato Iannella, Representing vCard Objects in RDF/XML, W3C Note, 22 February 2001.[back]
  3. Doing a quick web seek, these issues are discussed in several places, for example: Glaser, H., Lewy, T., Millard, I. and Dowling, B. (2007) On Coreference and the Semantic Web, (Technical Report, Electronics & Computer Science, University of Southampton) and Legg, C. (2007). Peirce, meaning and the semantic web (Paper presented at Applying Peirce Conference, University of Helsinki, Finland, June 2007). [back]

Birthday

It was my birthday last week.  First thanks to everyone who sent greetings through Facebook etc.  Got some new books to read as well as two new mugs: one that says “exterminate” and one that is becoming my wee dram beaker.

This evening going for a belated birthday dinner at Cèabhar (booked up until tonight!), a lovely restaurant overlooking the Atlantic sunset.

The books …

The Kerracher Man, Eric MacLeod — Just reading this now.  A family who go off to live in a remote scottish croft.

Pilgrims in the Mist: The Stories of Scotland’s Travelling People, Sheila Stewart — Tales once told beneath a bender.

Nella Last’s Peace: The Post-War Diaries Of Housewife 49 — This is the follow-on to Nella Last’s War, which was one of the books on my Rome bookshelf

Calum’s Road, Roger Hutchinson — A couple of years ago we spent Easter on Skye and visited the little island of Raasay.  At the north end a precipitous little road leads round headlands to a small beach.  We had heard that the road had be created over many years by the labours of a single man … Calum.

Welsh Pictures. Drawn with Pen and Pencil, Richard Lovett (editor), London: The Religious Tract Society, 1892 — A beautiful aniqurian book of images and text.

On the edge: universities bureacratised to death?

Just took a quick peek at the new JISC report “Edgeless University: why higher education must embrace technology” prompted by a blog about it by Sarah Bartlett at Talis.

The report is set in the context of both an increasing number of overseas students, attracted by the UK’s educational reputation, and also the desire for widening access to universities.  I am not convinced by the idea that technology is necessarily the way to go for either of these goals as it is just so much harder and more expensive to produce good quality learning materials without massive economies of scale (as the OU has).  Also the report seems to mix up open access to research outputs and open access to learning.

However, it was not these issues, that caught my eye, but a quote by Thomas Kealey vice-chancellor of the University of Buckingham,  the UKs only private university.  For three years Buckingham has come top of UK student satisfaction surveys, and Kealey says:

This is the third year that we’ve come top because we are the only university in Britain that focuses on the student rather than on government or regulatory targets. (Edgeless University, p. 21)

Of course, those in the relevant departments of government would say that the regulations and targets are inteded to deliver education quality, but as so often this centralising of control, (started paradoxically in the UK during the Thatcher years), serves instead to constrain real quality that comes from people not rules.

In 1992 we saw the merging of the polytechnic and university sectors in the UK.  As well as diffferences in level of education, the former were tradtionally under the auspices of local goverment, whereas the latter were independent educational isntitutions. Those in the ex-polytechnic sector hoped to emulate the levels of attaiment and ethos of the older universities.  Instead, in recent years the whole sector seems to have been dragged down into a bureacratic mire where paper trails take precidence over students and scholarship.

Obviously private institutions, as  Kealey suggests, can escape this, but I hope that current and future government can have the foresight and humility to let go some of this centralised control, or risk destroying the very system it wishes to grow.

the more things change …

I’ve been reading Jeni (Tennison)’s Musings about techie web stuff XML, RDF, etc.  Two articles particularly caught my eye.  One was Versioning URIs about URIs for real world and conceptual objects (schools, towns), and in particular how to deal with the fact that these change over time.  The other was Working With Fragmented Overlapping Markup all about managing multiple hierarchies of structure for the same underlying data.

In the past I’ve studied issues both of versioning and of multiple structures on the same data1, and Jeni lays out the issues for both really clearly. However, both topics gave a sense of deja vu, not just because of my own work, but because they reminded me of similar issues that go way back before the web was even thought of.

Versioning URIs and unique identifiers2

In my very first computing job (COBOL programming for Cumbria County Council) many many years ago, I read an article in Computer Weekly about choice of keys (I think for ISAM not even relational DBs). The article argued that keys should NEVER contain anything informational as it is bound to change. The author gave an example of standard maritime identifiers for a ship’s journey (rather like a flight number) that were based on destination port and supposed to never change … except when the ship maybe moved to a different route. There is always an ‘except’, so, the author argued, keys should be non-informational.

Just a short while after reading this I was working on a personnel system for the Education Dept. and was told emphatically that every teacher had a DES code given to them by government and that this code never changed. I believed them … they were my clients. However, sure enough, after several rounds of testing and demoing when they were happy with everything I tried a first mass import from the council’s main payroll file. Validations failed on a number of the DES numbers. It turned out that every teacher had a DES number except for new teachers where the Education Dept. then issued a sort of ‘pretend’ one … and of course the DES number never changed except when the real number came through. Of course, the uniqueness of the key was core to lots of the system … major rewrite :-/

The same issues occurred in many relational DBs where the spirit (rather like RDF triples) was that the record was defined by values, not by identity … but look at most SQL DBs today and everywhere you see unique but arbitrary identifying ids. DOIs, ISBNs, the BBC programme ids – we relearn the old lessons.

Unfortunately, once one leaves the engineered world of databases or SemWeb, neither arbitrary ids nor versioned ones entirely solve things as many real world entities tend to evolve rather than metamorphose, so for many purposes http://persons.org/2009/AlanDix is the same as http://persons.org/1969/AlanDix, but for others different: ‘nearly same as’ only has limited transitivity!

  1. e.g. Modelling Versions in Collaborative Work and Collaboration on different document processing platforms; quite a few years ago now![back]
  2. edited version of comments I left on Jeni’s post[back]

fix for WordPress shortcode bug

I’m starting to use shortcodes heavily in WordPress1 as we are using it internally on the DEPtH project to coordinate our new TouchIT book.  There was minor bug which meant that HTML tags came out unbalanced (e.g. “<p></div></p”).

I’ve just been fixing it and posting a patch2, interestingly the bug was partly due to the fact that back-references in regular expressions count from the beginning of the regular expression, making it impossible to use them if the expression may be ‘glued’ into a larger one … lack of referential transparency!

For anyone having similar problems, full details and patch below (all WP and PHP techie stuff).

Continue reading

  1. see section “using dynamic binding” in What’s wrong with dynamic binding?[back]
  2. TRAC ticket #10490[back]

What’s wrong with dynamic binding?

Dynamic scoping/binding of variables has a bad name, rather like GOTO and other remnants of the Bad Old Days before Structured Programming saved us all1.  But there are times when dynamic binding is useful and looking around it is very common in web scripting languages, event propagation, meta-level programming, and document styles.

So is it really so bad?

Continue reading

  1. Strangely also the days when major advances in substance seemed to be more important than minor advances in nomenclature[back]

spam going up

I noticed the size of my spam folder seemed to be going up.  16.5 Mb in the first 14 days of July.  I checked back to see what it was previously (I am truly sad and archive my Trash folders, so can see what t was!).  The 9 months October-June was 154 Mb, so with July at 33Mb in one month that looks like near doubling in rate.  Actually looking back the previous 19 month period was 88Mb, so again doubling.

I checked Eudora’s record of the numbers of emails arriving (right).  This incoming mail is  dominated by Spam and shows it going up from about 3000 a week in January to over 5000 a week now, again consonant with a doubling about every 9 months.  I’m not sure if this is because the server-side spam exclusion is letting more through or because the total volume is increasing. If the latter, the trend is worrying, not just for those of us personally trying to cope with the volume, but also for mail servers.

Moore’s law for disk capacity is about doubling in 18 months, and personally I’ve noticed that my actual document sizes tend to double about every 3 years (more media etc.), so basically disk space keeps ahead of disk need.  However, if the Spam volume is doubling every 9 months it is faster than disk size increases, so mail servers may find themselves struggling with the throughput, even if they can filter and remove it.

Is it just me or are other people seeing a similar pattern?

grammer aint wot it used two be

Fiona @ lovefibre and I have often discussed the worrying decline of language used in many comments and postings on the web. Sometimes people are using compressed txtng language or even leetspeak, both of these are reasonable alternative codes to ‘proper’ English, and potentially part of the natural growth of the language.  However, it is often clear that the cause is ignorance not choice.  One of the reasons may be that many more people are getting a voice on the Internet; it is not just the journalists, academics and professional classes.  If so, this could be a positive social sign indicating that a public voice is no longer restricted to university graduates, who, of course, know their grammar perfectly …

Earlier today I was using Google to look up the author of a book I was reading and one of the top links was a listing on ratemyprofessors.com.  For interest I clicked through and saw:

“He sucks.. hes mean and way to demanding if u wanan work your ass off for a C+ take his class1

Hmm I wonder what this student’s course assignment looked like?

Continue reading

  1. In case you think I’m a complete pedant, personally, I am happy with both the slang ‘sucks’ and ‘ass’ (instead of ‘arse’!), and the compressed speech ‘u’. These could be well-considered choices in language. The mistyped ‘wanna’ is also just a slip. It is the slightly more proper “hes mean and way to demanding” that seems to show  general lack of understanding.  Happily, the other comments, were not as bad as this one, but I did find the student who wanted a “descent grade” amusing 🙂 [back]