Shared File Metadata Specification Madness

From the Shared File Metadata Specification on freedesktop.org:

The only requirement for metadata names is that they are unique and do not overload or cause confusion with each other. To make this possible, all metadata is namespaced by an appropriate class based on the type of the file or the application name (if the metadata is application specific).

What isn't confusing about having all of the following metadata types:

Another stroke of genius flaw is File.Accessed (and so on): Last access date in format "YYYY-MM-DD hh:mm:ss". What timezone is this in? EXIF made this mistake, and it hurts.

Why this specification didn't soak up the years of work done by people on RDF, Dublin Core, EXIF-in-RDF and so on, I don't know.

NP: Sounds From The Verve Hifi, Thievery Corporation

10:30 Thursday, 09 Mar 2006 [#] [computers] (20 comments)

Posted by Wouter Bolsterlee at Thu Mar 9 11:24:06 2006:
Aaaargh. Why can't we just use DECENT namespaces without duplication? This sort of stuff is really blocking good integration of metadata and searching/categorization capabilities to the desktop...
Posted by Wouter Bolsterlee at Thu Mar 9 11:28:01 2006:
Quoting the specification:

Name: File.Permissions
Type: string
Writable: No
Description: Permission string in unix format eg "-rw-r--r--"

Short-sighted crap. What about ACL's? Public web resources? Other OS'es?
Posted by Jamie McCracken at Thu Mar 9 11:50:57 2006:
Allow me to justify why it is.

Firstly we need to distinguish between the various comments fields. Why? Because lets say I want to search Doc.Comments but am not interested in all the other comments (audio, image etc) - so what am I supposed to do?

How else can I specify that I want Document Comments and nothing else in Tracker?

So how can it be madness to differentiate?

And yes I have used Dublin Core where appropriate but Dublin Core is very generic and so you cant use it to nail down more specific metadata types.

WRT dates, we dont use timezones because they are not relevant to a user's metadata. Its only someone elses metadata where timezone might be important and that is out of scope of the spec.

As for the names, the spec uses all the names commonly found in office software, images (EXIF) etc so if you dont like it complain to them. It would be far worse to rename everything to make it inconsistent to everything else IMO.

If you have any constructive criticism please forward it to me - that spec is not set in stone.
Posted by Ross at Thu Mar 9 12:09:10 2006:
Obviously I'm foolish as I've only written an RDF-based metadata-heavy web gallery, but using RDF and Dublin Core I'd solve the problem like this:

mime-type == "image/*" AND dc:description ~= "something"

That would search for all images where the description contains something.  Then the same search for audio files becomes:

mime-type == "audio/*" AND dc:description ~= "something"

And searching for "something" in the description of every file becomes:

dc:description ~= "something"
Posted by Jamie McCracken at Thu Mar 9 12:22:00 2006:
Thats not a bad idea Ross and I had considered that before but there are a few issues with that namely becuae its inefficient searching that way especially with documents (as there is no easy mime type association in that case).

We use RDF Query in Tracker for searching metadata and its awkward and more cumbersome having to list out all the possible mime types and the search is far far quicker when using a more precise metadata type.

There is alos a case of overloading as File.Description and Image.Description would overlap without the class names (becasue File.* applies to all files) and "Description" is a Dublin Core type so its a catch-22 situation!
Posted by Ross at Thu Mar 9 12:34:38 2006:
Why bother with File.Description and Image.Description?  In what situation can they have different values?
Posted by Jamie McCracken at Thu Mar 9 12:37:53 2006:
File.Description = Nauitlus Notes on a file (Notes tab)

Image.Description = Exif description
Posted by glandium at Thu Mar 9 12:53:31 2006:
WRT dates, we dont use timezones because they are not relevant to a user's metadata.

They are not relevant to a user who doesn't move's metadata. What about the user's pictures from his trips to a country in a different time zone than his own ?
Posted by Ross at Thu Mar 9 12:59:52 2006:
Which was my problem exactly: I've a pile of photos that were taken in India, which is +0500.  EXIF stores the times in local time, which is useless to everyone.
Posted by Jamie McCracken at Thu Mar 9 13:08:17 2006:
When EXIF and friends support timezone then I will happily add support for it in my spec. Until that day comes there is nothing anyone can do about it!
Posted by Berend at Thu Mar 9 13:14:51 2006:
Shouldn't the point of this spec be to provide a consistent interface for data access, not uniform data? So if the data is defined using DC, EXIF, ID3, etc. then why try to abstract this away by creating new classes with their own properties like these Image or File? I know it would be neat to have uniform data but wouldn't it be a nightmare to try to map all existing ways of storing metadata into this interface (read: a lot of hardcoding to try to integrate all these). Also, I think it unnecessarely hides to the user where the data actually is.

I think this metadata layer should deal with what kind of metadata a file may have, which is based on the file's mime-type and which is defined by the schema's that are available for the type. Some plugin modules, one for each schema (EXIF,ID3,etc.), can handle the retrieval and storage of the data. The the core 'libmetadata' might store only DC internally, or perhaps use a triplestore to be able to store antyhing that isn't handles by a plugin. Plugins for EXIF or ID3 would use the file as store.

I hope this adds something, I've fiddled a bit with RDF and always thought mime-types would be a good way to decide which metadata a resource may have, though I'm not sure it covers all the usecases.

And I couldn't agree more on the dates, metadata should be unambigious no matter where it goes, so why not directly store it as such instead of relying on some export handling.
Oh, and I think the fact that EXIF doesn't support timezones illustrates the fact that trying to fit all data into a uniform format would not be a good approach.
Posted by Ross at Thu Mar 9 13:19:20 2006:
So what about File.Accessed?  Why doesn't that have a timezone?
Posted by Jamie McCracken at Thu Mar 9 13:40:39 2006:
Why should it?

The spec relates to a user's metadata and how its specified in a local metadata framework (like Tracker or KAT) - it is not intended to be used for sharing metadata globally around the world where privacy concerns come into play (as metadata like that would be stored in a local DB in the user's home directory it also cant be globalised).

I dont have a problem adding timezone info as such but I would ask the question "why is it useful in that particular case?"

Is it because it might be useful in some other context?
Posted by Ross at Thu Mar 9 13:55:54 2006:
$ stat .
  File: `.'
...
Access: 2006-03-09 13:54:08.000000000 +0000
Modify: 2006-03-09 13:43:08.000000000 +0000
Change: 2006-03-09 13:43:08.000000000 +0000

You are actually arguing that the timezone is  useless information?
Posted by Ross at Thu Mar 9 13:58:21 2006:
I suppose I should spell it out: Daylight Saving Time.
Posted by Brian Ewins at Thu Mar 9 14:12:40 2006:
"lets say I want to search Doc.Comments but am not interested in all the other comments [...] so what am I supposed to do?"

Ross mentions filtering by mime type, but that's just a refinement of DC.Format. Query on that.

"Dublin Core is very generic and so you cant use it to nail down more specific metadata types."

To some extent that's the point. Its meant to nail  'up' specific metadata types to the generic DC elements. eg. If I search the metadata field DC.Format its supposed to return results for not just that specific name but any names which are /refinements/ of DC.Format, like mime types.

"WRT dates, we dont use timezones because they are not relevant to a user's metadata. Its only someone elses metadata where timezone might be important and that is out of scope of the spec."

this sounds totally wrong to me. Down the line your metadata becomes someone elses metadata, when you publish your photos/blog/whatever. Even your own timezone changes when you travel. Bizarrely Doc.Created must add or remove information to whatever DC.Date you're deriving it from, since that isn't one of the date formats DC adoped from http://www.w3.org/TR/NOTE-datetime

"It would be far worse to rename everything to make it inconsistent to everything else IMO."

But that's exactly whats been done here? The Doc terms aren't DC, the image terms aren't EXIF. They're derived or renamed in some unspecified way from other stuff; and this is the BIG gap in the spec - where does this stuff come from?

If instead you said: We'll use these names, but treat (eg) EXIF.Height and SVG.Height as refinements of Image.Height, it would make more sense. That way you can query Image.Height and get back heights for EXIFs and SVG, right?

'course, if the tools don't understand metadata refinement the game's a bogey.
Posted by Jamie McCracken at Thu Mar 9 14:18:29 2006:
Yeah you may have a point. I expect strftime() to format the date according to timezone anyhow but its probably better to include the timezone info itself just to be on the safe side I suppose.
Posted by Jamie McCracken at Thu Mar 9 14:29:17 2006:
Brian:

The point of the spec was to make use of DC were appropriate but give priority to more commonly and more visibly used metadata names already in use in applications (like office software, music players, image viewers etc). So you end up with a mix of popular metadata names and DC (its a compromise basically).

Nailing up is not practical because of all the overlap as I said before. We need to store and select and search all metadata and you cant do that for hundreds of possible metadata using DC's 13 types.

The Doc terms were taken from what OpenOffice and MSOffice show in their properties dialog. Likewise with Audio and Image.
Posted by Joe Geldart at Thu Mar 9 19:59:18 2006:
Hm. Why should the user need to see these names anyway? All that matters to the computer is that the names be symbols with a notion of equality. Namespaces are there to solve a social problem -- trampling. Now, whilst you may claim that there is no problem with that here (there being a specification and all), unfortunately no one can predict all use cases. Namespaces were introduced with good reason, and were used very well by RDF. It seems senseless to me to persue a course that is known to be problematic when the 'correct' approach is just as easy to implement.

In case you are desparate for meaningful names (something I'm not entirely sure I agree with because the names themselves should carry no meaning in my view) then fortunately RDF again provides an answer in the form of rdf:label. This is used by Haystack (for one) to construct very usable interfaces so I fail to see why it isn't suitable here.

Now, I'm not an RDF zealot (in fact my PhD thesis will probably argue that its model is unsuitible for the semantic web) but having developed Frege (available from my website) I'm sure it is up to the task of a shared information system like this. RDF's model (whilst icky and model-theoretic ;) has a nice mapping to and from most OO models as well as other paradigms. Unlike most, I believe that given the choice between a hack and a well-developed solution one should go for the latter. This counts double-good for this situation where both approaches will require the same effort.
Posted by boohoo at Fri Mar 10 07:50:24 2006:
Well I like LDAP's idea of object classes, I think it would apply here quite nicely - just slap classes of metadata on a file and if several classes have the same attribute it just collapses to one instance of that attribute. I haven't read the spec so please don't flame when you explain why I'm wrong :]

Name:


E-mail:


URL:


Add 10 and 4 (required):


Comment: