Shared File Metadata Specification Madness
From the Shared File Metadata Specification on freedesktop.org:
The only requirement for metadata names is that they are unique and do not overload or cause confusion with each other. To make this possible, all metadata is namespaced by an appropriate class based on the type of the file or the application name (if the metadata is application specific).
What isn't confusing about having all of the following metadata types:
- File.Description
- Audio.Comment
- Doc.Comments
- Image.Description
- Image.Comments
Another stroke of genius flaw is File.Accessed (and so on): Last
access date in format "YYYY-MM-DD hh:mm:ss"
. What timezone is this
in? EXIF made this mistake, and it hurts.
Why this specification didn't soak up the years of work done by people on RDF, Dublin Core, EXIF-in-RDF and so on, I don't know.
NP: Sounds From The Verve Hifi, Thievery Corporation
Name: File.Permissions
Type: string
Writable: No
Description: Permission string in unix format eg "-rw-r--r--"
Short-sighted crap. What about ACL's? Public web resources? Other OS'es?
Firstly we need to distinguish between the various comments fields. Why? Because lets say I want to search Doc.Comments but am not interested in all the other comments (audio, image etc) - so what am I supposed to do?
How else can I specify that I want Document Comments and nothing else in Tracker?
So how can it be madness to differentiate?
And yes I have used Dublin Core where appropriate but Dublin Core is very generic and so you cant use it to nail down more specific metadata types.
WRT dates, we dont use timezones because they are not relevant to a user's metadata. Its only someone elses metadata where timezone might be important and that is out of scope of the spec.
As for the names, the spec uses all the names commonly found in office software, images (EXIF) etc so if you dont like it complain to them. It would be far worse to rename everything to make it inconsistent to everything else IMO.
If you have any constructive criticism please forward it to me - that spec is not set in stone.
mime-type == "image/*" AND dc:description ~= "something"
That would search for all images where the description contains something. Then the same search for audio files becomes:
mime-type == "audio/*" AND dc:description ~= "something"
And searching for "something" in the description of every file becomes:
dc:description ~= "something"
We use RDF Query in Tracker for searching metadata and its awkward and more cumbersome having to list out all the possible mime types and the search is far far quicker when using a more precise metadata type.
There is alos a case of overloading as File.Description and Image.Description would overlap without the class names (becasue File.* applies to all files) and "Description" is a Dublin Core type so its a catch-22 situation!
They are not relevant to a user who doesn't move's metadata. What about the user's pictures from his trips to a country in a different time zone than his own ?
I think this metadata layer should deal with what kind of metadata a file may have, which is based on the file's mime-type and which is defined by the schema's that are available for the type. Some plugin modules, one for each schema (EXIF,ID3,etc.), can handle the retrieval and storage of the data. The the core 'libmetadata' might store only DC internally, or perhaps use a triplestore to be able to store antyhing that isn't handles by a plugin. Plugins for EXIF or ID3 would use the file as store.
I hope this adds something, I've fiddled a bit with RDF and always thought mime-types would be a good way to decide which metadata a resource may have, though I'm not sure it covers all the usecases.
And I couldn't agree more on the dates, metadata should be unambigious no matter where it goes, so why not directly store it as such instead of relying on some export handling.
Oh, and I think the fact that EXIF doesn't support timezones illustrates the fact that trying to fit all data into a uniform format would not be a good approach.
The spec relates to a user's metadata and how its specified in a local metadata framework (like Tracker or KAT) - it is not intended to be used for sharing metadata globally around the world where privacy concerns come into play (as metadata like that would be stored in a local DB in the user's home directory it also cant be globalised).
I dont have a problem adding timezone info as such but I would ask the question "why is it useful in that particular case?"
Is it because it might be useful in some other context?
File: `.'
...
Access: 2006-03-09 13:54:08.000000000 +0000
Modify: 2006-03-09 13:43:08.000000000 +0000
Change: 2006-03-09 13:43:08.000000000 +0000
You are actually arguing that the timezone is useless information?
Ross mentions filtering by mime type, but that's just a refinement of DC.Format. Query on that.
"Dublin Core is very generic and so you cant use it to nail down more specific metadata types."
To some extent that's the point. Its meant to nail 'up' specific metadata types to the generic DC elements. eg. If I search the metadata field DC.Format its supposed to return results for not just that specific name but any names which are /refinements/ of DC.Format, like mime types.
"WRT dates, we dont use timezones because they are not relevant to a user's metadata. Its only someone elses metadata where timezone might be important and that is out of scope of the spec."
this sounds totally wrong to me. Down the line your metadata becomes someone elses metadata, when you publish your photos/blog/whatever. Even your own timezone changes when you travel. Bizarrely Doc.Created must add or remove information to whatever DC.Date you're deriving it from, since that isn't one of the date formats DC adoped from http://www.w3.org/TR/NOTE-datetime
"It would be far worse to rename everything to make it inconsistent to everything else IMO."
But that's exactly whats been done here? The Doc terms aren't DC, the image terms aren't EXIF. They're derived or renamed in some unspecified way from other stuff; and this is the BIG gap in the spec - where does this stuff come from?
If instead you said: We'll use these names, but treat (eg) EXIF.Height and SVG.Height as refinements of Image.Height, it would make more sense. That way you can query Image.Height and get back heights for EXIFs and SVG, right?
'course, if the tools don't understand metadata refinement the game's a bogey.
The point of the spec was to make use of DC were appropriate but give priority to more commonly and more visibly used metadata names already in use in applications (like office software, music players, image viewers etc). So you end up with a mix of popular metadata names and DC (its a compromise basically).
Nailing up is not practical because of all the overlap as I said before. We need to store and select and search all metadata and you cant do that for hundreds of possible metadata using DC's 13 types.
The Doc terms were taken from what OpenOffice and MSOffice show in their properties dialog. Likewise with Audio and Image.
In case you are desparate for meaningful names (something I'm not entirely sure I agree with because the names themselves should carry no meaning in my view) then fortunately RDF again provides an answer in the form of rdf:label. This is used by Haystack (for one) to construct very usable interfaces so I fail to see why it isn't suitable here.
Now, I'm not an RDF zealot (in fact my PhD thesis will probably argue that its model is unsuitible for the semantic web) but having developed Frege (available from my website) I'm sure it is up to the task of a shared information system like this. RDF's model (whilst icky and model-theoretic ;) has a nice mapping to and from most OO models as well as other paradigms. Unlike most, I believe that given the choice between a hack and a well-developed solution one should go for the latter. This counts double-good for this situation where both approaches will require the same effort.