::scr Ramblings of a Classic Refugee or How I Learned To Stop Worrying and Love OS X
Alaric B Snell
scr@thegestalt.org
Tue, 5 Feb 2002 16:34:41 +0000
> > 1. performance -
> > FS access would be *slow*. But memory is cheap and I've got clock cycles
> > to burn so we'll let Mr Moore take care of that for us.
>
> I don't know much about BeOS, but I think I've read something that
> claimed they'd fixed this problem? Maybe. I can't remember.
There's a myth about things 'having to be slow'. If you took a current
filesystem and bolted extra stuff on willy-nilly the result might well be
'slow'. If you do it properly, it won't. The hierarchial filesystem wasn't
designed to 'be fast', it was designed since it seemed like a logical way to
organise files at the time.
> > Effectively DB file systems and also individual files would have the
> > equivalent of DTDs (the file that describes an XML file for the buzzword
> > protected) but for lovely shiny binary files. Al Snell would be proud.
Hello.
If the abstraction layer to the data store is as strings of bits, then yes,
you'll need something basically isomorphic to grammars to define allowable
bit strings if you want to perform data validation, and attributing that
grammar with semantic declarations ('This is a pornographic image in JPEG
format', 'this is an integer', etc) will also be useful since those semantic
types can be looked up in a registry to automatically find pre-written
grammars for validity checking, and also link to code that can
view/edit/validate in deep semantic ways (making sure that the porn image
contains something that an image recognition algorithm classifies as a
cum-soaked farmyard babe).
Performing data validation can be a very very good thing, but it's important
that it doesn't become a pain in the ass. Whether this is best solved by
allowing unstructured types - as the X.500 / LDAP folks have recently - that
are not validated against a schema or by munging the minds of programmers to
make them feel that time spent writing schemas for everything they do far,
far, outweighs the costs of debugging things... waits to be seen.
> A binary file with a separate definition document gets rid of the
> primary advantage of XML, namely that it can be edited in a standard
> text editor.
This is not the primary advantage of XML. People used to say things like that
on XML-DEV, but then the Collapse came, and now they're cursing textual
encodings.
The primary advantage of XML is that it's a well supported standard, full
stop; the biggest pains in the ass of it is the actual textual encoding rules
and the lack of semantic declarations.
1) There are problems with character encodings. XML is based upon Unicode,
which is still a developing standard. There have been problems with
definitions of whitespace, and XML 1.1 is set to not be backwards compatible
with XML 1.0 because of this; XML 1.0 documents may not necessarily be valid
XML 1.1 documents...
2) There are problems with the fact that only a subset of interesting data
map well to text. Images don't. Encrypted data don't. The two solutions to
this are BASE64 and storing the 'binary' data in a seperate data store with
links inbetween that need to be kept up to date.
3) XML isn't that readable. Really simple stuff is, but the verbosity and so
on quickly make it next to impossible to make sense of in a text editor; this
is why there are specialist XML editors. XML configuration files are a right
pain :-( This particularly arises when the data model in use isn't all that
tree structured, and there are IDREF links between elements - the text editor
user will have big problems editing *that*.
And as for the lack of semantic declarations; the XML designers took a path
of 'We define a syntax alone, and let applications worry about semantics'.
This has caused problems because things like namespaces and namespace URIs
and ID attributes and processing instructions and schemas and DTDs have
multiple de facto standards for their semantics, and when they clash it can
get ugly. Both because there are now undefined cases and because developers
must be familiar with multiple competing 'mental models' of what XML
constructs *mean*, and must switch between them in their heads.
The lack of semantics of base XML 1.0 is a problem. There are multiple
available semantics.
1) The DOM model, in which text nodes may be arbitrarily broken at various
places
2) The Infoset, which is like the DOM but different in some ways
3) The post schema validation Infoset, which is like the Infoset but
different in some ways.
4) The XPath tree model, which is like the DOM but different in some ways
I can't remember all the differences, but they're mainly to do with things
like whether a parsed entity reference is presented to the application as
such or magically parsed at parse time (which may only be possible if there's
an Internet connection available at parse time).
Also, XML is *complex*. XML 1.0 itself is simple. Add namespaces and it gets
a bit more complex. Add XML schemas and the Infoset and the PSVI and it gets
more complex so.
> In that case, why is a binary file like that better than
> files being objects? If they had specified accessor methods, they could
> pretend to be any file type they needed too.
Indeed! Sod all this low level crap, people just wrap it in a library and
call it 'serialisation' anyway... we don't want to have to be worrying about
bit formats. We don't worry about bit layouts when we declare a struct full
of integers in C or an object in Java or a table in SQL, do we?
ABS
--
Alaric B. Snell, Developer
abs@frontwire.com