Friday, March 27, 2009

Thoughts on storage, and use of metadata, in SharePoint


I've been working with Windows SharePoint Service 3.0, and Microsoft Office SharePoint Server 2007 for about two years now. Granted, it hasn't been all fun and games, but SharePoint as a product has grown significantly on me from the previous versions.

My working background these last few years (2004-2008) has been related to online graphical production systems, and handling of the resources from those. As such, my initial concern, back in 2005, was that the SharePoint model of storing items in databases would be a bad idea for us - with *millions* of files, ranging from fairly small, to relatively large (multiple GBs). Having now had a few years to consider the consequences, as well as alternative solutions, I've come to realize that the model isn't as obscure as I initially thought.

Storage in SharePoint

When considering where to store data in SharePoint, my current policy is somewhat split, and no longer relies as much on the size, as it does intent. Large data can be stored in SQL, with the "only" consequence being storage and IO overhead. If you can live with those two, *that's* not the deciding factor.

In the wild, deployed SharePoint portals obviously come in all shapes and colors. Some stick with near default portals, with a few lists and document libraries, and no major customizations. Others tear it all apart, keeping storage and authentication, but provide their own glue and UI. Yet others use SharePoint for UI, but write all data sources themselves. All sound choices, if results from business scenario considerations.

So when it comes to deciding whether to keep data within or without, I believe it boils down to defining what your product is all about, and what the intent of the portal - and data - is. What you *do* want to avoid is inconsistencies, no matter how the system works.

If you're writing an intranet, and the company you're targetting already has a large document base, you'll have the options of
  • keeping everything outside SharePoint;
  • moving everything inside;
  • or keeping what's outside, but putting everything new inside.
Out of the three, the latter will cause the most horrendous inconsistency unless you make an effort of changing the mentality of the users - a much harder task than writing a mere portal.

Generally speaking, I'm of the belief that data should be put where it belongs, or as close as possible. This may sound like an oversimplification, but it's just a guide line, and dealing with the off cases we'll cover soon enough. Unlike many others, on the other hand, I'm never going to argue that you stick all possible data in the same place. The world of software (and business) is too complex for that to be practical.

Example storage do's and don't's

In the case of the company I worked for, data should have been stored in SharePoint, if the SharePoint portal held all processing logic - something it obviously didn't for most of it. After all, we had a multitude of systems, accepting templates, images, pdfs and whatnot, producing an equal multitude of output. The only piece of data suitable for SharePoint storage in that case, would be the pdf output (which was basically our end of the line product). The task of the pdfs, to put it like that, would be to be searched among, looked up, previewed and downloaded. All of those are tasks tightly connected to SharePoint and its attached systems (like various index servers).

Images were a second important asset in the business case. Uses would upload these, they'd be incorporated in templates through online UI apps, then passed through to various systems in the backend, and finally rendered into the pdfs. You could arguably say that the images would require as much user interaction as the pdfs, such as browsing, searching (though not by content), downloading and re-uploading. The main difference from the pdfs, however, is that they would exist in the portal as a mere bi-effect of their actual intent - to be processed in the backend. To make matters worse, when dealing with third party applications, you can seldom expect them to accept anything other than filesystem paths as input. Keeping them in the SharePoint databases as content would therefore require indirections, proxies or double storage,
  • you could keep one version in the database, for navigation and download, and one on the filesystem for backend usage;
  • you could keep a single version in the database, then write filesystem hooks and placeholders, to serve content from web services into the backend systems (tricking them to believe they were accessing the filesystem);
  • or you could make use of something a lot simpler, at the cost of not actually storing data in SharePoint
The first option would obviously cause quite a lot of storage overhead, and the second is just absurd - if you're anything close to cost-benefit oriented.

Third time's the charm

So last option may sound like an inconsistency provoking blur of madness, but it's actually quite simple.

Stepping back to intent; our intent for the images were to make them navigable, downloadable, sortable and whatnot. Unlike the PDFs, they would not be content indexed (although their properties very well might be). As such, the data of the images would belong to the backend systems, whereas the properties would belong to the web portal. You could write whatever frontend tools you'd want, to administer the file as a unit (properties + data) from SharePoint, but as long as you're rigorous about doing it from there, hosting the data somewhere else wouldn't pose a problem.

Solving detached storage by using extended metadata

I'm trying to cover all of this as briefly as possible, and obviously not touching all caveats and alternative paths; you'll have to excuse that. What I'm presenting is *one* approach to deciding how to deal with storage, how to make it possible, and finally how all of this leads us to alternative ways of viewing metadata.

By metadata, I'm now talking what's hidden behind the items in SharePoint, as opposed to columns in a SharePoint list. If you're developing with the object model, or SP services, you can access these for quite a few of the SharePoint primitive types.

What people use the metadata for, is no one's business but their own, but I've begun hosting more than just information there. What I've done lately, is assign functionality to items as well.

A document in one of my document libraries may be stored completely (properties and data) in SharePoint, while other documents in the same library have special classes serialized, with type and parameters, into the hidden metadata. One of these is a class implementing a custom interface of mine, IResourceProvider. Upon opening of such a document, code attached to my portal or lists, will instantiate an available IResourceProvider - determining its actual type, and deserializing its data - then use the IResourceProvider to retrieve the content. If there's no resource provider, SharePoint is left to supply whatever it believes the data to be (i.e. from the database).

This is but one use I have for the rich metadata, but that'll have to wait for a post on its own. I'm sure many will disagree with the points above, so I'll let that discussion play out first :)