Monday, May 12, 2008

optimization and expense

Neil Macehiter comments on the last post:

But the issue is not with the language you use to perform the query: it's where the data is located. If you have data in separate physical databases then it's necessary to pull the data from the separate sources and join them locally. So, in Kim's example, if you have 5000 employees and have sold 10000 computers then you need to pull down the 15000 records over the network and perform the join locally (unless you have an incredibly smart distributed query optimiser which works across heterogeneous data stores). This is going to be more expensive than if the computer order and employee data are colocated.

The "expense" is there no matter how you do it. Putting all of your potentially useful data in one RDBMS is incredibly wasteful of storage space and comes at the cost of slowing down all queries. It also means that synchronizations need to be done almost constantly in order for the most up to date data to be available, a network "expense". But the search can be optimized before any data is pulled. For example, query the HR database for the lowest employee number issued after the first date you're interested in (assuming that employee numbers are issued sequentially). Then query the orders for PC purchases by that employee number or higher. Yes, it's two steps, but it's also faster than pulling down all the records to do a local join. And, I hold, less "expensive" than maintaining a huge silo of all potentially useful data.

Labels: , , ,

I am not suggesting (and I don't think Kim is either) that ALL data needs to exist in one database. Rather that there is a need to create application-specific databases for performance, and potentially other reasons e.g. compliance.

The proposed search optimisation might be preferable to pulling down all the data (but will still not beat a local join) but what happens as the "where" clause of the query becomes more complex e.g. employees in department A who sold computers in 2007. Also, what happens when an additional table is introduced into the query?

Also, won't the network expense of synchronisation only be incurred once, whereas the distributed query expense would scale with the number of queries.
Putting aside all the cool features that virtual directory vendors have come up to solve this problem... it would be a good time for me to point you to an incredibly simple qualifier for search filter terms in CARML that solves this problem - "IsPrimaryKey".

This gives developers the ability to let the infrastructure know that a search term refers to a primary key (just like asking for a database column to be a primary key or to be indexed). The infrastructure can now optimize the query across multiple source boundaries without having to load tonnes of data to resolve a filter that spans multiples sources.
Post a Comment

© 2003-2006 The Virtual Quill, All Rights Reserved


[Powered by Blogger]