Go | New | Find | Notify | Tools | Reply |
Has Achieved Nirvana |
Starting a new thread in response to Ax's question in another thread... This is a good summary of what went on in our data modeling sessions: https://www.dbta.com/Columns/D...d-Conquer-83689.aspx What I participated in was theoretically high level logical data modeling. Others were supposed to take the logical design to a physical implementation. But the logical modeling group consisted of a bunch of former programmers and DBAs, so our discussions often drifted into "how would you implement this?". We were working on a Product model for financial instruments. We were using a model from Seer Technology as a starting point. As I recall, Seer's products were spun off from work originally done at Credit Suisse. Jon might be familiar with details. Seer's model was moderately lumped and quite elegant. It had also been developed looking at the information from the perspective of a firm, and we were an exchange, so we were tweaking it. The lumpers among us liked the elegance and future flexibility of a lumped model, but the splitters were on the side of the people who had to program and implement actual systems, namely a bunch of COBOL programmers who didn't necessarily have the very broad and deep understanding of the business that allowed them to understand why the lumped model made a lot of sense. In one meeting, one of our lumpers went to the white board and put up the ultimate lumpers' data model. It had two entities, Object and Object Type, in a many-to-many relationship. I thought it was perfect but that's not the model we ultimately ended up with.
| ||
|
Pinta & the Santa Maria Has Achieved Nirvana |
The urge to lump or split seems to exist in many situations. One that pops immediately to mind for me is classification statistics, such as cluster analysis. The goal of a cluster analysis is to look at multiple measures (data fields) for a group of observations, and see if you can determine which observations appear to be similar or dissimilar, based on the multiple measures you've analyzed. There are two broad-stroke ways to approach this. The lumper approach puts all the observations into a single cluster, then breaks the single cluster into two, then three, etc. The choice of how many clusters is made based on the variance explained at each stage. At some point, the improvement in explained variance is so small it's not worth it to add another cluster to your model as it really doesn't provide any additional insight. (FYI, if the # clusters = # observations then there is 100% explained variance.) The splitters would start with a large number of clusters, and calculate clusters pretty much backward from the lumper approach: reduce the clusters by one and check the change in explained variance, rinse, repeat. This is a very broad stroke explanation. There are also countless hybrids. | |||
|
Powered by Social Strata |
Please Wait. Your request is being processed... |