lesson #02: forward ever, backward never
Ok, I do Data Warehousing for financial institutions for a while, both in-line and as a consultant. I know that there are certain requirements outside, in the real world, that forces such companies to store histories of there data, and to be able to simulate the Truth as Was perspective. Regulatory requirements. They exist. Yes.
But that does not mean to build everything time dependent, to set up every master data with a capture of the slightest change within its attributes. No, in my past there was no, exactly zero, requirement for the simulation of Truth as Was. The business departments I used to work with were always interested in the as Is situation, and derive a proper forecast from this.
I don’t talk about time lines in your OLAP cube (which is a type of date attribute in your fact table). I talk about questions like: Show me the distribution of new contracts within the old field service structure! Who cares about that? You have a new structure, you have new revenues or applications. And if your KPI moves toward green things are right, and if not, one can not blame the new structure.
There is no need for the campaign management to know the ZIP code of last month. There is some sort of need to send the letter to the new postal address and use this micro demographic data.
Or am I totally wrong?
lesson #01: trust the file
I used to work with Informatica for several years, and the systems we usually received our data from where old HOST ones. So delivery took place in flat file form. Later one I moved “forward” to SAP BW in a SAP R/3 environment. Not 100% but most of the connected OLTP systems where R/3. And with R/3 comes the DataSource, both delivered by SAP as part of the Business Content (BC) and self created so called Generic Extractors. But there are several other options for getting data into a SAP based data warehouse:
- Connect to a database
- Receive SOAP messages
- Use RFC function calls
- Connect flat files (both server and client) .
So, we use a lot of the R/3 extractors, which, at the end, are real tight connections of OLTP sources and the data warehouse. What options do you have now? One could choose between a Delta of Full delivery of data. For Delta mode some continuously growing attribute must be specified. This is easy: One could start with a artificial ID, but will later move to some sort of changed-at time-stamp. The question now: Where are my deleted records? As it is not the aim of a OLTP system to store everything for ever things are deleted. The solution is a additional data source delivering information about deleted records, which are somehow “saved” during the deletion process. In Delta mode. One will not face this issue in case of a full extraction of data. You can do this ever, and ever again. No issue, as long as your operation is not blocked. But what kind of data you receive now? While extracting contract directly from OLTP, do I have all the contract holders and more important the corresponding version in time? No guarantee for this during direct access. Again, a solution is in place: Create buffer tables, which are filled at a certain moment in time, during a regular day end processing. Fill the buffer, ping the warehouse and load a full and consistent stock of data. You could do this again the same period, maybe day. The next day the buffer is overwritten. So, the chance to fix things up is gone. Another solution: Add technical driven tracking fields like status to the buffer table. Implement a handshake between OLTP and DWH. The new process looks like:
- End of day processing in OLTP
- Fill buffer tables, set status to NEW
- Ping the Warehouse
- Load all NEW data
- Perform checks
- Update buffer tables, set status to RECEIVED
Now, in case of error, status RECEIVED could be reset to NEW, the process starts again. For normal processing all records marked as received could be deleted after a certain period of time. And then there are gone. For ever.
It my seems anachronistic, but is there anything against the good old-fashioned flat file (Ok, nested are allowed too, you Cobol guys)? Have a day end processing, dump the necessary data into flat files, move them to the warehouse server, and load it, and reload it, and reload it a year after. The advantages I see are:
- Consistent and congruent set of data from each source
- DWH processing does in no way affect the operation of OLTP system
- Backup and recovery is system inherent, just store your files on cheap tape
- Real delta information (Deleted records!) is accessible within two full stocks of data
I fully understand that deltas delivered by the sources, in a direct way are more easy to be handled in the beginning. But from my perspective I see the advantages in the long run of a interface. A flat file is there, and will not be changed until processed. It does not matter when this is done, there is a fair chance to reload all the data after a 3 day breakdown of your Data Warehouse. This is hard, and expensive to be implemented using direct connections. So: Trust the file! And give it a chance.
lessons learned: there is a chance in every new beginning
Starting next year I’ll move to a new job position. This wouldn’t be worth any mentioning if along with this does not comes me leaving IT departments after nearly 13 years of working with, in, and for Data Warehouses of different sizes and complexities across Europe. 7,5 years in line, and additional 5 years as BI consultant prior formed my way of handling both big and small, and both IT and social problems. I’m really looking forward the new challanges, but most of all the chance to shape a company at this level.
Before I dedicate myself to the implementation of a group wide Data Quality Management, fenced by a Metadata Management initiative, I would like to share some of the essentials I’ve learned the recent years. We’ll see what comes next, here at DWH Pages… Thanks in advance!
The end of relational database management systems
This short, and high-speed-to-read article at ACM offers a good list of arguments against the common relational DBMS world most of us are used to live in.
For the world of data warehousing the predicted solution is column based databases. At least databases, but not relational any more. But how do such column based solutions work? What is the linking element? IMHO there always needs to be a identifying key, and the associated value within the very column. All this optimized, compressed and stored in a bitmap like index. Additional you need information about how to join column information together. Do it with the original keys? Do it with sort of row ID pointing to the right line?Any other solutions? Ask google!
For me it all seems like old wine in new skins. At the end one need a high performing machine which does good hash calculation to build up and resolve indeces. In memory technology could spice up the query life. Seting up a data vault model is something in between, and I suppose a good way to have both a good and queryable database on one and good ad-hoc peformance on the other side.
A sugested further reading on column based databases is: Michael Stonebraker et al., “C-Store: A Column-oriented DBMS,” Proc 2005 VLDB Conference, Trondheim, Norway, Sept. 2005 (pdf).
query definition without BEx
Via twitter.com/sapbwtweet and SDN I came across a not yet valued but evaluated tool to get a faster insight into your query definition without utilizing BEx and facing long waiting times.
- Enter transaction se38
- Execute programm RSRQ_QUERYDEFINITION
- Enter a query name (known selection dialog from BEx)
- Select / de-select some option
- Execute
- Analyze your query
In my opinion the structure of a query and all its elements are displayed in a transparent way without missing details. This transaction yould be used to analyze a query regarding filter values and the usage of variables much quicker than by see the real definition in BEx front end.
ny times on statistics
Thanks to twitter I came across a interesting article at the online edition of the New York Times (here). It is about the problem of analyzing huge amounts of data and detect patterns within. The big companys a hiring more and more statisticians to develop numer crunching methods. The acquisition of SPSS by IBM outlines this tendency. Go for analysis, but do not forget reporting!
farewell scrum, welcome … what?
Weeks ago we (which is my team and I) stopped what I labeled scrum before. The old postings about this experiment could be found here. For today I would like to reflect the latest stage and what comes next.

Figure 1: Last Scrum Board impression
Figure 1 shows the last impression of our board. Most things are done with the release, somethings are left. In the upper left corner one could get an idea of the new usage of this expensive white board. I started nearly a year ago with the idea of utilizing different aspects of scrum, which were:
- Daily stand-up meeting
- Smaller iterations compared to our locale release cycle of 2 months
- Tracking “things to be done” on a Board
- Use a priorized Product Backlog owned by the Business Department (and me)
- Review the release
My motivation behind was a) building up a team with a real spirit of cooperation and implementing solutions together and b) speeding up the development, short: Do more in less time. Things are working fine. IMHO we all performed much better in this time, we implemented more than originally planned per iteration, the dialogue between Business Department (BD) and me was as good as never before. The backlog was used by both parties to negotiate sprint content etc. But we failed. Why?
First of all I was not able to split “requests” from BD into such small pieces that a visual tracking was usefull. We commited us internally to a 3 week sprint duration. The common duration of a task was 5 days. Additional there is the need for solving incidents and thinks like. So, each developer is only able to implement 2 tasks per sprint. Given a team of 5 developers we had 10 story cards on board which remained for a long time in the “In development” slot and for a relativly short time in the “Test” slot. We met every day to talk about the one card. This was getting boring by the time. Then we experimented with a planing round. Which was a total flop. We only did once. What else? Nothing. What will change? A lot.
The board will still be used as so called information radiator, but with a different aproach: It is now there to sketch ideas of solutions, to visualize thougts, outlining architectures, data models, relations etc. I’ll move back to a weekly status meeting. This is than dedicated to talk about major issues, but also solutions, or topics from the other teams. Barriers are solved directly within the team. The Product Backlog will remain, because it is working too good. I’ll also try to have something new in production within every 3 weeks. Just to offer good service to BD. But I’ll cut the limit and will move towards a more continuous enhancement of the productive environment. I’ll do the release planing, which is necessary in the department, and also the estimation of the single entries in the backlog on my own. Assignment of tasks to single developers is also done directly by myself. So this is what changes most: The controling.
Main lesson learned: It is about the people! In this case about the team members, the developers. There is a need for a specific charateristic, some self motivation, the drive, to pick the next task on your own, to have enough confidence into your own decissions, to put commitments into your statements etc. Agile methods need specific people. Without them all the tools are nice, with them the tools produce added value.
TPS
Thanks to a posting at codemonkeyism.com I bought Taiichi Ohnos book about the Toyota Production System. Originally published in 1978 it explains the way he changed the production system at Toyota into what is now known as The Toyota Way. Concepts like Just-In-Time (JIT), or kanban are explained.
With this book I try to clarify for myself whether Data Warehousing is production or product development (TPDS is the Toyota acronym for this). Am I part of the software development guild or not?