About OR
OR Topics - Data Warehousing & Business Intelligence
BUILDING A DATA WAREHOUSE
Design issues

The design of the central data warehouse and any associated data marts is absolutely critical to success. A good design must combine simplicity, flexibility and performance. It will also have to balance the business’ stated requirements with what is feasible given the quality of available data and the capabilities of current technology. Some compromise is inevitable.

The design is normally divided into two stages, known as logical and physical design.

The logical design focuses on the data, and is independent of the platform on which they are held. The physical design centres on optimising the performance of the chosen database engine.

The most important design considerations are these:

  1. Choosing the right database schema
  2. Deciding the level of detail
  3. Choosing the dimensions and measures
  4. How to avoid temporal inconsistency
  5. Deciding what summary data to store
  6. Where to calculate derived data
  7. Deciding the update frequency
  8. Agreeing a policy for archiving
  9. Selecting a database platform
  10. Database optimisation

Choosing the right database schema

Transactional databases are usually normalised to optimise on line transaction processing. There are two main reasons for structuring query databases differently in a warehouse environment:

  • Normalised data models are almost invariably too complex for users to understand, or for query tools to navigate – they need to be simplified.
  • With a normalised database, many standard reports and analyses require large, multi-table joins. Also, ad hoc queries tend to require full table scans, because the right indexes are not in place. In both cases, the effect on performance is often disastrous.

A star schema database can overcome both of these problems, and is now widely accepted as the best design for individual data marts. The central warehouse may comprise a superset of all stars, or a set of master tables from which these can be easily derived, with a partially normalised schema.

In a star schema database, each subject area or business process has a single fact table linked to a number of smaller dimension tables. The name thus reflects the shape of the resulting entity relationship diagram.

The fact table contains relevant measures and a composite key defining what these measures describe. This key is chosen to represent the lowest level of detail in each of the dimensions required for analysis. Thus each fact table record represents a unique combination of dimensions, equivalent to a cell in the corresponding cube (see OLAP concepts).

The dimension tables contain attributes of the associated dimensions, but no measures. Specifically, they define the hierarchical relationships used for drill down and aggregation. By joining a fact table to several dimension tables through foreign keys, any subset or summary of the data can be very efficiently derived.

Most dimensions are used in more than one data mart, so it is common to use the same dimension table with several fact tables. Building conformed dimensions is highly desirable because it helps to ensure consistency (see pitfalls).

In the past, vendors have tended to fuel debate on the relative merits of relational and multi-dimensional databases (i.e. ROLAP versus MOLAP). With a star schema design, the two are logically equivalent, so the issue becomes marginal, and can be addressed as part of tool selection (see database platform).

ROLAP solutions tend to be more flexible at the expense of performance. MOLAP solutions may not cope so well with very large data volumes. Many vendors now advocate hybrid solutions (HOLAP), offering the best of both worlds.

Deciding the level of detail

The more detailed the data included in the warehouse, the greater the range of queries that can be answered, but the longer they will take to run. Storing more detail also increases the cost and effort of building and maintaining the warehouse, along with the time taken to load updates, re-build aggregates and prepare summary reports.

Users typically ask for more detail than they ever use, but the trend is towards ever more detailed analysis.

A sensible compromise is to design data marts so that they can handle 80% of queries, but to keep the most detailed available data in the main warehouse, where they can be accessed if needed – more slowly, but less often. Some business intelligence tools have facilities to drill through from a single cell in a cube to the underlying transaction records.

One of the key design decisions is thus to specify the grain (or granularity) of each fact table, that is the combination of dimension keys that will identify each record e.g. {store * product * month}. The trick is to retain enough detail to support precise cuts of the data, but not necessarily enough to access individual source system records.

Choosing the dimensions and measures

Inexperienced warehouse developers sometimes have difficulty deciding which attributes in the source data should be treated as dimensions and which as measures.

As a general guide, dimensions are:

  • discrete valued;
  • textual (or at least non-additive, like part numbers);
  • used in constraints and filters;
  • used to classify and compare.

Common dimensions include time, product, customer, supplier, location, department, channel, account type and age group.

Measures, on the other hand, are:

  • continuously valued;
  • numeric;
  • (ideally) additive across all dimensions.

Examples measures include sales, costs, headcount, account balances, success ratios, market share and growth rates.

A common design dilemma is how to handle repeating groups, that is groups of similar measures, e.g. telephone sales, web sales, trade sales and retail sales. Often these are better represented as a single measure (sales) and a dimension (sales category).

Currency, on the other hand, is rarely a useful dimension, because no measures can be added across it.

Dimensions can also be derived from measures by grouping them into bands, e.g. large, medium and small turnover companies or age bands 1-18, 19-25, 25-40, 40-65, over 65 etc.

How to avoid temporal inconsistency

The data warehouse will need to deal with a number of time-related issues, avoiding possible inconsistencies:

Flickering

Operational data can change minute by minute, leaving transactional databases in a constant state of ‘flickering’. These changes need to be summarised to give a stable, consistent overview, otherwise different users running the same report at different times may not get the same answers. Flickering is usually prevented by taking snapshots at regular intervals (e.g. month end stock levels) or by calculating the net change since the previous snapshot (e.g. total sales in month).

Slowly changing dimensions

In many cases, the hierarchical relationships or groupings within a dimension will change gradually over time, e.g. when employees change department, or product groups are revised. This makes it difficult to compare summary level data across time periods.

Slowly changing dimensions invariably complicate the design process, but broadly speaking there are three ways to tackle this issue:

  • Overwriting existing dimension records will re-state history;
  • Adding dimension records with different time spans will segment history;
  • Creating current or previous attributes within the dimension will preserve both old and new views of history, but makes queries more complex.

See Kimball for an in-depth discussion of this topic.

Alternative timeframes

There is often a requirement to produce reports based on different views of the time dimension, e.g. calendar years and fiscal years.

It is also common to select data for different reports using different dates e.g. transaction date or effective date.

Both situations can usually be handled by creating logical views or aliases of the master (physical) time dimension table, one for each required role.

Deciding what summary data to store

Summary level data maintained within the data warehouse are sometimes referred to as aggregates. They have a number of advantages:

  • Common consolidations are performed only once;
  • Summary level queries run much faster;
  • Groupings and consolidation rules can be tightly defined and centrally maintained.

On the other hand, aggregates can substantially increase the volume of data stored, but it is difficult to predict by how much. Typically, aggregates will occupy as much space as the base data, but because there are so many combinations of dimensions and levels, the additional space may actually exceed that for the granular data. The uncertainty arises because in practise the data tend to cluster in pockets, so that not every combination is needed. A useful rule of thumb to balance performance and space is to avoid storing aggregates for levels where each dimension member has fewer than 10 children at the level beneath it.

Furthermore, summary data must be recalculated whenever underlying data, groupings or consolidation rules change. This can increase the time taken to refresh the warehouse, which may be constrained by a tight overnight time window.

Aggregates offer the single most effective way to boost query performance, but the price is increased overheads. Also, as usage patterns change, different aggregates may be needed. This is therefore as much an ongoing maintenance issue as it is a design issue.

Where to calculate derived data

Most of the tools used to create data warehouse components can perform at least simple calculations. In deciding where to calculate any derived data needed, the design team needs to consider the following factors:

  • whether to store or calculate on the fly;
  • whether to define calculation formulae centrally, or let end users do so;
  • the complexity of any formulae and calculation rules;
  • the sequence in which calculations need to take place.

Ratios such as % growth provide a good example of the need to control calculation precedence:

If three stores in the UK experienced sales growth of 40%, 50% and 60% respectively, a simple SQL query would calculate the sales growth for the UK as a whole as 150%. The ratio should, of course, be calculated after aggregating, not before.

Most OLAP tools can handle situations like this, but end users often do not understand the importance of calculation precedence, and may obtain invalid results if not trained properly. The best way to avoid precedence issues is to perform all calculations in the same place, rather than distributing them across different warehouse components.

Power users are the most likely to think of new measures, and may require a reporting and analysis tool that allows them to enter their own calculations. Unfortunately, most of the available tools provide a very limited range of scientific and logical functions. It can be difficult, for example, to trap for division by zero. In most cases, therefore, users will want to export data to Excel.

It is impossible to anticipate and pre-define every measure users will ever need, but the most frequently used are best included in the overall system design. To ensure consistency, it is generally better to define formulae once, centrally, rather than repeatedly, in individual reports.

Storing derived data may take up substantial extra space, but is worth considering if it leads to a significant performance improvement, or reduces the risk of error.

Deciding the update frequency

Some parts of the warehouse could be updated more often than others. Most will be refreshed on a daily, weekly or monthly basis, depending on when updated data become available.

Deciding the update frequency involves striking a balance between three factors:

  • freshness- the value users place on gaining access to the latest available data as soon possible;
  • quality- the importance users attach to making sure the data is complete, correct and stable before it is published;
  • availability- the proportion of working time for which the reporting system is not taken off-line to permit uploads.

Agreeing a policy for archiving

The design team will need to decide how much historical data to keep in the warehouse. This will impact the volume of disk storage required, and the scope of the initial data loading exercise.

Users typically ask to keep more than they really need. The following approaches are often used to control the associated costs:

  • keeping summary level data for longer than the detail;
  • planning to archive older, little used data to off-line media;
  • monitoring usage in order to identify dormant data, so that it can be deleted or archived.

Before committing to load the previous 5 years’ data into the warehouse, the design team should consider any changes made to the source systems:

  • recently introduced data items may not be available this far back;
  • data structures may have been modified;
  • individual fields may have been used differently, or had different validation rules imposed;
  • dimension hierarchies may have changed, leading to temporal inconsistencies in the data.

These factors could make the initial data cleansing and transformation exercise much more difficult. One way to avoid this is to load less data initially, and gradually accumulate a longer history as new data become available. Another is to add historic data in suitable batches, working backwards in reverse chronological order.

Selecting a database platform

Perhaps the most important consideration here is scaleability: a successful data warehouse is bound to grow as new data sources are added. If the organisation itself is growing, the volumes of transaction data that need to be loaded will also increase, as may the number of users. If information is eventually made available to partners and customers, the load on the query database could easily increase by one or two orders of magnitude.

Multi-dimensional databases such as Oracle Express, Hyperion Essbase or Cognos PowerPlay give excellent performance with small and medium data sets, but may not be able to cope with the volumes of data required in a large data warehouse. They are most often used for functional data marts (see components).

Many of the leading relational database vendors (notably Microsoft and Oracle) now include effective support for star schema designs (see choosing the schema above), and so are able to deliver satisfactory performance in a query and analysis environment, though they are not always so functionally rich.

For these reasons, industrial strength relational database engines such as Oracle, DB2 and Terradata are normally chosen for the central data warehouse. Microsoft SQL Server has also become an attractive option for small, medium and sometimes larger organisations, but tools such as Microsoft Access are unlikely to cope.

Database optimisation

The physical design of a database has a major impact on performance, and may need to be tuned as the data warehouse evolves and usage patterns change. This normally requires a specialist database administrator, and a full discussion of this topic is beyond the scope of this article.

The options available are specific to the database engine, but usually include:

  • partitioning the database e.g. into separate tables for each year of data;
  • choosing different types of index (e.g. bitmap for small dimension tables, or b-tree for larger fact tables).

In addition, most database tools require table sizes to be declared before data are loaded. It is therefore important to prepare reasonably accurate estimates.

Click Here Section Map Click Here RELATIONSHIP WITH OR Click Here MANAGING
Click Here OVERVIEW Click Here MAIN Click Here RESOURCES

© 2002 The OR Society

Top of Page