|
The design of the central data warehouse and any associated
data marts is absolutely critical to success. A good design
must combine simplicity, flexibility and performance. It will
also have to balance the business’ stated requirements with
what is feasible given the quality of available data and the
capabilities of current technology. Some compromise is inevitable.
The design is normally divided into two stages, known as
logical and physical design.
The logical design focuses on the data, and is independent
of the platform on which they are held. The physical design
centres on optimising the performance of the chosen database
engine.
The most important design considerations are these:
- Choosing the
right database schema
- Deciding the level
of detail
- Choosing the
dimensions and measures
- How to
avoid temporal inconsistency
- Deciding
what summary data to store
- Where to calculate
derived data
- Deciding
the update frequency
- Agreeing a policy
for archiving
- Selecting a
database platform
- Database
optimisation
Choosing
the right database schema
Transactional databases are usually normalised
to optimise on line
transaction processing. There are two main reasons for
structuring query databases differently in a warehouse environment:
- Normalised data models are almost invariably too complex
for users to understand, or for query tools to navigate
– they need to be simplified.
- With a normalised database, many standard reports and
analyses require large, multi-table joins. Also, ad hoc
queries tend to require full table scans, because the right
indexes are not in place. In both cases, the effect on performance
is often disastrous.
A star schema
database can overcome both of these problems, and is now widely
accepted as the best design for individual data marts. The
central warehouse may comprise a superset of all stars, or
a set of master tables from which these can be easily derived,
with a partially normalised schema.
In a star schema database, each subject area or business
process has a single fact
table linked to a number of smaller dimension
tables. The name thus reflects the shape of the resulting
entity relationship diagram.

The fact table contains relevant measures
and a composite key defining what these measures describe.
This key is chosen to represent the lowest level of detail
in each of the dimensions
required for analysis. Thus each fact table record represents
a unique combination of dimensions, equivalent to a cell in
the corresponding cube (see OLAP
concepts).
The dimension tables contain attributes of the associated
dimensions, but no measures. Specifically, they define the
hierarchical relationships used for drill down and
aggregation. By joining a fact table to several dimension
tables through foreign keys, any subset or summary of the
data can be very efficiently derived.
Most dimensions are used in more than one data mart, so it
is common to use the same dimension table with several fact
tables. Building conformed
dimensions is highly desirable because it helps to ensure
consistency (see pitfalls).
In the past, vendors have tended to fuel debate on the relative
merits of relational and multi-dimensional databases (i.e.
ROLAP versus
MOLAP). With
a star schema
design, the two are logically equivalent, so the issue becomes
marginal, and can be addressed as part of tool selection (see
database platform).
ROLAP solutions tend to be more flexible at the expense of
performance. MOLAP solutions may not cope so well with very
large data volumes. Many vendors now advocate hybrid solutions
(HOLAP), offering the best of both worlds.
Deciding
the level of detail
The more detailed the data included in the warehouse, the
greater the range of queries that can be answered, but the
longer they will take to run. Storing more detail also increases
the cost and effort of building and maintaining the warehouse,
along with the time taken to load updates, re-build aggregates
and prepare summary reports.
Users typically ask for more detail than they ever use, but
the trend is towards ever more detailed analysis.
A sensible compromise is to design data marts so that they
can handle 80% of queries, but to keep the most detailed available
data in the main warehouse, where they can be accessed if
needed – more slowly, but less often. Some business intelligence
tools have facilities to drill
through from a single cell in a cube
to the underlying transaction records.
One of the key design decisions is thus to specify the grain
(or granularity) of each fact
table, that is the combination of dimension keys that
will identify each record e.g. {store * product * month}.
The trick is to retain enough detail to support precise cuts
of the data, but not necessarily enough to access individual
source system records.
Choosing
the dimensions and measures
Inexperienced warehouse developers sometimes have difficulty
deciding which attributes in the source data should be treated
as dimensions and which as measures.
As a general guide, dimensions are:
- discrete valued;
- textual (or at least non-additive, like part numbers);
- used in constraints and filters;
- used to classify and compare.
Common dimensions include time, product, customer, supplier,
location, department, channel, account type and age group.
Measures, on the other hand, are:
- continuously valued;
- numeric;
- (ideally) additive across all dimensions.
Examples measures include sales, costs, headcount, account
balances, success ratios, market share and growth rates.
A common design dilemma is how to handle repeating groups,
that is groups of similar measures, e.g. telephone sales,
web sales, trade sales and retail sales. Often these are better
represented as a single measure (sales) and a dimension (sales
category).
Currency, on the other hand, is rarely a useful dimension,
because no measures can be added across it.
Dimensions can also be derived from measures by grouping
them into bands, e.g. large, medium and small turnover companies
or age bands 1-18, 19-25, 25-40, 40-65, over 65 etc.
How
to avoid temporal inconsistency
The data warehouse will need to deal with a number of time-related
issues, avoiding possible inconsistencies:
Flickering
Operational data can change minute by minute, leaving transactional
databases in a constant state of ‘flickering’. These changes
need to be summarised to give a stable, consistent overview,
otherwise different users running the same report at different
times may not get the same answers. Flickering is usually
prevented by taking snapshots at regular intervals (e.g.
month end stock levels) or by calculating the net change
since the previous snapshot (e.g. total sales in month).
Slowly changing dimensions
In many cases, the hierarchical relationships or groupings
within a dimension will change gradually over time, e.g.
when employees change department, or product groups are
revised. This makes it difficult to compare summary level
data across time periods.
Slowly changing dimensions invariably complicate the design
process, but broadly speaking there are three ways to tackle
this issue:
- Overwriting existing dimension records will re-state history;
- Adding dimension records with different time spans will
segment history;
- Creating current or previous attributes within the dimension
will preserve both old and new views of history, but makes
queries more complex.
See Kimball
for an in-depth discussion of this topic.
Alternative timeframes
There is often a requirement to produce reports based on
different views of the time dimension, e.g. calendar years
and fiscal years.
It is also common to select data for different reports
using different dates e.g. transaction date or effective
date.
Both situations can usually be handled by creating logical
views or aliases of the master (physical) time dimension
table, one for each required role.
Deciding
what summary data to store
Summary level data maintained within the data warehouse are
sometimes referred to as aggregates.
They have a number of advantages:
- Common consolidations are performed only once;
- Summary level queries run much faster;
- Groupings and consolidation rules can be tightly defined
and centrally maintained.
On the other hand, aggregates can substantially increase
the volume of data stored, but it is difficult to predict
by how much. Typically, aggregates will occupy as much space
as the base data, but because there are so many combinations
of dimensions and levels, the additional space may actually
exceed that for the granular data. The uncertainty arises
because in practise the data tend to cluster in pockets, so
that not every combination is needed. A useful rule of thumb
to balance performance and space is to avoid storing aggregates
for levels where each dimension member has fewer than 10 children
at the level beneath it.
Furthermore, summary data must be recalculated whenever underlying
data, groupings or consolidation rules change. This can increase
the time taken to refresh the warehouse, which may be constrained
by a tight overnight time window.
Aggregates offer the single most effective way to boost query
performance, but the price is increased overheads. Also, as
usage patterns change, different aggregates may be needed.
This is therefore as much an ongoing maintenance issue as
it is a design issue.
Where
to calculate derived data
Most of the tools used to create data warehouse components
can perform at least simple calculations. In deciding where
to calculate any derived data needed, the design team needs
to consider the following factors:
- whether to store or calculate on the fly;
- whether to define calculation formulae centrally, or let
end users do so;
- the complexity of any formulae and calculation rules;
- the sequence in which calculations need to take place.
Ratios such as % growth provide a good example of the need
to control calculation precedence:
If three stores in the UK experienced sales growth of
40%, 50% and 60% respectively, a simple SQL
query would calculate the sales growth for the UK as a
whole as 150%. The ratio should, of course, be calculated
after aggregating, not before.
Most OLAP
tools can handle situations like this, but end users often
do not understand the importance of calculation precedence,
and may obtain invalid results if not trained properly. The
best way to avoid precedence issues is to perform all calculations
in the same place, rather than distributing them across different
warehouse components.
Power users
are the most likely to think of new measures, and may require
a reporting and analysis tool that allows them to enter their
own calculations. Unfortunately, most of the available tools
provide a very limited range of scientific and logical functions.
It can be difficult, for example, to trap for division by
zero. In most cases, therefore, users will want to export
data to Excel.
It is impossible to anticipate and pre-define every measure
users will ever need, but the most frequently used are best
included in the overall system design. To ensure consistency,
it is generally better to define formulae once, centrally,
rather than repeatedly, in individual reports.
Storing derived data may take up substantial extra space,
but is worth considering if it leads to a significant performance
improvement, or reduces the risk of error.
Deciding
the update frequency
Some parts of the warehouse could be updated more often than
others. Most will be refreshed on a daily, weekly or monthly
basis, depending on when updated data become available.
Deciding the update frequency involves striking a balance
between three factors:
- freshness- the value users place on gaining access to
the latest available data as soon possible;
- quality- the importance users attach to making sure the
data is complete, correct and stable before it is published;
- availability- the proportion of working time for which
the reporting system is not taken off-line to permit uploads.
Agreeing
a policy for archiving
The design team will need to decide how much historical data
to keep in the warehouse. This will impact the volume of disk
storage required, and the scope of the initial data loading
exercise.
Users typically ask to keep more than they really need. The
following approaches are often used to control the associated
costs:
- keeping summary level data for longer than the detail;
- planning to archive older, little used data to off-line
media;
- monitoring usage in order to identify dormant
data, so that it can be deleted or archived.
Before committing to load the previous 5 years’ data into
the warehouse, the design team should consider any changes
made to the source systems:
- recently introduced data items may not be available this
far back;
- data structures may have been modified;
- individual fields may have been used differently, or had
different validation rules imposed;
- dimension hierarchies may have changed, leading to temporal
inconsistencies in the data.
These factors could make the initial data cleansing and transformation
exercise much more difficult. One way to avoid this is to
load less data initially, and gradually accumulate a longer
history as new data become available. Another is to add historic
data in suitable batches, working backwards in reverse chronological
order.
Selecting
a database platform
Perhaps the most important consideration here is scaleability:
a successful data warehouse is bound to grow as new data sources
are added. If the organisation itself is growing, the volumes
of transaction data that need to be loaded will also increase,
as may the number of users. If information is eventually made
available to partners and customers, the load on the query
database could easily increase by one or two orders of magnitude.
Multi-dimensional
databases such as Oracle Express, Hyperion Essbase or
Cognos PowerPlay give excellent performance with small and
medium data sets, but may not be able to cope with the volumes
of data required in a large data warehouse. They are most
often used for functional data marts (see components).
Many of the leading relational
database vendors (notably Microsoft and Oracle) now include
effective support for star schema designs (see choosing
the schema above), and so are able to deliver satisfactory
performance in a query and analysis environment, though they
are not always so functionally rich.
For these reasons, industrial strength relational database
engines such as Oracle, DB2 and Terradata are normally chosen
for the central data warehouse. Microsoft SQL Server has also
become an attractive option for small, medium and sometimes
larger organisations, but tools such as Microsoft Access are
unlikely to cope.
Database optimisation
The physical design of a database has a major impact on performance,
and may need to be tuned as the data warehouse evolves and
usage patterns change. This normally requires a specialist
database administrator, and a full discussion of this topic
is beyond the scope of this article.
The options available are specific to the database engine,
but usually include:
- partitioning the database e.g. into separate tables for
each year of data;
- choosing different types of index (e.g. bitmap for small
dimension tables, or b-tree for larger fact tables).
In addition, most database tools require table sizes to be
declared before data are loaded. It is therefore important
to prepare reasonably accurate estimates.
|