|
Data warehouse
A data warehouse is a central repository containing stable,
accurate, consistent, clearly understood data that are needed
for management information and decision making across the
whole organisation.
The data assembled in a warehouse are likely to have been
drawn from a variety of source systems. Integrating these
disparate sources into a consistent, enterprise-wide framework
can be a major challenge. Customers, for example, may be identified
differently in each source system.
Usually, the source data are re-organised around a particular
subject, restructured specifically to suit reporting and analysis,
and stored in a separate database. Data may also be summarised,
though this needs careful consideration (see design
issues). Each of these changes can have a marked impact
on performance.
Most data warehouses include a copy of the data in the organisation’s
operational systems. Typically, copies will be taken at regular
intervals in order to build up an historical database capable
of revealing patterns and trends over time. Clearly the volume
of data can be substantial, so the level of detail retained
is a key design consideration.
A warehouse may also contain external data, and other information
previously kept by users in personal spreadsheets and databases
e.g. forecasts and competitor comparisons.
The physical form in which these data are held in the warehouse
is another major design consideration, but this has no real
bearing on whether or not something can be considered a data
warehouse. In principle, everything could be stored in flat
files, but in practice, most data warehouses use a relational
database. This is usually more efficient.
Data mart
A data mart
is a similar information store created for a specific purpose
e.g. sales analysis or performance measurement. It is likely
to be tailored to the needs of one or two departments or functional
groups within the organisation.
Different data marts may be stored in different locations
on different platforms using different database products.
Mart
or warehouse?
A data mart need not be small, but is likely to contain a
subset or summary of the detailed information available in
the warehouse. It will be structured to optimise the specific
reports and analyses needed by a clearly defined group of
users, and is much easier to build than a complete data warehouse.
A central data warehouse can feed multiple data marts, with
overlapping content. Each mart then provides a customised
view of the organisation, based on consistent data from the
main warehouse.
The warehouse may be allowed to grow from the first mart
to be implemented, possibly sharing the same hardware platform
and database. This approach can lead to major problems and
rework as the warehouse expands (see pitfalls).
Data staging
The most difficult and time-consuming aspect of building
a data warehouse is taking data from disparate source systems,
converting them into a consistent form that can be loaded
into the warehouse, checking their quality and automating
this process. This is known as data
staging and typically accounts for 70-80% of the effort
in a data warehousing initiative.
Whilst the steps needed to physically move data from one
system to another may be technically complex, the real issues
lie with the structure and interpretation of the data itself:
if all the source systems were consistent, there would be
no need to build the warehouse! The political hurdles are
even more significant, and include:
- establishing who owns what data;
- agreeing standard terminology, definitions and hierarchies;
- deciding which sources to use;
- securing resources to clean up the data.
Meta data
One of the main reasons for building a data warehouse is
to provide information that is clearly understood by the business.
It is therefore essential to capture and store details of
the origin, location, definition, quality and freshness of
the data in the warehouse. This data about the data is called
meta data.
In an ideal world, meta data would be held in a standard
format that could be shared by different components and tools
in the warehouse environment. In practice, this is extremely
difficult to achieve, but an industry group called the Meta
Data Coalition are working towards it.
Business
intelligence
Business Intelligence is a field closely associated with
data warehousing, and is focused on the exploitation of data
for business benefit, i.e. on reporting, analysis and decision
support. Many business intelligence tools have evolved from
those used to build executive information systems.
Most of the analytic techniques available are well known
in the operational research community, but it is instructive
to review the way they are seen in a data warehousing context
and by the various tool vendors. Organisations usually start
with the simplest and work towards the more sophisticated
in 4 stages:
Stage 1 - Query and reporting
The new data warehouse enables many more people to access
the basic information they need to monitor performance regularly
and take routine decisions. The focus is very much on improved
management reporting.
Stage 2 - Multi-dimensional analysis
OLAP tools make it much easier for people to explore the
data, investigate exceptions, and share insights (see OLAP
below). The focus shifts to encouraging more widespread
use of basic quantitative analysis.
Stage 3 - Statistical analysis
Specialists are able to use more and better data from the
warehouse for rigorous analysis, to test theories and establish
which patterns are significant.
Stage 4 - Data mining
Genetic algorithms, neural networks and other mathematical
techniques can be used to search for useful patterns and
relationships that no one previously suspected. This requires
large samples of data, specialist software and a combination
of subject area and technical expertise.
Because these techniques require different tools and levels
of expertise, it is quite normal to provide several different
business intelligence tools to different user groups.
Most vendors will claim to have tools covering all these
techniques, but few are strong in more than one category.
For example, several OLAP vendors have developed CHAID
based modules that can generate simple decision trees, which
they are marketing as data mining tools. The serious players
in this category are SAS
and SPSS.
OLAP
The term ‘on-line analytic processing’ is used to distinguish
the requirements of reporting and analysis systems from those
of transaction processing systems designed to run day-to-day
business operations.
On line transaction processing (OLTP)
focuses on capturing and updating information efficiently.
This works best in a normalised,
relational
database, where every piece of data is stored in only
one place, as part of a single record in a specific table.
Management reporting, on the other hand, usually requires
many records to be summarised, and information from different
parts of the database to be combined, e.g. to derive a useful
ratio. Good performance requires a different data structure,
and the use of aggregates.
OLAP tools represent data as if it were held in one or more
multi-dimensional arrays, known as cubes,
with cells like a spreadsheet. These cubes often have more
than 3 dimensions, so strictly speaking they should be called
hypercubes, but it is much easier to visualise and explain
how OLAP cubes are structured in plain 3-D.
The edges of the cube represent the important dimensions
of the business, such as time, country and product. One edge
usually represents different measures,
but some tools use separate cubes for each measure.

Each cell can be uniquely identified by specifying a member
from each dimension e.g. {1999, Cost of sales, UK}. By selecting
one or more members from each dimension, the user can slice
and dice the cube to view almost any subset of the data
from different perspectives.
Dimension members may be organised into a hierarchy,
with summary level members such as year, region or product
group. The user can then drill
down from one level to the next to see more detailed data,
and then drill back up.
Most OLAP tools also enable the user to switch instantly
between tabular and chart formats, and to save favourite views
of the data as reports for future reference.
By manipulating cubes in this way, it is easy to answer questions
such as these:
- What were our 3 best selling products last month?
- How does the number of customer complaints vary by store?
- Which regions have grown fastest over the last 5 years?
- How has our business mix changed since last year?
OLAP has become popular because it makes it relatively easy
to explore a data warehouse or data mart, discover simple
patterns and trends, and to share the insights gained.
Report types
The most common way to access a data mart or data warehouse
is to run reports. Another very popular approach is to use
OLAP tools. To compare
different types of reporting and analysis interface, it is
useful to classify reports along a spectrum of increasing
flexibility and decreasing ease of use:
Standard reports are
designed and built centrally, then published for general use.
They are often run at regular intervals to show the latest
available data and distributed to those who need or request
them. They can be divided into three sub-types:
- Static reports (also
known as canned reports) are completely fixed, and
require no further input from the user, making them the
fastest and easiest to use.
- Parameterised reports
have a fixed layout, but allow the user to specify which
data are to be included, usually through a series of prompts
(e.g. which country and time period). They are easy to use,
but take longer to initiate and, usually, to run.
- Interactive reports
allow the user to manipulate the structure, layout and content
of a generic report via buttons on the screen. They are
a little harder to use, but once familiar with the basic
interface, users have far greater flexibility, and can work
much faster.
Ad hoc
queries, as the name suggests, are queries
written by (or for) the end user as a one-off exercise. The
only limitations are the capabilities of the reporting tool
and the data available. Ad hoc reporting requires greater
expertise, but need not involve programming, as most modern
reporting tools are able to generate SQL.
OLAP tools can be thought
of as interactive reporting environments: they allow the user
to interact with a cube of data and create views that can
be saved and reused as generic, interactive reports. They
are excellent for exploring summarised data, and some will
allow the user to drill
through from the cube into the underlying database to
view the individual transaction details.
Having built a data warehouse or data mart, most organisations
want to exploit it as quickly as possible. It is tempting
to start by replacing all existing reports, but there is often
considerable scope for rationalisation, as many will have
fallen into disuse. Also, it may be possible to replace hundreds
of static reports with a few dozen interactive reports, and
to design these so that they cover a large proportion of likely
ad hoc queries as well.
Understanding these report types helps to clarify business
users’ requirements and select appropriate software, but it
is also important to understand the needs of different types
of user.
Types of
user
Most warehouse implementation teams find that the user population
can be divided into three broad groups:
- 80% casual users, who
make infrequent use of the warehouse, and prefer static
or parameterised reports.
- 15% active users, who
make frequent use of standard reports, and sometimes require
assistance with ad hoc requests. They are usually comfortable
with interactive reports but still use static and parameterised
reports.
- 5% power
users, who prefer interactive reporting and
frequently create their own ad hoc queries. They are often
expert spreadsheet users, and regularly extract data for
further analysis. Most OR analysts probably fit into this
category.
In addition, the warehouse development team will need one
or more expert users to write standard reports for central
publication and provide training and support for end users.
This role requires both business and technical knowledge,
and is normally fulfilled by a management information specialist.
|