|
The key steps in developing a data warehouse
can be summarised as follows:
- Project initiation
- Requirements
analysis
- Design (architecture,
databases and applications)
- Construction
(selecting and installing tools, developing data feeds and
building reports)
- Deployment (release
& training)
- Maintenance

It is advisable to conduct a pilot exercise before
embarking on a full-scale development effort. This will include
most of the above steps, and provides an opportunity to:
- understand new concepts and processes, and identify potential
problems;
- make more realistic plans and manage expectations;
- evaluate alternative tools;
- demonstrate benefits and gain management commitment.
Testing should be an integral part of construction, not a
separate step in the development process.
Project
initiation
No data warehousing project should commence without:
- a clear statement of business objectives and scope;
- a sound business case, including measurable benefits;
- an outline project plan, including estimated costs, timescales
and resource requirements;
- high level executive backing, including a commitment to
provide the necessary resources;
A small team is usually set up to prepare and present a suitable
project initiation document. This is normally a joint effort
between business and IT managers. If the organisation has
limited data warehousing experience, it is useful to obtain
external advice at this stage. If the project goes ahead,
the project plan and business case should be reviewed at each
stage.
It is widely regarded as good practise to develop a data
warehouse in small, manageable phases (see pitfalls).
Thus the analysis, design, construction and deployment steps
will be repeated in cycles.
It is generally a good tactic to provide something that is
not already available during the first phase, as this will
help to stimulate real interest. This could be new data or
enhanced functionality. It is also better to start with something
relatively easy, which the warehousing team can deliver whilst
still learning the ropes.
See project
management techniques for more information on relevant
methodologies and useful references.
Requirements
analysis
Establishing a broad view of the business’ requirements should
always be the first step. The understanding gained will guide
everything that follows, and the details can be filled in
for each phase in turn.
Collecting requirements typically involves 4 principal activities:
- Interviewing a number of potential users to find out what
they do, the information they need and how they analyse
it in order to make decisions. It is often helpful to analyse
some of the reports they currently use.
- Interviewing information systems specialists to find out
what data are available in potential source systems, and
how they are organised.
- Analysing the requirements to establish those that are
feasible given available data.
- Running facilitated workshops that bring representative
users and IT staff together to build consensus about what
is needed, what is feasible and where to start.
Design
The goal of the design process is to define the warehouse
components that will need to be built. The architecture, data
and application designs are all inter-related, and are normally
produced in parallel.
Architecture design
The warehouse architecture describes all the hardware and
software components
that form the data warehousing environment and explains:
- how the components will work together;
- where they are located (geographically and on what platform);
- who uses them;
- who will build and maintain them.
The architecture needs to be considered at the outset,
as this provides a framework for the selection of tools
and the detailed design of individual components during
the first and subsequent phases of development.
Data design
This step determines the structure of the primary data
stores used in the warehouse environment, based on the outcome
of the requirements
analysis. It is best to produce a broad outline quickly,
and then break the detailed design into phases, each of
which usually progresses from logical to physical:
The logical design determines what data are stored
in the main data warehouse and any associated functional
data marts. There are a number of data modelling
techniques that can be used to help.
Once the logical design is established, the next step is
to define the physical characteristics of individual
data stores (including aggregates)
and any associated indexes required to optimise performance
(see database
optimisation).
The data design is critical to further progress, in that
it defines the target for the data feeds and provides the
source data for all reporting and analysis applications.
Application design
The application design describes the reports and analyses
required by a particular group of users, and usually specifies:
- a number of template report layouts;
- how and when these reports will be delivered to users;
- the functional requirements for the user interface.
There may be one or more applications associated with each
data mart or phase of development.
Construction
Warehouse components are usually developed iteratively and
in parallel. That said, the most efficient sequence to begin
construction is probably as follows:
1) Tool selection & installation
Selecting tools is best carried out as part of a pilot
exercise, using a sample of real data. This allows the development
team to assess how well competing tools handle problems
specific to their organisation, and to test system performance
before committing to purchase.
The most important choices are the:
Clearly these need to be compatible, and it is worth checking
reference sites to make sure they work well together.
It pays to define standards and configure the development,
testing and production environments as soon as tools are
installed, rather than waiting until development is well
underway. Most vendors are willing to provide assistance
with these steps, and this is normally well worth the investment.
2) Data staging system
This comprises the physical warehouse database, data feeds
and any associated data marts and aggregates.
The following steps are typical:
- Create target tables in the central warehouse database;
- Request initial and regular extracts from source
systems;
- Write procedures to transform extract data ready for
loading (optionally creating interim tables in a data
staging area);
- Write procedures to load initial data into the warehouse
(using a bulk loader);
- Create and populate any data
marts;
- Write procedure to load regular updates into the warehouse;
- Develop special procedures for a once-off bulk load
of historic data;
- Write validation/exception handling procedures;
- Write archiving & backup procedures;
- Create a provisional set of aggregates;
- Automate all regular procedures;
- Document the whole process.
However thorough the design process, problems with the
real data are bound to surface at this stage. Substantial
time should be allowed to resolve any issues that arise,
establish appropriate data
cleansing procedures (preferably within the source systems
environment) and to validate all data before they are released
for live use.
3) Application development
This step can begin once a sample or initial extract has
been loaded, but it is usually best to leave the bulk of
application development until the underlying data mart (or
part of the central warehouse) and associated meta-data
(especially object names) are stable.
It is a good idea to involve users in the development of
reports and analytic
applications, preferably through prototyping, but at
least by asking them to carry out acceptance testing. Most
modern business intelligence tools do not require programming,
so it is possible for non-IT staff to build some of their
own reports as well.
Deployment
It is too often assumed that the first version of a data
warehouse can be rolled out in a matter of weeks, simply by
showing all the users how to use the new reporting tools.
In practice, training needs to cover not just the basic use
of the tools, but also the data that have been made available,
and, more significantly perhaps, the new business processes
or different ways of working that are intended. This training
usually works best if delivered on a one-to-one basis.
As well as training, planning for deployment needs to cover:
- Installing and configuring desktop PCs - any hardware
upgrades or amendments to the ‘standard build’ need to be
organised well in advance;
- Implementing appropriate security measures - to control
access to applications and data;
- Setting up a support organisation to deal with questions
about the tools, the applications and the data. However
thoroughly the data were checked and documented prior to
publication, users are likely to spot anomalies requiring
investigation and to need assistance interpreting the results
they obtain from the warehouse and reconciling these with
existing reports;
- Providing more advanced tool training later, when users
are ready, and assisting potential power
users to develop their first few reports.
If the first users find errors and inconsistencies in the
data, don’t feel comfortable with the tool or can’t be bothered
to learn how to use it properly, or won’t accept new procedures
and responsibilities, all the time spent building the warehouse
may ultimately be wasted. The following guidelines will help
to reduce these risks:
- Do not start deployment until the data are ready (available
and validated) and the tools and update procedures have
been tested;
- Use a small, representative group to try out the finished
system before rolling out, including users with a range
of abilities and attitudes;
- Do not grant system access to users until they have been
trained.
Maintenance
A data warehouse is not like an OLTP
system: development is never finished, but follows an iterative
cycle (analyse – build – deploy). Also, once live, a warehousing
environment requires substantial effort to keep running. Thus
the development team should not anticipate handing over and
moving on to other projects, but to spend half of their time
on support and maintenance.
The most important activities are:
- Monitoring the realisation of expected benefits;
- Providing ongoing support to users (see deployment);
- Training new staff;
- Assisting with the identification and cleansing of dirty
data;
- Maintaining both feeds & meta-data
as source systems change over time;
- Tuning the warehouse for maximum performance (this includes
managing indexes and aggregates
according to actual usage);
- Purging dormant
data;
- Recording successes and using these to continuously market
the warehouse.
In addition, mechanisms need to be established to manage
growth, in particular the prioritisation of requested enhancements,
which often require the addition of further data sources.
|