About OR
OR Topics - Data Warehousing & Business Intelligence
OVERVIEW
Data warehousing concepts

Data warehouse

A data warehouse is a central repository containing stable, accurate, consistent, clearly understood data that are needed for management information and decision making across the whole organisation.

The data assembled in a warehouse are likely to have been drawn from a variety of source systems. Integrating these disparate sources into a consistent, enterprise-wide framework can be a major challenge. Customers, for example, may be identified differently in each source system.

Usually, the source data are re-organised around a particular subject, restructured specifically to suit reporting and analysis, and stored in a separate database. Data may also be summarised, though this needs careful consideration (see design issues). Each of these changes can have a marked impact on performance.

Most data warehouses include a copy of the data in the organisation’s operational systems. Typically, copies will be taken at regular intervals in order to build up an historical database capable of revealing patterns and trends over time. Clearly the volume of data can be substantial, so the level of detail retained is a key design consideration.

A warehouse may also contain external data, and other information previously kept by users in personal spreadsheets and databases e.g. forecasts and competitor comparisons.

The physical form in which these data are held in the warehouse is another major design consideration, but this has no real bearing on whether or not something can be considered a data warehouse. In principle, everything could be stored in flat files, but in practice, most data warehouses use a relational database. This is usually more efficient.

Data mart

A data mart is a similar information store created for a specific purpose e.g. sales analysis or performance measurement. It is likely to be tailored to the needs of one or two departments or functional groups within the organisation.

Different data marts may be stored in different locations on different platforms using different database products.

Mart or warehouse?

A data mart need not be small, but is likely to contain a subset or summary of the detailed information available in the warehouse. It will be structured to optimise the specific reports and analyses needed by a clearly defined group of users, and is much easier to build than a complete data warehouse.

A central data warehouse can feed multiple data marts, with overlapping content. Each mart then provides a customised view of the organisation, based on consistent data from the main warehouse.

The warehouse may be allowed to grow from the first mart to be implemented, possibly sharing the same hardware platform and database. This approach can lead to major problems and rework as the warehouse expands (see pitfalls).

Data staging

The most difficult and time-consuming aspect of building a data warehouse is taking data from disparate source systems, converting them into a consistent form that can be loaded into the warehouse, checking their quality and automating this process. This is known as data staging and typically accounts for 70-80% of the effort in a data warehousing initiative.

Whilst the steps needed to physically move data from one system to another may be technically complex, the real issues lie with the structure and interpretation of the data itself: if all the source systems were consistent, there would be no need to build the warehouse! The political hurdles are even more significant, and include:

  • establishing who owns what data;
  • agreeing standard terminology, definitions and hierarchies;
  • deciding which sources to use;
  • securing resources to clean up the data.

Meta data

One of the main reasons for building a data warehouse is to provide information that is clearly understood by the business. It is therefore essential to capture and store details of the origin, location, definition, quality and freshness of the data in the warehouse. This data about the data is called meta data.

In an ideal world, meta data would be held in a standard format that could be shared by different components and tools in the warehouse environment. In practice, this is extremely difficult to achieve, but an industry group called the Meta Data Coalition are working towards it.

Business intelligence

Business Intelligence is a field closely associated with data warehousing, and is focused on the exploitation of data for business benefit, i.e. on reporting, analysis and decision support. Many business intelligence tools have evolved from those used to build executive information systems.

Most of the analytic techniques available are well known in the operational research community, but it is instructive to review the way they are seen in a data warehousing context and by the various tool vendors. Organisations usually start with the simplest and work towards the more sophisticated in 4 stages:

Stage 1 - Query and reporting

The new data warehouse enables many more people to access the basic information they need to monitor performance regularly and take routine decisions. The focus is very much on improved management reporting.

Stage 2 - Multi-dimensional analysis

OLAP tools make it much easier for people to explore the data, investigate exceptions, and share insights (see OLAP below). The focus shifts to encouraging more widespread use of basic quantitative analysis.

Stage 3 - Statistical analysis

Specialists are able to use more and better data from the warehouse for rigorous analysis, to test theories and establish which patterns are significant.

Stage 4 - Data mining

Genetic algorithms, neural networks and other mathematical techniques can be used to search for useful patterns and relationships that no one previously suspected. This requires large samples of data, specialist software and a combination of subject area and technical expertise.

Because these techniques require different tools and levels of expertise, it is quite normal to provide several different business intelligence tools to different user groups.

Most vendors will claim to have tools covering all these techniques, but few are strong in more than one category. For example, several OLAP vendors have developed CHAID based modules that can generate simple decision trees, which they are marketing as data mining tools. The serious players in this category are SAS and SPSS.

OLAP

The term ‘on-line analytic processing’ is used to distinguish the requirements of reporting and analysis systems from those of transaction processing systems designed to run day-to-day business operations.

On line transaction processing (OLTP) focuses on capturing and updating information efficiently. This works best in a normalised, relational database, where every piece of data is stored in only one place, as part of a single record in a specific table. Management reporting, on the other hand, usually requires many records to be summarised, and information from different parts of the database to be combined, e.g. to derive a useful ratio. Good performance requires a different data structure, and the use of aggregates.

OLAP tools represent data as if it were held in one or more multi-dimensional arrays, known as cubes, with cells like a spreadsheet. These cubes often have more than 3 dimensions, so strictly speaking they should be called hypercubes, but it is much easier to visualise and explain how OLAP cubes are structured in plain 3-D.

The edges of the cube represent the important dimensions of the business, such as time, country and product. One edge usually represents different measures, but some tools use separate cubes for each measure.

Each cell can be uniquely identified by specifying a member from each dimension e.g. {1999, Cost of sales, UK}. By selecting one or more members from each dimension, the user can slice and dice the cube to view almost any subset of the data from different perspectives.

Dimension members may be organised into a hierarchy, with summary level members such as year, region or product group. The user can then drill down from one level to the next to see more detailed data, and then drill back up.

Most OLAP tools also enable the user to switch instantly between tabular and chart formats, and to save favourite views of the data as reports for future reference.

By manipulating cubes in this way, it is easy to answer questions such as these:

  • What were our 3 best selling products last month?
  • How does the number of customer complaints vary by store?
  • Which regions have grown fastest over the last 5 years?
  • How has our business mix changed since last year?

OLAP has become popular because it makes it relatively easy to explore a data warehouse or data mart, discover simple patterns and trends, and to share the insights gained.

Report types

The most common way to access a data mart or data warehouse is to run reports. Another very popular approach is to use OLAP tools. To compare different types of reporting and analysis interface, it is useful to classify reports along a spectrum of increasing flexibility and decreasing ease of use:

Standard reports are designed and built centrally, then published for general use. They are often run at regular intervals to show the latest available data and distributed to those who need or request them. They can be divided into three sub-types:

  • Static reports (also known as canned reports) are completely fixed, and require no further input from the user, making them the fastest and easiest to use.
  • Parameterised reports have a fixed layout, but allow the user to specify which data are to be included, usually through a series of prompts (e.g. which country and time period). They are easy to use, but take longer to initiate and, usually, to run.
  • Interactive reports allow the user to manipulate the structure, layout and content of a generic report via buttons on the screen. They are a little harder to use, but once familiar with the basic interface, users have far greater flexibility, and can work much faster.

Ad hoc queries, as the name suggests, are queries written by (or for) the end user as a one-off exercise. The only limitations are the capabilities of the reporting tool and the data available. Ad hoc reporting requires greater expertise, but need not involve programming, as most modern reporting tools are able to generate SQL.

OLAP tools can be thought of as interactive reporting environments: they allow the user to interact with a cube of data and create views that can be saved and reused as generic, interactive reports. They are excellent for exploring summarised data, and some will allow the user to drill through from the cube into the underlying database to view the individual transaction details.

Having built a data warehouse or data mart, most organisations want to exploit it as quickly as possible. It is tempting to start by replacing all existing reports, but there is often considerable scope for rationalisation, as many will have fallen into disuse. Also, it may be possible to replace hundreds of static reports with a few dozen interactive reports, and to design these so that they cover a large proportion of likely ad hoc queries as well.

Understanding these report types helps to clarify business users’ requirements and select appropriate software, but it is also important to understand the needs of different types of user.

Types of user

Most warehouse implementation teams find that the user population can be divided into three broad groups:

  • 80% casual users, who make infrequent use of the warehouse, and prefer static or parameterised reports.
  • 15% active users, who make frequent use of standard reports, and sometimes require assistance with ad hoc requests. They are usually comfortable with interactive reports but still use static and parameterised reports.
  • 5% power users, who prefer interactive reporting and frequently create their own ad hoc queries. They are often expert spreadsheet users, and regularly extract data for further analysis. Most OR analysts probably fit into this category.

In addition, the warehouse development team will need one or more expert users to write standard reports for central publication and provide training and support for end users. This role requires both business and technical knowledge, and is normally fulfilled by a management information specialist. 

Click Here Section Map Click Here RELATIONSHIP WITH OR Click Here MANAGING
Click Here MAIN Click Here BUILDING Click Here RESOURCES

© 2002 The OR Society

Top of Page