RESOURCES |
Glossary
of data warehousing terms |
Ad hoc query
A once-off data extraction exercise to answer an unplanned
question. Also see report
types.
Aggregate
Data stored at a summary level to save space and/or improve
query performance.
Analytic applications
Applications which enable the user to take warehouse data
and further manipulate it in order to make better-informed
decisions. Examples could include demand forecasting, pricing,
competitor analysis and customer segmentation.
CHAID (Chi-squared Automated Interaction
Detector)
A market segmentation technique designed to split a population
into two or more categories based on available attributes
e.g. gender, employment status, race, age band etc. The results
are often presented in the form of a tree.
Clickstream analysis
Analysis of data derived from web activity logs to reveal
the sequence of page "hits". Usually needs to be
linked to other customer data to derive useful insights.
Closed loop applications
Applications which enable authorised users to save data they
have derived back into the warehouse for future reference.
Common examples include budgets, forecasts and performance
targets, which can then be compared to actual data when they
become available. Credit scores, customer segments and allocated
costs might also be prepared by analysts and then incorporated
in the warehouse for general use.
Conformed dimensions
Standardised dimensions that can be used across the whole
organisation. A conformed dimension
table used with different fact
tables helps to ensure consistency between data
marts.
Cube
A multi-dimensional representation of data that can be viewed
from different perspectives.
Data cleansing
The process of correcting errors and removing inconsistencies
before importing data into the warehouse.
Data dictionary
The index to all data in the warehouse, and home of meta-data.
Data mart
Similar to a data warehouse, but holding only the data needed
for a specific business function or department e.g. sales
analysis. Also see concepts.
Data staging
The process of extracting, transforming, loading and checking
data on its way from source system to data warehouse. Copies
may be stored at intermediate steps in a data staging area.
Also see ETL and concepts.
Data warehouse
A central repository containing stable, accurate, consistent,
clearly understood data that are needed for management information
and decision making across the whole organisation. Also see
concepts.
Data warehousing
The process of creating a vision, planning, designing, building,
using, maintaining, managing and enhancing a data
warehouse and/or data
mart(s).
Decision processing
A data warehouse supports "decision processing"
when it becomes part of a closed
loop, analytic
application integrated into routine operational procedures.
Dimension
A business perspective useful for analysing data. A dimension
usually contains one or more hierarchies
that can be used to drill up or down to different levels of
detail. Typical dimensions include product, customer, time,
department, location and channel.
Dimensional modelling
The process of identifying the dimensions
required for analysis, defining the hierarchies and levels
they contain, and making sure they conform.
Dimension table
A table containing the data for one dimension
within a star schema.
The primary key is used to link to the fact
table, and each level in the dimension has a corresponding
field in the dimension table.
Dormant data
Warehouse data that is never accessed. Can be archived to
save disk space.
Drill down
A tool feature that allows the user to obtain more detailed
data at the click of a mouse. In OLAP,
users would drill down to the next level in a dimension hierarchy.
Drill
through
Another tool feature that allows the user to obtain more
detailed data. Having drilled
down to the lowest level of detail available in a given
cube or report, users may
need to drill through to another cube or report, e.g. to view
the basic transaction level data related to the selected cell.
Entity
In a relational
database, each table contains data describing the corresponding
business object or "entity", e.g. a customer, order
or order line.
Entity relationship
modelling
The process of identifying relevant business entities,
describing the relationships between them and listing their
associated attributes. The resulting model can be represented
as an entity relationship diagram, and defines the logical
structure for a relational
database.
ETL (Extract, Transformation & Load)
The process of extracting data from source systems, transforming
this into the required structure and loading into the data
warehouse. ETL tools are available to assist with this
process.
Fact table
The central table in a star
schema, containing the basic facts or measures
of interest.
Dimension fields are also included (as foreign keys) to link
to each dimension
table.
Fat client (see also thin
client)
A systems architecture in which fully functional software
is installed on each user’s PC. OLAP
performance depends largely on the processing power and memory
of the PC.
Filter
A selection criterion applied to limit the data to be included
in a report or analysis.
In SQL, filters are expressed as WHERE clauses, e.g. ‘WHERE
COUNTRY = UK’. With OLAP tools, filters are changed automatically
by slicing and dicing,
and by drilling up
and down.
Granularity
(or grain)
The level of detail of the facts (or measures)
stored in the data
warehouse.
Hierarchy
An organisation of data into a logical tree structure defining
parent-child relationships between the levels in a dimension.
Controls data consolidation and drill
down paths.
A typical time dimension would have a hierarchy based on
date, week, month, quarter and year.
Measure
A numeric value stored in a fact
table or cube. Typical
examples include sales value, sales volume, price, stock and
headcount.
Meta data
Data describing the data held in the warehouse. This may
include a description of the tables
and fields in the warehouse, what they mean, similar descriptions
of the source data and a mapping between the two, details
of the transformations made, the reliability of each item,
when it was last updated etc. Also see concepts.
MOLAP (Multi-dimensional OLAP)
On line analytic processing based on data stored in a multi-dimensional
database such as Hyperion’s Essbase or TM/1.
Multi-dimensional
database
A database in which cubes are physically stored in a format
optimised for OLAP. Most
use a proprietary structure that usually can only be accessed
by the vendor’s own OLAP tools.
Normalisation
The process of assigning each attribute to exactly one table
(based on the entity it
describes) so as to eliminate all data duplication. A key
step in the design of most on-line transaction processing
(OLTP) systems.
NUMA (Non-uniform memory architecture)
A relatively recent innovation in parallel
processing, combining elements of SMP
and MPP.
OLAP (On line analytic processing)
An intuitive approach to the analysis of patterns and trends
in data, characterised by the representation of that data
in multi-dimensional cubes,
which normally include summary data at one or more levels.
The cubes may be stored in a multi-dimensional database (see
MOLAP) or as star
schemas in a relational database (see ROLAP).
The term was coined to distinguish the way a data warehouse
is often used to track business performance from the way transactional
(OLTP) systems are used
in the day to day running of the business. It has since become
synonymous with the class of software tools that facilitate
this approach.
Also see concepts
OLTP (On line transaction processing)
Operational or OLTP systems are designed to optimise the
creation and updating of individual records. Typical transactions
might represent the addition of new customers, taking orders
or recording a change of address. Reporting and analysis systems,
by contrast, typically need to summarise large numbers of
individual transaction records, and perform better if summary
data is pre-calculated and stored (see aggregates).
ODS
(Operational data store)
A subject-oriented, integrated, frequently updated store
of detailed data needed to support transactional systems with
integrated data, e.g. the current master customer list shared
by several systems. If available, the ODS is often a good
source for dimension
data.
When the term was first coined, the ODS was expected to support
queries requiring data
at the most detailed level available, which typically had
to be excluded from the warehouse to keep size within reasonable
bounds. Thus the ODS would contain maximum detail and be refreshed
in real time, whereas the warehouse would be lightly summarised
and refreshed at regular intervals.
Nowadays, hardware costs are less of a limiting factor, and
it is more usual to restructure even the most detailed data
for query and analysis and include them in the warehouse,
possibly in near real time. Thus an ODS is less likely to
be queried directly by users, and acts more like a source
system.
Parallel processing
(SMP, MPP)
A computing architecture that can handle many queries simultaneously.
There are two main variations:
Symmetric multi-processing (SMP) involves linking identical
processors in a cluster with shared disk and memory, and then
allocating queries so as to balance the load on each processor.
This approach improves throughput, but not the time taken
to execute a given query once it reaches the front of the
queue. SMP works well for ad
hoc queries running against small to medium data warehouses,
but bottlenecks on shared resources tend to limit scalability.
Massively parallel processing (MPP) involves linking and
co-ordinating many similar processors and then splitting individual
queries into multiple work packages that can be processed
simultaneously and then glued back together. This approach
improves individual query response times, and is truly scaleable,
but requires all software to be specially designed to subdivide
queries into appropriate packages. MPP is better for standard
queries running against large data warehouses
Portal
A gateway providing a single point of access and a common
interface to all information and applications. Usually accessed
through a simple web browser.
Power user
A user with good analysis skills who makes frequent use of
the data warehouse,
and is capable of writing their own ad
hoc queries. Also see types
of user.
Query
A short computer program that extracts data from a database,
usually returning a list of records matching the specified
criteria. Usually written in SQL,
or as a script which is translated into SQL when run.
Query governor
Software that manages the execution of queries.
Allows an administrator to prevent the database server being
overloaded by limiting the amount of data retrieved and/or
the time taken. More experienced users may be granted higher
limits.
Referential integrity
A set of rules imposed in a relational
database to ensure consistency between data held in different
tables. For example, every employee must have one and only
one department. To prevent the creation of ‘orphans’, a department
record cannot be deleted until all corresponding employee
records have been deleted or amended.
Relational database
A database in which data are stored in tables corresponding
to business entities such
as customers or orders. The relationships between these entities
are then used to create access paths to each piece of data
as it is needed (by joining the corresponding tables).
Contrast this to a hierarchical database, in which the access
path is determined by the physical record structure.
ROLAP (Relational OLAP)
On line analytic processing based on data stored in a relational
database such as DB2, Oracle or Microsoft SQL Server,
usually designed with one or more star
schemas.
Semantic layer
Part of the software architecture that uses meta-data
to translate physical database field names into readily understood
business terminology for use in reporting and analysis. (See
also data dictionary)
Slice & dice
A feature of OLAP tools
that enables users to filter
and rotate (or pivot) a cube to look at any cross-section
as if it were a 2 dimensional spreadsheet.
Slowly changing
dimensions
Dimensions such as
customer and employee, in which attributes like address or
job title may change irregularly and infrequently. Also see
temporal
inconsistency.
SQL
(Stuctured Query Language)
An industry standard programming language for extracting
data from any relational
database. Syntax is English like, but is only easy to
understand for relatively simple queries.
Star schema
A simple form of relational
database in which a large, central fact
table is joined to several smaller dimension
tables which act as lookups. For example, an orders table
might be joined to master product, customer, calendar and
department tables.
Table
A two-dimensional array in which each column (or field) represents
an attribute of an entity,
and each row (or record) represents an occurrence of that
entity.
Thin client (see also fat
client)
A system architecture in which most software components run
on a shared server, with minimal use of the client (or user
PC). Performance depends more on network capacity and server
power. Software usually requires less maintenance.
Time-boxing
An approach to project management that sets a series of strict,
short-term deadlines for the completion of deliverables. The
scope or resources input may be flexed, but the essential
concept is to get as much as possible done in the time available.
Touchpoint
Any point of contact with a customer, including telephone
calls, letters, faxes, email, web site visits and face to
face meetings.
The following have also published glossaries on the world
wide web:
|