A data warehouse refers to a database that is maintained separately from an organizations operational databases. The construction of data warehouses involves data cleaning, data integration, and data transformation. Data warehousing also forms an essential step in the knowledge discovery process.
Metadata
sources
Operational Extract Transform Load Refresh
OLAP Server
DBs
Data Warehouse
Serve
Data Marts
Data Sources
Data Storage
Operational data
Without source system, there would be no data The data sources for the data warehouse are supplied as follows:
Operational data held in network databases Departmental data held in file systems Private data held on workstations and private servers and external systems such as Internet, commercially available DB, or DB associated with and organizations suppliers or customers
Is a repository of current and integrated operational data used for analysis. It is often structured and supplied with data in the same way as the data warehouse, but may in fact simply act as a staging area for data to be moved into the warehouse. ODS objectives: to integrate information from day-today systems and allow operational lookup to relieve day-to-day systems of reporting and current-data analysis demands. ODS can be helpful step towards building a data warehouse because ODS can supply data that has been already extracted from the source systems and cleaned.
Load Manager
Called the frontend component. The data is extracted from the operational systems directly or from the operational datastore and then to the data warehouse Performs all the operations associated with the extraction and loading of data into the warehouse. These operations include sourcing, acquisition, cleanup and transformation tools which prepare the data for entry into the warehouse. The functionality includes:
Removing unwanted data from operational databases. Converting to common data names and definitions. Calculating summaries. Establishing defaults for missing data.
Warehouse Manager
Performs all the operations associated with the management of the data in the warehouse as follows:
Analysis of data to ensure consistency Transformation and merging of source data from temporary storage into the data warehouse tables Creation of indexes and views. Backing-up and archiving data.
Central Repository for information. This database is almost always implemented on the relational database management system (RDBMS) technology. Certain data warehouse attributes such as very large database size, ad hoc query processing and need for flexible user view creation including aggregates, multi-table joins and drill downs have become drivers for different technology approaches to data warehouse database. These approaches include:
Parallel Relational database designs that require a parallel computing platform, such as symmetric multiprocessors (SMPs) and massively parallel processors (MPPs). Multidimensional databases (MDDBs).
Query Manager
Called backend component Performs all the operations associated with the management of user queries Directing queries to the appropriate tables and scheduling the execution of queries.
Detailed Data
Stores all the detailed data in the database schema. On a regular basis, detailed data is added to the warehouse to supplement the aggregated data.
Archive/Backup Data
Stores detailed and summarized data for the purposes of archiving and backup May be necessary to backup online summary data if this data is kept beyond the retention period for detailed data The data is transferred to storage archives such as magnetic tape or optical disk
Meta Data
This area of the warehouse stores all the metadata definitions used by all the processes in the warehouse Meta-Data is used for a variety of purposes:
Extraction and loading processes Warehouse management process
Used to direct a query to the most appropriate data source End-user access tools use metadata to understand how to build a query
Data reporting and query tools (Query by Example MS Access DBMS) Application development tools (application used to access major DBS Oracle, sybase..) Executive information system (EIS) tools (For sales, marketing and finance) Online analytical processing (OLAP) tools (Allow users to analyze the data using complex and multidimentional views-from multiple databases) Data mining tools (allow the discovery of new patterns and trend by mining a large amount of data using statistical, mathematical tools)
Inflow
The processes associated with the extraction, cleansing, and loading of the data from the source systems into the data warehouse Cleaning include removing inconsistencies, adding missing fields, and cross-checking for data integrity Transformation include adding date/time stamp fields, summarizing detailed data, deriving new fields to store calculated data Extract the relevant data from multiple, heterogeneous, and external sources (commercial tools are used) Then mapped and loaded into the warehouse
Upflow
The process associated with adding value to the data in the warehouse through summarizing, packaging, and distribution of the data Summarizing the data works by choosing, projecting, joining, and grouping relational data into views that are more convenient and useful to the end users. Packeging the data involves converting the detailed or summarized information into more useful formats, such as spreadsheets, test documents, charts, other graphical presentations, private databases, and animation. Distribute the data in appropiate groups to increase its availability and accessibility
Downflow
The processes associated with archiving and backing-up of data in the warehouse. Archiving the effectiveness and performance maintanance is achieved by transferring the older data of limited value to storage archivers such as magnetic tapes, optical disk or digital storage devices. The downflow of data includes the processes to ensure that the current state of the data warehouse can be rebuilt following data loss, or software/hardware failures. Archived data should be stored in a way that allows the re-establishement of the data in the warehouse when required.
Outflow
Involves the process associated with making the data availabe to the end-users. This involves two activities such as data accessing and delivering Data accessing is concerned with satisfying the end userss requests for the data they need. The main problem here is the creation of an environment so that the users can effectively use the query tools to access the most appropiate data source. Delivering activity makes possible the information delivery to the users systems/workstations.
Metaflow
Meta-flow is a description of the data contents of the data warehouse, what is in it, where it came from originally, and what has been done to it by way of cleansing, integrating, and summarizing
Thanks