Anda di halaman 1dari 10

Journal Of RM Submitted: National University of Computer and Emerging science CFD

Comparative Study on Data Quality Management


in Data warehouses

Awais Sultan Awaissultan88@gmail.com


student
Department of computer science
National university of computer
And emerging science CFD

Dr. Gufran Khan m.gufran@nu.edu.pk


Assistant Professor
Department of computer science
National university of computer
And emerging science CFD

Abstract

Data warehouse has become popular among developers and researchers from several years. An important
and critical factor in data warehouse is maintaining data quality. If data quality is poor the stakeholder who
are associated with it they cannot trust the results generated by data. But maintain the data quality is very
crucial issue and a complicated task. Poor data quality will lead to lose the customer trust and increase cost.
Because of increase in data warehouse system, Data integration is required when take data from different
resources. This paper presents a research study based on techniques regarding to maintain quality. It
combines views of different researchers to provide brief set of issues and problems of data warehouse
system.

Key Word: Data warehouse, data quality, metadata, quality parameters, framework, ETL.

1 Introduction

Many large organizations are dependent on data now a day. Many problems arise due to massive growth
in data such as maintain the quality of data. Data quality issue arises when transfer data from one source
system to another (2003). Data quality is important for business performance because low quality data
will decrease cost. Many business organization uses data warehouse (1994) to process large amount of
data. Data warehouse provide data quality when transfer data from one source to other.

Data warehouse fail to meet requirements due to lack of data quality. The main issue with data
warehouse is to integrate data when take data from different sources. Many data warehouse projects fails
due to low quality .Data have different shapes and each data has different source from other data when
collected. So the main issue with data warehouse is data integration because different sources create entity
identification problem (1993).The main question arise how to integrate data from different sources and
how to remove data redundancy (1967).
1
Journal Of RM Submitted: National University of Computer and Emerging science CFD

Many researchers have done lot of work to improve data quality and provide many techniques to
integrate data from different sources .researchers provide metadata approach to solve the entity
identification problem in data. Some researchers provide the approach known as covariance and co-
relation to integrate data (2002). These approaches are statically and difficult to use so there will be need
some simple approach.

The main advantage of solving the integration problem is improving data quality and reduces the
cost. Quality data is very helpful for organization to make strategic plans for business(1980). Low quality
creates many problems for organizations. Low quality data cannot give proper information for strategic
plans.

Data warehouse provides best environment to manage the data quality because it provides ETL (2007)
process to maintain the quality of large amount of data. Data stored periodically in data warehouse so
integration becomes easier at each phase of data warehouse. Data is continuously improved at each stage.

In this paper, research study of different researchers is presented for improving data quality in data
warehouse when taking data from different sources. The method is known as data quality management
system and uses to solve the data integration problems. It combines the views of different researchers and
presents a simplified approach on the base of study. Fig1 described the quality framework.

Rest of the paper is as follows; in section two we describe related work. Section three represents
issues. Section four presents research study of researchers. Section five describes the purposed method
and analysis. Section six points out conclusion and future work.

Start

From a quality council

Define quality parameters


Define quality parameters

ualityquality
Define parameters
metrics

Find the value of parameters

Expected value of parameters

Fig.1 Data quality frame work


2
Journal Of RM Submitted: National University of Computer and Emerging science CFD

2 Related Work

Data quality is complex issue in data warehouse because volume of data has increased day to day. To
manage the data quality is great success of researchers of data warehouse. Some researchers have tried to
solve quality issues. In below section, research work of some prominent authors of data warehouse field is
presented.

Author’sPrakash, Singh, and Gosain (2004) purposed an information scenario to presents data
warehouse requirements. This scenario is subclass of class of scenario and authors use this scenario for
decision making. The main advantage of this scenario is that it provides efficiency in decision making.
This scenario takes some time in decision making. The scenario did not provide the access to all business
direction so authors Paim and De Castro (2003) purposed a technique that is called DWARF (data
warehouse requirements definition). This technique presents requirements that are non-functional at
technical level. Main advantage of this technique is that it improves data access performance in data
warehouse. This technique is useful for large data sets.

The data quality is not to get only business measurements. There is need to multidimensional view of
data, so authors Sapia, Blaschka, Höfling, and Dinter (1998) purposed a goal question metric approach
(GQM). It takes non-functional requirements with physical architecture of data warehouse. Advantage of
this technique is that it measures the exact business dimensions. It is difficult to use with large data
hierarchies. Above technique could not fulfill the requirements of stakeholders therefore authors Schiefer,
List, and Bruckner (2002) purposed a REMOTEDWH method (easy requirements modeling technique for
data warehouse). It provides requirements for different perspectives to stakeholders. The advantage of this
technique is that it provides entire view of organizations requirements. It is more useful than previous
techniques. In some other work authors Soler (2008) andMazón, Pardillo, and Trujillo (2007) provides
different types of requirements for data warehouse which are known as quality information of services.
Both papers combine these techniques in DMA model. Advantages of these two requirements provide
efficiency in data processing in large organizations. Data collection is difficult in this technique.

However there is no prominent method to combine data quality, business quality and information
quality in development of data warehouse.

3 Data quality issues:

 Source system has wrong information


 Data comes day by day and cannot updated regularly
 Different formats of data in different sources
 Some tables are missing information which is important some time
 Time is very important for updating data

3
Journal Of RM Submitted: National University of Computer and Emerging science CFD

 Huge amount of data


 Many changes in data
 Proper fields of data are not added
 Metadata is not reliable

3.1 Improvement Process of Data Quality:

For wide view of enterprise and applying metadata we develop an improvement process. The
improvement process has six stages. At each stage, some techniques and concepts are applied to improve
performance. First stage is to access data which is important because data have different sources and to
access for each source is complex task. After accessing the data second stage is to make plan for data.
Planning is most important in all stages because each stage follow plan stage. Next stage come
implementation which is the implementation of plans which we make on planning stage. After that results
are calculated if the results are correct then adopt this data.

4 Comparative Study

This section presents research study of different researchers

4.1 A SIMPLIFIED APPROACH FOR QUALITY MANAGEMENT IN DATA WAREHOUSE


(V.Kumar et al, 2013)
Data quality management is very important because it plays an important role in all data warehouse
projects. A complete data warehouse based on quality because it used for strategic information and
quality data has social and economic value. A metadata base data quality system was introduced in this
paper. A metadata is data about data and mostly used to solve entity identification problem in data
warehouse when take data from heterogeneous sources. In this system data quality is analyzed by
comparing actual value of quality parameters with expected value of parameters. Information getting from
comparison is stored by using metadata framework. When we evaluate quality value if the value is not
required value then this system provide information about errors and also gives information to improve
quality.

4.2 PROACTIVE DATA QUALITY MANAGEMANT FOR DATA WAREHOUSE SYSTEMS


(M.Helfert, 2012)
The previous discussed approach is better to improve data quality where data has numeric type but it is
not good for other types of data. So, a new system was introduced by author who based on approach to
measure data quality and for planning. The requirements for data quality are defined through some rules
and quality is measured with the help of these rules. This system is introduced for large Swiss bank to
improve the data quality of bank .In this system, there is an end user which has knowledge of business
requirements and order these requirements in a non-formalized way to understand data. This approach
was also applied on some other fields like e-commerce, logistics.

4
Journal Of RM Submitted: National University of Computer and Emerging science CFD

4.3 A new approach for total data quality management in data warehouse
(G. Sankaranaraynan, 2011)
Metadata is used for the knowledge of data about data; metadata is useful for entity identification problem
in data integration. In this paper, metadata requirements are defined for quality management in data
warehouse. Research shows the data which gained quality is related to old data. This data quality
framework is useful for organizations to get total data quality management (TDQM) in data warehouse.
Quality is evaluated according to timeliness, consistency and accuracy. So this approach enables the
stakeholder of data warehouse to evaluate quality according to timeliness, consistency and accuracy. It
improves the usability of data warehouse and helps to identify the problems of data warehouse. It also
provides IP view which helps in decision makers to evaluate the framework. IPMAP also conceptualize
the access to metadata.

4.4 META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING


(R.B.Palepu, 2012)
An important part of data warehouse is Extraction, transformation and loading. This approach is used to
maintain quality and has three phases. At the start phase, data is extracted from heterogeneous sources.
When data is extracted it has different types, names in different databases. Data is extracted from
operational sources then transformation data into a schema and finally load the data in data
warehouse.ETL approach is used in this research that overcomes the quality problems in data warehouse
at bottom level. Quality issues discussed in this paper are planning, classification criteria, and assessment
criteria. Different data quality tools and metadata also discussed in this paper.

4.5 THE QUALITY OF DATA AND METADATA IN A DATA WAREHOUSE (C.Raduţ, 2012)

Multidimensional data has many dimensions along with one business metric. This research represents
process for maintaining multidimensional data in data warehouse. Analysis is very important part of
multidimensional data and analysis is performed against complex queries. It helps to analyze the
information along with dimension. This approach calculated information which is not useful sometime.
So this approach lack of quality sometime. Data quality issues are solved at planning phase through
TDQM. This is useful for evaluation structure of data warehouse. This evaluation criterion helps to
control quality.

4.6 An analysis of quality management in data warehouse (M.A. Jeusfeld, 2010)

Semantic model is another model that is used in data warehouse environment to improve data quality. This paper
presents a semantic model approach for data warehouse to improve quality. This model uses the analysis and data
design in data warehouse and then apply measurements on this data. This model implement a concept base Meta
database in data warehouse. The major contribution of this paper is that it separates the quality at instance and
schema level which is not separated earlier work. Meta database is a good approach that achieves the quality goals
and measurements. But the main drawback of this paper is that it does not defined dependencies between quality and
measurements.

5
Journal Of RM Submitted: National University of Computer and Emerging science CFD

4.7 DATA QUALITY IN DATA WAREHOUSES PROBLEMS AND SOLUTION (R.K.Pandey, 2014)

This paper defines the all phases of data warehouse which are Extraction, Transformation and Loading. At each
phase this paper describes the deficiencies and non availability of data. Data quality improved through many
techniques but in this work data is improved through statically process control (SPC) and quality engineering. This
paper integrates the research on each phase of data warehouse. The main focus of this paper is the phases of data
warehouse which are important to improve data quality. The main theme helps the researcher to solve problem at
each phase of data warehouse. This research is also very important for users of data warehouse to improve business.

4.8 A new approach for data quality management in data warehouse (2010)

Data warehouse gets the attention of researchers and user from several years because it is an environment not
physical implementation. It helps the manager of organization to watch business easily and make strategic decisions.
But maintain the data quality in data warehouse is crucial issue because data is collected from many sources or daily
operations. This paper presents the different data quality models and these models give the data warehouse base
systems to improve quality in data warehouse. The main focus of this paper is to link the data quality to quality
criteria. This criterion helps to improve data quality in data warehouse.

4.9 DATA WAREHOUSE DESIGN AND IMPLEMENTATION BASED ON QUALITY


REQUIREMENTS (2010)

Data warehouse is an environment not a physical implementation. Data warehouse is efficient way which is used for
relational data. The data in data warehouse is non-volatile because once data enter in data warehouse it could never
change again. Data warehouse is widely used for decision making in organizations. Different model are used in data
warehouse to improve quality of large data sets. This paper describes the most important model for data warehouse
which is dimensional model. This model gives information in many directions. Data integration is an important
phase of data warehouse and quality is maintained during this phase. If quality is not maintained then data
warehouse fail to meet requirements. Criteria to evaluate quality are described in detail.

4.10 Causes and Quality problems in Data warehouse (E.Rahm, 2013)

The demand for data warehouse increases day to day because benefits of decision support system. Lot of work
available to improve quality of data in warehouse. The most important phase is to identify issues regarding quality
of data. This paper describes the causes of low quality data. There are about 50 causes are identified regarding
quality. These causes are taken under consideration when data warehouse id build. The identified problems are data
integration and data formatting etc. this paper combines the issues and problems of quality.

4.11 A survey on quality tools in data warehouse (M.Helfert, 2008)

To build a data warehouse, a lot of effort required because to maintain quality at any cost is very difficult task. There
are many tools available to maintain data quality of data in data warehouse. This paper describes the quality issues
that are identified by tools. Tools are used to enhance quality at loading phase of data. Every tools works in different
manner and used for different purpose of data. Some are used to detect errors and some are used to compare quality.
This will help us to choose best tool for quality.

6
Journal Of RM Submitted: National University of Computer and Emerging science CFD

4.12 ANALYSIS OF DATA QUALITY ASPECTS IN DATA WARE HOUSE SYSTEMS


(H.Hinrichset al, 2012)

The main purpose of data warehouse is to provide quality data. To maintain the quality of data is very important and
difficult task in data warehouse. If data could not fulfill the requirements of stakeholders then data has low cost and
quality and this data is invalid for organization. Data is cleaned and analyzed before using in organizations and any
other purposes. If data has poor quality it will lose the customer trust and also cost. The main purpose of this paper
is to focus on those steps which are used in data warehouse to maintain the quality of data. This paper helps the user
of data warehouse to analyze the steps that are important to maintain quality of data in data warehouse.

4.13 A system for data integration in data warehouse (2010)

To maintain data quality is an important factor in data warehouse. The analysis of data in data warehouse demands
to increase quality is more important task. Data is increased day by day and increase in data demands to maintain the
quality of data during data integration process. Data integration is performed when we take data from different
sources in data warehouse. The main focus of this work is to maintain quality when took data from different sources.
This paper provides data integration in data warehouse. This paper provides a data quality management system
which is called DQMS (Data quality management system). This paper also provides quality of data but there is still
no proper method is available for data integration in data warehouse.

4.14 DATA CLEANING: PROBLEMS AND CURRENT APPROACHES (R.Hegadi et al, 2011)

There is several quality problems occurred in data warehouse during management of data in data warehouse. This
paper describes the problems of data warehouse which affect the quality of data in data warehouse. When we take
data from different sources, data cleaning is required. This paper provides the method for data cleaning in data
warehouse. The demand for data cleaning is increased day by day when we take data from heterogeneous sources.
Data transformation is very important process of data warehouse when we take data from different sources. This
paper also provides data transformation process.

5 ANALYSIS
We have surveyed thirteen papers and used seven parameters to evaluate these papers. Security parameter is defined
for all papers in table. It reveals that security is most important for good quality in data warehouse. V Kumar [1]
focuses on reliability and provides a mechanism to get reliability. Maintain data quality is key issue in all papers.
Each paper tests a factor for quality. Data interpretability is described in every paper because data interpreted is an
important factor for quality. H.RAHM did not describe interpretability. System availability described by every paper
instead of K R PADEY. C. Radhut described the way of implementing quality in efficient way and also provides
steps for data warehouse management. Every paper described an important factor of quality and provides steps for
quality management. Every paper focuses on accuracy, completeness, integration, timeliness and interpretability of
data in data warehouse. M.Helfert focuses more on consistency.

7
Journal Of RM Submitted: National University of Computer and Emerging science CFD

Table1. Parameter for evaluation

1 Data Interpretability Information is not described fully


2 System Availability Information is not available in some cases
3 Completeness All data records are complete
4 Security Unauthorized access
5 Maintainability Long time to test data
6 Accuracy Exact value
7 Implementation Efficiently implemented
Efficiency

Table2. ANALYSIS OF PARAMETERS

Accuracy Implementation Data System Completeness security Maintainability


Efficiency Interpretability Availability
V.Kumar et al,
2013

M.Helfert, 2012

R.K.Pandey,
2014
G.
Sankaranaraynan,
2011
C.Raduţ, 2012

R.B.Palepu, 2012

H.Hinrichset al,
2012
Atre, 2014

M.Helfert, 2008

E.Rahm ,2013

R.Singh et al,
2010
H. Hinrichs,
2010

8
Journal Of RM Submitted: National University of Computer and Emerging science CFD

R.Hegadi et al,
2011

6 CONCLUSION
Data quality is very important factor in data warehouse. To maintain data quality and cleaned data in data warehouse
is very difficult task but once quality is achieved it provide many benefits to user of data ware house. User always
wants quality data which has low cast and more benefits for him. Quality data is important for economic and social
impacts. To ensure quality in data warehouse, a new approach is described here which is DBMS framework. A
system which is based on metadata also described here. Data warehouse used different tools to maintain quality and
cleaned data. Some tools of data warehouse also described in this study.

In future, to maintain quality of data in data warehouse is still a difficult task. So there is lot of effort required to
make new tools for maintain quality of data in data warehouse. There is a need to build a cost effective system
which has low cost and more quality.

References
Abell, Derek F. (1980). Defining the business: The starting point of strategic planning: Prentice-Hall Englewood
Cliffs, NJ.
Inmon, William H, & Hackathorn, Richard D. (1994). Using the data warehouse: Wiley-QED Publishing.
Khan, Nadia, Iqbal, Summaya, & Mahboob, Tahira. (2015). A Comparative Study of Data Quality Management in
Data ware Houses.
Kortman, CM. (1967). Redundancy reduction—a practical method of data compression. Proceedings of the IEEE,
55(3), 253-263.
Lenzerini, Maurizio. (2002). Data integration: A theoretical perspective. Paper presented at the Proceedings of the
twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems.
Lim, Ee-Peng, Srivastava, Jaideep, Prabhakar, Satya, & Richardson, James. (1993). Entity identification in database
integration. Paper presented at the Data Engineering, 1993. Proceedings. Ninth International Conference
on.
Mazón, Jose-Norberto, Pardillo, Jesús, & Trujillo, Juan. (2007). A model-driven goal-oriented requirement
engineering approach for data warehouses Advances in Conceptual Modeling–Foundations and
Applications (pp. 255-264): Springer.
Paim, Fabia Rilston Silva, & De Castro, Jaelson Freire Brelaz. (2003). DWARF: An approach for requirements
definition and management of data warehouse systems. Paper presented at the Requirements Engineering
Conference, 2003. Proceedings. 11th IEEE International.
Prakash, Naveen, Singh, Yogesh, & Gosain, Anjana. (2004). Informational scenarios for data warehouse
requirements elicitation Conceptual Modeling–ER 2004 (pp. 205-216): Springer.
Rahm, Erhard, & Do, Hong Hai. (2000). Data cleaning: Problems and current approaches. IEEE Data Eng. Bull.,
23(4), 3-13.
Saeed, AI, Sharov, Vasily, White, Joe, Li, Jerry, Liang, Wei, Bhagabati, Nirmal, . . . Thiagarajan, M. (2003). TM4: a
free, open-source system for microarray data management and analysis. Biotechniques, 34(2), 374.
Sapia, Carsten, Blaschka, Markus, Höfling, Gabriele, & Dinter, Barbara. (1998). Extending the E/R model for the
multidimensional paradigm Advances in Database Technologies (pp. 105-116): Springer.
Schiefer, Josef, List, Beate, & Bruckner, Robert. (2002). A holistic approach for managing requirements of data
warehouse systems. AMCIS 2002 Proceedings, 13.
Skoutas, Dimitrios, & Simitsis, Alkis. (2007). Ontology-based conceptual design of ETL processes for both
structured and semi-structured data. International Journal on Semantic Web and Information Systems
(IJSWIS), 3(4), 1-24.

9
Journal Of RM Submitted: National University of Computer and Emerging science CFD

Soler, E. (2008). Towards Comprehensive Requirement Analysis for Data Warehouses: Considering Security
Requirements (In Spanish: Modelado de Requisitos de Seguridad para Almacenes de Datos). Vienna
University of Technology.
C.Raduţ, THE QUALITY OF DATA AND METADATA IN A DATAWARE HOUSE, Vol.2, No.4, pp.
15-24, January 2012

M.Helfert, PROACTIVE DATA QUALITY MANAGEMANT FOR DATA WAREHOUSE SYSTEMS ,


2012
R.B.Palepu, META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING,
Vol.2, No.4, August 2012

M.A. Jeusfeld, C.Quix, M.Jarke , DESIGN AND ANALYSIS OF QUALITY INFORMATION FOR
DATA WAREHOUSES, December 2010

H. Hinrichs, DATA QUALITY TOOLS FOR DATA WAREHOUSING – A SMALL SAMPLE


SURVEY, Vol.3, No.5, December 2010

Manjunath T.N, R.Hegadi, Ravikumar G.K, ANALYSIS OF DATA QUALITY ASPECTS IN


DATAWAREHOUSE SYSTEMS, Vol. 2 (1) , 2011

10

Anda mungkin juga menyukai