Anda di halaman 1dari 19

Key Factors for an Effective Quality Assurance in Data Warehousing

by Ramesh Konda1, Rao R. Nemani2 and Jamuna R. Nemani3

Nova Southeastern University

The Graduate School of Computer and Information Sciences Fort Lauderdale, FL 33314 Konda1991@yahoo.com

Opus College of Business

University of St. Thomas 1000 LaSalle Avenue , TMH 455 Minneapolis , MN 55403 Nema8811@stthomas.edu

Primetherapeutics LLC

1305 Corporate Center Dr, Eagan, MN 55121 Nemani2@Yahoo.com

Silicon Valley American Society for Quality Conference (October 2009)


Written on: 29 July 2009

Key Factors for an Effective Quality Assurance in Data Warehousing

Abstract Strategic and data-driven decision making, in turbulent environments, has been pushing organizations to build business related Data Warehouse (DW) environment to store and manage vast amounts of data. The main premise of having a DW is to provide a single point of truth and coherent data at one place. DW can be defined as a collection of subject-oriented, integrated, non-volatile data that supports the management decision process. Successful DW implementation helped the businesses to store, analyze, and share critical and confidential data on-line among their business partners and customers. However, inefficiencies in data quality within a given DW been the main concern of the business users that have not been addressed adequately. Unless a defined and planned approach for data quality is followed during the different phases of DW, the organizations may suffer from data quality issues, consequently, any efforts to fix the Data Quality (DQ) issues become very expensive and time consuming. DQ can be attributed to several factors such as data accuracy, completeness, timeliness, coherency, consistency, conformity, and record duplication. This paper presents a successful approach for implementation of key quality factors into DW during the development and deployments phase. Examples from business are used to demonstrate the practical aspects of the proposed approach that yield positive results in DW development and deployment. Keywords Data Warehousing, Data Quality, Quality Assurance, Quality Assurance Testing, Quality Assurance Planning, Quality Assurance Deployment

Page No: ! " #$

Key Factors for an Effective Quality Assurance in Data Warehousing

1. Introduction: Jean-Pierre (2004) believes that an unclear definition of DQ itself leads to lack of solid methodology to deal with DQ. Quality is a relative statement and varies by individuals based upon their perceptions. In simplistic terms DQ is perceived as true and accurate. This makes DQ hard to define and measure. To understand how to tackle the problem, DQ needs to be understood thoroughly from the organizational point of view, and then a process can established to deal with DQ within the organization. In simplistic terms, DQ can be defined as an absent of undesirable characteristics or presence of desirable characteristics in the data. Buckley and Poston (1984) defined software quality assurance (SQA) as a planned and systematic pattern of all actions necessary to provide adequate confidence that the software conforms to the defined requirements. Chow (1985) argued that failure to pay enough attention to SQA has often resulted in schedule delays, budget overruns, and failure to meet the customer satisfaction. Abdel-Hamid (1988) articulated in his research that SQA not only holds the key to customer satisfaction, but also has a direct impact on the cost and the scheduling of a project. There has been great progress and improvement in the core technology of DW; however the DQ aspects are one of the crucial issues that were not adequately addressed. In a survey by Friedman, Nelson, and Radcliffe (2004), it was stated that 75 percent of survey respondents reported significant problems stemming from defective and fragmented data, over 50 percent has incurred cost for data reconciliations, and 33 percent were delayed IT systems owing to data quality problems. A second survey by Ambler (2006) reported several metrics from the response that indicate Data Quality has been the major issue and requires considerate attention to solve this problem. For example, the following chart illustrates only 2 percent of the respondents feel

Page No: % " #$

Key Factors for an Effective Quality Assurance in Data Warehousing

good about the data quality in their data warehousing and rest of the 98 percent indicate some kind of data quality issues that need be addressed.

Figure 1. Current State of Data Quality (Ambler, 2006) One of the major factors of influencing the DQ is user perception. Furthermore, if user assumptions or perceptions are unchecked, then over time it starts to become the truth whether or not it has an objective or factual basis, from both business and technical perspectives (Bryan, 2002). DQ indicates how well enterprise data matches up with the real world at any given time. There are many sources of 'dirty data'. These sources consist of a) Poor data entry, which includes misspellings, typographical errors and transpositions, and variations in spelling or naming, b) data missing from database fields, c) lack of company-wide or industry-wide data coding standards, d) multiple databases scattered throughout different departments or organizations, with the data in each structured according to the rules of that particular database, and e) older systems that contain poorly documented or obsolete data (Andrea & Miriam, 2005). Nord (2005) mention that the DQ has become an increasingly critical concern and it has been rated as a top concern to data consumers in many organizations. Nord (2005) continued stating

Page No: & " #$

Key Factors for an Effective Quality Assurance in Data Warehousing

that the data quality is gaining its importance within research and among the consumer organizations. Ensuring high level DQ is one of the most expensive and time-consuming tasks to perform in data warehousing projects. Many data warehouse projects have failed halfway through due to poor DQ. This is often because DQ problems do not become apparent until the project is underway. Any changes to DW at the implementation stage are extremely costly and may push project budget limits. If all the considerations are examined thoroughly at the strategy and design stage of DW, the plans and controls can be formulated into the design for DQ that can decrease operational costs, increase customer satisfaction, improve effective decision-making, and employee confidence in using the data (Andrea & Miriam, 2005). The quality of information systems (IS) is critically important for companies to derive return on their investments. Therefore, developing good quality in Data Warehousing that meets user needs is becoming a critical theme for information technology management (Guimaraes, Staples & McKeen, 2007). English (2001) listed several examples in his paper that draw attention to the negative impact of the DQ issues in DW. Some of them include errors in students Basic Standards Test scores, pension withholdings, invoicing, and food processing that led to the loss of billions of dollars as well as loss of reputation of those businesses. Section 2 of this paper presents a brief literature review. In section 3, the authors examine the process and critical factors of DQ. Then, the following section discusses the current practices of DQ in DW, and proposes a solution based on the practitioners point of view for improving the quality of data. The last section summarizes the paper.

Page No: ' " #$

Key Factors for an Effective Quality Assurance in Data Warehousing

2. Literature Review Software quality assurance (QA) is one of the critical functions in the software development and maintenance of software systems. Because QA is a rigorous function that adds significant effort and cost to the total software development cost, the QA process is often compromised during the software development. However, the concern has not been adequately addressed in the literature. There are many facets of QA in a DW project; this paper is primarily intended to focus on QA process and factors involved in DWs Data Quality aspects. As the quality assurance aims to detect systematic risks in order to avoid them, the authors will discuss various quality assurance factors in this paper. As QA aims at systematic coverage of business requirements to system requirements to test plan and test execution, the process ensures data quality is achieved to the acceptable level. Iain & Don (2005) argue that in order to tackle this difficult issue, organizations need both a top-down approach to DQ sponsored by the most senior levels of management and a comprehensive bottom up analysis of data sourcing, usage and content including an assessment of the enterprise's capabilities in terms of data management, relevant tools, and people skills. Xu, Nord, Brown, and Nord (2002) believe that for organizations considering implementing of DW, it is essential that DQ issues be thoroughly understood and the organizations should obtain knowledge of the critical success factors essential to ensure DQ during the implementation process. The main components of data that determines the DQ are, completeness, appropriateness, accuracy, grouping accuracy, access, confidence, currency, regulators, legal compliance, and meta-linking. Data interface, data replication and data migration and movement all share common characteristics such as volume of data, timeliness of movement and processing, direction of flow between sources and targets (Bryan, 2002).
Page No: ( " #$

Key Factors for an Effective Quality Assurance in Data Warehousing

DQ tools generally fall into one of three categories: auditing, cleansing and migration. Data auditing tools apply predefined business rules against a source database. These tools enhance the accuracy and correctness of the data at the source. Some of the data cleansing tools compare the data against an independent source e.g. US Postal Codes for verifying the data. Data is typically moved from the source to intermediate staging area where the data cleansing activities are performed. Data migration is an activity where data is extracted and transported from one source to another. Data migration tools perform the activity of extraction, transportation and mapping for data from one platform to another. Poor DQ impacts the typical enterprise in many ways such as customer dissatisfaction, increased cost, and lowered employee job satisfaction. The slightest suspicion of poor DQ often hinders managers from reaching any decision. In order to ensure DQ assessment, Hufford (1996) proposed a model which consists of defining DQ expectations and metrics, identifying and assessing risks, mitigating risks, and monitoring and evaluating results on an on-going basis. 3. Process and the Critical Factors of DQ Data warehousing depends on integrating data quality assurance into all warehousing phases planning, implementation, and maintenance (Ballou and Tayi, 1999). Experts in quality control methodology always recommend addressing the root cause duly considering the following quality expectations: 1) Accuracy 2) Completeness 3) Timeliness
Page No: ) " #$

Key Factors for an Effective Quality Assurance in Data Warehousing

4) Integrity 5) Consistency 6) Conformity 7) Record Duplication Nemani and Konda (2009) have presented an extended version of Data Warehouse Development Life Cycle (DWDLC) Layers, which lists comprehensive phases and links the Data Quality factors as follows. The major theme in each of the DWDLC layers can be described as follows:

Figure 2. Data Warehouse Development Life Cycle (DWDLC) Layers, adopted from Nemani and Konda (2009)

Page No: * " #$

Key Factors for an Effective Quality Assurance in Data Warehousing

1) Planning: Apart from DQ project success, it is evident that defining and managing the project scope influences the projects overall success. Every DW project requires a careful balance data sources, processes, procedures, and other factors are scoped as commensurate with the projects size, complexity, and importance. 2) Analysis: In this layer, one should consider analyzing the data from various available data sources. In this phase it is recommended to perform the data profiling of the data. 3) Requirements: In this layer, DW professional will collaborates with the business stakeholders to understand the business problem by defining and documenting the required data quality factors for the DW project. 4) Develop: In this phase, the DW professional will develop and test the DW solution keeping in mind the DQ factors defined in the requirement phase. 5) Implement: In this phase, the DQ solution will be implemented after duly signed off by the quality assurance team. 6) Measure: In this phase, a data sampling is done and a measure to understand current process capability is worked out on DQ factors defined in the requirements phase. This activity will ensure to minimize the data quality problem. As a basic process of practicing the Data Quality, organizations need to understand and define the process of data flow, data transformation and data storage. The process should consist of the source, staging area, data processing, data transformation, and data storage as follows:

Page No: $ " #$

Key Factors for an Effective Quality Assurance in Data Warehousing

Data Source

Staging Area

Data Load Process

Data Transformation

Data Load to target tables

Prevention from data corruption should be the focus

Ensure all the files are being processed, Reprocess the failed ones- and log & discard the corrupt ones

Apply the business logic to the data to meet the desired form, Repair and reload the data as needed,

Audit"Inspect the data using business rules for data quality, Build- Execute- and Report the DQ rules"metrics,

Figure 3. Foundational Data Warehouse Load Process Stages We have identified four different kinds of DQ assessment classifications: Data Source, Data Load process, Data Transformation and Data Load to Target Tables. We further defined multiple DQ assessment criterions for each DQ assessment class. These DQ assessment criterions have been linked to a quality assurance method from a practitioner's perspective and summarized in Table 1 below.

Page No: #+ " #$

Key Factors for an Effective Quality Assurance in Data Warehousing

Table 1. Classification of DQ assessment class, criteria and method DQ Assessment Class Data Source DQ Assessment Criterion Source Location. Sequence of Data Files Senders Address File Size Quality Assurance Method Validate Source Location Validate Sequence of Files Confirm Senders Address Validate File Size

File receipt Acknowledgement Confirm File Receipt Record Count Data Load Process Loading Time Notify Load Process File Status Tracking Data Load Track Failed Data Load Business Rules Track Failed Business Rules Target Data Loads Data Types Business Rules Data Completeness Data Load Track Failed Data Load Notify Load Process Reconcile with Source Verify Load Time Verify Load Status Notification Validate Action Verify Complete Load Reconcile Failed Data Load Validate Business Rules Reconcile Failed Business Rules Validate Data Loads Validate Consistent Data Types Validate Target Data Validate Data Completeness Verify Complete Load Reconcile Failed Data Load Verify Load Status Notification

Data Transformation

Data Load to Target Tables

The source can be defined as the source of the data. For example, if an organization has several locations where data is being captured, then each of the locations will become a source. In the process of loading the source data into the DW, the data will be held in a staging area of DW. The home grown or off-the-shelf software can be used to load data from staging area into the DW storage tables. Typically within the process, the source data is transformed to meet the business logic prior to loading into the target tables of DW. Once the data is transformed into target tables / storage, an audit/inspection plan must be devised based on the business rules that
Page No: ## " #$

Key Factors for an Effective Quality Assurance in Data Warehousing

are based on the desirable characteristics of the final data. The desirable characteristics can be verified by building the DQ rules, executing the tests, reporting out any issues with data characteristics for correcting/repairing and for preventive action in the upstream process. 4. Classification of DQ-Criteria for a QA solution Several research projects have tackled the problem of assessing scores for information quality criteria. Naumann & Rolker's (2000) present a classification of IQ criteria which is designed to help organizations to assess the status of their organizational information quality and monitor their IQ improvements. The authors of this paper have extended Naumann & Rolker's (2000) classification of IQ criteria to the DQ criteria. Nemani and Kondas (2009) Data Warehouse Development Life Cycle (DWDLC) Layers model was also leveraged to further strengthen the actionable and detailed tasks to accomplish the data quality. Quality Assurance is a framework in broad sense that encompasses understanding, planning, and execution of test plans before software applications are deployed for intended use during each phase of the DWDLC. Typically, the process starts with studying the business requirements and systems requirements documents to understand the overall scope of the application/software as well as to define the scope of the QA. Next step in the process is to devise the test strategy, test plan and test cases. One of the key tasks in developing test cases is to understand the business and systems requirements, and formulate the test cases for each of the requirements. The test case developer must also include the information about the test environment, and before and after results from the test. In the Table 1 above, we have defined the QA assessment criterion and the respective quality methodology. Below, we will provide the brief definition of each and the respective details on QA methodology.

Page No: #! " #$

Key Factors for an Effective Quality Assurance in Data Warehousing

Data Source: Major emphasis during this phase is to check and validate that the files are being received from the pre-identified locations, and from the designated sender. Files are also cross-validated to ensure that the file size and sequence of dimension and fact data are in order. The following detailed QA steps will ensure the above objective can be met. Validate Source Location: Identify the source location and its validation, for example, files must be received only from the pre-identified locations and in pre-identified format such as DB2, SQL, AS400 or any other external source file. Validate Sequence of Files: Logical sequence of the entire data files identified are validated successfully. For example, member file needs to be loaded prior to processing any claims. Confirm Senders Address: Verification of senders address; it is critical to know the source sender information for tracking and feedback purpose. Validate File Size: Cross verify the size of the received files to source files to ensure that the entire expected file has been received. Confirm File Receipt: Verify that a receipt acknowledgment confirmation is sent to the source for reconciliation/tracking purpose. Reconcile with Source: Verify that all the records in all files are processed by validating the following three steps: File header validation: Verify that header displays the record type for example 0 for header along with date and time stamped.

Page No: #% " #$

Key Factors for an Effective Quality Assurance in Data Warehousing

File detail validation: Verify that total # of records displayed in trailer count equal to the total records exist in detail segment and record type is 2 File Trailer validation: Validation of record type e.g. trailer record displays 8 and also displays total number count in detail record excluding the header and trailer record. Data Load Process: Emphasis during this phase is to ensure that the files received have been processed. Within this, the key measures include the loading time, user notification; file status tracking, and reconciling the failed data load. The following detailed QA steps will ensure the above objective can be met. Verify Load Time: Validate that the estimated data load time has not enormously exceed the time. Verify Load Status Notification: Validate that email notification process is functional as expected. Email/status notifications are sent periodically indicating the status of the load. Validate Action: File status tracking validation is to verify that the failed data load are reprocessed after identifying and correcting the issue within the specified stipulated time. Verify Complete Load: Critical validation is completeness of data load. Make sure that all the fields, with specified size and criteria have successfully loaded. Reconcile Failed Data Load: Verify that failed data load records during data load process are being tracked, validated, reviewed, updated and re-processed if necessary. Data Transformation: Typically, the source data is transformed in order to meet the business needs as well as standardization across the database. In this phase of QA, the emphasis
Page No: #& " #$

Key Factors for an Effective Quality Assurance in Data Warehousing

is to verify the business rules are being used, and reconcile the failed data loads. The following detailed QA steps will ensure the above objective can be met. Validate Business Rules: Validate the business rules transformation, for example if the source table requires changing data value from Female to number 2 in transformation. Verify that female values are all transformed to numeric 2. Reconcile Failed Business Rules: Verifying that transformation of any failed business rule or any unidentified business rules are captured, revalidated, re-considered and changed per business requirements. Validate Data Loads: Re-validate that the data transformation file is successfully loaded and matched to the identified counts in source target. Data Load to Target Tables: This is the critical phase where one can verify and validate the final data. This will include validating the consistent usage of data types, data completeness, right data in right target tables, and reconciliation of failed data loads. The following detailed QA steps will ensure the above objective can be met. Validate Consistent Data Types: Lastly verify in the targeted tables and the data field types are consistent throughout the database. For example, CustomerID is number datatype across all tables wherever customerid was used. Validate Target Data: Verify that the business rules are current and producing the required data in the target tables. Validate Data Completeness: Verifying data completeness, which is to ensure that the right data is loaded into the right target tables.
Page No: #' " #$

Key Factors for an Effective Quality Assurance in Data Warehousing

Verify Complete Load: Validate that targeted data load is complete in size and right in data field values when compared from the source and transformation files. Reconcile Failed Data Load: Verify that the failed data load is reconsidered for reprocessing after the required changes made or modified in the original data file. Verify Load Status Notification: Verify that the status of the load process has been published periodically during data loading process, for example, if the job aborts, File ABC aborted during identified time or else Load successfully completed (including the record count). 6. Conclusions In this paper, the authors have extended the DWDLC (Nemani & Konda, 2009) approach by combining subjective and objective DQ assessments which are applied in practice. The main objective of any DW is to provide decision makers a single version of the truth of high quality data. This enables decision managers and employees to make informed and better decisions. Data Quality (DQ) can be attributed to several factors such as data accuracy, completeness, timeliness, coherency, consistency, conformity, and record duplication. However, low quality data has severe effects on an organization performance. Unless a defined and planned approach for data quality is followed during the different phases of DW, the organizations may suffer from data quality issues, and any efforts to fix the DQ issues become very expensive and time consuming. In this paper, we have identified the DQ Assessment Classes, DQ Assessment Criterions, and the respective Quality Assurance Methods. A detailed explanation is provided for each of the DQ Assessment Criterion with related QA Method and QA test cases that will ensure the achievement of expected quality level in DW development and deployment.
Page No: #( " #$

Key Factors for an Effective Quality Assurance in Data Warehousing

References Abdel-Hamid, T. (1998). The Economics of Software Quality Assurance: A Simulation-Based Case Study. MIS Quarterly, 12(3), 395-411. Ambler, S. W. (2009). Data Quality Survey Results. http://www.ambysoft.com/downloads/surveys/DataQuality200609.ppt, accessed on July 25, (2009). Andrea, R., & Miriam, C. (2005). Invisible Data Quality Issues in a CRM Implementation. Journal of Database Marketing & Customer Strategy Management, Vol. 12, No. 4, pp. 305-314. Ballou, D., & Tayi, G. (1999). Enhancing Data Quality in Data Warehouse Environments. Communications of the ACM, 42(1), pp. 73-78. Bryan, F. (2002). Managing The Quality and Completeness of Customer Data. Journal of Database Management, Vol. 10, No. 2, pp. 139158. Buckley, F. and Poston, R. (1984). "Software Quality Assurance," IEEE Transactions on Software Engineering, pp, 36-41. English, L.P. (2001). Information Quality Management: The Next Frontier. Annual Quality Congress Proceedings, American Society for Quality, Milwaukee, WI, pp.529-33. Chow, T.S. (ed.) (1985). Software Quality Assurance: A Practical Approach, IEEE Computer Society Press, Silver Spring , MD. Friedman, Nelson, and Radcliffe (2004). CRM Demands Data Cleansing. Gartner Research. Guimaraes, T., Staples, D.S., & McKeen, J.D. (2007). Assessing the Impact from Information Systems Quality, Quality. Management Journal, Vol. 14, No. 1, pp. 30-44. Hufford, D (1996). Data Warehouse Quality, Data Management Review, Feb/Mar. Iain, H., & Don, M. (2005). Prioritizing and Deploying Data Quality Improvement Activity. Journal of Database Marketing & Customer Strategy Management, Vol. 12, No. 2, pp. 113. Jean-Pierre, D. (2004). Integrating DQ into Your Data Warehouse Architecture. Business Intelligence Journal, 9(2), 18. Naumann, F. & Rolker, C. (2000). Assessment Methods for Information Quality Criteria. In: Proceedings of the 2000 Conference on Information Quality, Cambridge, MA 1999, pp. 148-162. Nemani, R. R, & Konda, R. (2009). A Framework for Data Quality in Data Warehousing. Proceedings of the third International United Information Systems Conference, UNISCON, Sydney, Australia, 20(1), pp. 292-297.
Page No: #) " #$

Key Factors for an Effective Quality Assurance in Data Warehousing

Nord, G. D, (2005). An Investigation of the Impact of Organization Size on Data Quality Issues. Journal of Database Management, Vol. 16, No. 3, pp. 58-71. Xu, H., Nord, J.H., Brown, N., Nord, G.D. (2002). Data Quality Issues in Implementing an ERP. Industrial Management & Data Systems, Vol. 102, No.1, pp. 47-60.

Page No: #* " #$

Key Factors for an Effective Quality Assurance in Data Warehousing

Authors Biography Ramesh Konda has experience in quality assurance, six-sigma, ISO standards, project management, and programming/database administration. He received B.E., M.S., and M.B.A. Also, Mr. Konda is a Fellow of American Society for Quality (ASQ) and a certified quality engineer (CQE) and a certified quality auditor (CQA). At present, Mr. Konda is a Ph.D. candidate in Computer Information Systems at Nova Southeastern University, Florida, USA. His research interests include data mining, data warehousing, knowledge management, six-sigma, quality assurance, and project management. (konda1991@yahoo.com). Rao Nemani is an Adjunct Instructor at University of St. Thomas, Minneapolis, MN, USA where he is teaching IT-Management courses for graduate students. Rao received his Master of Informational Technology from Swinburne University of Technology, Melbourne, Australia. Rao is a Ph.D. candidate in IT Management at Capella University, Minneapolis, MN, USA. His research interest includes knowledge management, strategic management, data warehousing & data mining, Business and IT alignment. (Nema8811@stthomas.edu). Jamuna Nemani is currently working with Primetherapeutics Llc (a Pharmacy Benefit Management company) as Sr. Quality Assurance Analyst. Jamuna has been working in Information Technology profession for more than 15 years as a Project Manager, System Analyst, Application Developer, and Quality Assurance Analyst. (Nemani2@yahoo.com).

Page No: #$ " #$

Anda mungkin juga menyukai