Anda di halaman 1dari 17

CHAPTER 8 Data Structures and CAATTs for Data Extraction

Fajilan, Jastine Mae F. Jugan, Ma. Patricia Samantha G. Masangkay, Princess Donalyn L. Nicolas, Lorraine Nicole P. Sapinoso, Chrisiel Faye A. BSA 4-1 January 25, 2013

This chapter examines data structures and the use of CAATTs for data extraction and analysis. Data Structures 2 fundamental concepts: 1. Organization-refers to the way records are physically arranged on the secondary storage device; either sequential or random. 2. Access Method- is the technique used to locate records and to navigate through the database or file. File-Flat Structure End users in this environment own their data files rather than share them with other users. Data processing is thus performed by stand alone applications rather than integrated systems. The flat-file approach is a single view model that characterizes legacy systems. Sequential Structure It is easy and simple to process. The application starts at the beginning of the file and processes each record in a sequence. It does not permit accessing a record directly. Files that require direct access operations need a different data structure. The following data structures address this need. 1. Index structure contains the numeric value of the physical disk storage location for each record in the associated data file. Records in an indexed random file are dispersed throughout a disk without regard for their proximity to other related records. 2. Virtual Storage access method structure- used for very large files that require routine batch processing and a moderate degree of individual record processing. To find a specific record location, it uses a number of indexes that describe in summarized form the contents of each cylinder. The greatest disadvantage is that it does not perform record insertion operations efficiently. 3 physical components: i. indexes ii. Prime data storage area iii. Overflow area 3. Hashing structure- employs an algorithm that converts the primary key of a record directly into a storage address. The principal advantage is speed while significant disadvantages are: technique does not use storage space efficiently; and different record keys may generate the same residual, which translates into the same address. 4. Pointer structures- records in this type of file are spread over the entire disk without concern for their physical proximity with other related records. Types: 1. Physical address pointer- contains the actual disk storage location needed by the disk controller. 2. Relative address pointer- contains relative position of a record in the file. 3. Logical key pointer- contains the primary key of the related record.

Hierarchical and Network Database structures Early hierarchical and network database models employed many of the preceding flat-file techniques as well as new proprietary database structures. A major difference between the 2 approaches is the degree of process integration and data sharing that can be achieved. 2 dimensional flat-files exist as independent data structures that are not linked logically or physically to other files.

Relational Database Structure, Concepts and Terminology Indexed Sequential File Structure - Facilitates both direct access to individual records and batch processing of the entire file Inverted list allows even more flexible access to data

Figure 1Indexed Sequential File

Relational Database Theory E. F. Codd originally proposed the principles of the relational model in the late 1960s. The formal model has its foundations in relational algebra and set theory, which provide the theoretical basis for most of the data manipulation operations used. Accordingly, a system is relational if it: 1. Represents data in the form of two-dimensional tables 2. Supports the relational algebra functions of restrict, project, and join (a) Restrict extracts specified rows from a specified table .This operation creates a virtual table (one that does not physically exist) that is a subset of the original table (b) Project extracts specified attributes (columns) from a table to create a virtual table. (c)Join builds a new physical table from two tables consisting of all concatenated pairs of rows, from each table.

Although restrict, project, and join is not the complete set of relational functions, it is a useful subset that satisfies most business information needs

Relational Database Concepts In this section, we review basic concepts, terminology, and techniques common to relational database systems. Entity, Occurrence, and Attributes Entity anything about which the organization wishes to capture data - May be physical (eg. Inventories, customers, or employees), conceptual (eg. Sales (to a customer) , Accounts Receivable ,or Accounts Payable) Data model the blueprint for ultimately creating the physical database Entity Relationship (ER) Diagram graphical representation used to depict the model Occurrence used to describe the number of instances or records that pertain to a specific entity Attributes the data elements that define an entity - The logical and relevant characteristics of an entity, they are unique to it

Figure 2Data Model Using an Entity Relationship Diagram

Association

and Cardinality

Association labeled line connecting two entities in a data model -represented with a verb, such as ships, requests, or receives Cardinality degree of association between two entities -describes the number of possible occurrences in one table that are associated with a single occurrence in a related table Example 1 (1:1) shows that for every occurrence (record) in the Employee entity, there is a possibility of zero or one occurrence in the Company Car entity .When determining the cardinality values in an entity association, select a single occurrence (record) of one entity and answer the following question: What are the minimum and maximum number of records that may be associated with the single record that has been selected? Example 2 (1:1) illustrates a situation in which each record in one entity is always associated with one (and only one) record in the associated entity. Two short lines intersecting the connecting line at both ends depict this cardinality.

Example 3 (1: M) presents the relationship between customer and Sales Order entities. The crows foot symbol (which gives this form of notation its name) depicts the many cardinalities at the Sales order end of the association line Example 4 (1:M) represents a situation in which each specific item of inventory is supplied by one and only one vendor , and each vendor supplies one or many different inventory items to the company Example 5 (M: M) illustrates the many-to-many association. Examples 4 and 5 demonstrate how cardinality reflects the business rules in place within an organization. The database designer must obtain a thorough understanding of how the client-company and specified users conduct business to properly design the data model. If the data model is wrong, the resulting database tables will also be wrong.

Figure 3 Examples of Entity Associations

Alternative Cardinality Notations an alternative method is to write the cardinality values on each end of the association line connecting the two entities. Some database designers explicitly show both the upper and lower cardinality values. Some choose a shorthand version that notes only the upper cardinality.

The Physical Database Tables Physical database tables are constructed from the data model with each entity in the model being transformed into a separate physical table. Across the top of each table are attributes forming columns. Intersecting columns to form the rows of the table are tuples. Tuple corresponds approximately to a record in a flat-file system Properly designed tables possess the following four characteristics: 1. The value of at least one attribute in each occurrence (row) must be unique. This attribute is the primary table. The values of the other (nonkey) attributes in the row need not be unique 2. All attributes values in any column must be of the same class 3. Each column in a given table must be uniquely named. However , different tables may contain columns with the same name 4. Tables must conform to the rules of normalization. This means they must be free from structural dependencies including repeating groups , partial dependencies , and transitive dependencies Linkages between Relational Tables Logically related tables need to be physically connected to achieve the associations described in the data model .Using foreign keys accomplishes this. The nature of the association between the related tables determines the method used for assigning foreign keys.

Figure 4 Linkages between Relational Tables

User views User view the set of data that a particular user sees. Anomalies, Structural Dependencies, and Data Normalization -deals with why databases tables need to be normalized. In other words, why is it necessary for the organizations database to form an elaborate network of normalized tables linked together. Why, instead, can we not simply consolidate the views of one user (or several) into a single common table from which all data needs may be met? Database Anomalies The answer to the above questions is that improperly normalized tables can cause DBMS processing problems that restrict, or even deny, users access to the information they need. Such tables exhibit negative operational symptoms called Anomalies. One or more of these anomalies will exist in tables that are not normalized or are normalized at a low level, such as first normal form (1NF) or second normal form (2NF). To be free of anomalies, tables must be normalized to the third normal form (3NF) level. Update Anomaly results from data redundancy in an unnormalized table. Insertion Anomaly to demonstrate the effects of the insertion anomaly, assume that a new vendor has entered the marketplace. The organization does not yet purchase from the vendor, but may wish to do so in the future. In the meantime, the organization wants to add the vendor to the database. This is not possible, however, because the primary key for the Inventory table is PART NUM. Because the vendor does not supply the organization with any inventory items, the supplier data cannot be added to the table. Deletion Anomaly involves the unintentional deletion of data from a table.

Designing Relational Database Six phases of database design (view modeling): 1. Identify entities 2. Construct a data model showing entity associations 3. Add primary keys and attributes to the model 4. Normalize the data model and add foreign keys 5. Construct the physical database 6. Prepare the user views

Identify Entities -View modeling begins by identifying the primary entities of the business function. Entities are things about which the organization wishes to capture data. -Entities are represented as nouns in a system description. To pass as valid entities, two conditions need to be met: Condition 1. An entity must consist of two or more occurrences. Condition 2. An entity must contribute at least one attribute that is not provided through other entities Construct a data model showing entity associations The next step in view modeling is to determine the associations between entities and document them with an ER diagram. Sometimes the association between entities is not apparent because different rules may apply in different organizations. If the database is to function properly, its designers need to understand the organizations business rules as well as the specific needs of individual users. Add Primary Keys and Attributes to the Model The analyst should select a primary key that logically defines the nonkey attributes and uniquely identifies each occurrence in the entity. Every attribute in an entity should appear directly or indirectly (a calculated value) in one or more user views. Entity attributes are, therefore, originally derived and modeled from user views. In other words, if stored data are not used in a document, report, or a calculation that is reported in some way, then it serves no purpose and should not be part of the database. Normalize the data model and add foreign key The normalization issues that needed resolution are outlined as follows: Repeating group data in purchase order Repeating group data in receiving report Transitive Dependencies Construct the physical database This is an involved step that must be carefully planned and executed and may take many months in a large installation. Programs will need to be written to transfer organization data currently stored in flat files or legacy databases to the new relational tables. Data currently stored on paper documents may need to be entered into the database tables manually. Once this is done, the physical user view can be produced. Prepare the user views A report program is used to make the view visually attractive and easy to use. Column headings can be added, fields summed, and averages calculated to produce a hard-copy or computer screen report that resembles the original user report. The report program can suppress unnecessary data from the view, such as duplicated fields and the key values in the Inventory/vendor link table. These keys are necessary to build the view, but are not needed in the actual report.

Global View Integration A modern company however would need hundreds or thousands of views and associated tables. To facilitate the task, modern Enterprise Resource Planning (ERP) systems come equipped with a core schema, normalized tables, and view templates. View Integration- is combining the data needs of all users into a single schema or enterprise wide view. This is a daunting undertaking when creating the entire database from scratch. The database designer must produce a set of integrated tables that are free of the update, insert, and deletion anomalies and sufficiently rich to serve the needs of all users. Embedded Audit Module The objective of the embedded audit module (EAM), also known as continuous auditing, is to identify important transactions while they are being processed and extract copies of them in real time. An EAM is a specially programmed module embedded in a host application to capture predetermined transaction types for subsequent analysis. Disadvantages of EAMs Operational Efficiency- From the users point of view, EAMs decrease operational performance. The presence of an audit module within the host application may create significant overhead, especially when the amount of testing is extensive. One approach for relieving this burden from the system is to design modules that may be turned on and off by the auditor. Doing so will, of course, reduce the effectiveness of the EAM as an ongoing audit tool. Verifying EAM Integrity- the EAM approach may not be viable audit technique in environment with a high level of program maintenance. When host applications undergo frequent changes, the EAMs embedded within the hosts will also require frequent modifications. The integrity of the EAM directly affects the quality of the audit process, auditors must therefore evaluate the EAM integrity. Generalized Audit Software Generalized Audit Software is the most widely used CAAT for IS auditing. GAS allows auditors to access electronically coded data files and perform various operations on their contents. Some of the more common uses for GAS include: Footing and balancing entire files or selected data items Selecting and reporting detailed data contained in files Selecting stratified statistical samples for data files Formatting results of tests into reports Printing confirmations in either standardized or special working Screening data and selectively including or excluding items Comparing multiple files and identifying any differences Recalculating data fields

Using GAS to Access Simple Structures The GAS selects the sample records and prepares report containing the needed information.

Production inventory file

GAS

Transaction LIST

Using GAS to Access Complex Structures Gaining access to complex structures, such as a hashed file or other form of random file, may pose a problem for the auditor. Not all GAS products on the market may be capable of accessing every type of file structure. Most DBMSs have utility features that can be used to reformat complex structures into flat files suitable for this purpose.
DBMS utility program

Database

Flat file

GAS

Transaction file

Audit Issues Pertaining to the Creation of Flat Files The auditor must rely on computer services personnel to produce a flat file from the complex file structure. There is a risk that data integrity will be compromised by the procedure used to create the flat file. ACL (audit command language) Software Data Definition ACL uses the data definition feature. To create a date definition, the auditor needs to know both where the source file physically resides and its field structure layout. Small files can be imported via text files or spreadsheets. Very large files may need to be accessed directly from the mainframe computer. The data definition also allows the auditor to define important characteristics of the source file, including overall record length, the name given to each field, the type of data contained in each field and the starting point and length of each field in the file. Customizing a View A view is simply a way of looking at data in a file, auditors seldom need to use all the data contained in a file. ACL allows the auditor to customize the original view created during data definition to one that better meets his or her audit needs. The auditor can create and reformat new views without changing or deleting the data in the underlying file. Only the presentation of data is affected. Filtering Data ACL provides powerful options for filtering data that support various audit tests. Filters are expressions that search for records that meet the filter criteria. ACLs expression builder allows the auditor to use logical operations such as AND, OR,,, NOT others to define and test conditions of any complexity and to process only those records that match specific conditions. Stratifying Data ACLs stratifying feature allows the auditor to view the distribution of records that fall into specified strata. Data can be stratified on any numeric field such as sales price, unit cost, quantity sold, and so on. The data are summarized and classified by strata, which can be equal in size (called intervals) or vary in size (called free). Statistical Analysis ACL offers many sampling methods for statistical analysis. Two of the most frequently used are record sampling and monetary unit sampling (MUS). Each method allows random and interval sampling. The choice of methods will depend on the auditors strategy and the composition of the file being audited. On one hand, when records a file are fairly evenly distributed across strata, the auditor may want an unbiased sample and will thus choose the record sample approach. Normalizing Tables in a Relational Database The database anomalies are symptoms of structural problems within tables called dependencies. Specifically these are known as repeating groups, partial dependencies and transitive dependencies.

The normalization process involves systematically identifying and removing these dependencies from the tables under review. The figure below graphically illustrates the unnormalized tables progression toward 3NF as each type of dependency is resolved. Tables in 3NF will be free of anomalies and will meet two conditions: 1. All nonkey attributes will be wholly and uniquely dependent on (defined by) the primary key. 2. None of the nonkey attributes will be dependent on (defined by) other nonkey attributes. Steps in the Normalization Process: User View

Represent View with a Single Table

Unnormalized Table
Remove Repeating Groups

Table in First Normal Form (1NF)


Remove Partial Dependencies

Table in Second Normal Form (2NF)


Remove Transitive Redundancies

Table in Third Normal Form (3NF)

As illustrated in the figure, the process begins with a user view such as an output report, a source document or an input screen. At this point, the view is merely a pictorial representation of a set of data the user will eventually have when the project is completed. To demonstrate the normalization process, we will use the customer sales invoice.

User View
SALES INVOICE Invoice Number: 16459 Order Date: 09/22/2009 Shipped Date: 09/27/2009 Shipped Via: UPS Customer Number: 1765 Customer Name: ABC Associates Street Address: 132 Elm St. City: Bethlehem State: PA Telephone Number: 610-555-6721

Prod Num R234 M456 W62

Description Bolt cutter Gear puller Electric welder

Quantity 2 1 1

Unit Price $42.50 $16.50 $485.00 Total Due

Ex. Price $85.00 $16.50 $485.00 $586.50

The next step is to represent the view as a single table that contains all of the attributes. Now that we have a base table to work from, the next few steps in the normalization process involve identifying and if necessary, elimination structural dependencies that exist. If dependencies exist, correcting them will involve splitting the original single-table structure into two or more smaller and independent 3NF tables. Unnormalized Table Supporting User View
Invoice No. 16459 16459 16459 Order Date 09/22/2009 09/22/2009 09/22/2009 Shipd Date 09/27/2009 09/27/2009 09/27/2009 Shipd Via UPS UPS UPS Total Due $586.50 $586.50 $586.50 Cust, No. 1765 1765 1765 Cust Name ABC Assoc ABC Assoc ABC Assoc Street Address 132 Elm St. 132 Elm St. 132 Elm St. City Bethlehem Bethlehem Bethlehem State PA PA PA Tel No. 610 610 610 Prod No. R234 M456 W62 Description Bolt cutter Gear puller Electric welder Qty. 2 1 1 Unit Price $42.50 $16.50 $485.00 Ex Price $85.00 $16.50 $485.00

Remove Repeating Group Data The first step in correcting structural dependencies is to determine if the table under review contains repeating groups. Repeating group data is the existence of multiple values for a particular attribute in a specific record. To avoid such data redundancy, the repeating group data need to be removed from the table and placed in a separate table.

Resulting Tables after Removing Repeating Group Data


Single- Table Structure for Sales Invoice
Invoice No. Order Date Shipd Date Shipd Via Cust, No. Cust Name Street Address City State Tel No. Prod No. Description Qty. Unit Price

Sales Invoice Table


Invoice No. Order Date Shipd Date Shipd Via Cust, No. Cust Name Street Address City State Tel No.

Line Item Table


Invoice No. Prod No. Description Qty. Unit Price

Remove Partial Dependencies A partial dependency occurs when one or more nonkey attributes are dependent on (defined by) only part of the primary key, rather than the whole key. This can occur only in tables that have a composite (two or more attribute) primary key. Resulting Tables after Removing Partial Dependency
Invoice No. Invoice No. Prod No. Qty. Line Table Item Prod Description Qty. No. Unit Price

Line Item Table Prod No. Inventory Table Description Unit Price

Remove Transitive Dependencies The final step in resolving structural dependencies is to remove transitive dependency. A transitive dependency occurs in a table where nonkey attributes are dependent on another nonkey attribute and independent of the tables primary key. We resolve this transitive dependency by splitting out the customer data and placing them in a new table called Customer. Resulting Tables after Removing Transitive Dependency
Sales Invoice Table
Invoice No. Order Date Shipd Date Shipd Via Cust, No. Cust Name Street Address City State Tel No.

Invoice No.

Order Date

Shipd Date

Shipd Via

Sales Invoice Table Customer Table


Cust, No. Cust Name Street Address City State Tel No.

Linking the Normalized Tables At this point, the original single-table structure has been reduced to the four normalized but independent tables. To work together, these tables need to be linked via foreign keys. This requires first determining the cardinality (degree of association) between the tables and then assigning the foreign keys. Linkages Between Normalized Tables
Customer Table
Cust, No. 1765 Cust Name ABC Assoc Street Address 132 Elm St. City State Tel No. 6105556721

Bethlehem

PA

Invoice No. 16459

Order Date 09/22/2009

Shipd Date 09/27/2009

Shipd Via UPS

Cust, No. 1765

Sales Invoice Table Prod No. Invoice No. 16459 16459 16459 Line Item Table Prod No. R234 W62 1 Qty. M456 2 $42.50 1 $16.50 R234 Gear puller Description Bolt cutter W62

Inventory Table Unit Price M456 Electric welder $485.00