com/doc/130110921/dwh-concepts
Friday, November 16, 2012
*******************************************************************************
that expected data is added in target system that all DB fields and field data is loaded without any truncation data checksum for record count match that for rejected data proper error logs are generated with all details NULL value fields that duplicate data is not loaded data integrity
***************************************************************************
- OLTP (On-line Transaction Processing) is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second. In OLTP database there is detailed and current data, and schema used to store transactional databases is the entity model (usually 3NF). - OLAP (On-line Analytical Processing) is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually star schema).
OLTP System Online Transaction Processing (Operational System) Source of data Purpose of data What the data Inserts and Updates Operational data; OLTPs are the original source of the data. To control and run fundamental business tasks Reveals a snapshot of ongoing business processes Short and fast inserts and updates initiated by end users
OLAP System Online Analytical Processing (Data Warehouse) Consolidation data; OLAP data comes from the various OLTP Databases To help with planning, problem solving, and decision support Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data
Queries
Often complex queries involving aggregations Depends on the amount of data involved; batch data refreshes and complex queries may take many hours; query speed can be improved by creating indexes Larger due to the existence of aggregation structures and history data; requires more indexes than OLTP
Processing Speed
Space Requirements
Database Design
Typically de-normalized with fewer Highly normalized with many tables tables; use of star and/or snowflake schemas Instead of regular backups, some environments may consider simply reloading the OLTP data as a recovery method
Backup religiously; operational data is critical to run the business, data Backup and Recovery loss is likely to entail significant monetary loss and legal liability
*******************************************************************************
BI used for?
Organizations use Business Intelligence to gain data-driven insights on anything related to business performance. It is used to understand and improve performance and to cut costs and identify new business opportunities, this can include, among many other things: Analyzing customer behaviors, buying patterns and sales trends. Measuring, tracking and predicting sales and financial performance
Budgeting and financial planning and forecasting Tracking the performance of marketing campaigns Optimizing processes and operational performance Improving delivery and supply chain effectiveness Web and e-commerce analytics Customer relationship management Risk analysis Strategic value driver analysis
The next component of BI is analysing the data. Here we take the data that has been gathered and inspect, transform or model it in order to gain new insights that will support our business decision making. Data analysis comes in many different formats and approaches, both quantitative and qualitative. Analysis techniques includes the use of statistical tools, data mining approaches as well as visual analytics or even analysis of unstructured data such as text or pictures. Providing Access In order to support decision making the decision makers need to have access to the data. Access is needed to perform analysis or to view the results of the analysis. The former is provided by the latest software tools that allow end-users to perform data analysis while the latter is provided through reporting, dashboard and scorecard applications. *******************************************************************************
What is Metadata?
Metadata is defined as data that describes other data. Metadata can be divided into two main types: structural and descriptive. Structural metadata describes the design structure and their specifications. This type of metadata describes the containers of data within a database. Descriptive metadata describes instances of application data. This is the type of metadata that is traditionally spoken of and described as data about the data. A third type is sometime identified called Administrative metadata. Administrative metadata provides information that helps to manage other information, such as when and how a resource was created, file types and other technical information. Metadata makes it easier to retrieve, use, or manage information resources by providing users with information that adds context to the data theyre working with. Metadata can describe inf ormation at any level of aggregation, including collections, single resources, or component part of a single resource. Metadata can be embedded into a digital object or can be stored separately. Web pages contain metadata called metatags. Metadata at the most basic level is simply defined as data about data. An item of metadata describes the specific characteristics about an individual data item. In the database realm, metadata is defined as, data about data, through which the end-user data are integrated and managed. Metadata in a database typically store the relationships that link up numerous pieces of data. Metadata names these fields, describes the size of the fields, and may put restrictions on what can go in the field (for example, numbers only). Therefore, metadata is information about how data is extracted, and how it may be transformed. It is also about indexing and creating pointers into data. Database design is all about defining metadata schemas. Meta data can be stored either internally, in the same file as the data, or externally, in a separate area. If the data is stored internally, the metadata is together with the data, making it more easily accessible to view or change. However, this method creates high redundancy. If metadata is stored externally, the searches can become more efficient. There is no redundancy but getting to this metadata may be a little more technical.
All the metadata is stored in a data dictionary or a system catalog. The data dictionary is most typically an external document that is created in a spreadsheet type of document that stores the conceptual design ideas for the database schema. The data dictionary also contains the general format that the data, and in effect the metadata, should be. Metadata is an essential aspect to database design, it allows for increased processing power, due to the fact that it can help create pointers and indexes. *******************************************************************************
If data exists or not in a table Primary Index once created in Teradata cannot be changed. We have to drop the table and recreate the table with the column you want to be created as Primary Index *******************************************************************************
Explain
To Parallel
the
Process
Shared
everything
nothing
is
architecture?
to Share nothing
Shared nothing architecture (SNA) is a distributed computing architecture which consists of multiple nodes such that each node has its own private memory, disks and input/output devices independent of any other node in the network. Each node is self sufficient and shares nothing across the network. Therefore, there are no points of contention across the system and no scope for data sharing or system resources. This type of architecture is highly scalable and has become quite popular especially in the context of web development.
For instance, Google has implemented an SNA which evidentially enables it to scale web applications effectively by simply adding nodes in its network of servers without slowing down the system.
*******************************************************************************
Difference
between
SMP
and
MPP?
Symmetric Multiprocessing (SMP) is the processing of programs by multiple processors that share a common operating system and memory. This SMP is also called as "Tightly Coupled Multiprocessing". A Single copy of the Operating System is in charge for all the Processors Running in an SMP. This SMP Methodology doesnt exceed more than 16 Processors. SMP is better than MMP systems when Online Transaction Processing is Done, in which many users can access the same database to do a search with a relatively simple set of common transactions. One main advantage of SMP is its ability to dynamically balance the workload among computers ( As a result Serve more users at a faster rate )
Massively Parallel Processing (MPP)is the processing of programs by multiple processors that work on different parts of the program and share different operating systems and memories. These Different Processors which run communicate with each other through message interfaces. There are cases in which there are upto 200 processors which run for a single application. An Interconnect arrangement of data paths allows messages to be sent between different processors which run for a single application or product. The Setup for MPP is more complicated than SMP. An Experienced Thought Process should to be applied when u setup these MPP and one shold have a good in-depth knowledge to partition the database among these processors and how to assign the work to these processors. An MPP system can also be called as a loosely coupled system. An MPP is considered better than an SMP for applications that allow a number of databases to be searched in parallel. *******************************************************************************
*******************************************************************************
SELECT HASHROW(EMPLOYEE_ID) AS HASH_ROW, HASHBUCKET( HASHROW(EMPLOYEE_ID) ) AS HASH_BUCKET_NO, HASHAMP(HASHBUCKET( HASHROW(EMPLOYEE_ID) )) AS HASH_AMP_NO FROM SAMPLES.EMPLOYEES ORDER BY HASH_AMP_NO *******************************************************************************
SELECT tablename, sum(currentperm)/1024/1024 AS MB FROM dbc.allspace WHERE databasename='SAMPLES' GROUP BY tablename ORDER BY 2 DESC; *******************************************************************************
1. PRIMARY INDEX UNIQUE PRIMARY INDEX[UPI] NON-UNIQUE PRIMARY INDEX[NUPI] 2. SECONDARY INDEX UNIQUE SECONDARY INDEX[USI] NON-UNIQUE SECONDARY INDEX[NUSI] 3. PARTITION PRIMARY INDEX RANGE BASED [range_n] CASE BASED[case_n] 4. HASH INDEX 5. JOIN INDEX *******************************************************************************
*******************************************************************************
COLUMN LEVEL+ ROW LEVEL COLUMN LEVEL+ ROW LEVEL above conditions check the duplication of data.
*******************************************************************************
Permanent and Temporary Tables Permanent storage of tables is necessary when different sessions and users must share table contents. When tables are required for only a single session, we can request the system to create temporary tables. Using this type of table, we can save query results for use in subsequent queries within the same session. We can break down complex queries into smaller queries by storing results in a temporary table for use during the same session. When the session ends, the system automatically drops the temporary table. Global Tables They exist only for the duration of the SQL session in which they are used. The contents of these tables are private to the session, and System Automatically drops the table at the end of that session. System saves the Global Temporary Table Definition Permanently in the Data Dictionary. The Saved Definition may be Shared by Multiple Users and Sessions with Each Session getting its Own Instance of the Table. Volatile Tables If you need a temporary table for a single use only, you can define a volatile table. The definition of a volatile table resides in memory (RAM) but does not survive across a system restart. It improves performance even more than using global temporary tables because the system does not store the definitions of volatile tables in the Data Dictionary. Access-rights checking is not necessary because only the creator can access the volatile table. Derived Tables A special type of temporary table is the derived table. It is specified in SQL SELECT statement. A Derived Table is Obtained from One or More Other Tables as the Result of a Sub-Query. Scope of A Derived Table is only Visible to the Level of the SELECT statement calling the Sub-Query. Using Derived Tables avoids having to use the CREATE and DROP TABLE Statements for Storing Retrieved Information and Assists in Coding More Sophisticated, Complex Queries.
*******************************************************************************
Collect statistics can be applied In a single session 2000 Global temporary table can be materialized.
*******************************************************************************
-It is a temporary workspace which is used for processing Rows for given SQL statements. -Spool space is assigned only to users . -Once the SQL processing is complete the spool is freed and given to some other query.
-Unused Perm space is automatically available for Spool . ******************************************************************************* Types of ETL Bugs
1. User interface bugs/cosmetic bugs:Related to GUI of application Navigation, spelling mistakes, font style, font size, colors, alignment. 2. BVA Related bug:Minimum and maximum values 3. ECP Related bug:Valid and invalid type 4. Input/output bugs:Valid values not accepted Invalid values accepted 5. Calculation bugs:Mathematical errors Final output is wrong 6. Load condition bugs:Does not allows multiple users Does not allows customer expected load 7. Race condition bugs:System crash & hang System cannot run client plat forms 8. Version control bugs:No logo matching No version information available This occurs usually in regression testing 9. H/W bugs:Device is not responding to the application 10. Source bugs:-
*******************************************************************************
1) Constraint Testing: In the phase of constraint testing, the test engineers identifies whether the data is mapped from source to target or not. The Test Engineer follows the below scenarios in ETL Testing process. a) NOT NULL b) UNIQUE c) Primary Key d) Foreign key e) Check f) Default g) NULL
2) Source to Target Count Testing: In the Source to Target data is matched or not. A Tester can check in this view whether it is ascending order or descending order it doesnt matter .Only count is required for Tester. Due to lack of time a tester can follow this type of Testing. 3) Source to Target Data Validation Testing: In this Testing, a tester can validate the each and every point of the source to target data. Most of the financial projects, a tester can identify the decimal factors.
4) Threshold/Data Integrated Testing: In this Testing, the Ranges of the data, A test Engineer can usually identifies the population calculation and share marketing and business finance analysis (quarterly, halferly, Yearly)
MIN 4
MAX 10
RANGE 6
5) Field to Field Testing: In the field to field testing, a test engineer can identify that how much space is occupied in the database. The data is integrated in the table cum datatypes.
NOTE: To check the order of the columns and source column to target column.
6) Duplicate Check Testing: In this phase of ETL Testing, a Tester can face duplicate value very frequently so, at that time the tester follows database queries why because huge amount of data is present in source and Target tables. Select ENO, ENAME, SAL, COUNT (*) FROM EMP GROUP BY ENO, ENAME, SAL HAVING COUNT (*) >1;
1) 2)
Note: There are no mistakes in Primary Key or no Primary Key is allotted then the duplicates may arise. Sometimes, a developer can do mistakes while transferring the data from source to target at that time duplicates may arise. 3) Due to Environment Mistakes also duplicates arise (Due to improper plugins in the tool).
7)
Error/Exception Logical Testing: 1) Delimiter is available in Valid Tables 2) Delimiter is not available in invalid tables(Exception Tables)
8) Incremental and Historical Process Testing: In the Incremental data, the historical data is not corrupted. When the historical data is corrupted then this is the condition where bugs raise.
9) Control Columns and Defect Values Testing: This is introduced by IBM 10) Navigation Testing: Navigation Testing is the End user point of view testing. An end user cannot follow the friendly of the application that navigation is called as bad or poor Navigation. At the time of Testing, A tester can identify this type of navigation scenarios to avoid unnecessary navigation.
11) Initialization testing: A combination of hardware and software installed in platform is called the Initialization Testing
12) Transformation Testing: At the time of mapping from source table to target table, Transformation is not in mapping condition, then the Test Engineer raises bugs.
13) Regression Testing: Code modification to fix a bug or to implement a new functionality which makes us to to find errors. These introduced errors are called regression . Identifying for regression effect is called regression testing.
14) Retesting: Re executing the failed test cases after fixing the bug.
15) System Integration Testing: Integration testing: After the completion of programming process . Developer can integrate the modules there are 3 models a) Top Down b) Bottom Up c) Hybrid
*******************************************************************************
What
is
secondary
index?
Whats
are
its
uses?
A secondary index is an alternate path to the data. Secondary indexes are used to improve performance by allowing the user to avoid scanning the entire table during a query. A secondary index is like a primary index in that it allows the user to locate rows. Unlike a primary index, it has no influence on the way rows are distributed among amps. Secondary indexes are optional and can be created and dropped dynamically. Secondary indexes require separate subtables which require extra i/o to maintain the indexes. Comparing to primary indexes, secondary indexes allow access to information in a table by alternate, less frequently used paths. Teradata automatically creates a secondary index subtable. The subtable will contain:
Secondary index value Secondary index row id Primary index row id
When a user writes an sql query that has a si in the where clause, the parsing engine will hash the secondary index value. The output is the row hash of the si. The pe creates a request containing the row hash and gives the request to the message passing layer (which includes the bynet software and network). The message passing layer uses a portion of the row hash to point to a bucket in the hash map. That bucket contains an amp number to which the pe's request will be sent. The amp gets the request and accesses the secondary index subtable pertaining to the requested si information. The amp will check to see if the row hash exists in the subtable and double check the subtable row with the actual secondary index value. Then, the amp will create a request containing the primary index row id and send it back to the message passing layer. This request is directed to the amp with the base table row, and the amp easily retrieves the data row. Secondary indexes can be useful for : Satisfying complex condition
Processing aggregates Value comparison Matching character combination Joining tables *********************************************************************