19/02/2010
BI / Teradata
Arockkia Martin Irudayaraja
arockkiamartin.i@tcs.com
Primary Index
Low Confidence: Normally means STATS are difficult to use precisely. AND, OR
conditions used in WHERE clause. Table1 with PI is joined with Table2 with Non-PI
column(s)
High Confidence: Normally means Optimiser is sure of the results based on the
STATS available.
Joins
The following are the types of joins available that can be used in SQL:
Inner Join
Cross Join
Apart from this, the data retrieval in Teradata is carried out in the following special joins:
Merge Join
Nested Join
Hash Join
Merge Join
The merge join requires that both tables (spool files) be sorted on the hash-code of
the columns being joined (or subset of columns being joined). This will be the case if they
both have the same primary index, or if the optimizer decides that the cost is not too great
to sort one or both tables (or spool) to get them in this state
Nested Join
It requires a condition to be specified in the where clause of the SQL.
Hash Join
The hash join does not require that both tables are sorted. The smaller table/spool
is "hashed" into memory (sometimes using multiple hash partitions). Then, the larger table
is scanned and for each row, it looks up the row from the smaller table in the hashed table
that was created in memory. If the smaller table must be broken into partitions to fit into
memory, the larger table must also be broken into the same partitions prior to the join.
The main advantage of the hash join is that the big table does not have to be
sorted and the small table can be much larger than for a product join. However, if the
optimizer thinks that the small table is too large, then it will not choose the hash join as it
will not be able to fit the small table in memory, even after breaking into partitions.
Anyhow, the optimizer will try to determine the best path based on the cost of the
alternatives.
Product Joins
Product joins are the condition occurring when Teradata compares every row of the
first table to every row of the second table. This process can use huge amounts of CPU and
SPOOL. The Product join is likely to happen in one of the following conditions:
When an Alias has been used to identify a table, but the Alias has not been used
consistently throughout the SQL to identify the table. Therefore, Teradata believes
that a reference is being made to another copy of the table, but there is no join
condition placed on the other table, resulting in a Product Join.
The same type of problem exists when a join is attempted on part of a column (e.g. when
using Substring). Even if Statistics have been collected for the column, Teradata cannot
know the distribution of values in the substring.
Avoid Manipulated Columns in Where clauses
Statistics should exist on columns used in Where clauses to restrict rows being
returned (or join conditions). When coding the restrictions or joins, avoid manipulating the
columns wherever possible. The optimiser is unable to utilise the statistics on manipulated
columns.
For example, rather than code ColumnA - 2 < Date, code ColumnA < Date + 2
Use Date Functions where possible
There are a number of Date Functions that are available to help with the
manipulation of columns that are defined as Dates. Use them rather than attempting to
redefine the date as a character column and split it into its component parts. Teradata
does it much more efficiently.
Use Union All instead of just Union
When creating a Union of 2 sets of rows, the default form of the statement will
check for the presence of duplicate rows, which is unnecessary if duplicates are acceptable.
In the majority of situations, it is known that duplicates cannot possibly exist, and if they
do exist then it is correct to select them. Therefore, in the majority of cases it is better to
code Union All, which recognises that duplicates may exist.
Primary index columns of every table and also on all known columns used in joins
or restrictions in queries.
Statistics should normally be collected after the data has been loaded, or reloaded,
or significantly updated.
If the table changes so frequently that recollecting statistics every time would have
a resource impact, then a threshold after which statistics will be collected must be
identified.
If Statistics are not collected or are not current, and the wrong plan is used by
Teradata, then many thousands of CPU secs can be used instead of a few hundred.
The elapsed time of queries is frequently reduced from hours to minutes through
judicious collection of statistics.
Listed below are the 4 different table types and their characteristics:
Set Table
Multiset Table
Volatile Table