Practical No. 1
Data Transformations in a Data Warehouse
Title :
Create a warehouse in MS SQL Server 2000 and import various databases from
external sources such as excel, access, .txt file by using DTS tool.
Theory :
Data from various data sources and in various formats is stored in a data
warehouse by process of ETL (Extraction Transformation Loading).
Transformations on data, which are in various formats results in a common
format which is compatible with the database.
Screenshot:
Screenshot:
6) Right click on one of the arrows of Transform Data Task and click on
Properties option. Select the Destination Tab option and click on OK in
Create Destination Table form.
Screenshot:
7) Having created the table with same name as the data file name, click on
Transformations tab in Transform Data Task Properties form.
Click on Delete All, then Select All, then New.
Next select Copy Column and Press OK.
Press OK again in Transformation Options.
Screenshot:
12) Perform steps 5 to 8 for loading the transformed data in the database
dbase2 from database dbase1. Click on Execute Step in right click menu of
Transform Data Task between the two databases.
Practical No. 2
Querying the database
Title:
Create and schedule a DTS Package using Data Transformation services (DTS)
tool. Fire at least 5 queries on the database.
Click Ok
In SQL Query Analyzer window
Explanation:
1) sysobjects
SQL Server sysobjects Table contains one row for each object created within a
database. In other words, it has a row for every constraint, default, log, rule,
stored procedure, and so on in the database. Therefore, this table can be used to
retrieve information about the database.
2) OBJECT_ID
Returns the database object identification number.
Syntax : OBJECT_ID ( 'object' )
Arguments : 'object'
Is the object to be used. object is either char or nchar. If object is char, it is
implicitly converted to nchar.
Return Types : int
3) UNICODE STRING
Unicode strings have a format similar to character strings but are preceded by an
N identifier (N stands for National Language in the SQL-92 standard). The N
prefix must be uppercase. For example, 'Michl' is a character constant while
N'Michl' is a Unicode constant. Unicode constants are interpreted as Unicode
data, and are not evaluated using a code page. Unicode constants do have a
collation, which primarily controls comparisons and case sensitivity. Unicode
constants are assigned the default collation of the current database, unless the
COLLATE clause is used to specify a collation. Unicode data is stored using two
bytes per character, as opposed to one byte per character for character data.
In Microsoft SQL Server, these data types support Unicode data:
nchar
nvarchar
ntext
11
Object type
IsUserTable
Table
User-defined table.
1 = True
0 = False
4) IDENTITY (Property)
Creates an identity column in a table. This property is used with the CREATE
TABLE and ALTER TABLE Transact-SQL statements.
Note The IDENTITY property is not the same as the SQL-DMO Identity
property that exposes the row identity property of a column.
Syntax: IDENTITY [ ( seed , increment ) ]
Arguments
seed
Is the value that is used for the very first row loaded into the table.
increment
12
13
Screenshot:
14
Queries:
1. Show the total units sold for each product where the required data is earlier
than today.
SELECT Product_Dim.ProductName, Product_Dim.CategoryName,
Product_Dim.SupplierName,SUM(Sales_Fact.LineItemQuantity) AS
[Total Units Sold], Sales_Fact.RequiredDate FROM Sales_Fact
INNER JOIN Product_Dim ON
Sales_Fact.ProductKey = Product_Dim.ProductKey
GROUP BY Product_Dim.ProductName,
Product_Dim.CategoryName,
Product_Dim.SupplierName,
Sales_Fact.RequiredDate
15
2. Show the Product name, Category name, Suppliers name,total units sold for
each product where total unit sold is greater than 100 .
SELECT Product_Dim.ProductName,
Product_Dim.CategoryName,
Product_Dim.SupplierName,
SUM(Sales_Fact.LineItemQuantity) AS [Total Units Sold],
Sales_Fact.RequiredDate
FROM Sales_Fact INNER JOIN
Product_Dim ON
Sales_Fact.ProductKey = Product_Dim.ProductKey
GROUP BY Product_Dim.ProductName,
Product_Dim.CategoryName,
Product_Dim.SupplierName,
Sales_Fact.RequiredDate,
Sales_Fact.LineItemQuantity
HAVING (SUM(Sales_Fact.LineItemQuantity) >100)
16
Screenshot:
3. To View Total Unit Sold Of All the Products Under the category whose
average sale is greater than 50%
SELECT Product_Dim.ProductName,
Product_Dim.CategoryName,
SUM(Sales_Fact.LineItemQuantity) AS [Total Units Sold]
FROM Sales_Fact INNER JOIN
Product_Dim ON
Sales_Fact.ProductKey = Product_Dim.ProductKey
GROUP BY Product_Dim.Productkey,
Product_Dim.ProductName,
Product_Dim.CategoryName
HAVING (AVG(Sales_Fact.LineItemQuantity) >0.5)
Screenshot:
17
4. To View Company Name and Total Quantity sold for all the Products
SELECT Customer_Dim.CompanyName, Sum(Sales_Fact.LineItemQuantity) AS
TotalQtySold
FROM Sales_Fact, Customer_Dim
WHERE Sales_Fact.CustomerKey=Customer_Dim.CustomerKey
GROUP BY Customer_Dim.CompanyName
ORDER BY Sum(Sales_Fact.LineItemQuantity) DESC
Screenshot:
18
19
20
Practical No. 3
Single Dimensional OLAP Cube
Title:
Create a database using Analysis Manager (snap-in of Microsoft Management
Console MMC) and create a single dimensional cube by using star schema.
Theory:
A cube stores complex business data in a multidimensional structure.
Data sources, fact table, dimensions and measures are selected for a cube.
The cube is then processed with the selected elements and used for analysis.
(Drill down and Drill up techniques). The cube uses a single dimension following
star schema.
DBMS used: Microsoft SQL Server 2000 along with Analysis Manager.
Steps:
Creating a new database
1) Right click on server which is seen when expanding Analysis Server. Click on
New Database. Supply appropriate Database name.
Creating a new data source
2) Expand the newly created database; right click on Data Sources and select
New Data Source. Select the Provider as Microsoft.Jet.OLEDB.4.0 Provider.
21
Screenshot:
3) Click on Next >>; select or enter the database name; click on Test Connection
to test the connection with database.
Screenshot:
22
5) Select a fact table from the data source. A fact table is nothing but a table
which contains the measurement facts about particular criteria.
Screenshot:
23
6) Click on Next >; now select the appropriate numeric columns that define
necessary measures.
Screenshot:
24
8) Click Next >; select Star Schema. Select the dimension table.
(Please select the dimension table which has a relationship with the fact table)
Screenshot:
25
11) Click Next >; again click Next > in specify numeric key columns.
Screenshot:
26
27
14) Click Finish; again click Next >. Select Yes in Fact Table Row Count
message box.
Screenshot:
15) Specify appropriate Cube Name and click on Finish. The schema is now
shown.
Screenshot:
28
17) Click Next > in storage design wizard. Select MOLAP as type of data
storage.
29
19) Select Process now in storage design wizard to process the cube.
Screenshot:
30
31
Practical No. 4
Multidimensional OLAP Cube
Title:
Create a database by using Analysis Manager (snap-in of Microsoft Management
Console MMC) and create a multi dimensional OLAP cube by using snow flake
schema.
Theory:
A cube stores complex business data in a multidimensional structure.
Data sources, fact table, dimensions and measures are selected for a cube.
The cube is then processed with the selected elements and used for analysis.
(Drill down and Drill up techniques). The cube uses multi dimension following
snow flake schema.
DBMS used: Microsoft SQL Server 2000 along with Analysis Manager.
32
Steps:
Creating a new database
1) Right click on server which is seen when expanding Analysis Server. Click on
New Database. Supply appropriate Database name.
Screenshot:
3) Click on Next >>; select or enter the database name; click on Test Connection
to test the connection with database.
33
34
6) Click on Next >; now select the appropriate numeric columns that define
necessary measures.
Screenshot:
35
8) Click Next >; select Snow flake schema. Select the dimension table.
(Please select the dimension table which has a relationship with the fact table)
Screenshot:
36
10) Next, Drag drop columns to provide relationships between the dimension
tables, in other words create a join.
Screenshot:
37
12) Click Next >; again click Next > in specify numeric key columns.
Screenshot:
38
39
15) Click Finish; again click Next >. Select Yes in Fact Table Row Count
message box.
Screenshot:
16) Specify appropriate Cube Name and click on Finish. The schema is now
shown.
Screenshot:
40
18) Click Next > in storage design wizard. Select MOLAP as type of data
storage.
Screenshot:
41
19) Select Process now in storage design wizard to process the cube.
Screenshot:
42
20) Select the data tab. Right click on a category of dimension. Select Drill
down.
Screenshot:
Practical No. 5
43
Title:
Create a mining model using Relational Data using Microsoft Decision Tree.
Theory:
A mining model is a data structure that represents discovered knowledge based
on analysis of OLAP or relational data. Mining models can be used to make
predictions.
Steps:
Creating a new database
1) Right click on server which is seen when expanding Analysis Server. Click on
New Database. Supply appropriate Database name.
Screenshot:
44
3) Click on Next >>; select or enter the database name; click on Test Connection
to test the connection with database.
Screenshot:
45
46
47
8) Click Next to create and edit joins. Joins are automatically created if there is
cardinal relationship between the relational tables.
Screenshot:
48
10) Click Next to select the input and predictable columns. An input column
contains the base information for analysis. A prediction column contains the
prediction the mining model makes with respect to the input columns.
Screenshot:
49
13) Select Content Tab to view the decision tree. Select a particular option of
Prediction Tree combo box to view its appropriate tree.
Screenshot:
50
Practical No. 7
Mining Model using OLAP data
Title:
Create a mining model using OLAP data.
Theory:
A mining model is a data structure that represents discovered knowledge based
on analysis of OLAP or relational data. In this practical the mining model uses
the OLAP cube built with snow flake and star schema.
DBMS used: Microsoft SQL Server 2000 along with MMC snap-in Analysis
Manager.
Steps:
Creating a new database
51
1) Right click on server which is seen when expanding Analysis Server. Click on
New Database. Supply appropriate Database name.
Screenshot:
3) Click on Next >>; select or enter the database name in Data Link Properties;
click on Test Connection to test the connection with database.
52
53
6) Click Next > to select the numeric columns that define the measures.
Select the measures as store cost, store sales, unit sales.
Screenshot:
7) Click Next > to create the dimensions for the cube. (Note: 4 dimensions will
now be created).
Click New Dimension to open the dimension wizard. Click Next > and select
star schema.
Screenshot:
8) Click Next > to select the dimension table. Select time_by_day as the
dimension table and click Next > to select the type of dimension.
55
10) In the create dimension levels, select the default options and click
56
57
58
15) Click Next > in specify member key column. Again click Next > in
select advanced options. Supply name as customer dimension and click
Finish.
59
17) Select the following dimension tables: product and product_class and
click Next >.
Screenshot:
60
18) Click Next > in create and edit joins if joins are already present, else drag
drop appropriate columns to create or edit joins.
Screenshot:
19) Click Next > and select the following levels of dimension:
61
20) Click Next > in select member key column. Click Next > in select
advanced options. Supply dimension name as product dimension and click
Finish.
62
22) Select the dimension table as store and click Next >.
Screenshot:
63
23) Select standard dimension in select dimension type. Click Next > and
select the following levels of dimension: store_country, store_state, store_city,
store_name.
Screenshot:
24) Click Next in select member key column. Click Next > in select
advanced options. Supply dimension name as store dimension and click
Finish.
25) Click Next > in cube wizard. Click yes in fact table row count message
box. Supply cube name as sales cube and click Finish.
64
Screenshot:
65
66
Screenshot:
30) Click Next >. Select Process now and click Finish.
Performing Drill down of the cube
31) Click Data Tab in cube schema. Select appropriate dimension and perform
drill down by right clicking on + signs.
67
Screenshot:
68
Edit a Cube
You can make changes to your existing cube by using Cube Editor.
How to edit your cube in Cube Editor
You can use two methods to get to Cube Editor:
1.In the Analysis Manager tree pane, right-click an existing cube, and then click
Edit.
2.Create a new cube using Cube Editor directly. This method is not
recommended unless you are an advanced user.
In the schema pane of Cube Editor, the fact table (with yellow title bar) and the
joined dimension tables (blue title bars) are seen. In the Cube Editor tree pane,
you can preview the structure of your cube in a hierarchical tree. You can edit the
properties of the cube by clicking the Properties button at the bottom of the left
pane.
69
70
5.
6.
7.
8.
9.
7. Under What do you want to do?, select Process now, and then click
Finish.
Note: Processing the aggregations may take some time.
8. In the window that appears, you can watch your cube while it is being
processed. When processing is complete, a message appears confirming
that the processing was completed successfully.
9. Click Close to return to the Analysis Manager tree pane.
72
Practical No. 7
Implementing the Decision Tree Algorithm
The decision tree approach is most useful in classification problems. With
this technique, a tree is constructed to model the classification process. Once the
tree is built, it is applied to each tuple in the database and results in a
classification for that tuple. There are two basic steps in the technique : building
the tree and applying the tree to the database.
73
void main()
{
clrscr();
person p[n];
cout<<"Enter the data of the form \nName,gender,height\n";
for(int i=0;i<n;i++)
{
cout<<"For person "<<i+1<<". :";
gets(p[i].name);
cin>>p[i].gender;
cin>>p[i].height;
//For classifying based on output1
if(p[i].height<=1.7)
strcpy(p[i].output1,"Short");
else if(p[i].height<2)
strcpy(p[i].output1,"Medium");
else if(p[i].height>=2)
strcpy(p[i].output1,"Tall");
//For classifying output2
if(p[i].gender=='m' || p[i].gender=='M')
{
if(p[i].height<1.7)
strcpy(p[i].output2,"Short");
else if(p[i].height<2.1)
strcpy(p[i].output2,"Medium");
else if(p[i].height>=2.1)
strcpy(p[i].output2,"Tall");
}
else if(p[i].gender=='f' || p[i].gender=='F')
{
if(p[i].height<1.5)
strcpy(p[i].output2,"Short");
else if(p[i].height<=1.8)
strcpy(p[i].output2,"Medium");
else if(p[i].height>1.8)
strcpy(p[i].output2,"Tall");
}
74
/*OUTPUT :
Enter the data of the form
Name,gender,height
For person 1. :Kris
F
1.6
For person 2. :Jim
M
2
For person 3. :Maggie
F
1.9
For person 4. :Martha
F
1.88
For person 5. :Stepy
F
1.7
For person 6. :Bob
M
1.85
For person 7. :Kathy
F
1.6
75
Output1
Short
Tall
Medium
Medium
Short
Medium
Short
Short
Tall
Tall
Output2
Medium
Medium
Tall
Tall
Medium
Medium
Medium
Medium
Tall
Tall
76
scanf("%d",&gender[i]);
if(gender[i]==1)
{
male++;
}
else
{
female++;
}
77
}
}
probms=(float)shrtm/shrt;
probmm=(float)medm/med;
probml=(float)lngm/lng;
probfs=(float)shrtf/shrt;
probfm=(float)medf/med;
probfl=(float)lngf/lng;
probs=(float)shrt/ppl;
probm=(float)med/ppl;
probl=(float)lng/ppl;
printf("\n");
printf("\nProbability
printf("\nProbability
printf("\nProbability
printf("\nProbability
printf("\nProbability
printf("\nProbability
of
of
of
of
of
of
78
Practical No. 8
Implementing the K Nearest Neighbors Algorithm
K Nearest Neighbors (KNN) is a common classification scheme based on
the use of distance measures. The KNN technique assumes that the entire
training set includes not only the data in the set but also the desired classification
for each item. Thus the training data becomes the model. When a classification is
to be made for a new item, its distance to each item in the training set must be
determined. Only the K closest entries in the training set are considered further.
The new item is them placed in the class that contains the most items from this
set of K closest items.
79
# define MX 10
int mod_sub (int a, int b)
{
if (a<b)
{
return ((a-b)*(a-b));
}
else
{
return ((b-a)*(b-a));
}
}
int find_dist (int x1,int y1,int x2,int y2)
{
int dd;
dd=(int)(sqrt(mod_sub(y2,y1)+mod_sub(x2,x1)));
return dd;
}
int main()
{
int T[MX+1][2];
int tx,ty;
int k,x,y,i,j,temp;
int dist[MX+1][2];
printf("\nEnter training Data (x,y):-\n" );
for(i=0;i<MX;i++)
{
printf("Enter P%d (x,y) :",i);
scanf("%d %d", &T[i][0],&T[i][1]);
}
printf("\nEnter number of neighbours (k):");
scanf("%d",&k);
printf("\nEnter point t (x,y):");
scanf("%d %d",&tx,&ty);
for(i=0;i<MX;i++)
{
dist[i][0]=i;
dist[i][1]=find_dist(T[i][0],T[i][1],tx,ty);
printf("\nDistances of all points from 't' are:");
for(i=0;i<MX;i++)
{
printf("\nPoint %d Distance =%d",dist[i][0],dist[i][1]);
}
//getch();
80
training
P0 (x,y)
P1 (x,y)
P2 (x,y)
P3 (x,y)
P4 (x,y)
P5 (x,y)
P6 (x,y)
P7 (x,y)
P8 (x,y)
P9 (x,y)
Data (x,y)::20 50
:52 44
:64 55
:75 59
:45 76
:87 94
:65 99
:57 94
:20 90
:94 64
81
6
7
8
9
distance
distance
distance
distance
=22
=17
=50
=21
Practical No. 9
Implementing the K-Means Clustering Algorithm
K-Means is an iterative clustering algorithm in which items are moved
among sets of clusters until the desired set is reached. A high degree of similarity
among elements in clusters is obtained, while a high degree of dissimilarity
among elements in different clusters is achieved simultaneously. The cluster
mean of Ki={ti1, ti2,....,tim} is defined as
1 m
mi t ij
m j 1
This algorithm requires that some definition of cluster mean exists, but it
does not have to be this particular one. The desired number of clusters k, is taken
as input.
82
cout<<"K1=";
int i;
for(i=0;i<9;i++)
{
if(k1[i]!=0)
cout<<k1[i]<<"
";
}
cout<<endl<<"K2=";
for(i=0;i<9;i++)
{
if(k2[i]!=0)
cout<<k2[i]<<"
}
";
void main()
{
clrscr();
int num[9]={2,4,10,12,3,20,30,11,25};
int K1[9]={0,0,0,0,0,0,0,0,0};
int K2[9]={0,0,0,0,0,0,0,0,0};
int oldK1[9],oldK2[9];
int noK1=0,noK2=0,m;
double m1,m2,mean,sumK1=0,sumK2=0;
int i,same=0,sameCount;
cout<<"Considering number of clusters required 'k'=2 ";
cout<<endl<<"Set of numbers considered : ";
for(i=0;i<9;i++)
cout<<num[i]<<" ";
m1=num[0];
m2=num[1];
mean=(m1+m2)/2;
int c1=0,c2=0;
for(i=0;i<9;i++)
{
if(num[i]>=m1 && num[i]<=mean)
{
K1[c1]=num[i];
c1++;
}
else
{
K2[c2]=num[i];
83
m2= "<<m2<<endl;
m2= "<<m2<<endl;
84
if(K2[i]!=0)
{
if(abs(m1-K2[i])<abs(m2-K2[i]))
{
//shift into K1
for(int j=0;j<9;j++)
{
if(K1[j]==0)
{
K1[j]=K2[i];
K2[i]=0;
}
}
}
}
85
sameCount++;
}
}
if(sameCount==9)
same=1;
PrintClusters(K1,K2);
}
}
getch();
/* OUTPUT :
Enter the number of clusters required : 2
K1=2,3,
K2=4, 10, 12, 20, 30, 11, 25,
m1= 2.5 m2= 16
K1=2,3,4,
K2=10, 12, 20, 30, 11, 25,
m1= 3 m2= 18
K1=2,3,4,10,
K2=12, 20, 30, 11, 25,
m1= 4.75 m2= 19.6
K1=2,3,4,10,11,
K2=12, 20, 30, 25,
m1= 6 m2= 21.75
K1=2,3,4,10,11,12,
K2=20, 30, 25,
m1= 7 m2= 25
K1=2,3,4,10,11,12,
K2=20, 30, 25,
m1= 7 m2= 25
K1=2,3,4,10,11,12,
K2=20, 30, 25,
*/
86
Practical No. 10
Implementing the Agglomerative Algorithm (Single Link)
Agglomerative Algorithms, a type of clustering algorithm, start with each
individual item in its own cluster and iteratively merge clusters until all items
belong in one cluster. Different Agglomerative algorithms differ in how the
clusters are merged at each level. It assumes that a set of elements and distances
between them is given as input A (n x n vertex adjacency matrix). Here A[i,j] =
dis(ti,tj). The output of the algorithm is a dendrogram, DE, which is represented
as a set of ordered triples <d,k,K> where d is the threshold distance, k is the
number of clusters, and K is the set of clusters.
import java.util.*;
import java.io.*;
class Agglomerative
{
static void printAdjacency(char c[],int Ad[][],int n)
{
int i,j;
System.out.print("
");
for(i=0;i<n;i++)
{
System.out.print(c[i]+" ");
}
System.out.println();
for(i=0;i<n;i++)
{
System.out.print(c[i]+" ");
for(j=0;j<n;j++)
{
System.out.print(Ad[i][j]+" ");
}
System.out.println();
}
}
static boolean printClusters(int d,ArrayList clus[],int n)
{
int i; int count=1;
boolean stop=false;
for(i=0;i<n;i++)
{
if(!clus[i].isEmpty())
{
System.out.println("Cluster "+count+" has :
"+clus[i]);
count++;
if(clus[i].size()==n)
stop=true;
}
}
System.out.print("Dendrogram triple entry : <"+d+", "+
(count-1)+", {");
count=0;
for(i=0;i<n;i++)
{
if(!clus[i].isEmpty())
{
System.out.print(clus[i]+",");
}
88
if(stop)
return true;
else
return false;
//array of ArrayLists
89
}
}
stop=printClusters(d,K,n);
if(stop)
break;
}
}
catch(Exception e)
{
System.out.println("An Exception Occured "+e);
}
90
/* OUTPUT :
Enter the number of vertices
5
Enter the names of the vertices :
A
B
C
D
E
Enter the elements of the Adjacency matrix :
Elements of row 1 :
0
1
2
2
3
Elements of row 2 :
1
0
2
4
3
Elements of row 3 :
2
2
0
1
5
Elements of row 4 :
2
4
1
0
3
Elements of row 5 :
3
3
5
3
0
Adjacency Matrix :
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
Cluster 1 has : [A]
91
92