Data Science Lab Master Record

SETHU INSTITUTE OF TECHNOLOGY
(AN AUTONOMOUS INSTITUTION)

PULLOOR, KARIAPATTI - 626 115
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
MASTER RECORD
Data Science Laboratory
FACULTY INCHARGES
Mrs.S.Sangeetha,
Mr.B.Guruprakash,
Dr.Yesubai Rubavathi,
Mrs.M.Mathinakani,
Dr.V.Vivek,
Mrs.S.Meenakshi.
INSTITUTE VISION
 To promote excellence in technical education and scientific research for the benefit of the
society
INSTITUTE MISSION
 To provide quality technical education to fulfill the aspiration of the student and to meet
the needs of the Industry
 To provide holistic learning ambience
 To impart skills leading to employability and entrepreneurship
 To establish effective linkage with industries
 To promote Research and Development activities
 To offer services for the development of society through education and technology
CORE VALUES
 Quality
 Commitment
 Innovation
 Team work
 Courtesy
QUALITY POLICY
 To provide Quality technical education to the students.
 To produce competent professionals and contributing citizens.
 To contribute for the upliftment of the society.
DEPARTMENT VISION
 To produce globally innovative computer technocrats and promote research to contribute
to the society
DEPARTMENT MISSION
 Transforming the students into Technocrats in Computer Technology.
 Cultivating the skills essential for employability and higher education.
 Offering computer educational programs to fulfill the needs of the society.
 Establishing strong relationship with the industries.
 Promoting research activities among the students and the faculty.
 Encouraging students to do projects with innovative ideas to solve real world problems.
PROGRAMME EDUCATIONAL OBJECTIVES

 Demonstrate expertise in their career in industry/academia with a strong understanding
on the fundamental and advanced concepts of computer science and engineering.
 Develop effective solutions for vital problems with due considerations to socio-economic,
environmental and ethical factors.
 Pursue life-long learning and research with exemplary contributions to the industry and the
society at large.
PROGRAMME OUTCOMES
Possess in-depth knowledge in computer science and engineering with a broader

(a) perspective and enrich the same by continuous learning. (Scholarship of
Knowledge)
Perform critical analysis of complex engineering problems and make judgements
(b) based on strong theoretical and practical background to advance into further
research. (Critical Thinking)
(c) Conceptualize and build optimal engineering systems in specific areas of
expertise addressing public health, safety, environmental and social factors.
(Problem Solving)
(d) Comprehend open problems through extensive review of literature and
experimental analysis to formulate solutions. (Research Skill)
(e) Learn and use appropriate engineering tools for diverse engineering activities with
knowledge on their limitations. (Usage of modern tools)
(f) Contribute effectively as an individual and collaborate multi-disciplinary
scientific research projects in achieving a common goal. (Collaborative and
Multidisciplinary work)
(g) Apply engineering and management principles as an individual and as a leader
while deploying projects with due considerations to economical and financial
factors. (Project Management and Finance)
(h) Communicate effectively on complex engineering activities with engineering and
research community and the society through demonstration of skills to
comprehend and write technical reports, deliver effective presentations, and give
and receive clear instructions. (Communication)
(i) Exhibit enthusiasm for life-long learning and readiness for independent learning
towards knowledge upgradation and career advancement. (Life-long Learning)
(j) Demonstrate integrity and ethical values in professional practices and research
activities and contribute to the sustainable development of the society. (Ethical
Practices and Social Responsibility)
(k) Observe the behaviour of computational systems in response to inputs under
various conditions, make corrective measures and develop understanding
from mistakes without relying on external opinion. (Independent and
Reflective Learning)
Estd. 1995
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEEERING
GRAPHICS AND MULTIMEDIA LAB
LAB COURSES CONDUCTED
SEMESTER COURSE NAME
15UCS707- Data Science Laboratory

ODD
SYLLABUS : AS PER AUTONOMUS CURRICULUM REGULATION 2015
15UCS707- DATA SCIENCE LABORATORY
OBJECTIVES:
 To familiarize the implementation of programs in R and Hadoop.
LIST OF EXPERIMENTS
Programs in R
1.Basic Data Analytic Methods using R

2.Preparing and training data based on K-means clustering analysis using R
3.Preparing and training data based on Decision Tree Classification analysis using R
4.Preparing and training data based on Naïve Bayes Classification analysis using R
Programs in Hadoop
5.Hadoop Distributed File System Commands
6.Hadoop Word Count Implementation using MapReduce
7. Implementation of Matrix Multiplication using MapReduce
T OTAL: 30 Periods
COURSE OUTCOME ASSESSMENT METHODS : MODEL EXAM, OBSERVATION ,VIVA
& RECORD
COURSE OUTCOMES
After the successful completion of this course, the student will be able to
 Work With Data Analytic Methods. (Apply)
 Implement Clustering Analysis In R. (Apply)
 Manipulate Naïve Classification Analysis Using R. (Apply)
 Work With Hadoop Commands. (Apply)
 Implement Map Reduce Techniques. (Apply)
HARDWARE AND SOFTWARE REQUIRMENTS
FOR A BATCH OF 30 STUDENTS
Standalone desktops with C compiler 30 Nos.

(or)
Server with C compiler supporting 30 terminals or more
15UCS707- DATA SCIENCE LABORATORY
LIST OF EXPERIMENTS
R PROGRAMMING AND HADOOP
1. Basic Data Analytic Methods using R

a) R Commands
b) Problem Statement:
The diamond dataset contains the prices and other attributes of almost 54,000
diamonds and is included in the ggplot2 package. Use the methods of descriptive
statistics and exploratory analysis, to visualize and analyze the data.
2. Preparing and training data based on K-means clustering analysis using R

Problem Statement:
The Student Performance dataset describing the task is to group 620 high school seniors
based on their grades in three subject areas: English, mathematics, and science. The
grades are averaged over their high school career and assume values from 0 to 100.
Analyze the student data set, using R to perform K-means clustering.
3. Preparing and training data based on Decision Tree Classification analysis

using R
Problem Statement:
The bank marketing dataset includes 2,000 instances randomly drawn from the original
dataset, and each instance corresponds to a customer. To make the example simple, the
subset only keeps the following categorical variables: (1) job, (2) marital status, (3)
education level, (4) if the credit is in default, (5) if there is a housing loan, (6) if the
customer currently has a personal loan, (7) contact type, (8) result of the previous
marketing campaign contact (poutcome), and finally (9) if the client actually subscribed to
the term deposit. Attributes (1) through (8) are input variables, and (9) is considered the
outcome. The outcome subscribed is either yes (meaning the customer will subscribe to
the term deposit) or no (meaning the customer won’t subscribe). All the variables listed
earlier are categorical. Analyze the bank marketing dataset, construct a decision tree using
classification techniques.
4. Preparing and training data based on Naïve Bayes Classification analysis using R
Problem Statement:
The Iris data set is a dataset describing key characteristics of Iris flowers. Fisher’s Iris data
base is perhaps the best known database to be found in the pattern recognition literature.
The data set contains 3 classes of 50 instances each, where each class refers to a type of
iris plant. One class is linearly separable from the other two; the latter are not linearly
separable from each other. Analyze the Iris dataset, using Naïve Bayes classification
Techniques.
The data base contains the following attributes:
1). sepal length in cm
2). sepal width in cm
3). petal length in cm
4). petal width in cm
5). class:
- Iris Setosa
- Iris Versicolou
Programs in Hadoop
5. Hadoop Distributed File System Commands
6. Hadoop Word Count Implementation using MapReduce
7. Implementation of Matrix Multiplication using MapReduce
.
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEEERING
2018-2019 ODD SEMESTER
15UCS707 – Data Science Laboratory
Ex.No.1a Basic Data Analytic Methods using R
Aim:
To study about various R Commands and their purpose.
Description:
Technically R is an expression language with a very simple syntax. It is case sensitive as

are most UNIX based packages, so A and a are different symbols and would refer to different
variables. The set of symbols which can be used in R names depends on the operating system
and country within which R is being run (technically on the locale in use). Normally all
alphanumeric symbols are allowed2 (and in some countries this includes accented letters) plus
‘.’ and ‘_’, with the restriction that a name must start with ‘.’ or a letter, and if it starts with ‘.’ the
second character must not be a digit. Names are effectively unlimited in length.
Elementary commands consist of either expressions or assignments. If an expression is
given as a command, it is evaluated, printed (unless specifically made invisible), and the value
is lost. An assignment also evaluates an expression and passes the value to a variable but the
result is not automatically printed.
Commands are separated either by a semi-colon (‘;’), or by a newline. Elementary
commands can be grouped together into one compound expression by braces (‘{’ and
‘}’). Comments can be put almost 3 anywhere, starting with a hashmark (‘#’), everything to the
end of the line is a comment. If a command is not complete at the end of a line, R will give a
different prompt, by default + on second and subsequent lines and continue to read input until
the command is syntactically complete. This prompt may be changed by the user Command
lines entered at the console are limited4 to about 4095 bytes (not characters).
Result:
Thus the various commands in R has been executed successfully.
Ex.No.1b Basic Data Analytic Methods using R
Aim:
To analyze the diamond dataset using Exploratory and Descriptive Statistics Methods.
Problem Statement:
The diamond dataset contains the prices and other attributes of almost 54,000 diamonds and is
included in the ggplot2 package. Use the methods of descriptive statistics and exploratory
analysis, to visualize and analyze the data.
Content
price price in US dollars (\$326--\$18,823)
carat weight of the diamond (0.2--5.01)
cut quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color diamond colour, from J (worst) to D (best)
clarity a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1,
VVS2,VVS1, IF (best))
x length in mm (0--10.74) ,y width in mm (0--58.9)
z depth in mm (0--31.8) ,depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79) ,
table width of top of diamond relative to widest point (43--95)
Algorithm:
1: Import the data
2. read data into R
3. Look at the data
4. Look at the summary of the data
5: Print the entire dataset to the screen
6. Step 6: Using visualization techniques on the training data.
7. Plot the graph
Code
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3
data(diamonds)
View(diamonds)
nrow(diamonds)
ncol(diamonds)
levels(diamonds$color)
subset(diamonds, price == max(price))
subset(diamonds, price == min(price))
summary(diamonds$price)
mean(diamonds$price)
median(diamonds$price)
summary(diamonds)
ggplot(data=diamonds) + geom_histogram(binwidth=500, aes(x=diamonds$price)) + ggtitle("Diamond

Price Distribution") + xlab("Diamond Price U$") + ylab("Frequency") + theme_minimal()
ggplot(diamonds, aes(factor(cut), price, fill=cut)) + geom_boxplot() + ggtitle("Diamond Price according

Cut") + xlab("Type of Cut") + ylab("Diamond Price U$") + coord_cartesian(ylim=c(0,7500))
ggplot(diamonds, aes(factor(color), (price/carat), fill=color)) + geom_boxplot() + ggtitle("Diamond Price

per Carat according Color") + xlab("Color") + ylab("Diamond Price per Carat U$")
ggplot(diamonds, aes(x = depth, y = price)) + geom_point(alpha = 1/20)
ggplot(diamonds,aes(x=carat,y=price))+ geom_point(color='blue',fill='blue')+
xlim(0,quantile(diamonds$carat,0.99))+ylim(0,quantile(diamonds$price,0.99))+
ggtitle('Diamond price vs. carat')
Output
## [1] 53940
## [1] 10
## [1] "D" "E" "F" "G" "H" "I" "J"
## carat cut color clarity depth table price x y z

## 27750 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
## carat cut color clarity depth table price x y z

## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 326 950 2400 3930 5320 18800
## carat cut color clarity

## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066
## Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## depth table price x
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
##
## y z
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.710 Median : 3.530
## Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :58.900 Max. :31.800
##
mean(diamonds$price)
## [1] 3932.8
median(diamonds$price)
## [1] 2401
Result:
Thus the various visualization techniques in R has been executed successfully.

Ex.No.2 Preparing and training data based on K-means clustering analysis using R
Aim:
To analyze the student performance dataset to perform k – means clustering analysis.
Problem Statement:
The Student Performance dataset describing the task is to group 620 high school seniors based
on their grades in three subject areas: English, mathematics, and science. The grades are
averaged over their high school career and assume values from 0 to 100. Analyze the student
data set, using R to perform K-means clustering.
Algorithm:
1. Import data into the Environment

2. read data into R
3. The name of the file is marks.csv. This file contains grades and subject areas.
4. Look at the data
6. Print the entire dataset to the screen
7. Apply K-means algorithm. The dataset is split into 3 clusters and the maximum iteration is 10.
8. Next, plot the clusters.
Code
R code establishes the necessary R libraries and imports the CSV file containing the
grades.
library(plyr)
library(ggplot2)
library(cluster)
library(lattice)
library(graphics)
library(grid)
library(gridExtra)
#import the student grades
grade_input = as.data.frame(read.csv(“c:/data/grades_km_input.csv”))
R code formats the grades for processing. The data file contains four columns. The first
column holds a student identification (ID) number, and the other three columns are for
the grades in the three subject areas. Because the student ID is not used in the
clustering analysis, it is excluded from the k-means input matrix, kmdata.
kmdata_orig = as.matrix(grade_input[,c(“Student”,“English”,“Math”,“Science”)])
kmdata <- kmdata_orig[,2:4]
kmdata[1:10,]
Output
English Math Science

[1,] 99 96 97
[2,] 99 96 97
[3,] 98 97 97
[4,] 95 100 95
[5,] 95 96 96
[6,] 96 97 96
[7,] 100 96 97
[8,] 95 98 98
[9,] 98 96 96
[10,] 99 99 95
R code loops through several k-means analyses for the number of centroids,k, varying
from 1 to 15. For each k, the option nstart=25 specifies that the k-means algorithm will
be repeated 25 times, each starting with k random initial centroids. The corresponding
value of WSS for each k-mean analysis is stored in the wss vector.
wss <- numeric(15)

for (k in 1:15) wss[k] <- sum(kmeans(kmdata, centers=k, nstart=25)$withinss)
km = kmeans(kmdata,3, nstart=25)
km
K-means clustering with 3 clusters of sizes 158, 218, 244

Cluster means:
English Math Science

1 97.21519 93.37342 94.86076
2 73.22018 64.62844 65.84862
3 85.84426 79.68033 81.50820
Clustering vector:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1111111111
[41] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1111111111
[81] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1111111111
[121] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
3333333333
[161] 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 1 1 3 3 1 3 3 3 1 3 3 3 3
3313333333
[201] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3333333333
[241] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3333333333
[281] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3333333333
[321] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3333333333
[361] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 3 2 3 2 3 3 3
2222332222
[401] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2222222222
[441] 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 3
2222222232
[481] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2222222222
[521] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2222222222
[561] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2222222222
[601] 3 3 2 2 3 3 3 3 1 1 3 3 3 2 2 3 2 3 3 3
Within cluster sum of squares by cluster:

[1] 6692.589 34806.339 22984.131
(between_SS / total_SS = 76.5 %)
Available components:
[1] “cluster” “centers” “totss” “withinss” “tot.withinss”
[6] “betweenss” “size” “iter” “ifault”
c( wss[3] , sum(km$withinss) )
[1] 64483.06 64483.06
Using the basic R plot function, each WSS is plotted against the respective
number of centroids, 1 through 15.
plot(1:15, wss, type=“b”, xlab=“Number of Clusters”, ylab=“Within Sum of

Squares”)
In determining the value of k, the data scientist should visualize the data and assigned
clusters. In the following code, the ggplot2 package is used to visualize the identified
student clusters and centroids.
#prepare the student data and clustering results for plotting
df = as.data.frame(kmdata_orig[,2:4])
df$cluster = factor(km$cluster)
centers=as.data.frame(km$centers)
g1= ggplot(data=df, aes(x=English, y=Math, color=cluster )) +
geom_point() + theme(legend.position=“right”) +
geom_point(data=centers,
aes(x=English,y=Math, color=as.factor(c(1,2,3))),
size=10, alpha=.3, show_guide=FALSE)
g2 =ggplot(data=df, aes(x=English, y=Science, color=cluster )) +
geom_point() +
aes(x=English,y=Science, color=as.factor(c(1,2,3))),
g3 = ggplot(data=df, aes(x=Math, y=Science, color=cluster )) +
geom_point() +
aes(x=Math,y=Science, color=as.factor(c(1,2,3))),
tmp = ggplot_gtable(ggplot_build(g1))
grid.arrange(arrangeGrob(g1 + theme(legend.position=“none”),
g2 + theme(legend.position=“none”),
g3 + theme(legend.position=“none”),
main =“High School Student Cluster Analysis”,
ncol=1))
> View(Marks)
> library(readxl)
Warning message:
package ‘readxl’ was built under R version 3.3.3
> Marks1 <- read_excel("C:/Users/Admin/Desktop/Marks1.xlsx")
> View(Marks1)
> data(Marks1)
Warning message:
In data(Marks1) : data set ‘Marks1’ not found
> str(Marks1)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 12 obs. of 5 variables:
$ S.No : num 1 2 3 4 5 6 7 8 9 10 ...
$ Name : chr "Kapoor" "Lav" "Foo" "Xin" ...
$ English: num 56 34 90 100 89 90 76 56 45 78 ...
$ Maths : num 74 45 78 90 81 73 63 78 99 100 ...
$ Science: num 87 87 90 78 46 90 45 100 56 98 ...
> head(Marks1)
# A tibble: 6 x 5
S.No Name English Maths Science
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 Kapoor 56 74 87
2 2 Lav 34 45 87
3 3 Foo 90 78 90
4 4 Xin 100 90 78
5 5 Eddv 89 81 46
6 6 Faran 90 73 90
> summary(Marks1)
S.No Name English Maths
Min. : 1.00 Length:12 Min. : 34.00 Min. : 45.00
1st Qu.: 3.75 Class :character 1st Qu.: 56.00 1st Qu.: 64.75
Median : 6.50 Mode :character Median : 78.00 Median : 76.00
Mean : 6.50 Mean : 73.25 Mean : 75.83
3rd Qu.: 9.25 3rd Qu.: 89.25 3rd Qu.: 83.25
Max. :12.00 Max. :100.00 Max. :100.00
Science
Min. : 45.00
1st Qu.: 71.00
Median : 82.50
Mean : 77.58
3rd Qu.: 90.00
Max. :100.00
> plot(Mark1)
Error in plot(Mark1) : object 'Mark1' not found
> plot(Marks1)
Error in plot.window(...) : need finite 'xlim' values
In addition: Warning messages:
1: In data.matrix(x) : NAs introduced by coercion
2: In min(x) : no non-missing arguments to min; returning Inf
3: In max(x) : no non-missing arguments to max; returning -Inf
> abc=Marks1
> abc1=abc
> View(abc1)
> abc1$Name<-NULL
> View(abc1)
> x=kmeans(abc1,5)
> table(abc$Name,x$cluster)
1 2 3 4 5
Ahi 1 0 0 0 0
Arun 0 0 0 0 1
Arvind 0 0 0 1 0
Duk 1 0 0 0 0
Eddv 0 0 1 0 0
Faran 1 0 0 0 0
Foo 1 0 0 0 0
Kapoor 0 0 0 0 1
Lav 0 0 0 0 1
Mashoo 0 0 1 0 0
Matew 0 1 0 0 0
Xin 1 0 0 0 0
> plot(abc[c("English","Maths")],col=x$cluster)
> plot(abc[c("English","Maths","Science")],col=x$cluster)
> plot(abc[c("S.No","English","Maths","Science")],col=x$cluster)
> abc1
# A tibble: 12 x 4
S.No English Maths Science
<dbl> <dbl> <dbl> <dbl>
1 1 56 74 87
2 2 34 45 87
3 3 90 78 90
4 4 100 90 78
5 5 89 81 46
6 6 90 73 90
7 7 76 63 45
8 8 56 78 100
9 9 45 99 56
10 10 78 100 98
11 11 78 64 78
12 12 87 65 76
> x
K-means clustering with 5 clusters of sizes 5, 1, 2, 1, 3
Cluster means:
S.No English Maths Science
1 7.200000 89.00000 74.00000 82.40000
2 10.000000 78.00000 100.00000 98.00000
3 6.000000 82.50000 72.00000 45.50000
4 9.000000 45.00000 99.00000 56.00000
5 3.666667 48.66667 65.66667 91.33333
Clustering vector:
[1] 5 5 1 1 3 1 3 5 4 2 1 1
Within cluster sum of squares by cluster:

[1] 964.000 0.000 249.000 0.000 1112.667
(between_SS / total_SS = 79.8 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"

[6] "betweenss" "size" "iter" "ifault"
>
Result:
Thus the clustering techniques in R has been executed successfully.
Ex.No.3 Preparing and training data based on Decision Tree Classification

Analysis using R
Aim:
To analyze the iris dataset and construct a decision tree using classification techniques.
Problem Statement:
The Iris data set is a dataset describing key characteristics of Iris flowers. Fisher’s Iris
data base is perhaps the best known database to be found in the pattern recognition
literature. The data set contains 3 classes of 50 instances each, where each class
refers to a type of iris plant. One class is linearly separable from the other two; the latter
are not linearly separable from each other. Analyze the Iris dataset, construct a decision
tree using classification techniques
.The data base contains the following attributes:
1). sepal length in cm
2). sepal width in cm
3). petal length in cm
4). petal width in cm
5). class:
- Iris Setosa
- Iris Versicolou
Algorithm:
1. Import the data

2. Clean the dataset
3. Create train/test set
4. Build the model
5. Make prediction
6. Measure performance
7. Tune the hyper-parameters
Code
> install.packages("rpart")
> install.packages("rpart.plot")
> require("rpart")
> require("rpart.plot")
> data(iris)
> str(iris)
> table(iris$Species)
> head(iris)
> set.seed(9850)
> q<-runif(nrow(iris))
> irisr<-iris[order(q),]
> str(irisr)
> m3<-rpart(Species~.,data=irisr[1:100,],method="class")
> m3
> rpart.plot(m3)
> rpart.plot(m3,type=3,extra=101,fallen.leaves = T)
> summary(m3)
> p3<-predict(m3,irisr[101:150,],type="class")
> table(irisr[101:150,5],predicted=p3)
Output
Installing package into ‘C:/Users/Admin/Documents/R/win-library/3.3’

(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.3/rpart_4.1-13.zip'
Content type 'application/zip' length 925260 bytes (903 KB)
downloaded 903 KB
package ‘rpart’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in

C:\Users\Admin\AppData\Local\Temp\RtmpWeWgPf\downloaded_packages
Installing package into ‘C:/Users/Admin/Documents/R/win-library/3.3’
(as ‘lib’ is unspecified)
There is a binary version available but the source version is later:

binary source needs_compilation
rpart.plot 2.1.2 3.0.3 FALSE
installing the source package ‘rpart.plot’
trying URL 'https://cran.rstudio.com/src/contrib/rpart.plot_3.0.3.tar.gz'

Content type 'application/x-gzip' length 689937 bytes (673 KB)
downloaded 673 KB
* installing *source* package 'rpart.plot' ...

** package 'rpart.plot' successfully unpacked and MD5 sums checked
** R
** data
** inst
** preparing package for lazy loading
Warning: package 'rpart' was built under R version 3.3.3
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
*** arch - i386
*** arch - x64
* DONE (rpart.plot)
The downloaded source packages are in

C:\Users\Admin\AppData\Local\Temp\RtmpWeWgPf\downloaded_packages’
Loading required package: rpart
Warning message:
package ‘rpart’ was built under R version 3.3.3
Loading required package: rpart.plot
'data.frame': 150 obs. of 5 variables:

$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
setosa versicolor virginica

50 50 50
Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
n= 100
node), split, n, loss, yval, (yprob)

* denotes terminal node
1) root 100 65 versicolor (0.34000000 0.35000000 0.31000000)

2) Petal.Length< 2.6 34 0 setosa (1.00000000 0.00000000 0.00000000) *
3) Petal.Length>=2.6 66 31 versicolor (0.00000000 0.53030303 0.46969697)
6) Petal.Width< 1.65 37 2 versicolor (0.00000000 0.94594595 0.05405405) *
7) Petal.Width>=1.65 29 0 virginica (0.00000000 0.00000000 1.00000000) *
Warning message:
Bad 'data' field in model 'call'.
To silence this warning:
Call rpart.plot with roundint=FALSE,
or rebuild the rpart model with model=TRUE.
Warning message:
Bad 'data' field in model 'call'.
To silence this warning:
Call rpart.plot with roundint=FALSE,
or rebuild the rpart model with model=TRUE.
Call:
rpart(formula = Species ~ ., data = irisr[1:100, ], method = "class")
n= 100
CP nsplit rel error xerror xstd

1 0.5230769 0 1.00000000 1.20000000 0.06373020
2 0.4461538 1 0.47692308 0.81538462 0.07678449
3 0.0100000 2 0.03076923 0.06153846 0.03014757
Variable importance
Petal.Width Petal.Length Sepal.Length Sepal.Width
34 31 21 14
Node number 1: 100 observations, complexity param=0.5230769

predicted class=versicolor expected loss=0.65 P(node) =1
class counts: 34 35 31
probabilities: 0.340 0.350 0.310
left son=2 (34 obs) right son=3 (66 obs)
Primary splits:
Petal.Length < 2.6 to the left, improve=33.70121, (0 missing)
Petal.Width < 0.8 to the left, improve=33.70121, (0 missing)
Sepal.Length < 5.45 to the left, improve=20.39105, (0 missing)
Sepal.Width < 3.05 to the right, improve=14.88870, (0 missing)
Surrogate splits:
Petal.Width < 0.8 to the left, agree=1.00, adj=1.000, (0 split)
Sepal.Length < 5.35 to the left, agree=0.90, adj=0.706, (0 split)
Sepal.Width < 3.35 to the right, agree=0.83, adj=0.500, (0 split)
Node number 2: 34 observations

predicted class=setosa expected loss=0 P(node) =0.34
Node number 3: 66 observations, complexity param=0.4461538

predicted class=versicolor expected loss=0.469697 P(node) =0.66
left son=6 (37 obs) right son=7 (29 obs)
Primary splits:
Petal.Width < 1.65 to the left, improve=29.09500, (0 missing)
Petal.Length < 4.75 to the left, improve=27.18761, (0 missing)
Sepal.Length < 6.65 to the left, improve=11.15657, (0 missing)
Sepal.Width < 2.95 to the left, improve= 4.24345, (0 missing)
Surrogate splits:
Petal.Length < 4.75 to the left, agree=0.924, adj=0.828, (0 split)
Sepal.Length < 6.35 to the left, agree=0.803, adj=0.552, (0 split)
Sepal.Width < 2.95 to the left, agree=0.712, adj=0.345, (0 split)

predicted class=versicolor expected loss=0.05405405 P(node) =0.37

predicted class=virginica expected loss=0 P(node) =0.29
predicted
setosa versicolor virginica
setosa 16 0 0
versicolor 0 13 2
virginica 0 2 17
Result:
Thus the decision tree using classification techniques in R has been executed successfully.
Ex.No.4 Preparing and training data based on Naïve Bayes Classification

Analysis using R
Aim:
To analyze the bank marketing dataset compute Naïve Bayes’ using classification
techniques.
Problem Statement:
The bank marketing dataset includes 2,000 instances randomly drawn from the original dataset,
and each instance corresponds to a customer. To make the example simple, the subset only
keeps the following categorical variables: (1) job, (2) marital status, (3) education level, (4) if the
credit is in default, (5) if there is a housing loan, (6) if the customer currently has a personal
loan, (7) contact type, (8) result of the previous marketing campaign contact (poutcome), and
finally (9) if the client actually subscribed to the term deposit. Attributes (1) through (8) are input
variables, and (9) is considered the outcome. The outcome subscribed is either yes (meaning
the customer will subscribe to the term deposit) or no (meaning the customer won’t subscribe).
All the variables listed earlier are categorical. Analyze the bank marketing dataset, using Naïve
Bayes classification Techniques
Algorithm:
1. read data into R

2. Look at the data
4. Print the entire dataset to the screen
5. Construct the Naïve Bayes classifier
6. Make some predictions on the training data such as finding class distribution of the classes
Code
In R, first set up the working directory and initialize the packages.
setwd(“c:/”)
install.packages(“e1071”) # install package e1071
library(e1071) # load the library
library(ROCR)
# training set
banktrain <- read.table(“bank-sample.csv”,header=TRUE,sep=”,”)
# drop a few columns
drops <- c(“balance”, “day”, “campaign”, “pdays”, “previous”, “month”)
banktrain <- banktrain [,!(names(banktrain) %in% drops)]
# testing set
banktest <- read.table(“bank-sample-test.csv”,header=TRUE,sep=”,”)
banktest <- banktest [,!(names(banktest) %in% drops)]
# build the naïve Bayes classifier
nb_model <- naiveBayes(subscribed˜.,
data=banktrain)
# perform on the testing set
nb_prediction <- predict(nb_model,
# remove column “subscribed”
banktest[,-ncol(banktest)],
type=‘raw’)
score <- nb_prediction[, c(“yes”)]
actual_class <- banktest$subscribed == ‘yes’
pred <- prediction(score, actual_class)
perf <- performance(pred, “tpr”, “fpr”)
plot(perf, lwd=2, xlab=“False Positive Rate (FPR)”,
ylab=“True Positive Rate (TPR)”)
abline(a=0, b=1, col=“gray50”, lty=3)
The following R code shows that the corresponding AUC score of the ROC curve is
about
0.915.
auc <- performance(pred, “auc”)
auc <- unlist(slot(auc, “y.values”))
auc
[1] 0.9152196
Result:
Thus the naïve Bayes classification techniques in R has been executed successfully
Ex.No.5 Hadoop Distributed File System Commands
Aim:
To study about various Hadoop distributed File System Commands .
Algorithm:
1. Append single src, or multiple srcs from local file system to the destination file system
using appendToFile
2. Copies source paths to STDOUT USING CAT
3. Changes the group association of files using chgrp
4. Changes the permissions of files using chmod
5. Changes the owner of files with chown
6. Copy single src, or multiple srcs from local file system to the destination file system
where the source is restricted to a local file reference using copyFromLocal
7. Copy files to the local file system where the destination is restricted to a local file
reference using copyToLocal
8. Count the number of directories, files and bytes under the paths that match the specified
file pattern with count
9. Copy files from source to destination with cp command
10. Displays sizes of files and directories contained in the given directory or the length of a
file in case its just a file using du command
11. Displays a summary of file lengths using dus
12. Empty the Trash with expunge
13. Copy files to the local file system with get
14. Displays the Access Control Lists (ACLs) of files and directories using getfacl
15. Displays the extended attribute names and values (if any) for a file or directory with
getfattr
16. Takes a source directory and a destination file as input and concatenates files in src into
the destination local file with getmerge
17. returns list of its direct children as in Unix using ls
18. Recursive version of ls with lsr
19. Takes path uri's as argument and creates directories using mkdir
20. Move file from local using moveFromLocal
21. Displays a "Not implemented yet" message using moveToLocal
22. Moves files from source to destination using mv
Commands
Result:
Thus the various commands in Hadoop has been executed successfully.
Ex.No.6 Hadoop Word Count Implementation using MapReduce
Aim:
To write a Hadoop word count program to count the occurrence of similar words in a file
using Map Reduce.
Algorithm:
1. Check Points
2. Writing mapper & reducer in R
3. Checking mapper file
4. Checking / Running on command line with separate mappers and reducers
5. Copy any text file to HDFS
6. Running MapReduce scripts on Hadoop
7. View WordCount output
8. Copy output to local filesystem
9. You can view wcOutputfile in any editor
Input Data :
Welcome to Hadoop session
Introduction to Hadoop
Introdiction to Hive
Hive Session
Pig Session
Result:
Thus the program Hadoop word count using Map Reduce has been executed successfully.
Ex.No.7 Implementation of Matrix Multiplication using MapReduce
Aim:
To write a Matrix Multiplication program using Map Reduce.
Algorithm:
1. Download the hadoop jar files with these links.

Download Hadoop Common Jar files: https://goo.gl/G4MyHp
$ wget https://goo.gl/G4MyHp -O hadoop-common-2.2.0.jar
Download Hadoop Mapreduce Jar File: https://goo.gl/KT8yfB
$ wget https://goo.gl/KT8yfB -O hadoop-mapreduce-client-core-

2.7.1.jar
2. Creating Mapper file for Matrix Multiplication.

3. package www.ehadoopinfo.com;
4. Creating Reducer.java file for Matrix Multiplication.
package www.ehadoopinfo.com;
5. Creating MatrixMultiply.java file for
package www.ehadoopinfo.com;
5. Compiling the program in particular folder named as operation

6. Let’s retrieve the directory after compilation.
7. Creating Jar file for the Matrix Multiplication.
8. Uploading the M, N file which contains the matrix multiplication data to HDFS.
9. Executing the jar file using hadoop command and thus how fetching record from HDFS and
storing output in HDFS.
10. Getting Output from part-r-00000 that was generated after the execution of the hadoop
command.
Code
import java.io.IOException;
import java.util.*;
import java.util.AbstractMap.SimpleEntry;
import java.util.Map.Entry;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class TwoStepMatrixMultiplication {
public static class Map extends Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = value.toString();
String[] indicesAndValue = line.split(",");
Text outputKey = new Text();
Text outputValue = new Text();
if (indicesAndValue[0].equals("A")) {
outputKey.set(indicesAndValue[2]);
outputValue.set("A," + indicesAndValue[1] + "," + indicesAndValue[3]);
context.write(outputKey, outputValue);
} else {
outputKey.set(indicesAndValue[1]);
outputValue.set("B," + indicesAndValue[2] + "," + indicesAndValue[3]);
context.write(outputKey, outputValue);
public static class Reduce extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
InterruptedException {
String[] value;
ArrayList<Entry<Integer, Float>> listA = new ArrayList<Entry<Integer, Float>>();
ArrayList<Entry<Integer, Float>> listB = new ArrayList<Entry<Integer, Float>>();
for (Text val : values) {

value = val.toString().split(",");
if (value[0].equals("A")) {
listA.add(new SimpleEntry<Integer, Float>(Integer.parseInt(value[1]),

Float.parseFloat(value[2])));
} else {
listB.add(new SimpleEntry<Integer, Float>(Integer.parseInt(value[1]),

Float.parseFloat(value[2])));
String i;
float a_ij;
String k;
float b_jk;
Text outputValue = new Text();
for (Entry<Integer, Float> a : listA) {
i = Integer.toString(a.getKey());
a_ij = a.getValue();
for (Entry<Integer, Float> b : listB) {
k = Integer.toString(b.getKey());
b_jk = b.getValue();
outputValue.set(i + "," + k + "," + Float.toString(a_ij*b_jk));
context.write(null, outputValue);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "MatrixMatrixMultiplicationTwoSteps");
job.setJarByClass(TwoStepMatrixMultiplication.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path("hdfs://127.0.0.1:9000/matrixin"));
FileOutputFormat.setOutputPath(job, new Path("hdfs://127.0.0.1:9000/matrixout"));
job.waitForCompletion(true);
}
Result:
Thus the matrix multiplication program using Map Reduce has been executed successfully.
Algorithm:
1. Start the program.
2. Declare an integer array and read the values from the user.
3. Define a function rev( ) which accepts one pointer value (pointing to an element in the
array)and perform rearrangement of elements inside the function.
4. Display the reversed array.
5. Stop the program.
Program:
#include <stdio.h>
#include <stdlib.h>
int main()
{
int i, j, n, *num, temp;
printf("Enter how many
numbers:"); scanf("%d", &n);
num = (int *)malloc(sizeof(int) * n);
printf("Enter %d numbers:\n");
for (i = 0; i < n; i++)
{
scanf("%d",(num + i));
}
printf("\nThe array elements are:\n");

for (i = 0; i < n; i++)
{
printf("%d ", *(num + i));
}
i = 0, j = n -
1; while (i < j)
{
temp = *(num + i);
*(num + i) = *(num + j);
*(num + j) = temp;
j--, i++;
}
printf("\nReversed array elements are:\n");
for (i = 0; i < n; i++)
{
printf("%d ", *(num + i));
}
printf("\n");
return 0;
}
Sample Input and Output:
Enter how many numbers:5
Enter 5 numbers: 10 30 25 65 39
The array elements are: 10 30 25 65 39
The reversed array elements are: 39 65 25 30 10.
Result:
Thus the program to read and reverse an array using pointers has been executed successfully.
Ex.No.6 Programs using Files – File Manipulation
Aim:
To create a file for maintaining employee details and to perform file manipulations(file
read ,write , append and copy).
Algorithm:
2. Create a file named “salary.dat” which maintains employee name and salary details for
10 employees.
3. Read “salary.dat” using a file pointer „fp1‟ and find employees getting highest and lowest
salary. Write these information in another file named “salary1.dat” using file pointer „fp2‟.
4. Append “salary.dat” with two more employee details.
5. Copy the contents of “salary2.dat” into “copysalary.dat” using file pointer „fp4‟.
Program:
#include <stdio.h>
#include<conio.h>
void main()
{
char name[20];
float salary,high,low;
int i;
FILE *fptr1,*fptr2;
clrscr();
fptr1=(fopen("C:\\salary.dat","w"));
if(fptr1==NULL)
{
printf("Error!");
exit(1);
}
for(i=0;i<10;i++)
{
printf("For Employee %d\nEnter name:
",i+1); scanf("%s",name);
printf("Enter Salary: ");
scanf("%f",&salary);
fprintf(fptr1,"\nName : %s \nSalary:%.2f \n",name,salary);
}
close(fptr1);
fptr1=(fopen("C:\\salary.dat","r"));
if(fptr1==NULL){
printf("Error!");
exit(1);
}
fptr2=(fopen("C:\\salary1.dat","w"));
if(fptr2==NULL){
printf("Error!");
exit(1);
}
fscanf(fptr1,"%s %f",name,&salary); low=high=salary;
for(i=1;i<10;i++)
{
fscanf(fptr1,"%s
%f",name,&salary); if (salary>high)
high=salary; if
(salary<low)
low=salary;
}
printf("High Salary :%.2f\n",high);
printf("Low Salary :%.2f\n",low);
fprintf(fptr2,"\nHighest Salary:%.2f\nLowest Salary:%.2f\n",high,low);
fclose(fptr2);
fclose(fptr1);
fptr1=(fopen("C:\\salary.dat","a"));
if(fptr1==NULL){
printf("Error!");
exit(1);
}
for(i=0;i<2;i++)
{
printf("For Employee %d\nEnter name:
",i+1); scanf("%s",name);
printf("Enter Salary: ");
scanf("%f",&salary);
fprintf(fptr1,"\nName : %s \nSalary:%.2f \n",name,salary);
}
close(fptr1);
getch();
Sample Input and Output:

>> edit salary.dat
aaa 21300
bbb 13546
ccc 74545
ddd 9898
eee 28767
fff 12356
ggg 15321
hhh 27643
iii 32451
jjj 29180
>> type salary1.dat Highest

salary : ccc 74545 Lowest
salary : ddd 9898
Enter details of 2 more employees

kkk 45362
lll 17632
Details appended
>>type salary2.dat
aaa 21300
bbb 13546
ccc 74545
ddd 9898
eee 28767
fff 12356
ggg 15321
hhh 27643
iii 32451
jjj 29180
kkk 45362
lll 17632
>>type copysalary.dat
aaa 21300
bbb 13546
ccc 74545
ddd 9898
eee 28767
fff 12356
ggg 15321
hhh 27643
iii 32451
jjj 29180
kkk 45362
lll 17632
Result:
Thus the program to illustrate various file operations (read, write, append, copy) has been
executed successfully.
Ex. No: 7 Implementation of Singly Linked List
Aim:
To create a singly Linked list and to perform insertion and deletion operations.
Algorithm:

7. Create a node with two fields, data and link field.
a. Allocate space for the node dynamically.
b. Create link between the created nodes and let the last node be with NULL link
c. Insert the input data in the data field and press –1 to stop the same.
8. Get the choice of operation - either insertion or deletion.
9. For insertion get the position in which insertion is to be done and the element to be
inserted. Check for the start, middle or end position of insertion. Insert the node and
change its link accordingly.
10. For deletion get the position in which deletion is to be done. Delete the node and then
link it to the next node. Before deletion check whether there is data in the list to be
deleted.
11. Using display option, list the elements of the list.

Program:
#include<stdio.h>
#include<conio.h>
typedefstruct list
{
intnum;
struct list *next;
}LIST;
LIST *head,*node,*temp;
void create();
void insert();
void del();
void display();
voidmain()
{
intch;
clrscr();
head=NULL;
while(1)
{
printf("enter the choice \n");
printf(" 1:create list \n 2:insert an element to list \n 3:delete an element \n 4.display \n 5.exit \n");
scanf("%d",&ch);
switch(ch)
{
case 1:
create();
break;
case 2:
insert();
break;
case 3:
del();
break;
case 4:
display();
break;
case 5:
exit(0);
break;
default:
printf("wrong choice");
break;
}}}
void create()
{
int k=1;
printf("\n enter the element( else enter -1 to stop):");
scanf("%d",&k);
while(k!=-1)
{
LIST *s1,*p;
s1=(struct list *)malloc(sizeof(LIST));
s1->num=k;
s1->next=NULL;
if(head==NULL)
head=s1;
else p-
>next=s1;
p=s1;
printf("\n enter the element(else enter -1 to
stop):"); scanf("%d",&k);
}}
void display()
{
temp=head;
printf("\n List elements are \n");
while(temp!=NULL)
{
printf("\n %d--->",temp-
>num); temp=temp->next;
}
getch();
}
void del()
{
int i=0,pos;
temp=head;
printf("Enter the position to delete \n");
scanf("%d",&pos); while(temp!
=NULL)
{
if(pos==1)
{
head=temp->next;
printf("node
deleted"); break;
}
else
{
i++;
if(i+1==pos)
{
temp->next=temp->next-
>next; printf("node deleted");
}
}
temp=temp->next;
}
getch();
}
void insert()
{
LIST *p;
int y;
temp=head;
printf("Enter the element after which new element to be inserted
\n"); scanf("%d",&y);
printf("Enter the node value to insert \n");
p=(struct list *)malloc(sizeof(LIST));
scanf("%d",&p->num); while(temp!
=NULL)
{
if(temp->num==y)
{
p->next=temp->next;
temp->next=p;
printf("value inserted \n");
}
temp=temp->next;
}}
Sample Input and Output

1:create list
2:insert an element to
list 3:delete an element
4.display
5.exit
Enter
choice 1
Enter the element -1 to stop
10
20
30
-1
Enter
choice 3
Enter the position to be
deleted 1
Enter
choice 4
List elements are
10
30
Enter
choice 5
RESULT:
Thus the program to perform operations of singly linked list was successfully executed.
Ex. No: 8 Implementation of Doubly Linked List
Aim:
To create a Doubly Linked list and to perform insertion and deletion operations.
Algorithm:

2. Create a node with three fields, data and flink and blink field.
a. Allocate space for the node dynamically.
b. Create link between the created nodes and let the last node be with NULL flink
and first node be with NULL blink.
c. Insert the input data in the data field and press –1 to stop the same.
3. Get the choice of operation - either insertion or deletion.
4. For insertion get the position in which insertion is to be done and the element to be
inserted. Check for the start, middle or end position of insertion. Insert the node and
change its link accordingly.
5. For deletion get the position in which deletion is to be done. Delete the node and then
link it to the next node. Before deletion check whether there is data in the list to be
deleted.
6. Using display option, list the elements of the list.
Program:
#include<stdio.h>
#include<conio.h>
typedefstruct list
{
intnum;
struct list *prev;
struct list
*next; }LIST;
LIST *head,*node,*temp;
void create();
void insert();
void del();
void display();
voidmain()
{
intch;
clrscr();
head=NULL;
while(1)
{
printf("enter the choice \n");
printf(" 1:create list \n 2:insert an element to list \n 3:delete an element \n 4.display \n 5.exit \n");
scanf("%d",&ch);
switch(ch)
{
case 1:
create();
break;
case 2:
insert();
break;
case 3:
del();
break;
case 4:
display();
break;
case 5:
exit(0);
break;
default:
printf("wrong choice");
break;
}
}
}
void create()
{
int k=1;
printf("\n enter the element(Else enter -1 to stop:");
scanf("%d",&k);
while(k!=-1)
{
LIST *s1,*p;
s1=(struct list *)malloc(sizeof(LIST));
s1->num=k;
s1->next=NULL;
s1->prev=NULL;
if(head==NULL)
head=s1;
else
{
p->next=s1;
s1->prev=p;
}
p=s1;
printf("\n enter the element (Else enter -1 to stop:");
scanf("%d",&k);
}
}
void display()
{
temp=head;
printf("\n List elements are
\n\n"); while(temp!=NULL)
{
printf("%d--->",temp->num);
temp=temp->next;
}
getch();
}
void del()
{
intpos,i=0;
LIST *temp1;
temp=head;
printf("Enter the position to delete \n");
scanf("%d",&pos); while(temp!
=NULL)
{
i++;
if(pos==1)
{
head=temp->next;
head->prev=NULL;
printf("Node deleted
\n"); break;
}
else if(i==pos)
{
temp->prev->next=temp->next;
temp->next->prev=temp-
>prev; printf("node deleted");
}
temp=temp->next;
}
getch();
}
void insert()
{
LIST *p;
int y;
temp=head;
printf("Enter the element after which new element to be inserted
\n"); scanf("%d",&y);
printf("Enter the node value to insert \n");
p=(struct list *)malloc(sizeof(LIST));
scanf("%d",&p->num); while(temp!
=NULL)
{
if(temp->num==y)
{
p->next=temp->next;
temp->next->prev=p;
temp->next=p; p-
>prev=temp; printf("value
inserted \n");
}
temp=temp->next;
}
}
Sample Input and Output

1:create list
2:insert an element to
list 3:delete an element
4.display
5.exit
Enter
choice 1
Enter the element(Else enter -1 to stop)
10
Enter the element(Else enter -1 to stop)
20
Enter the element (Else enter -1 to stop)
30
Enter the element (Else enter -1 to stop)
-1
Enter
choice 3
Enter the element to be
deleted 10
Enter
choice 4
List elements are
20 30
Enter
choice 5
RESULT:
Thus the program to perform operations of doubly linked list was successfully executed.
Ex. No: 9 Implementation of Queue using array
Aim:
To create a queue in array and to perform enqueue and dequeue operations.
Algorithm:
1) Start the program.
2) Get the choice of operation as either enqueue or dequeue.
3) To enqueue an element into the queue do as follows.
a) Check for the maximum array size of the queue; if full then no element can be inserted.
b) If the queue is initially empty, increment both rear and front pointers byone. Otherwise
increment the rear pointer by one
4) To delete an element from the queue do as follows.
a) Check for queue empty condition; if empty then no element can be deleted.
b) Otherwise delete the element pointed by front by incrementing front pointer by one.
5) Display the elements from positions pointed from front to rear.
6) Stop the program.
Program:
#include <stdio.h>
#define MAX 50
int queue_array[MAX];
int rear = - 1;
int front = - 1;
main()
{
int choice;
while (1)
{
printf("1.Enqueue 2.Dequeue 3.Exit \n");
printf("Enter your choice : ");
scanf("%d", &choice);
switch (choice)
{
case 1:
insert();
display();
break;
case 2:
del();
display();
break;
case 3:
exit(0);
default:
printf("Wrong choice \n");
}
}
}
insert()
{
int add_item;
if (rear == MAX - 1)
printf("Queue Overflow \n");
else
{
if (front == - 1)
front = 0;
printf("Enter the item to be inserted :\n ");
scanf("%d", &add_item); rear = rear + 1;
queue_array[rear] = add_item;
}
}
del()
{
if (front == - 1 || front > rear)
{
printf("Queue Underflow \n");
return ;
}
else
{
printf("The deleted element is : %d\n",
queue_array[front]); front = front + 1;
}
}
display()
{
int i;
if (front == - 1)
printf("Queue is empty \n");
else
{
printf("queue elements are :
\n"); for (i = front; i <= rear; i++)
printf("%d ",
queue_array[i]); printf("\n");
}
}
Sample Input and Output :
1-Enqueue,2-Dequeue,3-
Exit enter the choice 1
Enter the item to be inserted 11
queue elements are
11
Enter the item to be inserted 12
queue elements are
11 12
The deleted element is 11
queue elements are
12
1-Enqueue,2-Dequeue,3-Exit
enter the choice 2
The deleted element is 12
queue elements are
Queue Underflow
RESULT:
Thus the program to implement queue using array was successfully executed.
Ex. No: 10 Implementation of Stack using array
Aim:
To create astack in array and perform push and pop operations.
Algorithm:
2. Get the choice of operation as either push or pop.
3. To pushan element into the stack do as follows.
a. Check for the maximum array size of the stack, if full then no element can be
inserted.
b. Otherwise increment TOP by one and place the element in the position pointed
by TOP.
4. To popan element from the stack do as follows.
a. Check for stack empty condition; if empty then no element can be deleted.
b. Otherwise delete the element and decrement TOP by one.
5. Display the elements from TOP tostarting position of the array.
Program:
#include<stdio.h>
#include<conio.h>
# define MAX 10
int q[10],top=-1;
void push( );
voidpop( ); void
display( );
void main( )
{
int choice;
clrscr( );
while(1)
{
printf(“1-Push,2-Pop,3-Exit\n”)
printf(“\nenter the choice\n”);
scanf(“%d”,&choice);
switch(choice)
{
case 1:push( );
break;
case 2:pop( );
break; case
3:exit(0);
break;
}
}
getch( );
}
voidpush( )
{
int item;
printf(“Enter the item to be pushed\n”);
scanf(“%d”,&item);
if(top = = MAX)
{
printf(“stack is full\n”);
}
else
{
top++ ;
q[top]=item;
display( );
}
}
voidpop( )
{
int item;
if(top = = -1)
{
printf(“stack is empty\n”);
}
else
{
item=q[top];
top-- ;
printf(“The popped item is %d\n”,item);
display( );
}
}
void display( )
{
int i;
printf(“\nstack items are \n”);
for(i=top;i>=0;i--)
{
printf(“%d”,q[i]);
}
printf(“\n”);
}
OUTPUT:-
1-Push,2-Pop,3-Exit
enter the choice 1
Enter the item to be pushed 13

stack items are
13
1-Push,2-Pop,3-Exit
enter the choice 1
Enter the item to be pushed 12

stack items are
12
13
1-Push,2-Pop,3-Exit
enter the choice 2
The popped item is 12
stack items are

13
RESULT:
Thus the program to implement stack using arrays was successfully executed.
Ex.No:11 Implementation of Circular queue using array
Aim
To implement the Circular Queue using arrays
Algorithm
1. Declare an array Q of size N.
2. Assign F and R to be the front and rear pointers of the queue and assign -1 to F
and R.
3. Get the new element Y to be inserted in to the queue
4. If R is less than N, insert Y at the end, by incrementing R by 1.
Otherwise display Queue is Full.
5. If F is zero then assign F to be 1.
6. To delete an element check whether F is greater than zero, then delete an
element pointed by F, Otherwise display Queue is Empty.
7. If F and R are equal the set F = R=0;otherwise F=F+1;
8. Display the queue Q from F to R.
Program:
#include<stdio.h>
#include <conio.h>
#include<math.h> #define MAXSIZE 4 intcq[MAXSIZE]; intfront,rear;
void main( )
{
void add(int,int);
voiddel(int); void
display( ); int
will=1,i,num;
front = -1;rear =
-1; clrscr( );
printf("\nProgram for Circular Queue demonstration through
array"); while(1)
{
printf("\n\nMAIN MENU\n1.INSERTION\n2.DELETION\n3.EXIT");
printf("\n\nENTER YOUR CHOICE : ");
scanf("%d",&will);
switch(will)
{
case 1:
printf("\n\nENTER THE QUEUE ELEMENT : ");
scanf("%d",&num);
add(num,MAXSIZE);
display();
break;
case 2:
del(MAXSIZE);
display();
break;
case 3:
exit(0);
default:
printf("\n\nInvalid Choice . ");
}
}
}
void add(intitem,int MAX)
{
if(front = =(rear+1)%MAX)
{
printf("\n\nCIRCULAR QUEUE IS OVERFLOW");
}
else
{
if(front= =-1)
front=rear=0;
else rear=(rear+1)%MAX;
cq[rear]=item;
printf("\n\nRear = %d Front = %d ",rear,front);
}
}
voiddel(int MAX)
{
int a;
if(front = = -1)
{
printf("\n\nCIRCULAR QUEUE IS UNDERFLOW");
}
else
{
a=cq[front];
if(front= =rear)
front=rear=-1;
else
front = (front+1)%MAX;
printf("\n\nDELETED ELEMENT FROM QUEUE IS : %d ",a);
printf("\n\nRear = %d Front = %d ",rear,front);
}
}
void display( )
{
int i; if(front=
=-1)
{
printf("\n **CIRCULAR QUEUE IS UNDERFLOW**");
return;
}
else
{
printf("\n The status of queue ");
for(i=front;i<=rear;i++)
printf("\n%d",cq[i]);
}
if(front>rear)
{
for(i=front;i<MAXSIZE;i++)
for(i=0;i<=rear;i++)
}
}
SAMPLE OUTPUT:
MAIN MENU
1. INSERTION
2.DELETION
3.EXIT
ENTER YOUR CHOICE : 1
ENTER THE QUEUE ELEMENT : 10
Rear=0 Front=0
The status of queue 10
MAIN MENU 1.
INSERTION
2.DELETION
3.EXIT
Rear=1 Front=0
The status of
queue 10 20
MAIN MENU 1.
INSERTION
2.DELETION
3.EXIT
Rear=2 Front=0
The status of
queue 10 20 30
MAIN MENU 1.
INSERTION
2.DELETION
3.EXIT
Rear=3 Front=0
The status of
queue 10 20 30 40
MAIN MENU 1.
INSERTION
2.DELETION
3.EXIT

Rear=4 Front=0
The status of queue

10 20 30 40 50
MAIN MENU 1.
INSERTION
2.DELETION
3.EXIT
CIRCULAR QUEUE IS OVERFLOW.
MAIN MENU 1.
INSERTION
2.DELETION
3.EXIT
DELETED ELEMENT FROM QUEUE IS : 10
Rear =4 Front=1
The status of
queue 20 30 40 50
MAIN MENU 1.
INSERTION
2.DELETION
3.EXIT
DELETED ELEMENT FROM QUEUE IS : 20
Rear =4 Front=2
The status of
queue 30 40 50
RESULT:
Thus the program to implement array based circular queue was successfully executed.
Ex. No: 12 Implementation of Double Ended Queue using array
Aim:
To create a Double ended Queue and to perform insertion and deletion operations.
Algorithm:
2. Create a array with two pointers, front and rear.
a. Initialize rear pointer value as -1.
b. Initialize front pointer value as 0.
3. Get the choice of operation - either insertion at rear end or front end and similarly
deletion at rear end or front end.
4. For insertion at the rear end ,check whether rear=arraysize, if so dequeue is full at rear
end, and not possible to insert at rear end. Otherwise increment the rear pointer and
insert the new item at the current position of rear pointer.
5. For insertion at the front end, check whether front > 0, if so decrement the front pointer
and insert the new item at the current position of front pointer.
6. For deletion at the rear end, check whether rear>=0,if so delete the item from the current
rear position and decrement the rear pointer, otherwise not possible to delete the item
from the rear end.
7. For deletion at the front end, check whether front <= rear, if so delete the item from the
current front position and increment the front pointer, otherwise not possible to delete the
item from the front end.
8. Using display option, list the elements of the dequeue.
Program
#include
<stdio.h> #define
SIZE 20 structdq
{
intfront,rear;
int
item[SIZE]; }
deque;
void create(deque *);
void display(deque *);
void insert_rear(deque *, int);
void insert_front(deque *,
int); intdelete_front(deque *,
int); intdelete_rear(deque *,
int); void main()
{
intx,data,ch;
deque DQ;
clrscr();
create(&DQ);
printf(“\n\t\t Program shows working of double ended queue”);
do
{
printf(“\n\t\t Menu”);
printf(“\n\t\t 1: insert at rear end”);
printf(“\n\t\t 2: insert at front end”);
printf(“\n\t\t 3: delete from front
end”); printf(“\n\t\t 4: delete from rear
end”); printf(“\n\t\t 5: exit. “);
printf(“\n\t\t Enter choice :”);
scanf(“%d”,&ch);
switch(ch)
{
case 1:
if (DQ.rear>= SIZE)
{
printf(“\n Deque is full at rear
end”); continue;
}
else
{
printf(“\n Enter element to be added at rear
end :”); scanf(“%d”,&data);
insert_rear(&DQ,data);
printf(“\n Elements in a deque are :”);
display(&DQ);
continue;
}
case 2:
if (DQ.front<=0)
{
printf(“\n Deque is full at front end”);
continue;
}
else
{
printf(“\n Enter element to be added at front end :”);
scanf(“%d”,&data);
insert_front(&DQ,data);
display(&DQ);
continue;
}
case 3:
x = delete_front(&DQ,data);
if (DQ.front==0)
{
continue;
}
else
{
display(&DQ);
continue;
}
case 4:
x = delete_rear(&DQ,data);
if (DQ.rear==0)
{
continue;
}
else
{
printf(“\n Elements in a deque are :”); display(&DQ);
continue;
}
case 5: printf(“\n finish”); return;
}
}while(ch!=5);
getch();
}
void create(deque *DQ)
{
DQ->front=0;
DQ->rear =0;
}
void insert_rear(deque *DQ, int data)
{
if ((DQ->front == 0) &&(DQ->rear == 0))
{
DQ->item[DQ->rear] = data;
DQ->rear = DQ->rear +1;
}
else
{
DQ->item[DQ->rear] = data;
DQ->rear = DQ->rear +1;
}
}
intdelete_front(deque *DQ, int data)
{
if ((DQ->front == 0) && (DQ->rear == 0))
{
printf(“\n Underflow”);
return(0);
}
else
{
data = DQ->item[DQ->front];
printf(“\n Element %d is deleted from front :”,data);
DQ->front = DQ->front +1;
}
if (DQ->front==DQ->rear)
{
DQ->front =0;
DQ->rear = 0;
printf(“\n Deque is empty (front end)”);
}
return data;
}
void insert_front(deque *DQ, int data)
{
if(DQ->front > 0)
{
DQ->front = DQ->front-1;
DQ->item[DQ->front] = data;
}
}
intdelete_rear(deque *DQ, int data)
{
if (DQ->rear == 0)
{
printf(“\n Underflow”);
return(0);
}
else
{
DQ->rear = DQ->rear -1;
data = DQ->item[DQ->rear];
printf(“\n Element %d is deleted from rear:”,data);
}
if (DQ->front==DQ->rear)
{
DQ->front =0;
DQ->rear = 0;
printf(“\n Deque is empty(rear end)”);
}
return data;
}
void display(deque *DQ)
{
int x; for(x=DQ-
>front;xrear;x++)
{
printf(“%d\t”,DQ->item[x]);
}
printf(“\n\n”);
}
SAMPLE OUTPUT:
Menu
1: insert at rear end
2: insert at front end
3: delete from front end
4: delete from rear end
5: exit.
Enter choice :1
Enter element to be added at rear end :11
Elements in a deque are :11
Menu
5: exit.
Enter choice :1
Elements in a deque are :11 22
Menu
5: exit.
Enter choice :1
Elements in a deque are :11 22 33
Menu
5: exit.
Enter choice :2
Deque is full at front end
Menu
5: exit.
Enter choice :3
Element 11 is deleted from front :
Elements in a deque are :22 33
Menu
5: exit.
Enter choice :2
Menu
5: exit.
Enter choice :5
RESULT:
Thus the program to implement Double Ended Queue was successfully executed

Data Science Lab Master Record

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Data Science Lab Master Record

Diunggah oleh

Hak Cipta:

Format Tersedia

SETHU INSTITUTE OF TECHNOLOGY

(AN AUTONOMOUS INSTITUTION)

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Data Science Laboratory

PROGRAMME EDUCATIONAL OBJECTIVES

Possess in-depth knowledge in computer science and engineering with a broader

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEEERING

GRAPHICS AND MULTIMEDIA LAB

LAB COURSES CONDUCTED

SEMESTER COURSE NAME

15UCS707- Data Science Laboratory

15UCS707- DATA SCIENCE LABORATORY

1.Basic Data Analytic Methods using R

7. Implementation of Matrix Multiplication using MapReduce

HARDWARE AND SOFTWARE REQUIRMENTS

FOR A BATCH OF 30 STUDENTS

Standalone desktops with C compiler 30 Nos.

R PROGRAMMING AND HADOOP

1. Basic Data Analytic Methods using R

2. Preparing and training data based on K-means clustering analysis using R

3. Preparing and training data based on Decision Tree Classification analysis

2018-2019 ODD SEMESTER

15UCS707 – Data Science Laboratory

Ex.No.1a Basic Data Analytic Methods using R

Technically R is an expression language with a very simple syntax. It is case sensitive as

1: Import the data

2. read data into R

3. Look at the data

4. Look at the summary of the data

5: Print the entire dataset to the screen

6. Step 6: Using visualization techniques on the training data.

7. Plot the graph

ggplot(data=diamonds) + geom_histogram(binwidth=500, aes(x=diamonds$price)) + ggtitle("Diamond

ggplot(diamonds, aes(factor(cut), price, fill=cut)) + geom_boxplot() + ggtitle("Diamond Price according

ggplot(diamonds, aes(factor(color), (price/carat), fill=color)) + geom_boxplot() + ggtitle("Diamond Price

ggplot(diamonds, aes(x = depth, y = price)) + geom_point(alpha = 1/20)

## [1] "D" "E" "F" "G" "H" "I" "J"

## carat cut color clarity depth table price x y z

## carat cut color clarity depth table price x y z

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## carat cut color clarity

Thus the various visualization techniques in R has been executed successfully.

1. Import data into the Environment

#import the student grades

English Math Science

wss <- numeric(15)

K-means clustering with 3 clusters of sizes 158, 218, 244

English Math Science

Within cluster sum of squares by cluster:

plot(1:15, wss, type=“b”, xlab=“Number of Clusters”, ylab=“Within Sum of

#prepare the student data and clustering results for plotting

Within cluster sum of squares by cluster:

[1] "cluster" "centers" "totss" "withinss" "tot.withinss"

Thus the clustering techniques in R has been executed successfully.

Ex.No.3 Preparing and training data based on Decision Tree Classification

1. Import the data

Installing package into ‘C:/Users/Admin/Documents/R/win-library/3.3’

package ‘rpart’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in

There is a binary version available but the source version is later:

installing the source package ‘rpart.plot’

trying URL 'https://cran.rstudio.com/src/contrib/rpart.plot_3.0.3.tar.gz'

* installing *source* package 'rpart.plot' ...

The downloaded source packages are in

* installing source package 'rpart.plot' ...