Deal/proposal/prospect>RFI>RFP>Win/Loss/Abort,
decision trees alogarthm-cart(classification using regression tress)
Inferential statistics-
class-2:
x<-2
5->y
x*y
------------------------------braces, brackets, parentheseis()------use csv comma
delimitd file only
example of data frame in R- that has names, characters, numeric etc.
houseprices<-read.table(file.choose(),sep=",",header=TRUE)
houseprices-hit enter, see file in R, as below>
> houseprices
Price LivingArea Bathrooms Bedrooms LotSize Age Fireplace X X.1 X.2 X.3 X.4
1 16858 1629 1.0 3 0.76 180 0 NA NA NA NA NA
>
head(houseprices,4)
tail(houseprices,5)
-------------
mean(houseprices$Price)>mean of the price variable
mean(houseprices$Price)
[1] 163862.1
houseprices$Price
houseprices$Price
[1] 16858 26049 26130 31113 40932 44674 44873 45004 45904 47630 49211
49564 50483
[14] 50709 51183 52360 54210 55817 55851 57678 59003 59043 60238 60309
62105 62938
[27] 63101 63852 64174 64552 65325 65550 65615 66027 67951
------------------------------
median(houseprices$Price)
median(houseprices$Price)
[1] 151917
getwd()-get directory
C:/Users/IBM_ADMIN/Documents"
setwd("C:\Users\IBM_ADMIN\Desktop\BIG DATA\class notes, videos\data science")-set
directory
replace / by ctrlF,
install packages:-
install.packages("car")
The downloaded binary packages are in
C:\Users\IBM_ADMIN\AppData\Local\Temp\RtmpqQFwT8\downloaded_packages
package minqa successfully unpacked and MD5 sums checked
------------------------------
then , to see>all packages
library()
-------------------
to see car package>
library(help=car)
Information on package car
Description:
Package: car
Version: 2.1-5
Date: 2017-06-25
----------------------------------------------
Basics of statistics>
data types>
quantitaive(discrete,continuous)/qualitative
Type of interval
:-Nominal(emp id-unique key), Ordinal(rank+id),Interval(c/F),Ratio(abs zero)
------------------------------------------------------------
class-3
1. Descriptive statistics:-
A>ATTACH FILES:-
houseprices<-read.table(file.choose(),sep = ",", header = TRUE)
names(houseprices)
attach(houseprices)
FIND VALUES:-
table(Fireplace)
Fireplace
0 1
426 621
-----------------------
x<-subset(houseprices,Price>mean(Price))
x
Price LivingArea Bathrooms Bedrooms LotSize Age Fireplace X X.1 X.2 X.3 X.4
611 163986 2039 2.5 4 0.52 18 1 NA NA NA NA NA
612 163992 2310 2.5 4 0.45 33 1 NA NA NA NA NA
613 164419 1785 2.5 3 0.16 1 1 NA NA NA NA NA
------------------------------------------------
Find outliers :-values>3rd qtr+1.5IQR,
outliers:-[mean t * SD, mean + t * SD]
x<-subset(houseprices,Price>(quantile(Price,.75)+1.5*(IQR(Price))))
step-1
quantile(Price,0.75)+1.5*IQR(Price)
step-2
subset(houseprices,Price>quantile(Price,0.75)+1.5*IQR(Price))
subset(houseprices,Price>quantile(Price,0.75)+1.5*IQR(Price))
Price LivingArea Bathrooms Bedrooms LotSize Age Fireplace X X.1 X.2 X.3 X.4
1031 345364 3308 2.5 4 0.66 7 1 NA NA NA NA NA
1032 347761 3820 4.5 4 1.40 18 1 NA NA NA NA NA
or, quantile(Price,0.75)+1.5*IQR(Price)
subset(houseprices,Price>quantile(Price,0.75)+1.5*IQR(Price))
subset(houseprices,Price<quantile(Price,0.25)+1.5*IQR(Price))
--------------------------------------------------
2. Normal Distribution:-
Mean =494
SD=100
what is Score between 300 and 600?
pnorm(600,494,100)-pnorm(300,494,100)
=0.8292379
sd of the population=21
sd of the sample mean=21/sqrt49=3
so, pnorm(446,448,3)-pnorm(441,448,3)=0.2426772
class-4:-
------------
PROB 1.Mercury makes a 2.4 lt V-6 engine, The Laser XRi, used in speedboats.
The companies engineer believe that the engine delivers an average poer of
220 horsepower and that the standard deviation of power delivered is 15
horsepower. A potential buyer intends to sample 100 engines(each engine to be
run a single time).
What is the probability that the sample mean will be less than 217 horsepower?
x+/- z(sigma/sqrt(n)=interval
prob on computer service company-
x+/- z(sigma/sqrt(n)=interval
Z=(X-MU)/(SIGMA/SQRTN)
z=qnorm(p,mu,sigma)=qnorm(0.25,0,1)
qnorm(0.025,0,1)
[1] -1.959964
so, z=-0.6744898
plug in z as below:-
i6.5+qnorm(.025,0,1)*.32
[1] 5.872812
> 6.5-qnorm(.025,0,1)*.32
[1] 7.127188
-----------------
if 99% confidence,
6.5+qnorm(0.01/2,0,1)*.32
6.5-qnorm(0.01/2,0,1)*.32
5.973647<z<7.026353
-99%<x<+99%
t=(x-mu)/(d/sqrt(n), d=dof=n-1
mu=x=/- t(d/sqrtn)
A stock market analyst wants to estimate the average return on a certain stock.
A random sample of 15 days yields an average (annualized) return of
Xbar=10.37% and a standard deviation of s=3.5%. Assuming a normal population
of returns, give a 95% confidence
interval for the average return on this stock
3. Hypothesis testing:-
------------------------
HO FALSE
AcceptH0 OK TYPEII error>>we mention the error when we
(status co) accept Null Hyp
REJECTH0 TypeI error ok >>we mention type II also when
(no go ahead scene) we reject H0
HOi, mu<3000
Ha, mu>3000
we get sample mean=3001 , but we cant reject null as its not big enough, hence
more evidence from statistics..
>p value=TypeI error, 5%=significance level>like opp confidence interval
--------------------------------------
class-5:-
1.An automatic bottling machine fills cola into 2 lt (2000 cm3) bottles.
A consumer advocate wants to test the null hypothesis that the average amount
filed by the machine into the bottle is at least 2,000 cm3. A random sample
of 40 bottles coming out of the machine was selected and the exact content of
the selected bottles are recorded. The sample mean was 1,999.6 cm3. The
population standard deviation is known from the past experience to be 1.30
cm3.1. Test the null hypothesis at an alpha of 5%2. Assume that the
population is normally distributed with the same sd of 1.30 cm3. Assume that
the sample size is only 20 but the sample mean is the same 1,999.6 cm3.
Conduct the test once again at an alpha of 5%.3. If there is
a difference in the two test results, explain the reason for the difference.
3> why diff? as the sample size grows, its more obvious to build confidence
mu < 2000
Can't reject Null Hypothesis in this case AS p>0,05
names(houseprices)
t.test(houseprices$Math, mu=54, alternative="two.sided")
>>
data: houseprices$Math
t = -2.0454, df = 199, p-value = 0.04213
alternative hypothesis: true mean is not equal to 54
95 percent confidence interval:
51.33868 53.95132
sample estimates:
mean of x
52.645
3. 2 variable:-
t.test(hyp$Math,hyp$Science,paired = T)
4. separate samples:-
t.test(hyp$Math1~hyp$School)
out<-aov(hyp$Math~as.factor(hyp$Prog))summary(out)
6.chi square:-
table(houseprices$Race)
chisq.test(houseprices$Race, houseprices$Gender)
>>
Pearson's Chi-squared test
names(houseprices)
cor(houseprices$Gender,houseprices$School)
2.cor(hp$Price,hp$LivingArea)head(hp)dta<-hp[c(1,2,5,6)]cor(dta)
price=15875+81.883L,
Residuals:
Min 1Q Median 3Q Max
-250728 -20832 -1748 18542 215701
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15875.637 3943.034 4.026 6.08e-05 ***
LivingArea 81.883 2.056 39.823 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
SSR+SSE=TSE
ssr/sst=(R)^2
ADJR^2=(SSR/df1)/(sst/df2)
5.
Call:
lm(formula = Price ~ LivingArea + LotSize + Age, data = houseprices)
Residuals:
Min 1Q Median 3Q Max
-188875 -21980 -3736 16718 227525
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33695.248 4321.689 7.797 1.53e-14 ***
LivingArea 76.863 2.106 36.490 < 2e-16 ***
LotSize 1042.954 1675.971 0.622 0.534
Age -332.895 37.936 -8.775 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
1. R prog
input<-read.table(file.choose(),sep=",",header=T)
head(input)
mydata<- input[1:25,1:7]
normalized_data<-scale(mydata[,2:7])
input<-read.table(file.choose(),sep=",",header=T)
head(input)mydatamydata<- input[1:25,1:7]
normalized_data<-scale(mydata[,2:7]) d <- dist(normalized_data, method =
"euclidean")fit <- hclust(d, method="complete")
plot(fit)
-----------------
mydata
mydata<- input[1:25,1:7]
normalized_data<-scale(mydata[,2:7])
membership<-as.matrix(groups)
membership
hclustering<- data.frame(mydata, membership)
hclustering
write.csv(hclustering,"C:/Users/pande_000/Desktop/Simplilearn/cluster2.csv")
hclustering
write.csv(hclustering,"C:/Users/pande_000/Desktop/Simplilearn/cluster2.csv")fit <-
kmeans(normalized_data, 4) fit$clustermydata<- data.frame(mydata, fit$cluster)
mydataaggregate(mydata[,2:7], by=list(fit$cluster), FUN=mean)
aggregate(mydata[,2:7], by=list(fit$cluster),
FUN=max)aggregate(mydata[,2:7], by=list(fit$cluster), FUN=min)
----------------------------------------------------------
3. Groceries case:-Unstructured data:-association rules
install.packages(arules)
library(arules)
library(arules)
inspect(Groc[1:3])
items
[1] {bread,margarine,ready,
citrus,
fruit,semi-finished,
soups,,,,,,,,,,,,,,,,,,,,,,,,,,,,}
[2] {fruit,yogurt,coffee,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
tropical}
[3] {milk,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
whole}
itemFrequencyPlot(Groc,support=.01)-THE 10%
itemFrequencyPlot(Groc,support=.05)
---------------------------------------------------------------
class-8:
---------
confidence(a,b>c)= support(a,b>c)/support(a>b)
>
m1<-apriori(Groc,parameter = list(support=.007,confidence=.25,minlen=2))
summary(m1):
inspect(m1)
>>
summary(m1)
set of 24 rules
mining info:
data ntransactions support confidence
Groc 9835 0.007 0.25
>
inspect(sort(m1,by="lift"))
lhs rhs
support confidence
[1] {bakery} => {life}
0.037417387 1.0000000
[2] {life} => {bakery}
0.037417387 1.0000000
[3] {canned} => {beer,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,}
0.026436197 0.8049536
[4] {beer,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,} => {canned}
0.026436197 0.6842105
[5] {milk,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,} => {whole}
0.012302999 0.9379845
[6] {citrus} => {fruit,tropical}
0.011591256 0.2516556
[7] {fruit,tropical} => {citrus}
0.011591256 0.5816327
[8] {vegetables,other,vegetables,whole} => {fruit,root}
0.007320793 0.4114286
[9] {fruit,root,vegetables,whole} => {veget
----------------------------
2.debscan:-core, border and noise points
https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/
install.packages("fpc")
library(fpc)
install.packages("dbscan")
library("dbscan")
install.packages("factoextra")
library("factoextra")
---------------------------------------------------
3. supervised learning>classification/Data partitioning:-
-------------------------------------------
>install.packages("class")library("class")credit<-
read.table(file.choose(),sep=",",header=TRUE)head(credit)credit$Default <-
factor(credit$Default)credit$Defaultcredit$history =
factor(credit$history,levels=c("A30","A31","A32","A33","A34"))credit$history
levels(credit$history) = c("good","good","poor","poor","terrible")
levels(credit$history)credit$foreign <- factor(credit$foreign,
levels=c("A201","A202"), labels=c("foreign","german"))credit$rent <-
factor(credit$housing=="A151") credit$rentcredit$purpose <-
factor(credit$purpose,levels=c("A40","A41","A42","A43","A44","A45","A46","A47","A48
","A49","A410")) levels(credit$purpose) <-
c("newcar","usedcar",rep("goods/repair",4),"edu",NA,"edu","biz","biz")credit <-
credit[,c("Default","duration","amount","installment","age","history",
"purpose","foreign","rent")]summary(credit)credit[1:3,]summary(credit)
x=credit[,c(2,3,4)] x[,1]=(x[,1]-mean(x[,1]))/sd(x[,1])x[,2]=(x[,2]-
mean(x[,2]))/sd(x[,2])
-------------------------------------------
4.baye's theorem
p(a/b)=p(aintb)/p(b)