Clodes Class Data Science

class-1>
Deal/proposal/prospect>RFI>RFP>Win/Loss/Abort,
decision trees alogarthm-cart(classification using regression tress)
SALESFORCE.COM=INFO ABOUT SALES N CUST
Inferential statistics-
srs=sample random sampling
class-2:
x<-2
5->y
x*y
------------------------------braces, brackets, parentheseis()------use csv comma
delimitd file only
example of data frame in R- that has names, characters, numeric etc.
houseprices<-read.table(file.choose(),sep=",",header=TRUE)
houseprices-hit enter, see file in R, as below>
> houseprices
Price LivingArea Bathrooms Bedrooms LotSize Age Fireplace X X.1 X.2 X.3 X.4
1 16858 1629 1.0 3 0.76 180 0 NA NA NA NA NA
names(houseprices)- see variables names as below

Price" "LivingArea" "Bathrooms" "Bedrooms" "LotSize" "Age"
"Fireplace"
>
head(houseprices,4)
tail(houseprices,5)
-------------
mean(houseprices$Price)>mean of the price variable
mean(houseprices$Price)
[1] 163862.1
houseprices$Price
houseprices$Price
[1] 16858 26049 26130 31113 40932 44674 44873 45004 45904 47630 49211
49564 50483
[14] 50709 51183 52360 54210 55817 55851 57678 59003 59043 60238 60309
62105 62938
[27] 63101 63852 64174 64552 65325 65550 65615 66027 67951
------------------------------
median(houseprices$Price)
median(houseprices$Price)
[1] 151917
getwd()-get directory
C:/Users/IBM_ADMIN/Documents"
setwd("C:\Users\IBM_ADMIN\Desktop\BIG DATA\class notes, videos\data science")-set
directory
replace / by ctrlF,
setwd("C:/Users/IBM_ADMIN/Desktop/BIG DATA/class notes, videos/data science"

set
------------------------------------------
if the above directory is set, we can write the below code
d()
exp<-read.csv("houseprices.CSV")
head(exp)
head(exp)
1 16858 1629 1 3 0.76 180 0 NA NA NA NA NA
2 26049 1344 2 3 0.92 13 0 NA NA NA NA NA
3 26130
--------------------------------
Getting help in R
?mean
help("mean")
install packages:-
install.packages("car")
The downloaded binary packages are in
C:\Users\IBM_ADMIN\AppData\Local\Temp\RtmpqQFwT8\downloaded_packages
package minqa successfully unpacked and MD5 sums checked
------------------------------
then , to see>all packages
library()
-------------------
to see car package>
library(help=car)
Information on package car
Description:
Package: car
Version: 2.1-5
Date: 2017-06-25
----------------------------------------------
Basics of statistics>
population(all learners)>sample(only success learners)

, variable(no of defective items)>parameter(average income), random variable
data types>
quantitaive(discrete,continuous)/qualitative
Type of interval
:-Nominal(emp id-unique key), Ordinal(rank+id),Interval(c/F),Ratio(abs zero)
------------------------------------------------------------
class-3
1. Descriptive statistics:-
A>ATTACH FILES:-
houseprices<-read.table(file.choose(),sep = ",", header = TRUE)
names(houseprices)
attach(houseprices)
FIND VALUES:-
table statement=freq of the variable:-

table(HP$Fireplace)
table(Fireplace)
table(Fireplace)
Fireplace
0 1
426 621
-----------------------
mean, median, IQR etc>

quantile(Price,0.75)
IQR(Price)
quantile(Price,0.75)-quantile(Price,0.25)
------------------------------------------
Find houses that has values >mean values in price:-
x<-subset(houseprices,Price>mean(Price))
x
611 163986 2039 2.5 4 0.52 18 1 NA NA NA NA NA
612 163992 2310 2.5 4 0.45 33 1 NA NA NA NA NA
613 164419 1785 2.5 3 0.16 1 1 NA NA NA NA NA
------------------------------------------------
Find outliers :-values>3rd qtr+1.5IQR,
outliers:-[mean t * SD, mean + t * SD]
x<-subset(houseprices,Price>(quantile(Price,.75)+1.5*(IQR(Price))))
step-1
quantile(Price,0.75)+1.5*IQR(Price)
step-2
subset(houseprices,Price>quantile(Price,0.75)+1.5*IQR(Price))
so, code is:-

quantile(Price,0.75)+1.5*IQR(Price)
1031 345364 3308 2.5 4 0.66 7 1 NA NA NA NA NA
1032 347761 3820 4.5 4 1.40 18 1 NA NA NA NA NA
or, quantile(Price,0.75)+1.5*IQR(Price)
subset(houseprices,Price<quantile(Price,0.25)+1.5*IQR(Price))
--------------------------------------------------
install.packages("moments")-this is for calculation of skewness

library("moments")
skewness(Price)>0.8749042=+vely skewed
kurtosis(Price)>3.750459=normal has skewdness=2, so its more than normal skewed
---------------------------
2. Normal Distribution:-
find the value:-
Mean =494
SD=100
what is Score between 300 and 600?
pnorm(600,494,100)-pnorm(300,494,100)
=0.8292379
central limit theorem:-sample mean=population mean,

sd of sample=sd of pop/sqrt sample
---------------------
uppose during any hour in large departmental store, the average number
of shoppers is 448, with the standard deviation of 21 shoppers. What is
the probability that a random sample of 49 different shopping
hours will yield a sample mean between 441 and 446 shoppers.
sd of the population=21
sd of the sample mean=21/sqrt49=3
so, pnorm(446,448,3)-pnorm(441,448,3)=0.2426772
estimation:-/standard normal distribution:-

----------------
z=(x-mu)/sigma; x=sample mean, mu=pop mean, sigma=sigma/sqrt(n)
now, if> mean=0, sd=1,its a standard normal
class-4:-
------------
PROB 1.Mercury makes a 2.4 lt V-6 engine, The Laser XRi, used in speedboats.
The companies engineer believe that the engine delivers an average poer of
220 horsepower and that the standard deviation of power delivered is 15
horsepower. A potential buyer intends to sample 100 engines(each engine to be
run a single time).
What is the probability that the sample mean will be less than 217 horsepower?
pnorm(x,mean,sd/sqrt(n)), n=sample size

pnorm(217,220,1.5) =0.02275013
0.02275013
x+/- z(sigma/sqrt(n)=interval
prob on computer service company-
PROB 2. Comcast, the computer services company, is planning to invest heavily in

online television services. As part of the decision, the company wants to
estimate the average no of online shows a family of four would watch per day.
A random sample of n=100 families is obtained, and in this sample the average
no of shows viewed per day is 6.5 and the population standard deviation is
known to be 3.2. Construct a 95% confidence interval for the average no
of online television shows watched by the entire population of families of four
> its a std normal, so mu=0, and sigma=1
x+/- z(sigma/sqrt(n)=interval
Z=(X-MU)/(SIGMA/SQRTN)
x=6.5, z=?, sigma=3.2, n=100
z=qnorm(p,mu,sigma)=qnorm(0.25,0,1)
qnorm(0.025,0,1)
[1] -1.959964
so, z=-0.6744898
plug in z as below:-
i6.5+qnorm(.025,0,1)*.32
[1] 5.872812
> 6.5-qnorm(.025,0,1)*.32
[1] 7.127188
-----------------
if 99% confidence,
6.5+qnorm(0.01/2,0,1)*.32
6.5-qnorm(0.01/2,0,1)*.32
5.973647<z<7.026353
-99%<x<+99%
2.T DISTRIBUTION:- SAMPLE SIZE NOT A COSTRAINT

We need this as we dont know sigma in normanl distribution
t=(x-mu)/(d/sqrt(n), d=dof=n-1
mu=x=/- t(d/sqrtn)
A stock market analyst wants to estimate the average return on a certain stock.
A random sample of 15 days yields an average (annualized) return of
Xbar=10.37% and a standard deviation of s=3.5%. Assuming a normal population
of returns, give a 95% confidence
interval for the average return on this stock
sd=3.5, x=10.37, df=15-1=14
10.37+qt(.025,14)*3.5/sqrt(15) and 10.37-qt(.025,14)*3.5/sqrt(15)
8.431765 and 12.30824

complete businees statistics by Amir D Aczel**
statistics for mangagement by Levin and rubin**
KEn black**
3. Hypothesis testing:-
------------------------
HO FALSE
AcceptH0 OK TYPEII error>>we mention the error when we
(status co) accept Null Hyp
REJECTH0 TypeI error ok >>we mention type II also when
(no go ahead scene) we reject H0
suppose we have a scehe where
HOi, mu<3000
Ha, mu>3000
we get sample mean=3001 , but we cant reject null as its not big enough, hence
more evidence from statistics..
>p value=TypeI error, 5%=significance level>like opp confidence interval
if p<5% reject null, if p>5% accept null
MECE-Mutually exclusive, collectively exhaustive

null hypothesis>typeI(reject null when true) and typeII error(accept alternate
when false).
>>
p value=TypeI error, 5%=significance level>like opp confidence interval
In a hypothesis test, the test statistic Z=-1.86.
1.Find the p-value if the test is a) left-tailed,

b) right-tailed, and c) two-tailed.
a>pnorm(-1.86,0,1) =0.21-reject null

b>(1 - pnorm(1.86,0,1))-cant reject null
c>pnorm(-1.86,0,1) + (1 - pnorm(1.86,0,1))=0.06288553>>can t reject null
2.In which of these three cases will H0 be rejected at an alpha of 5%

An automatic bottling machine fills cola into 2 lt (2000 cm3) bottles.
A consumer advocate wants to test the null hypothesis that the average amount
filed by the machine into the bottle is at least 2,000 cm3. A random sample
of 40 bottles coming out of the machine was selected and the exact content of
the selected bottles are recorded. The sample mean was 1,999.6 cm3.
The population standard deviation is known from the past experience to be
1.30 cm3.
1.Test the null hypothesis at an alpha of 5%

2.Assume that the population is normally distributed with the same sd
of 1.30 cm3. Assume that the sample size is only 20 but the sample mean
is the same 1,999.6 cm3. Conduct the test once again at an alpha of 5%.
3.If there is a difference in the two test results, explain the reason for
the difference.
H0: mu >= 2000, HA: mu < 2000

N(2000, 1.30/sqrt(40))
pnorm(1999.6,2000,1.3/sqrt(40))=0.02582635<5%, so reject null hypothesis.
--------------------------------------
class-5:-
1.An automatic bottling machine fills cola into 2 lt (2000 cm3) bottles.
A consumer advocate wants to test the null hypothesis that the average amount
filed by the machine into the bottle is at least 2,000 cm3. A random sample
of 40 bottles coming out of the machine was selected and the exact content of
the selected bottles are recorded. The sample mean was 1,999.6 cm3. The
population standard deviation is known from the past experience to be 1.30
cm3.1. Test the null hypothesis at an alpha of 5%2. Assume that the
population is normally distributed with the same sd of 1.30 cm3. Assume that
the sample size is only 20 but the sample mean is the same 1,999.6 cm3.
Conduct the test once again at an alpha of 5%.3. If there is
a difference in the two test results, explain the reason for the difference.
>1.. pnorm(1999.6,2000,1.3/sqrt(40)), p=0.02582635, so reject null.
2.> pnorm(1999.6,2000,1.3/sqrt(20))[1]> p=0.0844043
3> why diff? as the sample size grows, its more obvious to build confidence
mu < 2000
Can't reject Null Hypothesis in this case AS p>0,05
2.houseprices<-read.table(file.choose(),sep = ",",header = TRUE)

head(houseprices)
names(houseprices)
t.test(houseprices$Math, mu=54, alternative="two.sided")
>>
data: houseprices$Math
t = -2.0454, df = 199, p-value = 0.04213
alternative hypothesis: true mean is not equal to 54
95 percent confidence interval:
51.33868 53.95132
sample estimates:
mean of x
52.645
3. 2 variable:-
t.test(hyp$Math,hyp$Science,paired = T)
4. separate samples:-
t.test(hyp$Math1~hyp$School)
data: houseprices$Math by houseprices$School

t = -1.448, df = 45.351, p-value = 0.1545
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-5.9908645 0.9789598
sample estimates:
mean in group 1 mean in group 2
52.24405 54.75000
5.anova-
out<-aov(hyp$Math~as.factor(hyp$Prog))summary(out)
Df Sum Sq Mean Sq F value Pr(>F)

as.factor(houseprices$Prog) 2 4002 2001.1 29.28 7.36e-12 ***
Residuals 197 13464 68.3
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
>
p=7.36e-12 ***..reject null
6.chi square:-
the disease case example:-

calculate p and test null:-H0=there is no association of disease and
exposure.
calculated on houseprices file:-
table(houseprices$Race)
chisq.test(houseprices$Race, houseprices$Gender)
>>
Pearson's Chi-squared test
data: houseprices$Race and houseprices$Gender

X-squared = 3.204, df = 3, p-value = 0.3612
since p>0.05, accept null

---------------------------------------------------------
class-6:-
1.Correleation:-
names(houseprices)
"ID" "Gender" "Race" "SEB" "School" "Prog" "Read" "Write"

[9] "Math" "Science" "SST"
cor(houseprices$Gender,houseprices$School)
2.cor(hp$Price,hp$LivingArea)head(hp)dta<-hp[c(1,2,5,6)]cor(dta)
3. fit1 <- lm(Price ~ LivingArea,data=hp)summary(fit1)
so we are testing null hypothesis if betas are coonected or not:-

H0=beta0=beta1=..=0
first F test(Anova) is to be done.for the model feasibility

Only if p-value < 0.05 for F test, then only t test is needed.
----------------------------------------------------------------------
price=15875+81.883L,
Residuals:
Min 1Q Median 3Q Max
-250728 -20832 -1748 18542 215701
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15875.637 3943.034 4.026 6.08e-05 ***
LivingArea 81.883 2.056 39.823 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 42660 on 1045 degrees of freedom

Multiple R-squared: 0.6028, Adjusted R-squared: 0.6024
F-statistic: 1586 on 1 and 1045 DF, p-value: < 2.2e-16
since p<0.05 , we reject null..so there is relation between beta values

so there is relation between living area and price.
4. so, regression value of y - mean of y=is the ED value.explidate deviation

ED+UED=Total Deviation.
SSR+SSE=TSE
ssr/sst=(R)^2
ADJR^2=(SSR/df1)/(sst/df2)
5.
fit1 <- lm(Price ~ LivingArea+LotSize+Age,data=houseprices)

vif(fit1)
summary(fit1)
Call:
lm(formula = Price ~ LivingArea + LotSize + Age, data = houseprices)
Residuals:
Min 1Q Median 3Q Max
-188875 -21980 -3736 16718 227525
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33695.248 4321.689 7.797 1.53e-14 ***
LivingArea 76.863 2.106 36.490 < 2e-16 ***
LotSize 1042.954 1675.971 0.622 0.534
Age -332.895 37.936 -8.775 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 41200 on 1043 degrees of freedom

Multiple R-squared: 0.6301, Adjusted R-squared: 0.629
F-statistic: 592.2 on 3 and 1043 DF, p-value: < 2.2e-16
>
-----------------------------------------------------------------------
class-7:-
1. R prog
2. Normalization>unsupervised learning:- subtract each by mean and divide

by the sd:
univ .sat file import >scale function
input<-read.table(file.choose(),sep=",",header=T)
head(input)
mydata<- input[1:25,1:7]
normalized_data<-scale(mydata[,2:7])
next distance between cluster:-complete linkage method

d <- dist(normalized_data, method = "euclidean")>
input<-read.table(file.choose(),sep=",",header=T)
head(input)mydatamydata<- input[1:25,1:7]
normalized_data<-scale(mydata[,2:7]) d <- dist(normalized_data, method =
"euclidean")fit <- hclust(d, method="complete")
plot(fit)
-----------------
mydata
mydata<- input[1:25,1:7]
normalized_data<-scale(mydata[,2:7])
membership<-as.matrix(groups)
membership
hclustering<- data.frame(mydata, membership)
hclustering
write.csv(hclustering,"C:/Users/pande_000/Desktop/Simplilearn/cluster2.csv")
hclustering
write.csv(hclustering,"C:/Users/pande_000/Desktop/Simplilearn/cluster2.csv")fit <-
kmeans(normalized_data, 4) fit$clustermydata<- data.frame(mydata, fit$cluster)
mydataaggregate(mydata[,2:7], by=list(fit$cluster), FUN=mean)
aggregate(mydata[,2:7], by=list(fit$cluster),
FUN=max)aggregate(mydata[,2:7], by=list(fit$cluster), FUN=min)
----------------------------------------------------------
3. Groceries case:-Unstructured data:-association rules
install.packages(arules)
library(arules)
library(arules)
Groc<-read.transactions("C:/Users/IBM_ADMIN/Desktop/BIG DATA/class notes,

videos/data science/groceries.csv")
summary(Groc)
element (itemset/transaction) length distribution:

sizes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
17 18 19 20 21 23
1380 2733 1774 1257 910 601 415 293 166 95 75 44 39 19 11 9
2 3 3 3 1 2
Min. 1st Qu. Median Mean 3rd Qu. Max.

1.000 2.000 3.000 3.635 5.000 23.000
includes extended item information - examples:

labels
1 (appetizer),,,,,,,,,,,,,,,,,,,,,
2 (appetizer),,,,,,,,,,,,,,,,,,,,,,,
3 (appetizer),,,,,
support:-proportion of transaction that an item shows up:-
inspect(Groc[1:3])
items
[1] {bread,margarine,ready,
citrus,
fruit,semi-finished,
soups,,,,,,,,,,,,,,,,,,,,,,,,,,,,}
[2] {fruit,yogurt,coffee,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
tropical}
[3] {milk,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
whole}
itemFrequencyPlot(Groc,support=.01)-THE 10%
itemFrequencyPlot(Groc,support=.05)
---------------------------------------------------------------
class-8:
---------
1. Association rules>market basket analysis:-
#Confidence - measure of proportion of transactions where the presence

of an item results
#into a presence of another item or set of items
confidence(a,b>c)= support(a,b>c)/support(a>b)
Total=100 transactions>bread=20, butter=15 times>9 with bread

alongside bread
so, lift= conf/suppport=(9/20)/(15/100)
>
m1<-apriori(Groc,parameter = list(support=.007,confidence=.25,minlen=2))
summary(m1):
inspect(m1)
>>
summary(m1)
set of 24 rules
rule length distribution (lhs + rhs):sizes

2 3
21 3
Min. 1st Qu. Median Mean 3rd Qu. Max.

2.000 2.000 2.000 2.125 2.000 3.000
summary of quality measures:

support confidence lift
Min. :0.007117 Min. :0.2517 Min. : 2.634
1st Qu.:0.007956 1st Qu.:0.3118 1st Qu.: 5.012
Median :0.012201 Median :0.4086 Median : 8.925
Mean :0.014599 Mean :0.4810 Mean :10.282
3rd Qu.:0.015099 3rd Qu.:0.5257 3rd Qu.:12.628
Max. :0.037417 Max. :1.0000 Max. :26.726
mining info:
data ntransactions support confidence
Groc 9835 0.007 0.25
>
so, higher the lift>more chances of bundle as lift=conf/bundle

lhs support confidence lift
{beer, => {canned} 0.026436197 0.6842105 20.833469
{beer,,,,,} => {bottled} 0.012201322 0.3157895 8.924682\\\
inspect(sort(m1,by="lift"))
lhs rhs
support confidence
[1] {bakery} => {life}
0.037417387 1.0000000
[2] {life} => {bakery}
0.037417387 1.0000000
[3] {canned} => {beer,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,}
0.026436197 0.8049536
[4] {beer,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,} => {canned}
0.026436197 0.6842105
[5] {milk,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,} => {whole}
0.012302999 0.9379845
[6] {citrus} => {fruit,tropical}
0.011591256 0.2516556
[7] {fruit,tropical} => {citrus}
0.011591256 0.5816327
[8] {vegetables,other,vegetables,whole} => {fruit,root}
0.007320793 0.4114286
[9] {fruit,root,vegetables,whole} => {veget
----------------------------
2.debscan:-core, border and noise points
https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/
install.packages("fpc")
library(fpc)
install.packages("dbscan")
library("dbscan")
install.packages("factoextra")
library("factoextra")
res.fpc <- fpc::dbscan(iris, eps = 0.4, MinPts = 4)
---------------------------------------------------
3. supervised learning>classification/Data partitioning:-
age, income, residence=A,I,R,F(=fraud)
70%-train-AIR,F>Learning alograthm applied: model BUILD

30%-test(validation)-A.I.R,F>apply model on them.
res.fpc <- fpc::dbscan(iris, eps = 0.4, MinPts = 4)
res.fpcfviz_cluster(res.fpc, iris, geom = "point")
-------------------------------------------
install.packages("class") library("class") credit<-

read.table(file.choose(),sep=",",header=TRUE) head(credit) credit$Default <-
factor(credit$Default) credit$Default credit$history =
factor(credit$history,levels=c("A30","A31","A32","A33","A34")) credit$history
levels(credit$history) = c("good","good","poor","poor","terrible")
levels(credit$history) credit$foreign <- factor(credit$foreign,
levels=c("A201","A202"), labels=c("foreign","german")) credit$rent <-
factor(credit$housing=="A151") credit$rent credit$purpose <-
factor(credit$purpose,levels=c("A40","A41","A42","A43","A44","A45","A46","A47","A48
","A49","A410")) levels(credit$purpose) <-
c("newcar","usedcar",rep("goods/repair",4),"edu",NA,"edu","biz","biz") credit <-
credit[,c("Default","duration","amount","installment","age","history",
"purpose","foreign","rent")] summary(credit) credit[1:3,] summary(credit)
x=credit[,c(2,3,4)] x[,1]=(x[,1]-mean(x[,1]))/sd(x[,1]) x[,2]=(x[,2]-
mean(x[,2]))/sd(x[,2])
>install.packages("class")library("class")credit<-
read.table(file.choose(),sep=",",header=TRUE)head(credit)credit$Default <-
factor(credit$Default)credit$Defaultcredit$history =
factor(credit$history,levels=c("A30","A31","A32","A33","A34"))credit$history
levels(credit$history) = c("good","good","poor","poor","terrible")
levels(credit$history)credit$foreign <- factor(credit$foreign,
levels=c("A201","A202"), labels=c("foreign","german"))credit$rent <-
factor(credit$housing=="A151") credit$rentcredit$purpose <-
factor(credit$purpose,levels=c("A40","A41","A42","A43","A44","A45","A46","A47","A48
","A49","A410")) levels(credit$purpose) <-
c("newcar","usedcar",rep("goods/repair",4),"edu",NA,"edu","biz","biz")credit <-
credit[,c("Default","duration","amount","installment","age","history",
"purpose","foreign","rent")]summary(credit)credit[1:3,]summary(credit)
x=credit[,c(2,3,4)] x[,1]=(x[,1]-mean(x[,1]))/sd(x[,1])x[,2]=(x[,2]-
mean(x[,2]))/sd(x[,2])
>from Shubham Pandey to All Participants:

train <- sample(1:1000,900) ## this is training set of 900 borrowers
xtrain <- x[train,] # training data set
xnew <- x[-train,] #test data set
ytrain <- credit$Default[train] #factor of true classifications of training set
ynew <- credit$Default[-train] #factor of true classifications of test set
xnew
nearest1 <- knn(train=xtrain, test=xnew, cl=ytrain, k=5)
table(nearest1,credit$Default[-train])
pcorrn1=100*sum(ynew==nearest1)/100
pcorrn1
train <- sample(1:1000,900) ## this is training set of 900 borrowersxtrain <-

x[train,] # training data setxnew <- x[-train,] #test data setytrain <-
credit$Default[train] #factor of true classifications of training setynew <-
credit$Default[-train] #factor of true classifications of test set
-------------------------------------------
4.baye's theorem
p(a/b)=p(aintb)/p(b)

Clodes Class Data Science

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Clodes Class Data Science

Diunggah oleh

Hak Cipta:

Format Tersedia

class-1>

SALESFORCE.COM=INFO ABOUT SALES N CUST

srs=sample random sampling

names(houseprices)- see variables names as below

setwd("C:/Users/IBM_ADMIN/Desktop/BIG DATA/class notes, videos/data science"

population(all learners)>sample(only success learners)

table statement=freq of the variable:-

mean, median, IQR etc>

Find houses that has values >mean values in price:-

so, code is:-

install.packages("moments")-this is for calculation of skewness

find the value:-

central limit theorem:-sample mean=population mean,

estimation:-/standard normal distribution:-

pnorm(x,mean,sd/sqrt(n)), n=sample size

PROB 2. Comcast, the computer services company, is planning to invest heavily in

x=6.5, z=?, sigma=3.2, n=100

2.T DISTRIBUTION:- SAMPLE SIZE NOT A COSTRAINT

sd=3.5, x=10.37, df=15-1=14

10.37+qt(.025,14)*3.5/sqrt(15) and 10.37-qt(.025,14)*3.5/sqrt(15)

8.431765 and 12.30824

suppose we have a scehe where

if p<5% reject null, if p>5% accept null

MECE-Mutually exclusive, collectively exhaustive

In a hypothesis test, the test statistic Z=-1.86.

1.Find the p-value if the test is a) left-tailed,

a>pnorm(-1.86,0,1) =0.21-reject null

2.In which of these three cases will H0 be rejected at an alpha of 5%

1.Test the null hypothesis at an alpha of 5%

H0: mu >= 2000, HA: mu < 2000

pnorm(1999.6,2000,1.3/sqrt(40))=0.02582635<5%, so reject null hypothesis.

>1.. pnorm(1999.6,2000,1.3/sqrt(40)), p=0.02582635, so reject null.

2.> pnorm(1999.6,2000,1.3/sqrt(20))[1]> p=0.0844043

2.houseprices<-read.table(file.choose(),sep = ",",header = TRUE)

data: houseprices$Math by houseprices$School

Df Sum Sq Mean Sq F value Pr(>F)

p=7.36e-12 ***..reject null

the disease case example:-

calculated on houseprices file:-

data: houseprices$Race and houseprices$Gender

since p>0.05, accept null

"ID" "Gender" "Race" "SEB" "School" "Prog" "Read" "Write"

3. fit1 <- lm(Price ~ LivingArea,data=hp)summary(fit1)

so we are testing null hypothesis if betas are coonected or not:-

first F test(Anova) is to be done.for the model feasibility

Residual standard error: 42660 on 1045 degrees of freedom

since p<0.05 , we reject null..so there is relation between beta values

4. so, regression value of y - mean of y=is the ED value.explidate deviation

fit1 <- lm(Price ~ LivingArea+LotSize+Age,data=houseprices)

Residual standard error: 41200 on 1043 degrees of freedom

2. Normalization>unsupervised learning:- subtract each by mean and divide

univ .sat file import >scale function

next distance between cluster:-complete linkage method

Groc<-read.transactions("C:/Users/IBM_ADMIN/Desktop/BIG DATA/class notes,

element (itemset/transaction) length distribution:

Min. 1st Qu. Median Mean 3rd Qu. Max.

includes extended item information - examples:

support:-proportion of transaction that an item shows up:-

1. Association rules>market basket analysis:-

#Confidence - measure of proportion of transactions where the presence

Total=100 transactions>bread=20, butter=15 times>9 with bread

so, lift= conf/suppport=(9/20)/(15/100)

rule length distribution (lhs + rhs):sizes

10.37+qt(.025,14)3.5/sqrt(15) and 10.37-qt(.025,14)3.5/sqrt(15)