Anda di halaman 1dari 14

class-1>

Deal/proposal/prospect>RFI>RFP>Win/Loss/Abort,
decision trees alogarthm-cart(classification using regression tress)

SALESFORCE.COM=INFO ABOUT SALES N CUST

Inferential statistics-

srs=sample random sampling

class-2:

x<-2
5->y
x*y
------------------------------braces, brackets, parentheseis()------use csv comma
delimitd file only
example of data frame in R- that has names, characters, numeric etc.

houseprices<-read.table(file.choose(),sep=",",header=TRUE)
houseprices-hit enter, see file in R, as below>

> houseprices
Price LivingArea Bathrooms Bedrooms LotSize Age Fireplace X X.1 X.2 X.3 X.4
1 16858 1629 1.0 3 0.76 180 0 NA NA NA NA NA

names(houseprices)- see variables names as below


Price" "LivingArea" "Bathrooms" "Bedrooms" "LotSize" "Age"
"Fireplace"

>

head(houseprices,4)
tail(houseprices,5)

-------------
mean(houseprices$Price)>mean of the price variable
mean(houseprices$Price)
[1] 163862.1
houseprices$Price
houseprices$Price
[1] 16858 26049 26130 31113 40932 44674 44873 45004 45904 47630 49211
49564 50483
[14] 50709 51183 52360 54210 55817 55851 57678 59003 59043 60238 60309
62105 62938
[27] 63101 63852 64174 64552 65325 65550 65615 66027 67951
------------------------------
median(houseprices$Price)
median(houseprices$Price)
[1] 151917

getwd()-get directory
C:/Users/IBM_ADMIN/Documents"
setwd("C:\Users\IBM_ADMIN\Desktop\BIG DATA\class notes, videos\data science")-set
directory
replace / by ctrlF,

setwd("C:/Users/IBM_ADMIN/Desktop/BIG DATA/class notes, videos/data science"


set
------------------------------------------
if the above directory is set, we can write the below code
d()
exp<-read.csv("houseprices.CSV")
head(exp)
head(exp)
Price LivingArea Bathrooms Bedrooms LotSize Age Fireplace X X.1 X.2 X.3 X.4
1 16858 1629 1 3 0.76 180 0 NA NA NA NA NA
2 26049 1344 2 3 0.92 13 0 NA NA NA NA NA
3 26130
--------------------------------
Getting help in R
?mean
help("mean")

install packages:-
install.packages("car")
The downloaded binary packages are in
C:\Users\IBM_ADMIN\AppData\Local\Temp\RtmpqQFwT8\downloaded_packages
package minqa successfully unpacked and MD5 sums checked
------------------------------
then , to see>all packages
library()
-------------------
to see car package>

library(help=car)
Information on package car

Description:

Package: car
Version: 2.1-5
Date: 2017-06-25

----------------------------------------------

Basics of statistics>

population(all learners)>sample(only success learners)


, variable(no of defective items)>parameter(average income), random variable

data types>
quantitaive(discrete,continuous)/qualitative

Type of interval
:-Nominal(emp id-unique key), Ordinal(rank+id),Interval(c/F),Ratio(abs zero)

------------------------------------------------------------

class-3

1. Descriptive statistics:-

A>ATTACH FILES:-
houseprices<-read.table(file.choose(),sep = ",", header = TRUE)
names(houseprices)
attach(houseprices)
FIND VALUES:-

table statement=freq of the variable:-


table(HP$Fireplace)
table(Fireplace)

table(Fireplace)
Fireplace
0 1
426 621

-----------------------

mean, median, IQR etc>


quantile(Price,0.75)
quantile(Price,0.50)
quantile(Price,0.25)
IQR(Price)
quantile(Price,0.75)-quantile(Price,0.25)
------------------------------------------

Find houses that has values >mean values in price:-

x<-subset(houseprices,Price>mean(Price))
x
Price LivingArea Bathrooms Bedrooms LotSize Age Fireplace X X.1 X.2 X.3 X.4
611 163986 2039 2.5 4 0.52 18 1 NA NA NA NA NA
612 163992 2310 2.5 4 0.45 33 1 NA NA NA NA NA
613 164419 1785 2.5 3 0.16 1 1 NA NA NA NA NA

------------------------------------------------
Find outliers :-values>3rd qtr+1.5IQR,
outliers:-[mean t * SD, mean + t * SD]
x<-subset(houseprices,Price>(quantile(Price,.75)+1.5*(IQR(Price))))

step-1
quantile(Price,0.75)+1.5*IQR(Price)
step-2
subset(houseprices,Price>quantile(Price,0.75)+1.5*IQR(Price))

so, code is:-


quantile(Price,0.75)+1.5*IQR(Price)
subset(houseprices,Price>quantile(Price,0.75)+1.5*IQR(Price))

subset(houseprices,Price>quantile(Price,0.75)+1.5*IQR(Price))
Price LivingArea Bathrooms Bedrooms LotSize Age Fireplace X X.1 X.2 X.3 X.4
1031 345364 3308 2.5 4 0.66 7 1 NA NA NA NA NA
1032 347761 3820 4.5 4 1.40 18 1 NA NA NA NA NA

or, quantile(Price,0.75)+1.5*IQR(Price)
subset(houseprices,Price>quantile(Price,0.75)+1.5*IQR(Price))
subset(houseprices,Price<quantile(Price,0.25)+1.5*IQR(Price))
--------------------------------------------------

install.packages("moments")-this is for calculation of skewness


library("moments")
skewness(Price)>0.8749042=+vely skewed
kurtosis(Price)>3.750459=normal has skewdness=2, so its more than normal skewed
---------------------------

2. Normal Distribution:-

find the value:-

Mean =494
SD=100
what is Score between 300 and 600?

pnorm(600,494,100)-pnorm(300,494,100)
=0.8292379

central limit theorem:-sample mean=population mean,


sd of sample=sd of pop/sqrt sample
---------------------
uppose during any hour in large departmental store, the average number
of shoppers is 448, with the standard deviation of 21 shoppers. What is
the probability that a random sample of 49 different shopping
hours will yield a sample mean between 441 and 446 shoppers.

sd of the population=21
sd of the sample mean=21/sqrt49=3

so, pnorm(446,448,3)-pnorm(441,448,3)=0.2426772

estimation:-/standard normal distribution:-


----------------
z=(x-mu)/sigma; x=sample mean, mu=pop mean, sigma=sigma/sqrt(n)
now, if> mean=0, sd=1,its a standard normal

class-4:-
------------

PROB 1.Mercury makes a 2.4 lt V-6 engine, The Laser XRi, used in speedboats.
The companies engineer believe that the engine delivers an average poer of
220 horsepower and that the standard deviation of power delivered is 15
horsepower. A potential buyer intends to sample 100 engines(each engine to be
run a single time).
What is the probability that the sample mean will be less than 217 horsepower?

pnorm(x,mean,sd/sqrt(n)), n=sample size


pnorm(217,220,1.5) =0.02275013
0.02275013

x+/- z(sigma/sqrt(n)=interval
prob on computer service company-

PROB 2. Comcast, the computer services company, is planning to invest heavily in


online television services. As part of the decision, the company wants to
estimate the average no of online shows a family of four would watch per day.
A random sample of n=100 families is obtained, and in this sample the average
no of shows viewed per day is 6.5 and the population standard deviation is
known to be 3.2. Construct a 95% confidence interval for the average no
of online television shows watched by the entire population of families of four
> its a std normal, so mu=0, and sigma=1

x+/- z(sigma/sqrt(n)=interval

Z=(X-MU)/(SIGMA/SQRTN)

x=6.5, z=?, sigma=3.2, n=100

z=qnorm(p,mu,sigma)=qnorm(0.25,0,1)
qnorm(0.025,0,1)
[1] -1.959964

so, z=-0.6744898

plug in z as below:-

i6.5+qnorm(.025,0,1)*.32
[1] 5.872812
> 6.5-qnorm(.025,0,1)*.32
[1] 7.127188
-----------------

if 99% confidence,
6.5+qnorm(0.01/2,0,1)*.32
6.5-qnorm(0.01/2,0,1)*.32

5.973647<z<7.026353
-99%<x<+99%

2.T DISTRIBUTION:- SAMPLE SIZE NOT A COSTRAINT


We need this as we dont know sigma in normanl distribution

t=(x-mu)/(d/sqrt(n), d=dof=n-1
mu=x=/- t(d/sqrtn)

A stock market analyst wants to estimate the average return on a certain stock.
A random sample of 15 days yields an average (annualized) return of
Xbar=10.37% and a standard deviation of s=3.5%. Assuming a normal population
of returns, give a 95% confidence
interval for the average return on this stock

sd=3.5, x=10.37, df=15-1=14

10.37+qt(.025,14)*3.5/sqrt(15) and 10.37-qt(.025,14)*3.5/sqrt(15)

8.431765 and 12.30824


complete businees statistics by Amir D Aczel**
statistics for mangagement by Levin and rubin**
KEn black**

3. Hypothesis testing:-
------------------------

HO FALSE
AcceptH0 OK TYPEII error>>we mention the error when we
(status co) accept Null Hyp
REJECTH0 TypeI error ok >>we mention type II also when
(no go ahead scene) we reject H0

suppose we have a scehe where

HOi, mu<3000
Ha, mu>3000

we get sample mean=3001 , but we cant reject null as its not big enough, hence
more evidence from statistics..
>p value=TypeI error, 5%=significance level>like opp confidence interval

if p<5% reject null, if p>5% accept null

MECE-Mutually exclusive, collectively exhaustive


null hypothesis>typeI(reject null when true) and typeII error(accept alternate
when false).
>>
p value=TypeI error, 5%=significance level>like opp confidence interval

In a hypothesis test, the test statistic Z=-1.86.

1.Find the p-value if the test is a) left-tailed,


b) right-tailed, and c) two-tailed.

a>pnorm(-1.86,0,1) =0.21-reject null


b>(1 - pnorm(1.86,0,1))-cant reject null
c>pnorm(-1.86,0,1) + (1 - pnorm(1.86,0,1))=0.06288553>>can t reject null

2.In which of these three cases will H0 be rejected at an alpha of 5%


An automatic bottling machine fills cola into 2 lt (2000 cm3) bottles.
A consumer advocate wants to test the null hypothesis that the average amount
filed by the machine into the bottle is at least 2,000 cm3. A random sample
of 40 bottles coming out of the machine was selected and the exact content of
the selected bottles are recorded. The sample mean was 1,999.6 cm3.
The population standard deviation is known from the past experience to be
1.30 cm3.

1.Test the null hypothesis at an alpha of 5%


2.Assume that the population is normally distributed with the same sd
of 1.30 cm3. Assume that the sample size is only 20 but the sample mean
is the same 1,999.6 cm3. Conduct the test once again at an alpha of 5%.
3.If there is a difference in the two test results, explain the reason for
the difference.

H0: mu >= 2000, HA: mu < 2000


N(2000, 1.30/sqrt(40))

pnorm(1999.6,2000,1.3/sqrt(40))=0.02582635<5%, so reject null hypothesis.

--------------------------------------
class-5:-

1.An automatic bottling machine fills cola into 2 lt (2000 cm3) bottles.
A consumer advocate wants to test the null hypothesis that the average amount
filed by the machine into the bottle is at least 2,000 cm3. A random sample
of 40 bottles coming out of the machine was selected and the exact content of
the selected bottles are recorded. The sample mean was 1,999.6 cm3. The
population standard deviation is known from the past experience to be 1.30
cm3.1. Test the null hypothesis at an alpha of 5%2. Assume that the
population is normally distributed with the same sd of 1.30 cm3. Assume that
the sample size is only 20 but the sample mean is the same 1,999.6 cm3.
Conduct the test once again at an alpha of 5%.3. If there is
a difference in the two test results, explain the reason for the difference.

>1.. pnorm(1999.6,2000,1.3/sqrt(40)), p=0.02582635, so reject null.

2.> pnorm(1999.6,2000,1.3/sqrt(20))[1]> p=0.0844043

3> why diff? as the sample size grows, its more obvious to build confidence

mu < 2000
Can't reject Null Hypothesis in this case AS p>0,05

2.houseprices<-read.table(file.choose(),sep = ",",header = TRUE)


head(houseprices)

names(houseprices)
t.test(houseprices$Math, mu=54, alternative="two.sided")

>>
data: houseprices$Math
t = -2.0454, df = 199, p-value = 0.04213
alternative hypothesis: true mean is not equal to 54
95 percent confidence interval:
51.33868 53.95132
sample estimates:
mean of x
52.645

3. 2 variable:-

t.test(hyp$Math,hyp$Science,paired = T)

4. separate samples:-
t.test(hyp$Math1~hyp$School)

data: houseprices$Math by houseprices$School


t = -1.448, df = 45.351, p-value = 0.1545
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-5.9908645 0.9789598
sample estimates:
mean in group 1 mean in group 2
52.24405 54.75000
5.anova-

out<-aov(hyp$Math~as.factor(hyp$Prog))summary(out)

Df Sum Sq Mean Sq F value Pr(>F)


as.factor(houseprices$Prog) 2 4002 2001.1 29.28 7.36e-12 ***
Residuals 197 13464 68.3
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
>

p=7.36e-12 ***..reject null

6.chi square:-

the disease case example:-


calculate p and test null:-H0=there is no association of disease and
exposure.

calculated on houseprices file:-

table(houseprices$Race)
chisq.test(houseprices$Race, houseprices$Gender)
>>
Pearson's Chi-squared test

data: houseprices$Race and houseprices$Gender


X-squared = 3.204, df = 3, p-value = 0.3612

since p>0.05, accept null


---------------------------------------------------------
class-6:-
1.Correleation:-

names(houseprices)

"ID" "Gender" "Race" "SEB" "School" "Prog" "Read" "Write"


[9] "Math" "Science" "SST"

cor(houseprices$Gender,houseprices$School)

2.cor(hp$Price,hp$LivingArea)head(hp)dta<-hp[c(1,2,5,6)]cor(dta)

3. fit1 <- lm(Price ~ LivingArea,data=hp)summary(fit1)

so we are testing null hypothesis if betas are coonected or not:-


H0=beta0=beta1=..=0

first F test(Anova) is to be done.for the model feasibility


Only if p-value < 0.05 for F test, then only t test is needed.
----------------------------------------------------------------------

price=15875+81.883L,

Residuals:
Min 1Q Median 3Q Max
-250728 -20832 -1748 18542 215701

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15875.637 3943.034 4.026 6.08e-05 ***
LivingArea 81.883 2.056 39.823 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 42660 on 1045 degrees of freedom


Multiple R-squared: 0.6028, Adjusted R-squared: 0.6024
F-statistic: 1586 on 1 and 1045 DF, p-value: < 2.2e-16

since p<0.05 , we reject null..so there is relation between beta values


so there is relation between living area and price.

4. so, regression value of y - mean of y=is the ED value.explidate deviation


ED+UED=Total Deviation.

SSR+SSE=TSE

ssr/sst=(R)^2
ADJR^2=(SSR/df1)/(sst/df2)

5.

fit1 <- lm(Price ~ LivingArea+LotSize+Age,data=houseprices)


vif(fit1)
summary(fit1)

Call:
lm(formula = Price ~ LivingArea + LotSize + Age, data = houseprices)

Residuals:
Min 1Q Median 3Q Max
-188875 -21980 -3736 16718 227525

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33695.248 4321.689 7.797 1.53e-14 ***
LivingArea 76.863 2.106 36.490 < 2e-16 ***
LotSize 1042.954 1675.971 0.622 0.534
Age -332.895 37.936 -8.775 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 41200 on 1043 degrees of freedom


Multiple R-squared: 0.6301, Adjusted R-squared: 0.629
F-statistic: 592.2 on 3 and 1043 DF, p-value: < 2.2e-16
>
-----------------------------------------------------------------------
class-7:-

1. R prog

2. Normalization>unsupervised learning:- subtract each by mean and divide


by the sd:

univ .sat file import >scale function

input<-read.table(file.choose(),sep=",",header=T)

head(input)

mydata<- input[1:25,1:7]
normalized_data<-scale(mydata[,2:7])

next distance between cluster:-complete linkage method


d <- dist(normalized_data, method = "euclidean")>

input<-read.table(file.choose(),sep=",",header=T)
head(input)mydatamydata<- input[1:25,1:7]
normalized_data<-scale(mydata[,2:7]) d <- dist(normalized_data, method =
"euclidean")fit <- hclust(d, method="complete")
plot(fit)

-----------------
mydata
mydata<- input[1:25,1:7]
normalized_data<-scale(mydata[,2:7])
membership<-as.matrix(groups)
membership
hclustering<- data.frame(mydata, membership)
hclustering
write.csv(hclustering,"C:/Users/pande_000/Desktop/Simplilearn/cluster2.csv")

hclustering
write.csv(hclustering,"C:/Users/pande_000/Desktop/Simplilearn/cluster2.csv")fit <-
kmeans(normalized_data, 4) fit$clustermydata<- data.frame(mydata, fit$cluster)
mydataaggregate(mydata[,2:7], by=list(fit$cluster), FUN=mean)
aggregate(mydata[,2:7], by=list(fit$cluster),
FUN=max)aggregate(mydata[,2:7], by=list(fit$cluster), FUN=min)

----------------------------------------------------------
3. Groceries case:-Unstructured data:-association rules

install.packages(arules)

library(arules)
library(arules)

Groc<-read.transactions("C:/Users/IBM_ADMIN/Desktop/BIG DATA/class notes,


videos/data science/groceries.csv")
summary(Groc)

element (itemset/transaction) length distribution:


sizes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
17 18 19 20 21 23
1380 2733 1774 1257 910 601 415 293 166 95 75 44 39 19 11 9
2 3 3 3 1 2

Min. 1st Qu. Median Mean 3rd Qu. Max.


1.000 2.000 3.000 3.635 5.000 23.000

includes extended item information - examples:


labels
1 (appetizer),,,,,,,,,,,,,,,,,,,,,
2 (appetizer),,,,,,,,,,,,,,,,,,,,,,,
3 (appetizer),,,,,

support:-proportion of transaction that an item shows up:-

inspect(Groc[1:3])
items
[1] {bread,margarine,ready,
citrus,
fruit,semi-finished,
soups,,,,,,,,,,,,,,,,,,,,,,,,,,,,}
[2] {fruit,yogurt,coffee,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
tropical}
[3] {milk,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
whole}

itemFrequencyPlot(Groc,support=.01)-THE 10%
itemFrequencyPlot(Groc,support=.05)

---------------------------------------------------------------
class-8:
---------

1. Association rules>market basket analysis:-

#Confidence - measure of proportion of transactions where the presence


of an item results
#into a presence of another item or set of items

confidence(a,b>c)= support(a,b>c)/support(a>b)

Total=100 transactions>bread=20, butter=15 times>9 with bread


alongside bread

so, lift= conf/suppport=(9/20)/(15/100)

>

m1<-apriori(Groc,parameter = list(support=.007,confidence=.25,minlen=2))
summary(m1):
inspect(m1)
>>
summary(m1)
set of 24 rules

rule length distribution (lhs + rhs):sizes


2 3
21 3

Min. 1st Qu. Median Mean 3rd Qu. Max.


2.000 2.000 2.000 2.125 2.000 3.000

summary of quality measures:


support confidence lift
Min. :0.007117 Min. :0.2517 Min. : 2.634
1st Qu.:0.007956 1st Qu.:0.3118 1st Qu.: 5.012
Median :0.012201 Median :0.4086 Median : 8.925
Mean :0.014599 Mean :0.4810 Mean :10.282
3rd Qu.:0.015099 3rd Qu.:0.5257 3rd Qu.:12.628
Max. :0.037417 Max. :1.0000 Max. :26.726

mining info:
data ntransactions support confidence
Groc 9835 0.007 0.25
>

so, higher the lift>more chances of bundle as lift=conf/bundle


lhs support confidence lift
{beer, => {canned} 0.026436197 0.6842105 20.833469
{beer,,,,,} => {bottled} 0.012201322 0.3157895 8.924682\\\

inspect(sort(m1,by="lift"))

lhs rhs
support confidence
[1] {bakery} => {life}
0.037417387 1.0000000
[2] {life} => {bakery}
0.037417387 1.0000000
[3] {canned} => {beer,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,}
0.026436197 0.8049536
[4] {beer,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,} => {canned}
0.026436197 0.6842105
[5] {milk,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,} => {whole}
0.012302999 0.9379845
[6] {citrus} => {fruit,tropical}
0.011591256 0.2516556
[7] {fruit,tropical} => {citrus}
0.011591256 0.5816327
[8] {vegetables,other,vegetables,whole} => {fruit,root}
0.007320793 0.4114286
[9] {fruit,root,vegetables,whole} => {veget

----------------------------
2.debscan:-core, border and noise points
https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/

install.packages("fpc")
library(fpc)
install.packages("dbscan")
library("dbscan")
install.packages("factoextra")
library("factoextra")

res.fpc <- fpc::dbscan(iris, eps = 0.4, MinPts = 4)

---------------------------------------------------
3. supervised learning>classification/Data partitioning:-

age, income, residence=A,I,R,F(=fraud)

70%-train-AIR,F>Learning alograthm applied: model BUILD


30%-test(validation)-A.I.R,F>apply model on them.

res.fpc <- fpc::dbscan(iris, eps = 0.4, MinPts = 4)

res.fpcfviz_cluster(res.fpc, iris, geom = "point")

-------------------------------------------

install.packages("class") library("class") credit<-


read.table(file.choose(),sep=",",header=TRUE) head(credit) credit$Default <-
factor(credit$Default) credit$Default credit$history =
factor(credit$history,levels=c("A30","A31","A32","A33","A34")) credit$history
levels(credit$history) = c("good","good","poor","poor","terrible")
levels(credit$history) credit$foreign <- factor(credit$foreign,
levels=c("A201","A202"), labels=c("foreign","german")) credit$rent <-
factor(credit$housing=="A151") credit$rent credit$purpose <-
factor(credit$purpose,levels=c("A40","A41","A42","A43","A44","A45","A46","A47","A48
","A49","A410")) levels(credit$purpose) <-
c("newcar","usedcar",rep("goods/repair",4),"edu",NA,"edu","biz","biz") credit <-
credit[,c("Default","duration","amount","installment","age","history",
"purpose","foreign","rent")] summary(credit) credit[1:3,] summary(credit)
x=credit[,c(2,3,4)] x[,1]=(x[,1]-mean(x[,1]))/sd(x[,1]) x[,2]=(x[,2]-
mean(x[,2]))/sd(x[,2])

>install.packages("class")library("class")credit<-
read.table(file.choose(),sep=",",header=TRUE)head(credit)credit$Default <-
factor(credit$Default)credit$Defaultcredit$history =
factor(credit$history,levels=c("A30","A31","A32","A33","A34"))credit$history
levels(credit$history) = c("good","good","poor","poor","terrible")
levels(credit$history)credit$foreign <- factor(credit$foreign,
levels=c("A201","A202"), labels=c("foreign","german"))credit$rent <-
factor(credit$housing=="A151") credit$rentcredit$purpose <-
factor(credit$purpose,levels=c("A40","A41","A42","A43","A44","A45","A46","A47","A48
","A49","A410")) levels(credit$purpose) <-
c("newcar","usedcar",rep("goods/repair",4),"edu",NA,"edu","biz","biz")credit <-
credit[,c("Default","duration","amount","installment","age","history",
"purpose","foreign","rent")]summary(credit)credit[1:3,]summary(credit)
x=credit[,c(2,3,4)] x[,1]=(x[,1]-mean(x[,1]))/sd(x[,1])x[,2]=(x[,2]-
mean(x[,2]))/sd(x[,2])

>from Shubham Pandey to All Participants:


train <- sample(1:1000,900) ## this is training set of 900 borrowers
xtrain <- x[train,] # training data set
xnew <- x[-train,] #test data set
ytrain <- credit$Default[train] #factor of true classifications of training set
ynew <- credit$Default[-train] #factor of true classifications of test set
xnew
nearest1 <- knn(train=xtrain, test=xnew, cl=ytrain, k=5)
table(nearest1,credit$Default[-train])
pcorrn1=100*sum(ynew==nearest1)/100
pcorrn1

train <- sample(1:1000,900) ## this is training set of 900 borrowersxtrain <-


x[train,] # training data setxnew <- x[-train,] #test data setytrain <-
credit$Default[train] #factor of true classifications of training setynew <-
credit$Default[-train] #factor of true classifications of test set

-------------------------------------------

4.baye's theorem

p(a/b)=p(aintb)/p(b)

Anda mungkin juga menyukai