When we build a model , the intercept value is beta is associated with a variable whose
value is all the time 1 ( X0 ).
If all categories are taken without taking the reference value, the sum of those categories
will be 1 always.
Ex : If it is a location, each person belong to one and only one location.
So the X0 which is always one can always be predicted from sum X of all the dummy
variable.
rstud<-as.data.frame(rstud)
data_with_outliers<-cbind(insurance,rstud)
data_without_outliers<data_with_outliers[which(data_with_outliers$rstud<2&data_with_outliers$rstud>-2), ]
insurance_without_outliers<-data_without_outliers[-8]
ins_model_new <- lm(charges ~ age + children + bmi + sex +smoker + region, data =
insurance_without_outliers)
Missing values
Data
Less outliers
Large
dataset
numeric
Mean
Median
Categori
cal
Mode
Associat
ed
variable
Type of
technique
Categorical
Categori
cal
Categorical
Numeric
Remarks
Assumptions
Decision
tree
Nave
Bayesian
Decesion tree
need no
assumption
Nave bayes
assume
independent
variables
Logistic
regression
K-NN
classifier
K-NN
CLASSIFIER
need no
assumption
Regression
assumption of
normality,
homoscedasticity
etc
Regression
assumption of
normality,
homoscedasticity
etc
Numeric
Numeric
Regression
model
Clustering
Clustering
need no
assumption
Numeric
Categori
cal
Clustering
No
assumption
Categorical
Both
Decision
No
Regression
Stratified sampling in R
mydata<-read.csv("binary.csv",header=TRUE)
mydata$fraud <- as.factor(mydata$fraud)
##Divide into 2 strata
mydata_fraud<-mydata[which(mydata$fraud==1), ]
mydata_notfraud<-mydata[which(mydata$fraud==0), ]
dim(mydata_fraud)
dim(mydata_notfraud)
##random sampling
index1<-sample(1:nrow(mydata_fraud),101,replace=FALSE)
index2<-sample(1:nrow(mydata_notfraud),219,replace=FALSE)
##training
fraud_random_sample_training<-mydata_fraud[index1, ]
fraud_random_sample_testing<-mydata_fraud[-index1, ]
#testing
notfraud_random_sample_training<-mydata_notfraud[index2, ]
notfraud_random_sample_testing<-mydata_notfraud[-index2, ]
##training data by statified sampling
training_data<-rbind(fraud_random_sample_training,notfraud_random_sample_training)
testing_data<-rbind(fraud_random_sample_testing,notfraud_random_sample_testing)