Anda di halaman 1dari 8

How to give a reference dummy variable in logistic regression ?

Why reference is actually created in dummy variable ?


Removing outliers within R directly
Predict function
Missing values
Stratified sampling ?
Extracting data from databases in R

Why a reference level is required


( Dummy Trap )

When we build a model , the intercept value is beta is associated with a variable whose
value is all the time 1 ( X0 ).
If all categories are taken without taking the reference value, the sum of those categories
will be 1 always.
Ex : If it is a location, each person belong to one and only one location.

Perfect multicollinearity case where


value of intercept may have much high
variance. Hence to avoid this, we take
one as
reference.
#building
the model
( with particular reference )

So the X0 which is always one can always be predicted from sum X of all the dummy
variable.

mydata_train<- within(mydata_train, house_locality <relevel(house_locality, ref = 2))


mylogit <- glm(fraud ~ debt + loan_by_annual_income + house_locality, data = mydata_train,

Removing outliers directly


insurance <- read.csv("insurance.csv",header=T)
within
R
str(insurance)
ins_model <- lm(charges ~ age + children + bmi + sex +smoker + region, data = insurance)
summary(ins_model)
reg1.inf <- influence.measures(ins_model)
rstud<-rstudent(ins_model)
rstud

Note the model


improvement. R2
increases by 10 %

rstud<-as.data.frame(rstud)
data_with_outliers<-cbind(insurance,rstud)
data_without_outliers<data_with_outliers[which(data_with_outliers$rstud<2&data_with_outliers$rstud>-2), ]
insurance_without_outliers<-data_without_outliers[-8]
ins_model_new <- lm(charges ~ age + children + bmi + sex +smoker + region, data =
insurance_without_outliers)

Predicting values using our


clock_data <- read.csv("clock_data.csv",header=TRUE)
linear model
input<-read.csv("input.csv",header=T)
str(clock_data)
##build a mode
model <- lm(price ~ age + bidders,data = clock_data)
summary(model)
predicted_value<-predict(model,input,prediction="interval",se.fit=T)
predicted_value
predcted_values<-as.data.frame(predicted_value)
output<-cbind(input,predicted_value)
write.csv(output,"output.csv")

Missing values
Data

Less outliers
Large
dataset

numeric

Mean

Median

Categori
cal

Mode

Missing value through prediction


Missing
variable

Associat
ed
variable

Type of
technique

Categorical

Categori
cal

Categorical

Numeric

Remarks

Assumptions

Decision
tree
Nave
Bayesian

Decesion tree
need no
assumption

Nave bayes
assume
independent
variables

Logistic
regression
K-NN
classifier

K-NN
CLASSIFIER
need no
assumption

Regression
assumption of
normality,
homoscedasticity
etc
Regression
assumption of
normality,
homoscedasticity
etc

Numeric

Numeric

Regression
model
Clustering

Clustering
need no
assumption

Numeric

Categori
cal

Clustering

No
assumption

Categorical

Both

Decision

No

Regression

Stratified sampling in R
mydata<-read.csv("binary.csv",header=TRUE)
mydata$fraud <- as.factor(mydata$fraud)
##Divide into 2 strata
mydata_fraud<-mydata[which(mydata$fraud==1), ]
mydata_notfraud<-mydata[which(mydata$fraud==0), ]
dim(mydata_fraud)
dim(mydata_notfraud)
##random sampling
index1<-sample(1:nrow(mydata_fraud),101,replace=FALSE)
index2<-sample(1:nrow(mydata_notfraud),219,replace=FALSE)
##training
fraud_random_sample_training<-mydata_fraud[index1, ]
fraud_random_sample_testing<-mydata_fraud[-index1, ]
#testing
notfraud_random_sample_training<-mydata_notfraud[index2, ]
notfraud_random_sample_testing<-mydata_notfraud[-index2, ]
##training data by statified sampling
training_data<-rbind(fraud_random_sample_training,notfraud_random_sample_training)
testing_data<-rbind(fraud_random_sample_testing,notfraud_random_sample_testing)

Importing data from SQL


If your data is stored in databases
an ODBC (Open Database Connectivity) SQL
(Structured Query Language) database such as Oracle, MySQL,
PostgreSQL, Microsoft SQL, or SQLite
install.packages("RODBC")
library(RODBC)
mydb <- odbcConnect("my_dsn")
mydb <- odbcConnect("my_dsn", uid = "my_username pwd =
"my_password")
patient_query <- "select * from patient_data where alive = 1"
patient_data <- sqlQuery(channel = mydb, query = patient_query)
odbcClose(mydb)
The data will be all the rows of patients which are alive

Anda mungkin juga menyukai