Anda di halaman 1dari 73

ANALISIS MULTIVARIAT : SUATU KRITIK

Oleh :

Abdullah M. Jaubah

Pendahuluan

Bebeapa penulis telah menulis statistik dengan judul Multivariate dan ruang lingkup pembahasan
mereka adalah sangat terbatas. Penulis tidak sependapat dengan gagasan mereka tentang ruang
lingkup analisis atau statistik multivariat. Beberapa buku yang dipakai sebagai dasar dari kritik ini
akan disajikan di bawah ini.

Analisis univariat, bivariat, dan multivariat dilakukan berdasar atas definisi tertentu. Analisis univariat
adalah analisis atas satu variabel. Analisis bivariat adalah analisis atas dua variabel. Analisis
multivariat adalah analisis atas tiga variabel atau lebih.

Buku-buku tentang Multivariat

J. Supranto (2004) telah menulis buku berjudul Analisis Multivariat : Arti & Interpretasi. Ruang
lingkup pembahasan dalam buku ini mencakup Pendahuluan, Analisis, Regresi Linear Berganda,
analisus diskriminan, analisis faktor, analisis klaster, Penskalaan Multidimensional dan analisis
konjoin, model persamaan struktural dan analisis jalur, model interdependensi Dimensional kepuasan,
persamaan struktural dengan variabel laten, dan contoh soal analisis faktor.

Imam Ghozali (2006) telah menulis buku berjudul Aplikasi Analisis Multivariate Dengan SPSS.
Ruang lingkup pembahasan mencakup skala pengukuran dan metode analisis data, pengenalan
program SPSS, aplikasi statistik Deskriptif dan uji beda T-Test, Reliabilitas dan Validitas Suatu
Konstruk/Konsep, uji beda T-Test, analysis of variance, analisis of covariance, dan multiple analysis
of variance (Manova), analisis regresi, uji asumsi klasik, regresi dengan uji asumsi klasik, variabel
dummy dan Chow Test, Nmodel regesi dengan bentuk fungsional, Analisis regresi dengan variabel
moderating dan inervening. Analisis diskriminan, logistic regression, kerelasi kanonikal, analisis
conjoint, analisis faktor, dan analisis kluster.

Ali Baroroh (2013) telah menulis buku berjudul Analisis multivariat dan Time Series dengan SPSS
21. Ruang lingkup pembahasan dalam buku ini mencakup pembahasan mengenai analisis regresi
linier, analisis regresi logistik, analisis diskriminan, analisis faktor, analisis cluster, dan analisis time
series.

Singgih Santoso (2014) telah menulis buku berjudul Statistik Multivariat : Konsep dan Aplikasi
dengan SPSS, Edisi Revisi. Ruang lingkup pembahasan dalam buku ini mencakup pembahasan

1
mengenai mengenal statistik multivariat, uji data, analisis faktor, analisis cluster, analisis diskriminan,
manova, korelasi kanonikal, conjoint analysis, multidimensional scaling, dan correspondence
analysis.

Subhash Sharma (1996) telah menulis buku berjudul Applied Multivariate Techniques. Ruang lingkup
pembahasan dalam buku ini mencakup pembahhasan mengenai introductio, geometric concepts of
data manipulation, fundamental of data manipulation, principal components analysis, factor analysis,
confoormatory factor analysis, cluster analysis, two group discriminant analysis, multiple group
discriminant analysis, logistic regression, multivariat analysis of variance, assumptions, canonical
correlation, dan pembahasan mengenai covariance structure models.

Penulis tidak sependapat dengan J. Supranto, Singgih Santoso, Ali Baroroh, Imam Ghozali, dan
Subhash Sharma mengenai ruang lingkup analisis atau statistik multivariat. Gagasan mereka tidak
konsisten dengan definisi analisis multivariat yaitu analisis atas tiga variabel atau lebih;

Ruang lingkup pembahasan analisis multivariat adalah jauh lebih luas daripada ruang lingkup
pembahasan mereka. Pembahasan mereka lebih mencerminkan pembahasan tradisional walau tiga
penulis telah memakai SPSS.

Ruang lingkup pembahasan multivariat, menurut penulis, mencakup ruang lingkup antara laain
aedalah sebagai berikut :

A. Statistics Base
Linear models
Linear regression
Ordinal Regression
2-Stage Least Squares
Partial Least Squares Regression
Nearest Neighbor Analysis
Discriminant Analysis
Factor Analysis
TwoStep Cluster Analysis
Hierarchical Cluster Analysis
K-Means Cluster Analysis
Multiple Response Analysis
Select Predictors
B. Advanced Statistics
Multivariate General Linear Modeling

2
Variance Components
Linear Mixed Models
Generalized Linear Models
Generalized linear mixed models
Loglinear Modeling
Life Tables
Kaplan-Meier Survival Analysis
Cox Regression
C. Categories
Categorical Regression
Categorical Principal Components Analysis
Nonlinear Canonical Correlation Analysis
Correspondence analysis
Multiple Correspondence Analysis
Multidimensional Scaling
Multidimensional Unfolding
D. Complex Samples
Planning for Complex Samples
Complex Samples Sampling Wizard
Complex Samples Analysis Preparation Wizard
Complex Samples Analysis Procedures: Tabulation
Complex Samples Analysis Procedures: Descriptives
Complex Samples Frequencies
Complex Samples Descriptives
Complex Samples Crosstabs
Complex Samples Ratios
Complex Samples General Linear Model
Complex Samples Logistic Regression
Complex Samples Ordinal Regression
Complex Samples Cox Regression
E. Conjoint
Conjoint Analysis
F. Decision Trees
Data assumptions and requirements
Using Decision Trees to Evaluate Credit Risk

3
Building a Scoring Model
Missing Values in Tree Models
G. Direct Marketing
RFM Analysis from Transaction Data
Cluster analysis
Prospect profiles
Postal code response rates
Propensity to purchase
Control package test
H. Multiple Imputation
Multiple Imputation
I. Neural Networks
Multilayer Perceptron
Radial Basis Function
J. Regression

Multinomial Logistic Regression


Nonlinear Regression
Probit Analysis
Weight Estimation
Two-Stage Least-Squares Regression

K. Forecasting
Bulk Forecasting with the Expert Modeler
Bulk Reforecasting by Applying Saved Models
Using the Expert Modeler to Determine Significant Predictors
Experimenting with Predictors by Applying Saved Models
Seasonal Decomposition
Spectral Plots
L. IBM SPSS Amos 22
Pemodelan Persamaan Struktural
Analisis Faktor Konfirmatori
m e

4
Subhash Sharma mengelompokkan metode analisis data ke dalam metode-metode dependen,
metode-metode interindependen, dan model-model struktural. Metode-metode dependen
mencakup satu variabel dependen dan satu variabel independen, satu variabel dependen dan
lebih daripada satu variabel indopenden, variabel dependen adalah lebih daripada satu dan
variabel independen adalah satu atau lebih daripada satu variabel independen. Metode-
metode interdependen mencakup variabel-variabel metrik dan data nonmetrik (1996 : 5-6).

Subhash Sharma, lebih lanjut menyajikan rincian dari satu variabel dependen dan lebih
daripada satu variabel dependen. Satu variabel dependen mencakup metrik dan nonmetrik.
Satu variabel dependen metrik mencakup regression, t-test, multiple regression, dan Anova.
Satu variabel dependen nonmetrik mencakup discriminant analysis, logistic regression,
discrete discriminan analysis, dan conjoint analysis (Monanova). Variabel dependen jika
lebih daripada satu terdiri dari metrik dan nonmetrik. Variabel dependen metrik lebih
daripada satu variabel mencakup canonical dan Manova (Multivariate analysis of variance)
correlation. Variabel dependen metrik lebih daripada satu variabel mencakup Variabel
dependen metrik lebih daripada satu variabel mencakup Variabel dependen nonmetrik lebih
daripada satu variabel mencakup multiple group discriminant analysis (MDA), discrete
MDA. Model-model struktural mencakup konstruk-konstruk laten atau unobservable yang
terdiri dari model-model pengukuran dan model-model struktural dengan memanfaatkan
paket program Lisrel, SAS, atau EQS, Subhash Sharma akhirnya menyatakan bahwa bukunya
hanya mencakup teknik-teknik multivariat Subhash Sharma Subhash Sharma principal
components analysis, factor analysis, Confirmatory factor analysis, cluster analysis, two-
group discriminant analysis, multiple-group discriminan analysis, logistic regression,
Manova, canonical correlation, dan structural models karena menganggap bahwa semua
teknik analisis multivariat tidak mungkin dicakup. Hal ini berarti bahwa teknik multivariat
adalah lebih banyak daripada hanya 10 teknik saja.

J. Supranto juga merinci analisis multivariat ke dalamm metode dependensi dan metode
interdependensi. Metode dependensi terdiri dari satu variabel tak bebas dan lebih dari satu
variabel tak bebas. Metode interdependensi terdiri dari fokus pada variabel dan fokus pada
objek. Metode dependensi satu variabel tak bebas mencakup anova, ancova, regresi berganda,
analisis diskriminan, dan analisis konjoin. Metode dependensi lebih dari satu variabel tak
bebas mencakup Manova, Mancova, dan korelasi kanonikal. Metode interdependensi fokus
pada variabel mencakup analisis faktor dan fokus pada objek mencakup analisis klaster, dan

5
penskalaan multidimensi. J. Supranto juga membahas structural equation modeling dan
confiratory factor analysis secara sangat tidak jelas karena tidak menyajikan Goodness-of Fit.

Imam Ghozali (2006 : 6 9) juga membahas metode analisis data yang terdiri dari metode
dependen dan metode interdependen.

Pertanyaan yang timbul dalam hubungannya dengan analisis atau statistik multivariat adalah
apakah penjelasan mengenai pengelompokan analisis data ke dala metode dependen dan
metode interdependen itu diperlukan?

Subhash Sharma, setelah melakukan pembahasan mengenai metode analisis data, akhirnya
menyatakan 10 teknik saja yang akan dibahas dalam buku tersebut.

Mengapakah 2 Stage Least Squares tidak dicakup dalam Analisis atau Statistik Multivariat?

Mengapakan analisis faktor konfirmatori dan pemodelan persamaan struktural dalam tiga
buku SPSS itu tidak dicakup dalam analisis multivariat?

Pembahasan analisis multivariat dalam kelima buku tersebut adalah sangat terbatas sekali
daripada ruang lingkup pembahasan analisis multivariat berdasar standar-standar dalam
SPSS. Mengapakah ketiga buku SPSS tersebut di atas tidak mencakup confirmatory factor
analysis dan pemodelan persamaan struktural?

SPSS 22 telah mengintegrasikan antara SPSS dan Amos sehingga paket program Amos dapat
dilaksanakan melalui SPSS. Mengapakah Ali Baroroh, Singgih Santoso, dan Imam Ghozali
tidak membahas confirmatory factor analysis dan pemodelan persamaan struktural? Singgih
Santoso telah menulis buku mengenai Amos, Imam Ghozali juga telah menulis buku
mengenai Amos akan tetapi kedua penulis ini tidak memasukkannya dalam analisis atau
statistik multivariat.

Contoh sintaksis 2 Stage Least Squares adalah sebagai berikut :

***********************************************
***** Abdullah M. Jaubah
***********************************************

GET
FILE='D:\ADA\2SLS.sav'.
.

2SLS DEMAND WITH PRICE, INCOME


/PRICE WITH DEMAND, RAINFALL, LAGPRICE

6
/INSTRUMENTS=INCOME, RAINFALL, LAGPRICE.

Pelaksanaan sintaksis ini akan mencipta hasil-hasil sebagai berikut :

Two-stage Least Squares Analysis

Model Description
Type of Variable
Equation 1 demand dependent
Price predictor
Income predictor & instrumental
Rainfall instrumental
Lagprice instrumental
Equation 2 Price dependent
demand predictor
Rainfall predictor & instrumental
Lagprice predictor & instrumental
Income instrumental
MOD_14

Model Summary
Equation 1 Multiple R ,778
R Square ,606
Adjusted R Square ,579
Std. Error of the Estimate 2,430
Equation 2 Multiple R ,991
R Square ,982
Adjusted R Square ,980
Std. Error of the Estimate ,478

ANOVA
Sum of Mean
df F Sig.
Squares Square
Regression 263,425 2 131,712 22,304 0
Equation 1 Residual 171,252 29 5,905
Total 434,677 31
Regression 350,021 3 116,674 510,642 0
Equation 2 Residual 6,398 28 0,228
Total 356,419 31

7
Coefficients
Unstandardized
Coefficients
B Std. Error Beta t Sig.
Equation 1 (Constant) 7,180 6,529 1,100 ,280
Price ,719 ,276 ,653 2,610 ,014
Income ,016 ,027 ,148 ,596 ,556
Equation 2 (Constant) ,226 2,008 ,113 ,911
demand ,076 ,139 ,084 ,550 ,587
Rainfall ,002 ,041 ,002 ,039 ,969
Lagprice ,449 ,064 ,925 7,011 ,000

Coefficient Correlations
Price Income demand Rainfall Lagprice

Equation Price 1 -0,882


Correlations
1 Income -0,882 1
demand 1 0,031 -0,912
Equation
Correlations Rainfall 0,031 1 -0,391
2
Lagprice -0,912 -0,391 1

Contoh di atas memakai variabel demand sebagai variabel dependen, variabel price sebagai
variabel prediktor, variabel Income sebagai variabel prediktor dan instrumental, variabel
Rainfall sebagai variabel instrumental, dan variabel Lagprice sebagai variabel instrumental.
Lima variabel telah dipakai dalam contoh ini dan mengapakah contoh seperti ini tidak
dicakup dalam analisis atau statistik multivariate?

Ruang lingkup di atas mengandung asumsi bahwa semua analisis memakai tiga variabel atau
lebih. Persentase pembahasan mereka mengenai analisis atau statistik multivariat yang telah
memakai SPSS adalah sangat kecil jika ruang lingkup pembahasan SPSS dipakai sebagai
standar.

Contoh pemakaian Amos dapat dilakukan di sini. Amos dikembangkan berdasar atas bahasa
Visual Basic Hal ini berarti bahwa setelah penciptaan diagram jalur secara akurat dan tepat
maka langkah selanjutnya adalah penciptaan sintaksis berbasis visual basic.Dua contoh akan
dipakai di sini dari paket program Amos 22. Contoh Ex05-a akan dipakai di sini sebagai
contoh dari pemodelan persamaan struktural dalam Amos. Diagram jalur dari contoh ini
dapat disajikan sebagain berikut :

8
Diagram jalur ini terdiri dari tiga variabel laten eksogen dan 6 variabel indikator eksogen dan
satu variabel laten endogen dan dua variabel indikator endogen. Penciptaan sintaksis Amos
berdasar atas Visual Basic dapat dilakukan sebagai berikut :

#Region "Header"
Imports System
Imports System.Diagnostics
Imports Microsoft.VisualBasic
Imports AmosEngineLib
Imports AmosGraphics
Imports AmosEngineLib.AmosEngine.TMatrixID
Imports PBayes
#End Region
Module MainModule
Public Sub Main()
Dim Sem As AmosEngine
Sem = New AmosEngine
Sem.Title("Example 5, Model A:" _
& vbCrLf & "Regression with unobserved variables" _
& vbCrLf & "" _
& vbCrLf & "Using data from the Warren, White and" _
& vbCrLf & "Fuller (1974) study of job performance" _
& vbCrLf & "of farm managers.")
Sem.TextOutput
AnalysisProperties(Sem)
ModelSpecification(Sem)
Sem.FitAllModels()
Sem.Dispose()
End Sub

Sub ModelSpecification(Sem As AmosEngine)


Sem.GenerateDefaultCovariances(False)

9
Sem.BeginGroup("C:\AMOS 5\Examples\Warren9v.wk1" , "Warren9v" )
Sem.GroupName("Group number 1")
Sem.Path("1knowledge", "error3", 1)
Sem.Path("2knowledge", "error4", 1)
Sem.Path("1value", "error5", 1)
Sem.Path("2value", "error6", 1)
Sem.Path("1satisfaction", "error7", 1)
Sem.Path("2satisfaction", "error8", 1)
Sem.Path("2satisfaction", "satisfaction")
Sem.Path("1satisfaction", "satisfaction", 1)
Sem.Path("2value", "value")
Sem.Path("1value", "value", 1)
Sem.Path("2knowledge", "knowledge")
Sem.Path("1knowledge", "knowledge", 1)
Sem.Path("performance", "knowledge")
Sem.Path("performance", "satisfaction")
Sem.Path("performance", "value")
Sem.Cov("value", "knowledge")
Sem.Cov("satisfaction", "value")
Sem.Path("1performance", "performance", 1)
Sem.Path("2performance", "performance")
Sem.Path("1performance", "error1", 1)
Sem.Path("2performance", "error2", 1)
Sem.Path("performance", "error9", 1)
Sem.Cov("satisfaction", "knowledge")

Sem.Model("Default model", "")


End Sub

Sub AnalysisProperties(Sem As AmosEngine)


Sem.Iterations(50)
Sem.InputUnbiasedMoments
Sem.FitMLMoments
Sem.Standardized
Sem.Smc
Sem.Seed(1)
End Sub
End Module

Pelaksanaan sintaksis di atas akan mencipta hasil-hasil sebagai berikut :

Analysis Summary

Date and Time

Date: 06 Juni 2017


Time: 23:22:14

Title

Example 5, Model A: Regression with unobserved variables Using data from the Warren,
White and Fuller (1974) study of job performance of farm managers.

10
Notes for Group (Group number 1)

The model is recursive.

Sample size = 98

Variable Summary (Group number 1)

Your model contains the following variables (Group number 1)

Observed, endogenous variables


1knowledge
2knowledge
1value
2value
1satisfaction
2satisfaction
1performance
2performance
Unobserved, endogenous variables
performance
Unobserved, exogenous variables
error3
error4
error5
error6
error7
error8
satisfaction
value
knowledge
error1
error2
error9

Variable counts (Group number 1)

Number of variables in your model: 21


Number of observed variables: 8
Number of unobserved variables: 13
Number of exogenous variables: 12
Number of endogenous variables: 9

11
Parameter Summary (Group number 1)

Weights Covariances Variances Means Intercepts Total


Fixed 13 0 0 0 0 13
Labeled 0 0 0 0 0 0
Unlabeled 7 3 12 0 0 22
Total 20 3 12 0 0 35

Notes for Model (Default model)

Computation of degrees of freedom (Default model)

Number of distinct sample moments: 36


Number of distinct parameters to be estimated: 22
Degrees of freedom (36 - 22): 14

Result (Default model)

Minimum was achieved


Chi-square = 10,335
Degrees of freedom = 14
Probability level = ,737

Estimates (Group number 1 - Default model)

Scalar Estimates (Group number 1 - Default model)

Maximum Likelihood Estimates

Regression Weights: (Group number 1 - Default model)

Estimate S.E. C.R. P Label


performance <--- knowledge ,337 ,125 2,697 ,007
performance <--- satisfaction ,061 ,054 1,127 ,260
performance <--- value ,176 ,079 2,225 ,026
2satisfaction <--- satisfaction ,792 ,438 1,806 ,071
1satisfaction <--- satisfaction 1,000
2value <--- value ,763 ,185 4,128 ***
1value <--- value 1,000
2knowledge <--- knowledge ,683 ,161 4,252 ***
1knowledge <--- knowledge 1,000
1performance <--- performance 1,000
2performance <--- performance ,867 ,116 7,450 ***

Standardized Regression Weights: (Group number 1 - Default model)

12
Estimate
performance <--- knowledge ,516
performance <--- satisfaction ,130
performance <--- value ,398
2satisfaction <--- satisfaction ,747
1satisfaction <--- satisfaction ,896
2value <--- value ,633
1value <--- value ,745
2knowledge <--- knowledge ,618
1knowledge <--- knowledge ,728
1performance <--- performance ,856
2performance <--- performance ,819

Covariances: (Group number 1 - Default model)

Estimate S.E. C.R. P Label


value <--> knowledge ,037 ,012 3,036 ,002
satisfaction <--> value -,008 ,013 -,610 ,542
satisfaction <--> knowledge ,004 ,009 ,462 ,644

Correlations: (Group number 1 - Default model)

Estimate
value <--> knowledge ,542
satisfaction <--> value -,084
satisfaction <--> knowledge ,064

Variances: (Group number 1 - Default model)

Estimate S.E. C.R. P Label


satisfaction ,090 ,052 1,745 ,081
value ,100 ,032 3,147 ,002
knowledge ,046 ,015 3,138 ,002
error9 ,007 ,003 2,577 ,010
error3 ,041 ,011 3,611 ***
error4 ,035 ,007 5,167 ***
error5 ,080 ,025 3,249 ,001
error6 ,087 ,018 4,891 ***
error7 ,022 ,049 ,451 ,652
error8 ,045 ,032 1,420 ,156
error1 ,007 ,002 3,110 ,002
error2 ,007 ,002 3,871 ***

Squared Multiple Correlations: (Group number 1 - Default model)

13
Estimate
performance ,663
2performance ,671
1performance ,732
2satisfaction ,558
1satisfaction ,802
2value ,401
1value ,556
2knowledge ,381
1knowledge ,529

Estimate S.E. C.R. P Label


performance <--- knowledge ,337 ,125 2,697 ,007
performance <--- satisfaction ,061 ,054 1,127 ,260
performance <--- value ,176 ,079 2,225 ,026
2satisfaction <--- satisfaction ,792 ,438 1,806 ,071
1satisfaction <--- satisfaction 1,000
2value <--- value ,763 ,185 4,128 ***
1value <--- value 1,000
2knowledge <--- knowledge ,683 ,161 4,252 ***
1knowledge <--- knowledge 1,000
1performance <--- performance 1,000
2performance <--- performance ,867 ,116 7,450 ***

Estimate
performance <--- knowledge ,516
performance <--- satisfaction ,130
performance <--- value ,398
2satisfaction <--- satisfaction ,747
1satisfaction <--- satisfaction ,896
2value <--- value ,633
1value <--- value ,745
2knowledge <--- knowledge ,618
1knowledge <--- knowledge ,728
1performance <--- performance ,856
2performance <--- performance ,819

14
Estimate S.E. C.R. P Label
value <--> knowledge ,037 ,012 3,036 ,002
satisfaction <--> value -,008 ,013 -,610 ,542
satisfaction <--> knowledge ,004 ,009 ,462 ,644

Estimate
value <--> knowledge ,542
satisfaction <--> value -,084
satisfaction <--> knowledge ,064

Estimate S.E. C.R. P Label


satisfaction ,090 ,052 1,745 ,081
value ,100 ,032 3,147 ,002
knowledge ,046 ,015 3,138 ,002
error9 ,007 ,003 2,577 ,010
error3 ,041 ,011 3,611 ***
error4 ,035 ,007 5,167 ***
error5 ,080 ,025 3,249 ,001
error6 ,087 ,018 4,891 ***
error7 ,022 ,049 ,451 ,652
error8 ,045 ,032 1,420 ,156
error1 ,007 ,002 3,110 ,002
error2 ,007 ,002 3,871 ***

Estimate
performance ,663
2performance ,671
1performance ,732
2satisfaction ,558
1satisfaction ,802
2value ,401
1value ,556
2knowledge ,381
1knowledge ,529

15
Minimization History (Default model)

Negative Condition Smallest


Iteration Diameter F NTries Ratio
eigenvalues # eigenvalue
0 e 7 -,190 9999,000 292,665 0 9999,000
1 e* 2 -,048 1,839 78,769 20 ,806
2 e 1 -,069 ,628 24,913 5 ,808
3 e 1 -,005 ,505 13,473 6 ,664
4 e 0 442,497 ,358 10,619 7 ,952
5 e 0 217,001 ,150 10,343 1 1,037
6 e 0 301,865 ,017 10,335 1 1,007
7 e 0 302,254 ,003 10,335 1 ,997
8 e 0 302,509 ,000 10,335 1 1,000

Model Fit Summary

CMIN

Model NPAR CMIN DF P CMIN/DF


Default model 22 10,335 14 ,737 ,738
Saturated model 36 ,000 0
Independence model 8 243,768 28 ,000 8,706

RMR, GFI

Model RMR GFI AGFI PGFI


Default model ,003 ,975 ,935 ,379
Saturated model ,000 1,000
Independence model ,023 ,570 ,447 ,443

Baseline Comparisons

NFI RFI IFI TLI


Model CFI
Delta1 rho1 Delta2 rho2
Default model ,958 ,915 1,016 1,034 1,000
Saturated model 1,000 1,000 1,000
Independence model ,000 ,000 ,000 ,000 ,000

Parsimony-Adjusted Measures

Model PRATIO PNFI PCFI


Default model ,500 ,479 ,500
Saturated model ,000 ,000 ,000
Independence model 1,000 ,000 ,000

NCP

16
Model NCP LO 90 HI 90
Default model ,000 ,000 7,102
Saturated model ,000 ,000 ,000
Independence model 215,768 169,584 269,424

FMIN

Model FMIN F0 LO 90 HI 90
Default model ,107 ,000 ,000 ,073
Saturated model ,000 ,000 ,000 ,000
Independence model 2,513 2,224 1,748 2,778

RMSEA

Model RMSEA LO 90 HI 90 PCLOSE


Default model ,000 ,000 ,072 ,877
Independence model ,282 ,250 ,315 ,000

AIC

Model AIC BCC BIC CAIC


Default model 54,335 58,835 111,204 133,204
Saturated model 72,000 79,364 165,059 201,059
Independence model 259,768 261,404 280,447 288,447

ECVI

Model ECVI LO 90 HI 90 MECVI


Default model ,560 ,598 ,671 ,607
Saturated model ,742 ,742 ,742 ,818
Independence model 2,678 2,202 3,231 2,695

HOELTER

HOELTER HOELTER
Model
.05 .01
Default model 223 274
Independence model 17 20

Execution time summary

Minimization: ,188
Miscellaneous: 1,138
Bootstrap: ,000
Total: 1,326

17
Contoh Confirmatory Factor Analysis

Sintaksis Bahasa Visual Basic

#Region "Header"
Imports System
Imports System.Diagnostics
Imports Microsoft.VisualBasic
Imports AmosEngineLib
Imports AmosGraphics
Imports AmosEngineLib.AmosEngine.TMatrixID
Imports PBayes
#End Region
Module MainModule
Public Sub Main()
Dim Sem As AmosEngine
Sem = New AmosEngine
Sem.Title("Example 8:" _
& vbCrLf & "Factor analysis" _
& vbCrLf & "" _
& vbCrLf & "Holzinger and Swineford (1939) Grant-White sample." _
& vbCrLf & "Intelligence factor study. Raw data of 73 female" _
& vbCrLf & "students from the Grant-White high school, Chicago.")
Sem.TextOutput
AnalysisProperties(Sem)
ModelSpecification(Sem)
Sem.FitAllModels()
Sem.Dispose()
End Sub

Sub ModelSpecification(Sem As AmosEngine)


Sem.GenerateDefaultCovariances(False)

18
Sem.BeginGroup("C:\AMOS 5\Examples\Grnt_fem.sav" , "Grnt_fem" )
Sem.GroupName("Group number 1")
Sem.Path("visperc", "spatial", 1)
Sem.Path("cubes", "spatial")
Sem.Path("lozenges", "spatial")
Sem.Path("paragrap", "verbal", 1)
Sem.Path("sentence", "verbal")
Sem.Path("wordmean", "verbal")
Sem.Path("visperc", "err_v", 1)
Sem.Path("cubes", "err_c", 1)
Sem.Path("lozenges", "err_l", 1)
Sem.Path("paragrap", "err_p", 1)
Sem.Path("sentence", "err_s", 1)
Sem.Path("wordmean", "err_w", 1)
Sem.Cov("spatial", "verbal")

Sem.Model("Default model", "")


End Sub

Sub AnalysisProperties(Sem As AmosEngine)


Sem.Iterations(50)
Sem.InputUnbiasedMoments
Sem.FitMLMoments
Sem.Standardized
Sem.Smc
Sem.Seed(1)
End Sub
End Module

Pelaksanaan sintaksis bahasa Visual Basic mencipta hasil-hasil sebagai berikut :

Analysis Summary

Date and Time

Date: 07 Juni 2017


Time: 0:21:18

Title

Example 8: Factor analysis Holzinger and Swineford (1939) Grant-White sample.


Intelligence factor study. Raw data of 73 female students from the Grant-White high school,
Chicago.

19
Number of variables in your model: 14
Number of observed variables: 6
Number of unobserved variables: 8
Number of exogenous variables: 8
Number of endogenous variables: 6

Weights Covariances Variances Means Intercepts Total


Fixed 8 0 0 0 0 8
Labeled 0 0 0 0 0 0
Unlabeled 4 1 8 0 0 13
Total 12 1 8 0 0 21

Number of distinct sample moments: 21


Number of distinct parameters to be estimated: 13
Degrees of freedom (21 - 13): 8

Estimate S.E. C.R. P Label


visperc <--- spatial 1,000
cubes <--- spatial ,610 ,143 4,250 ***
lozenges <--- spatial 1,198 ,272 4,405 ***
paragrap <--- verbal 1,000
sentence <--- verbal 1,334 ,160 8,322 ***
wordmean <--- verbal 2,234 ,263 8,482 ***

Estimate
visperc <--- spatial ,703
cubes <--- spatial ,654
lozenges <--- spatial ,736
paragrap <--- verbal ,880
sentence <--- verbal ,827
wordmean <--- verbal ,841

Estimate S.E. C.R. P Label


spatial <--> verbal 7,315 2,571 2,846 ,004

20
Estimate
spatial <--> verbal ,487

Estimate S.E. C.R. P Label


spatial 23,302 8,123 2,868 ,004
verbal 9,682 2,159 4,485 ***
err_v 23,873 5,986 3,988 ***
err_c 11,602 2,584 4,490 ***
err_l 28,275 7,892 3,583 ***
err_p 2,834 ,868 3,263 ,001
err_s 7,967 1,869 4,263 ***
err_w 19,925 4,951 4,024 ***

Estimate
wordmean ,708
sentence ,684
paragrap ,774
lozenges ,542
cubes ,428
visperc ,494

Negative Condition Smallest


Iteration Diameter F NTries Ratio
eigenvalues # eigenvalue
0 e 4 -,394 9999,000 197,996 0 9999,000
1 e* 1 -,025 1,788 58,475 20 ,599
2 e 0 89,924 ,472 23,137 6 ,831
3 e 0 38,120 ,694 10,311 2 ,000
4 e 0 32,549 ,381 8,112 1 ,954
5 e 0 40,055 ,057 7,854 1 1,030
6 e 0 40,264 ,006 7,853 1 1,005
7 e 0 40,272 ,000 7,853 1 1,000

Model NPAR CMIN DF P CMIN/DF


Default model 13 7,853 8 ,448 ,982
Saturated model 21 ,000 0
Independence model 6 187,718 15 ,000 12,515

21
Model RMR GFI AGFI PGFI
Default model 1,677 ,966 ,910 ,368
Saturated model ,000 1,000
Independence model 13,807 ,496 ,294 ,354

NFI RFI IFI TLI


Model CFI
Delta1 rho1 Delta2 rho2
Default model ,958 ,922 1,001 1,002 1,000
Saturated model 1,000 1,000 1,000
Independence model ,000 ,000 ,000 ,000 ,000

Model PRATIO PNFI PCFI


Default model ,533 ,511 ,533
Saturated model ,000 ,000 ,000
Independence model 1,000 ,000 ,000

Model NCP LO 90 HI 90
Default model ,000 ,000 10,733
Saturated model ,000 ,000 ,000
Independence model 172,718 132,220 220,668

Model FMIN F0 LO 90 HI 90
Default model ,109 ,000 ,000 ,149
Saturated model ,000 ,000 ,000 ,000
Independence model 2,607 2,399 1,836 3,065

Model RMSEA LO 90 HI 90 PCLOSE


Default model ,000 ,000 ,137 ,577
Independence model ,400 ,350 ,452 ,000

Model AIC BCC BIC CAIC


Default model 33,853 36,653 63,629 76,629
Saturated model 42,000 46,523 90,100 111,100
Independence model 199,718 201,010 213,461 219,461

22
Model ECVI LO 90 HI 90 MECVI
Default model ,470 ,472 ,621 ,509
Saturated model ,583 ,583 ,583 ,646
Independence model 2,774 2,211 3,440 2,792

HOELTER HOELTER
Model
.05 .01
Default model 143 185
Independence model 10 12

Minimization: ,218
Miscellaneous: 1,186
Bootstrap: ,000
Total: 1,404

Contoh tentang pemodelan persamaan struktural dan contoh tentang confirmatory factor
analysis telah disajikan di atas tanpa melakukan interpretasi. Mengapakah Singgih Santoso
tidak memasukkan kedua unsur ini dalam statistik multivariat? Singgih Santoro telah menulis
buku tentang Amos. Mengapakah Imam Ghozali tidak memasukkan kedua unsur tersebut
dalam analisis multivariat? Imam Ghozali telah pula menyusun buku tentang Amos.

Sebagian besar prosedur yang terkandung dalam SPSS tersebut bila memakai tiga variabel
atau lebih akan merupakan ruang lingkup pembahasan mengenai analisis atau statistik
multivariat dan IBM SPSS Amos jika memakai tiga variabel atau lebih juga dapat
dimasukkan ke dalam analisis atau statistik mulltivariat. Hal ini mencerminkan bahwa ruang
lingkup pembahasan mereka adalah sangat sempit dan tidak sesuai dengan ruang lingkup
yang terkandung dalam SPSS.

IBM SPSS Statistics tidak mengandung pembahasan analisis atau multivariat sebagaimana
dikemukakan oleh kelima penulis di atas. Istilah multivariate terdapat dalam General Linear
Model dan dimaksud sebagai analisis Manova.

Kesimpulan

Kritik atas lima buku tentang Analisis atau Statistik Multivariat dilancarkan karena penulis
menganggap bahwa analisis atau statistik multivariat, berdasar atas SPSS, jauh lebih banyak daripada
ruang lingkup yang dibahas dalam kelima buku tersebut. Buku-buku SPSS tersebut juga belum

23
memanfaatkan cara pemrograman secara lengkap. Hal ini akan memberikan dampak negatif atas
konsep, arti, dan interpretasi atas analisis atau statistik multivariat dan dampak ini akan tercermin
dalam penelitian-penelitian ilmiah, skripsi, tesis, atau disertasi di Indonesia.

Referensi

Ali Baroroh.2013. Analisis Multivariat dan Time Series dengan SPSS. Jakarta : Penerbit PT Elex
Media Komputindo Kompas Gramedia.

Imam Ghozali.2006. Aplikasi Analisis Multivariate Dengan SPSS. Semarang : Badan Penerbit
Universitas Diponegoro.

J. Supranto. 2004. Analisis Multivariat : Arti & Interpretasi. Jakarta : Penerbit Rineka Cipta.

Sharma, Subhash. 1996. Applied Multivariate Techniques. New York : John Wiley & Sons, Inc.

Singgih Santoso. 2014. Statistik Multivariat : Konsep dan Aplikasi Dengan SPSS. Edisi Revisi.
Jakarta : Penerbit PT Elex Media Komputindo Kompas Gramedia.

Permata Depok Regency, 6 Juni 2017.

24
DIRECT MARKETING DALAM IBM SPSS STATISTICS 22 : SUATU KRITIK

Oleh :

Abdullah M. Jaubah

Pendahuluan

Penulis, dalam melakukan studi dan penghayatan mengenai analisis multivariat dari Ali
Baroroh, Singgih Santoso, Imam Ghozali, J. Supranto, dan Subhash Sharma agak kecewa
karena mereka tidak banyak memasukkan pokok-pokok pembahasan yang mengandung
variabel tiga atau lebih daripada tiga variabel. Penulis tidak menemukan pembahasan
mengenai Direct Marketing dalam kelima buku tersebut. Studi dan penghayatan mengenai
Direct Marketing kemudian dilakukan dengan memanfaatkan pokok-pokok pembahasan
dalam SPSS 22.

Studi dan penghayatan mengenai Direct Marketing ini mencakup pembahasan-pembahasan


mengenai RFM Analysis from Transaction Data, Cluster analysis, Prospect profiles, Postal
code response rates, Propensity to purchase, dan pembahasan mengenai Control package test.

Rincian dari keenam pokok pembahasan dalam Direct Marketing adalah sebagai berikut :
RFM Analysis from Transaction Data, Transaction Data, Running the Analysis, Evaluating
the Results, Merging Score Data with Customer Data, Cluster analysis, Running the analysis,
Output, Selecting records based on clusters, Creating a filter in the Cluster Model
Viewer,Selecting records based on cluster field values, Prospect profiles, Data considerations,
Running the analysis, Output, Postal code response rates, Data considerations, Running the
analysis, Output, Propensity to purchase, Data considerations, Building a predictive model,
Evaluating the model, Applying the model, Control package test, Running the analysis,
Output, dan Summary.

RFM Analysis from Transaction Data

Arsip data yang dipakai adalah rfm_transactions.sav. Arsip data ini tersedia dalam folder
Sample Files sehingga arsip tersebut tidak perlu disajikan di sini.Tiap baris, dalam suatu arsip

25
data transaksi, mewakili suatu transaksi terpisah, bukan suatu pelanggan terpisah, dan
mungkin saja terdapat baris-baris dari transaksi jamak untuk tiap pelanggan.

Data Trasaksi

Arsip data harus mengandung variabel-variabel yang mencerminkan informasi sebagai


berikut :

1. Suatu variabel atau kombinasi dari variabel-variabel yang mengandung identifikasi


tiap kasus atau observasi (pelanggan).
2. Suatu variabel yang mengandung tanggal dari tiap transaksi.
3. Suatu variabel yang mengandung nilai uang dari tiap transaksi.

Hal ini berarti bahwa persyaratan untuk melakukan analisis multivariat dalam Direct
Marketing terpenuhi. Bentuk data transaksi RFM disajikan di bawah ini :

Gambar 1. RFM transaction data

Case Studies > Direct Marketing > RFM Analysis from Transaction Data

Running the Analysis


1. To calculate RFM scores, from the menus choose:

Direct Marketing > Choose Technique

26
Figure 1. Direct Marketing dialog

2. Select Help identify my best contacts (RFM Analysis) and click Continue.
3. In the Data Format dialog, click Transaction data and then click Continue.

Figure 2. RFM from Transactions, Variables tab

27
4. Click Reset to clear any previous settings.
5. For Transaction Date, select Purchase Date [Date].
6. For Transaction Amount, select Purchase Amount [Amount].
7. For Summary Method, select Total.
8. For Customer Identifiers, select Customer ID [ID].
9. Then click the Output tab.

Figure 3. RFM for Transactions, Output tab

10. Select (check) Chart of bin counts.

28
11. Then click OK to run the procedure.

Case Studies > Direct Marketing > RFM Analysis from Transaction Data

Evaluating the Results


When you compute RFM scores from transaction data, a new dataset is created that includes
the new RFM scores.

Figure 1. RFM from Transactions dataset

By default, the dataset includes the following information for each customer:

Customer ID variable(s)
Date of most recent transaction
Total number of transactions
Summary transaction amount (the default is total)
Recency, Frequency, Monetary, and combined RFM scores

The new dataset contains only one row (record) for each customer. The original transaction
data has been aggregated by values of the customer identifier variables. The identifier
variables are always included in the new dataset; otherwise you would have no way of
matching the RFM scores to the customers.

The combined RFM score for each customer is simply the concatenation of the three
individual scores, computed as: (recency x 100) + (frequency x 10) + monetary.

The chart of bin counts displayed in the Viewer window shows the number of customers in
each RFM category.

29
Figure 2. Chart of bin counts

Using the default method of five score categories for each of the three RFM components
results in 125 possible RFM score categories. Each bar in the chart represents the number of
customers in each RFM category.

Ideally, you want a relatively even distribution of customers across all RFM score categories.
In reality, there will usually be some amount of variation, such as what you see in this
example. If there are many empty categories, you might want to consider changing the
binning method.

There are a number of strategies for dealing with uneven distributions of RFM scores,
including:

Use nested instead of independent binning.


Reduce the number of possible score categories (bins).
When there are large numbers of tied values, randomly assign cases with the same
scores to different categories.

See the topic RFM Binning for more information.

Case Studies > Direct Marketing > RFM Analysis from Transaction Data

Merging Score Data with Customer Data


30
Now that you have a dataset that contains RFM scores, you need to match those scores to the
customers. You could merge the scores back to the transaction data file, but more typically
you want to merge the score data with a data file that, like the RFM score dataset, contains
one row (record) for each customer -- and also contains information such as the customer's
name and address.

Figure 1. RFM score dataset in Variable View

1. Make the dataset that contains the RFM scores the active dataset. (Click anywhere in
the Data Editor window that contains the dataset.)
2. From the menus choose:

Data > Merge Files > Add Variables

Figure 2. Add Variables, select files dialog

3. Select An external data file.


4. Use the Browse button to navigate to the Samples folder and select
customer_information.sav. See the topic Sample Files for more information.
5. Then click Continue.

31
Figure 3. Add Variables, select variables dialog

6. Select (check) Match cases on key variables in sorted files.


7. Select Both files provide cases.
8. Select ID for the Key Variables list.
9. Click OK.

Note the message that warns you that both files must be sorted in ascending order of
the key variables. In this example, both files are already sorted in ascending order of
the key variable, which is the customer identifier variable we selected when we
computed the RFM scores. When you compute RFM scores from transaction data, the
new dataset is automatically sorted in ascending order of the customer identifier
variable(s). If you change the sort order of the score dataset or the data file with which
you want to merge the score dataset is not sorted in that order, you must first sort both
files in ascending order of the customer identifier variable(s). See the topic Add
Variables for more information.

10. Click OK to merge the two datasets.

The dataset that contains the RFM scores now also contains name, address and other
information for each customer.

32
Figure 4. Merged datasets

Cluster analysis

Cluster Analysis is an exploratory tool designed to reveal natural groupings (or clusters)
within your data. For example, it can identify different groups of customers based on various
demographic and purchasing characteristics.

For example, the direct marketing division of a company wants to identify demographic
groupings in their customer database to help determine marketing campaign strategies and
develop new product offerings.

This information is collected in dmdata.sav. See the topic Sample Files for more information.

Running the analysis


1. To run a Cluster Analysis, from the menus choose:

Direct Marketing > Choose Technique

Figure 1. Direct Marketing dialog

33
2. Select Segment my contacts into clusters and click Continue.

In this example file, there are no fields with an unknown measurement level, and all
fields have the correct measurement level; so the measurement level alert should not
appear.

Figure 2. Cluster Analysis, Fields tab

34
3. Select the following fields to create segments: Age, Income category, Education,
Years at current residence, Gender, Married, and Children.
4. Click Run to run the procedure.

Output
Figure 1. Cluster model summary

35
The results are displayed in the Cluster Model Viewer.

The model summary indicates that four clusters were found based on the seven input
features (fields) you selected.
The cluster quality chart indicates that the overall model quality is in the middle of the
"Fair" range.

1. Double-click the Cluster Model Viewer output to activate the Model Viewer.

Figure 2. Activated Cluster Model Viewer

2. From the View drop-down list at the bottom of the Cluster Model Viewer window,
select Clusters.

Figure 3. Cluster view

36
The Cluster view displays information on the attributes of each cluster.

o For continuous (scale) fields, the mean (average) value is displayed.


o For categorical (nominal, ordinal) fields, the mode is displayed. The mode is
the category with the largest number of records. In this example, each record
is a customer.
o By default, fields are displayed in the order of their overall importance to the
model. In this example, Age has the highest overall importance. You can also
sort fields by within-cluster importance or alphabetical order.

If you select (click) any cell in Cluster view, you can see a chart that summarizes the
values of that field for that cluster.

3. For example, select the Age cell for cluster 1.

Figure 4. Age histogram for cluster 1

37
For continuous fields, a histogram is displayed. The histogram displays both the
distribution of values within that cluster and the overall distribution of values for the
field. The histogram indicates that the customers in cluster 1 tend to be somewhat
older.

4. Select the Age cell for cluster 4 in the Cluster view.

Figure 5. Age histogram for cluster 4

38
In contrast to cluster 1, the customers in cluster 4 tend to be younger than the overall
average.

5. Select the Income category cell for cluster 1 in the Cluster view.

Figure 6. Income category bar chart for cluster 1

For categorical fields, a bar chart is displayed. The most notable feature of the income
category bar chart for this cluster is the complete absence of any customers in the
lowest income category.

6. Select the Income category cell for cluster 4 in the Cluster view.

Figure 7. Income category bar chart for cluster 4

39
In contrast to cluster 1, all of the customers in cluster 4 are in the lowest income category.

You can also change the Cluster view to display charts in the cells, which makes it easy to
quickly compare the distributions of values between clusters by using the toolbar at the
bottom of Model Viewer window to change the view.

Figure 8. Charts displayed in the Cluster

40
Looking at the Cluster view and the additional information provided in the charts for each
cell, you can see some distinct differences between the clusters:

Customers in cluster 1 tend to be older, married people with children and higher
incomes.
Customers in cluster 2 tend to be somewhat older single mothers with moderate
incomes.
Customers in cluster 3 tend to be younger, single men without children.
Customers in cluster 4 tend to be younger, single women without children and with
lower incomes.

The Description cells in the Cluster view are text fields that you can edit to add descriptions
of each cluster.

Figure 9. Cluster view with cluster descriptions

41
Selecting records based on clusters
You can select records based on cluster membership in two ways:

Create a filter condition interactively in the Cluster Model Viewer.


Use the values of the cluster field generated by the procedure to specify filter or
selection conditions.

Creating a filter in the Cluster Model


Viewer
To create a filter condition that selects records from specific clusters in the Cluster Model
Viewer:

1. Activate (double-click) the Cluster Model Viewer.


2. From the View drop-down list at the bottom of the Cluster Model Viewer window,
select Clusters.
3. Click the cluster number for the cluster you want at the top of the Cluster View. If you
want to select multiple clusters, Ctrl-click on each additional cluster number that you
want.

Figure 1. Clusters selected in Cluster view

42
4. From the Cluster Model Viewer menus, choose:

Generate > Filter records

Figure 2. Filter Records dialog

5. Enter a name for the filter field and click OK. Names must conform to IBM SPSS
Statistics naming rules. See the topic Variable names for more information.

Figure 3. Filtered records in Data Editor

43
This creates a new field in the dataset and filters records in the dataset based on the values of
that field.

Records with a value of 1 for the filter field will be included in subsequent analyses,
charts, and reports.
Records with a value of 0 for the filter field will be excluded.
Excluded records are not deleted from the dataset. They are retained with a filter
status indicator, which is displayed as a diagonal slash through the record number in
the Data Editor.

For more information on filtering, see Select cases.

Selecting records based on cluster field


values
By default, Cluster Analysis creates a new field that identifies the cluster group for each
record. The default name of this field is ClusterGroupn, where n is an integer that forms a
unique field name.

Figure 1. Cluster field added to dataset

44
To use the values of the cluster field to select records in specific clusters:

1. From the menus choose:

Data > Select Cases

Figure 2. Select Cases dialog

2. In the Select Cases dialog, select If condition is satisfied and then click If.

Figure 3. Select Cases: If dialog

45
3. Enter the selection condition.

For example, ClusterGroup1 < 3 will select all records in clusters 1 and 2, and will
exclude records in clusters 3 and higher.

4. Click Continue.

In the Select Cases dialog, there are several options for what to do with selected and
unselected records:

Filter out unselected cases. This creates a new field that specifies a filter condition.
Excluded records are not deleted from the dataset. They are retained with a filter status
indicator, which is displayed as a diagonal slash through the record number in the Data
Editor. This is equivalent to interactively selecting clusters in the Cluster Model Viewer.

Copy selected cases to a new dataset. This creates a new dataset in the current session that
contains only the records that meet the filter condition. The original dataset is unaffected.

Delete unselected cases. Unselected records are deleted from the dataset. Deleted records
can be recovered only by exiting from the file without saving any changes and then reopening
the file. The deletion of cases is permanent if you save the changes to the data file.

46
The Select Cases dialog also has an option to use an existing variable as a filter variable
(field). If you create a filter condition interactively in the Cluster Model Viewer and save the
generated filter field with the dataset, you can use that field to filter records in subsequent
sessions.

Summary

Cluster Analysis is a useful exploratory tool that can reveal natural groupings (or clusters)
within your data. You can use the information from these clusters to determine marketing
campaign strategies and develop new product offerings. You can select records based on
cluster membership for further analysis or targeted marketing campaigns.

Prospect profiles

Prospect Profiles uses results from a previous or test campaign to create descriptive profiles.
You can use the profiles to target specific groups of contacts in future campaigns. For
example, based on the results of a test mailing, the direct marketing division of a company
wants to generate profiles of the types of people most likely to respond to a certain type of
offer, based on demographic information. Based on those results, they can then determine the
types of mailing lists they should use for similar offers.

For example, the direct marketing division of a company sends out a test mailing to
approximately 20% of their total customer database. The results of this test mailing are
recorded in a data file that also contains demographic characteristics for each customer,
including age, gender, marital status, and geographic region. The results are recorded in a
simple yes/no fashion, indicating which customers in the test mailing responded (made a
purchase) and which ones did not.

This information is collected in dmdata.sav. See the topic Sample Files for more information.

Data considerations

The response field should be categorical, with one value representing all positive responses.
Any other non-missing value is assumed to be a negative response. If the response field

47
represents a continuous (scale) value, such as number of purchases or monetary amount of
purchases, you need to create a new field that assigns a single positive response value to all
non-zero response values.See the topic Creating a categorical response field for more
information.

Running the analysis


1. To run a Prospect Profiles analysis, from the menus choose:

Direct Marketing > Choose Technique

Figure 1. Direct Marketing dialog

2. Select Generate profiles of my contacts who responded to an offer and click Continue.

48
In this example file, there are no fields with an unknown measurement level, and all
fields have the correct measurement level; so the measurement level alert should not
appear.

Figure 2. Prospect Profiles, Fields tab

3. For Response Field, select Responded to test offer.


4. For Positive response value, select Yes from the drop-down list. A value of 1 is
displayed in the text field because "Yes" is actually a value label associated with a
recorded value of 1. (If the positive response value doesn't have a defined value label,
you can just enter the value in the text field.)
5. For Create Profiles with, select Age, Income category, Education, Years at current
residence, Gender, Married, Region, and Children.
6. Click the Settings tab.

Figure 3. Prospect Profiles, Settings tab

49
7. Select (check) Include minimum response rate threshold information in results.
8. For the target response rate, enter 7.
9. Then click Run to run the procedure.

Output
Figure 1. Response rate table

The response rate table displays information for each profile group identified by the
procedure.

Profiles are displayed in descending order or response rate.

50
Response rate is the percentage of customer who responded positively (made a
purchase).
Cumulative response rate is the combined response rate for the current and all
preceding profile groups. Since profiles are displayed in descending order of response
rate, that means the cumulative response rate is the combined response rate for the
current profile group plus all profile groups with a higher response rate.
The profile description includes the characteristics for only those fields that provide a
significant contribution to the model. In this example, region, gender, and marital
status are included in the model. The remaining fields -- age, income, education, and
years at current address -- are not included because they did not make a significant
contribution to the model.
The green area of the table represents the set of profiles with a cumulative response
rate equal to or greater than the specified target response rate, which in this example is
7%.
The red area of the table represents the set of profiles with a cumulative response rate
lower than the specified target response rate.
The cumulative response rate in the last row of the table is the overall or average
response rate for all customers included in the test mailing, since it is the response rate
for all profile groups.

The results displayed in the table suggest that if you target females in the west, south, and
east, you should get a response rate slightly higher than the target response rate.

Note, however, that there is a substantial difference between the response rates for unmarried
females (9.2%) and married females (5.0%) in those regions. Although the cumulative
response rate for both groups is above the target response rate, the response rate for the latter
group alone is, in fact, lower than the target response rate, which suggests that you may want
to look for other characteristics that might improve the model.

Smart output

Figure 2. Smart output

51
The table is accompanied by "smart output" that provide general information on how to
interpret the table and specific information on the results contained in the table.

Figure 3. Cumulative response rate chart

The cumulative response rate chart is basically a visual representation of the cumulative
response rates displayed in the table. Since profiles are reported in descending order of
response rate, the cumulative response rate line always goes down for each subsequent
profile. Just like the table, the chart shows that the cumulative response rate drops below the
target response rate between profile group 2 and profile group 3.

Summary

For this particular test mailing, four profile groups were identified, and the results indicate
that the only significant demographic characteristics that seem to be related to whether or not
a person responded to the offer are gender, region, and marital status. The group with the
highest response rate consists of unmarried females, living in the south, east, and west. After

52
that, response rates drop off rapidly, although including married females in the same regions
still yields a cumulative response rate higher than the target response rate.

Postal code response rates

This technique uses results from a previous campaign to calculate postal code response rates.
Those rates can be used to target specific postal codes in future campaigns.

For example, based on the results of a previous mailing, the direct marketing division of a
company generates response rates by postal codes. Based on various criteria, such as a
minimum acceptable response rate and/or maximum number of contacts to include in the
mailing, they can then target specific postal codes.

This information is collected in dmdata.sav. See the topic Sample Files for more information.

Data considerations

The response field should be categorical, with one value representing all positive responses.
Any other non-missing value is assumed to be a negative response. If the response field
represents a continuous (scale) value, such as number of purchases or monetary amount of
purchases, you need to create a new field that assigns a single positive response value to all
non-zero response values. See the topic Creating a Categorical Response Field for more
information.

Running the analysis


1. To calculate postal code response rates, from the menus choose:

Direct Marketing > Choose Technique

Figure 1. Direct Marketing dialog

53
2. Select Identify the top respondng postal codes and click Continue.

Figure 2. Postal Code Response Rates, Fields tab

54
3. For Response Field, select Responded to previous offer.
4. For Positive response value, select Yes from the drop-down list. A value of 1 is
displayed in the text field because "Yes" is actually a value label associated with a
recorded value of 1. (If the positive response value doesn't have a defined value label,
you can just enter the value in the text field.)
5. For Postal Code Field, select Postal Code.
6. Click the Settings tab.

Figure 3. Postal Code Response Rates, Settings tab

55
7. In the Group Postal Codes Based On group, select First 3 digits or characters. This
will calculate combined response rates for all contacts that have postal codes that start
with the same three digits or characters. For example, the first three digits of a U.S.
zip code represent a common geographic area that is larger than the geographic area
defined by the full 5-digit zip code.
8. In the Output group, select (check) Response rate and capacity analysis.
9. Select Target response rate and enter a value of 5.
10. Select Number of contacts and enter a value of 5000.
11. Then click Run to run the procedure.

Output
Figure 1. New dataset with response rates by postal code

56
A new dataset is automatically created. This dataset contains a single record (row) for each
postal code. In this example, each row contains summary information for all postal codes that
start with the same first three digits or characters.

In addition to the field that contains the postal code, the new dataset contains the following
fields:

ResponseRate. The percentage of positive responses in each postal code. Records are
automatically sorted in descending order of response rates; so postal codes that have
the highest response rates appear at the top of the dataset.
Responses. The number of positive responses in each postal code.
Contacts. The total number of contacts in each postal code that contain a non-missing
value for the response field.
Index. The "weighted" response based on the formula N x P x (1-P), where N is the
number of contacts, and P is the response rate expressed as a proportion. For two
postal codes with the same response rate, this formula will assign a higher index value
to the postal code with the larger number of contacts.
Rank. Decile rank (top 10%, top 20%, etc.) of the cumulative postal code response
rates in descending order.

Since we selected Response rate and capacity analysis on the Settings tab of the Postal Code
Response Rates dialog, a summary response rate table and chart are displayed in the Viewer.

57
Figure 2. Response rate table

The table summarizes results by decile rank in descending order (top 10%, top 20%, etc.).

The cumulative response rate is the combined percentage of positive responses in the
current and all preceding rows. Since results are displayed in descending order of
response rates, this is therefore the combined response rate for the current decile and
all deciles with a higher response rate.
The table is color-coded based on the values you entered for target response rate and
maximum number of contacts. Rows with a cumulative response rate equal to or
greater than 5% and 5,000 or fewer cumulative contacts are colored green. The color-
coding is based on whichever threshold value is reached first. In this example, both
threshold values are reached in the same decile.

Figure 3. Smart output for response rate table

The table is accompanied by text that provides a general description of how to read the table.
If you have specified either a minimum response rate or a maximum number of contacts, it
also includes a section describing how the results relate to the threshold values you specified.

Figure 4. Cumulative response rate chart

58
The chart of cumulative response rate and cumulative number of contacts in each decile is a
visual representation of the same information displayed in the response rate table. The
threshold for both minimum cumumlative response rate and maximum cumulative number of
contacts is reached somewhere between the 40th and 50th percentile.

Since the chart displays cumulative response rates in descending order of decile rank
of response rate, the cumulative response rate line always goes down for each
subsequent decile.
Since the line for number of contacts represents cumulative number of contacts, it
always goes up.

The information in the table and chart tell you that if you are want to achieve a response rate
of at least 5% but don't want to include more than 5,000 contacts in the campaign, you should
focus on the postal codes in the top four deciles. Since decile rank is included in the new
dataset, you can easily identify the postal codes that meet the top 40% requirement.

Figure 5. New dataset

59
Note: Rank is recorded as an integer value from 1 to 10. The field has defined value labels,
where 1= Top 10%, 2=Top 20%, etc. You will see either the actual rank values or the value
labels in Data View of the Data Editor, depending on your View settings.

Summary

The Postal Code Response Rates procedure uses results from a previous campaign to
calculate postal code response rates. Those rates can be used to target specific postal codes in
future campaigns. The procedure creates a new dataset that contains response rates for each
postal code. Based on information in the response rate table and chart and decile rank
information in the new dataset, you can identify the set of postal codes that meet a specified
minimum cumulative response rate and/or cumulative maximum number of contacts.

Propensity to purchase
Propensity to Purchase uses results from a test mailing or previous campaign to generate
propensity scores. The scores indicate which contacts are most likely to respond, based on
various selected characteristics.

This technique uses binary logistic regression to build a predictive model. The process of
building and applying a predictive model has two basic steps:

60
1. Build the model and save the model file. You build the model using a dataset for
which the outcome of interest (often referred to as the target) is known. For example,
if you want to build a model that will predict who is likely to respond to a direct mail
campaign, you need to start with a dataset that already contains information on who
responded and who did not respond. For example, this might be the results of a test
mailing to a small group of customers or information on responses to a similar
campaign in the past.
2. Apply that model to a different dataset (for which the outcome of interest is not
known) to obtain predicted outcomes.

This example uses two data files: dmdata2.sav is used to build the model, and then that model
is applied to dmdata3.sav. See the topic Sample Files for more information.

Data considerations

The response field (the target outcome of interest) should be categorical, with one value
representing all positive responses. Any other non-missing value is assumed to be a negative
response. If the response field represents a continuous (scale) value, such as number of
purchases or monetary amount of purchases, you need to create a new field that assigns a
single positive response value to all non-zero response values.See the topic Creating a
categorical response field for more information.

Building a predictive model


1. Open the data file dmdata2.sav.

This file contains various demographic characteristics of the people who received the
test mailing, and it also contains information on whether or not they responded to the
mailing. This information is recorded in the field (variable) Responded. A value of 1
indicates that the contact responded to the mailing, and a value of 0 indicates that the
contact did not respond.

2. From the menus choose:

Direct Marketing > Choose Technique

61
3. Select Select contacts most likely to purchase and click Continue.

Figure 1. Propensity to Purchase, Fields tab

4. For Response Field, select Responded to test offer.


5. For Positive response value, select Yes from the drop-down list. A value of 1 is
displayed in the text field because "Yes" is actually a value label associated with a
recorded value of 1. (If the positive response value doesn't have a defined value label,
you can just enter the value in the text field.)
6. For Predict Propensity with, select Age, Income category, Education, Years at current
residence, Gender, Married, Region, and Children.
7. Select (check) Export model information to XML file.
8. Click Browse to navigate to where you want to save the file and enter a name for the
file.

62
9. In the Propensity to Purchase dialog, click the Settings tab.

Figure 2. Propensity to Purchase, Settings tab

10. In the Model Validation Group, select (check) Validate model and Set seed to
replicate results.
11. Use the default training sample partition size of 50% and the default seed value of
2000000.
12. In the Diagnostic Output group, select (check) Overall model quality and
Classification table.
13. For Minimum probability, enter 0.05. As a general rule, you should specify a value
close to your minimum target response rate, expressed as a proportion. A value of
0.05 represents a response rate of 5%.
14. Click Run to run the procedure and generate the model.

63
Evaluating the model

Propensity to Purchase produces an overall model quality chart and a classification table that
can be used to evaluate the model.

The overall model quality chart provides a quick visual indication of the model quality. As a
general rule, the overall model quality should be above 0.5.

Figure 1. Overall model quality chart

To confirm that the model is adequate for scoring, you should also examine the classification
table.

Figure 2. Classification table

The classification table compares predicted values of the target field to the actual values of
the target field. The overall accuracy rate can provide some indication of how well the model
works, but you may be more interested in the percentage of correct predicted positive
responses, if the goal is to build a model that will identify the group of contacts likely to yield
a positive response rate equal to or greater than the specified minimum positive response rate.

64
In this example, the classification table is split into a training sample and a testing sample.
The training sample is used to build the model. The model is then applied to the testing
sample to see how well the model works.

The specified minimum response rate was 0.05 or 5%. The classification table shows that the
correct classification rate for positive responses is 7.43% in the training sample and 7.61% in
the testing sample. Since the testing sample response rate is greater than 5%, this model
should be able to identify a group of contacts likely to yield a response rate greater than 5%.

Applying the model


1. Open the data file dmdata3.sav. This data file contains demographic and other
information for all the contacts that were not included in the test mailing. See the
topic Sample Files for more information.
2. Open the Scoring Wizard. To open the Scoring Wizard, from the menus choose:

Utilities > Scoring Wizard

Figure 1. Scoring Wizard, Select a Scoring Model

65
3. Click Browse to navigate to the location where you saved the model XML file and
click Select in the Browse dialog.

All files with an .xml or .zip extension are displayed in the Scoring Wizard. If the
selected file is recognized as a valid model file, a description of the model is
displayed.

4. Select the model XML file you created and then click Next.

Figure 2. Scoring Wizard: Match Model Fields

In order to score the active dataset, the dataset must contain fields (variables) that
correspond to all the predictors in the model. If the model also contains split fields,
then the dataset must also contain fields that correspond to all the split fields in the
model.

o By default, any fields in the active dataset that have the same name and type as
fields in the model are automatically matched.
o Use the drop-down list to match dataset fields to model fields. The data type
for each field must be the same in both the model and the dataset in order to
match fields.

66
o You cannot continue with the wizard or score the active dataset unless all
predictors (and split fields if present) in the model are matched with fields in
the active dataset.

The active dataset does not contain a field named Income. So the cell in the Dataset
Fields column that corresponds to the model field Income is initially blank. You need
to select a field in the active dataset that is equivalent to that model field.

5. From the drop-down list in the Dataset Fields column in the blank cell in the row for
the Income model field, select IncomeCategory.

Note: In addition to field name and type, you should make sure that the actual data
values in the dataset being scored are recorded in the same fashion as the data values
in the dataset used to build the model. For example, if the model was built with an
Income field that has income divided into four categories, and IncomeCategory in the
active dataset has income divided into six categories or four different categories, those
fields don't really match each other and the resulting scores will not be reliable.

Click Next to continue to the next step of the Scoring Wizard.

Figure 3. Scoring Wizard: Select Scoring Functions

67
The scoring functions are the types of "scores" available for the selected model. The
scoring functions available are dependent on the model. For the binary logistic model
used in this example, the available functions are predicted value, probability of the
predicted value, probability of a selected value, and confidence. See the topic
Selecting scoring functions for more information.

In this example, we are interested in the predicted probability of a positive response to


the mailing; so we want the probability of a selected value.

6. Select (check) Probability of Selected Category.


7. In the Value column, select 1 from the drop-down list. The list of possible values for
the target is defined in the model, based on the target values in the data file used to
build the model.

Note: When you use the Propensity to Purchase feature to build a model, the value
associated with a positive response will always be 1, since Propensity to Purchase
automatically recodes the target to a binary field where 1 represents a positive
response, and 0 represents any other valid value encountered in the data file used to
build the model.

8. Deselect (clear) all the other scoring functions.


9. Optionally, you can assign a more descriptive name to the new field that will contain
the score values in the active dataset. For example, Probability_of_responding. For
information on field (variable) naming rules, see Variable names.
10. Click Finish to apply the model to the active dataset.

The new field that contains the probability of a positive response is appended to the
end of the dataset.

You can then use that field to select the subset of contacts that are likely to yield a
positive response rate at or above a certain level. For example, you could create a new
dataset that contains the subset of cases likely to yield a positive response rate of at
least 5%.

11. From the menus choose:

68
Data > Select Cases

12. In the Select Cases dialog, select If condition is satisfied and click If.
13. In the Select Cases: If dialog enter the following expression:

Probability_of_responding >=.05

Note: If you used a different name for the field that contains the probability values,
enter that name instead of Probability_of_responding. The default name is
SelectedProbability.

14. Click Continue.


15. In the Select Cases dialog, select Copy selected cases to a new dataset and enter a
name for the new dataset. Dataset names must conform to field (variable) naming
rules. See the topic Variable names for more information.
16. Click OK to create the dataset with the selected contacts.

The new dataset contains only those contacts with a predicted probability of a positive
response of at least 5%.

Summary

Propensity to Purchase uses results from a test mailing or previous campaign to generate
propensity scores. The scores indicate which contacts are most likely to respond, based on
various selected characteristics. This techniques builds a predictive model that can then be
applied to dataset to obtain propensity scores.

Control package test

This technique compares marketing campaigns to see if there is a significant difference in


effectiveness for different packages or offers. Campaign effectiveness is measured by
responses.

For example, The direct marketing division of a company wants to see if a new package
design will generate more positive responses than the existing package. So they send out a
test mailing to determine if the new package generates a significantly higher positive

69
response rate. The test mailing consists of a control group that receives the existing package
and a test group that receives the new package design. The results for the two groups are then
compared to see if there is a significant difference.

This information is collected in dmdata.sav. See the topic Sample Files for more information.

Running the analysis


1. To obtain a control package test, from the menus choose:

Direct Marketing > Choose Technique

Figure 1. Direct Marketing dialog

70
2. Select Compare effectiveness of campaigns (Control Package Test) and click
Continue.

Figure 2. Control Package Test, Fields tab

3. For Campaign Field, select Control Package.


4. For Effectiveness Response Field, select Responded to test offer.
5. Select Reply.
6. For Positive response value, select Yes from the drop-down list. A value of 1 is
displayed in the text field because "Yes" is actually a value label associated with a
recorded value of 1. (If the positive response value doesn't have a defined value label,
you can just enter the value in the text field.)

A new field is automatically created, in which 1 represents positive responses and 0


represents negative responses, and the analysis is performed on the new field. You

71
can override the default name and label and provide your own. For this example, we'll
use the field name already provided.

7. Click Run to run the procedure.

Output
Figure 1. Control Package Test output

The output from the procedure includes a table that displays counts and percentages of
positive and negative responses for each group defined by the Campaign Field and a table
that indicates if the group response rates differ significantly from each other.

Effectiveness is the recoded version of the response field, where 1 represents positive
responses and 0 represents negative responses.
The positive response rate for the control package is 3.8%, while the positive response
rate for the test package is 6.2%.

The simple text description below the table indicates that the difference between the groups is
significantly different, which means that the higher response rate for the test package
probably isn't the result of random chance. This text table will contain a comparison for each
possible pair of groups included in the analysis. Since there are only two groups in this
examples, there is only one comparison. If there are more than five groups, the text
description table is replaced with the Comparison of Column Proportions table.

Summary

The Control Package Test compares marketing campaigns to see if there is a significant
difference in effectiveness for different packages or offers. In this example, the positive

72
response of 6.2% for the test package was significantly higher than the positive response rate
of 3.8% for the control package. This suggests that you should use the new package design
instead of the old one, but there may be other factors that you need to consider, such as any
additional costs associated with the new package design.

73