Anda di halaman 1dari 4

Proc Compare to Validate Datasets

Angelina Cecilia Casas M.S., PPD Development, Research Triangle Park, NC


Variables Summary
Number of Variables in Common: 3.

ABSTRACT
Comparison of two datasets is a technique used to know that two
datasets are equal or if they have discrepancies. This method
can be used to validate that a dataset has been created correctly
or that the changes in a dataset are only those that are expected.
I.e. observations were added or removed or corrections were
done.

Proc Compare is a procedure that allows two datasets to be


compared for properties, number of observations, number of
variables, and properties of the datasets.
For a variable, you can get output about differences in:
Type, length, formats, informats and label(s).
For a dataset, you can find differences in:
date of creation, last modification of the datasets was modified,
number of variables and observations of the datasets. You can
also see the labels of the datasets, but differences are not
reported for that.
For observations, you can compare the value of the record for
each variable. Also, you can decide how different the values of
the observations can be.

101.6
94.3
85.7
99.3

1
4

1
4

First Obs
Last Obs

The records in DEMOG and the records in compare have the


same order and an ID statement is not needed. However if the
order of the two datasets is not the same, records might be
compared incorrectly.

WHEN THE DATASETS HAVE A DIFFERENT


ORDER;

dob

proc sort data=compare;


by DOB;
run;

08/19/1949
06/02/1934
09/17/1949
10/17/1931

proc compare base=demog (keep=unique weightkg)


compare=compare (keep=unique weightkg);
run;

proc contents data=demog;


Data Set Name: WORK.DEMOG
Member Type:
DATA

Compare

NOTE: No unequal values were found. All values


compared are exactly equal.

proc print data=demog noobs;

20120
20130
20110
20202

Base

Number of Observations with Some Compared Variables


Unequal: 0.
Number of Observations with All Compared Variables
Equal: 4.

THE DATASETS THAT WILL BE COMPARED.

WEIGHTKG

Observation

Number of Observations in Common: 4.


Total Number of Observations Read from
WORK.DEMOG: 4.
Total Number of Observations Read from
WORK.COMPARE: 4.

INTRODUCTION

Unique

Observation Summary

Observations:
Variables: 3

As in other SAS procedures and data step, it is possible to select


observations and variables in the datasets that are going to be
used.

--Alphabetic List of Variables and Attributes--

Dataset

Variable Type Len Pos Format


Label
----------------------------------------------UNIQUE
Char 200 32
Unique Record
WEIGHTKG Num
8
8 5.1
Weight in kg
dob
Num
8 24 MMDDYY10. Date of Birth

Created

Modified

WORK.DEMOG
20JAN03:13:17
WORK.COMPARE 20JAN03:13:17

NVar

Nobs

2
2

4
4

20JAN03:13:17
20JAN03:13:17

Vars Summary
# of Vars in Common: 2.

proc sql;
create table Compare as select * from demog;
quit;

Observation Summary
Observation

AN EXAMPLE OF A CLEAN COMPARE OUTPUT

Values Comparison Summary

The COMPARE Procedure


Comparison of WORK.DEMOG with WORK.COMPARE
(Method=EXACT)

# of Vars Compared with All Obs Equal: 0.


# of Vars Compared with Some Obs Unequal: 2.
Total # of Values which Compare Unequal: 6.
Maximum Difference: 15.9.

Data Set Summary


Created

Compare

# of Obs with Some Compared Vars Unequal: 3.


# of Obs with All Compared Vars Equal: 1.

proc compare base=here.demog compare=compare;


run;

Dataset

Base

# of Obs in Common: 4.
Total # of Obs Read from WORK.DEMOG: 4.
Total # of Obs Read from WORK.COMPARE: 4.

All Vars Compared have Unequal Values

Modified

WORK.DEMOG
14JAN03:16:03 14JAN03:16:06
WORK.COMPARE 14JAN03:16:06 14JAN03:16:06

Nvar Nobs
3
3

4
4

Variable
Unique
WEIGHTKG

Type
CHAR
NUM

Len
200
8

Label
Unique Record
Weight in kg

Value Comparison Results for Vars

Ndif
3
3

MaxDif
15.900

______________________________________________________
|| Unique Record
|| Base Value
Compare Value
Obs || Unique
Unique
_____ || ___________________+ ___________________+
||
1 || 20120
20202
3 || 20110
20120
4 || 20202
20110
_______________________________________________________
_______________________________________________________
|| Weight in kg
||
Base
Compare
Obs ||
WEIGHTKG
WEIGHTKG
Diff.
% Diff
_____ || _________ _________ _________ _________
||
1 ||
101.6
99.3
-2.3000
-2.2638
3 ||
85.7
101.6
15.9000
18.5531
4 ||
99.3
85.7
-13.6000
-13.6959
_______________________________________________________

The values of the variable that was to the right of the BASE or
DATA statement is in the BASE column. The values of dataset
listed to right of the COMPARE dataset is in the COMPARE
column.
It is necessary that the two datasets that are going to be
compared have the same order UNLESS the order of the
datasets is something that is being tested.
What happens when the order of the observations in each of the
datasets is different?

# of Obs with Some Compared Variables Unequal: 0.


# of Obs with All Compared Variables Equal: 3.
NOTE: No unequal values were found. All values compared
are exactly equal.

INFORMATION ABOUT VARIABLES OR


OBSERVATIONS THAT ARE IN DIFFERENT IN
THE DATASETS BEING COMPARED.
The output there are no differences, but it might be easy to miss
the fact that only 3 of the 4 observations were compared. Also,
you dont know which observations were not compared unless
you use the LIST or LISTALL option.
proc compare data=base
compare=compare
list;
id unique;
run;

Observation 2 in WORK.DEMOG not found in WORK.COMPARE:


Unique=20120.
Observation 4 in WORK.COMPARE not found in WORK.DEMOG:
Unique=34343.

proc sort data=compare;


by unique;
run;

The LIST option will list the observations and variables that exist
in one dataset and are missing in the other. There are other
options, LISTALL, LISTBASE, LISTBASEOBS, LISTBASEVAR,
LISTCOMP, LISTCMPOBS, LISTCOMPVAR, LISTOBS, AND
LISTVAR, to limit this report to one or the other dataset and only
variables and / or observations. LISTEQUALVAR will also list
the variables that are not used in the ID variable list for which all
values are equal.

proc compare base=demog(keep = unique weightkg)


compare=compare(keep = unique weightkg);
run;

LIMITING THE PRINTED OUTPUT

proc sql;
update compare set unique='34343' where
unique='20202';
quit;

_______________________________________________________
|| Unique Record
|| Base Value
Compare Value
Obs || Unique
Unique
_______ || ___________________+ ___________________+
||
2 || 20120
20130
3 || 20130
20202
4 || 20202
34343
_______________________________________________________
Etcetera

Even though, there is only one discrepancy (UNIQUE) in the


datasets, several discrepancies are reported. It is necessary to
specify which observations should be compared .
The option that tells proc compare, which variables should be
together, is ID. After it, you should list the number of variables
that define which observation in each of the datasets should be
compared. This is similar to the way datasets are merged using
a by statement.
proc compare base=demog compare=compare;
id unique;
run;
# of Obs in Common: 3.
# of Obs in WORK.DEMOG but not in WORK.COMPARE: 1.
# of Obs in WORK.COMPARE but not in WORK.DEMOG: 1.
Total # of Obs Read from WORK.DEMOG: 4.
Total # of Obs Read from WORK.COMPARE: 4.

Sometimes it is not necessary to know all the observations that


have an error, once it is known that an error exists, it might not be
needed to know all the discrepancies.
It might be desirable to limit the size of the output; for that, the
option MAXPRINT is very useful.
First it is necessary to use the original dataset and modify the
variable WEIGHTKG to have something to report.
For example, lets change the value of one of the variables in the
ID statement from 34343 to 20120.
proc sql;
update compare set unique='20120'
where unique='34343';
update compare set weightkg=int(weightkg);
quit;
proc sort data=compare;
by unique;
run;
proc compare data=demog
compare=compare
maxprint=(4,2);
id unique ;
run;
At the most, 4 differences will be printed, 2 for every variable that

has discrepancies.
Variable

Type

WEIGHTKG

NUM

Len
8

Label

Ndif

MaxDif

0.700

Weight in kg

_______________________________________________________
|| Weight in kg
||
Base
Compare
Unique
||
WEIGHTKG
WEIGHTKG
Diff.
% Diff
_______ || _________ _________ _________ _________
||
20110
||
85.7
85.0
-0.7000
-0.8168
20120
||
101.6
101.0
-0.6000
-0.5906
NOTE: The MAXPRINT=2 printing limit has been reached
for the variable WEIGHTKG.
No more values will be printed for this comparison.

PRINTING RELATED INFORMATION


TOGETHER.

proc compare data=demog


compare=compare

transpose
;
id unique ;
run;
Unique=20110:
Variable Base Value
Compare
WEIGHTKG
85.7
85.0
dob 09/17/1949 09/14/1949

Diff.
-0.700000
-3.000000

% Diff
-0.816803
0.079830

Unique=20120:
Variable Base Value
Compare
WEIGHTKG
101.6
101.0
dob 08/19/1949 08/16/1949

Diff.
-0.600000
-3.000000

% Diff
-0.590551
0.079218

Unique=20130:
Variable Base Value
Compare
WEIGHTKG
94.3
94.0
dob 06/02/1934 05/30/1934

Diff.
-0.300000
-3.000000

% Diff
-0.318134
0.032106

Diff.
-0.300000
-3.000000

% Diff
-0.302115
0.029118

COMPARING VARIABLES WITH DIFFERENT


NAMES.
Sometimes, it is necessary to compare datasets when it is known
that the variable names are different yet they should have the
same values.
Proc sql;
alter table compare
add nage num label "Numeric age";
alter table demog
add age num label "New Numeric age";
update demog
set
age=int((date()-dob)/365.25);
update compare set
nage=int((date()-dob)/365.25);
quit;
proc compare data=demog
compare=compare;
id unique;

# of Vars in Common: 3.
# of Vars in WORK.DEMOG but not in WORK.COMPARE:
1.
# of Vars in WORK.COMPARE but not in WORK.DEMOG:
1.
# of ID Vars: 1.
# of VAR Statement Vars: 1.
# of WITH Statement Vars: 1.
Sometimes, there are differences in the datasets that are not
important for the purpose of the comparison. For this scenario, it
is possible to give a value for which only observations that are
bigger than this number will be marked as a difference.
proc sql;
update compare set nage=nage+0.001;
quit;

Sometimes it is desired to see the information with discrepancies


grouped together by the variables in the ID statement.

Unique=20202:
Variable Base Value
Compare
WEIGHTKG
99.3
99.0
dob 10/17/1931 10/14/1931

var age;
with nage;
run;

proc compare data=demog


compare=compare
criterion=.01;
var age;
with nage;
run;

The COMPARE Procedure


Comparison of WORK.DEMOG with WORK.COMPARE
(Method=RELATIVE(2.21E-12), Criterion=0.01)
Variables Summary
Number of Variables in Common: 3.
Number of Variables in WORK.DEMOG but not in
WORK.COMPARE: 1.
Number of Variables in WORK.COMPARE but not in
WORK.DEMOG: 1.
Number of VAR Statement Variables: 1.
Number of WITH Statement Variables: 1.
Number of Observations in Common: 4.
Total Number of Observations Read from WORK.DEMOG: 4.
Total Number of Observations Read from WORK.COMPARE:
4.
Number of Observations with Some Compared Variables
Unequal: 0.
Number of Observations with All Compared Variables
Equal: 4.
Values Comparison Summary
Number of Variables Compared with All Observations
Equal: 1.
Number of Variables Compared with Some Observations
Unequal: 0.
Total Number of Values which Compare Unequal: 0.
Total Number of Values not EXACTLY Equal: 4.
Maximum Difference Criterion Value: 0.000018868.

CREATING AN OUTPUT DATASET


Sometimes, it is desirable to create an output dataset that
contains only the equalities or discrepancies. Here, there is an
example to create a dataset that has the differences.
The outnoequal option, specifies that only the observations with
discrepancies will exist in the dataset toprint.
proc compare data=demog
compare=compare
outnoequal
out=toprint
;

id unique ;
run;

Obs
1
3
5
7

Proc Print data=toprint;


run;

Unique
20110
20120
20130
20202

_OBS_
1
2
3
4

_NAME_
WEIGHTKG
WEIGHTKG
WEIGHTKG
WEIGHTKG

_LABEL_
Weight in
Weight in
Weight in
Weight in

kg
kg
kg
kg

BASE COMPARE
85.7
85
101.6
101
94.3
94
99.3
99

Variables with Unequal Values


Variable

Type

WEIGHTKG

NUM

Len
8

Label

Ndif

MaxDif

0.700

Weight in kg

CONCLUSION

Value Comparison Results for Variables


_______________________________________________________
||Weight in kg
||
Base
Compare
Unique
|| WEIGHTKG
WEIGHTKG
Diff.
% Diff
_________ || ________
________ ________
_________
||
20110
||
85.7
85.0
-0.7000
-0.8168
20120
||
101.6
101.0
-0.6000
-0.5906
20130
||
94.3
94.0
-0.3000
-0.3181
20202
||
99.3
99.0
-0.3000
-0.3021
______________________________________________________

_TYPE_

1
2
3
4

DIF
DIF
DIF
DIF

_OBS_
1
2
3
4

ACKNOWLEDGMENTS
I want to thank, Craig Mauldin, Andy Barnett, George Clark,
Bonnie Duncan and Kim Sturgen for their comments and support.

CONTACT INFORMATION
Your comments and questions are valued and encouraged.
Contact the author at:

OUTPUT OF PROC PRINT


Obs

You are the person who decides what is important to validate,


values of the observations and how different can they be. Is it
important to have the same labels, formats and informats in a
variable. Proc Compare is a tool that is flexible to allow you find
the differences that are important to you.

Unique

WEIGHTKG

20110
20120
20130
20202

dob

-0.7
-0.6
-0.3
-0.3

E
E
E
E

It is possible to use the noprint option, but then it is strongly


recommended to use the options OUTCOMP AND OUTBASE to
differentiate the variables.
proc compare data=demog
compare=compare
outnoequal
out=toprint
noprint
outcomp
outbase;
id unique ;
run;

SAS and all other SAS Institute Inc. product or service names are
registered trademarks or trademarks of SAS Institute Inc. in the
USA and other countries. indicates USA registration.
Other brand and product names are trademarks of their
respective companies.

proc print data=toprint;


run;
_OBS_ Unique
1
20110

A. Cecilia Casas
PPD Development
3900 Paramount Parkway
Morrisville, NC 27560
Phone: (919) 462-4199
Fax: (919) 379-6151
Email: Cecilia.Casas@rtp.ppdi.com

Obs
1

_TYPE_
BASE

WEIGHTKG
85.7

dob
09/17/1949

COMPARE

BASE

20110

85.0

09/17/1949

20120

101.6

COMPARE

20120

101.0

08/19/1949
08/19/1949

BASE

20130

94.3

06/02/1934

COMPARE

20130

94.0

06/02/1934

BASE

20202

99.3

10/17/1931

COMPARE

20202

99.0

10/17/1931

A USEFUL DATASET TO REPORT


To get only the observations and variables that have
discrepancies, proc transpose can be a big help in reporting the
findings.
proc transpose data=toprint out=transp;
by unique _obs_;
id _type_;
run;
proc print data=transp(where=(base^=compare));
run;

Anda mungkin juga menyukai