Angelina Cecilia Casas M.S., PPD Development, Research Triangle Park, NC

Proc Compare to Validate Datasets
Angelina Cecilia Casas M.S., PPD Development, Research Triangle Park, NC

Variables Summary
Number of Variables in Common: 3.
ABSTRACT
Comparison of two datasets is a technique used to know that two
datasets are equal or if they have discrepancies. This method
can be used to validate that a dataset has been created correctly
or that the changes in a dataset are only those that are expected.
I.e. observations were added or removed or corrections were
done.
Proc Compare is a procedure that allows two datasets to be

compared for properties, number of observations, number of
variables, and properties of the datasets.
For a variable, you can get output about differences in:
Type, length, formats, informats and label(s).
For a dataset, you can find differences in:
date of creation, last modification of the datasets was modified,
number of variables and observations of the datasets. You can
also see the labels of the datasets, but differences are not
reported for that.
For observations, you can compare the value of the record for
each variable. Also, you can decide how different the values of
the observations can be.
101.6
94.3
85.7
99.3
1
4
1
4
First Obs
Last Obs
The records in DEMOG and the records in compare have the

same order and an ID statement is not needed. However if the
order of the two datasets is not the same, records might be
compared incorrectly.
WHEN THE DATASETS HAVE A DIFFERENT

ORDER;
dob
proc sort data=compare;

by DOB;
run;
08/19/1949
06/02/1934
09/17/1949
10/17/1931
proc compare base=demog (keep=unique weightkg)

compare=compare (keep=unique weightkg);
run;
proc contents data=demog;

Data Set Name: WORK.DEMOG
Member Type:
DATA
Compare
NOTE: No unequal values were found. All values

compared are exactly equal.
proc print data=demog noobs;
20120
20130
20110
20202
Base
Number of Observations with Some Compared Variables

Unequal: 0.
Number of Observations with All Compared Variables
Equal: 4.
THE DATASETS THAT WILL BE COMPARED.
WEIGHTKG
Observation
Number of Observations in Common: 4.

Total Number of Observations Read from
WORK.DEMOG: 4.
Total Number of Observations Read from
WORK.COMPARE: 4.
INTRODUCTION
Unique
Observation Summary
Observations:
Variables: 3
As in other SAS procedures and data step, it is possible to select

observations and variables in the datasets that are going to be
used.
--Alphabetic List of Variables and Attributes--
Dataset
Variable Type Len Pos Format

Label
----------------------------------------------UNIQUE
Char 200 32
Unique Record
WEIGHTKG Num
8
8 5.1
Weight in kg
dob
Num
8 24 MMDDYY10. Date of Birth
Created
Modified
WORK.DEMOG
20JAN03:13:17
WORK.COMPARE 20JAN03:13:17
NVar
Nobs
2
2
4
4
20JAN03:13:17
20JAN03:13:17
Vars Summary
# of Vars in Common: 2.
proc sql;
create table Compare as select * from demog;
quit;
Observation Summary
Observation
AN EXAMPLE OF A CLEAN COMPARE OUTPUT
Values Comparison Summary
The COMPARE Procedure

Comparison of WORK.DEMOG with WORK.COMPARE
(Method=EXACT)
# of Vars Compared with All Obs Equal: 0.

# of Vars Compared with Some Obs Unequal: 2.
Total # of Values which Compare Unequal: 6.
Maximum Difference: 15.9.
Data Set Summary

Created
Compare
# of Obs with Some Compared Vars Unequal: 3.

# of Obs with All Compared Vars Equal: 1.
proc compare base=here.demog compare=compare;

run;
Dataset
Base
# of Obs in Common: 4.
Total # of Obs Read from WORK.DEMOG: 4.
Total # of Obs Read from WORK.COMPARE: 4.
All Vars Compared have Unequal Values
Modified
WORK.DEMOG
14JAN03:16:03 14JAN03:16:06
WORK.COMPARE 14JAN03:16:06 14JAN03:16:06
Nvar Nobs
3
3
4
4
Variable
Unique
WEIGHTKG
Type
CHAR
NUM
Len
200
8
Label
Unique Record
Weight in kg
Value Comparison Results for Vars
Ndif
3
3
MaxDif
15.900
______________________________________________________
|| Unique Record
|| Base Value
Compare Value
Obs || Unique
Unique
_____ || ___________________+ ___________________+
||
1 || 20120
20202
3 || 20110
20120
4 || 20202
20110
_______________________________________________________
_______________________________________________________
|| Weight in kg
||
Base
Compare
Obs ||
WEIGHTKG
WEIGHTKG
Diff.
% Diff
_____ || _________ _________ _________ _________
||
1 ||
101.6
99.3
-2.3000
-2.2638
3 ||
85.7
101.6
15.9000
18.5531
4 ||
99.3
85.7
-13.6000
-13.6959
_______________________________________________________
The values of the variable that was to the right of the BASE or
DATA statement is in the BASE column. The values of dataset
listed to right of the COMPARE dataset is in the COMPARE
column.
It is necessary that the two datasets that are going to be
compared have the same order UNLESS the order of the
datasets is something that is being tested.
What happens when the order of the observations in each of the
datasets is different?
# of Obs with Some Compared Variables Unequal: 0.

# of Obs with All Compared Variables Equal: 3.
NOTE: No unequal values were found. All values compared
are exactly equal.
INFORMATION ABOUT VARIABLES OR

OBSERVATIONS THAT ARE IN DIFFERENT IN
THE DATASETS BEING COMPARED.
The output there are no differences, but it might be easy to miss
the fact that only 3 of the 4 observations were compared. Also,
you dont know which observations were not compared unless
you use the LIST or LISTALL option.
proc compare data=base
compare=compare
list;
id unique;
run;
Observation 2 in WORK.DEMOG not found in WORK.COMPARE:

Unique=20120.
Observation 4 in WORK.COMPARE not found in WORK.DEMOG:
Unique=34343.

by unique;
run;
The LIST option will list the observations and variables that exist
in one dataset and are missing in the other. There are other
options, LISTALL, LISTBASE, LISTBASEOBS, LISTBASEVAR,
LISTCOMP, LISTCMPOBS, LISTCOMPVAR, LISTOBS, AND
LISTVAR, to limit this report to one or the other dataset and only
variables and / or observations. LISTEQUALVAR will also list
the variables that are not used in the ID variable list for which all
values are equal.
proc compare base=demog(keep = unique weightkg)

compare=compare(keep = unique weightkg);
run;
LIMITING THE PRINTED OUTPUT
proc sql;
update compare set unique='34343' where
unique='20202';
quit;
_______________________________________________________
|| Unique Record
|| Base Value
Compare Value
Obs || Unique
Unique
_______ || ___________________+ ___________________+
||
2 || 20120
20130
3 || 20130
20202
4 || 20202
34343
_______________________________________________________
Etcetera
Even though, there is only one discrepancy (UNIQUE) in the

datasets, several discrepancies are reported. It is necessary to
specify which observations should be compared .
The option that tells proc compare, which variables should be
together, is ID. After it, you should list the number of variables
that define which observation in each of the datasets should be
compared. This is similar to the way datasets are merged using
a by statement.
proc compare base=demog compare=compare;
id unique;
run;
# of Obs in Common: 3.
# of Obs in WORK.DEMOG but not in WORK.COMPARE: 1.
# of Obs in WORK.COMPARE but not in WORK.DEMOG: 1.
Total # of Obs Read from WORK.DEMOG: 4.
Total # of Obs Read from WORK.COMPARE: 4.
Sometimes it is not necessary to know all the observations that

have an error, once it is known that an error exists, it might not be
needed to know all the discrepancies.
It might be desirable to limit the size of the output; for that, the
option MAXPRINT is very useful.
First it is necessary to use the original dataset and modify the
variable WEIGHTKG to have something to report.
For example, lets change the value of one of the variables in the
ID statement from 34343 to 20120.
proc sql;
update compare set unique='20120'
where unique='34343';
update compare set weightkg=int(weightkg);
quit;
by unique;
run;
proc compare data=demog
compare=compare
maxprint=(4,2);
id unique ;
run;
At the most, 4 differences will be printed, 2 for every variable that
has discrepancies.
Variable
Type
WEIGHTKG
NUM
Len
8
Label
Ndif
MaxDif
0.700
Weight in kg
_______________________________________________________
|| Weight in kg
||
Base
Compare
Unique
||
WEIGHTKG
WEIGHTKG
Diff.
% Diff
_______ || _________ _________ _________ _________
||
20110
||
85.7
85.0
-0.7000
-0.8168
20120
||
101.6
101.0
-0.6000
-0.5906
NOTE: The MAXPRINT=2 printing limit has been reached
for the variable WEIGHTKG.
No more values will be printed for this comparison.
PRINTING RELATED INFORMATION

TOGETHER.

compare=compare
transpose
;
id unique ;
run;
Unique=20110:
Variable Base Value
Compare
WEIGHTKG
85.7
85.0
dob 09/17/1949 09/14/1949
Diff.
-0.700000
-3.000000
% Diff
-0.816803
0.079830
Unique=20120:
Variable Base Value
Compare
WEIGHTKG
101.6
101.0
dob 08/19/1949 08/16/1949
Diff.
-0.600000
-3.000000
% Diff
-0.590551
0.079218
Unique=20130:
Variable Base Value
Compare
WEIGHTKG
94.3
94.0
dob 06/02/1934 05/30/1934
Diff.
-0.300000
-3.000000
% Diff
-0.318134
0.032106
Diff.
-0.300000
-3.000000
% Diff
-0.302115
0.029118
COMPARING VARIABLES WITH DIFFERENT

NAMES.
Sometimes, it is necessary to compare datasets when it is known
that the variable names are different yet they should have the
same values.
Proc sql;
alter table compare
add nage num label "Numeric age";
alter table demog
add age num label "New Numeric age";
update demog
set
age=int((date()-dob)/365.25);
update compare set
nage=int((date()-dob)/365.25);
quit;
compare=compare;
id unique;
# of Vars in Common: 3.
# of Vars in WORK.DEMOG but not in WORK.COMPARE:
1.
# of Vars in WORK.COMPARE but not in WORK.DEMOG:
1.
# of ID Vars: 1.
# of VAR Statement Vars: 1.
# of WITH Statement Vars: 1.
Sometimes, there are differences in the datasets that are not
important for the purpose of the comparison. For this scenario, it
is possible to give a value for which only observations that are
bigger than this number will be marked as a difference.
proc sql;
update compare set nage=nage+0.001;
quit;
Sometimes it is desired to see the information with discrepancies

grouped together by the variables in the ID statement.
Unique=20202:
Variable Base Value
Compare
WEIGHTKG
99.3
99.0
dob 10/17/1931 10/14/1931
var age;
with nage;
run;

compare=compare
criterion=.01;
var age;
with nage;
run;
The COMPARE Procedure

Comparison of WORK.DEMOG with WORK.COMPARE
(Method=RELATIVE(2.21E-12), Criterion=0.01)
Variables Summary
Number of Variables in Common: 3.
Number of Variables in WORK.DEMOG but not in
WORK.COMPARE: 1.
Number of Variables in WORK.COMPARE but not in
WORK.DEMOG: 1.
Number of VAR Statement Variables: 1.
Number of WITH Statement Variables: 1.
Number of Observations in Common: 4.
Total Number of Observations Read from WORK.DEMOG: 4.
Total Number of Observations Read from WORK.COMPARE:
4.
Number of Observations with Some Compared Variables
Unequal: 0.
Number of Observations with All Compared Variables
Equal: 4.
Values Comparison Summary
Number of Variables Compared with All Observations
Equal: 1.
Number of Variables Compared with Some Observations
Unequal: 0.
Total Number of Values which Compare Unequal: 0.
Total Number of Values not EXACTLY Equal: 4.
Maximum Difference Criterion Value: 0.000018868.
CREATING AN OUTPUT DATASET

Sometimes, it is desirable to create an output dataset that
contains only the equalities or discrepancies. Here, there is an
example to create a dataset that has the differences.
The outnoequal option, specifies that only the observations with
discrepancies will exist in the dataset toprint.
compare=compare
outnoequal
out=toprint
;
id unique ;
run;
Obs
1
3
5
7
Proc Print data=toprint;

run;
Unique
20110
20120
20130
20202
_OBS_
1
2
3
4
_NAME_
WEIGHTKG
WEIGHTKG
WEIGHTKG
WEIGHTKG
_LABEL_
Weight in
Weight in
Weight in
Weight in
kg
kg
kg
kg
BASE COMPARE
85.7
85
101.6
101
94.3
94
99.3
99
Variables with Unequal Values

Variable
Type
WEIGHTKG
NUM
Len
8
Label
Ndif
MaxDif
0.700
Weight in kg
CONCLUSION
Value Comparison Results for Variables

_______________________________________________________
||Weight in kg
||
Base
Compare
Unique
|| WEIGHTKG
WEIGHTKG
Diff.
% Diff
_________ || ________
________ ________
_________
||
20110
||
85.7
85.0
-0.7000
-0.8168
20120
||
101.6
101.0
-0.6000
-0.5906
20130
||
94.3
94.0
-0.3000
-0.3181
20202
||
99.3
99.0
-0.3000
-0.3021
______________________________________________________
_TYPE_
1
2
3
4
DIF
DIF
DIF
DIF
_OBS_
1
2
3
4
ACKNOWLEDGMENTS
I want to thank, Craig Mauldin, Andy Barnett, George Clark,
Bonnie Duncan and Kim Sturgen for their comments and support.
CONTACT INFORMATION
Your comments and questions are valued and encouraged.
Contact the author at:
OUTPUT OF PROC PRINT

Obs
You are the person who decides what is important to validate,

values of the observations and how different can they be. Is it
important to have the same labels, formats and informats in a
variable. Proc Compare is a tool that is flexible to allow you find
the differences that are important to you.
Unique
WEIGHTKG
20110
20120
20130
20202
dob
-0.7
-0.6
-0.3
-0.3
E
E
E
E
It is possible to use the noprint option, but then it is strongly

recommended to use the options OUTCOMP AND OUTBASE to
differentiate the variables.
compare=compare
outnoequal
out=toprint
noprint
outcomp
outbase;
id unique ;
run;
SAS and all other SAS Institute Inc. product or service names are
registered trademarks or trademarks of SAS Institute Inc. in the
USA and other countries. indicates USA registration.
Other brand and product names are trademarks of their
respective companies.
proc print data=toprint;

run;
_OBS_ Unique
1
20110
A. Cecilia Casas
PPD Development
3900 Paramount Parkway
Morrisville, NC 27560
Phone: (919) 462-4199
Fax: (919) 379-6151
Email: Cecilia.Casas@rtp.ppdi.com
Obs
1
_TYPE_
BASE
WEIGHTKG
85.7
dob
09/17/1949
COMPARE
BASE
20110
85.0
09/17/1949
20120
101.6
COMPARE
20120
101.0
08/19/1949
08/19/1949
BASE
20130
94.3
06/02/1934
COMPARE
20130
94.0
06/02/1934
BASE
20202
99.3
10/17/1931
COMPARE
20202
99.0
10/17/1931
A USEFUL DATASET TO REPORT

To get only the observations and variables that have
discrepancies, proc transpose can be a big help in reporting the
findings.
proc transpose data=toprint out=transp;
by unique _obs_;
id _type_;
run;
proc print data=transp(where=(base^=compare));
run;

Angelina Cecilia Casas M.S., PPD Development, Research Triangle Park, NC

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Angelina Cecilia Casas M.S., PPD Development, Research Triangle Park, NC

Diunggah oleh

Hak Cipta:

Format Tersedia

Proc Compare to Validate Datasets

Angelina Cecilia Casas M.S., PPD Development, Research Triangle Park, NC

Proc Compare is a procedure that allows two datasets to be

The records in DEMOG and the records in compare have the

WHEN THE DATASETS HAVE A DIFFERENT

proc sort data=compare;

proc compare base=demog (keep=unique weightkg)

proc contents data=demog;

NOTE: No unequal values were found. All values

proc print data=demog noobs;

Number of Observations with Some Compared Variables

THE DATASETS THAT WILL BE COMPARED.

Number of Observations in Common: 4.

As in other SAS procedures and data step, it is possible to select

--Alphabetic List of Variables and Attributes--

Variable Type Len Pos Format

AN EXAMPLE OF A CLEAN COMPARE OUTPUT

Values Comparison Summary

The COMPARE Procedure

# of Vars Compared with All Obs Equal: 0.

Data Set Summary

# of Obs with Some Compared Vars Unequal: 3.

proc compare base=here.demog compare=compare;

All Vars Compared have Unequal Values

Value Comparison Results for Vars

# of Obs with Some Compared Variables Unequal: 0.

INFORMATION ABOUT VARIABLES OR

Observation 2 in WORK.DEMOG not found in WORK.COMPARE:

proc sort data=compare;

proc compare base=demog(keep = unique weightkg)

LIMITING THE PRINTED OUTPUT

Even though, there is only one discrepancy (UNIQUE) in the

Sometimes it is not necessary to know all the observations that

PRINTING RELATED INFORMATION

proc compare data=demog

COMPARING VARIABLES WITH DIFFERENT

Sometimes it is desired to see the information with discrepancies

proc compare data=demog

The COMPARE Procedure

CREATING AN OUTPUT DATASET

Proc Print data=toprint;

Variables with Unequal Values

Value Comparison Results for Variables

OUTPUT OF PROC PRINT

You are the person who decides what is important to validate,

It is possible to use the noprint option, but then it is strongly

proc print data=toprint;

A USEFUL DATASET TO REPORT

Anda mungkin juga menyukai