ABSTRACT
Comparison of two datasets is a technique used to know that two
datasets are equal or if they have discrepancies. This method
can be used to validate that a dataset has been created correctly
or that the changes in a dataset are only those that are expected.
I.e. observations were added or removed or corrections were
done.
101.6
94.3
85.7
99.3
1
4
1
4
First Obs
Last Obs
dob
08/19/1949
06/02/1934
09/17/1949
10/17/1931
Compare
20120
20130
20110
20202
Base
WEIGHTKG
Observation
INTRODUCTION
Unique
Observation Summary
Observations:
Variables: 3
Dataset
Created
Modified
WORK.DEMOG
20JAN03:13:17
WORK.COMPARE 20JAN03:13:17
NVar
Nobs
2
2
4
4
20JAN03:13:17
20JAN03:13:17
Vars Summary
# of Vars in Common: 2.
proc sql;
create table Compare as select * from demog;
quit;
Observation Summary
Observation
Compare
Dataset
Base
# of Obs in Common: 4.
Total # of Obs Read from WORK.DEMOG: 4.
Total # of Obs Read from WORK.COMPARE: 4.
Modified
WORK.DEMOG
14JAN03:16:03 14JAN03:16:06
WORK.COMPARE 14JAN03:16:06 14JAN03:16:06
Nvar Nobs
3
3
4
4
Variable
Unique
WEIGHTKG
Type
CHAR
NUM
Len
200
8
Label
Unique Record
Weight in kg
Ndif
3
3
MaxDif
15.900
______________________________________________________
|| Unique Record
|| Base Value
Compare Value
Obs || Unique
Unique
_____ || ___________________+ ___________________+
||
1 || 20120
20202
3 || 20110
20120
4 || 20202
20110
_______________________________________________________
_______________________________________________________
|| Weight in kg
||
Base
Compare
Obs ||
WEIGHTKG
WEIGHTKG
Diff.
% Diff
_____ || _________ _________ _________ _________
||
1 ||
101.6
99.3
-2.3000
-2.2638
3 ||
85.7
101.6
15.9000
18.5531
4 ||
99.3
85.7
-13.6000
-13.6959
_______________________________________________________
The values of the variable that was to the right of the BASE or
DATA statement is in the BASE column. The values of dataset
listed to right of the COMPARE dataset is in the COMPARE
column.
It is necessary that the two datasets that are going to be
compared have the same order UNLESS the order of the
datasets is something that is being tested.
What happens when the order of the observations in each of the
datasets is different?
The LIST option will list the observations and variables that exist
in one dataset and are missing in the other. There are other
options, LISTALL, LISTBASE, LISTBASEOBS, LISTBASEVAR,
LISTCOMP, LISTCMPOBS, LISTCOMPVAR, LISTOBS, AND
LISTVAR, to limit this report to one or the other dataset and only
variables and / or observations. LISTEQUALVAR will also list
the variables that are not used in the ID variable list for which all
values are equal.
proc sql;
update compare set unique='34343' where
unique='20202';
quit;
_______________________________________________________
|| Unique Record
|| Base Value
Compare Value
Obs || Unique
Unique
_______ || ___________________+ ___________________+
||
2 || 20120
20130
3 || 20130
20202
4 || 20202
34343
_______________________________________________________
Etcetera
has discrepancies.
Variable
Type
WEIGHTKG
NUM
Len
8
Label
Ndif
MaxDif
0.700
Weight in kg
_______________________________________________________
|| Weight in kg
||
Base
Compare
Unique
||
WEIGHTKG
WEIGHTKG
Diff.
% Diff
_______ || _________ _________ _________ _________
||
20110
||
85.7
85.0
-0.7000
-0.8168
20120
||
101.6
101.0
-0.6000
-0.5906
NOTE: The MAXPRINT=2 printing limit has been reached
for the variable WEIGHTKG.
No more values will be printed for this comparison.
transpose
;
id unique ;
run;
Unique=20110:
Variable Base Value
Compare
WEIGHTKG
85.7
85.0
dob 09/17/1949 09/14/1949
Diff.
-0.700000
-3.000000
% Diff
-0.816803
0.079830
Unique=20120:
Variable Base Value
Compare
WEIGHTKG
101.6
101.0
dob 08/19/1949 08/16/1949
Diff.
-0.600000
-3.000000
% Diff
-0.590551
0.079218
Unique=20130:
Variable Base Value
Compare
WEIGHTKG
94.3
94.0
dob 06/02/1934 05/30/1934
Diff.
-0.300000
-3.000000
% Diff
-0.318134
0.032106
Diff.
-0.300000
-3.000000
% Diff
-0.302115
0.029118
# of Vars in Common: 3.
# of Vars in WORK.DEMOG but not in WORK.COMPARE:
1.
# of Vars in WORK.COMPARE but not in WORK.DEMOG:
1.
# of ID Vars: 1.
# of VAR Statement Vars: 1.
# of WITH Statement Vars: 1.
Sometimes, there are differences in the datasets that are not
important for the purpose of the comparison. For this scenario, it
is possible to give a value for which only observations that are
bigger than this number will be marked as a difference.
proc sql;
update compare set nage=nage+0.001;
quit;
Unique=20202:
Variable Base Value
Compare
WEIGHTKG
99.3
99.0
dob 10/17/1931 10/14/1931
var age;
with nage;
run;
id unique ;
run;
Obs
1
3
5
7
Unique
20110
20120
20130
20202
_OBS_
1
2
3
4
_NAME_
WEIGHTKG
WEIGHTKG
WEIGHTKG
WEIGHTKG
_LABEL_
Weight in
Weight in
Weight in
Weight in
kg
kg
kg
kg
BASE COMPARE
85.7
85
101.6
101
94.3
94
99.3
99
Type
WEIGHTKG
NUM
Len
8
Label
Ndif
MaxDif
0.700
Weight in kg
CONCLUSION
_TYPE_
1
2
3
4
DIF
DIF
DIF
DIF
_OBS_
1
2
3
4
ACKNOWLEDGMENTS
I want to thank, Craig Mauldin, Andy Barnett, George Clark,
Bonnie Duncan and Kim Sturgen for their comments and support.
CONTACT INFORMATION
Your comments and questions are valued and encouraged.
Contact the author at:
Unique
WEIGHTKG
20110
20120
20130
20202
dob
-0.7
-0.6
-0.3
-0.3
E
E
E
E
SAS and all other SAS Institute Inc. product or service names are
registered trademarks or trademarks of SAS Institute Inc. in the
USA and other countries. indicates USA registration.
Other brand and product names are trademarks of their
respective companies.
A. Cecilia Casas
PPD Development
3900 Paramount Parkway
Morrisville, NC 27560
Phone: (919) 462-4199
Fax: (919) 379-6151
Email: Cecilia.Casas@rtp.ppdi.com
Obs
1
_TYPE_
BASE
WEIGHTKG
85.7
dob
09/17/1949
COMPARE
BASE
20110
85.0
09/17/1949
20120
101.6
COMPARE
20120
101.0
08/19/1949
08/19/1949
BASE
20130
94.3
06/02/1934
COMPARE
20130
94.0
06/02/1934
BASE
20202
99.3
10/17/1931
COMPARE
20202
99.0
10/17/1931