A COMPARISON
Objective:
Datastep merge:
A datastep merge is one of the most heavily used programming
constructs in SAS. It helps us to combine two or more SAS
datasets and outputs a single combined dataset containing all the
variables from both datasets if not specified otherwise. I will
be using small datasets to demonstrate the merging process.
Let us first look at the syntax:
Data dset1;
Merge dset2(in=a) dset3(in=b);
By byvar1 byvar2... ;
If a and b;
Run;
data b;
input id new_name $ ;
datalines;
1 b1
2 b2
3 b3
4 b4
;
run;
Here the final output will be all the employees having their
salary information in the salary dataset. If the info for a
employee is found in any one of the datasets then that employee
will not be present in the final table.
1a) Inner Join with datastep:
The same inner join can be performed with a datastep merge.
data company;
merge employee(in=a) salary(rename=(emp_id=empid) in=b);
by empid;
if a and b;
run;
Here the final output will be all the employees whether or not
they have their salary information in the salary dataset. If you
want to list all the employees irrespective of their salary being
updated in the salary dataset then you will use left outer join.
2a) Left outer Join with datastep:
The same join can be performed with a datastep merge. We just
need to change the IF condition.
data company;
merge employee(in=a) salary(rename=(emp_id=empid) in=b);
by empid;
if a;
run;
Here the final output will be all the employees who have their
salary information in the salary dataset as well as the employees
who are not currently updated in employee dataset but their
salary info was updated. The situation looks silly but in this
case you will be using a right outer join.
One point to note here is that we are taking empid from the
employee dataset so for these non-matching employees of the
salary dataset the empid will be missing which is definitely not
desirable. So as a precaution we generally use the COALESCE
function in right outer joins.
Proc sql;
create table company as
4) Full Join
A Full join as you have rightly guessed by now outputs the
matching rows from both datasets as well as it also outputs the
non-matching from both the datasets. Its syntax is:
Proc sql;
create table company as
select coalesce(a.empid,b.emp_id), b.name, b.salary
from employee as a full join salary as b
where a.empid=b.emp_id;
Quit;
Here the final output will be all the employees who have their
salary information either in the salary dataset or the employee
dataset.
Here also we use COALESCE function or the same reason as in Right
outer join.
4a) Full Join with datastep:
The full join can also be performed with a datastep merge. We
just need to eliminate the if condition or for better
understanding we can keep the condition as IF A OR B;
data company;
merge employee(in=a) salary(rename=(emp_id=empid) in=b);
by empid;
if a or b;
run;
5) Cartesian Product
A simplest join in proc sql where the final out is each row of
first dataset combined with each row of the second. Its syntax
is:
Proc sql;
create table company as
select a.empid, b.name, b.salary
from employee as a , salary as b
;
Quit;
Here the final output will be product of all rows from the
employee dataset with all rows in the salary dataset.
Practically this looks wrong as everyones salary will be
everyone others salary, perfect Socialism
But surprisingly this Cartesian product is the basis of all
joins. Whenever you specify a join a Cartesian product is done
and the output rows are restricted by certain conditions. So
knowing this is necessary.
6) Cartesian Product through a datastep
This technique is not very intuitive but is asked in a lot of
interviews so I am including it here:
data every_combination;
/* Set one of your data sets, usually the larger data set */
set one;
do i=1 to n;
/* For every observation in the first data set,
*/
/* read in each observation in the second data set */
set two point=i nobs=n;
output;
end;
run;
JOIN
By default a Cartesian Product is produced;It
means joining each row from one table to every
row of the other table
Sorting is required.
Conclusion:
This paper gives basic information about the SQL merges and
joins and is intended to be used as a starter or a reference in this topic.
Will be back with some more SAS magic. Goodbye Till then.