SAS Data Manipulation: Combining and Updating a Dataset
As a data analyst or scientist, working with datasets in SAS can be an efficient way to process and analyze large amounts of information. One common task is to combine and update a dataset to achieve a desired output format. In this article, we will explore how to accomplish this using SAS procedures.
Understanding the Challenge
The original question presents a dataset dimension
with various columns containing numerical values. The goal is to simplify the data by merging and updating certain columns to produce a new table. We’ll break down the problem step-by-step and provide a clear explanation of each section.
Dataset Structure
Let’s examine the structure of the dimension
dataset:
rk | 1_nm | 2_rk | 2_nm | 2_parent_rk | 3_rk | 3_nm | 3_parent_rk
1 one - - - - - -
2 two - - - - - -
3 - 3 three 1 - - -
4 - 4 four 1 - - -
5 - 5 five 2 - - -
6 - 6 six 2 - - -
7 - - - - 7 seven 3
8 - - - - 8 eight 3
9 - - - - 9 nine 5
The dataset has several columns, including 1_nm
, 2_rk
, 2_nm
, and 3_rk
. We’ll need to update the structure of this table to produce a new output.
SAS Procedures: Creating a New Table
To create a new table with the desired structure, we can use the proc sql
procedure in SAS. The select
statement allows us to specify which columns to include in our new table.
proc sql;
select rk, d.1_nm,
(case when d2.2_nm is not null then d.1_nm || ' ' || d2.2_nm end) as 2_nm,
(ase when d3.3_nm is not null then d.1_nm || ' ' || d2.2_nm || ' ' || d3.3_nm end) as 3_nm
from dimension d left join
dimension d2
on d2.2_parent_rk = d.rk left join
dimension d3
on d3.3_parent_rk = d2.rk;
In this select
statement, we’re joining the original dimension
dataset with two other datasets (d2
and d3
) based on specific relationships between columns.
Joining Datasets
The key to this solution is understanding how to join datasets in SAS. The left join
operator (ON d2.2_parent_rk = d.rk
) allows us to combine rows from two tables where the corresponding column values match. If no matching row exists in the second table, a null value will be returned for that row.
In our example, we’re joining d2
with d
on the 2_parent_rk
column, and then joining d3
with d2
on the 3_parent_rk
column. This creates a hierarchical structure where each row in the original dataset is linked to related rows in d2
and d3
.
Updating Column Values
The magic happens when we update the value of 1_nm
based on whether a related value exists in 2_nm
. We use a combination of conditional statements (case
) to achieve this.
(case when d2.2_nm is not null then d.1_nm || ' ' || d2.2_nm end) as 2_nm,
This statement checks if d2.2_nm
has a non-null value. If it does, the corresponding value from d.1_nm
and d2.2_nm
are concatenated to create the new value for 2_nm
.
Similarly, we’re updating the value of 3_nm
by concatenating values from d.1_nm
, d2.2_nm
, and d3.3_nm
.
Last modified on 2024-02-10