SAS Data Manipulation: Combining and Updating a Dataset

SAS Data Manipulation: Combining and Updating a Dataset

As a data analyst or scientist, working with datasets in SAS can be an efficient way to process and analyze large amounts of information. One common task is to combine and update a dataset to achieve a desired output format. In this article, we will explore how to accomplish this using SAS procedures.

Understanding the Challenge

The original question presents a dataset dimension with various columns containing numerical values. The goal is to simplify the data by merging and updating certain columns to produce a new table. We’ll break down the problem step-by-step and provide a clear explanation of each section.

Dataset Structure

Let’s examine the structure of the dimension dataset:

rk | 1_nm | 2_rk | 2_nm | 2_parent_rk | 3_rk | 3_nm | 3_parent_rk

1    one     -     -        -            -      -        -
2    two     -     -        -            -      -        -
3     -       3    three    1            -      -        -
4     -       4    four     1            -      -        -
5     -       5    five     2            -      -        -
6     -       6    six      2            -      -        -
7     -       -     -       -            7      seven    3
8     -       -     -       -            8      eight    3
9     -       -     -       -            9      nine     5

The dataset has several columns, including 1_nm, 2_rk, 2_nm, and 3_rk. We’ll need to update the structure of this table to produce a new output.

SAS Procedures: Creating a New Table

To create a new table with the desired structure, we can use the proc sql procedure in SAS. The select statement allows us to specify which columns to include in our new table.

proc sql;
    select rk, d.1_nm,
           (case when d2.2_nm is not null then d.1_nm || ' ' || d2.2_nm end) as 2_nm,
           (ase when d3.3_nm is not null then d.1_nm || ' ' || d2.2_nm || ' ' || d3.3_nm end) as 3_nm
    from dimension d left join
         dimension d2
         on d2.2_parent_rk = d.rk left join
         dimension d3
         on d3.3_parent_rk = d2.rk;

In this select statement, we’re joining the original dimension dataset with two other datasets (d2 and d3) based on specific relationships between columns.

Joining Datasets

The key to this solution is understanding how to join datasets in SAS. The left join operator (ON d2.2_parent_rk = d.rk) allows us to combine rows from two tables where the corresponding column values match. If no matching row exists in the second table, a null value will be returned for that row.

In our example, we’re joining d2 with d on the 2_parent_rk column, and then joining d3 with d2 on the 3_parent_rk column. This creates a hierarchical structure where each row in the original dataset is linked to related rows in d2 and d3.

Updating Column Values

The magic happens when we update the value of 1_nm based on whether a related value exists in 2_nm. We use a combination of conditional statements (case) to achieve this.

(case when d2.2_nm is not null then d.1_nm || ' ' || d2.2_nm end) as 2_nm,

This statement checks if d2.2_nm has a non-null value. If it does, the corresponding value from d.1_nm and d2.2_nm are concatenated to create the new value for 2_nm.

Similarly, we’re updating the value of 3_nm by concatenating values from d.1_nm, d2.2_nm, and d3.3_nm.


Last modified on 2024-02-10