SAS Library

7/31/2019 SAS Library

1/31

SAS Library

Data Transformations and Data Manipulation in SAS

This page was adapted from a page created by Professor Oliver Schabenberger .We thank Professor Schabenberger for permission to adapt and distribute thispage via our web site.

1.Making a copy of a SAS data set2.Creating new variables

2.1.Transformations2.2.Operators2.3.Algebra with logical expressions

3.Dropping variables from a data set

4.Dropping observations from a data set (Subsetting data)5.Setting and merging multiple data sets6.Sorting your data

6.1.By-processing in PROC steps6.2.By-merging of sorted data sets (sorted matching)

7.Formatting7.1.Labels7.2.Formats7.3.SAS Dates

8.Advanced data manipulation8.1.Renaming of variables8.2.Retaining of variables8.3.DO Blocks and DO Loops

8.4.IF .. THEN .. ELSE statements8.5.SELECT case distinction8.6.Arrays8.7.Lagged variables8.8.Generating multiple data sets in a DATA step8.9.Converting character variables to numeric variables

1. Making a copy of a SAS data set

Manipulating data is possible in the initial DATA step when you read in or access data for the first

time or in subsequent data steps when you modify or replace an existing data set. Assume thatdata set MYDATA exists and you wish to create new variables, drop variables, subset the dataset or perform some other manipulation of it. This requires a new DATA step in which you have tomake available the information stored in MYDATA to SAS. There easiest way to do this is withthe SET command. There are two forms:

data mydata;set mydata;< more statements>

run;
http://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Making%20a%20copy%20of%20a%20SAS%20data%20sethttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Making%20a%20copy%20of%20a%20SAS%20data%20sethttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Making%20a%20copy%20of%20a%20SAS%20data%20sethttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Creating%20new%20variableshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Creating%20new%20variableshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Creating%20new%20variableshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Transformationshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Transformationshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Transformationshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Algebra%20with%20variableshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Algebra%20with%20variableshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Algebra%20with%20variableshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Algebra%20with%20logical%20expressionshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Algebra%20with%20logical%20expressionshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Algebra%20with%20logical%20expressionshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Dropping%20variables%20from%20a%20data%20sethttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Dropping%20variables%20from%20a%20data%20sethttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Dropping%20variables%20from%20a%20data%20sethttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Droppinghttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Droppinghttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Droppinghttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Setting%20and%20merging%20multiple%20data%20setshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Setting%20and%20merging%20multiple%20data%20setshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Setting%20and%20merging%20multiple%20data%20setshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Sorting%20your%20datahttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Sorting%20your%20datahttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Sorting%20your%20datahttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#By-processinghttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#By-processinghttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#By-processinghttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#By-merging%20of%20sorted%20datahttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#By-merging%20of%20sorted%20datahttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#By-merging%20of%20sorted%20datahttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Formattinghttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Formattinghttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Formattinghttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Labelshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Labelshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Labelshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Formatshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Formatshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Formatshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#SAS%20Dateshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#SAS%20Dateshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#SAS%20Dateshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Advanced%20data%20manipulationhttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Advanced%20data%20manipulationhttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Advanced%20data%20manipulationhttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Renaming%20of%20variableshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Renaming%20of%20variableshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Renaming%20of%20variableshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Retaining%20of%20variableshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Retaining%20of%20variableshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Retaining%20of%20variableshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#DOBlocksandDOLoopshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#DOBlocksandDOLoopshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#DOBlocksandDOLoopshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#IF%20..%20THEN%20..%20ELSE%20statementshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#IF%20..%20THEN%20..%20ELSE%20statementshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#IF%20..%20THEN%20..%20ELSE%20statementshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#SELECT%20case%20distinctionhttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#SELECT%20case%20distinctionhttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#SELECT%20case%20distinctionhttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Arrayshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Arrayshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Arrayshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Lagged%20variableshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Lagged%20variableshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Lagged%20variableshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Generating%20multiple%20data%20setshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Generating%20multiple%20data%20setshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Generating%20multiple%20data%20setshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Converting%20character%20variableshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Converting%20character%20variableshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Converting%20character%20variableshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Converting%20character%20variableshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Generating%20multiple%20data%20setshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Lagged%20variableshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Arrayshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#SELECT%20case%20distinctionhttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#IF%20..%20THEN%20..%20ELSE%20statementshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#DOBlocksandDOLoopshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Retaining%20of%20variableshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Renaming%20of%20variableshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Advanced%20data%20manipulationhttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#SAS%20Dateshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Formatshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Labelshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Formattinghttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#By-merging%20of%20sorted%20datahttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#By-processinghttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Sorting%20your%20datahttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Setting%20and%20merging%20multiple%20data%20setshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Droppinghttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Dropping%20variables%20from%20a%20data%20sethttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Algebra%20with%20logical%20expressionshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Algebra%20with%20variableshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Transformationshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Creating%20new%20variableshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Making%20a%20copy%20of%20a%20SAS%20data%20set


2/31

and

data newdata;set mydata;

run;

The SET command adds the contents of a data set to a DATA step. In the first example the dataset being created (MYDATA) and the data set being added are the same. This works, since SASdoes not assign a name to the data set until the DATA step completed successfully. In the

interim, a dummy data set name is being used. The statement data mydata is interpreted as "I

should create a new data set, if the DATA step executes without errors, I will call it mydata". Priorto completion of the DATA step the data set mydata exists in its unmodified, original form. Thisconstruction, to use the name of an existing data set in the DATA and SET command is a

convenient way to modify an existing data set. If you wish to modify mydata but store the results

in a new data set use syntax as in the second example. The names of the data sets in the DATAand SET command are different. Upon completion of the DATA step a new data set namednewdata is being created. Mydata remains unchanged.

2. Creating new variables2.1. Transformations

SAS' power makes it unnecessary to perform most data manipulations outside of The SASSystem. Creation of new variables, calculations with existing variables, subsetting of data,sorting, etc. are best done within SAS. Transformations in the narrow sense use the built-inmathematical functions available in The SAS System. An example DATA step would be:

data survey; set survey;x = ranuni(123); /* A uniform(0,1) random number */lny = log(inc); /* The natural logarithm (base e) */logy = log10(inc); /* The log to base 10 */rooty = sqrt(inc); /* The square root */expy = exp(inc/10); /* The exponential function */cos = cos(x); /* The cosine function */sin = sin(x); /* The sine function */tan = tan(x); /* The tangent function */cosz = cos(z);z = ranuni(0);

run;

This DATA step calculates various transformations of the income variable. Watch out for thesubtle difference between the log() and log10() functions. In mathematics, the natural logarithm(logarithm with base e) is usually abbreviated ln, while log is reserved for the logarithm with base10. SAS does not have a ln() function. The natural log is calculated by the log() function. TheRANUNI() function generates a random number from a Uniform(0,1) distribution. The number in

parentheses is the seed of the random number generator. If you set the seed to a non-zeronumber the same random numbers are being generated, every time you run the program.The expressions (function calls) on the right hand side of the '=' sign must involve existingvariables. These are either variables already in the data set being SET, or are created previously.For example, the statement tan = tan(x); will function properly, since x has been defined prior tothe call to the tan() function. The variable cosz however will contain missing values only, since itsargument, the variable z, is not defined prior. Here is a printout of the data set survey after thisDATA step.


3/31

OBS ID SEX AGE INC R1 R2 R3 X LNY LOGY ROOTY

1 1 F 35 17 7 2 2 0.75040 2.83321 1.23045 4.123112 17 M 50 14 5 5 3 0.17839 2.63906 1.14613 3.741663 33 F 45 6 7 2 7 0.35712 1.79176 0.77815 2.449494 49 M 24 14 7 5 7 0.78644 2.63906 1.14613 3.74166

5 65 F 52 9 4 7 7 0.12467 2.19722 0.95424 3.000006 81 M 44 11 7 7 7 0.77618 2.39790 1.04139 3.316627 2 F 34 17 6 5 3 0.96750 2.83321 1.23045 4.123118 18 M 40 14 7 5 2 0.71393 2.63906 1.14613 3.741669 34 F 47 6 6 5 6 0.53125 1.79176 0.77815 2.44949

10 50 M 35 17 5 7 5 0.14208 2.83321 1.23045 4.12311

OBS EXPY COS SIN TAN COSZ Z

1 5.47395 0.73142 0.68193 0.93234 . 0.320912 4.05520 0.98413 0.17745 0.18031 . 0.906033 1.82212 0.93691 0.34957 0.37312 . 0.221114 4.05520 0.70637 0.70784 1.00208 . 0.39808

5 2.45960 0.99224 0.12434 0.12532 . 0.187696 3.00417 0.71359 0.70056 0.98174 . 0.436077 5.47395 0.56736 0.82347 1.45141 . 0.263708 4.05520 0.75579 0.65481 0.86639 . 0.554869 1.82212 0.86217 0.50661 0.58760 . 0.86134

10 5.47395 0.98992 0.14160 0.14304 . 0.86042

For more information about mathematical, trigonometric, and other functions see the Help Files(go to Help-Extended Help, then select SAS System Help, select SAS Language, select SASFunctions, select Function Categories)

2.2. Operators

The most important operators are listed in the following table. The smaller the group number, thehigher the precedence of the operator. For example, in the expression y = 3 * x + 4; multiplicationis carried out before addition, since the group number of the multiplication operator is less thanthat of the addition operator.

GroupType ofOperator

Operator DescriptionDATA StepExample

0 ()expression inparentheses isevaluated first

y = 3*(x+1);

1 Math **raises argument to apower y = x**2;

2 Math +,-to indicate a positive ornegative number

y = -x;

3 Math * multiplication y = x * z;

4 Math+-

additionsubtraction

y = x + 3;z = y - 3*x;

5 String || string concatenation name = firstname


4/31

|| lastname;

6 Set Inwhether value iscontained in a set

y = x in (1,2,3,4);gender in ('F','M');

7 Logical =, eq equals if x = 12;

Logical , ne does not equal if x ne 5;Logical >, gt greater than if sin(x) > 0.4;

Logical =, ge greater than or equal

Logical 0.3);

Logical or logical orif (a=b) or (sin(x)

< 0.3);

Logical not logical not if not (a=b);

2.3. Algebra with logical expressions

Algebra with logical expressions is a nifty trick. Like many other computing packages, logicalcomparisons in SAS return the numeric value 1 if true, 0 otherwise. This feature can be used in

DATA steps elegantly. Imagine you need to create a new variable agegr grouping ages in the

survey examples. The first group comprises ages between 0 and 25 years, the second groupbetween 26 and 40 years and the third all individuals age 41 and older.

data survey; set survey;agegr = (age 25) and (age 41);

run;

For individuals less than or equal to 25 years old, only the first logical comparison is true, the

resulting algebraic expression is agegr = 1 + 0 + 0;. For those between 26 and 40 years old,

the second expression is true and the expression yields agegr = 0 + 2*1 + 0;. Finally for

those above 40 years old, you get agegr = 0 + 0 + 3*1;. Using algebra with logical

expressions is sometimes easier and more compact than using if..then..else constructs. Theif..then syntax that accomplishes the same as the one-liner above is

data survey; set survey;if age


5/31

data survey; set survey;drop r1 r2 r3;

run;

if you use the DROP statement and

data survey; set survey(drop=r1 r2 r3); run;

if you use the DROP= option.The end result of the two versions is the same. The variables r1, r2, and r3 are no longer part ofthe data set survey when the DATA step completes. There is a subtle difference between the two.When you use the DROP statement inside the DATA step all variables in survey are initiallycopied into the new data set being created. The variables being dropped are available in theDATA step itself. Dropping takes place only at completion of the DATA step. When you listvariables in a DROP= option as in the second example, the variables are not copied (SET) intothe DATA set. This version is slightly faster, since the interim data set being manipulated issmaller. But it precludes you from using the variables r1, r2, r3 somewhere in the data set. Forexample, if you want to calculate a new variable, the sum of r1, r2, r3 before dropping thevariables you have to use

data survey; set survey;total = r1 + r2 + r3;drop r1 r2 r3;

run;

If you would use

data survey; set survey(drop=r1 r2 r3);total = r1 + r2 + r3;

run;

the total would contain missing values, since r1, r2, r3 are not known after survey has been

copied into the new data set.If many variables are to be listed that form a numbered list, such as r1, r2, r3, etc. you can use ashortcut to describe the elements in the list:

data survey; set survey(drop=r1-r3); run;

The complementary DATA step command and data set option to DROP (DROP=) are the KEEPstatement and KEEP= option. Instead of dropping the variables listed, only the variables listedafter KEEP (KEEP=) are being kept in the data set. All others are eliminated. If you use theKEEP= data set option, variables not listed are not being copied into the new data set. The next

line of statements eliminates all variables, except age and inc.

data survey; set survey(keep=age inc); run;

4. Dropping observations from a data set (Subsetting data)

Dropping observations (subsetting data) means to retain only those observations that satisfy acertain conditions. This is accomplished with IF and WHERE statements as well as the WHERE=data set option. For example to keep observations of individuals more than 35 years old use


6/31

data survey; set survey;if age > 35;

run;

SAS evaluates the logical condition for each observation and upon successful completion of theDATA step deletes those observations for which the condition is not true. An alternative syntax

construction is IF THEN DELETE;:

data survey; set survey;if age 35) then delete; */

run;

If you use this construction, the condition has to be reversed of course. The WHERE statement functions exactly like the first IF syntax example:

data survey; set survey;where age > 35;

run;

The advantage of the WHERE construction is that it can be used as a data set option:

data survey; set survey(where=(age > 35)); run;

Only those observations for which the expression in parentheses is true are copied into the newdata set. If a lot of observations must be deleted, this is much faster than using the WHERE or IFstatement inside the DATA step. The other advantage of subsetting data with the WHERE= dataset option is that it can be combined with any procedure. For example, if you want to print thedata set for only those age 35 and above you can use

proc print data=survey(where=(age >= 35)); run;

without having to create a new data set containing the over 35 year old survey participants first.To calculate sample means, standard deviations, etc. for 1994 yield data from a data setcontaining multiple years:

proc means data=yielddat(where=(year = 1994)); run;

5. Setting and merging multiple data sets

Setting data sets means concatenating their contents vertically. Merging data means combiningtwo or more data sets horizontally. Imagine two SAS data sets. The first contains n1observations and v1 variables, the second n2 observations and v2 variables. When you set thetwo data sets the new data set will contain n1+n2 observations and max(v1,v2) variables.Variables that are not in both data sets receive missing values for observations from the data set

where the variable is not present. An example will make this clearer.

data growth1;input block trtmnt growth @@;year = 1997;datalines;

1 1 7.84 2 1 8.69 3 1 8.11 4 1 7.74 5 1 8.351 2 6.78 2 2 6.69 3 2 6.95 4 2 6.41 5 2 6.641 3 6.79 2 3 6.79 3 3 6.79 4 3 6.43 5 3 6.61


7/31

;run;

proc print data=growth1(obs=10); run;

Data set growth1 contains 15 observations and four variables (BLOCK, TRTMNT, GROWTH,

YEAR). Only the first 10 observations are displayed (OBS=10 data set option).

OBS BLOCK TRTMNT GROWTH YEAR

1 1 1 7.84 19972 2 1 8.69 19973 3 1 8.11 19974 4 1 7.74 19975 5 1 8.35 19976 1 2 6.78 19977 2 2 6.69 19978 3 2 6.95 19979 4 2 6.41 1997

10 5 2 6.64 1997

The next data set (growth2) contains 10 observations. The variable YEAR is not part of the dataset growth2.

data growth2;input block trtmnt growth @@;datalines;

1 4 6.64 2 4 6.57 3 4 6.78 4 4 6.54 5 4 6.481 5 7.31 2 5 7.65 3 5 7.26 4 5 6.98 5 5 7.39run;

To combine the data sets vertically, use the SET data set statement and list the data sets you

wish to combine. The data sets are placed in the new data set in the order in which they appearin the SET statement. In this example, the observations in growth1 go first followed by theobservations in growth2.

data growth; set growth1 growth2; run;proc print data=growth; run;

The combined data has four variables and 50 observations, the variable year contains missingvalues for all observations from growth2 since year was not present in this data set.


1 1 1 7.84 19972 2 1 8.69 19973 3 1 8.11 19974 4 1 7.74 19975 5 1 8.35 19976 1 2 6.78 19977 2 2 6.69 19978 3 2 6.95 19979 4 2 6.41 1997

10 5 2 6.64 1997


8/31

11 1 3 6.79 199712 2 3 6.79 199713 3 3 6.79 199714 4 3 6.43 199715 5 3 6.61 199716 1 4 6.64 .17 2 4 6.57 .18 3 4 6.78 .19 4 4 6.54 .20 5 4 6.48 .21 1 5 7.31 .22 2 5 7.65 .23 3 5 7.26 .24 4 5 6.98 .25 5 5 7.39 .

Merging data sets is usually done when the data sets contain the same observations but differentvariables, while setting is reasonable when data sets contain different observations but the samevariables. Consider the survey example and assume that the baseline information (id, sex, age,inc) are in one data set, while the subjects ratings of product preference (r1, r2, r3) are contained

in a second data set.

DATA baseline;INPUT id sex $ age inc;DATALINES;

1 F 35 1717 M 50 1433 F 45 649 M 24 1465 F 52 981 M 44 112 F 34 1718 M 40 14

34 F 47 650 M 35 17;DATA rating;

INPUT r1 r2 r3 ;DATALINES;

7 2 25 5 37 2 77 5 74 7 77 7 76 5 37 5 26 5 65 7 5;run;

Merging the two data sets in a DATA step combines the variables and observations horizontally.If the first data set has n1 observations and v1 variables and the second data set has n2observations and v2 variables, the merged data set will have max(n1,n2) observations.Observations not present in the smaller data set are patched with missing values. The number of


9/31

variables in the combined data set depends on whether the two data sets share some variables.If variables are present in either data set, they are retained from the data set in the merge list thatcontains the variable last. If the rating data set above would contain a variable ID, the value of theID variable in the merged data set would come from the rating data set.

data survey; merge baseline rating; run;

proc print data=survey; run;

OBS ID SEX AGE INC R1 R2 R3

1 1 F 35 17 7 2 22 17 M 50 14 5 5 33 33 F 45 6 7 2 74 49 M 24 14 7 5 75 65 F 52 9 4 7 76 81 M 44 11 7 7 77 2 F 34 17 6 5 38 18 M 40 14 7 5 29 34 F 47 6 6 5 6

10 50 M 35 17 5 7 5

6. Sorting your data

If your data is entered or read in some "ordered" fashion, one could consider it ordered. Forexample the data set growth in the exampleaboveappears sorted first by the variable TRTMNTand for each value of TRTMNT by BLOCKS. As far as SAS is concerned, the data are simplyordered by these variables, but not sorted. A data set is not sorted, unless you process it with theSORT procedure. The basic syntax of PROC SORT is

proc sort data=yourdata;by

run;

is the list of variables by which to sort the data set. If this list contains more thanone variable, SAS sorts the data set by the variable listed first. Then, for each value of thisvariable, it sorts the data set by the second variable. For example

by block trtmnt;

will cause SAS to sort by BLOCK first. All observations with the same value of variable BLOCKare then sorted by variable TRTMNT.By default, variables are sorted in ascending order. To reverse the sort order add the key-wordDESCENDING before the name of the variable you want to be arranged in descending order. Forexample

by descending block trtmnt;

will sort the data in descending order of BLOCK and all observations of the same block inascending order of TRTMNT.Why is sorting so important and how does it differ from arranging (reading) data in an orderedsequence to begin with? When you sort data with PROC SORT, SAS adds hidden variables foreach variable in the BY statement. For example the code

proc sort data=growth; by block trtmnt; run;
http://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Setting%20and%20merging%20multiple%20data%20setshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Setting%20and%20merging%20multiple%20data%20setshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Setting%20and%20merging%20multiple%20data%20setshttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#Setting%20and%20merging%20multiple%20data%20sets


10/31

sorts data set growth by BLOCK and TRTMNT. The hidden variables added to the data set are

first.blocklast.blockfirst.trtmntlast.trtmnt

You are not able to see these variables or print them out. However, they can be accessed inDATA steps. By default these are logical variables containing the values 0 and 1 only. In a group

of observations with the same value of BLOCK first.block takes on the value 1 for the first

observation in the group and last.block takes on the value 0. For the last observation in the grouplast.block takes on the value 1. With some trickery, I made first.block and last.block visible in theprintout of the sorted data set growth:

first. last.OBS BLOCK TRTMNT block block GROWTH

1 1 1 1 0 7.842 1 2 0 0 6.78

3 1 3 0 0 6.794 1 4 0 0 6.645 1 5 0 1 7.316 2 1 1 0 8.697 2 2 0 0 6.698 2 3 0 0 6.799 2 4 0 0 6.57

10 2 5 0 1 7.6511 3 1 1 0 8.1112 3 2 0 0 6.9513 3 3 0 0 6.7914 3 4 0 0 6.7815 3 5 0 1 7.26

16 4 1 1 0 7.7417 4 2 0 0 6.4118 4 3 0 0 6.4319 4 4 0 0 6.5420 4 5 0 1 6.9821 5 1 1 0 8.3522 5 2 0 0 6.6423 5 3 0 0 6.6124 5 4 0 0 6.4825 5 5 0 1 7.39

A convenient way to process data in procedures is to useBY-processing. This is only possible ifthe data contains the hidden variables first.whatever and last.whatever. Before you can use BY-processing, the data must be sorted accordingly.

The next code example shows how the first.whatever and last.whatever variables can be used inDATA steps. Here only the first observation for each block is output to the data set. This trickallows counting the number of unique values of BLOCK in the data set:

data blocks;set growth; by block;if first.block;
http://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#By-processinghttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#By-processinghttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#By-processinghttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#By-processing


11/31

run;proc print data=blocks; run;


1 1 1 7.84 1997

2 2 1 8.69 19973 3 1 8.11 19974 4 1 7.74 19975 5 1 8.35 1997

Notice that the SET statement is followed by the BY statement. It is the presence of the BYstatement in the DATA step that forces SAS to copy the first.block and last.block variables fromthe sorted data set. When the BY statement is omitted, first.block can not be accessed.

The SORT procedure has many nifty aspects. Like for any other procedure you can access helpand explore the syntax by enteringhelp procedurenameinto the little white box in the top left corner of the SAS application workspace. Then click on the

checkmark to the left of the box. Here are some other uses of PROC SORT:

1. By default the data set being sorted is replaced with the sorted version at completion of PROCSORT. To prevent this, use the OUT= option.

proc sort data=growth out=newdata; by descending block; run;

sorts the data set growth by descending levels of BLOCK but leaves the original data untouched.The sorted data is written instead to a data set calle newdata.

2. Sometimes you want to identify how many combinations of the levels of certain variables are inyour data set. One technique to determine this is shown above, a DATA step using the first.xxx

and/or last.xxx variables. More conveniently, you can use the NODUPKEY option. If there aremultiple observations for a certain combination of the sort variables, only the first one is retained:

proc sort data=growth out=blocks nodupkey; by block; run;

6.1. By-processing in PROC steps

By-processing in general refers to the use of the BY statement in PROC or DATA steps. If a BYstatement

by

appears in a PROC or DATA step the data set(s) being processed must be sorted as indicated bythe variable list. However, if you sort data with

by block tx a b ;

for example, by processing is possible with any of the following:

by block;by block tx;


12/31

by block tx a;by block tx a b;

since a data set sorted by BLOCK and TX is also sorted by BLOCK and so forth. You can not,however change the order in which the varaibles appear in the list or omit a variable rankedhigher in the sort order. By statements such as

by b;by tx a;by a block;

and so forth will create errors at execution.

A BY statement in a procedure causes SAS to execute the procedure separately for allcombinations of the BY variables. For example,

proc means data=growth;var growth;

run;

will calculate mean, standard deviation, etc. for the variable growth across all observations in thedata set. The statements

proc means data=growth;var growth;by block;

run;

will calculate these descriptive statistics separately for each level of the BLOCK variable in thedata set.

6.2. By-merging of sorted data sets (sorted matching)

By-merging is a powerful tool to merge data according to variables contained in both data sets.The easiest case is one-to-one merging where each data set in the merge contributes at mostone observation for a variable according to which the data sets are to be matched. Consider thefollowing example: In a spacing study two levels of P fertilization (0 and 25 lbs/acre) and threelevels of row spacing (40, 80, 120 cm) are applied. Yield data are collected in 1996 and 1997.The data sets for the two years are shown below. Each data set contains 17 observations. Tocalculate the yield difference between the two years the data sets need to be merged. Uponcloser inspection one sees however, that the observations do not have the same P and SPACEvariable arrangement. There are only two observations for P=0, SPACE=40 in the 1996 data setwhereas there are three observations for this combination in the 1997 data set. Conversely,replicate 2 for P=25, SPACE=120 observations in 1996 is not represented in the 1997 data set.

/* 1996 yield data */data spacing1;

input P space rep yield96;datalines;

0 40 1 570 40 2 580 80 1 570 80 2 580 80 3 56


13/31

0 120 1 490 120 2 540 120 3 5325 40 1 5325 40 2 4525 40 3 4625 80 1 5425 80 2 5025 80 3 4825 120 1 6325 120 2 5725 120 3 53;;run;

/* 1997 yield data */data spacing2;

input P space rep yield97;datalines;

0 40 1 35

0 40 2 280 40 3 290 80 1 380 80 2 290 80 3 270 120 1 100 120 2 250 120 3 3425 40 1 2425 40 2 2425 40 3 1725 80 1 2525 80 2 31

25 80 3 2925 120 1 4425 120 3 28;;run;

The following DATA step merges the data incorrectly. It matches observation by observation andsince both data sets contain variables P SPACE, and REP, the observations for these variablesare pulled from spacing2, the last data set in the list.

data spacing;merge spacing1 spacing2;

run;

proc print data=spacing; run;

OBS P SPACE REP YIELD96 YIELD97

1 0 40 1 57 352 0 40 2 58 283 0 40 3 57 294 0 80 1 58 385 0 80 2 56 29


14/31

6 0 80 3 49 277 0 120 1 54 108 0 120 2 53 259 0 120 3 53 34

10 25 40 1 45 2411 25 40 2 46 2412 25 40 3 54 1713 25 80 1 50 2514 25 80 2 48 3115 25 80 3 63 2916 25 120 1 57 4417 25 120 3 53 28

Yield measurements in the two years are matched up correctly for the first two observations, butincorrectly for the third and all following observations. Since the P, SPACE, and REP variablescontained in data set spacing1 were overwritten by the variables in spacing2, the problem is notat all obvious on first glance.To correctly merge the two data sets, we sort them both such that each observation is properlyidentified:

proc sort data=spacing1; by p space rep; run;proc sort data=spacing2; by p space rep; run;

Then merge them using BY-processing:

data spacing;merge spacing1 spacing2;by p space rep;

run;proc print data=spacing; run;

This produces

OBS P SPACE REP YIELD96 YIELD97

1 0 40 1 57 352 0 40 2 58 283 0 40 3 . 294 0 80 1 57 385 0 80 2 58 296 0 80 3 56 277 0 120 1 49 108 0 120 2 54 259 0 120 3 53 34

10 25 40 1 53 24

11 25 40 2 45 2412 25 40 3 46 1713 25 80 1 54 2514 25 80 2 50 3115 25 80 3 48 2916 25 120 1 63 4417 25 120 2 57 .18 25 120 3 53 28

The observations are now matched up properly.


15/31

By-merging is also useful to match many-to-one relationships in data sets. Assume that data setone contains the cahnge in ear temperature at treatment and 4 days after treatment of 5 rabbitstreated with three different treatments each:

data rabbits;input Treat Rabbit day0 day4;

datalines;1 1 -0.3 -0.21 2 -0.5 2.21 3 -1.1 2.41 4 1.0 1.71 5 -0.3 0.82 1 -1.1 -2.22 2 -1.4 -0.22 3 -0.1 -0.12 4 -0.2 0.12 5 -0.1 -0.23 1 -1.8 0.23 2 -0.5 0.03 3 -1.0 -0.33 4 0.4 0.43 5 -0.5 0.9

;;run;

A second data set contains the treatment means averaged across the five rabbits for eachtreatment group:

data means;input treat day0mn day4mn;datalines;1 -0.24 1.382 -0.58 -0.523 -0.68 0.24

;run;

You want to calculate the deviation between each observation and the respective treatmentmeans. That means one observation in data set means myst be matched with five observations indata set rabbits. This is done easily with By-merging. Sort both data sets by TREAT and mergethem by TREAT.

proc sort data=rabbits; by treat; run;proc sort data=means; by treat; run;data deviate;

merge rabbits means; by treat;dev0 = day0 - day0mn;dev4 = day4 - day4mn;

run;proc print data=deviate; run;

which produces:

OBS TREAT RABBIT DAY0 DAY4 DAY0MN DAY4MN DEV0DEV4


16/31

1 1 1 -0.3 -0.2 -0.24 1.38 -0.06 -1.58

2 1 2 -0.5 2.2 -0.24 1.38 -0.260.82

3 1 3 -1.1 2.4 -0.24 1.38 -0.861.02

4 1 4 1.0 1.7 -0.24 1.38 1.240.32

5 1 5 -0.3 0.8 -0.24 1.38 -0.06 -0.58

6 2 1 -1.1 -2.2 -0.58 -0.52 -0.52 -1.68

7 2 2 -1.4 -0.2 -0.58 -0.52 -0.820.32

8 2 3 -0.1 -0.1 -0.58 -0.52 0.480.42

9 2 4 -0.2 0.1 -0.58 -0.52 0.380.6210 2 5 -0.1 -0.2 -0.58 -0.52 0.480.32

11 3 1 -1.8 0.2 -0.68 0.24 -1.12 -0.0412 3 2 -0.5 0.0 -0.68 0.24 0.18 -0.2413 3 3 -1.0 -0.3 -0.68 0.24 -0.32 -0.5414 3 4 0.4 0.4 -0.68 0.24 1.080.1615 3 5 -0.5 0.9 -0.68 0.24 0.180.66

This type of merge works well too, if the two data sets do not have the same levels of the BYvariables. Assume, for example, that the third observation is missing in data set means:

data means;input treat day0mn day4mn;datalines;1 -0.24 1.382 -0.58 -0.52

;run;

To keep only observations when the data sets are merged, for which a mean value exists, youcan use the IN= option:

proc sort data=rabbits; by treat; run;

proc sort data=means; by treat; run;data deviate;

merge rabbits means(in=y); by treat;if y;dev0 = day0 - day0mn;dev4 = day4 - day4mn;

run;proc print data=deviate; run;


17/31

which produces

OBS TREAT RABBIT DAY0 DAY4 DAY0MN DAY4MN DEV0DEV4

1 1 1 -0.3 -0.2 -0.24 1.38 -0.06 -

1.582 1 2 -0.5 2.2 -0.24 1.38 -0.260.82

3 1 3 -1.1 2.4 -0.24 1.38 -0.861.02

4 1 4 1.0 1.7 -0.24 1.38 1.240.32

5 1 5 -0.3 0.8 -0.24 1.38 -0.06 -0.58

6 2 1 -1.1 -2.2 -0.58 -0.52 -0.52 -1.68

7 2 2 -1.4 -0.2 -0.58 -0.52 -0.820.32

8 2 3 -0.1 -0.1 -0.58 -0.52 0.480.42

9 2 4 -0.2 0.1 -0.58 -0.52 0.380.6210 2 5 -0.1 -0.2 -0.58 -0.52 0.480.32

7. Formatting

By formatting we mean the display of variable names and observations on SAS printouts in auser-defined form. By default if a data set is printed, SAS labels the columns with the variablenames and displays the values of the variables with their actual content. For readability, it is oftenadvisable to replace the short mnemonic variable names with more descriptive captions and todisplay observations differently. In the preceding example, the variable TREAT refers to atreatment applied in the experiment. Without explicitly knowing what treat=1, treat=2, etc. refersto, printouts may be hard to read. Rather than re-entering the data and changing the contents oftreat to a descriptive string, you can instruct SAS to display the contents of the variabledifferently, without changing its contents.

7.1. Labels

Labels are descriptive strings associated with a variable. Labels are assigned in data steps withthe LABEL statement. The following data set is from an experiment concerned with theaccumulation of thatch in creeping bentgrass turf under various nitrogen and thatch managementprotocols.

data chloro;label block = 'Experimental replicate'nitro = 'Nitrogen Source'thatch = 'Thatch management system'chloro = 'Amount chlorophyll in leaves (mg/g)';

input block nitro thatch chloro;datalines;1 1 1 3.81 1 2 5.31 1 3 5.9


18/31

;;run;proc print data=chloro label; run;

produces

AmountThatch chlorophyllExperimental Nitrogen management in leaves

OBS replicate Source system (mg/g)

1 1 1 1 3.82 1 1 2 5.33 1 1 3 5.9

etc.

The LABEL option of PROC PRINT instructs SAS to display labels instead of variable names ifthese exist. It is not necessary to assign labels to all the variabels in the data set.

7.2. Formats

While labels replace variable names in displays, formats substitute for the values of a variable.Many formats are predefined in The SAS System (check the SAS/Lanuage or SAS/User's Guidemanuals or online help files). We are concerned here with user-defined formats. These arecreated in PROC FORMAT. The next example creates two different formats, NI and YEA. Thedefinition of a format commences with the VALUE keyword followed by the format name. Thenfollows a list of the form value = 'some string'. If a variable is associated with a format and SASencounters a given value, it displays the string instead. For example, values of 2 are displayed asAmm. sulf. for the variable associated with the format NI.

proc format;value Ni

1='Urea' 2='Amm. sulf.' 3='IBDU' 4='Urea(SC)';value

Yea 1='2 years' 2='5 years' 3='8 years';run;

The association between variable and format is done in a DATA step (it is also possible to createthis association in certain procedures such as PROC FREQ, PROC PRINT, etc.) with theFORMAT statement.

data chloro;label block = 'Experimental replicate'

nitro = 'Nitrogen Source'thatch = 'Thatch management system'chloro = 'Amount chlorophyll in leaves (mg/g)';

format nitro ni.thatch yea.;

input block nitro thatch chloro;datalines;1 1 1 3.81 1 2 5.31 1 3 5.9


19/31

;;run;

The name of the format follows the name of the variable to be formatted. The format name mustbe followed with a period. Otherwise SAS will assume that the format name is a variable name.

Printing this data set after assigning of labels and formats produces a much more pleasingprintout.

AmountThatch chlorophyll

Experimental Nitrogen management in leavesOBS replicate Source system (mg/g)

1 1 Urea 2 years 3.82 1 Urea 5 years 5.33 1 Urea 8 years 5.94 1 Amm. sulf. 2 years 5.25 1 Amm. sulf. 5 years 5.6

6 1 Amm. sulf. 8 years 5.47 1 IBDU 2 years 6.08 1 IBDU 5 years 5.69 1 IBDU 8 years 7.8

10 1 Urea(SC) 2 years 6.811 1 Urea(SC) 5 years 8.6

Once labels and formats are associated with variables, procedures will take advantage of it. Thefollowing statements calculate descriptive statistics for the variable CHLORO with PROCMEANS.

proc means data=chloro;class nitro thatch;

var chloro;run;

which produces

Analysis Variable : CHLORO Amount chlorophyll in leaves (mg/g)

NITRO THATCH N Obs N Mean Std Dev Minimum------------------------------------------------------------------------ Urea 2 years 2 2 3.8500000 0.0707107 3.8000000

5 years 2 2 5.3500000 0.0707107 5.3000000

8 years 2 2 5.1000000 1.1313708 4.3000000

Amm. sulf. 2 years 2 2 5.6000000 0.5656854 5.2000000

5 years 2 2 5.8500000 0.3535534 5.6000000

8 years 2 2 5.8000000 0.5656854 5.4000000


20/31

IBDU 2 years 2 2 6.5000000 0.7071068 6.0000000

It is important to note that formatting a variable does not replace the values of the variable. It issimply a device for displaying the values of the variable. In DATA steps, for example,observations are accessed using the original , not the formatted values. In the thatch experiment,assume you want to subset the observations that received Urea treatment. :

data urea;set chloro;if nitro = 1;

run;

7.3. SAS Dates

Dates are closely linked to formats in SAS. For example, date variablesimported from an ExcelspreadsheetordBase fileare automatically formatted and displayed in month/day/year or someother form. Often one has to make selections and data manipulations based on the value of adate variable. Since the original value determines how to access a variable, not the formatted

value it is important to know how dates are represented in SAS.

In general, dates are measured as the number of days since January 1, 1960. May 16, 1998 forexample, corresponds to a SAS date of 14015. To select all observations from a data set whosedate variable falls between September 12, 1997 and May 16, 1998 one needs syntax such as

data one; set one;if (date >= 13769) and (date


21/31

8. Advanced data manipulation8.1. Renaming of variables

Renaming variables in SAS is not really an advanced data manipulation. But it has a quirk, which causesnovice users frequently trouble. Variables are renamed with either the RENAME= data set option or theRENAME statement. The data set option form is shown here:

data two; set one(rename=(y=x1 u=z1)); run;

The statement form is part of the data step:data two; set one;

rename y=x1 u=z1;run;The difference between the two statements is that the RENAME= data set option is executed when dataset one is copied into the new data set. The RENAME statement is executed upon successful completionof the DATA step. Consequently, if you wish to access the variables being renamed inside the DATAstep, you have to access them under their new name, if you chose the RENAME= data set option andunder theirold name if you chose the RENAME statement. The following code examples are correct andproduce the same result:

data two; set one(rename=(y=x1 u=z1));logx = log(x1);/* logy = log(y); This will produce an error, since variable Y *//* is unknown inside the DATA step *//* It has been renamed to X1 */

run;

date two; set one;/* logx = log(x1); This will not work, since x1 is unknown. *//* The RENAME statement is not executed *//* until the end of the DATA step. */logy = log(y);

rename y = x1 u=z1;run;

8.2. Retaining of variables

To understand the concept of variable retaining, it is helpful to look at DATA steps in some more detail.

data one;input x ;datalines;

0.10.51.0

2.0;data one; set one;

y = log(x);if y < 0 then z = 'negative';if y >= 0 then u = sqrt(y);keep u x y z;rename z = isneg;

run;


22/31

The second data step consists of five statements. The KEEP and RENAME statement are executed atthe very end of the DATA step. They are sort of "set aside" until all observations have been processed.The first three statements are executed in sequence for every observation in the data set. SAS picksthe first observation and executes the three statements. Then it moves to the second observation andexecutes the statements again. At the beginning of each cycle, variables that are not in the data set (y, z,and u) are being initialized with missing values in the case of numeric variables or blanks in the case ofcharacter variables. If a statement generates a value for the variable the missing value is overwritten. If avalue is not generated, the missing value code remains. In the example above, u is assigned a value onlyif y is positive. Since y is the natural logarithm of x, this is the case for the thrid and fourth observation.The variable z is assigned a value only, if y is negative. The printout of this data set is as follows:

OBS X Y ISNEG U

1 0.1 -2.30259 negative .2 0.5 -0.69315 negative .3 1.0 0.00000 0.000004 2.0 0.69315 0.83255

U contains missing values for the first two observation, ISNEG is set to an empty string for observation

three and four.Since SAS sets new variables to missing for each observation, how can one, for example, add variablessuccessively. In the DATA step

data one;input x ;datalines;

0.10.51.02.0;data one; set one;

c = c + x;run;

the variable c will contain missing values, since c has no value assigned to it at the beginning of the datacycle. This can be remedied with the RETAIN statement. It causes SAS to do two things. The variablesbeing retained are not initialized with missing values at the top of the cycle, but retain their previous value.Also, a starting value is assigned to the retained variable. The DATA step

data one; set one;retain c 0;c = c + x;

run;proc print; run;

sets initially variable c to 0. Since c is part of a RETAIN statement, it keeps the value from the previousdata cycle. In successive cycles c on the right hand side of the "=" sign contain the previous sum ofvariable x and the current value of x is being added to it. The data set one at the end of the DATA step is

OBS X C

1 0.1 0.12 0.5 0.6


23/31

3 1.0 1.64 2.0 3.6

A short-cut for this syntax construction can be used:

data one; set one;

c+x;run;

The syntax c+x implies that c is automatically retained and the values of variable x are successively to be

added to it.Retaining variables is also very convenient if you work with sorted data sets. In the next example

data spike;input rating $ tx;datalines;apparent 1discrete 2apparent 3

discrete 4discrete 5discrete 6none 7discrete 8discrete 9apparent 10;;run;

data set spike contains a character variable (rating). We want to find out how many observations are in

the data set for each unique value of variable rating and how many unique values there are. First, sort

the data by rating. Then define two new variables, both of which are initialized with 0 and retained. For

each new value ofrating, the variable cnt is reset to 1. Variable howmany is incremented only when anew value ofrating is found.

proc sort data=spike; by rating; run;

data countem; set spike; by rating;retain cnt 0 howmany 0;if first.rating then cnt=1; else cnt+1;if first.rating then howmany+1;if last.rating then output;

run;proc print data=countem; run;

Here is the printout of data set countem:

OBS RATING TX CNT HOWMANY

1 apparent 10 3 12 discrete 9 6 23 none 7 1 3


24/31

There are three observations with rating='apparent', six observations with rating='discrete'

and one with rating='none' and a total of three unique values ofrating.

8.3. DO Blocks and DO Loops

A DO block begins with the reserved word DO and ends with the reserved word END. The statements

enclosed inside DO..END are called a block. The DO..END construct is an important device to groupstatements inside a DATA step.

data two; set one;if x < 0 then do;y = .;z = sqrt(-x) + rannor(123);Tx = 'Control';

end;run;

If the condition in the IF statement is true, SAS executes the statements in the DO..END block. Otherwisethe statements are ignored. Without the DO..END block, the DATA statement would require three IF

statements. The DO statement also is part of looping constructs (iterative DO). These can be written invarious ways. Here are examples:

do i = 1 to 4; /* A index loop, runs from 1 to 4 in increments of 1 */< SAS statements>

end;

do i = 1 to 10 by 2; /* index loop, runs from 1 to 10 in increments of 2 */

end;

/* index loop over x=10, 20, 30, 50, 55, 60, 65, ... , 100 */ do x = 10, 20, 30, 50 to 100 by 5;

end;

do month = 'FEB', 'MAR', 'APR';end;

/* the statements inside the loop are executed only while the *//* expression in parentheses is true */do k =1 to 12 while(month='APR');

end;

The next example generates 100 observations from a Gaussian distribution with mean 12 and variance 3.For each observation, it calculates its right-tail probability:

data Gauss;do i = 1 to 100;

z = rannor(8923);p = 1 - Probnorm(z);x = z*Sqrt(3) + 12;output;


25/31

end;run;

Notice that the DO i1 to 100; .. END; construct is executed for each observation in the data set. Since noobservations are input or transferred from another SAS data set, you need the OUTPUT statement insidethe DO loop to instruct SAS to write to the data set when the loop is completed.

executes the loop do i = 1 to 4; output; for each observation and simply produces four copies ofeach observation. The OUTPUT statement should be the last statement inside the loop. Other forms of DO loops are the DO..WHILE() and DO..UNTIL() constructs. A logical expression insidethe parentheses is evaluated for each repetition of the loop. The loop continues as long as the statementis true for the DO..WHILE() loop or until the statement becomes true (DO..UNTIL()). DO..WHILE andDO..UNTIL loops are dangerous. For example, if the logical statment in the WHILE() expression is nottrue, the loop will never executes. If it is not true, there must be a mechanism inside the loop thateventually makes the statement false, otherwise the loop will continue infinitely. It is important toremember that the loop is executed for each observation in the data set. Care must be exercised not towrite infinite loops with DO..WHILE. For example, the following loop

n=0;do while(n


26/31

z = sqrt(-x);end; else z = sqrt(x);run;

IF .. THEN .. ELSE statements can be nested:

data two; set one;if score < 4 then rating='below ave';else if score < 6 then rating = 'average';

else if score < 8 then rating = 'above ave.'else rating = 'superior';

run;

If many cases are to be distinguished, or multiple statements are to be executed for one or all conditions,this construct is hard to read and debug. A simpler method uses the SELECT case distinction.

8.5. SELECT case distinction

SELECT expressions are more convenient to program and easier to read than a series of nested (and

convoluted) IF..THEN expressions. Recall the survey data set

DATA survey;INPUT id sex $ age inc r1 r2 r3 ; DATALINES;

1 F 35 17 7 2 217 M 50 14 5 5 333 F 45 6 7 2 749 M 24 14 7 5 765 F 52 9 4 7 781 M 44 11 7 7 72 F 34 17 6 5 318 M 40 14 7 5 2

34 F 47 6 6 5 650 M 35 17 5 7 5;

We wish to recode variable r3 and assign character strings. This could be accomplished with a series ofIF..THEN statements:data survey; set survey;

if r3 < 3 then rating='below ave';else if r3 < 5 then rating = 'average';

else if r3 < 6 then rating = 'above ave.' else r3 = 'superior';

run;The SELECT construct would be

data two; set survey;select ;when ( r3 < 3 ) rating = 'below ave.'; when ( r3 < 5 ) rating = 'average'; when ( r3 < 6 ) rating = 'above ave.'; otherwise rating = 'superior';

end;run;


27/31

SAS evaluates a logical expression for each of the WHEN expressions. For the first expression thatreturns true, the statement is executed. If no WHEN expression is true, SAS executes the statementfollowing OTHERWISE. The OTHERWISE statement is optional, but should be used as a safeguard. Ifnone of the WHEN expressions is true and the OTHERWISE clause is missing, SAS stops the DATA stepwith an error message. An alternative method of writng the SELECT..END construct in this example is

data two; set survey;select ( r3 );when ( 1,2 ) rating = 'below ave.';when ( 3,4 ) rating = 'average';when ( 5 ) rating = 'above ave.';otherwise rating = 'superior';

end;run;

The variable which follows the select keyword in parentheses is compared against the values in theWHEN expressions. The first WHEN expression that is true will be executed, all others will be ignored.

8.6. Arrays

Arrays in SAS are different from arrays in other programming languages. A SAS array is simply agrouping of variables in a data set for ease of processing. Array processing in SAS consists of threesteps:1. Define the array which groups variables2. Process the array by repeating an action 3. Select an individual element of the array at each repetition of the action.This probably sounds more complicated than it should. Here is an example. Three variables (X, Y, Z) areread into a data set. Initially, missing values were coded as 9999. You want to replace them with amissing value (.), to prevent accidental involvement in calculations. You could of course do this with threeIF statements:

data two; set one(keep=X Y Z);

if x = 9999 then x = . ;if y = 9999 then y = . ;if z = 9999 then z = .;

run;

Notice that the structure of the IF statements is very similar. Each variable is compared against 9999 andif it contains that value, it is replaced with . ; Processing the same problem with arrays could proceed asfollows:

data two; set one(keep = X Y Z);array vars{3} x y z; /* define the array, i.e., group the variables x y z

into array vars */do i =1 to 3; /* repeat an action for each element of the array */

if vars{i} = 9999 then vars{i} = .; /* select individual elements of thearray */

end;run;

The ARRAY statement tells SAS that you are about to define an array. It is followed by the name of thearray and its dimension (the number of variables being grouped) in curly braces. Then, you follow it withthe names of the variables being grouped. The number of variable names should match the number incurly braces. If the variables being listed are not part of the data set they will be created. You can omit the


28/31

number of array elements in curly braces and replace it with an asterisk. SAS will then determine thenumber of array elements from the number of variables in the list:

array vars{*} x y z;

Alternatively, you can omit the variable names and SAS will create variables named ArrayName1,

ArrayName2, .... For example

array vars{4}

will create variables VARS1, VARS2, VARS3, VARS4.A particular element of the array is accessed inside the DATA step by following the array name with anumber in curly braces. The second element of the vars array, the variable y, is accessed as vars{2}, forexample. DO i=1... loops are very common to process the elements of an array in turn.If the variables in the list of the ARRAY statement are not part of the data, they are added to the data set.Sometimes, you need an array only during the DATA step, then you can follow the array definition withthe reserved word _TEMPORARY_:

array temparr{10} x1-x10 _temporary_;

A typical example where array processing is helpful is when switching a multivariate data set into aunivariate data set. The next data set stems from a repeated measures study in which a temperaturedifference was measured at four time points for 15 subjects. 5 subjects each received one of threetreatments. The time points are spaced 15 days apart.

/* The data as a multivariate data set */data repmeas;

input Treat subject time1 time2 time3 time4;datalines;1 1 -0.3 -0.2 1.2 3.11 2 -0.5 2.2 3.3 3.71 3 -1.1 2.4 2.2 2.71 4 1.0 1.7 2.1 2.51 5 -0.3 0.8 0.6 0.92 1 -1.1 -2.2 0.2 0.32 2 -1.4 -0.2 -0.5 -0.12 3 -0.1 -0.1 -0.5 -0.32 4 -0.2 0.1 -0.2 0.42 5 -0.1 -0.2 0.7 -0.33 1 -1.8 0.2 0.1 0.63 2 -0.5 0.0 1.0 0.53 3 -1.0 -0.3 -2.1 0.63 4 0.4 0.4 -0.7 -0.33 5 -0.5 0.9 -0.4 -0.3

;;

run;

This data set is setup appropriately for a repeated measures analysis with PROC GLM, but not for arepeated measures analysis with PROC MIXED. PROC MIXED requires that the four measurementsappear as separate observations indexed by a variable measuring passage of time. The next data stepcreates this univariate structure with the help of an array:/* create the univariate data set */


29/31

data univ; set repmeas;array t{4} time1-time4; /* group variable time1 through time4 into an array

*/do i=1 to 4; /* loop through the array */

time = 15*(i-1); /* calculate the variable measuring passage of time*/

temp = t{i}; /* record the temperature as the ith element ofthe array */

output; /* write the observation to the new data set*/

end;drop time1-time4 i; /* no need for those anymore

*/run;proc print data=univ(obs=20); run;

The DO i = 1 to 4 loop creates four observation from each observation in the multivariate data set. TREATand SUBJECT variable remain unchanged, but a variable for time and one for the actual measuredtemperature must be created. Since each temperature is stored in a different variable (time1 .. time4) inthe multivariate data set, the array is perfectly suited to deal with the problem. Here are the first twenty

observations of the univariate data set:

OBS TREAT SUBJECT TIME TEMP

1 1 1 0 -0.32 1 1 15 -0.23 1 1 30 1.24 1 1 45 3.15 1 2 0 -0.56 1 2 15 2.27 1 2 30 3.38 1 2 45 3.79 1 3 0 -1.1

10 1 3 15 2.411 1 3 30 2.212 1 3 45 2.713 1 4 0 1.014 1 4 15 1.715 1 4 30 2.116 1 4 45 2.517 1 5 0 -0.318 1 5 15 0.819 1 5 30 0.620 1 5 45 0.9

8.7. Lagged variables

At any cycle of the DATA step, only the values of the variables for the current observation are availablefor processing. Values for previous or upcoming observations are not accessible. You can make thevalues of previously processed observation available through the lag() functionsLAG(variable name) returns the previous valueLAG2(variable name) returns the value of the observation processed second to lastLAG5(variable name) returns the value of the observation processed 5 cycles agoand so forth. Lagging temperatures in the previous example:


30/31

data univ; set univ;lagt = lag(temp);lag2t = lag2(temp);

run;proc print data=univ(obs=20); run;

produces

OBS TREAT SUBJECT TIME TEMP LAGT LAG2T

1 1 1 0 -0.3 . .2 1 1 15 -0.2 -0.3 .3 1 1 30 1.2 -0.2 -0.34 1 1 45 3.1 1.2 -0.25 1 2 0 -0.5 3.1 1.26 1 2 15 2.2 -0.5 3.17 1 2 30 3.3 2.2 -0.58 1 2 45 3.7 3.3 2.29 1 3 0 -1.1 3.7 3.3

10 1 3 15 2.4 -1.1 3.711 1 3 30 2.2 2.4 -1.112 1 3 45 2.7 2.2 2.413 1 4 0 1.0 2.7 2.214 1 4 15 1.7 1.0 2.715 1 4 30 2.1 1.7 1.016 1 4 45 2.5 2.1 1.717 1 5 0 -0.3 2.5 2.118 1 5 15 0.8 -0.3 2.519 1 5 30 0.6 0.8 -0.320 1 5 45 0.9 0.6 0.8

8.8. Generating multiple data sets in a DATA step

Sometimes you want to generate multiple data sets in a single DATA step. For example, when reading alarge amount of data from a file, you may want to separate it into data sets according to the year ofmeasurement. This can be done by listing multiple data set names after the DATA keyword and usingseparate OUTPUT statements for each:

filename inf 'C:\Research\Data\Allyears\The Whole Thing.txt';data y1994 y1995 y1996 y1997;

infile inf firstobs=20;input year location block a b absorp transloc;select (year);

when (1994) output y1994;when (1995) output y1995;when (1996) output y1996;

otherwise output y1997;end;run;

Notice that theSELECT..ENDconstruction was used here instead of multiple IF..THEN statements.
http://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#SELECT%20case%20distinctionhttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#SELECT%20case%20distinctionhttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#SELECT%20case%20distinctionhttp://www.ats.ucla.edu/stat/sas/library/SASTranMan_os.html#SELECT%20case%20distinction


31/31

8.9. Converting character variables to numeric variables

SAS uses several rules to convert character to numeric variables. For example, if a character variable isused with a numeric operand such as addition, multiplication, etc., the variable is automatically convertedto a numeric variable. If a numeric variable is used on the left side and a character variable on the rightside of a statement, the character variable is converted to numeric format.

data test;input s $ ;x = 0;x = s;datalines;12345

;;run;

In this example, variable S is defined as read as a character variable. The statement x = 0; defines as anumeric variable. The next statement, x = s;, invokes the character-to-numeric conversion, x will containthe values of s in numeric format. Alternatively, you could have combined the two statements into a singlestatement

x = s + 0;

SAS Library

Documents

Transcript of SAS Library