STAT0023 Week 8

Simple data analysis in SAS

Richard Chandler and Ioanna Manolopoulou

Introduction

This week in SAS . . .

Simple descriptive statistics and exploratory graphics

Simple hypothesis tests and confidence intervals for means

SAS statement structure and vocabulary: options, secondary options

etc.

Finding things in the help system

Providing and accessing metadata

More on importing and exporting data

Exploratory analysis

Recap: aims of an exploratory analysis

From lecture 1:

1 To gain a preliminary understanding of structure in a dataset

2 To look for possible outliers or data quality problems

3 To suggest some initial assumptions (e.g. normality of residuals,

constant variance) that may be reasonable as a starting point in

subsequent modelling and analysis

Recap: aims of an exploratory analysis

From lecture 1:

1 To gain a preliminary understanding of structure in a dataset

2 To look for possible outliers or data quality problems

3 To suggest some initial assumptions (e.g. normality of residuals,

constant variance) that may be reasonable as a starting point in

subsequent modelling and analysis

Techniques: descriptive statistics, frequency tables, histograms,

boxplots, scatterplots etc.

Descriptive statistics & simple analysis: useful procedures

PROC MEANS, PROC UNIVARIATE: calculate summary statistics for

numeric variables, possibly separately for different groups within a

data set — also confidence intervals for means, 1-sample t-tests etc.

PROC FREQ: calculates frequency / contingency tables — also tests

for association e.g. chi-squared tests

PROC TTEST: two-sample and paired t-tests

PROC NPAR1WAY nonparametric alternatives to t-tests e.g. for small

non-normal samples

Descriptive statistics & simple analysis: useful procedures

PROC MEANS, PROC UNIVARIATE: calculate summary statistics for

numeric variables, possibly separately for different groups within a

data set — also confidence intervals for means, 1-sample t-tests etc.

PROC FREQ: calculates frequency / contingency tables — also tests

for association e.g. chi-squared tests

PROC TTEST: two-sample and paired t-tests

PROC NPAR1WAY nonparametric alternatives to t-tests e.g. for small

non-normal samples

How best to learn?

Follow some examples

Use the help system

SAS statement structure

Example: calculating summaries using PROC MEANS

SAS code:

PROC MEANS DATA=sashelp.demographics N MEAN STD MIN MAX

MAXDEC=3 FW=8;

CLASS region;

VAR MaleSchoolPct FemaleSchoolPct;

RUN;

PROC MEANS: calculates summary statistics for numeric variables

CLASS statement: calculate statistics separately for different groups

of observations defined by the region variable

VAR statement: defines variables of interest

Example: calculating summaries using PROC MEANS

SAS code:

PROC MEANS DATA=sashelp.demographics N MEAN STD MIN MAX

MAXDEC=3 FW=8;

CLASS region;

VAR MaleSchoolPct FemaleSchoolPct;

RUN;

PROC MEANS: calculates summary statistics for numeric variables

CLASS statement: calculate statistics separately for different groups

of observations defined by the region variable

VAR statement: defines variables of interest

DATA=sashelp.demographics, MAXDEC=3 FW=8 are options

controlling overall behaviour of MEANS procedure (which data set to

use, how to format the output).

N MEAN STD MIN MAX are statistic keywords — special options

controlling which statistics are produced.

Another example: frequency tables using PROC FREQ

SAS code:

PROC FREQ DATA=sashelp.demographics;

TABLES cont*region / NOPERCENT NOROW NOCOL;

RUN;

PROC FREQ: calculates frequency tables

TABLES statement: defines structure of required tables (here a

two-way table counting how many times each combination of cont

and region occurs)

Another example: frequency tables using PROC FREQ

SAS code:

PROC FREQ DATA=sashelp.demographics;

TABLES cont*region / NOPERCENT NOROW NOCOL;

RUN;

PROC FREQ: calculates frequency tables

TABLES statement: defines structure of required tables (here a

two-way table counting how many times each combination of cont

and region occurs)

DATA=sashelp.demographics is option controlling overall

behaviour of FREQ procedure (which data set to use).

NOPERCENT NOROW NOCOL are options controlling behaviour of

TABLES statement (suppress row, column and overall percentages in

output).

Another example: frequency tables using PROC FREQ

SAS code:

PROC FREQ DATA=sashelp.demographics;

TABLES cont*region / NOPERCENT NOROW NOCOL;

RUN;

PROC FREQ: calculates frequency tables

TABLES statement: defines structure of required tables (here a

two-way table counting how many times each combination of cont

and region occurs)

DATA=sashelp.demographics is option controlling overall

behaviour of FREQ procedure (which data set to use).

NOPERCENT NOROW NOCOL are options controlling behaviour of

TABLES statement (suppress row, column and overall percentages in

output).

Note two ways of specifying options:

In PROC statement: given directly following procedure name.

In subsequent statements: following ‘/’ symbol.

Yet another example: graphics using PROC UNIVARIATE

SAS code:

PROC UNIVARIATE DATA=sashelp.demographics NOPRINT;

VAR FemaleSchoolpct;

HISTOGRAM / NOBARS KERNEL (LOWER=0 UPPER=1 C=SJPI W=3);

INSET MEAN (5.2) STD=”Std Dev” (5.2) Q1 (5.2) MEDIAN (5.2) Q3 (5.2);

RUN;

PROC UNIVARIATE: calculates summary statistics, histograms and

kernel density estimates (see later)

VAR statement: selects variable of interest

HISTOGRAM statement: requests graphical output

INSET statement: produces table of summary statistics on graphics

output

Yet another example: graphics using PROC UNIVARIATE

SAS code:

PROC UNIVARIATE DATA=sashelp.demographics NOPRINT;

VAR FemaleSchoolpct;

HISTOGRAM / NOBARS KERNEL (LOWER=0 UPPER=1 C=SJPI W=3);

INSET MEAN (5.2) STD=”Std Dev” (5.2) Q1 (5.2) MEDIAN (5.2) Q3 (5.2);

RUN;

PROC UNIVARIATE: calculates summary statistics, histograms and

kernel density estimates (see later)

VAR statement: selects variable of interest

HISTOGRAM statement: requests graphical output

INSET statement: produces table of summary statistics on graphics

output

Note options and statistic keywords.

Yet another example: graphics using PROC UNIVARIATE

SAS code:

PROC UNIVARIATE DATA=sashelp.demographics NOPRINT;

VAR FemaleSchoolpct;

HISTOGRAM / NOBARS KERNEL (LOWER=0 UPPER=1 C=SJPI W=3);

INSET MEAN (5.2) STD=”Std Dev” (5.2) Q1 (5.2) MEDIAN (5.2) Q3 (5.2);

RUN;

PROC UNIVARIATE: calculates summary statistics, histograms and

kernel density estimates (see later)

VAR statement: selects variable of interest

HISTOGRAM statement: requests graphical output

INSET statement: produces table of summary statistics on graphics

output

Note options and statistic keywords.

Note also secondary options:

Allow finer control of primary option and statistic keyword behaviour

Enclosed in brackets ‘()’

PROC UNIVARIATE example: the result

The code:

PROC UNIVARIATE DATA=sashelp.demographics NOPRINT;

VAR FemaleSchoolpct;

HISTOGRAM / NOBARS KERNEL (LOWER=0 UPPER=1 C=SJPI W=3);

INSET MEAN (5.2) STD=”Std Dev” (5.2) Q1 (5.2) MEDIAN (5.2) Q3 (5.2);

RUN;

The result:

The help system

Navigating the help system

Easy way to find help on any SAS procedure: click procedure name in

program and press F1 key.

Most help pages have several tabs including:

Syntax: summary of options and subsequent statements that can be

used;

Overview: description of what the procedure can do;

Concepts: key ideas that you need to understand more advanced use

of the procedure;

Examples: often the most useful part!

Some pages also have other tabs, giving details of how calculations

are done etc.

Example: look at help for PROC UNIVARIATE

Histograms and density estimates

Graphical displays: histograms and density estimates

Previous PROC UNIVARIATE produced kernel density estimate of

underlying probability density function

Also seen in Week 5 using density() command in R

Kernel density estimates are modern alternative to histograms …

Kernel density estimation: preliminaries

Revision: how to draw a relative frequency histogram

1 Divide range of data into

nonoverlapping intervals

y

Density

-3 -2 -1 0 1 2 3

0.00 0.10 0.20 0.30

n = 50 here

Kernel density estimation: preliminaries

Revision: how to draw a relative frequency histogram

1 Divide range of data into

nonoverlapping intervals

2 Calculate rectangle heights as

# of observations in interval

n × interval width

y

Density

-3 -2 -1 0 1 2 3

0.00 0.10 0.20 0.30

n = 50 here

Kernel density estimation: preliminaries

Revision: how to draw a relative frequency histogram

1 Divide range of data into

nonoverlapping intervals

2 Calculate rectangle heights as

# of observations in interval

n × interval width

3 Draw rectangles y

Density

-3 -2 -1 0 1 2 3

0.00 0.10 0.20 0.30

n = 50 here

Histogram

Kernel density estimation: preliminaries

Revision: how to draw a relative frequency histogram

1 Divide range of data into

nonoverlapping intervals

2 Calculate rectangle heights as

# of observations in interval

n × interval width

3 Draw rectangles y

Density

-3 -2 -1 0 1 2 3

0.00 0.10 0.20 0.30

n = 50 here

Histogram

Notes

Scaling in step 2 ensures total area is 1 ⇒ histogram is a (crude)

probability density function (PDF)

Results can be sensitive to number and positioning of intervals in step

1

Histograms → kernel density estimates

Idea: use moving window instead of fixed set of intervals . . .

. . . and use smooth ’kernel’ to give more weight to observations near

centre of window

Moving window histogram

y

Density

-3 -2 -1 0 1 2 3

0.0 0.1 0.2 0.3 0.4

Kernel density estimate

y

Density

-3 -2 -1 0 1 2 3

0.0 0.1 0.2 0.3 0.4

Histograms → kernel density estimates

Idea: use moving window instead of fixed set of intervals . . .

. . . and use smooth ’kernel’ to give more weight to observations near

centre of window

Moving window histogram

y

Density

-3 -2 -1 0 1 2 3

0.0 0.1 0.2 0.3 0.4

Kernel density estimate

y

Density

-3 -2 -1 0 1 2 3

0.0 0.1 0.2 0.3 0.4

Histograms → kernel density estimates

Idea: use moving window instead of fixed set of intervals . . .

. . . and use smooth ’kernel’ to give more weight to observations near

centre of window

Moving window histogram

y

Density

-3 -2 -1 0 1 2 3

0.0 0.1 0.2 0.3 0.4

Kernel density estimate

y

Density

-3 -2 -1 0 1 2 3

0.0 0.1 0.2 0.3 0.4

Histograms → kernel density estimates

Idea: use moving window instead of fixed set of intervals . . .

. . . and use smooth ’kernel’ to give more weight to observations near

centre of window

Histograms → kernel density estimates

Idea: use moving window instead of fixed set of intervals . . .

. . . and use smooth ’kernel’ to give more weight to observations near

centre of window

Moving window histogram

y

Density

-3 -2 -1 0 1 2 3

0.0 0.1 0.2 0.3 0.4

Original histogram

Moving window

Kernel density estimate

y

Density

-3 -2 -1 0 1 2 3

0.0 0.1 0.2 0.3 0.4

Original histogram

Kernel density estimate

Notes on kernel density estimates

Kernel density estimate is estimate of underlying PDF

Total area under a PDF is 1 (recall relative frequency histogram) —

area under kernel function must also be 1 to ensure this

Notes on kernel density estimates

Kernel density estimate is estimate of underlying PDF

Total area under a PDF is 1 (recall relative frequency histogram) —

area under kernel function must also be 1 to ensure this

Smoothness of kernel translates to smooth kernel density estimate

Exact choice of kernel unimportant (within reason!)

Notes on kernel density estimates

Kernel density estimate is estimate of underlying PDF

Total area under a PDF is 1 (recall relative frequency histogram) —

area under kernel function must also be 1 to ensure this

Smoothness of kernel translates to smooth kernel density estimate

Exact choice of kernel unimportant (within reason!)

Width of kernel (’bandwidth’) is important:

Too narrow ⇒ density estimate very wiggly

Too big ⇒ important details obscured

Various ’automatic’ choices available (fine for exploratory analysis)

e.g. C=SJPI option in HISTOGRAM statement for PROC UNIVARIATE

Notes on kernel density estimates

Kernel density estimate is estimate of underlying PDF

Total area under a PDF is 1 (recall relative frequency histogram) —

area under kernel function must also be 1 to ensure this

Smoothness of kernel translates to smooth kernel density estimate

Exact choice of kernel unimportant (within reason!)

Width of kernel (’bandwidth’) is important:

Too narrow ⇒ density estimate very wiggly

Too big ⇒ important details obscured

Various ’automatic’ choices available (fine for exploratory analysis)

e.g. C=SJPI option in HISTOGRAM statement for PROC UNIVARIATE

Refinements available if distribution has known upper / lower

endpoint (note LOWER and UPPER options in HISTOGRAM statement)

Other graphical displays

PROC GCHART: bar charts to show frequency distributions of discrete

variables, or compare relative values of different quantities

PROC BOXPLOT: boxplots

PROC GPLOT: scatterplots

Other graphical displays

PROC GCHART: bar charts to show frequency distributions of discrete

variables, or compare relative values of different quantities

PROC BOXPLOT: boxplots

PROC GPLOT: scatterplots

Customising graphics output

Axis labels, plotting symbols, line types etc. defined using global

statements e.g. AXISn, SYMBOLn.

Graphics statements then use options to specify which axis / symbol

definition etc. to use.

Example: code for customising graphics output

GOPTIONS colors=(blue gold red);

AXIS1 LABEL=(“Male enrolment”) WIDTH=2;

AXIS2 LABEL=(“Female enrolment”) WIDTH=2;

SYMBOL1 VALUE=squarefilled;

SYMBOL2 VALUE=trianglefilled;

PROC GPLOT DATA=sashelp.demographics;

PLOT FemaleSchoolPct * MaleSchoolPct =region /

HAXIS=axis1 VAXIS=axis2;

RUN;

QUIT;

Global statements set up axis definitions and plotting symbols

HAXIS and VAXIS options to PLOT statement use AXIS1 definition for

horizontal axis and AXIS2 definition for vertical axis

PLOT statement automatically cycles through defined symbols and

colours as required

Example: result of previous code

More on data management and manipulation

In this week’s self-study materials and workshop

Defining metadata i.e. ‘data about the data’:

Dataset description and provenance: summary of what the data set

represents, where the data came from etc.

Detailed information about each variable: description for informative

labelling of output, units of measurement etc.

More on reading data e.g. from files with no spaces between

variables, or with more than one record per line

Exporting SAS data sets and analysis results to other file formats

e.g. Excel, CSV etc.

AssignmentTutorOnline

- Assignment status: Already Solved By Our Experts
*(USA, AUS, UK & CA PhD. Writers)***CLICK HERE TO GET A PROFESSIONAL WRITER TO WORK ON THIS PAPER AND OTHER SIMILAR PAPERS, GET A NON PLAGIARIZED PAPER FROM OUR EXPERTS**

**NO PLAGIARISM**– CUSTOM PAPER