1
STAT0023 Computing for Practical Statistics
In-course assessment 2, take-home component (2020–21 session)
Table of Contents
Rubric……………………………………………………………………………………………………………………………………………………………..2
Background and overview ………………………………………………………………………………………………………………………….3
Detailed instructions……………………………………………………………………………………………………………………………………4
Marking criteria………………………………………………………………………………………………………………………………………..5
Hints on tackling the assessment………………………………………………………………………………………………………….7
Appendix: the UKCovidWave1.csv dataset…………………………………………………………………………………………. 10
Data sources and pre-processing………………………………………………………………………………………………………. 10
Description of variables……………………………………………………………………………………………………………………….. 12
Overall information about each MSOA and its population ………………………………………………………. 12
Household information for each MSOA……………………………………………………………………………………….. 13
Age profile for each MSOA: variables Age0-4, Age5-7, …, Age90+ ………………………………………… 13
Ethnicity and immigration ………………………………………………………………………………………………………………. 13
Unpaid carers…………………………………………………………………………………………………………………………………….. 14
Household accommodation …………………………………………………………………………………………………………… 14
People living in communal establishments………………………………………………………………………………….. 14
Employment / occupation………………………………………………………………………………………………………………. 14
Social grade: variables GradeAB, GradeC1, GradeC2 and GradeDE ………………………………………… 15
Public transport use: variables MetroUsers, TrainUsers and BusUsers…………………………….. 15
Education and qualifications…………………………………………………………………………………………………………… 15
2
Rubric
• Your solutions should be your own work and are to be submitted electronically to the
course Moodle page by 12 noon on MONDAY, 26TH APRIL 2021.
• You can work either alone or in pairs for this assessment. It is up to you to form your own
pairs. You MUST register your choices on Moodle by 12 noon on MONDAY, 29TH MARCH
2021, even if you choose to work alone.
• If you choose to work in a pair, you will be jointly responsible for the work that is submitted
and you will be awarded the same mark.
• Ensure that you electronically ‘sign’ the plagiarism declaration on the Moodle page when
submitting your work. If you choose to work in a pair, both of you should check what has
been submitted before signing this declaration: if any plagiarism or collusion is identified
with anyone outside your pair, you will share responsibility for it.
• Late submission will incur a penalty unless there are extenuating circumstances
(e.g. medical) supported by appropriate documentation and notified within one week of
the deadline above. Penalties, and the procedure in case of extenuating circumstances, are
set out in the latest editions of the Statistical Science Department student handbooks
which are available from the departmental web pages.
• Failure to submit this in-course assessment will mean that your overall examination mark is
recorded as “non-complete”, i.e. you will not obtain a pass for the course.
• Submitted work that exceeds the specified word count will be penalized. The penalties are
described in the detailed instructions below.
• Your solutions should be your own work. When uploading your scripts, you will be required
to electronically sign a statement confirming this, and that you have read the Statistical
Science department’s guidelines on plagiarism and collusion (see below).
• Any plagiarism or collusion can lead to serious penalties for all students involved, and may
also mean that your overall examination mark is recorded as non-complete. Guidelines as
to what constitutes plagiarism may be found in the departmental student handbooks: the
relevant extract is provided on the ‘In-course assessment 2’ tab on the STAT0023 Moodle
page. The Turn-It-In plagiarism detection system may be used to scan your submission for
evidence of plagiarism and collusion.
• You will receive feedback on your work via Moodle, and you will receive a provisional
grade. Grades are provisional until confirmed by the Statistics Examiners’ Meeting in June
2021.
3
Background and overview
When the Covid-19 pandemic was first recognised in early 2020, it quickly became apparent that
age was the main risk factor for becoming seriously ill or dying from the disease. Researchers
have also identified other risk factors including gender, social deprivation, pre-existing health
conditions and ethnicity.1 Understanding these risk factors can potentially help to develop
strategies for reducing deaths, for example by targeting appropriate healthcare resources in
areas that need them the most.2
In the UK, the Office for National Statistics (ONS) publishes a variety of information on Covid. An
ONS report from August 20203 produced a simple analysis of Covid death rates across England
and Wales, between March and July 2020. In this assessment we will examine more closely the
data used in that report and try to understand why some areas have more deaths than others,
by linking to UK Census data on the socio-economic characteristics of the different areas.
We will use data consisting of the total numbers of reported deaths in the period March–July
2020, where Covid-19 was given as the cause of death, for each of 7201 “Middle Layer Super
Output Areas” (MSOAs) in England and Wales. According to the ONS report cited above, Super
Output Areas are “small-area statistical geographies covering England and Wales”, each of which
has a similarly sized population and remains stable over time. These data are from the ONS web
site.4 They have been combined with demographic and socioeconomic data from the most
recent UK Census in 2011, obtained by querying datasets at the Nomis Labour Market Statistics
service; and also with some geographic information from the UK’s Open Geography Portal.
The data are provided in the file UKCovidWave1.csv, available from the ‘In-course assessment
2’ tab of the STAT0023 Moodle page. This contains an anonymised version of the original data.
Full details, including the anonymisation procedure (which includes rounding of most variables)
can be found in the Appendix to these instructions. The first 5 401 rows are complete, i.e.,
contain all values of the death count and covariates. The last 1 800 rows contain all values of the
covariates, but -1 for the death counts.
Your task in this assessment is to use the data from the first 5 401 records, to build a statistical
model that will help you to:
• Understand the social, demographic and economic factors associated with variation
between MSOAs in numbers of Covid deaths during the period March–July 2020; and
• Estimate the numbers of deaths for each of the 1 800 records where you don’t have this
information.
1 See, for example, Williamson et al. (2020): “Factors associated with COVID-19-related death using
OpenSAFELY” (Nature 584, pp. 430–436).
2 For a more general overview of the key role that statistics has to play in responding to crises, see the
Royal Statistical Society’s Ten recommendations on better use of stats and data in a pandemic,
released on 8th March 2021.
3 ONS Statistical Bulletin “Deaths involving COVID-19 by local area and socioeconomic deprivation:
deaths occurring between 1 March and 31 July 2020”, published August 2020.
4 Here and elsewhere, clicking on the blue text will take you to the relevant web site.
4
Detailed instructions
You may use either R or SAS for this assessment.
1. Read the data into your chosen software package and carry out any necessary recoding
(e.g. to deal with the fact that -1 represents a missing value).
2. Carry out an exploratory analysis that will help you to start building a sensible statistical
model to understand and predict the numbers of Covid deaths in each MSOA. This analysis
should aim to identify an appropriate set of candidate variables to take into the subsequent
modelling exercise, as well as to identify any important features of the data that may have
some implications for the modelling. You will need to consider the context of the problem
to guide your choice of exploratory analysis. See the ‘Hints’ below for some ideas.
3. Using your exploratory analysis as a starting point, develop a statistical model that enables
you to predict the number of Covid deaths for each MSOA based on (a subset of) the other
variables in the dataset, and also to understand the variation in deaths between different
MSOAs. To be convincing, you will need to consider a range of models and to use an
appropriate suite of diagnostics to assess them. Ultimately however, you are required to
recommend a single model that is suitable for interpretation, and to justify your
recommendation. Your chosen model should be either a linear model, a generalized linear
model or a generalized additive model.
4. Use your chosen model to predict the number of Covid deaths for each MSOA where this
information is missing, and also to estimate the standard deviation of your prediction
errors.
Submission for this assessment is electronic, via the STAT0023 Moodle page. You are required to
submit three files, as follows:
• A report on your analysis, not exceeding 2 500 words of text plus two pages of graphs and
/ or tables. The word count includes titles, footnotes, appendices, references etc. — in fact it
includes everything except the two pages of graphs / tables and, if present, the separate
page describing the contribution of each pair member (see below). Your report should be
in three sections, as follows:
Section I: Describe briefly what aspects of the problem context you considered at the
outset, how you used these to start your exploratory analysis, and what were the important
points to emerge from this exploratory analysis.
Section II: Describe briefly (without too many technical details) what models you
considered in step (3) above, and why you chose the model that you did.
Section III: State your final model clearly, summarise what your model tells you about the
factors associated with variation of death counts in each MSOA, and discuss any potential
limitations of the model.
Your report should not include any computer code. It should include some graphs and / or
tables, but only those that support your main points. Graphs and tables must appear on
separate pages, or they will be included in the word count.
In addition to your data analysis, if you are working as a pair then you must include an
additional page at the end of their report where each pair member briefly describes
their contribution to the project. You will need to agree this in your pairs before
5
submitting the report. If both pair members agree that they contributed equally then it is
sufficient to write a single sentence to that effect, or alternatively you are very welcome to
describe your own personal contribution to the project. Note that this page will not be
marked and does not contribute to the word count; nor will different marks be allocated to
different pair members based on this. The purpose is to encourage you all to be mindful
about contributing to this piece of group-work.
Your report should be submitted as a PDF file named as ########_rpt.pdf, where
######## is your group ID, with any spaces replaced by underscores (IMPORTANT!!!).
For example, if your group ID is ‘ICA2Group C’, your report should be named
ICA2Group_C_rpt.pdf.
• An R script or SAS program corresponding to your analysis and predictions. Your
script/program should run without user intervention on any computer with R or SAS
installed, providing the file UKCovidWave1.csv is present in the current working directory /
current folder. When run, it should produce any results that are mentioned in your report,
together with the predictions and the associated standard deviations. The script /
program should be named ########.r or ########.sas as appropriate, where
######## is your group ID with underscores instead of spaces. For example, if your
group ID is ‘ICA2Group C’ and you use R, your should be named ICA2Group_C.r.
You may not create any additional input files that can be referenced by your script; nor
should you write any code that requires access to the internet in order to run it. If you use R
however, you may use the following additional libraries if you wish (together with other
libraries that are loaded automatically by these): mgcv, ggplot2, grDevices,
RColorbrewer, lattice and MASS. You may not use any other add-on libraries: for present
purposes, an “add-on library” is one that requires a library() or require() command or
equivalent (e.g. the package::command syntax) before it can be used, if your R system is
installed using default settings.
• A text file containing your predictions for the 1 800 observations with missing counts. This
file should be named ########_pred.dat, where ######## is your group ID with
underscores instead of spaces. The file should contain three columns, separated by
spaces and with no header. The first column should be the record identifier (corresponding
to variable ID in file UKCovidWave1.csv); the second should be the corresponding count
prediction, and the third should be the standard deviation of your prediction error.
• NOTE: if you work in pairs, both members of a pair must confirm their submission on
Moodle before the submission deadline.
Marking criteria
There are 75 marks for this exercise. These are broken down as follows:
• Report: 40 marks. The marks here are for: displaying awareness of the context for the
problem and using this to inform the statistical analysis; good judgement in the choice of
exploratory analysis and in the model-building process; a clear and well-justified argument;
clear conclusions that are supported by the analysis; and appropriate choice and
presentation of graphs and / or tables. The mark breakdown is as follows:
6
– Awareness of context: 5 marks.
– Exploratory analysis: 10 marks. These marks are for (a) tackling the problem in a
sensible way that is justified by the context (b) carrying out analyses that are
designed to inform the subsequent modelling.
– Model-building: 10 marks. The marks are for (a) starting in a sensible place that is
justified from the exploratory analysis (b) appropriate use of model output and
diagnostics to identify potential areas for improvement (c) awareness of different
modelling options and their advantages and disadvantages (d) consideration of the
social, economic and demographic context during the model-building process.
– Quality of argument: 5 marks. The marks are for assembling a coherent ‘narrative’,
for example by drawing together the results of the exploratory analysis so as to
provide a clear starting point for model development, presenting the modelbuilding exercise in a structured and systematic way and, at each stage, linking the
development to what has gone before.
– Clarity and validity of conclusions: 5 marks. These marks are for stating clearly what
you have learned about how and why the numbers of deaths vary between MSOAs,
and for ensuring that this is supported by your analysis and modelling.
– Graphs and / or tables: 5 marks. Graphs and / or tables need to be relevant, clear
and well presented (for example, with appropriate choices of symbols, line types,
captions, axis labels and so forth). There is a one-slide guide to ‘Using graphics
effectively’ in the slides / handouts for the Week 1 videos for the course. Note that
you will only receive credit for the graphs in your report if your submitted script /
program generates and automatically saves all of these graphs when it is run.
Note that you will be penalised if your report exceeds EITHER the specified 2
500-word
limit or the number of pages of graphs and / or tables. Following UCL guidelines, the
maximum penalty is 7 marks, and no penalty will be imposed that takes the final mark
below 30/75 if it was originally higher. Subject to these conditions, penalties are as follows:
– More than two pages of graphs and / or tables: zero marks for graphs and / or
tables, in the marking scheme given above.
– Exceeding the word count by 10% or less: mark reduced by 4.
– Exceeding the word count by more than 10%: mark reduced by 7.
In the event of disagreement between reported word counts on different software systems,
the count used will be that from the examiner’s system. The examiners will use an R
function called PDFcount to obtain the word count in your PDF report: this function is
available from the Moodle page in file PDFcount.r.
• Coding: 15 marks. There are 3 marks here for reading the data, preprocessing and setting
up variable names correctly and efficiently; 7 marks for effective use of your chosen
software in the exploratory analysis and modelling (e.g. programming efficiently and
correctly); and 5 marks for clarity of your code — commenting, layout, choice of variable /
object names and so forth.
• Prediction quality: 20 marks. The remaining 20 marks are for the quality of your
predictions. Note, however, that you will only receive credit for your predictions if your
7
submitted ########_pred.dat file is identical to that produced by your script / program
when it is run: if this is not the case, your predictions will earn zero marks.
For these marks, you are competing against each other. Your predictions will be assessed
using the following score:
AssignmentTutorOnline
- Assignment status: Already Solved By Our Experts
- (USA, AUS, UK & CA PhD. Writers)
- CLICK HERE TO GET A PROFESSIONAL WRITER TO WORK ON THIS PAPER AND OTHER SIMILAR PAPERS, GET A NON PLAGIARIZED PAPER FROM OUR EXPERTS
