Applications of Data Science and Statistical Modelling Assignment
Applications of Data Science and Statistical Modelling
Assignment
This assignment consist of four sections. Section A requires you to work in Python, while sections B and C have R-based exercises. Section D is a narrated power point presentation on Section C question 4.
This is an individual coursework and you must not collaborate with others when working on this assessment. Please read the College guidelines on collaboration and plagiarism: http://intranet. exeter.ac.uk/emps/studentinfo/subjects/mathematics/assessment/academicmisconduct/.
A. Python [50 marks]
This part of the assignment requires you to show python code, report numerical results, and display figures. A useful environment to work on these questions and submit your answers is in a Jupyter notebook which was covered in the lectures. If you find that you cannot use Jupyter notebooks to submit this part of the coursework, please email Stefan Siegert at _{s.siegert@exeter.ac.uk }within 1 week of the release of this coursework to work out a suitable alternative.
Always show all the source code you wrote to answer each question. Where numerical answers are required, answer the question as a complete sentence after displaying the code and any output. Do not expect the marker to extract the answer from your python code output. Indicate very clearly which question you answer in each part of the notebook, for example by using markdown headings.
You may not use any other programming languages than the python language to answer the questions in this part.
1. Data import and cleaning (8 marks)
- Read the data in the file _{csv }into a pandas data frame called _{wx}. How many rows does _{wx }have, and what are the column names.
- Permanently remove the columns _{STATION }and _{NAME }from _{wx}. iii) Transform the _{DATE }column to type _{datetime}. In the function call, specify the date format explicitly.
2. Exploratory Data Analysis (14 marks)
- i) What is the highest value of wind speed (AWND) observed in the data set? ii) On what date was the lowest minimum temperature (TMIN) observed?
- On how many days was the wind speed (AWND) smaller than 5 mph, and what was the average maximum temperature (TMAX) over those days?
- Create and display a new pandas data frame named _{prcp}, with two columns called _{year }and _{rain}. The _{rain }column should contain the total (summed) precipitation amount (PRCP) per year. Then visualise the total precipitation amount per year with a bar chart.
3. Regression modelling (9 marks)
- Create a scatter plot of maximum temperature (TMAX) versus minimum temperature (TMIN). Using the appropriate function from Scikit-Learn, fit a linear regression model to maximum temperature, using minimum temperature as the covariate. Add the fitted regression line to the scatter plot.
- The slope coefficient of the regression line is larger than one – What does this tell us about the behavior of daily temperature ranges in cold vs warm periods?
4. Model selection (9 marks)
The adjusted _{R}^{2}, denoted by _{Ra}^{2 }is a widely used measure of goodness-of-fit to decide which of several regression models is “best”. A higher value of _{Ra}^{2 }indicates a “better” model fit. To calculate _{Ra}^{2 }define the following quantities: _{yi }is the _{i}-th target value and _{y}ˆ_{i }is the _{i}-th fitted value of the regression model. The index _{i }runs from 1 to _{n}, where _{n }is the sample size. Then define the residual sum of squares _{SSres }as
n
SSres = X(yi −yˆi)2
i=1
and the total sum of squares _{SStot }as
n
SStot = X(yi −y¯)2
i=1
where _{y}¯ = _{n}^{1 }^{Pn}_{i}_{=1}y_{i }is the sample mean of the target. The adjusted _{R}^{2 }is then calculated as
R_{a}2 = 1 − (n(n−−p−1)SS1)SSrestot
where _{p }is the number of covariates in the regression model (not including the intercept).
Fit two regression models to maximum temperature (TMAX). One model that only uses minimum temperature (TMIN) as the covariate, and one that uses minimum temperature (TMIN) and wind speed (AWND) as two covariates. Calculate _{Ra}^{2 }for both models by implementing the equations shown above and report your results. You may use functions from _{sklearn.linear_model }to calculate the fitted _{y}ˆ values, but you may otherwise only use basic python, pandas, and numpy functions to calculate _{Ra}^{2}. Finally, based on your R_{a}^{2 }calculations, state your preference which of the two fitted regression models is better.
5. Graphical summary (10 marks)
Climographs are widely-used tools to summarise and visualise the climatic conditions at a location based on summary statistics of rain and temperatures. Write python code to reproduce as closely as possible the Climograph for Seattle for the year 2015 shown in Figure 1 below. The precipitation amounts shown are sum totals, and the temperature values shown are averages. Use the provided weather data, and otherwise only functions from the pandas, numpy, and matplotlib packages. (Adding the second y-axis requires the matplotlib command _{twinx()}).
Figure 1: Seattle 2015 Climograph
THE REST OF THE QUESTIONS SHOULD BE SOLVED USING R
B. Health Impact Analysis [30 marks]
Lip cancer is a type of oral cancer. Lip cancer develops from abnormal cells that grow out of control and form lesions or tumors on the lips. It develops in thin, flat cells — called squamous cells — that line the lips, mouth, tongue, cheeks, sinuses and throat. Certain lifestyle choices can increase your risk of developing lip cancer, including smoking cigarettes, heavy alcohol use, excessive sun exposure and tanning.
The aim of this assignment is to investigate incidence of lip cancer from 56 local authorities in Scotland between 1975 and 1986. You are given the the number of observed cases as well as the corresponding expected number of cases, calculated using indirect standardisation by applying the age–sex specific rates for the whole of Scotland to the age–sex population profile of each of the local authorities. In addition, you are given the percentage of the workforce employed in agriculture, fishing and forestry within local authority. This data can be found in _{Scotland_data.csv }and the dataset has the following variables:
- _{Name }– Name of local authority,
- _{Observed }– Observed number of lip cancer cases in Scotland between 1975 and 1986,
- _{Expected }– Expected number of lip cancer cases in Scotland between 1975 and 1986,
- _{PCAFF }– Percentage of the workforce employed in agriculture, fishing and forestry within local authority You also have shapefiles for Scotland split by local authorities provided in order to produce any maps for this analysis (Scotland.shp, Scotland.dbf, Scotland.shx and Scotland.prj)
- _{[2 marks] }Provide a summary (or a plot) of the number of cases of Lip Cancer in Scotland.
- _{[5 marks] }Estimate raw standardised morbidity ratios (SMR) for the lip cancer risk in Scotland between 1975 and 1986. You should plot the results on a map and comment on any spatial patterns and/or extreme values and consider any potential reasons for them. If necessary, use plots to support your answers.
- _{[8 marks] }Estimate smoothed SMR for the lip cancer risk in Scotland between 1975 and 1986. You should plot the results on a map. Again, comment on any spatial patterns and/or extreme values and consider any potential reasons for them. Furthermore, comment on differences between raw and the smoothed estimates. If there are big changes, are they appropriate? If necessary, use plots to support your answers.
- _{[5 marks] }For each of the 56 local authorities in Scotland, we have the percentage of the workforce employed in agriculture, fishing and forestry within local authority. Adapt your code to estimate a relative risk for working in agriculture, fishing and forestry on developing lung cancer in Scotland between 1975 and 1986. Is there an increased risk of working in agriculture, fishing and forestry? Is the effect significant?
- _{[10 marks] }A minister from the Scottish Government wants to increase resources to hospitals in Scotland to cope with lip cancer. Write a report recommending key areas/local authorities/populations in Scotland where these resources should be allocated. In your report, you should explain the key steps of your analysis, providing comment on what conclusions you can draw and any limitations associated. You may use the analyses from Q1-4 to support you to write the report. You do not necessarily need to include every step of these analyses, as you should only include results, figures and tables that support your findings. The report should have no more than one page text, this doesn’t include figures/tables/code appendix, and have the following sections:
- Introduction – Explain the problem and what questions you aim to answer
- Initial Data Analysis – Introduce the data you have used, providing graphical and numerical summaries.
- Methods – Describe the methods used in your main analysis, highlighting why your choices are appropriate.
- Results – Present the results of your analyses together with a clear narrative that answers the questions posed in the introduction.
- Summary – Summarise the key findings of your analysis.
- Appendix – R Code.
Ensure any plots and/or tables included are labelled clearly and are accompanied with appropriate captions. Do not include R code in your answer, instead please include it in the appendix.
C. Clustering [80 marks]
Aim
The aim of this part of the assignment is to perform clustering on power data recording at substations in order to see whether there are groups that have similar demand profiles and to see whether there are differences between years.
Data
There are two types of data: (i) variable – the measurements from the monitors; and (ii) fixed – characteristics of the substations, that include information that may be useful when trying to understand, and name, your clusters.
There are 5 datasets containing the variable data, each relating to a different season
- RData
- _{RData }(Haha. This is Britain!)
- RData
- RData
- RData
You can load each dataset using the _{load }function
# Loading dataset for Autumn 2012 load(“Autumn_2012.RData”)
Within each of the datasets, in each row there is a Station ID, a date (in Julian format), followed by 144 numbers that are the scaled power for each ten minutes, followed by another 144 numbers that are the actual (non-scaled) values of power for that substation.
The scaled version of the measurements are the actual measurements divided by the daily maximum (you could try and calculate these yourselves if you wanted). The idea is that if you perform clustering on the raw data, clusters may be chosen based just on the magnitude rather than on the patterns within days (see later).
A word about the dates. Julian dates are the number of days since an origin, in this case the 1st of January 1970. You can convert these to dates that you may recognise using the _{dates }function in the _{chron }package….
# Loading package require(chron)
# convert the Julian date in the first row to a date dates(15586, origin = c(month = 1,day = 1,year = 1970)) [1] 09/03/12 dates(Autumn_2012[1,2], origin = c(month = 1,day = 1,year = 1970)) [1] 09/03/12 # and the whole lot! converted_dates <- dates(Autumn_2012[,2], origin = c(month = 1,day = 1,year = 1970)) converted_dates[1:10] [1] 09/03/12 09/04/12 09/05/12 09/06/12 09/07/12 09/08/12 09/09/12 09/10/12 [9] 09/11/12 09/12/12 # noting that the format of the dates is month, day, year |
The fixed data is in the _{Characteristics.csv }file.
Characteristics <- read.csv(“Characteristics.csv”, stringsAsFactors=FALSE) head(Characteristics)
SUBSTATION_NUMBER TRANSFORMER_TYPE TOTAL_CUSTOMERS Transformer_RATING |
|||||
1 511016 | Grd Mtd Dist. Substation | 206 | 750 | ||
2 511017 | Grd Mtd Dist. Substation | 0 | 500 | ||
3 511028 | Grd Mtd Dist. Substation | 280 | 500 | ||
4 511029 | Grd Mtd Dist. Substation | 268 | 500 | ||
5 511030 | Grd Mtd Dist. Substation | 299 | 500 | ||
6 511032 | Grd Mtd Dist. Substation | 108 | 800 | ||
Percentage_IC LV_FEEDER_COU
1 0.70308406 |
NT
5 |
GRID_REFERENCE
ST187800775900 |
|||
2 0.09264679 | 0 | ST181000782000 | |||
3 0.24804607 | 5 | ST188400769800 | |||
4 0.16029786 | 3 | ST188200771500 | |||
5 0.28333084 | 5 | ST187300772600 | |||
6 0.89802973 | 3 | ST191800779200 | |||
This contains the following information:
- SUBSTATION_NUMBER – so you can link with the measured data
- TRANSFORMER_TYPE – ground or pole mounted (indicating urban or rural)
- TOTAL_CUSTOMERS – the number of customers receiving their electricity from this substation
- Transformer_RATING – indicating the size of the total power being delivered by the substation
- Percentage_IC – the percentage of industrial and commerical (not domestic) customers
- LV_FEEDER_COUNT – the number of feeders coming from the substation
- GRID_REFERENCE – the Ordnance Survey grid reference for the location
(see https://getoutside.ordnancesurvey.co.uk/guides/beginners-guide-to-grid-references/) Dataset _{NewSubstations.csv }contains _{raw }measurements for five new substations (for each ten minute period).
- _{[10 marks] }Initial data analysis tasks
- Summarise the data in the _{csv }dataset, and plot the distributions for the percentage of industrial and commercial customers, transformer ratings and pole or ground monitored substations.
- Using this and other analyses you think appropriate, describe the relationships between the different substation characteristics (transformer type, number of customers, rating, percentage of I&C customers and number of feeders).
- _{[20 marks] }Initial clustering tasks
Using the _{scaled }daily measurements from the January_2013 dataset perform hierarchical clustering for the daily average demand (that is each 10 minute period for the given substation should be averaged over all days where measurement is available):
- Using your preferred choice of a dissimilarity function, create a distance matrix for these data and produce a dendrogram.
- Choose an appropriate number of clusters and label each substation according to its cluster membership.
- For each of your clusters, plot the daily average demand for 1) All days, 2) Weekdays, 3) Saturdays and 4) Sundays.
- Produce summaries of the variables in _{csv }for each of your clusters.
- Desribe your clusters based on the information in _{csv }and choose names for them. Describe the patterns of their power demands for each cluster.
- _{[20 marks] }Allocating new substations
The Dataset _{NewSubstations.csv }contains information for five new substations.
- For each substation, on the same plot, plot the daily average demand for 1) All days, 2) Weekdays, 3) Saturdays and 4) Sundays (one plot per new substation).
- Using _{k-means }(or other version, i.e. based on medians), allocate these new substations to one of your clusters.
- Based on your summaries and plots, is the cluster allocation as you expected?
- _{[30 marks] }Exploring differences between years
- The power company want to know whether there are any differences between power demands between seasons. They are particulary interested in whether the groupings/clusters of substations change between seasons. Peform suitable analyses of the power demands by season and explore whether the membership of clusters changes between seasons. You should write a report to the power distribution company detailing your analyses, results and present a conclusion. Your report should include plots/tables where appropriate and should be a maximum length of 2 pages. Plots and tables are not included in this limit.
D. Presentation [40 marks]
The presentation is based on PartC/Q4 only. You should submit a narrated power-point presentation that should be 7 minutes long, and you should aim for 7 slides in total.
In this you should explain what the problem is, how you approached it, and what your findings are.
You should pay attention to the clarity/pace/coherency of the delivery, the style/information-balance on the slides, clear description of methodology and time management.
The deadline for submission is Noon (12pm), 25th June.
Note that exercises in part A should be solved using Python, while exercises in Part B and C should be solved using R. PartB/Q4 and PartC/Q4 are short reports. When answering these do not include R code in your answer, instead please include it as an appendix. It is up to you whether you submit a separate pdf document for each part or one file that contains the answers to all the questions.
The presentation is based on PartC/Q4 only. You should submit a narrated power-point presentation that should be 7 minutes long, and you should aim for 7 slides in total.
Note that late submissions will be penalised.