Programming for Data Analysis and Visualization
Submit one single R file for the solution of the following questions as: Firstname.Surename.R
The dataset “Power Plant” records variables which the company’s engineers believe are
important factors in the operation of the plant. The company is interested in maximising net
hourly electrical energy output (recorded as PE in the dataset). For each hour of energy
output recorded, other variable “Temperature” (AT) in the range 1.81°C and 37.11°C is
- Run a linear regression model for PE over AT. Record the value for the slope
and take it as the actual population parameter .
- For 1000 iterations:
a. Take 50 random samples from the dataset. Run the regression model and
using the expression for CI for , that we found in the lecture, find a 95%
CI for .
b. Find what percentage of the CIs generated in step 2 would contain the
that you got in step 1.
If and are independent random samples from the Uniform distribution U(0,1), by
generating random samples find | −
are independent random samples from the Beta distribution (1, 1 + ), by
generating random samples for 3 different values for find
∑ ln (1 −
∑ ln (1 −
and show that the result is independent from .
b. Using the distribution in Q2, show that the result is even independent from the
Assessment Requirements / Tasks (include all guidance notes)
This assignment will use employment data of Wales from the StatsWales data
source. This dataset provides workplace employment estimates, or estimates of
total jobs, for Wales and its NUTS2 areas, along with comparable UK data
disaggregated by industry section.
For this assignment students will undertake a data analysis and machine learning
approach to reveal the workplace employment landscape of Wales.
- Data processing
1.1. Download the dataset for the period 2009 – 2018 and create a dataframe that
concatenates Wales (total) employment value only.
1.2. Check for any null value or outlier. If found replace that with mean value.
1.3. Change the name of the industries as bellow
The final dataframe should look like following
2012 2013 2014 2015 2016 2017 2018
- Data analysis
For each question provide graph/chart along with your own interpretation (~ 50
2.1. Which industry employed highest and lowest workers over the period?
2.2. Which industry has the highest and lowest overall growth over the period?
2.3. Which years are the best and worst performing year in relation to number of
employment. (highest and lowest employment)
- Visual analysis
Create a dynamic scatter/bubble plot showing the change of workforce number over
the period using Plotly express.
4.1. Undertake a PCA (PC=2; columns should be like PC1, PC2, Industry) and
produce a scatter plot. Write your interpretation about the plot and in relation
to the analysis of section 2 & 3 (for example which industries are correlated
over the years as well as in PCA etc.)
4.2. Make a year wise correlation for each industry. Does the aforementioned
industries are also correlated over the years? Explain your answer.
Page 5 of 8
- Clustering (k means & hierarchical)
5.1. Using the best and worst performing year column’s employment data (2.3)
undertake a K means clustering analysis (K=2 & 3) and identify industries
cluster together. Write your own interpretation (~100 words).
5.2. Using the same dataset (best & worst performing) create a hierarchical
cluster. Compare the cluster with k means clusters.
Provide a brief discussion (~ 300 words) on employment landscape of Wales based
on the employment data analysis results.
1.1 Data preparation 05
1.2 Data preparation 05
1.3 Data preparation 05
2.1 Data analysis 05
2.2 Data analysis 05
2.3 Data analysis 05
3 Visual analysis 20
4.1 PCA 10
4.1 Correlation 10
5.1 Clustering 10
5.2 Clustering 10
6 Discussion 10
Please see Moodle for confirmation of the Assessment submission date.
Presentation will be on 4:00 PM of submission date.
Any assessments submitted after the deadline will not be marked and will be
recorded as a Non-Attempt.
The assessment must be submitted as a zip file / pdf / word document through the
Turnitin submission point in Moodle
Your assessment should be titled with your Student ID Number, module code and
assessment id, e.g. st12345678 CIS4000 WRIT1
Page 6 of 8
Feedback for the assessment will be provided electronically via Moodle, and will
normally be available 4 working weeks after initial submission. The feedback return
date will be confirmed on Moodle.
Feedback will be provided in the form of a rubric and supported with comments on
your strengths and the areas which you improve.
All marks are preliminary and are subject to quality assurance processes and
confirmation at the Examination Board.
Further information on the Academic and Feedback Policy in available in the
Academic Handbook (Vol 1, Section 4.0)
70 – 100%
(1 st )
Addressed all sections and provided correct answers with elegant
presentation of results. Applied correct data analysis approaches
and provided excellent interpretation on each section.
Addressed all sections and provided correct answers with good
presentation of results. Applied mostly correct data analysis
approaches and provided very good interpretation on each section.
Addressed most of the sections and provided mostly correct answers
with average presentation of results. Applied some correct data
analysis approaches and provided an average interpretation on each
(3 rd )
Addressed few sections with few correct answers with/out any
presentation of results. Applied mostly incorrect data analysis
approaches and provided