P.I. Chountas, N. Yerashenia 7BUIS008W 2021/22
University of Westminster
School of Computer Science and Engineering
7BUIS008W Data Mining & Machine Learning | |
Module leader | Panagiotis Chountas |
Unit | Coursework 1 / Prepared by Natalia Yerashenia |
Weighting: | 50% |
Qualifying mark | 40% |
Description | Students are expected to critically justify the use of effective and novel data mining and machine learning techniques for a specific problem domain and definitely reflect on the knowledge of how different data mining and machine learning algorithms operate in terms of their underlying design assumptions and biases for a given problem domain. Students are expected to methodically analyse the output of data mining and machine learning algorithms by drawing technically appropriate and sound conclusions resulting from the application of data mining and machine learning algorithms to the given problem |
Learning Outcomes Covered in this Assignment: |
This assignment contributes towards the following Learning Outcomes (LOs): • LO1 critically justify the use of effective and novel data mining and machine learning techniques for Data Science applications; • LO3 critically reflect on the knowledge on how different data mining and machine learning algorithms operate and their underlying design assumptions and biases in order to select and apply an appropriate such algorithms to solve a given problem; • LO5 critically analyse the output of data mining and machine learning algorithms by drawing technically appropriate and justifiable conclusions resulting from the application of data mining and machine learning algorithms to real-world data sets |
Handed Out: | 17th February 2022 |
Due Date | 23rd March 2022 Submission by 13:00 hours |
Expected deliverables | Submit on Blackboard a zip file containing the required documentation (either in docx or pdf format). All implemented codes should be included in your documentation together with the results/analysis. |
Method of Submission: | Electronic submission on BB via a provided link close to the submission time. |
Type of Feedback and Due Date: |
Feedback will be provided on BB, on 14th April 2021 |
MSC CRITERIA MEETING IN THIS ASSIGNMENT |
• a systematic and methodological way about Data Analytics/ Data Mining issues; • develop problem-solving skills and knowledge of various techniques/tools/methods; • ability to model and deploy appropriate software tools that satisfy specified requirements, and test their use in a target domain; • independent in-depth analysis of a chosen topic making use of information resources outside a teaching environment; • studying the context within which the design of systems for Data Science and Analytics takes place; • identifying the security and legal implications of Business Intelligence, Data Science and Analytics applications |
AssignmentTutorOnline
P.I. Chountas, N. Yerashenia 7BUIS008W 2021/22
Assessment regulations
Refer to section 4 of the “How you study” guide for undergraduate students for a clarification of how you are
assessed, penalties and late submissions, what constitutes plagiarism etc.
Penalty for Late Submission
If you submit your coursework late but within 24 hours or one working day of the specified deadline, 10 marks will
be deducted from the final mark, as a penalty for late submission, except for work that obtains a mark in the range
of 50 – 59%, in which case the mark will be capped at the pass mark (50%). If you submit your coursework more
than 24 hours or more than one working day after the specified deadline you will be given a mark of zero for the
work in question unless a claim of Mitigating Circumstances has been submitted and accepted as valid.
It is recognised that on occasion, illness or a personal crisis can mean that you fail to submit a piece of work on
time. In such cases you must inform the Campus Office in writing on a mitigating circumstances form, giving the
reason for your late or non-submission. You must provide relevant documentary evidence with the form. This
information will be reported to the relevant Assessment Board that will decide whether the mark of zero shall stand.
For more detailed information regarding University Assessment Regulations, please refer to the following
website:http://www.westminster.ac.uk/study/current-students/resources/academic-regulations
P.I. Chountas, N. Yerashenia 7BUIS008W 2021/22
Coursework Description
I. Data Description – COVID-19 Global Spread Data
The COVID pandemic situation impacted us all during the past three years.
Coronavirus disease (COVID-19) is an infectious disease caused by the SARS-CoV-2
virus.
Most people infected with the virus will experience mild to moderate respiratory
illness and recover without requiring special treatment. However, some will become
seriously ill and require medical attention. Older people and those with underlying
medical conditions like cardiovascular disease, diabetes, chronic respiratory disease,
or cancer are more likely to develop serious illnesses. Anyone can get sick with COVID-
19 and become seriously ill or die at any age.
The best way to prevent and slow down transmission is to be well informed about the
disease and how the virus spreads. Students are offered to research the COVID-19
global data by means of clustering analysis.
The dataset was taken on February 10, 10:00 am. It is uploaded to BB.
The full and most recent dataset is available here.
World Health Organisation Coronavirus disease (COVID-19) Weekly Epidemiological
Updates are available here.
Stay Informed and Take Care!
Note: the first row of the dataset is the world data, it is provided for your
information, do not consider this row in your DM analysis.
Tasks:
Identify the main groups of countries and their characteristics using
1. K-Means clustering after performing dimensionality reduction (PCA)
[16 Marks]
2. Agglomerative hierarchical clustering after performing dimensionality
reduction (PCA)
[9 Marks]
3. Critically summarise the business value of the clustering analysis with reference
to the given domain.
[5 Marks]
[30 Marks]
P.I. Chountas, N. Yerashenia 7BUIS008W 2021/22
II. Dataset Information – Radio Listeners Data: To understand what exactly a
listener prefers listening to on the radio, every detail is recorded online. This
recorded information is used for recommending music that the listener is likely to
enjoy and to come up with a “focused” marketing strategy that sends out
advertisements for music that a listener may wish to buy. However, this results in
wasting money on scarce advertising.
Suppose that you are provided with data from a music community site, giving you
details of each user. This will be further enhanced by getting access to a log of every
artist that listed users have downloaded on their computer. With this data, you will
also get information on the demographics of the listed users (such as age, sex,
location, occupation, and interests).
The objective of providing this data lies in building a system that recommends new
music to the users in this listed community. From the available information, it is
usually not difficult to determine the support for various individual artists (that is,
the frequencies of a specific music genre/artist or song that a user is listening to)
as well as the joint support for pairs (or larger groupings) of artists.
You need to count the number of incidences across all your network members and
divide it by the number of members.
In the mentioned data set, a large chunk of information close to 300,000 records
of song (or artists) selections is listed that is per the listening frequency given by
15,000 users. Each row of the data set contains the name of the artist that the user
has been listening to. The first user is a German lady, who has listened to 16 artists.
This has resulted in the first 16 rows of the data matrix.
First, you need to transform the data given here into an incidence matrix, where
each listener is represented by a row, with 0s and 1s across the columns. This
indicates if a listener has chosen a certain artist or not.
Then, calculate the support for each of the listed 1004 artists and display the
support for all artists with a support threshold greater than 0.08.
The full data dataset is available here.
Tasks
1. Perform Market Basket analysis for the top 3 baskets (listeners): Split the data
according to the listeners (creation of a basket) and apply the Apriori algorithm
for each basket.
[15 Marks]
2. Critically summarise the business value of each basket analysis with reference
to the given domain
[5 Marks]
[20 Marks]
P.I. Chountas, N. Yerashenia 7BUIS008W 2021/22
Guidelines:
You are required to deliver a report (max 15 pages including all figures) describing the
methods adopted and the discussion of achieved results with reference to the tasks
listed below. Assume that the report is targeted to a marketing strategist, who is
interested to learn the business insights inferred in your analysis and to receive
suggestions on how to take appropriate actions, therefore.
P.I. Chountas, N. Yerashenia 7BUIS008W 2021/22
Marking Scheme
Due to the nature of the assessment candidates may come up with more than one
equally, good solutions. Thus, marks will be allocated as follows
I. COVID-19 global spread Data
Tasks:
Identify the main groups of countries and their characteristics using
1. K-Means clustering after performing dimensionality reduction (PCA);
[16 Marks]
o Identification and treatment of any Missing Values;
[1 Mark]
o Identification and treatment of any Outliers;
[1 Mark]
o Data Normalisation;
[2 Marks]
o Dimensionality Reduction Using PCA;
[3 Marks]
o Find the ideal number of clusters – justify it by showing two
different/methods (via manual or automated tools);
[3 Marks]
o Perform K-Means, on the reduced by PCA data set;
[3 Marks]
o Project each original feature on the principal component axis, to
represent the level of importance of each feature in the multidimensional
scaling.
[3 Marks]
2. Agglomerative clustering hierarchical clustering after performing
dimensionality reduction (PCA);
[9 Marks]
o Plot the dendrogram – and justify the obtained number of clusters;
[3 Marks]
o Justify the selected linkage method;
[3 Marks]
o Project each original feature on the principle component axis, to
represent the level of importance of each feature in the multidimensional
scaling.
[3 Marks]
3. Critically summarise the value of the clustering analysis with reference to the
given domain;
[5 Marks]
o Discuss the meaning of the obtained clusters with reference to the
COVID global situation, give comments about your home country;
[2 Marks]
o Justify your findings using the results of tasks 2 and 3.
[3 Marks]
[30 Marks]
P.I. Chountas, N. Yerashenia 7BUIS008W 2021/22
II. | Data Set Information – Radio Listeners Data: Tasks: |
1. Data Understanding: Preliminary steps to capture basic data properties.
Distribution analysis, statistical exploration, correlation analysis, the suitable
transformation of variables and elimination of redundant variables,
management of missing values.
[4 Marks]
2. Apriori Implementation:
[9 Marks]
o Split the data according to the listener (creation of a basket)
[4 Marks]
o Apply the Apriori algorithm in each basket.
[5 Marks]
3. Critically summarise the business value of each basket analysis with reference
to the given domain
[7 Marks]
a. Discuss the meaning of the important rules for each basket;
[3.5 Marks]
b. Justify your findings using the results of tasks 1 and 2.
[3.5 Marks]
[20 Marks]
- Assignment status: Already Solved By Our Experts
- (USA, AUS, UK & CA PhD. Writers)
- CLICK HERE TO GET A PROFESSIONAL WRITER TO WORK ON THIS PAPER AND OTHER SIMILAR PAPERS, GET A NON PLAGIARIZED PAPER FROM OUR EXPERTS
