3
CN7030 – Machine Learning on Big Data
Coursework: 2021-22 Academic Year
This coursework (CRWK) must be attempted in the groups of 2 students. This coursework
is divided into two sections: (1) Spark Machine Learning on a real case study and (2) Spark
Streaming for a Streaming-based Application.
All group members must attend the presentation at week 12. Presentation would be online
through Microsoft Teams. If you fail to attend the presentation, your mark will be zero.
Overall mark for CRWK comes from two main activities:
1- Big Data report (around 3,000 words, with a tolerance of ± 10%) in the HTML format
(60%)
2- Presentation (40%)
Assessment for resit (2nd attempt):
We will have only Big Data report (100%) in the HTML format. Students must develop new
solutions for the same tasks. If students copy their solutions from main sit, it will be
considered as a self-plagiarism and the mark will be zero. The marking scheme is same as
main sit.
Marking Scheme
Topic | Total mark |
Remarks (breakdown of marks for each sub-task) |
|
Machine Learning on Big Data |
60 | (20) | Design one binary classifier, and explain its configurations and parameters |
(25) | Design one multi classifier incorporating Ensemble techniques, and explain its configurations and parameters |
||
(15) | Performance and accuracy measurements on both classifiers. |
||
Data Streaming Application |
30 | (5) | Configure and initiate the Streaming environment in Spark. |
(25) | Manipulate and process the real-time data and visualize them at (near) real-time |
||
Documentation | 10 | (10) | Write down a well-organized report for a programming and analytics project. |
Total: | 100 |
AssignmentTutorOnline
Good Luck!
4
Big Data Processing using PySpark
CN7030 – Machine Learning on Big Data
Understanding Dataset: UNSW-NB151
The raw network packets of the UNSW-NB15 dataset was created by the IXIA PerfectStorm
tool in the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS) for
generating a hybrid of real modern normal activities and synthetic contemporary attack
behaviours. Tcpdump tool used to capture 100 GB of the raw traffic (e.g., Pcap files). This
data set has nine types of attacks, namely, Fuzzers, Analysis, Backdoors, DoS, Exploits,
Generic, Reconnaissance, Shellcode and Worms. The Argus and Bro-IDS tools are used,
and twelve algorithms are developed to generate totally 49 features with the class label.
a) The features are described here.
b) The number of attacks and their sub-categories is described here.
c) We use the total number of 10-million records that was stored in the CSV file
(download). The total size is about 600MB. We use it for the machine learning task.
Task 1: Design and Build Classifiers using PySpark [60 marks]
1. Design one binary classifier to categorize the attack and the normal traffic data. Explain
your algorithm and its configuration. Follow the complete process of machine learning
involving the feature selection, preprocessing, class imbalance, etc. [20 marks]
2. Design one Multi Classifier incorporating Ensemble techniques, i.e., bagging and
boosting, and explain shortly any parameter, configuration and processing. [25 marks]
3. Measure and Compare the performance of both classifiers. Visualize your results and
findings using Python libraries. [15 Marks]
Note: A working solution without system/logical error is considered for full mark.
Task 2: Data Streaming [30 marks]
There are three different ways of data streaming methods in Spark: Discretized Streams
(DStreams), Window-based Computations, and Structured Streaming. You can apply one
of these methods to complete this task, as follows:
• Configure Spark environment based on your method.
• The incoming data/traffic should come up in the paragraph format (several lines).
• The task 1 is to count the number of words with the length of 5 or more in the oddnumbered lines.
1 https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-NB15-Datasets/
5
• The task 2 is to count only digits.
• At the end, visualize (use Python UI/plot libraries) and/or print out the results in the
predefined time slots.
Task 3: Documentation [10 marks]
Your final report must follow the “The format of final submission” section. Your work must
demonstrate appropriate understanding of building a user friendly, efficient and
comprehensive analytics report for a big data project to help move users (readers) around
to find the relevant contents.
- Assignment status: Already Solved By Our Experts
- (USA, AUS, UK & CA PhD. Writers)
- CLICK HERE TO GET A PROFESSIONAL WRITER TO WORK ON THIS PAPER AND OTHER SIMILAR PAPERS, GET A NON PLAGIARIZED PAPER FROM OUR EXPERTS
