Introduction

1. Introduction

The idea of this final project is to give students an opportunity of working on some “real-world” problems. For the final project, instead of providing the pre-processed data, we will provide the raw datasets. Students can pick one of the datasets, which is equivalent to picking a task as shown in the following.

The theme of this project is “Machine Learning for Social Good”

2. Project Options

Data type: texts (tweets)
Task type: binary classification
Data link

Option 2: Brain Tumor Classification (MRI)

Data type: images
Task type: 4-class classification
Data link

Option 3: Detecting Malicious and Benign Websites

Data type: tabular
Task type: binary classification
Data link

On the Kaggle website, each dataset has several demo code, which may not be exactly aligned with our project requirements. So, students can use them as reference, but please use them with caution. On the other hand, directly copying from the example code will be considered as plagiarism.

We will use Canvas to sign up groups. Currently, Canvas allows students to do self signup, which means students can create their groups.

Students can create their own group on the Groups page in the People panel
The suggestion for the group name is Final-Project-XX and replace XX with a 2-digit number
Each group can only have 3 – 4 students. Because of the size of this class, smaller groups (1 – 2 students) are not allowed without the instructor’s permission.
Students are recommended to use the discussion board (e.g., Campuswire) to form groups if necessary.

You can also sign up using the Google form, if you have trouble of using the Canvas.

4. Project Report Guideline

Please create an iPython notebook either from your local machine or on Google collab for this project. Your report will be that notebook file. Please keep all the code and required outputs for grading.

Please use the following section titles to organize your report and implementation.

Section 1: Data Preprocessing

Although all three options are classification tasks, each option has a different type of data. Therefore, each option will have a different requirement of feature engineering.

Please keep the pre-processed data for the tasks in the following sections.

Option 1: Detecting Bias in Social Media Posts

Input and output: the original data has 21 columns. For this project, we will only use the following two columns
- the column text as the input, and
- the column bias as the output
Task: the task is to build a classifier to predict whether a tweet is biased based on the text.

TODO

Convert the input texts into numeric vectors for classification
You can use the CountVectorizer function from sklearn

Option 2: Brain Tumor Classification (MRI)

Input and Output: the original data are organized by folder, each folder corresponds one label
- Use each image as input
- Use the corresponding folder name as label
Task the task is to build a classifier to predict the type of tumor based on the image

TODO

Merge the data from all four folder together into one folder, meanwhile make sure keep tracking the label of each image
Load the images and convert them into the gray scale
- There are many functions that can load images from one folder, here are some options
- The original image is in the RGB format, which means there are three channels. Converting them to gray scale will reduce the image to one channel
Reduce the dimensionality of each image to 128 using PCA
- The original dimensionality of each image is 360 x 380, which means the dimension of each image will be about 100K, even after reducing to the gray scale
- To reduce the dimensionality, please set the parameter n_components to be 128

Option 3: Detecting Malicious and Benign Websites

Input and Output: there are 21 columns of this tabular data
- Use the Type column as the label, and
- Use the rest of the column as input
Task: the task is to build a classifier to predict whether a website is malicious or benign

TODO

Load the dataset
Normalize the data using one of the normalization functions listed on this page
Please explain why you choose this normalization method and why it should work

Section 2: Data Splits

TODO

Create a training/validation/test split by using 80%, 10%, and 10% of the whole dataset
You can use the train_test_split function, but it can only split a dataset into two subsets each time

Comment:

A common practice is to create the data split first, then do feature engineering. For simplicity, we switch the order in this project.

Section 3: Build classifiers

TODO

Please choose two of the four classification families we discussed in class for this project: Perceptron, Logistic regression, SVM, and Feed-forward neural networks. Note that
Please report the prediction accuracies of the two classifiers with their default parameters.

Comments:

Please don’t use the classifiers beyond the scope of our lectures. I understand some of you may want to use some more advanced classifiers, such as BERT or GPT. However, there are two reason why this is not expected: (1) students may not have enough computing resources to run many experiments; and (2) a good understanding of how simple classifiers work is more important than making an advanced classifier work.
For large datasets, if you want to try linear SVM, the LinearSVC function is often more efficient.

Section 4: Hyper-parameter tuning

For each classifier

TODO

Please choose at least three hyper-parameters, and for each hyper-parameter. Report these hyper-parameters and their pre-defined values.
Please report the prediction accuracy of the validation set on each combination of the hyper-parameters.
Please report the prediction accuracy of the test set with the best combination of the hyper-parameters.

Section 5: Analysis

Please answer the following two questions

TODO

Based on the results above, which classifier is better, and why?
For further improvement on classification accuracy, what strategies that you can use and why do you think they will be helpful?
- List at least two strategies for analysis
- You don’t have to implement these strategies. Therefore, your justification should not be “we tried them and they worked”
- If your strategy involves using large language models, you should show a good understanding about how to use them, in addition the justification about why they could help.
(Only for option 3) Which input feature is the most informative one, and what is your empirical evidence?

5. Submission Guideline

For the final project report

The submission deadline is May 5, 11:59 PM, without a late penalty
Make sure the names of all team members are included at the beginning of the report
Each team only needs to submit one report, it can be uploaded by any member of the team, the same grade will be given to all team members

For the final project presentation

The submission deadline is May 10, 11:59 PM, Note that, late submissions will NOT be accepted, in order to give the TAs and instructor enough time for grading.
Each team should prepare a slide deck for presentation, and all team members should be listed on the first page. The requirements are in the next section.
The presentation should be recorded via Zoom and uploaded in mp4 format (the default format of zoom recording).
Similar to the report, each team only needs to submit one copy of the video. It can be uploaded by any team member, and the same grade will be given to all team members.

6. Project Presentation

The project presentation should be within 6 - 8 minutes. Each team should prepare a slide deck for the presentation, and the number of slides should be around seven, including the title page.

The expected content from the presentation includes

The justification (2 points)
- Why your team picked this project?
- The justification should be aligned with the theme of the final project, and cannot be something like “we randomly picked a project”
The technical challenges (2 points)
- As this project is an end-to-end implementation, the potential technical challenges could be but not limited to (1) data pre-processing – how to process the data before building classifiers, (2) performance validation – how do we know the performance is valid or good enough, (3) hyper-parameter tuning – what hyper-parameters you decided to adjust and how do you pick their values (this question may not be applied to some hyper-parameters, as their possible values are pre-defined and very limited)?
Addressing the technical challenges (2 points)
- How did you address the previous technical challenges
Final results (2 points)
- To what extent addressing the challenges will impact the final results?
- In other words, what are prediction performance before and after addressing the challenge?

Note that

You can make any one of the three classifiers to prepare the presentation.
All team members need to participate in the presentation, whether it is one sentence or one minute, it does not matter.
As you may realize that this presentation is complementary to the final report, but the content is mostly there once you finish the project.

1. Introduction

2. Project Options

Option 1: Detecting Bias in Social Media Posts

Option 2: Brain Tumor Classification (MRI)

Option 3: Detecting Malicious and Benign Websites

3. Project Group Signup

4. Project Report Guideline

Section 1: Data Preprocessing

Section 2: Data Splits

Section 3: Build classifiers

Section 4: Hyper-parameter tuning

Section 5: Analysis

5. Submission Guideline

6. Project Presentation