1. Introduction
The idea of this final project is to give students an opportunity of working on some “real-world” problems. For the final project, instead of providing the pre-processed data, we will provide the raw datasets. Students can pick one of the datasets, which is equivalent to picking a task as shown in the following.
The theme of this project is “Machine Learning for Social Good”
2. Project Options
Option 1: Detecting Bias in Social Media Posts
- Data type: texts (tweets)
- Task type: binary classification
- Data link
Option 2: Brain Tumor Classification (MRI)
- Data type: images
- Task type: 4-class classification
- Data link
Option 3: Detecting Malicious and Benign Websites
- Data type: tabular
- Task type: binary classification
- Data link
On the Kaggle website, each dataset has several demo code, which may not be exactly aligned with our project requirements. So, students can use them as reference, but please use them with caution. On the other hand, directly copying from the example code will be considered as plagiarism.
3. Project Group Signup
We will use Canvas to sign up groups. Currently, Canvas allows students to do self signup, which means students can create their groups.
- Students can create their own group on the
Groups
page in thePeople
panel - The suggestion for the group name is
Final-Project-XX
and replaceXX
with a 2-digit number - Each group can only have 3 – 4 students. Because of the size of this class, smaller groups (1 – 2 students) are not allowed without the instructor’s permission.
- Students are recommended to use the discussion board (e.g., Campuswire) to form groups if necessary.
You can also sign up using the Google form, if you have trouble of using the Canvas.
4. Project Report Guideline
Please create an iPython notebook either from your local machine or on Google collab for this project. Your report will be that notebook file. Please keep all the code and required outputs for grading.
Please use the following section titles to organize your report and implementation.
Section 1: Data Preprocessing
Although all three options are classification tasks, each option has a different type of data. Therefore, each option will have a different requirement of feature engineering.
Please keep the pre-processed data for the tasks in the following sections.
Option 1: Detecting Bias in Social Media Posts
- Input and output: the original data has 21 columns. For this project, we will only use the following two columns
- the column
text
as the input, and - the column
bias
as the output
- the column
- Task: the task is to build a classifier to predict whether a tweet is biased based on the text.
TODO
- Convert the input texts into numeric vectors for classification
- You can use the CountVectorizer function from
sklearn
Option 2: Brain Tumor Classification (MRI)
- Input and Output: the original data are organized by folder, each folder corresponds one label
- Use each image as input
- Use the corresponding folder name as label
- Task the task is to build a classifier to predict the type of tumor based on the image
TODO
- Merge the data from all four folder together into one folder, meanwhile make sure keep tracking the label of each image
- Load the images and convert them into the gray scale
- There are many functions that can load images from one folder, here are some options
- The original image is in the RGB format, which means there are three channels. Converting them to gray scale will reduce the image to one channel
- Reduce the dimensionality of each image to 128 using PCA
- The original dimensionality of each image is 360 x 380, which means the dimension of each image will be about 100K, even after reducing to the gray scale
- To reduce the dimensionality, please set the parameter
n_components
to be 128
Option 3: Detecting Malicious and Benign Websites
- Input and Output: there are 21 columns of this tabular data
- Use the
Type
column as the label, and - Use the rest of the column as input
- Use the
- Task: the task is to build a classifier to predict whether a website is malicious or benign
TODO
- Load the dataset
- Normalize the data using one of the normalization functions listed on this page
- Please explain why you choose this normalization method and why it should work
Section 2: Data Splits
TODO
- Create a training/validation/test split by using 80%, 10%, and 10% of the whole dataset
- You can use the train_test_split function, but it can only split a dataset into two subsets each time
Comment:
- A common practice is to create the data split first, then do feature engineering. For simplicity, we switch the order in this project.
Section 3: Build classifiers
TODO
- Please choose two of the four classification families we discussed in class for this project: Perceptron, Logistic regression, SVM, and Feed-forward neural networks. Note that
- Please report the prediction accuracies of the two classifiers with their default parameters.
Comments:
- Please don’t use the classifiers beyond the scope of our lectures. I understand some of you may want to use some more advanced classifiers, such as BERT or GPT. However, there are two reason why this is not expected: (1) students may not have enough computing resources to run many experiments; and (2) a good understanding of how simple classifiers work is more important than making an advanced classifier work.
- For large datasets, if you want to try linear SVM, the LinearSVC function is often more efficient.
Section 4: Hyper-parameter tuning
For each classifier
TODO
- Please choose at least three hyper-parameters, and for each hyper-parameter. Report these hyper-parameters and their pre-defined values.
- Please report the prediction accuracy of the validation set on each combination of the hyper-parameters.
- Please report the prediction accuracy of the test set with the best combination of the hyper-parameters.
Section 5: Analysis
Please answer the following two questions
TODO
- Based on the results above, which classifier is better, and why?
- For further improvement on classification accuracy, what strategies that you can use and why do you think they will be helpful?
- List at least two strategies for analysis
- You don’t have to implement these strategies. Therefore, your justification should not be “we tried them and they worked”
- If your strategy involves using large language models, you should show a good understanding about how to use them, in addition the justification about why they could help.
- (Only for option 3) Which input feature is the most informative one, and what is your empirical evidence?
5. Submission Guideline
For the final project report
- The submission deadline is May 5, 11:59 PM, without a late penalty
- Make sure the names of all team members are included at the beginning of the report
- Each team only needs to submit one report, it can be uploaded by any member of the team, the same grade will be given to all team members
For the final project presentation
- The submission deadline is May 10, 11:59 PM, Note that, late submissions will NOT be accepted, in order to give the TAs and instructor enough time for grading.
- Each team should prepare a slide deck for presentation, and all team members should be listed on the first page. The requirements are in the next section.
- The presentation should be recorded via Zoom and uploaded in mp4 format (the default format of zoom recording).
- Similar to the report, each team only needs to submit one copy of the video. It can be uploaded by any team member, and the same grade will be given to all team members.
6. Project Presentation
The project presentation should be within 6 - 8 minutes. Each team should prepare a slide deck for the presentation, and the number of slides should be around seven, including the title page.
The expected content from the presentation includes
- The justification (2 points)
- Why your team picked this project?
- The justification should be aligned with the theme of the final project, and cannot be something like “we randomly picked a project”
- The technical challenges (2 points)
- As this project is an end-to-end implementation, the potential technical challenges could be but not limited to (1) data pre-processing – how to process the data before building classifiers, (2) performance validation – how do we know the performance is valid or good enough, (3) hyper-parameter tuning – what hyper-parameters you decided to adjust and how do you pick their values (this question may not be applied to some hyper-parameters, as their possible values are pre-defined and very limited)?
- Addressing the technical challenges (2 points)
- How did you address the previous technical challenges
- Final results (2 points)
- To what extent addressing the challenges will impact the final results?
- In other words, what are prediction performance before and after addressing the challenge?
Note that
- You can make any one of the three classifiers to prepare the presentation.
- All team members need to participate in the presentation, whether it is one sentence or one minute, it does not matter.
- As you may realize that this presentation is complementary to the final report, but the content is mostly there once you finish the project.