GitHub - k-katarzyna/20_newsgroups: A student project from a machine learning class, multiclass classification on a popular dataset.

The content

The progress, results and comments are documented in the 20newsgroups/20newsgroups.ipynb notebook, as well as summary and conclusions. Utilities designed for this project are located in 20newsgroups/utilities.py file.

Use nbviewer to see the fully-rendered notebook with gradient styled result dataframes. LINK

The dataset

The project utilizes the "20 Newsgroups" dataset, consisting over 18 000 posts across 20 different topics. The split between the training and test sets is based on the dates of the posts, distinguishing those published before and after a specific cutoff date.

The goal

The aim of the project is to train a text classification model for all 20 categories with optimization focused on the F1 macro metric.

Methods

To achieve the goal, we employed:

Classifiers: SVM (Support Vector Machine) and Multinomial Naive Bayes.
We tested 2 approaches to data preparation:
- Text cleaning and lemmatization.
- Unchanged text, preprocessed only by TfidfVectorizer.
We evaluated 2 feature selection approaches:
- Chi-square test.
- Limiting the number of features through vectorization parameters.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
20newsgroups		20newsgroups
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

The content

The dataset

The goal

Methods

About

Uh oh!

Releases

Packages

Uh oh!

Languages

k-katarzyna/20_newsgroups

Folders and files

Latest commit

History

Repository files navigation

The content

The dataset

The goal

Methods

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages