This project applies Natural Language Processing (NLP) techniques to classify movie reviews from IMDb into positive or negative sentiment categories. It follows a two-phase approach:
- 🔍 Phase 1: Text preprocessing and exploratory analysis using R and tidyverse libraries.
- 🤖 Phase 2 (in progress): Sentiment classification models in Python using Scikit-learn and PyTorch.
- IMDb Reviews Dataset (Kaggle)
- 50,000 reviews labeled as positive or negative.
- Text cleaning: lowercasing, stopword removal, HTML cleanup
- Tokenization and lemmatization
- POS tagging with
udpipe
- Visualization: word clouds, bar charts, n-gram analysis
- 📈 Full EDA Notebook on RPubs
- Data exported as
IMDB-cleaned.csv
- Model candidates:
- Logistic Regression
- Naive Bayes
- Support Vector Machines
- PyTorch-based classifier
- Metrics: Accuracy, F1-score, ROC-AUC
sentiment-analysis.Rmd
: Full EDA notebook in RIMDB-cleaned.csv
: Preprocessed datasetmodel_sentiment.py
: Sentiment classification model (planned)- Streamlit / Flask deployment (planned)
- Build an interactive dashboard (Streamlit)
- Deploy a REST API using FastAPI
- (Optional) Real-time batch sentiment processing with Apache Spark
- R:
tidyverse
,tidytext
,udpipe
,ggplot2
,SnowballC
- Python:
scikit-learn
,NLTK
,PyTorch
(planned) - EDA & Reporting: R Markdown, RPubs
Manuel Alejandro Matías Astorga
Data Scientist | Physicist | Machine Learning Enthusiast
📄 Portfolio Website · LinkedIn
✨ Feel free to fork, contribute or reach out if you're working on similar projects!