View on GitHub

CS-109a - Intro to Datascience Team Project

Applying Datascience Best Practices to User-based Movie Ratings

Team members: Timur Zambalayev and Joshua Coffie

Download this project as a .zip file Download this project as a tar.gz file

Welcome to our Project.

We began our journey exploring a dataset of movie reviews from MovieLens and decided we would build a model that takes the reviews and ratings of future users and produces recommendations for the next movie they should check out!

The Dataset

The dataset we used to build our model has just over 100,000 ratings on 9,100 movies from 671 unique reviewers and comes in parts. If you've ever wondered what the raw data looks like behind movie reviews, take a look at our dataset page. A big thanks to MovieLens for providing the data for our model!

Learn more about our dataset.

Cleaning up the Data

Importing, merging, and cleaning the dataset for use is a process that everyone should go through when creating a model. In this section, we provide example code and demonstrate how to structure the dataset before you build a model.

Learn more about our data cleanup.

Setting a Baseline

Before spending the time and effort to build a complex model, make sure you create baselines to measure your performance by. If your model doesn't perform better than the mean, chances are the model will be a poor investment.

Learn more about our baselines.

The Approach

In this section, we discuss our approach to building a model on the MovieLens dataset. If you've ever wondered what it takes to build a recommendation engine, this part is for you. And just because you visited our page, here is a flying husky (who's a good boy??).

Learn more about our approach.

Our Models

Beyond a doubt, this is the most intensive part of the project. Models can often take minutes to hours (or longer!) to run and we can't tell how well it'll do until we can measure our performance and error. Lucky for you, you get to see the results without the wait!

Learn more about our models.

Authors and Contributors

In 2016, Joshua Coffie (@starbuck10)and Timur Zambalayev (@timzam) built this project for Harvard University's CS-109a, Intro to Data Science course, Fall 2016. All code used for modeling within this assignment is Python 2.

If you're interested in getting in touch, feel free to send a note to us! Email Joshua or Timur with any questions you might have.

Support

If you're having trouble accessing this repo, please check out this link for more information.