This project aims to build a Machine Learning model that predicts whether a movie can be considered a blockbuster — both in terms of audience reception and financial success.
A blockbuster movie is generally defined as a film that achieves both high profitability and high public acclaim. To capture this, we define two measurable targets:
- IMDB Score: Proxy for audience satisfaction and critical reception.
- ROI (Return on Investment): Proxy for commercial success.
- Perform Exploratory Data Analysis (EDA) to understand patterns behind blockbuster movies.
- Build predictive models for:
- IMDB Score
- ROI
- Optionally, create a combined model to classify blockbuster likelihood based on both.
Our goal is to build a comprehensive dataset of blockbuster movies and find a model that optimizes all the information we got. We'll combine information from multiple sources. Below are some datasets that align with our project requirements:
1. Movie Data Analysis Dataset
- Details about 7,668 movies, including:
- Titles, ratings, genres, release years
- IMDb scores, votes
- Directors, writers, main stars
- Production countries, budgets, gross earnings
- Production companies, runtimes
- Source: GitHub Repository
2. Global Movie Franchise Revenue and Budget Data
- Comprehensive data on movie franchises worldwide between 2000–2020:
- Lifetime gross, budget, rating
- Runtime, release date, vote count/average
- Source: Kaggle Dataset
3. TMDB 5000 Movies Dataset
- Information on over 5,000 movies:
- Budget, cast, director
- Keywords, runtime, genres
- Production companies, release dates
- Source: Hugging Face Dataset
4. Complete Movie Metadata Dataset
- Data on over 722,000 movies, including:
- ID, title, genres, budget, revenue
- Suitable for analyzing trends in movie popularity, production companies, budgets, and revenues.
- Source: Gigasheet Dataset
5. Movie Revenue Analysis Dataset
- Approx. 1,800 movies released between 1915 and 2020:
- Domestic and worldwide gross revenues
- Production budgets, release dates
- Source: GitHub Repository
These are merged and cleaned into:
imdb_score_features
roi_features
master_table
- Python (Pandas, Scikit-learn)
- Feature engineering based on domain knowledge and EDA
- ML models (regression, classification)
- IMDB Score reflects whether people liked the movie — crucial for sustained popularity and brand value.
- ROI captures whether the film was a financial hit — crucial for studios and investors.
Combining these helps us define and detect "blockbusters" more holistically.
- Incorporate marketing or release strategy data (e.g., release date, streaming vs theater)
- Refine model into binary classification: "Blockbuster vs Not"