This repository provides the dataset and artifacts associated with the paper "GHALogs: A Large-Scale Dataset of GitHub Actions Runs", presented at the Mining Software Repositories 2025 (MSR'25) conference.
The GHALogs dataset contains over 116,000 CI/CD workflows executed via GitHub Actions (GHA), collected from more than 25,000 public repositories across 20 programming languages. In total, the dataset includes 513,000 workflow runs and approximately 2.3 million individual steps, offering a rich foundation for research on CI/CD practices, failure analysis, software engineering automation, and more.
See the examples folder for a quick overview of the data format.
If you use this dataset in your research, please cite our paper:
@inproceedings{msr25_ghalogs,
author = {Moriconi, Florent and Durieux, Thomas and Falleri, Jean-Rémi and Francillon, Aurélien and Troncy, Raphael},
title = {GHALogs: Large-Scale Dataset of GitHub Actions Runs},
year = {2025},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
location = {Ottawa, Canada},
series = {MSR '25}
}
The paper is available here.
Dataset is not hosted in the Git repository because of its large size (~140GB).
It is hosted on Zenodo: https://doi.org/10.5281/zenodo.10154920
repositories.json.gz
and runs.json.gz
are ~1GB in total, but github_run_logs.zip
is about 142GB!
Logs archive is not required to run Jupyter notebooks and explore runs metadata!
However, logs are required to run the log parsing again (e.g., extract commands used in runs)
ℹ️ An example of the data available in the dataset with the same format (repository info, run info, run logs) can be found in the examples
folder within this repository.
Jupyter notebooks are stored in the jupyter
folder in this code repository.
They expect a MongoDB database to be run seamlessly.
Notebooks expect a MongoDB database.
- MongoDB database
If you don't have a MongoDB database, you can use Docker to quickly start a MongoDB database.
Below a command to start a container based on the mongo image.
Note: All data will be lost when container exits!
docker run -it --rm -p 127.0.0.1:27017:27017 --name gha-mongodb mongo:7
See the documentation for more information on deploying MongoDB for production (e.g., persistence).
- Import data in MongoDB.
Data for MongoDB is stored as gzip JSON lines (e.g., enable users to access data without MongoDB).
However, Jupyter notebooks expect a MongoDB database to run queries.
a. Download repositories.json.gz
and runs.json.gz
(see Section Dataset above)
b. Run cells in the notebook 0 - Load data
.
Notebook 1 - Dataset metrics
allow you to reproduce the results presented in the paper.
Each step of the notebooks is documented inside the notebook.
Enjoy your discovery!
Code used to retrieve Github Actions runs is stored in this repository. More info to come!