Skip to content

Is the code data used in the training data? #112

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
xuhu0115 opened this issue Apr 8, 2025 · 4 comments
Open

Is the code data used in the training data? #112

xuhu0115 opened this issue Apr 8, 2025 · 4 comments

Comments

@xuhu0115
Copy link

xuhu0115 commented Apr 8, 2025

Hello, may I ask if code data was used in the training data? I noticed that you mentioned using the LiveCodeBench and USACO code datasets, but these two labels do not appear in the 59k source_type. How can this be explained?

@Muennighoff
Copy link
Contributor

Muennighoff commented Apr 8, 2025 via email

@xuhu0115
Copy link
Author

xuhu0115 commented Apr 9, 2025

Thank you for your reply! Additionally, I have another question. The 59k dataset includes the GPQA dataset. I noticed in the paper that the GPQA Diamond subset was used for evaluation. I’m wondering if the GPQA subset was also used during training? Or were any other subsets besides Diamond used in the training process?

@Muennighoff
Copy link
Contributor

it's not the diamond subset that is in the training but its other parts (and also additionally decontaminated against the diamond)

@xuhu0115
Copy link
Author

xuhu0115 commented Apr 9, 2025

Okay, thank you for your answer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants