Is the code data used in the training data? #112

xuhu0115 · 2025-04-08T13:09:36Z

Hello, may I ask if code data was used in the training data? I noticed that you mentioned using the LiveCodeBench and USACO code datasets, but these two labels do not appear in the 59k source_type. How can this be explained?

Muennighoff · 2025-04-08T15:46:57Z

don't think there code data; the two may have been filtered out due to not having labels

…

On Tue, Apr 8, 2025 at 6:10 AM xuhu0115 ***@***.***> wrote: Hello, may I ask if code data was used in the training data? I noticed that you mentioned using the LiveCodeBench and USACO code datasets, but these two labels do not appear in the 59k source_type. How can this be explained? — Reply to this email directly, view it on GitHub <#112>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AO7I55C6U74XIZGGA3LQXAD2YPDCTAVCNFSM6AAAAAB2WIPT4WVHI2DSMVQWIX3LMV43ASLTON2WKOZSHE3TSNZVGIYDIMQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***> *xuhu0115* created an issue (simplescaling/s1#112) <#112> Hello, may I ask if code data was used in the training data? I noticed that you mentioned using the LiveCodeBench and USACO code datasets, but these two labels do not appear in the 59k source_type. How can this be explained? — Reply to this email directly, view it on GitHub <#112>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AO7I55C6U74XIZGGA3LQXAD2YPDCTAVCNFSM6AAAAAB2WIPT4WVHI2DSMVQWIX3LMV43ASLTON2WKOZSHE3TSNZVGIYDIMQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

xuhu0115 · 2025-04-09T04:04:19Z

Thank you for your reply! Additionally, I have another question. The 59k dataset includes the GPQA dataset. I noticed in the paper that the GPQA Diamond subset was used for evaluation. I’m wondering if the GPQA subset was also used during training? Or were any other subsets besides Diamond used in the training process?

Muennighoff · 2025-04-09T05:31:48Z

it's not the diamond subset that is in the training but its other parts (and also additionally decontaminated against the diamond)

xuhu0115 · 2025-04-09T05:35:36Z

Okay, thank you for your answer!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is the code data used in the training data? #112

Is the code data used in the training data? #112

xuhu0115 commented Apr 8, 2025

Muennighoff commented Apr 8, 2025 via email

Uh oh!

xuhu0115 commented Apr 9, 2025 •

edited

Loading

Uh oh!

Muennighoff commented Apr 9, 2025

Uh oh!

xuhu0115 commented Apr 9, 2025

Uh oh!

Is the code data used in the training data? #112

Is the code data used in the training data? #112

Comments

xuhu0115 commented Apr 8, 2025

Muennighoff commented Apr 8, 2025 via email

Uh oh!

xuhu0115 commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Muennighoff commented Apr 9, 2025

Uh oh!

xuhu0115 commented Apr 9, 2025

Uh oh!

xuhu0115 commented Apr 9, 2025 •

edited

Loading