Reproducibility in train_test_apart_stratify() #41

stephengmatthews · 2024-08-16T11:34:25Z

train_test_apart_stratify() produces different results for the same input data, even when setting random_state=0.

To reproduce this, I've adapted the example from the function's docstring to contain only strings (i.e., the values for a are now str instead of int). Run this several times to see different results.

import pandas
from pandas_streaming.df import train_test_apart_stratify

df = pandas.DataFrame([dict(a="1", b="e"),
                       dict(a="1", b="f"),
                       dict(a="2", b="e"),
                       dict(a="2", b="f")])

train, test = train_test_apart_stratify(
    df, group="a", stratify="b", test_size=0.5)
print(train)
print('-----------')
print(test)

The cause seems to be the sets created in connex_split.py#L530 are then iterated over in connex_split.py#L543 but a set is an unordered collection. Replacing ids[k] with sorted(ids[k]) on L543 seems to fix this.

The text was updated successfully, but these errors were encountered:

stephengmatthews · 2024-09-17T04:33:35Z

Resolved in PR #43. There's now a sorted_indices flag.

xadupre mentioned this issue Sep 7, 2024

Use sorted indices #43

Merged

stephengmatthews closed this as completed Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducibility in train_test_apart_stratify() #41

Reproducibility in train_test_apart_stratify() #41

stephengmatthews commented Aug 16, 2024

stephengmatthews commented Sep 17, 2024

Reproducibility in train_test_apart_stratify() #41

Reproducibility in train_test_apart_stratify() #41

Comments

stephengmatthews commented Aug 16, 2024

stephengmatthews commented Sep 17, 2024