Reading and Writing of long String Variables from SPSS #119

Ov-ille · 2021-03-31T09:09:51Z

When reading and writing spss files with long string variables, the respective variable is being split into several variables.

Reproducing writing issue:

a = pd.DataFrame()
a["LongString1"] = ["Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."]
a["LongString2"] = [
    "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
]

sav.write_sav(
    a,
    r"C:\Users\XX\test_out1.sav"
)

When this file is opened in SPSS, instead of 2 variable, it contains 5 ("LongString2" is follwed by "V2_A1", "V2_A2", "V2_A3").
When read back into Python with pyreadstat it only shows the 2 created variables.

Strangely, when only "LongString2" is created and written, or when its variable name is shorter ("LongStr"), the splitting does not occur.

Reproducing Reading Issue
Unfortunately I can't offer a file to reproduce the reading issue. The one, that causes a problem for me, can't be shared due to data protection.
And I didn't succeed in creating a sample file, that produces the same problem.

Setup Information:

pyreadstat was installed with pip
a virtual environment created with venv
Python3.8 (plain)
Windows10, 64bit

The text was updated successfully, but these errors were encountered:

ofajardo · 2021-04-02T13:11:48Z

Very similar to #118 . Reported to Readstat for them to take a look.

Ov-ille · 2021-12-14T16:37:51Z

@ofajardo Are there any news regarding this bug? I just stumbled across this problem again when reading spss data with long strings. Some standard code wasn't working all of a sudden and it took me ages to realise that it was down to this problem again (columns being split without any warning).

ofajardo · 2021-12-14T16:39:44Z

no news, sorry

ofajardo · 2022-02-23T13:53:14Z

the issue can be replicated in pure C: WizardMac/ReadStat#260

Ov-ille · 2022-08-12T09:53:34Z

@ofajardo Since I keep encountering this issue, I spent some time creating data to reproduce this issue, in case that it is of any help for finding the bug (sadly I don't have the abilities to actually help solve the issue).

There are alot of variaties how the error is expressed when opening the file in spss, I tried to find a few examples.
It seems to have to do with the number of columns in the dataset, the number of characters in the strings, and also the format of the variable name.

import pandas as pd
import numpy as np
import pyreadstat as sav

### create dataframe
error_file = pd.DataFrame()
columnNames = ['so3_10_9_1', 'so3_10_10_1', 'so3_10_11_1', 'so3_10_12_1',
       'so3_10_13_1', 'so3_10_14_1', 'so3_10_15_1', 'so3_10_16_1',
       'so3_10_17_1', 'so3_10_18_1', 'so3_10_19_1', 'so3_10_20_1',
       'so3_10_96opn', 'so3_10_97opn', 'so3_10_98opn']
error_file[columnNames] = np.nan
# 504 characters or more to produce error
error_file.loc[0,"so3_10_98opn"] = "a"*505
# 504 characters will produce an error
error_file.loc[0,"so3_10_97opn"] = "a"*504
# 503 characters or less to work
error_file.loc[0,"so3_10_96opn"] = "a"*503
# bug example: variable is split with pattern V{number of variable}A{1/2/3/...})
sav.write_sav(error_file, "error_file.sav")

### keep only problem-column with 504 characters
# bug example: variable is split (different pattern!)
sav.write_sav(error_file[["so3_10_97opn"]], "error2_file.sav")

### keep only problem-column with 505 characters
# bug example: variable is exported CORRECTLY!
sav.write_sav(error_file[["so3_10_98opn"]], "error3_file.sav")

### keep only problem-column and create different variable names
error4_file = error_file[["so3_10_97opn"]].copy()
## same amount of characters variable name without underscores
error4_file["so31097opn"] = error4_file["so3_10_97opn"]
## variable name with only one underscore
error4_file["so3_1097opn"] = error4_file["so3_10_97opn"]
## variable name with no underscores but same amount of characters
error4_file["so3x10x97opn"] = error4_file["so3_10_97opn"]
# bug example: variables are split with different naming patterns!
sav.write_sav(error4_file, "error4_file.sav")

ofajardo · 2023-02-22T15:13:09Z

hi @Ov-ille I have tested your initial report code and in the version I just released 1.2.1 it seems to be fixed. Would you please check if it is fully solved now?
@mtr

ofajardo · 2023-02-23T14:14:39Z

I also tried the other examples and all of them seem good now. Closing this.

Ov-ille · 2023-02-23T16:10:08Z

Hi @ofajardo and thanks for testing! I just installed the newest version (1.2.1) but the problems from this issue haven't changed. Did you open the file in spss or how did you check whether it worked? When reading the same files back into python after writing them with pyreadstat the split columns don't appear. But when opened in spss they are being split.

When creating the file directly in spss and then reading with pyreadstat, the variables were kept the way they should be.

ofajardo · 2023-02-23T16:16:55Z

I see, I was checking by reading them with pyreadstat only. I re-open this issue then. Now I realize the issue was always that pyreadstat was reading it correctly but SPSS was not.

KevinCrossDCL · 2023-07-12T15:20:21Z

I'm also experiencing a similar issue:

variable_format = { 'VariableName': 'A1000' }
....
pyreadstat.write_sav(df, "SPSS.sav", column_labels=variable_labels, variable_format=variable_format, variable_value_labels=variable_value_labels)

In the SPSS file that's created the length is set at 255. If I set it as 255 or less it will work, but anything higher than that and it will default to 255.

ME-researchgroup · 2023-11-27T14:22:40Z

I am running into the same issue.
I see you have already opened an issue at WizardMac/ReadStat#260 that is still open.

The names of the variables that are being created seem quite unpredictable, which makes writing a hacky quick fix difficult. Hopefully our friends at ReadStat can look into it!

gulchitai · 2024-01-19T00:59:41Z

I'm facing the same problem too. Use library haven 2.5.3 from R.

pepcmarques · 2025-02-03T20:43:41Z

Same issue here. Using pyreadstat==1.2.5.

I found that readstat_sav_write.c file under /src/spss/ defines MAX_STRING_SIZE to 255 (see below)

#define MAX_STRING_SIZE             255

I would give a shot to change it to a higher value, but I have no idea how to compile it. The documentation says that it is straight forward, but I have no clue.

I know it might not be the only place to change, but it is a start.

I hope we have this issue fixed soon.

gulchitai · 2025-02-05T03:25:02Z

I don't think simply changing the number 255 to a larger value would work. Since the library is written in C, there's a fundamental limitation with the char type which has a maximum of 255 characters.

pepcmarques · 2025-02-05T04:45:03Z

Thank you for your answer @gulchitai

However, I couldn't understand what you wrote; probably because of my lack of knowledge in C.

A char type is 1 byte limited, isn't it? I believe it is possible to create a variable like char str[1024]; in C. Am I wrong?

I tried to understand this readstat_sav_write.c and it segments the string if it the user_width is greater than MAX_STRING_SIZE.

I still have a hunch that it would be a good start.

stspec · 2025-02-07T08:47:44Z

@pepcmarques

I think this subject matter has been referenced in related threads here or in https://github.com/WizardMac/ReadStat, but you mind find the PSPP documentation relating to long strings relevant: https://www.gnu.org/software/pspp/pspp-dev/pspp-dev.html#Very-Long-String-Record.

Apparently, long strings are broken up into 255 length segments and stored as separate variables internally. I haven't looked at the code you've explored, but I believe that's what this length is referring to.

ofajardo · 2025-05-26T12:16:16Z

Unfortunately the latest updates on Readstat source by today do not solve the issue (file is still read wrongly in SPSS)

ofajardo mentioned this issue Apr 2, 2021

long string variable split when reading in SPSS WizardMac/ReadStat#236

Open

ofajardo mentioned this issue May 5, 2021

Long string handling #118

Closed

ofajardo added bug Something isn't working requires changes in Readstat waiting for changes in the C library Readstat labels May 5, 2021

Ov-ille mentioned this issue Feb 25, 2022

Variable name not imported correctly #165

Open

ofajardo closed this as completed Feb 23, 2023

ofajardo reopened this Feb 23, 2023

Reading and Writing of long String Variables from SPSS #119

Reading and Writing of long String Variables from SPSS #119

Comments

Ov-ille commented Mar 31, 2021

ofajardo commented Apr 2, 2021

Uh oh!

Ov-ille commented Dec 14, 2021

Uh oh!

ofajardo commented Dec 14, 2021

Uh oh!

ofajardo commented Feb 23, 2022

Uh oh!

Ov-ille commented Aug 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ofajardo commented Feb 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ofajardo commented Feb 23, 2023

Uh oh!

Ov-ille commented Feb 23, 2023

Uh oh!

ofajardo commented Feb 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KevinCrossDCL commented Jul 12, 2023

Uh oh!

ME-researchgroup commented Nov 27, 2023

Uh oh!

gulchitai commented Jan 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pepcmarques commented Feb 3, 2025

Uh oh!

gulchitai commented Feb 5, 2025

Uh oh!

pepcmarques commented Feb 5, 2025

Uh oh!

stspec commented Feb 7, 2025

Uh oh!

ofajardo commented May 26, 2025

Uh oh!

Ov-ille commented Aug 12, 2022 •

edited

Loading

ofajardo commented Feb 22, 2023 •

edited

Loading

ofajardo commented Feb 23, 2023 •

edited

Loading

gulchitai commented Jan 19, 2024 •

edited

Loading