Skip to content

Parquet compression #20

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
chapmanjacobd opened this issue Oct 15, 2022 · 3 comments
Open

Parquet compression #20

chapmanjacobd opened this issue Oct 15, 2022 · 3 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@chapmanjacobd
Copy link

It would be nice to have options for compression. Looks like there is no compression by default?

parq RS_2008-04.parquet 

 # Metadata 
 <pyarrow._parquet.FileMetaData object at 0x7f5d6f635490>
  created_by: parquet-cpp-arrow version 7.0.0
  num_columns: 124
  num_rows: 167472
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 53334
@ivbeg ivbeg self-assigned this Oct 16, 2022
@ivbeg ivbeg added bug Something isn't working enhancement New feature or request labels Oct 16, 2022
@ivbeg ivbeg moved this to 🏗 In progress in undatum backlog Oct 16, 2022
@ivbeg
Copy link
Contributor

ivbeg commented Oct 16, 2022

Actually it uses snappy compression by default since it uses pandas dataframe for conversion and by default pandas uses snappy compression, I don't know why parq tool doesn't show it. I will add compression option too.

@ivbeg
Copy link
Contributor

ivbeg commented Oct 17, 2022

@chapmanjacobd I've added a compression option to the latest code in the main branch.
Example usage:

  • undatum convert -c brotli data.csv data.parquet
  • undatum convert -c snappy data.csv data.parquet

It supports the following compression codecs: brotli, snappy, lzo, gzip, None

@capricornusx
Copy link

zstd туда же, оно поддерживается в pandas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
Status: 🏗 In progress
Development

No branches or pull requests

3 participants