Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pure-numpy interface to parquet #931

Draft
wants to merge 27 commits into
base: main
Choose a base branch
from
Draft

Conversation

martindurant
Copy link
Member

Due to the upcoming hard dependence of pandas on pyarrow, this branch investigates what it would look like to have a fastparquet that avoids pandas altogether and deals with numpy arrays alone. For complex columns, the representation will be similar and compatible to awkward/arrow buffers, but not require those packages.

@yohplala
Copy link

Hi @martindurant
I have seen your comment in #935:

Output will be an iterator over row-groups, and dictionaries giving the positions in the schema or light structured wrapper, something like:

{0: {
  'foo.with.strings-data': array([0, 1, -1], dtype=int8),
  'foo.with.strings-cats': ["hey", "there"],
  'foo.with.ints-data': array([1, 2, 3], dtype=uint8),
  'foo.with.lists.list-offsets': array([0, 1, 2, 3]),
  'foo.with.lists.list.element-data': array([0, 0, 0], dtype=uint8),
  'foo.with.lists.list.element-cats': [0]}
}

'foo.with.strings-data' appears to be a column name, right?
But, what is 0 key? The ID of the row group? (all arrays do not have all the same length, so I am not sure what it is)

I also am curious to know what will be the input for the general write() function?
A similar dictionary providing per column the corresponding data?

Thank you for your feedback!

@martindurant
Copy link
Member Author

'foo.with.strings-data' appears to be a column name

These are complex columns. In this case, a list-of-lists is made up of the data values, offsets and maybe an index (in the case of categoricals). There will be some simple wrappers in https://github.com/dask/fastparquet/blob/a9d3f309068189043f5ecec5f616de90c11fa305/fastparquet/wrappers.py to provide access to these nested structures, or the arrays could be passed directly to arrow, awkward or other libraries that know what to do with them.

  'foo.with.strings-data': array([0, 1, -1], dtype=int8),
  'foo.with.strings-cats': ["hey", "there"],

becomes ["hey", "there", None] as a list

  'foo.with.lists.list-offsets': array([0, 1, 2, 3]),
  'foo.with.lists.list.element-data': array([0, 0, 0], dtype=uint8),
  'foo.with.lists.list.element-cats': [0]}

becomes [[0], [0], [0]] as a list.

Yes, 0 is the row-group index. It could also include the filename maybe. Ways to combine arrays from multiple row-groups can be provided, but I am thinking that iterating over them will be more common.

@yohplala
Copy link

yohplala commented Sep 11, 2024

Thanks a lot for your quick feedbacks !
Please, can you also share your thoughts about the 2nd question?

I also am curious to know what will be the input for the general write() function?
A similar dictionary providing per column the corresponding data?

@martindurant
Copy link
Member Author

I also am curious to know what will be the input for the general write() function?
A similar dictionary providing per column the corresponding data?

Yes, I think so. So in the simple case of tabular data (nothing nested), this is essentially what pandas gives you anyway: dict(df) => {col: values}. For structured data, we can provide ways to ingest lists/dicts, but the best path would be for the caller to provide offsets and such directly, or use the same wrapper classes I referenced above. Reading will be ready well before writing, though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants