pure-numpy interface to parquet #931

martindurant · 2024-08-22T13:58:25Z

Due to the upcoming hard dependence of pandas on pyarrow, this branch investigates what it would look like to have a fastparquet that avoids pandas altogether and deals with numpy arrays alone. For complex columns, the representation will be similar and compatible to awkward/arrow buffers, but not require those packages.

yohplala · 2024-09-11T12:51:32Z

Hi @martindurant
I have seen your comment in #935:

Output will be an iterator over row-groups, and dictionaries giving the positions in the schema or light structured wrapper, something like:

{0: {
  'foo.with.strings-data': array([0, 1, -1], dtype=int8),
  'foo.with.strings-cats': ["hey", "there"],
  'foo.with.ints-data': array([1, 2, 3], dtype=uint8),
  'foo.with.lists.list-offsets': array([0, 1, 2, 3]),
  'foo.with.lists.list.element-data': array([0, 0, 0], dtype=uint8),
  'foo.with.lists.list.element-cats': [0]}
}

'foo.with.strings-data' appears to be a column name, right?
But, what is 0 key? The ID of the row group? (all arrays do not have all the same length, so I am not sure what it is)

I also am curious to know what will be the input for the general write() function?
A similar dictionary providing per column the corresponding data?

Thank you for your feedback!

martindurant · 2024-09-11T13:12:03Z

'foo.with.strings-data' appears to be a column name

These are complex columns. In this case, a list-of-lists is made up of the data values, offsets and maybe an index (in the case of categoricals). There will be some simple wrappers in https://github.com/dask/fastparquet/blob/a9d3f309068189043f5ecec5f616de90c11fa305/fastparquet/wrappers.py to provide access to these nested structures, or the arrays could be passed directly to arrow, awkward or other libraries that know what to do with them.

  'foo.with.strings-data': array([0, 1, -1], dtype=int8),
  'foo.with.strings-cats': ["hey", "there"],

becomes ["hey", "there", None] as a list

  'foo.with.lists.list-offsets': array([0, 1, 2, 3]),
  'foo.with.lists.list.element-data': array([0, 0, 0], dtype=uint8),
  'foo.with.lists.list.element-cats': [0]}

becomes [[0], [0], [0]] as a list.

Yes, 0 is the row-group index. It could also include the filename maybe. Ways to combine arrays from multiple row-groups can be provided, but I am thinking that iterating over them will be more common.

yohplala · 2024-09-11T14:08:41Z

Thanks a lot for your quick feedbacks !
Please, can you also share your thoughts about the 2nd question?

I also am curious to know what will be the input for the general write() function?
A similar dictionary providing per column the corresponding data?

martindurant · 2024-09-11T14:27:34Z

I also am curious to know what will be the input for the general write() function?
A similar dictionary providing per column the corresponding data?

Yes, I think so. So in the simple case of tabular data (nothing nested), this is essentially what pandas gives you anyway: dict(df) => {col: values}. For structured data, we can provide ways to ingest lists/dicts, but the best path would be for the caller to provide offsets and such directly, or use the same wrapper classes I referenced above. Reading will be ready well before writing, though!

martindurant added 19 commits March 1, 2024 14:32

Scrap lots of pandas stuff

64f3c3b

Kick out pandas, start numpy

440bfe5

partial

6d0e8a9

prototype

64be035

general algo and specialised variants

f046b2b

various opts

393b8e4

micro

2261d63

opts

59dd36e

choose your parallelism

4aa8934

stop

b60eddd

stop

6cde130

TO simpler

9359f04

stop

a140540

Merge branch 'main' into faster

d6caf89

See what happens if we don't track thrift i32

0d8c2ca

one more

9ac836e

small wins

40ea4c0

Merge branch 'ignore_i32' into faster

29d4ac8

parallelism

b6834e1

martindurant mentioned this pull request Aug 22, 2024

Support upcoming default pandas string dtype (pandas >= 3) #930

Open

martindurant added 7 commits August 22, 2024 10:32

Merge branch 'main' into faster

c5b39ec

remove pandas CI

18d01b2

light work on wrappers

07776f7

steps

944837a

stop

0345cb3

alt

2a6d6a5

improve schema

71683ca

latest

a9d3f30

martindurant mentioned this pull request Sep 26, 2024

Feature requests kylebarron/arro3#195

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pure-numpy interface to parquet #931

pure-numpy interface to parquet #931

martindurant commented Aug 22, 2024

yohplala commented Sep 11, 2024

martindurant commented Sep 11, 2024

yohplala commented Sep 11, 2024 •

edited

Loading

martindurant commented Sep 11, 2024

pure-numpy interface to parquet #931

Are you sure you want to change the base?

pure-numpy interface to parquet #931

Conversation

martindurant commented Aug 22, 2024

yohplala commented Sep 11, 2024

martindurant commented Sep 11, 2024

yohplala commented Sep 11, 2024 • edited Loading

martindurant commented Sep 11, 2024

yohplala commented Sep 11, 2024 •

edited

Loading