Initial Implementation #2

dougbrn · 2024-04-19T01:13:22Z

Resolves #3. Resolves #8

Change Description

My PR includes a link to the issue that I am addressing

This PR lays down the foundation for the nested-dask package (currently dask-nested but we will change this). It implements a Dask API for the v0.1 Nested-Pandas high-level API, and provides a limited "nest" accessor object. The vast majority of functionality is just lean map_partitions wrappings of nested-pandas, this is by design as we intend to mainly focus development on the nested-pandas side where applicable and mainly use Dask just to handle partitioning. There are (and will be in the future likely) some exceptions to this, such as to_parquet.

This PR also establishes a basic unit test and benchmarking suite. Documentation is the notable exception, but this PR is already way too large so it will come in a later PR.

Solution Description

Code Quality

I have read the Contribution Guide
My code follows the code style of this project
My code builds (or compiles) cleanly without any errors or warnings
My code contains relevant comments and necessary documentation

Project-Specific Pull Request Checklists

Bug Fix Checklist

My fix includes a new test that breaks as a result of the bug (if possible)
My change includes a breaking change
- My change includes backwards compatibility and deprecation warnings (if possible)

New Feature Checklist

I have added or updated the docstrings associated with my feature using the NumPy docstring format
I have updated the tutorial to highlight my new feature (if appropriate)
I have added unit/End-to-End (E2E) test cases to cover my new feature
My change includes a breaking change
- My change includes backwards compatibility and deprecation warnings (if possible)

Documentation Change Checklist

Any updated docstrings use the NumPy docstring format

Build/CI Change Checklist

If required or optional dependencies have changed (including version numbers), I have updated the README to reflect this
If this is a new CI setup, I have added the associated badge to the README

Other Change Checklist

Any new or updated docstrings use the NumPy docstring format.
I have updated the tutorial to highlight my new feature (if appropriate)
I have added unit/End-to-End (E2E) test cases to cover any changes
My change includes a breaking change
- My change includes backwards compatibility and deprecation warnings (if possible)

github-actions · 2024-04-19T01:15:08Z

Before [`506c831`]	After [`5b3f875`]	Ratio	Benchmark (Parameter)
failed	151M	n/a	benchmarks.NestedFrameAddNested.peakmem_run
failed	235±2ms	n/a	benchmarks.NestedFrameAddNested.time_run
failed	152M	n/a	benchmarks.NestedFrameQuery.peakmem_run
failed	489±2ms	n/a	benchmarks.NestedFrameQuery.time_run
failed	150M	n/a	benchmarks.NestedFrameReduce.peakmem_run
failed	377±1ms	n/a	benchmarks.NestedFrameReduce.time_run

Click here to view all benchmarks.

codecov · 2024-05-17T22:44:26Z

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

wilsonbb

This looks good overall!

Some nits about documentation/comments that were copied over from Tape, and a few small questions.

src/dask_nested/backends.py

src/dask_nested/core.py

wilsonbb · 2024-05-20T20:13:26Z

src/dask_nested/core.py

+        -------
+        `dask_nested.NestedFrame`
+        """
+        nested = nested.map_partitions(lambda x: pack_flat(x)).rename(name)


What happens if we try adding a flat dask dataframe? How would pack_flat handle it?

It seems to fail if given a dask dataframe instead of a nested-dask nestedframe: *** ValueError: max() arg is an empty sequence. I can open a ticket to support this.

Yeah just a ticket should be fine for now.

But this seems like a blocker for out-of-memory datasets at the moment which is unfortunate

I think out-of-memory datasets are still feasible. It seems to work fine with a flat Nested-Dask NestedFrame object so the main hiccup is just that we will need to convert any input dask dataframe to a NestedFrame before calling add_nested

We should think about a short-cut for index-sorted datasets, it should work for out-of-memory datasets

src/dask_nested/core.py

wilsonbb · 2024-05-20T20:22:19Z

tests/dask_nested/conftest.py

+    }
+    layer_nf = npd.NestedFrame(data=layer_data).set_index("index").sort_index()
+
+    base_dn = dn.NestedFrame.from_nested_pandas(base_nf, npartitions=5)


Nice that we're using a diversity of npartitions here!

hombit

Looks great, thank you!!!

src/dask_nested/accessor.py

src/dask_nested/backends.py

src/dask_nested/core.py

hombit · 2024-05-21T13:39:46Z

src/dask_nested/core.py

+        `dask_nested.NestedFrame`
+        """
+        nested = nested.map_partitions(lambda x: pack_flat(x)).rename(name)
+        return self.join(nested, how="outer")


Why is it outer here? Should we make it configurable?

outer is here mainly to just not reject any data from either table, however I think making this configurable is a good idea. Is there a better/more intuitive default value for it?

Made this a kwarg, but still defaulting to outer. We could default to left to follow Dask, but I'm not sure if it's the most sensible default

hombit · 2024-05-21T13:41:05Z

src/dask_nested/core.py

+        -------
+        `dask_nested.NestedFrame`
+        """
+        nested = nested.map_partitions(lambda x: pack_flat(x)).rename(name)


We should think about a short-cut for index-sorted datasets, it should work for out-of-memory datasets

WIP: initial implementation start

a51086d

dougbrn added 19 commits April 24, 2024 12:25

add fixed array_nonempty

5d40646

adding functions

68ff5f7

further implementation

7f82621

more work, starting tests

18fc5d2

fill out unit test suite

a2a27a8

typing and pre-commit fixes

fe39cd4

add accessor to_lists and to_flat + tests

f43551a

add a read_parquet test

4b98df8

typing and pre-commit fixes

9cdba35

ruff fixes

775bbe0

add benchmarks and dataset generation

cc6d300

from_nestedpandas -> from_nested_pandas

1b02e98

add to_parquet

7a12a96

add to_parquet

3068845

add docstring note on by_layer=False issues

a68af56

add dependencies

583eadd

add dask_expr

5993c7e

add annotations for python 3.9

687d844

add annotations for python 3.9

5ba9904

dougbrn changed the title ~~WIP: initial implementation~~ Initial Implementation May 17, 2024

remove pdb

28020c9

dougbrn requested review from wilsonbb and hombit May 17, 2024 22:49

wilsonbb approved these changes May 20, 2024

View reviewed changes

address review comments

8192366

hombit approved these changes May 21, 2024

View reviewed changes

address review comments

f6a21e3

add meta to reduce benchmark

414d15d

dougbrn merged commit 954ab73 into main May 21, 2024
9 checks passed

dougbrn mentioned this pull request May 23, 2024

Accessor Implementation #6

Open

dougbrn deleted the initial_implementation branch May 23, 2024 20:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial Implementation #2

Initial Implementation #2

dougbrn commented Apr 19, 2024 •

edited

Loading

github-actions bot commented Apr 19, 2024 •

edited

Loading

codecov bot commented May 17, 2024

wilsonbb left a comment

wilsonbb May 20, 2024

dougbrn May 20, 2024

wilsonbb May 20, 2024

dougbrn May 20, 2024

hombit May 21, 2024

wilsonbb May 20, 2024

hombit left a comment

hombit May 21, 2024

dougbrn May 21, 2024

dougbrn May 21, 2024

hombit May 21, 2024

Initial Implementation #2

Initial Implementation #2

Conversation

dougbrn commented Apr 19, 2024 • edited Loading

Change Description

Solution Description

Code Quality

Project-Specific Pull Request Checklists

Bug Fix Checklist

New Feature Checklist

Documentation Change Checklist

Build/CI Change Checklist

Other Change Checklist

github-actions bot commented Apr 19, 2024 • edited Loading

codecov bot commented May 17, 2024

Welcome to Codecov 🎉

wilsonbb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hombit left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dougbrn commented Apr 19, 2024 •

edited

Loading

github-actions bot commented Apr 19, 2024 •

edited

Loading