Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Implementation #2

Merged
merged 24 commits into from
May 21, 2024
Merged

Initial Implementation #2

merged 24 commits into from
May 21, 2024

Conversation

dougbrn
Copy link
Collaborator

@dougbrn dougbrn commented Apr 19, 2024

Resolves #3. Resolves #8

Change Description

  • My PR includes a link to the issue that I am addressing

This PR lays down the foundation for the nested-dask package (currently dask-nested but we will change this). It implements a Dask API for the v0.1 Nested-Pandas high-level API, and provides a limited "nest" accessor object. The vast majority of functionality is just lean map_partitions wrappings of nested-pandas, this is by design as we intend to mainly focus development on the nested-pandas side where applicable and mainly use Dask just to handle partitioning. There are (and will be in the future likely) some exceptions to this, such as to_parquet.

This PR also establishes a basic unit test and benchmarking suite. Documentation is the notable exception, but this PR is already way too large so it will come in a later PR.

Solution Description

Code Quality

  • I have read the Contribution Guide
  • My code follows the code style of this project
  • My code builds (or compiles) cleanly without any errors or warnings
  • My code contains relevant comments and necessary documentation

Project-Specific Pull Request Checklists

Bug Fix Checklist

  • My fix includes a new test that breaks as a result of the bug (if possible)
  • My change includes a breaking change
    • My change includes backwards compatibility and deprecation warnings (if possible)

New Feature Checklist

  • I have added or updated the docstrings associated with my feature using the NumPy docstring format
  • I have updated the tutorial to highlight my new feature (if appropriate)
  • I have added unit/End-to-End (E2E) test cases to cover my new feature
  • My change includes a breaking change
    • My change includes backwards compatibility and deprecation warnings (if possible)

Documentation Change Checklist

Build/CI Change Checklist

  • If required or optional dependencies have changed (including version numbers), I have updated the README to reflect this
  • If this is a new CI setup, I have added the associated badge to the README

Other Change Checklist

  • Any new or updated docstrings use the NumPy docstring format.
  • I have updated the tutorial to highlight my new feature (if appropriate)
  • I have added unit/End-to-End (E2E) test cases to cover any changes
  • My change includes a breaking change
    • My change includes backwards compatibility and deprecation warnings (if possible)

Copy link

github-actions bot commented Apr 19, 2024

Before [506c831] After [5b3f875] Ratio Benchmark (Parameter)
failed 151M n/a benchmarks.NestedFrameAddNested.peakmem_run
failed 235±2ms n/a benchmarks.NestedFrameAddNested.time_run
failed 152M n/a benchmarks.NestedFrameQuery.peakmem_run
failed 489±2ms n/a benchmarks.NestedFrameQuery.time_run
failed 150M n/a benchmarks.NestedFrameReduce.peakmem_run
failed 377±1ms n/a benchmarks.NestedFrameReduce.time_run

Click here to view all benchmarks.

@dougbrn dougbrn changed the title WIP: initial implementation Initial Implementation May 17, 2024
Copy link

codecov bot commented May 17, 2024

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

@dougbrn dougbrn requested review from wilsonbb and hombit May 17, 2024 22:49
Copy link
Collaborator

@wilsonbb wilsonbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good overall!

Some nits about documentation/comments that were copied over from Tape, and a few small questions.

src/dask_nested/backends.py Show resolved Hide resolved
src/dask_nested/backends.py Outdated Show resolved Hide resolved
src/dask_nested/core.py Outdated Show resolved Hide resolved
src/dask_nested/core.py Show resolved Hide resolved
src/dask_nested/core.py Outdated Show resolved Hide resolved
src/dask_nested/core.py Show resolved Hide resolved
-------
`dask_nested.NestedFrame`
"""
nested = nested.map_partitions(lambda x: pack_flat(x)).rename(name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we try adding a flat dask dataframe? How would pack_flat handle it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to fail if given a dask dataframe instead of a nested-dask nestedframe: *** ValueError: max() arg is an empty sequence. I can open a ticket to support this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah just a ticket should be fine for now.

But this seems like a blocker for out-of-memory datasets at the moment which is unfortunate

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think out-of-memory datasets are still feasible. It seems to work fine with a flat Nested-Dask NestedFrame object so the main hiccup is just that we will need to convert any input dask dataframe to a NestedFrame before calling add_nested

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should think about a short-cut for index-sorted datasets, it should work for out-of-memory datasets

src/dask_nested/core.py Show resolved Hide resolved
src/dask_nested/core.py Outdated Show resolved Hide resolved
}
layer_nf = npd.NestedFrame(data=layer_data).set_index("index").sort_index()

base_dn = dn.NestedFrame.from_nested_pandas(base_nf, npartitions=5)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice that we're using a diversity of npartitions here!

Copy link
Collaborator

@hombit hombit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you!!!

src/dask_nested/accessor.py Outdated Show resolved Hide resolved
src/dask_nested/backends.py Outdated Show resolved Hide resolved
src/dask_nested/core.py Outdated Show resolved Hide resolved
`dask_nested.NestedFrame`
"""
nested = nested.map_partitions(lambda x: pack_flat(x)).rename(name)
return self.join(nested, how="outer")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it outer here? Should we make it configurable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outer is here mainly to just not reject any data from either table, however I think making this configurable is a good idea. Is there a better/more intuitive default value for it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this a kwarg, but still defaulting to outer. We could default to left to follow Dask, but I'm not sure if it's the most sensible default

-------
`dask_nested.NestedFrame`
"""
nested = nested.map_partitions(lambda x: pack_flat(x)).rename(name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should think about a short-cut for index-sorted datasets, it should work for out-of-memory datasets

@dougbrn dougbrn merged commit 954ab73 into main May 21, 2024
9 checks passed
@dougbrn dougbrn mentioned this pull request May 23, 2024
@dougbrn dougbrn deleted the initial_implementation branch May 23, 2024 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a few ASV benchmarks Implement Nested-pandas MVP functionality
3 participants