Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: named axis for ak.Array #3238

Draft
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

pfackeldey
Copy link
Collaborator

@pfackeldey pfackeldey commented Sep 12, 2024

Proposal for named axis

This PR addresses #2596.

References for other named axis implementations:

Motivation

As argumented at PyHEP.dev 2023 and by the Harvard NLP group in their "Tensor Considered Harmful" write-up, named axis can be a powerful tool to make code more readable and less error-prone.

Design

ak.Array with named axis

Named axis are implemented through a mapping from named axis to positional axis.
named axis are hashables (mostly strings), except for integers as they are reserved for positional axis.

import typing

AxisName: typing.Alias = typing.Hashable

By default a ak.Array uses positional axis, but named axis can be added to the array in the following ways:

import awkward as ak

# tuple:
#   positional axis: (0, 1)
#   named axis: ("events", "jets")
array = ak.Array([[1, 2], [3], [], [4, 5, 6]], named_axis=("events", "jets"))

# dict:
#   positional axis: (0, 1)
#   named axis: ("events", "jets")
array = ak.Array([[1, 2], [3], [], [4, 5, 6]], named_axis={"events": 0, "jets": 1})

# the dict interface allows to name single axis, also negative positional axis
array = ak.Array([[1, 2], [3], [], [4, 5, 6]], named_axis={"jets": -1})

# attach axis naming to an existing array
array = ak.Array([[1, 2], [3], [], [4, 5, 6]])
array = ak.with_named_axis(array, ("events", "jets"))
# or
array = ak.with_named_axis(array, {"events": 0, "jets": 1})

The named_axis argument of the constructor of an ak.Array is a tuple of AxisName, or a dict of AxisName to integers.
It is stored in the .attrs attribute of the array with a reserved key "__named_axis__" of type dict[AxisName, int].
The two types of axis can be accessed through the named_axis and positional_axis property (always represented as a tuple):

import awkward as ak

array = ak.Array([[1, 2], [3], [], [4, 5, 6]], named_axis=("events", "jets"))
array.named_axis
>>> ("events", "jets")
array.positional_axis
>>> (0, 1)

array = ak.Array([[1, 2], [3], [], [4, 5, 6]], named_axis={"jets": -1})
array.named_axis
>>> (None, "jets")

Named axis in high-level functions

Named axis can be used by all high-level functions, e.g. ak.sum, ak.max, etc.:

import awkward as ak

array = ak.Array([[1, 2], [3], [], [4, 5, 6]], named_axis=("events", "jets"))

# sum over the "jets" axis
sum_jets = ak.sum(array, axis="jets")
>>> ak.Array([3, 3, 0, 15])
sum_jets.named_axis
>>> ("events",)

# the `keepdims=True` argument keeps the named axis
sum_jets = ak.sum(array, axis="jets", keepdims=True)
>>> ak.Array([[3], [3], [], [15]])
sum_jets.named_axis
>>> ("events", "jets")

There are different scenarios how named axis are propagated to the resulting array:

  1. Nothing changes: The named axis are kept in the resulting array, e.g. ak.sum(array, axis="jets", keepdims=True) or array ** 2.
  2. Named axis are removed: The named axis are removed from the resulting array, e.g. ak.sum(array, axis="jets").
  3. Named axis are unified from binary operations of two ak.Array:
import awkward as ak

array1 = ak.Array([[1, 2], [3, 4]], named_axis=("In", None))
array2 = ak.Array([[5, 6], [7, 8]], named_axis=(None, "Out"))

(array1 + array2).named_axis
>>> ("In", "Out")

Here, checks for matching named axis are possible, the rules are:

ak.Array([1], named_axis=("foo",)) + ak.Array([1], named_axis=("foo",))    # OK
ak.Array([1], named_axis=("foo",)) + ak.Array([1], named_axis=(None,))     # OK
ak.Array([1], named_axis=("foo",)) + ak.Array([1], named_axis=("bar",))    # raise Exception
  1. Named axis are collapsed into a new one:
import awkward as ak

array = ak.ones((1, 2, 3), named_axis=("x", "y", "z"))

# does this even make sense/exist?
ak.flatten(array, axis=("y", "z")).named_axis
>>> ("x", None)

ak.flatten(array, axis=None).named_axis
>>> (None,)
  1. no use-case exists currently / not possible: Named axis permuted: The named axis of the resulting array are permuted, e.g.:
import awkward as ak

array = ak.Array([[1], [2]], named_axis=("x", "y"))
array.named_axis
>>> ("x", "y")

array.T.named_axis
>>> ("y", "x")`
  1. no use-case exists currently / not possible: Named axis are contracted away:
import awkward as ak

array1 = ak.Array([[1, 2], [3, 4]], named_axis=("In", "Foo"))
array2 = ak.Array([[5, 6], [7, 8]], named_axis=("Foo", "Out"))

(array1 @ array2).named_axis
>>> ("In", "Out")

Named axis in indexing

In addition, named axis can be used to select data:

import awkward as ak

array = ak.Array([[1, 2], [3], [], [4, 5, 6]], named_axis=("events", "jets"))

# select the first event
first_event = array[{"events": 0, "jets": slice(None)}]
>>> ak.Array([1, 2], named_axis=("jets",))

# select the first jet of each event
first_jet = array[{"events": slice(None), "jets": slice(0, 1)}]
>>> ak.Array([[1], [3], [], [4]], named_axis=("events", "jets"))

For synthatic sugar ak.slice is added:

import awkward as ak

array = ak.Array([[1, 2], [3], [], [4, 5, 6]], named_axis=("events", "jets"))

# select the first jet of each event
first_jet = array[{"events": ak.slice[...], "jets": ak.slice[0:1]}]
>>> ak.Array([[1], [3], [], [4]], named_axis=("events", "jets"))

# or mixed with positional axis
first_jet = array[..., {"jets": ak.slice[0:1]}]
>>> ak.Array([[1], [3], [], [4]], named_axis=("events", "jets"))

This PR has to touch a lot of code and needs to add custom named axis propagation to each high-level operation. Thus, this PR is currently in draft mode.

Looking forward to ideas, thoughts, feedback on this effort!

@pfackeldey pfackeldey changed the title Feat: named axis for ak.Array feat: named axis for ak.Array Sep 12, 2024
@pfackeldey
Copy link
Collaborator Author

pfackeldey commented Sep 13, 2024

Progress

general

  • documentation for named axis

slicing

  • positional axis only (e.g. array[0])
  • named axis only (e.g. array[{"events": 0}])
  • mixed positional and named axis (e.g. array[{0: 0, "jets": 0}])

Unary and binary operations

  • unary operations (e.g. array ** 2)
  • binary operations (e.g. array1 + array2)

high-level functions

New:

  • ak.with_named_axis
  • ak.without_named_axis

Can be used with named axis:

  • ak.all
  • ak.almost_equal
  • ak.angle
  • ak.any
  • ak.argcartesian
  • ak.argcombinations
  • ak.argmax
  • ak.argmin
  • ak.argsort
  • ak.array_equal
  • ak.backend
  • ak.broadcast_arrays
  • ak.broadcast_fields
  • ak.cartesian
  • ak.categories
  • ak.combinations
  • ak.concatenate
  • ak.copy
  • ak.corr
  • ak.count
  • ak.count_nonzero
  • ak.covar
  • ak.drop_none
  • ak.enforce_type
  • ak.fill_none
  • ak.firsts
  • ak.flatten
  • ak.imag
  • ak.is_none
  • ak.local_index
  • ak.mask
  • ak.max
  • ak.mean
  • ak.min
  • ak.moment
  • ak.nan_to_none
  • ak.nan_to_num
  • ak.num
  • ak.ones_like
  • ak.pad_none
  • ak.prod
  • ak.ptp
  • ak.ravel
  • ak.real
  • ak.round
  • ak.run_lengths
  • ak.singletons
  • ak.softmax
  • ak.sort
  • ak.std
  • ak.strings_astype
  • ak.sum
  • ak.to_packed
  • ak.unflatten
  • ak.values_astype
  • ak.var
  • ak.where
  • ak.with_field
  • ak.with_name
  • ak.with_parameter
  • ak.without_parameters
  • ak.zeros_like
  • ak.zip

Independent of named axis: improvements / bugs found that are fixed by this PR aswell:

  • various typos in doc strings
  • Indexing could have multiple ... in certain cases, this is prohibited in NumPy (a705981)
  • keepdims argument in ak.corr and ak.covar was wrong for the mean calculation (00669a3)
  • avoid touching shape unnecessarily often when accessing .purelist_depth, .minmax_depth, and .branch_depth (typetracers) through self.inner_shape property of Numpy{Meta|Array} (1af4376)

@jpivarski
Copy link
Member

And all the data types that can be passed into square brackets with __getitem__.

Copy link

codecov bot commented Sep 13, 2024

Codecov Report

Attention: Patch coverage is 91.18497% with 61 lines in your changes missing coverage. Please review.

Project coverage is 82.26%. Comparing base (b749e49) to head (3bb8efa).
Report is 162 commits behind head on main.

Files with missing lines Patch % Lines
src/awkward/_namedaxis.py 85.40% 20 Missing ⚠️
src/awkward/_operators.py 76.66% 7 Missing ⚠️
src/awkward/operations/ak_pad_none.py 75.00% 2 Missing ⚠️
src/awkward/operations/ak_with_named_axis.py 92.00% 2 Missing ⚠️
src/awkward/operations/ak_without_named_axis.py 86.66% 2 Missing ⚠️
src/awkward/_layout.py 92.30% 1 Missing ⚠️
src/awkward/_typing.py 50.00% 1 Missing ⚠️
src/awkward/contents/content.py 97.77% 1 Missing ⚠️
src/awkward/operations/ak_all.py 92.85% 1 Missing ⚠️
src/awkward/operations/ak_any.py 92.85% 1 Missing ⚠️
... and 23 more
Additional details and impacted files
Files with missing lines Coverage Δ
src/awkward/_nplikes/array_like.py 97.14% <ø> (+27.75%) ⬆️
src/awkward/_nplikes/typetracer.py 75.05% <ø> (+0.19%) ⬆️
src/awkward/_regularize.py 87.87% <100.00%> (+0.37%) ⬆️
src/awkward/contents/numpyarray.py 90.50% <100.00%> (-1.01%) ⬇️
src/awkward/highlevel.py 77.16% <100.00%> (+0.49%) ⬆️
src/awkward/operations/__init__.py 100.00% <100.00%> (ø)
src/awkward/operations/ak_almost_equal.py 93.75% <100.00%> (+0.56%) ⬆️
src/awkward/operations/ak_argcombinations.py 88.00% <ø> (ø)
src/awkward/operations/ak_array_equal.py 100.00% <ø> (ø)
src/awkward/operations/ak_cartesian.py 91.89% <100.00%> (+0.89%) ⬆️
... and 53 more

... and 85 files with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants