ManifestGroup::TaskContext should cache partition spec #11235

lirui-apache · 2024-09-29T09:27:34Z

Feature Request / Improvement

When we create a TaskContext, the PartitionSpec and Schema are serialized to JSON strings and passed to each BaseContentScanTask. When we later want to inspect schema and spec of the tasks, they need to be deserialized. This can be very expensive if we have lots of columns in the schema. In our use case, the table has over 2k columns.

So I think we should store the schema/spec in TaskContext and pass them to BaseContentScanTask.

Query engine

None

Willingness to contribute

I can contribute this improvement/feature independently
I would be willing to contribute this improvement/feature with guidance from the Iceberg community
I cannot contribute this improvement/feature at this time

The text was updated successfully, but these errors were encountered:

lirui-apache added the improvement PR that improves existing functionality label Sep 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ManifestGroup::TaskContext should cache partition spec #11235

ManifestGroup::TaskContext should cache partition spec #11235

lirui-apache commented Sep 29, 2024

ManifestGroup::TaskContext should cache partition spec #11235

ManifestGroup::TaskContext should cache partition spec #11235

Comments

lirui-apache commented Sep 29, 2024

Feature Request / Improvement

Query engine

Willingness to contribute