Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data: Add partition stats writer and reader #11216

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ajantha-bhat
Copy link
Member

@ajantha-bhat ajantha-bhat commented Sep 26, 2024

Introduce APIs to write the partition stats into files in table default format using Iceberg generic writers and readers.

PartitionStatisticsFile partitionStatisticsFile =
        PartitionStatsHandler.computeAndWriteStatsFile(testTable, "b1");

testTable.updatePartitionStatistics().setPartitionStatistics(partitionStatisticsFile).commit();


@Override
@SuppressWarnings("checkstyle:CyclomaticComplexity")
public boolean equals(Object other) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StructLikeMap was previously handling this implicitly. But when PartitionStatsRecord wraps PartitionStats now for the writers, it needs to override equals and hashcode

StructLike coercedPartition =
PartitionUtil.coercePartition(partitionType, spec, file.partition());
StructLike key = keyTemplate.copyFor(coercedPartition);
Record key = coercedPartitionRecord(file, spec, partitionType);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need Record instead of PartitionData for the writers.

Cannot keep this conversion in the data module as it just to wraps the same PartitionStats object.


/** Wraps the {@link PartitionStats} as {@link Record}. Used by generic writers and readers. */
public class PartitionStatsRecord implements Record, StructLike {
private static final LoadingCache<StructType, Map<String, Integer>> NAME_MAP_CACHE =
Copy link
Member Author

@ajantha-bhat ajantha-bhat Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Class is similar to GenericRecord but for a specific partition stats schema.

partitionData.set(1, c3);
return partitionData;
private static StructLike partitionData(Types.StructType partitionType, String c2, String c3) {
GenericRecord record = GenericRecord.create(partitionType);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stores the Record instead of PartitionData in the core logic. So, testcase is updated to expect Record.

Schema schema,
PartitionSpec spec,
int formatVersion,
Map<String, String> properties) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was no option to pass the table properties before.
Needed to pass different file format for paramterized test.

@ajantha-bhat ajantha-bhat added this to the Iceberg 1.7.0 milestone Sep 27, 2024
@ajantha-bhat
Copy link
Member Author

@aokolnychyi: This PR is ready. But as we discussed previously, this PR wraps the PartitionStats into a Record as the writers cannot work with Iceberg internal objects yet.

I will explore adding the internal writers for Parquet and Orc. Similar to #11108.
If we fail to have it ready by 1.7.0, I think it makes sense to merge this PR and introduce the optimized writer in the next version by deprecating this writer.

@ajantha-bhat ajantha-bhat marked this pull request as ready for review September 27, 2024 02:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant