Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support reading plain encoded INT96 timestamp from Parquet file #10850

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

rui-mo
Copy link
Collaborator

@rui-mo rui-mo commented Aug 27, 2024

Follow-up for: #4680.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 27, 2024
Copy link

netlify bot commented Aug 27, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit b8ef8f2
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/66f5313b89e42900080e4ce1

@rui-mo
Copy link
Collaborator Author

rui-mo commented Aug 27, 2024

Hi @mskapilks, since Gluten customer has been asking for this feature, I created this PR to add support for reading plain encoded INT96 timestamp with your commit kept. If you would like to continue this work in your PR, please let us know. Thanks!

@mskapilks
Copy link

mskapilks commented Aug 27, 2024

Hi @mskapilks, since Gluten customer has been asking for this feature, I created this PR to add support for reading plain encoded INT96 timestamp with your commit kept. If you would like to continue this work in your PR, please let us know. Thanks!

Thanks for the PR
You can continue. I am still looking into INT64 pr

@rui-mo rui-mo force-pushed the wip_ts branch 3 times, most recently from a93770b to 07bfce3 Compare September 23, 2024 06:59
unsigned char ch;

// Read 8 unsigned bytes.
uint64_t nanos = 0;
Copy link
Contributor

@Yuhta Yuhta Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just do this? This class should not have any knowledge about timestamp

int128_t result = 0;
for (int i = 0; i < 12; ++i) {
  auto ch = readByte();
  result |= static_cast<uint128_t>(ch & BASE_256_MASK) << (i * 8);
}
return result;

Then inside TimestampColumnReader we convert these int128_t into Timestamp, possibly in-place.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated following this suggestion. Now we convert int128_t to Timestamp in getValues inside TimestampColumnReader. Please note that we also need to convert the int128_t value as Timestamp in TimestampRange with this change. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IntDecoder and TimestampColumnReader looks good now, the change in TimestampRange is not ideal though, that is parquet specific logic we need to find a way to move this into parquet as well.

@rui-mo rui-mo force-pushed the wip_ts branch 2 times, most recently from 709e15e to b8ef8f2 Compare September 26, 2024 10:02
@@ -70,7 +76,7 @@ class TimestampColumnReader : public IntegerColumnReader {
const RowSet& rows,
const uint64_t* /*incomingNulls*/) override {
auto& data = formatData_->as<ParquetData>();
// Use int128_t as a workaroud. Timestamp in Velox is of 16-byte length.
// Use int128_t as a workaround. Timestamp in Velox is of 16-byte length.
prepareRead<int128_t>(offset, rows, nullptr);
readCommon<IntegerColumnReader, true>(rows);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the clean way to do it is to create our own ColumnVisitor instance with a subclass of TimestampRange that is dedicated for Parquet Int96, instead of reusing the readCommon for integers. Almost all of the filter type and valueSize_ dispatches on integers do not apply to timestamp type anyway.

@Yuhta Yuhta linked an issue Oct 3, 2024 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet Int96 timestamp support
4 participants