Skip to content

feat: nan_value_counts support #907

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Mar 25, 2025

Conversation

feniljain
Copy link
Contributor

@feniljain feniljain commented Jan 22, 2025

Issue

Fixes #417

Description

  • We compute upper and lower bounds by relying on parquet statistics, but those statistics don't provide nan_value_count, so we have to implement it in library itself when arrow record batches are received.
  • We keep track of it at ParquetWriter level cause write can be called multiple times .
  • Added couple of new tests for different types for nan_val_count.

@feniljain feniljain force-pushed the feat-nan-value-counts branch from 281daa1 to 7e76ade Compare January 22, 2025 20:17
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @feniljain for this pr! Left some comments to resolve.

let dt = col.data_type();

let nan_val_cnt: u64 = match dt {
DataType::Float32 => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to add float64?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this is incorrect, it ignored nested primitive type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are nice catches, lemme implement this and re-request review, thanks! 🙇🏻

@feniljain feniljain force-pushed the feat-nan-value-counts branch 2 times, most recently from 4801e40 to ee3b30d Compare January 27, 2025 19:29
@feniljain feniljain force-pushed the feat-nan-value-counts branch from ee3b30d to 7fad2e3 Compare March 8, 2025 12:12
@feniljain
Copy link
Contributor Author

Hey @liurenjie1024 👋🏻

Thanks for being patient, I think this time I have covered all the cases properly, there's a small discussion around LargeList and FixedSizeList which I have posted about on slack here, but rest should be fine :)

@feniljain feniljain requested a review from liurenjie1024 March 8, 2025 13:08
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @feniljain for this great pr, generall LGTM! Would you mind to move the logic to arrow module and use SchemaWithPartnerVisitor to do that? Here is a good example.

@@ -509,6 +584,89 @@ impl ParquetWriter {

Ok(builder)
}

fn transverse_batch(&mut self, col: &ArrayRef, field: &NestedFieldRef) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only arrow related, I think it's better to move it to arrow module. Also we have a good example of visiting arrow array along with iceberg schema, see this part

@feniljain feniljain force-pushed the feat-nan-value-counts branch from 6d7bbe0 to 3ccba94 Compare March 13, 2025 18:02
@feniljain feniljain force-pushed the feat-nan-value-counts branch from ab301c5 to 736c5ad Compare March 13, 2025 18:06
@feniljain feniljain force-pushed the feat-nan-value-counts branch from 98070a5 to 644567d Compare March 13, 2025 19:24
@feniljain feniljain force-pushed the feat-nan-value-counts branch from 644567d to 3b1d33d Compare March 13, 2025 19:29
@feniljain
Copy link
Contributor Author

Heyo! 👋🏻

Thanks for pointing out SchemaWithPartnerVisitor , I had somehow missed it when initially checking for transversing mechanisms. I have shifted implementation to use this :)

Also, I have one small doubt, is this clone okay?

RecordBatch has three fields:

  • schema which is SchemaRef i.e. behind Arc, this is a cheap clone
  • columns which is Vec<Arc<dyn Array>>, here vector will be cloned, but internally buffers seem to be behind Arc, so even that should cheap to clone
  • row_count is a primitive usize type, not a problem

This was my thinking when using clone, but do lemme know if I am misunderstanding something here and if there's a better way, thanks! 🙇🏻

@feniljain feniljain requested a review from liurenjie1024 March 13, 2025 19:47
@feniljain
Copy link
Contributor Author

Also just FYI, I have fixed few other unrelated tests in this commit. They started failing now cause this implementation relies on field ID defined in metadata of arrow schema, and these tests did not have it defined leading to failures.

@feniljain
Copy link
Contributor Author

Hadn't realized it got conflicts due to other PR merges, fixed them too :)

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @feniljain for this pr, generally LGTM! Just one minor hints.

@@ -19,7 +19,12 @@

mod schema;
pub use schema::*;

mod nan_val_cnt_visitor;
pub use nan_val_cnt_visitor::*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub use nan_val_cnt_visitor::*;
pub(crate) use nan_val_cnt_visitor::*;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saw the review, thanks for making the change yourself 🙇🏻

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @feniljain for this pr, LGTM!

@liurenjie1024 liurenjie1024 merged commit 314af4c into apache:main Mar 25, 2025
18 checks passed
@feniljain feniljain deleted the feat-nan-value-counts branch March 25, 2025 13:33
xxchan added a commit to xxchan/iceberg-rust that referenced this pull request Mar 25, 2025
xxchan added a commit to risingwavelabs/iceberg-rust that referenced this pull request Mar 25, 2025
fix tests failure, caused by apache#907

Signed-off-by: xxchan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement nan_value_counts && distinct_counts metrics in parquet writer
2 participants