-
Notifications
You must be signed in to change notification settings - Fork 298
feat: nan_value_counts support #907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
281daa1
to
7e76ade
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @feniljain for this pr! Left some comments to resolve.
let dt = col.data_type(); | ||
|
||
let nan_val_cnt: u64 = match dt { | ||
DataType::Float32 => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also need to add float64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this is incorrect, it ignored nested primitive type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are nice catches, lemme implement this and re-request review, thanks! 🙇🏻
4801e40
to
ee3b30d
Compare
ee3b30d
to
7fad2e3
Compare
Hey @liurenjie1024 👋🏻 Thanks for being patient, I think this time I have covered all the cases properly, there's a small discussion around |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @feniljain for this great pr, generall LGTM! Would you mind to move the logic to arrow module and use SchemaWithPartnerVisitor
to do that? Here is a good example.
@@ -509,6 +584,89 @@ impl ParquetWriter { | |||
|
|||
Ok(builder) | |||
} | |||
|
|||
fn transverse_batch(&mut self, col: &ArrayRef, field: &NestedFieldRef) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only arrow related, I think it's better to move it to arrow module. Also we have a good example of visiting arrow array along with iceberg schema, see this part
6d7bbe0
to
3ccba94
Compare
ab301c5
to
736c5ad
Compare
98070a5
to
644567d
Compare
644567d
to
3b1d33d
Compare
Heyo! 👋🏻 Thanks for pointing out Also, I have one small doubt, is this
This was my thinking when using |
Also just FYI, I have fixed few other unrelated tests in this commit. They started failing now cause this implementation relies on field ID defined in metadata of arrow schema, and these tests did not have it defined leading to failures. |
Hadn't realized it got conflicts due to other PR merges, fixed them too :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @feniljain for this pr, generally LGTM! Just one minor hints.
crates/iceberg/src/arrow/mod.rs
Outdated
@@ -19,7 +19,12 @@ | |||
|
|||
mod schema; | |||
pub use schema::*; | |||
|
|||
mod nan_val_cnt_visitor; | |||
pub use nan_val_cnt_visitor::*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub use nan_val_cnt_visitor::*; | |
pub(crate) use nan_val_cnt_visitor::*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just saw the review, thanks for making the change yourself 🙇🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @feniljain for this pr, LGTM!
fix tests failure, caused by apache#907 Signed-off-by: xxchan <[email protected]>
fix tests failure, caused by apache#907 Signed-off-by: xxchan <[email protected]>
Issue
Fixes #417
Description
nan_value_count
, so we have to implement it in library itself when arrow record batches are received.ParquetWriter
level causewrite
can be called multiple times .nan_val_count
.