-
Notifications
You must be signed in to change notification settings - Fork 285
feat(datafusion): Support insert_into in IcebergTableProvider #1511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@@ -440,10 +440,12 @@ impl PartnerAccessor<ArrayRef> for ArrowArrayAccessor { | |||
Ok(schema_partner) | |||
} | |||
|
|||
// todo generate field_pos in datafusion instead of passing to here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found it tricky to handle this case: the input from datafusion won't have field id, and we will need to assign them manually. maybe there is a way to do name mapping here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help me to understand why we need to change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a temporary hack to an issue that I don't know how exactly to fix for now: the RecordBatch
from Datafusion won't have PARQUET_FIELD_ID_META_KEY
in its schema's metadata, causing the schema visiting to fail here.
I'm thinking maybe we can bound the schema in datafusion via name mapping, but have not got the chance to explore more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need to convert RecordBatch
's schema to iceberg schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method you mentioned is typically used to convert parquet file's schema to iceberg schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is used when using ParquetWriter
to write RecordBatch
. When it's counting nan values, it will need to walk through both RecordBatch
's schema and Iceberg schema in a partner fashion:
.compute(self.schema.clone(), batch_c)?; |
Basically the call stack is NanValueCountVisitor::compute
-> visit_struct_with_partner
-> ArrowArrayAccessor::field_partner
-> get_field_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @CTTY for this pr, just finished round of review. My suggestion is to start with unpartitioned table first.
@@ -432,3 +433,69 @@ async fn test_metadata_table() -> Result<()> { | |||
|
|||
Ok(()) | |||
} | |||
|
|||
#[tokio::test] | |||
async fn test_insert_into() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a big fan of adding this kind of integration tests. How about adding sqllogictests?
|
||
// Create data file writer builder | ||
let data_file_writer_builder = DataFileWriterBuilder::new( | ||
ParquetWriterBuilder::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be RollingFileWriter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Added rolling_writer.rs
// // Apply the action and commit the transaction | ||
// let updated_table = action | ||
// .apply(tx) | ||
// .map_err(to_datafusion_error)? | ||
// .commit(catalog.as_ref()) | ||
// .await | ||
// .map_err(to_datafusion_error)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why comment out this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for testing only and will be uncommented in the future. I commented it out for now because the memory catalog doesn't support update table for now and this will fail my local test
println!("----StructArray from record stream: {:?}", struct_arr); | ||
println!("----Schema.as_struct from table: {:?}", schema.as_struct()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use log here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for testing only, and I'm planning to remove these log lines
@@ -440,10 +440,12 @@ impl PartnerAccessor<ArrayRef> for ArrowArrayAccessor { | |||
Ok(schema_partner) | |||
} | |||
|
|||
// todo generate field_pos in datafusion instead of passing to here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help me to understand why we need to change this?
Co-authored-by: Renjie Liu <[email protected]>
Co-authored-by: Renjie Liu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @CTTY , the direction looks good to me!
@@ -440,10 +440,12 @@ impl PartnerAccessor<ArrayRef> for ArrowArrayAccessor { | |||
Ok(schema_partner) | |||
} | |||
|
|||
// todo generate field_pos in datafusion instead of passing to here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need to convert RecordBatch
's schema to iceberg schema?
@@ -440,10 +440,12 @@ impl PartnerAccessor<ArrayRef> for ArrowArrayAccessor { | |||
Ok(schema_partner) | |||
} | |||
|
|||
// todo generate field_pos in datafusion instead of passing to here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method you mentioned is typically used to convert parquet file's schema to iceberg schema.
|
||
// Verify each serialized file contains expected data | ||
for json in &serialized_files { | ||
assert!(json.contains("path/to/file")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Why not assert the json output? We could use snapshot test to make it easier, see https://docs.rs/expect-test/latest/expect_test/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a snapshot test makes more sense
/// This trait extends `IcebergWriter` with the ability to determine when to start | ||
/// writing to a new file based on the size of incoming data. | ||
#[async_trait] | ||
pub trait RollingFileWriter: IcebergWriter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I forgot that currently we don't have RollingFileWriter
. It's fine to leave it in this pr, but it would be better to use a separate pr to add this. RollingFileWriter
should be a FileWriter
rather an IcebergWriter
.
Which issue does this PR close?
What changes are included in this PR?
Are these changes tested?