feat(datafusion): Support insert_into in IcebergTableProvider #1511

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

CTTY wants to merge 36 commits into apache:main from CTTY:ctty/df-insert

Contributor

CTTY commented Jul 15, 2025

Which issue does this PR close?

A part of [EPIC] Support for appending data to iceberg table. #1382

What changes are included in this PR?

Are these changes tested?

CTTY added 5 commits

July 15, 2025 15:35


          Support Datafusion insert_into

a5593b4


          cleanup

558b402


          minor

847a2bb


          minor

b067656


          clippy ftw

f52a698

CTTY commented

View reviewed changes

crates/iceberg/src/arrow/value.rs

@@ @@ -440,10 +440,12 @@ impl PartnerAccessor<ArrayRef> for ArrowArrayAccessor { @@
                       Ok(schema_partner)
                   }
+                  // todo generate field_pos in datafusion instead of passing to here

Contributor Author

CTTY Jul 16, 2025

I found it tricky to handle this case: the input from datafusion won't have field id, and we will need to assign them manually. maybe there is a way to do name mapping here?

Contributor

liurenjie1024 Jul 21, 2025

Could you help me to understand why we need to change this?

Contributor Author

CTTY Jul 21, 2025

This is a temporary hack to an issue that I don't know how exactly to fix for now: the RecordBatch from Datafusion won't have PARQUET_FIELD_ID_META_KEY in its schema's metadata, causing the schema visiting to fail here.

I'm thinking maybe we can bound the schema in datafusion via name mapping, but have not got the chance to explore more

Contributor

liurenjie1024 Jul 22, 2025

Why we need to convert RecordBatch's schema to iceberg schema?

Contributor

liurenjie1024 Jul 22, 2025

The method you mentioned is typically used to convert parquet file's schema to iceberg schema.

Contributor Author

CTTY Jul 22, 2025

This method is used when using ParquetWriter to write RecordBatch. When it's counting nan values, it will need to walk through both RecordBatch's schema and Iceberg schema in a partner fashion:

iceberg-rust/crates/iceberg/src/writer/file_writer/parquet_writer.rs

Line 528 in 9787140

.compute(self.schema.clone(), batch_c)?;

Basically the call stack is NanValueCountVisitor::compute -> visit_struct_with_partner -> ArrowArrayAccessor::field_partner -> get_field_id

Contributor

liurenjie1024 Jul 23, 2025

Thanks for explaination, that makes sense to me. We need a separate issue to solve this.


          minor

d367a7c

CTTY commented

View reviewed changes

crates/iceberg/src/spec/manifest/_serde.rs Outdated Show resolved Hide resolved

CTTY added 2 commits

July 15, 2025 18:13


          minor

99af430


          i luv cleaning up

2f9efa8

CTTY force-pushed the ctty/df-insert branch from 7843b0d to 2f9efa8 Compare

July 16, 2025 03:37


          fmt not working?

9d7c1c3

liurenjie1024 reviewed

View reviewed changes

Contributor

liurenjie1024 left a comment

Thanks @CTTY for this pr, just finished round of review. My suggestion is to start with unpartitioned table first.

crates/integrations/datafusion/src/table/mod.rs Show resolved Hide resolved

crates/integrations/datafusion/src/physical_plan/write.rs Show resolved Hide resolved

crates/integrations/datafusion/src/physical_plan/write.rs Outdated Show resolved Hide resolved

crates/integrations/datafusion/src/physical_plan/write.rs Outdated Show resolved Hide resolved

crates/iceberg/src/spec/manifest/mod.rs Outdated Show resolved Hide resolved

crates/integrations/datafusion/src/physical_plan/write.rs Outdated Show resolved Hide resolved

crates/integrations/datafusion/tests/integration_datafusion_test.rs

@@ @@ -432,3 +433,69 @@ async fn test_metadata_table() -> Result<()> { @@
                   Ok(())
               }
+              #[tokio::test]
+              async fn test_insert_into() -> Result<()> {

Contributor

liurenjie1024 Jul 16, 2025

I'm not a big fan of adding this kind of integration tests. How about adding sqllogictests?

crates/integrations/datafusion/src/physical_plan/write.rs Outdated Show resolved Hide resolved

CTTY added 6 commits

July 16, 2025 09:34


          Merge branch 'main' into ctty/df-insert

41a75bd


          do not expose serde

e25f888


          cut it down

b554701


          Use stricter wrapper data file wrapper

77b349b


          fix partitioning, and fmt ofc

88afe82


          minor

295e9b6

liurenjie1024 reviewed

View reviewed changes

crates/integrations/datafusion/src/table/mod.rs Show resolved Hide resolved

crates/integrations/datafusion/src/table/mod.rs Outdated Show resolved Hide resolved

crates/integrations/datafusion/src/physical_plan/write.rs Outdated Show resolved Hide resolved

crates/integrations/datafusion/src/physical_plan/write.rs Outdated Show resolved Hide resolved

crates/integrations/datafusion/src/physical_plan/write.rs Outdated Show resolved Hide resolved

crates/integrations/datafusion/src/physical_plan/write.rs Outdated Show resolved Hide resolved

crates/integrations/datafusion/src/physical_plan/write.rs Outdated Show resolved Hide resolved

CTTY added 4 commits

July 17, 2025 14:31


          partitioned shall not pass

92588f5


          implement children and with_new_children for write node, fix fmt

7db9432


          Merge branch 'main' into ctty/df-insert

6bd624c


          get row counts from data files directly

8c78046

liurenjie1024 reviewed

View reviewed changes

crates/integrations/datafusion/src/physical_plan/write.rs Outdated Show resolved Hide resolved

crates/integrations/datafusion/src/physical_plan/write.rs Outdated Show resolved Hide resolved

crates/integrations/datafusion/src/physical_plan/write.rs Outdated Show resolved Hide resolved

crates/integrations/datafusion/src/physical_plan/write.rs Outdated

+                      // Create data file writer builder
+                      let data_file_writer_builder = DataFileWriterBuilder::new(
+                          ParquetWriterBuilder::new(

Contributor

liurenjie1024 Jul 21, 2025

This should be RollingFileWriter

Contributor Author

CTTY Jul 22, 2025

Good point! Added rolling_writer.rs

crates/integrations/datafusion/src/physical_plan/write.rs Outdated Show resolved Hide resolved

crates/integrations/datafusion/src/physical_plan/commit.rs Outdated Show resolved Hide resolved

crates/integrations/datafusion/src/physical_plan/commit.rs

Comment on lines +252 to +258

+                          // // Apply the action and commit the transaction
+                          // let updated_table = action
+                          //     .apply(tx)
+                          //     .map_err(to_datafusion_error)?
+                          //     .commit(catalog.as_ref())
+                          //     .await
+                          //     .map_err(to_datafusion_error)?;

Contributor

liurenjie1024 Jul 21, 2025

Why comment out this?

Contributor Author

CTTY Jul 21, 2025

This is for testing only and will be uncommented in the future. I commented it out for now because the memory catalog doesn't support update table for now and this will fail my local test

crates/iceberg/src/spec/manifest/mod.rs Outdated Show resolved Hide resolved

crates/iceberg/src/arrow/nan_val_cnt_visitor.rs

Comment on lines +162 to +163

		println!("----StructArray from record stream: {:?}", struct_arr);
		println!("----Schema.as_struct from table: {:?}", schema.as_struct());

Contributor

liurenjie1024 Jul 21, 2025

We should use log here.

Contributor Author

CTTY Jul 21, 2025

This is for testing only, and I'm planning to remove these log lines

crates/iceberg/src/arrow/value.rs

@@ @@ -440,10 +440,12 @@ impl PartnerAccessor<ArrayRef> for ArrowArrayAccessor { @@
                       Ok(schema_partner)
                   }
+                  // todo generate field_pos in datafusion instead of passing to here

Contributor

liurenjie1024 Jul 21, 2025

Could you help me to understand why we need to change this?

CTTY and others added 5 commits

July 21, 2025 10:02


          Update crates/integrations/datafusion/src/physical_plan/write.rs

724ec7d

Co-authored-by: Renjie Liu <[email protected]>


          Update crates/integrations/datafusion/src/physical_plan/commit.rs

2f56169

Co-authored-by: Renjie Liu <[email protected]>


          Merge branch 'main' into ctty/df-insert

273a164


          fix fmt, input boundedness

53b8b82


          make data_files constant

d2168f2

CTTY added 10 commits

July 21, 2025 12:05


          use format version when serde datafiles

59a3428


          use try_new instead

3b4dc9d


          minor

2b1c3df


          coalesce partitions

db20df1


          minor

e56ab4e

fmt

04a44b3


          rolling

c5b1c38


          rolling in the deep

0f9bce0


          rolls the unit tests

d8f05cf


          could have it all for tests

1ea4a0f

liurenjie1024 reviewed

View reviewed changes

Contributor

liurenjie1024 left a comment

Thanks @CTTY , the direction looks good to me!

crates/iceberg/src/arrow/value.rs

@@ @@ -440,10 +440,12 @@ impl PartnerAccessor<ArrayRef> for ArrowArrayAccessor { @@
                       Ok(schema_partner)
                   }
+                  // todo generate field_pos in datafusion instead of passing to here

Contributor

liurenjie1024 Jul 22, 2025

Why we need to convert RecordBatch's schema to iceberg schema?

crates/iceberg/src/arrow/value.rs

@@ @@ -440,10 +440,12 @@ impl PartnerAccessor<ArrayRef> for ArrowArrayAccessor { @@
                       Ok(schema_partner)
                   }
+                  // todo generate field_pos in datafusion instead of passing to here

Contributor

liurenjie1024 Jul 22, 2025

The method you mentioned is typically used to convert parquet file's schema to iceberg schema.

crates/iceberg/src/spec/manifest/mod.rs

+                      // Verify each serialized file contains expected data
+                      for json in &serialized_files {
+                          assert!(json.contains("path/to/file"));

Contributor

liurenjie1024 Jul 22, 2025

nit: Why not assert the json output? We could use snapshot test to make it easier, see https://docs.rs/expect-test/latest/expect_test/

Contributor Author

CTTY Jul 22, 2025

I think a snapshot test makes more sense

crates/iceberg/src/writer/base_writer/rolling_writer.rs

+              /// This trait extends `IcebergWriter` with the ability to determine when to start
+              /// writing to a new file based on the size of incoming data.
+              #[async_trait]
+              pub trait RollingFileWriter: IcebergWriter {

Contributor

liurenjie1024 Jul 22, 2025

Sorry I forgot that currently we don't have RollingFileWriter. It's fine to leave it in this pr, but it would be better to use a separate pr to add this. RollingFileWriter should be a FileWriter rather an IcebergWriter.


          Merge branch 'main' into ctty/df-insert

5001e07

CTTY mentioned this pull request

Implement insert_into for IcebergTableProvider #1540

Open

6 tasks


          Merge branch 'main' into ctty/df-insert

078f458

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet