Skip to content

chore: Add assertion for empty data files for append action #1301

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 10, 2025

Conversation

dentiny
Copy link
Contributor

@dentiny dentiny commented May 8, 2025

What changes are included in this PR?

The general usage to append data files is

let mut action = txn.fast_append(/*commit_uuid=*/ None, /*key_metadata=*/ vec![])?;
action.add_data_files(new_data_files)?;
let txn = action.apply().await?;
let table = txn.commit(catalog).await?

I found if new_data_files is empty by accident, a new manifest file will still be created and tracked by manifest list file, which doesn't make sense to me (let me if team thinks it's a valid and ideal use case).

In this PR, I did a few changes:

  • Add a new invalid argument error code;
  • Add assertion to fail empty data files append, and empty action commit use cases.

Are these changes tested?

Existing unit tests pass, and add a validation test for invalid data file append.

@dentiny dentiny changed the title chore: Add assertion for empty data files for append action [WIP] chore: Add assertion for empty data files for append action May 8, 2025
@dentiny dentiny force-pushed the hjiang/add-assertion-for-empty-data-files branch from e3ea33a to c52a997 Compare May 8, 2025 08:51
@dentiny dentiny force-pushed the hjiang/add-assertion-for-empty-data-files branch from c52a997 to 3c21958 Compare May 8, 2025 17:13
@dentiny dentiny marked this pull request as draft May 8, 2025 17:13
@dentiny dentiny changed the title [WIP] chore: Add assertion for empty data files for append action chore: Add assertion for empty data files for append action May 8, 2025
@dentiny dentiny marked this pull request as ready for review May 8, 2025 17:44
@@ -129,6 +129,13 @@ impl<'a> SnapshotProduceAction<'a> {
data_files: impl IntoIterator<Item = DataFile>,
) -> Result<&mut Self> {
let data_files: Vec<DataFile> = data_files.into_iter().collect();
if data_files.is_empty() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think returning a DataInvalid error is good enough here, I don't know if invalid argument is needed. cc @Xuanwo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error description reads to me as "data corruption".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that makes sense, InvalidArgument or maybe InvalidInput is more intuitive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems safe to just return Ok(()) while input iterator is empty. Returning error here seems no contribute for users' experrence. Users can't take actions on this error.

They have to add something like:

if input_data.is_empty() {
    return Ok(())
}
xxx.add_data_files(input_data).await?;

So how about we just check this and early return it?

Copy link
Contributor Author

@dentiny dentiny May 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the given iterator is empty, we will end up with a manifest file with no manifest entries, with no warning / error happening (it's exactly the same behavior you described), this is something I would like to avoid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, I think having an empty manifest file doesn't make too much sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the given iterator is empty, we will end up with a manifest file with no manifest entries, with no warning / error happening (it's exactly the same behavior you described), this is something I would like to avoid.

This is valid and I do agree that we should not generate an empty manifest file.

However, the add_data_files function itself should be able to handle empty input safely, as users might use it in loops or pipelines with filters, where encountering an empty iterator is possible.

Therefore, I prefer to perform this check when committing the manifest file, rather than within add_data_files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, I want to note that add_data_files can be called multiple times and is not directly mapped to a manifest file.

@dentiny dentiny requested review from jonathanc-n and Xuanwo May 9, 2025 23:12
@dentiny dentiny requested a review from Xuanwo May 10, 2025 01:40
@dentiny dentiny force-pushed the hjiang/add-assertion-for-empty-data-files branch from f3ccbe1 to 3bf711d Compare May 10, 2025 02:05
@dentiny dentiny force-pushed the hjiang/add-assertion-for-empty-data-files branch from 3bf711d to 5b9aaf9 Compare May 10, 2025 02:06
@@ -28,6 +28,9 @@ pub type Result<T> = std::result::Result<T, Error>;
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
#[non_exhaustive]
pub enum ErrorKind {
/// The operation was rejected because the system is not in a state required for the operation’s execution.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description from error status is borrowed from https://grpc.io/docs/guides/status-codes/#the-full-list-of-status-codes

@dentiny
Copy link
Contributor Author

dentiny commented May 10, 2025

The failed CI test doesn't seem to be related to my change:

ailures:

---- test_drop_table stdout ----
Error: Unexpected => Failure in doing io operation

Source: ConfigInvalid (persistent) at write, context: { uri: http://172.18.0.3:9000/warehouse/hive/metadata/00000-b7e8f769-acc7-4051-9af9-ea3c2eb05962.metadata.json, response: Parts { status: 404, version: HTTP/1.1, headers: {"accept-ranges": "bytes", "content-length": "470", "content-type": "application/xml", "server": "MinIO", "strict-transport-security": "max-age=31536000; includeSubDomains", "vary": "Origin", "vary": "Accept-Encoding", "x-amz-id-2": "dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8", "x-amz-request-id": "183E1001819BE226", "x-content-type-options": "nosniff", "x-xss-protection": "1; mode=block", "date": "Sat, 10 May 2025 04:28:24 GMT"} }, service: s3, path: hive/metadata/00000-b7e8f769-acc7-4051-9af9-ea3c2eb05962.metadata.json, written: 533 } => S3Error { code: "NoSuchBucket", message: "The specified bucket does not exist", resource: "/warehouse/hive/metadata/00000-b7e8f769-acc7-4051-9af9-ea3c2eb05962.metadata.json", request_id: "183E1001819BE226" }


---- test_load_table stdout ----
Error: Unexpected => Failure in doing io operation

Source: ConfigInvalid (persistent) at write, context: { uri: http://172.18.0.3:9000/warehouse/hive/metadata/00000-852999b5-f3a5-4f69-b338-d3c817900f0b.metadata.json, response: Parts { status: 404, version: HTTP/1.1, headers: {"accept-ranges": "bytes", "content-length": "470", "content-type": "application/xml", "server": "MinIO", "strict-transport-security": "max-age=31536000; includeSubDomains", "vary": "Origin", "vary": "Accept-Encoding", "x-amz-id-2": "dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8", "x-amz-request-id": "183E1001861B3ED1", "x-content-type-options": "nosniff", "x-xss-protection": "1; mode=block", "date": "Sat, 10 May 2025 04:28:24 GMT"} }, service: s3, path: hive/metadata/00000-852999b5-f3a5-4f69-b338-d3c817900f0b.metadata.json, written: 533 } => S3Error { code: "NoSuchBucket", message: "The specified bucket does not exist", resource: "/warehouse/hive/metadata/00000-852999b5-f3a5-4f69-b338-d3c817900f0b.metadata.json", request_id: "183E1001861B3ED1" }


---- test_create_table stdout ----
Error: Unexpected => Failure in doing io operation

Source: ConfigInvalid (persistent) at write, context: { uri: http://172.18.0.3:9000/warehouse/hive/metadata/00000-7c0aecbb-50e3-4173-af7c-4cac0cba4113.metadata.json, response: Parts { status: 404, version: HTTP/1.1, headers: {"accept-ranges": "bytes", "content-length": "470", "content-type": "application/xml", "server": "MinIO", "strict-transport-security": "max-age=31536000; includeSubDomains", "vary": "Origin", "vary": "Accept-Encoding", "x-amz-id-2": "dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8", "x-amz-request-id": "183E1001819BE232", "x-content-type-options": "nosniff", "x-xss-protection": "1; mode=block", "date": "Sat, 10 May 2025 04:28:24 GMT"} }, service: s3, path: hive/metadata/00000-7c0aecbb-50e3-4173-af7c-4cac0cba4113.metadata.json, written: 533 } => S3Error { code: "NoSuchBucket", message: "The specified bucket does not exist", resource: "/warehouse/hive/metadata/00000-7c0aecbb-50e3-4173-af7c-4cac0cba4113.metadata.json", request_id: "183E1001819BE232" }


---- test_rename_table stdout ----
Error: Unexpected => Failure in doing io operation

Source: ConfigInvalid (persistent) at write, context: { uri: http://172.18.0.3:9000/warehouse/hive/metadata/00000-8aacea18-eeaf-4461-9787-cb5fa133a7b0.metadata.json, response: Parts { status: 404, version: HTTP/1.1, headers: {"accept-ranges": "bytes", "content-length": "470", "content-type": "application/xml", "server": "MinIO", "strict-transport-security": "max-age=31536000; includeSubDomains", "vary": "Origin", "vary": "Accept-Encoding", "x-amz-id-2": "dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8", "x-amz-request-id": "183E10018BC5FA17", "x-content-type-options": "nosniff", "x-xss-protection": "1; mode=block", "date": "Sat, 10 May 2025 04:28:24 GMT"} }, service: s3, path: hive/metadata/00000-8aacea18-eeaf-4461-9787-cb5fa133a7b0.metadata.json, written: 533 } => S3Error { code: "NoSuchBucket", message: "The specified bucket does not exist", resource: "/warehouse/hive/metadata/00000-8aacea18-eeaf-4461-9787-cb5fa133a7b0.metadata.json", request_id: "183E10018BC5FA17" }


---- test_table_exists stdout ----
Error: Unexpected => Failure in doing io operation

Source: ConfigInvalid (persistent) at write, context: { uri: http://172.18.0.3:9000/warehouse/hive/metadata/00000-c25734ce-2f8e-4bdd-8578-4942bf85027d.metadata.json, response: Parts { status: 404, version: HTTP/1.1, headers: {"accept-ranges": "bytes", "content-length": "470", "content-type": "application/xml", "server": "MinIO", "strict-transport-security": "max-age=31536000; includeSubDomains", "vary": "Origin", "vary": "Accept-Encoding", "x-amz-id-2": "dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8", "x-amz-request-id": "183E10018CA7299B", "x-content-type-options": "nosniff", "x-xss-protection": "1; mode=block", "date": "Sat, 10 May 2025 04:28:24 GMT"} }, service: s3, path: hive/metadata/00000-c25734ce-2f8e-4bdd-8578-4942bf85027d.metadata.json, written: 533 } => S3Error { code: "NoSuchBucket", message: "The specified bucket does not exist", resource: "/warehouse/hive/metadata/00000-c25734ce-2f8e-4bdd-8578-4942bf85027d.metadata.json", request_id: "183E10018CA7299B" }

Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @dentiny for working on this and thank you @jonathanc-n for the review.

@Xuanwo Xuanwo merged commit 47d990e into apache:main May 10, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants