-
Notifications
You must be signed in to change notification settings - Fork 255
chore: Add assertion for empty data files for append action #1301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: Add assertion for empty data files for append action #1301
Conversation
e3ea33a
to
c52a997
Compare
c52a997
to
3c21958
Compare
@@ -129,6 +129,13 @@ impl<'a> SnapshotProduceAction<'a> { | |||
data_files: impl IntoIterator<Item = DataFile>, | |||
) -> Result<&mut Self> { | |||
let data_files: Vec<DataFile> = data_files.into_iter().collect(); | |||
if data_files.is_empty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think returning a DataInvalid
error is good enough here, I don't know if invalid argument is needed. cc @Xuanwo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error description reads to me as "data corruption".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that makes sense, InvalidArgument
or maybe InvalidInput
is more intuitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems safe to just return Ok(())
while input iterator is empty. Returning error here seems no contribute for users' experrence. Users can't take actions on this error.
They have to add something like:
if input_data.is_empty() {
return Ok(())
}
xxx.add_data_files(input_data).await?;
So how about we just check this and early return it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the given iterator is empty, we will end up with a manifest file with no manifest entries, with no warning / error happening (it's exactly the same behavior you described), this is something I would like to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words, I think having an empty manifest file doesn't make too much sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the given iterator is empty, we will end up with a manifest file with no manifest entries, with no warning / error happening (it's exactly the same behavior you described), this is something I would like to avoid.
This is valid and I do agree that we should not generate an empty manifest file.
However, the add_data_files
function itself should be able to handle empty input safely, as users might use it in loops or pipelines with filters, where encountering an empty iterator is possible.
Therefore, I prefer to perform this check when committing the manifest file, rather than within add_data_files
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, I want to note that add_data_files
can be called multiple times and is not directly mapped to a manifest file.
f3ccbe1
to
3bf711d
Compare
3bf711d
to
5b9aaf9
Compare
@@ -28,6 +28,9 @@ pub type Result<T> = std::result::Result<T, Error>; | |||
#[derive(Clone, Copy, Debug, PartialEq, Eq)] | |||
#[non_exhaustive] | |||
pub enum ErrorKind { | |||
/// The operation was rejected because the system is not in a state required for the operation’s execution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description from error status is borrowed from https://grpc.io/docs/guides/status-codes/#the-full-list-of-status-codes
The failed CI test doesn't seem to be related to my change:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @dentiny for working on this and thank you @jonathanc-n for the review.
What changes are included in this PR?
The general usage to append data files is
I found if
new_data_files
is empty by accident, a new manifest file will still be created and tracked by manifest list file, which doesn't make sense to me (let me if team thinks it's a valid and ideal use case).In this PR, I did a few changes:
Are these changes tested?
Existing unit tests pass, and add a validation test for invalid data file append.