Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: beam.io.WriteToText deletes existing file even if skip_if_empty=True #27926

Closed
2 of 15 tasks
efung opened this issue Aug 9, 2023 · 3 comments · Fixed by #30409
Closed
2 of 15 tasks

[Bug]: beam.io.WriteToText deletes existing file even if skip_if_empty=True #27926

efung opened this issue Aug 9, 2023 · 3 comments · Fixed by #30409

Comments

@efung
Copy link

efung commented Aug 9, 2023

What happened?

I am trying to write a status value to a file via beam.io.WriteToText. If the input PCollection is empty, I don't want the file to be overwritten. I've set the argument skip_if_empty=True but the file gets deleted.

I initially encountered this bug when writing to a file in GCS, but have also reproduced using a file on my local computer.

I'm using macOS 13.5, Python 3.9.10, Apache Beam 2.46.0

Steps to reproduce

  1. Run the attached Python script, skip_if_empty.txt, like this:
    python skip_if_empty.txt --output-file test.txt --project <some_gcp_project>
  2. Note that a timestamp value is written into test.txt
  3. Now, edit the script and comment out the string in the list passed to beam.Create, so that the collection is empty.
  4. Run the script again as above.
  5. Observe these warnings printed to the console:
WARNING:apache_beam.io.filebasedsink:Deleting 1 existing files in target path matching:
WARNING:apache_beam.io.filebasedsink:No shards found to finalize. num_shards: 0, skipped: 0
  1. Observe that test.txt has now been deleted.
  2. Repeat the above using gs://some_gcp_project/path/to/test.txt as the output file (if you have access to a GCP project and GCS)

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@damccorm
Copy link
Contributor

I'm not sure if this is actually a bug. skip_if_empty is a parameter controlling whether we write files or not (from the pydoc "Don’t write any shards if the PCollection is empty."). In general, this transform assumes that the destination is empty, or it will clear it to be empty.

Its not actually clear to me that skip_if_empty should impact our deletion behavior though; I think if I'm writing an empty PCollection, I would expect the result in destination to be either no file or an empty file depending on the parameter. If the end contents are the same, it (to me) indicates that an identical PCollection was received and rewritten to the file. Basically, deleting the file is the only way we can be certain that the PCollection was empty.

Because of this, I'm hesitant to change the meaning of this parameter (note that this is also mildly breaking).

@damccorm
Copy link
Contributor

I believe this is also consistent across our Java/Python implementations

@riteshghorse
Copy link
Contributor

Oh yes, that makes sense. I didn't think of that. This could confuse users. Added a doc comment instead.

@github-actions github-actions bot added this to the 2.55.0 Release milestone Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants