You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to write a status value to a file via beam.io.WriteToText. If the input PCollection is empty, I don't want the file to be overwritten. I've set the argument skip_if_empty=True but the file gets deleted.
I initially encountered this bug when writing to a file in GCS, but have also reproduced using a file on my local computer.
I'm using macOS 13.5, Python 3.9.10, Apache Beam 2.46.0
Steps to reproduce
Run the attached Python script, skip_if_empty.txt, like this: python skip_if_empty.txt --output-file test.txt --project <some_gcp_project>
Note that a timestamp value is written into test.txt
Now, edit the script and comment out the string in the list passed to beam.Create, so that the collection is empty.
Run the script again as above.
Observe these warnings printed to the console:
WARNING:apache_beam.io.filebasedsink:Deleting 1 existing files in target path matching:
WARNING:apache_beam.io.filebasedsink:No shards found to finalize. num_shards: 0, skipped: 0
Observe that test.txt has now been deleted.
Repeat the above using gs://some_gcp_project/path/to/test.txt as the output file (if you have access to a GCP project and GCS)
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
Component: Python SDK
Component: Java SDK
Component: Go SDK
Component: Typescript SDK
Component: IO connector
Component: Beam examples
Component: Beam playground
Component: Beam katas
Component: Website
Component: Spark Runner
Component: Flink Runner
Component: Samza Runner
Component: Twister2 Runner
Component: Hazelcast Jet Runner
Component: Google Cloud Dataflow Runner
The text was updated successfully, but these errors were encountered:
I'm not sure if this is actually a bug. skip_if_empty is a parameter controlling whether we write files or not (from the pydoc "Don’t write any shards if the PCollection is empty."). In general, this transform assumes that the destination is empty, or it will clear it to be empty.
Its not actually clear to me that skip_if_empty should impact our deletion behavior though; I think if I'm writing an empty PCollection, I would expect the result in destination to be either no file or an empty file depending on the parameter. If the end contents are the same, it (to me) indicates that an identical PCollection was received and rewritten to the file. Basically, deleting the file is the only way we can be certain that the PCollection was empty.
Because of this, I'm hesitant to change the meaning of this parameter (note that this is also mildly breaking).
What happened?
I am trying to write a status value to a file via
beam.io.WriteToText
. If the input PCollection is empty, I don't want the file to be overwritten. I've set the argumentskip_if_empty=True
but the file gets deleted.I initially encountered this bug when writing to a file in GCS, but have also reproduced using a file on my local computer.
I'm using macOS 13.5, Python 3.9.10, Apache Beam 2.46.0
Steps to reproduce
python skip_if_empty.txt --output-file test.txt --project <some_gcp_project>
test.txt
beam.Create
, so that the collection is empty.test.txt
has now been deleted.gs://some_gcp_project/path/to/test.txt
as the output file (if you have access to a GCP project and GCS)Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: