sync with open source how #118

lesterhaynes · 2024-03-14T08:15:08Z

Please add a meaningful description for your change here

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

…3089) * Change dead partition detection to only look at the current topic, instead of looking at all topics * spotless * fix test, simplify existence check

* adding disableAutoCommit flag to ReadFn --------- Co-authored-by: Chris Ashcraft <[email protected]>

) Bumps [cloud.google.com/go/bigquery](https://github.com/googleapis/google-cloud-go) from 1.63.1 to 1.64.0. - [Release notes](https://github.com/googleapis/google-cloud-go/releases) - [Changelog](https://github.com/googleapis/google-cloud-go/blob/main/CHANGES.md) - [Commits](googleapis/google-cloud-go@bigquery/v1.63.1...spanner/v1.64.0) --- updated-dependencies: - dependency-name: cloud.google.com/go/bigquery dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Lineage support for JdbcIO * Report table the pipeline read from and write to * add logs and documentation

…eadFromKafkaDoFn (#32921) * Revert "Set backlog in gauge metric (#31137)" * Revert "Add Backlog Metrics to Kafka Splittable DoFn Implementation (#31281)" This reverts commit fd4368f. * Call reportBacklog in nextBatch to report split metrics more often * Report SDF metrics for the active split/partition after processing a record batch * Use KafkaSourceDescriptor as cache key and log entry

…#33059)

* This is a follow-up PR to #31953, and part of the issue #31905. This PR adds the actual writer functionality, and some additional testing, including integration testing. This should be final PR for the SolaceIO write connector to be complete. * Use static imports for Preconditions * Remove unused method * Logging has builtin formatting support * Use TypeDescriptors to check the type used as input * Fix parameter name * Use interface + utils class for MessageProducer * Use null instead of optional * Avoid using ByteString just to create an empty byte array. * Fix documentation, we are not using ByteString now. * Not needed anymore, we are not using ByteString * Defer transforming latency from nanos to millis. The transform into millis is done at the presentation moment, when the metric is reported to Beam. * Avoid using top level classes with a single inner class. A couple of DoFns are moved to their own files too, as the abstract class forthe UnboundedSolaceWriter was in practice a "package". This commits addresses a few comments about the structure of UnboundedSolaceWriter and some base classes of that abstract class. * Remove using a state variable, there is already a timer. This DoFn is a stateful DoFn to force a shuffling with a given input key set cardinality. * Properties must always be set. The warnings are only shown if the user decided to set the properties that are overriden by the connector. This was changed in one of the previous commits but it is actually a bug. I am reverting that change and changing this to a switch block, to make it more clear that the properties need to be set always by the connector. * Add a new custom mode so no JCSMP property is overridden. This lets the user to fully control all the properties used by the connector, instead of making sensible choices on its behalf. This also adds some logging to be more explicit about what the connector is doing. This does not add too much logging pressure, this only adds logging at the producer creation moment. * Add some more documentation about the new custom submission mode. * Fix bug introduced with the refactoring of code for this PR. I forgot to pass the submission mode when the write session is created, and I called the wrong method in the base class because it was defined as public. This makes sure that the submission mode is passed to the session when the session is created for writing messages. * Remove unnecessary Serializable annotation. * Make the PublishResult class for handling callbacks non-static to handle pipelines with multiple write transforms. * Rename maxNumOfUsedWorkers to numShards * Use RoundRobin assignment of producers to process bundles. * Output results in a GlobalWindow * Add ErrorHandler * Fix docs * Remove PublishResultHandler class that was just a wrapper around a Queue * small refactors * Revert CsvIO docs fix * Add withErrorHandler docs * fix var scope --------- Co-authored-by: Bartosz Zablocki <[email protected]>

…#32928)

* managed bigqueryio * spotless * move managed dependency to test only * cleanup after merging snake_case PR * choose write method based on boundedness and pipeline options * rename bigquery write config class * spotless * change read output tag to 'output' * spotless * revert logic that depends on DataflowServiceOptions. switching BQ methods can instead be done in Dataflow service side * spotless * fix typo * separate BQ write config to a new class * fix doc * resolve after syncing to HEAD * spotless * fork on batch/streaming * cleanup * spotless * portable bigquery destinations * move forking logic to BQ schematransform side * add file loads translation and tests; add test checks that the correct transform is chosen * set top-level wrapper to be the underlying managed BQ transform urn; change tests to verify underlying transform name * move unit tests to respectvie schematransform test classes * expose to Python SDK as well * cleanup * address comment * set enable_streaming_engine option; add to CHANGES

Co-authored-by: Claude <[email protected]>

Beam Yaml's error handling framework returns per-record errors as a schema'd PCollection with associated error metadata (e.g. error messages, tracebacks). Currently there is no way to "unnest" the nested rececords (except for field by field) back to the top level if one wants to re-process these records (or otherwise ignore the metadata). Even if there was a way to do this "up-one-level" unnesting it's not clear that this would be obvious to users to find. Worse, various forms of error handling are not consistent in what the "bad records" schema is, or even where the original record is found (though we do have a caveat in the docs that this is still not set in stone). This adds a simple, easy to identify transform that abstracts all of these complexities away for the basic usecase.

* set enable_streaming_engine option * trigger test * trigger test * revert test trigger

Update dataframes to PEP 585 typing

* Make AvroUtils compatible with older versions of Avro * Create beam_PostCommit_Java_Avro_Versions.json * Update AvroUtils.java * Fix nullness

* fix JDBC providers Signed-off-by: Jeffrey Kinard <[email protected]> * fix test failures Signed-off-by: Jeffrey Kinard <[email protected]> * fix typo Signed-off-by: Jeffrey Kinard <[email protected]> --------- Signed-off-by: Jeffrey Kinard <[email protected]>

Co-authored-by: Naireen <[email protected]>

Update container image to pick up recent changes.

…er and runner APIs.

* Add TF MNIST classification cost benchmark * linting * Generalize to single workflow file for cost benchmarks * fix incorrect UTC time in comment * move wordcount to same workflow * update workflow job name

…33402) Committing consumer offsets to Kafka is not critical for KafkaIO because it relies on the offsets stored in KafkaCheckpointMark, but throwing an exception makes Dataflow retry the same work item unnecessarily.

* Add missing to_type_hint to WindowedValueCoder * Add type ignore to make mypy happy.

…hour) (#32939)

- Fix custom coder not being used in Reshuffle (global window) (#33339) - Fix custom coders not being used in Reshuffle (non global window) #33363 - Add missing to_type_hint to WindowedValueCoder #33403

… releases from snapshot repo

* [YAML] Better docs for Filter and MapToFields. * Remove redundant optional indicators. * Update sdks/python/apache_beam/yaml/yaml_mapping.py Co-authored-by: Jeff Kinard <[email protected]> --------- Co-authored-by: Jeff Kinard <[email protected]>

* Fix env variable loading in Cost Benchmark workflow * fix output file for tf mnist * add load test requirements file arg * update mnist args * revert how args are passed * assign result correctly

Revert three commits related to supporting custom coder in reshuffle

* improve python multi-lang examples * minor adjustments

Bumps [github.com/docker/docker](https://github.com/docker/docker) from 27.3.1+incompatible to 27.4.1+incompatible. - [Release notes](https://github.com/docker/docker/releases) - [Commits](moby/moby@v27.3.1...v27.4.1) --- updated-dependencies: - dependency-name: github.com/docker/docker dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Bumps [google.golang.org/api](https://github.com/googleapis/google-api-go-client) from 0.212.0 to 0.214.0. - [Release notes](https://github.com/googleapis/google-api-go-client/releases) - [Changelog](https://github.com/googleapis/google-api-go-client/blob/main/CHANGES.md) - [Commits](googleapis/google-api-go-client@v0.212.0...v0.214.0) --- updated-dependencies: - dependency-name: google.golang.org/api dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

github-actions bot added build java python examples go infra kotlin learning model labels Mar 14, 2024

johnjcasey and others added 21 commits November 12, 2024 15:27

Change dead partition detection to only look at the current topic (#3…

c03a5e0

…3089) * Change dead partition detection to only look at the current topic, instead of looking at all topics * spotless * fix test, simplify existence check

[JdbcIO] Adding disableAutoCommit flag (#32988)

9394f85

* adding disableAutoCommit flag to ReadFn --------- Co-authored-by: Chris Ashcraft <[email protected]>

Add Kafka tag to review bot (#32975)

87e251a

Lineage support for JdbcIO (#33062)

f25ac69

* Lineage support for JdbcIO * Report table the pipeline read from and write to * add logs and documentation

Update dataframes to PEP 585 typing

63fc0db

Introduce pipeline options to disable user counter and user stringset (…

ff5feed

…#33059)

fix imports

c643645

Share AvgRecordSize and KafkaLatestOffsetEstimator caches among DoFns (…

306c6d7

…#32928)

Disable gradle cache for gcp expansion service (#33099)

7f268ac

Co-authored-by: Claude <[email protected]>

Set streaming engine option to fix V1 tests (#33100)

cb06b1b

* set enable_streaming_engine option * trigger test * trigger test * revert test trigger

Moving to 2.62.0-SNAPSHOT on master branch.

3c664e9

Merge pull request #33098 from jrmccluskey/dataframesUpdate

965c3c6

Update dataframes to PEP 585 typing

Make AvroUtils compatible with older versions of Avro (#33102)

38415b8

* Make AvroUtils compatible with older versions of Avro * Create beam_PostCommit_Java_Avro_Versions.json * Update AvroUtils.java * Fix nullness

Add logging to see which topic each split is reading from (#33031)

f4d07c4

Co-authored-by: Naireen <[email protected]>

Update names.py (#33107)

bff3eac

Update container image to pick up recent changes.

robertwb and others added 30 commits December 17, 2024 10:44

Merge pull request #33378 End to end BoundedTrie metrics through work…

9232cd8

…er and runner APIs.

[#32929] Add OrderedListState support to Prism. (#33350)

8e1e124

Migrate lineage counters to bounded tries.

9ef19a9

Add lineage support for local files.

76db105

Remove file-specific lineage bounding.

30288ae

Add TF MNIST classification cost benchmark (#33391)

0e37501

* Add TF MNIST classification cost benchmark * linting * Generalize to single workflow file for cost benchmarks * fix incorrect UTC time in comment * move wordcount to same workflow * update workflow job name

Add missing to_type_hint to WindowedValueCoder (#33403)

470d7d6

* Add missing to_type_hint to WindowedValueCoder * Add type ignore to make mypy happy.

Python ExternalTransformProvider improvements (#33359)

a9f50fa

[Managed Iceberg] Support partitioning time types (year, month, day, …

286e29c

…hour) (#32939)

Add missing imports.

d704422

Validate circular reference for yaml (#33208)

e68a79c

Deserialization fix.

71a5ced

Revert three commits related to supporting custom coder in reshuffle

4cbf257

- Fix custom coder not being used in Reshuffle (global window) (#33339) - Fix custom coders not being used in Reshuffle (non global window) #33363 - Add missing to_type_hint to WindowedValueCoder #33403

Added support for SparkRunner streaming stateful processing (#33267)

f476417

Update lineage query function.

14f7caf

Merge pull request #33386: In contributor docs, don't try to download…

d17f77a

… releases from snapshot repo

Fix env variable loading in Cost Benchmark workflow (#33404)

116df9f

* Fix env variable loading in Cost Benchmark workflow * fix output file for tf mnist * add load test requirements file arg * update mnist args * revert how args are passed * assign result correctly

Merge pull request #33414 from shunping/revert-reshuffle-changes

e9424b9

Revert three commits related to supporting custom coder in reshuffle

fix error on start-build-env.sh (#33401)

5eed396

Moving to 2.63.0-SNAPSHOT on master branch.

ec2e150

Bump Iceberg to v1.6.1 (#33294)

ca8a35a

Add 2.63.0 section to CHANGES.md

55b0ce9

Merge pull request #33423: Add 2.63.0 section to CHANGES.md

fe6b7aa

Improve existing Python multi-lang SchemaTransform examples (#33361)

8fee3ca

* improve python multi-lang examples * minor adjustments

Merge pull request #33381 Migrate lineage counters to bounded tries.

e8df26f

Move Dataflow Python Managed tests to prod (#33426)

a51a0e1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync with open source how #118

sync with open source how #118

lesterhaynes commented Mar 14, 2024

sync with open source how #118

Are you sure you want to change the base?

sync with open source how #118

Conversation

lesterhaynes commented Mar 14, 2024

GitHub Actions Tests Status (on master branch)