Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync with open source how #118

Draft
wants to merge 5,604 commits into
base: li_trunk
Choose a base branch
from
Draft

sync with open source how #118

wants to merge 5,604 commits into from

Conversation

lesterhaynes
Copy link

Please add a meaningful description for your change here


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI.

johnjcasey and others added 21 commits November 12, 2024 15:27
…3089)

* Change dead partition detection to only look at the current topic, instead of looking at all topics

* spotless

* fix test, simplify existence check
* adding disableAutoCommit flag to ReadFn

---------

Co-authored-by: Chris Ashcraft <[email protected]>
)

Bumps [cloud.google.com/go/bigquery](https://github.com/googleapis/google-cloud-go) from 1.63.1 to 1.64.0.
- [Release notes](https://github.com/googleapis/google-cloud-go/releases)
- [Changelog](https://github.com/googleapis/google-cloud-go/blob/main/CHANGES.md)
- [Commits](googleapis/google-cloud-go@bigquery/v1.63.1...spanner/v1.64.0)

---
updated-dependencies:
- dependency-name: cloud.google.com/go/bigquery
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Lineage support for JdbcIO

* Report table the pipeline read from and write to

* add logs and documentation
…eadFromKafkaDoFn (#32921)

* Revert "Set backlog in gauge metric (#31137)"

* Revert "Add Backlog Metrics to  Kafka Splittable DoFn Implementation (#31281)"

This reverts commit fd4368f.

* Call reportBacklog in nextBatch to report split metrics more often

* Report SDF metrics for the active split/partition after processing a record batch

* Use KafkaSourceDescriptor as cache key and log entry
* This is a follow-up PR to #31953, and part of the issue #31905.

This PR adds the actual writer functionality, and some additional
testing, including integration testing.

This should be final PR for the SolaceIO write connector to be
complete.

* Use static imports for Preconditions

* Remove unused method

* Logging has builtin formatting support

* Use TypeDescriptors to check the type used as input

* Fix parameter name

* Use interface + utils class for MessageProducer

* Use null instead of optional

* Avoid using ByteString just to create an empty byte array.

* Fix documentation, we are not using ByteString now.

* Not needed anymore, we are not using ByteString

* Defer transforming latency from nanos to millis.

The transform into millis is done at the presentation moment, when
the metric is reported to Beam.

* Avoid using top level classes with a single inner class.

A couple of DoFns are moved to their own files too, as the
abstract class forthe UnboundedSolaceWriter was in practice a
"package".

This commits addresses a few comments about the structure of
UnboundedSolaceWriter and some base classes of that abstract
class.

* Remove using a state variable, there is already a timer.

This DoFn is a stateful DoFn to force a shuffling with a given
input key set cardinality.

* Properties must always be set.

The warnings are only shown if the user decided to set the
properties that are overriden by the connector.

This was changed in one of the previous commits but it is
actually a bug. I am reverting that change and changing this to a
switch block, to make it more clear that the properties need to be
set always by the connector.

* Add a new custom mode so no JCSMP property is overridden.

This lets the user to fully control all the properties used by the connector,
instead of making sensible choices on its behalf.

This also adds some logging to be more explicit about what the connector is
doing. This does not add too much logging pressure, this only adds logging at
the producer creation moment.

* Add some more documentation about the new custom submission mode.

* Fix bug introduced with the refactoring of code for this PR.

I forgot to pass the submission mode when the write session is created, and I
called the wrong method in the base class because it was defined as public.

This makes sure that the submission mode is passed to the session when the
session is created for writing messages.

* Remove unnecessary Serializable annotation.

* Make the PublishResult class for handling callbacks non-static to handle pipelines with multiple write transforms.

* Rename maxNumOfUsedWorkers to numShards

* Use RoundRobin assignment of producers to process bundles.

* Output results in a GlobalWindow

* Add ErrorHandler

* Fix docs

* Remove PublishResultHandler class that was just a wrapper around a Queue

* small refactors

* Revert CsvIO docs fix

* Add withErrorHandler docs

* fix var scope

---------

Co-authored-by: Bartosz Zablocki <[email protected]>
* managed bigqueryio

* spotless

* move managed dependency to test only

* cleanup after merging snake_case PR

* choose write method based on boundedness and pipeline options

* rename bigquery write config class

* spotless

* change read output tag to 'output'

* spotless

* revert logic that depends on DataflowServiceOptions. switching BQ methods can instead be done in Dataflow service side

* spotless

* fix typo

* separate BQ write config to a new class

* fix doc

* resolve after syncing to HEAD

* spotless

* fork on batch/streaming

* cleanup

* spotless

* portable bigquery destinations

* move forking logic to BQ schematransform side

* add file loads translation and tests; add test checks that the correct transform is chosen

* set top-level wrapper to be the underlying managed BQ transform urn; change tests to verify underlying transform name

* move unit tests to respectvie schematransform test classes

* expose to Python SDK as well

* cleanup

* address comment

* set enable_streaming_engine option; add to CHANGES
Beam Yaml's error handling framework returns per-record errors as a
schema'd PCollection with associated error metadata (e.g. error messages,
tracebacks). Currently there is no way to "unnest" the nested rececords
(except for field by field) back to the top level if one wants to
re-process these records (or otherwise ignore the metadata).  Even if
there was a way to do this "up-one-level" unnesting it's not clear that
this would be obvious to users to find.  Worse, various forms of error
handling are not consistent in what the "bad records" schema is, or
even where the original record is found (though we do have a caveat in
the docs that this is still not set in stone).

This adds a simple, easy to identify transform that abstracts all of
these complexities away for the basic usecase.
* set enable_streaming_engine option

* trigger test

* trigger test

* revert test trigger
Update dataframes to PEP 585 typing
* Make AvroUtils compatible with older versions of Avro

* Create beam_PostCommit_Java_Avro_Versions.json

* Update AvroUtils.java

* Fix nullness
* fix JDBC providers

Signed-off-by: Jeffrey Kinard <[email protected]>

* fix test failures

Signed-off-by: Jeffrey Kinard <[email protected]>

* fix typo

Signed-off-by: Jeffrey Kinard <[email protected]>

---------

Signed-off-by: Jeffrey Kinard <[email protected]>
Update container image to pick up recent changes.
robertwb and others added 30 commits December 17, 2024 10:44
* Add TF MNIST classification cost benchmark

* linting

* Generalize to single workflow file for cost benchmarks

* fix incorrect UTC time in comment

* move wordcount to same workflow

* update workflow job name
…33402)

Committing consumer offsets to Kafka is not critical for KafkaIO because it relies on the offsets stored in KafkaCheckpointMark, but throwing an exception makes Dataflow retry the same work item unnecessarily.
* Add missing to_type_hint to WindowedValueCoder

* Add type ignore to make mypy happy.
- Fix custom coder not being used in Reshuffle (global window) (#33339)
- Fix custom coders not being used in Reshuffle (non global window) #33363
- Add missing to_type_hint to WindowedValueCoder #33403
* [YAML] Better docs for Filter and MapToFields.

* Remove redundant optional indicators.

* Update sdks/python/apache_beam/yaml/yaml_mapping.py

Co-authored-by: Jeff Kinard <[email protected]>

---------

Co-authored-by: Jeff Kinard <[email protected]>
* Fix env variable loading in Cost Benchmark workflow

* fix output file for tf mnist

* add load test requirements file arg

* update mnist args

* revert how args are passed

* assign result correctly
Revert three commits related to supporting custom coder in reshuffle
* improve python multi-lang examples

* minor adjustments
Bumps [github.com/docker/docker](https://github.com/docker/docker) from 27.3.1+incompatible to 27.4.1+incompatible.
- [Release notes](https://github.com/docker/docker/releases)
- [Commits](moby/moby@v27.3.1...v27.4.1)

---
updated-dependencies:
- dependency-name: github.com/docker/docker
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [google.golang.org/api](https://github.com/googleapis/google-api-go-client) from 0.212.0 to 0.214.0.
- [Release notes](https://github.com/googleapis/google-api-go-client/releases)
- [Changelog](https://github.com/googleapis/google-api-go-client/blob/main/CHANGES.md)
- [Commits](googleapis/google-api-go-client@v0.212.0...v0.214.0)

---
updated-dependencies:
- dependency-name: google.golang.org/api
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.