Skip to content

Commit

Permalink
Duet AI Prompts - Documentation Lookup Without Links (apache#30873)
Browse files Browse the repository at this point in the history
* Create 53_dataflow_runner.md

* Create 54_spark_runner.md

* Create 55_direct_runner.md

* Create 29_advanced_side_inputs.md

* Create 30_advanced_additional_outputs.md

* Create 31_advanced_batched_dofn.md

* Update 31_advanced_batched_dofn.md

* Update 30_advanced_additional_outputs.md

* Create 01_basic_learning_apache_beam.md

* Create 02_basic_pipelines.md

* Create 03_basic_configuring_pipelines.md

* Update 04_basic_pcollections.md

* Create 04_basic_pcollections.md

* Update 05_basic_ptransforms.md

* Create 05_basic_ptransforms.md

* Update 06_basic_schema.md

* Create 06_basic_schema.md

* Update 07_basic_runners.md

* Update 07_basic_runners.md

* Create 07_basic_runners.md

* Update 07_basic_runners.md

* Update 08_basic_windowing.md

* Create 08_basic_windowing.md

* Update 09_basic_triggers.md

* Create 09_basic_triggers.md

* Update 10_basic_metrics.md

* Update 10_basic_metrics.md

* Create 10_basic_metrics.md

* Update 11_basic_state.md

* Create 11_basic_state.md

* Update 12_basic_timers.md

* Update 12_basic_timers.md

* Create 12_basic_timers.md

* Update 13_advanced_splittable_dofn.md

* Update 13_advanced_splittable_dofn.md

* Create 13_advanced_splittable_dofn.md

* Update 13_advanced_splittable_dofn.md

* Update 14_advanced_pipeline_patterns.md

* Update 14_advanced_pipeline_patterns.md

* Create 14_advanced_pipeline_patterns.md

* Update 15_advanced_xlang.md

* Create 15_advanced_xlang.md

* Update 16_advanced_pipeline_lifecycle.md

* Update 16_advanced_pipeline_lifecycle.md

* Create 16_advanced_pipeline_lifecycle.md

* Update 17_advanced_ai_ml.md

* Update 17_advanced_ai_ml.md

* Update 17_advanced_ai_ml.md

* Create 17_advanced_ai_ml.md

* Update 18_io_kafka.md

* Create 18_io_kafka.md

* Update 19_io_pubsub.md

* Create 19_io_pubsub.md

* Update 20_io_biguery.md

* Create 20_io_biguery.md

* Update 21_io_bigtable.md

* Create 21_io_bigtable.md

* Update 22_io_spanner.md

* Create 22_io_spanner.md

* Update 23_io_gcs.md

* Create 23_io_gcs.md

* Update 24_io_json.md

* Create 24_io_json.md

* Update 25_io_csv.md

* Create 25_io_csv.md

* Update 26_io_avro.md

* Create 26_io_avro.md

* Update 27_io_parquet.md

* Create 27_io_parquet.md

* Update 28_io_jdbc.md

* Create 28_io_jdbc.md

* Update learning/prompts/documentation-lookup-nolinks/08_basic_windowing.md

Co-authored-by: Andrey Devyatkin <[email protected]>

* Update learning/prompts/documentation-lookup-nolinks/09_basic_triggers.md

Co-authored-by: Andrey Devyatkin <[email protected]>

* Update learning/prompts/documentation-lookup-nolinks/18_io_kafka.md

Co-authored-by: Andrey Devyatkin <[email protected]>

* Update learning/prompts/documentation-lookup-nolinks/19_io_pubsub.md

Co-authored-by: Andrey Devyatkin <[email protected]>

* Update learning/prompts/documentation-lookup-nolinks/20_io_biguery.md

Co-authored-by: Andrey Devyatkin <[email protected]>

* Update learning/prompts/documentation-lookup-nolinks/20_io_biguery.md

Co-authored-by: Andrey Devyatkin <[email protected]>

* Update learning/prompts/documentation-lookup-nolinks/21_io_bigtable.md

Co-authored-by: Andrey Devyatkin <[email protected]>

* Update learning/prompts/documentation-lookup-nolinks/22_io_spanner.md

Co-authored-by: Andrey Devyatkin <[email protected]>

* Update learning/prompts/documentation-lookup-nolinks/24_io_json.md

Co-authored-by: Andrey Devyatkin <[email protected]>

* Update learning/prompts/documentation-lookup-nolinks/26_io_avro.md

Co-authored-by: Andrey Devyatkin <[email protected]>

* Update learning/prompts/documentation-lookup-nolinks/27_io_parquet.md

Co-authored-by: Andrey Devyatkin <[email protected]>

* Update learning/prompts/documentation-lookup-nolinks/25_io_csv.md

Co-authored-by: Andrey Devyatkin <[email protected]>

* Update learning/prompts/documentation-lookup-nolinks/28_io_jdbc.md

Co-authored-by: Andrey Devyatkin <[email protected]>

* Update learning/prompts/documentation-lookup-nolinks/30_advanced_additional_outputs.md

Co-authored-by: Andrey Devyatkin <[email protected]>

* Update learning/prompts/documentation-lookup-nolinks/29_advanced_side_inputs.md

Co-authored-by: Andrey Devyatkin <[email protected]>

* Update learning/prompts/documentation-lookup-nolinks/30_advanced_additional_outputs.md

Co-authored-by: Andrey Devyatkin <[email protected]>

* Update learning/prompts/documentation-lookup-nolinks/30_advanced_additional_outputs.md

Co-authored-by: Andrey Devyatkin <[email protected]>

* Update learning/prompts/documentation-lookup-nolinks/30_advanced_additional_outputs.md

Co-authored-by: Andrey Devyatkin <[email protected]>

* Update learning/prompts/documentation-lookup-nolinks/53_dataflow_runner.md

Co-authored-by: Andrey Devyatkin <[email protected]>

* Update learning/prompts/documentation-lookup-nolinks/54_spark_runner.md

Co-authored-by: Andrey Devyatkin <[email protected]>

* Update 53_dataflow_runner.md

* Update 07_basic_runners.md

* Update 53_dataflow_runner.md

---------

Co-authored-by: Andrey Devyatkin <[email protected]>
  • Loading branch information
dariabezkorovaina and andreydevyatkin authored Apr 8, 2024
1 parent 58e4f83 commit 1c962a5
Show file tree
Hide file tree
Showing 61 changed files with 1,314 additions and 142 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
Prompt:
What is Apache Beam?

Response:
Apache Beam is an open-source unified programming model for data processing pipelines. It provides a set of language-specific SDKs that allow you to define and execute data processing workflows.
Apache Beam allows you to write pipelines in one of the supported languages (Java, Python, Go, and Scala). The pipelines can be executed on one of the supported runners (Apache Flink, Apache Spark, and Google Cloud Dataflow).

Learning Apache Beam is a valuable skill for data processing and analysis. Here are some resources to get you started:
* The 'Get Started' section of the Apache Beam website is a perfect place to start. It provides an overview of Apache Beam and its key concepts.
* For a hands-on introduction, you can start by exploring the WordCount example provided in the 'Get Started' section, which demonstrates the basic concepts of Apache Beam.
* Tour of Beam is an interactive tour that teaches you core Beam concepts. It provides a sandbox environment to write and run pipelines while exploring different topics.
* Beam Playground offers an interactive space to experiment with the Apache Beam transforms and examples without installing anything on your system.

After you have a basic understanding of Apache Beam, you can start to build your own pipelines. The Apache Beam website has a number of resources that can help you get started, including Apache Beam Documentation and Apache Beam Community. For those interested in the source code, you can explore the Apache Beam repository on GitHub.

By leveraging these resources, you can build a strong foundation in Apache Beam and start constructing your own data processing pipelines. Additionally, Apache Beam is an open-source project, and contributions are welcome. Whether you're fixing bugs, improving documentation, or adding new features, your work can benefit the entire Beam community. To learn more about how you can contribute, visit the 'Contribute' section of the Apache Beam website.
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
Prompt:
What is a pipeline in Apache Beam?

Response:
In Apache Beam, a pipeline encapsulates the entirety of a data processing task, including all the data and each step of the process. Essentially, it's a Directed Acyclic Graph of transformations (known as `PTransform`) applied to data collections (`PCollection`).

The simplest pipelines in Apache Beam follow a linear flow of operations, typically adhering to a read-process-write pattern. However, pipelines can also be significantly more complex, featuring multiple input sources, output sinks, and operations (known as a `PTransform`) that can both read from and output to multiple `PCollection` objects. For more information about pipeline design and best practices, see the 'Common Pipeline Patterns' section in the Apache Beam documentation.

To use Apache Beam, your driver program must first create an instance of the Beam SDK class `Pipeline` (typically in the `main()` function).

```python
import apache_beam as beam

with beam.Pipeline() as pipeline:
pass # Build your pipeline here.
```

When you create your pipeline, you need to set configuration options. You can set your pipeline’s configuration options programmatically. You can also set the options ahead of time, or read them from the command line, and then pass them to the `Pipeline` object when you create the object. This option is often easier than setting the pipeline configuration options programmatically.
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
Prompt:
How do I configure pipeline in Apache Beam?

Response:
Configuring pipeline options in Apache Beam is crucial for specifying the execution environment, managing resources, and tailoring the pipeline's behavior to meet specific data processing needs and performance requirements.

You can set pipeline options programmatically or pass them in using the command line. These options include runner-specific settings, job name, project ID (for cloud runners), machine types, number of workers, and more.

Apache Beam offers a variety of standard pipeline options that allow you to customize and optimize your data processing pipelines.

Beam SDKs include a command-line parser that you can use to set pipeline options. Use command-line arguments in the format `--<option>=<value>`. For example, the following command sets the `--runner` option `DirectRunner` and the `--project` option `my-project-id`:

```bash
python my-pipeline.py --runner=DirectRunner --project=my-project-id
```

To set the pipeline options programmatically, use the `PipelineOptions` class. For example, the following code sets the `--runner` option to `DirectRunner` and the `--project` option to `my-project-id`:

```python
from apache_beam import Pipeline
from apache_beam.options.pipeline_options import PipelineOptions

options = PipelineOptions(
project='my-project-id',
runner='DirectRunner'
)
```

In addition to the standard pipeline options, you can add custom pipeline options. For a common pattern for configuring pipeline options, see the 'Pipeline option patterns' section in the Apache Beam documentation.

The WordCount example pipeline in the 'Get Started' section of the Apache Beam documentation demonstrates how to set pipeline options at runtime by using command-line options.
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
Prompt:
What is a `PCollection` in Apache Beam?

Response:
A `PCollection` in Apache Beam is an unordered bag of elements. Each `PCollection` is a potentially distributed, homogeneous data set or data stream, and is owned by the specific `Pipeline` object for which it is created. It’s the primary data structure used in Apache Beam pipelines to handle large-scale data processing in batch and streaming modes.

The following example shows how to create a `PCollection` using the `Create` transform:

```python
import apache_beam as beam

with beam.Pipeline() as pipeline:
pcollection = pipeline | beam.Create([...]) # Create a PCollection
```

A `PCollection` can either be bounded or unbounded, making it versatile for different data source types. Bounded `PCollection` objects represent a finite data set, such as files or databases, ideal for batch processing. Unbounded `PCollection` objects, on the other hand, represent data streams that continuously grow over time, such as real-time event logs, suitable for stream processing.

Beam’s computational patterns and transforms are focused on situations where distributed data-parallel computation is required. Therefore, a `PCollection` has the following key characteristics:
- All elements must be of the same type (with support of structured types).
- Every `PCollection` has a coder, which is a specification of the binary format of the elements.
- Elements are immutable. They can't be altered after creation.
- Random access to individual elements of the collection isn't allowed.
- Beam encodes each element for distributed processing.

For more information about `PCollection` objects in Apache Beam and usage patterns, see the 'Beam Programming Guide: PCollections' section in the Apache Beam documentation.
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
Prompt:
What is a `PTransform` in Apache Beam?

Response:
A `PTransform` (or transform) represents a data processing operation, or a step, in an Apache Beam pipeline. A transform is applied to zero or more `PCollection` objects and produces zero or more `PCollection` objects.

Transforms have the following key characteristics:
1. Versatility: Able to execute a diverse range of operations on `PCollection` objects.
2. Composability: Can be combined to form elaborate data processing pipelines.
3. Parallel execution: Designed for distributed processing, allowing simultaneous execution across multiple workers.
4. Scalability: Able to handle extensive data and suitable for both batch and streaming data.

The Beam SDKs contain different transforms that you can apply to your pipeline’s `PCollection` objects. The following list includes common transform types:
- Source transforms such as `TextIO.Read` and `Create`. A source transform conceptually has no input.
- Processing and conversion operations such as `ParDo`, `GroupByKey`, `CoGroupByKey`, `Combine`, and `Count`.
- Outputting transforms such as `TextIO.Write`.
- User-defined, application-specific composite transforms.

Transform processing logic is provided in the form of a function object, colloquially referred to as 'user code'. This code is applied to each element of the input `PCollection` (or more than one `PCollection`). The `PCollection` objects can be linked together to create complex data processing sequences.
User code for transforms must satisfy the requirements of the Beam model.

The following example shows how to apply custom user code to a `PCollection` using the `ParDo` transform:

```python
import apache_beam as beam

def SomeUserCode(element):
# Do something with an element
return element

with beam.Pipeline() as pipeline:
input_collection = pipeline | beam.Create([...])
output_collection = input_collection | beam.ParDo(SomeUserCode())
```
25 changes: 25 additions & 0 deletions learning/prompts/documentation-lookup-nolinks/06_basic_schema.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
Prompt:
What are schemas in Apache Beam?

Response:
A Schema in Apache Beam is a language-independent type definition for a `PCollection`. The schema defines elements of that `PCollection` as an ordered list of named fields.

In many cases, the element type in a `PCollection` has a structure that can be introspected. Some examples are JSON, Protocol Buffer, Avro, and database row objects. All of these formats can be converted to Beam Schemas.

To take advantage of schemas, your `PCollection` objects must have a schema attached. Often, the source itself attaches a schema to the `PCollection`.

Although schemas are language-independent, they're designed to embed naturally into the Beam SDK programming languages. This feature allows you to continue using native types while leveraging the advantage of having Beam understand your element schemas.

Apache Beam provides a collection of schema transforms that operate on schemas. For an example, see `SqlTransform` in the 'Beam SQL overview' section of the Apache Beam documentation.

The following list includes key capabilities provided by schema transforms:
* field selection
* grouping and aggregation
* join operations
* filtering data
* adding and removing fields
* renaming fields
* type conversion
* enhanced ParDo functionality

For more information about schemas and the Apache Beam best practices when working with structured data, see the 'Schema Usage Patterns' section in the Apache Beam documentation.
25 changes: 25 additions & 0 deletions learning/prompts/documentation-lookup-nolinks/07_basic_runners.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
Prompt:
What is a runner in Apache Beam?

Response:
Apache Beam runners are the execution engines that run the pipelines. They translate or adapt the pipeline into a form that can be executed on a massively parallel big data processing system, such as Apache Flink, Apache Spark, Google Cloud Dataflow, and more.

Choosing a runner is an important step in the pipeline development process. The runner you choose determines where and how your pipeline runs. For more information about available runners and their capabilities, see the Beam Capability Matrix.

To specify a runner, when you execute the pipeline, use the `--runner` flag. For example, to run the WordCount pipeline on Google Cloud Dataflow, you would run the following command:

```python
python -m apache_beam.examples.wordcount \
--region DATAFLOW_REGION \
--input gs://dataflow-samples/shakespeare/kinglear.txt \
--output gs://STORAGE_BUCKET/results/outputs \
--runner DataflowRunner \
--project PROJECT_ID \
--temp_location gs://STORAGE_BUCKET/tmp/
```

For more details about the WordCount example, see the WordCount source code in the Apache Beam GitHub repository.

The Apache Beam Direct Runner executes the pipeline on your local machine. It is helpful for testing and debugging purposes.

For instructions about how to configure your development environment and runner, see the Apache Beam quickstarts for Java, Python, and Go.
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
Prompt:
What is windowing in Apache Beam?

Response:
In stream processing, windowing allows you to divide data streams into logical units for efficient and correct parallel processing.

With an unbounded data set, collecting all of the elements is impossible, because new elements are constantly being added. In the Apache Beam model, any `PCollection` (including unbounded `PCollection` objects) can be subdivided into logical windows. Grouping transforms then consider the elements of each `PCollection` on a per-window basis.

Because Apache Beam's default windowing strategy is to assign each element to a single, global window, you must explicitly specify a windowing function for your pipeline.

The following code snippet shows how to divide a `PCollection` into 60-second windows:

```python
from apache_beam import beam
from apache_beam import window

fixed_windowed_items = items | 'window' >> beam.WindowInto(window.FixedWindows(60))
```

Beam provides a number of built-in windowing functions that you can use to subdivide your data into different types of windows:
- Fixed Time Windows
- Sliding Time Windows
- Per-Session Windows
- Single Global Window
- Calendar-based Windows (not supported by the Beam SDK for Python or Go)

You can also create your own custom windowing function `WindowFn`.

You need to specify a triggering strategy to determine when to emit the results of your pipeline’s windowed computations.

You can adjust the windowing strategy to allow for late data, which is data that arrives after the watermark has passed the end of the window. You can also specify how to handle late data, such as discarding or adding it to the next window.
31 changes: 31 additions & 0 deletions learning/prompts/documentation-lookup-nolinks/09_basic_triggers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
Prompt:
What is a trigger in Apache Beam?

Response:
Apache Beam uses triggers to determine when to emit the aggregated results of each window, referred to as a pane.

Triggers provide two additional capabilities compared to outputting at the end of a window:
- Allowing early results to be output before the end of the window.
- Allowing late data to be handled after the end of the window.

This feature lets you control the flow of your data and balance between completeness, latency, and cost.

You set the triggers for a `PCollection` by setting the `trigger` parameter of the `WindowInto` transform.

```python
pcollection | WindowInto(
FixedWindows(1 * 60),
trigger=AfterProcessingTime(1 * 60),
accumulation_mode=AccumulationMode.DISCARDING,
)
```

When a trigger fires, it emits the current contents of the window as a pane. Because a trigger can fire multiple times, the accumulation mode determines whether the system accumulates the window panes as the trigger fires, or discards them. This behavior is controlled by the window accumulation mode parameter of the `WindowInto` transform.

Beam provides several built-in triggers that you can use to determine when to emit the results of your pipeline's windowed computations:
- Event time triggers
- Processing time triggers
- Data-driven triggers
- Composite triggers

One of the most useful trigger patterns is the `AfterWatermark` trigger, which fires a single time when Apache Beam estimates that all the data has arrived, such as when the watermark passes the end of the window.
25 changes: 25 additions & 0 deletions learning/prompts/documentation-lookup-nolinks/10_basic_metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
Prompt:
What are the metrics in Apache Beam?

Response:
In the Apache Beam model, metrics provide insights into the current state of your pipeline, including during pipeline execution.

Metrics are named and scoped to a specific step in the pipeline. They can be created dynamically during pipeline execution. If a runner doesn't support some part of reporting metrics, the fallback behavior is to drop the metric updates rather than fail the pipeline.

Apache Beam provides a number of built-in metric types:
* Counters
* Distributions
* Gauges

To declare a metric, use the `beam.metrics.Metrics` class. For example:

```python
self.words_counter = Metrics.counter(self.__class__, 'words')
self.word_lengths_counter = Metrics.counter(self.__class__, 'word_lengths')
self.word_lengths_dist = Metrics.distribution(self.__class__, 'word_len_dist')
self.empty_line_counter = Metrics.counter(self.__class__, 'empty_lines')
```

For implementation details, see the WordCount example with metrics in the Apache Beam GitHub repository.

You can export metrics to external sinks. Spark and Flink runners support REST HTTP and Graphite.
19 changes: 19 additions & 0 deletions learning/prompts/documentation-lookup-nolinks/11_basic_state.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
Prompt:
What is a state in Apache Beam?

Response:
State in Apache Beam is an API that allows for managing per-key state in element-wise operations like `ParDo` or `Map`. It enables fine-grained control over aggregations, particularly in situations where windowing and triggers are insufficient.

State is scoped to the current processing key within a window, meaning each key has its own state. State is particularly useful for complex aggregations across multiple elements.

Apache Beam provides several types of state that can be used for different use cases, including scalar and collection values:
- `ValueState`
- `BagState`
- `SetState`
- `MapState`
- `OrderedListState`
- `CombiningState`

The per-key state needs to be garbage-collected because the increasing size of the state can negatively impact performance. For more information about common garbage collection strategies, see the 'Garbage collecting state' section in the Beam Programming Guide.

For more information and additional examples of using state in Apache Beam, read the community blog post 'Stateful processing with Apache Beam' and refer to the 'State and Timers examples' section in the Beam Programming Guide.
11 changes: 11 additions & 0 deletions learning/prompts/documentation-lookup-nolinks/12_basic_timers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Prompt:
What is a timer in Apache Beam?

Response:
In Apache Beam, a timer is a per-key timer callback API that enables delayed processing of data stored using the State API.

Apache Beam provides two types of timers: processing time timers and event time timers. Processing time timers are based on the system clock. Event time timers are based on the timestamps of the data elements.

Apache Beam also supports dynamically setting a timer tag using `TimerMap` in the Java SDK. This feature lets you set multiple different timers in a `DoFn`. You can set the timer tags to be chosen dynamically, for example, based on data in the input elements.

For more information and additional examples of using timers in Apache Beam, read the community blog post 'Timely (and Stateful) Processing with Apache Beam' and see the 'State and Timers examples' section in the Apache Beam documentation.
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
Prompt:
What is a splittable `DoFn` in Apache Beam?

Response:
A splittable `DoFn` (SDF) is a generalization of `DoFn` that lets you process elements in a non-monolithic way. A splittable `DoFn` makes it easier to create complex, modular I/O connectors in Apache Beam.
When you apply a splittable `DoFn` to an element, the runner can split the element’s processing into smaller tasks. You can checkpoint the processing of an element, and you can split the remaining work to yield additional parallelism.

At a high level, an SDF is responsible for processing element and restriction pairs. A restriction represents a subset of work that would have been necessary to have been done when processing the element.

Executing a splittable `DoFn` uses the following steps:
1. Each element is paired with a restriction. For example, filename is paired with offset range representing the whole file.
2. Each element and restriction pair is split. For example, offset ranges are broken up into smaller pieces.
3. The runner redistributes the element and restriction pairs to several workers.
4. Element and restriction pairs are processed in parallel. For example, the file is read. Within this last step, the element and restriction pair can pause its own processing or be split into further element and restriction pairs.

For an example, see the 'Splittable `DoFn`' module in the Tour of Beam.

For more information, see the community blog post 'Splittable `DoFn` in Apache Beam is Ready to Use'.
Loading

0 comments on commit 1c962a5

Please sign in to comment.