Skip to content

Commit

Permalink
Docs: Grammar files 20-40 (#1847)
Browse files Browse the repository at this point in the history
* files 20-40

* Update docs/website/docs/general-usage/http/rest-client.md

Co-authored-by: Violetta Mishechkina <[email protected]>

* Update docs/website/docs/general-usage/http/rest-client.md

Co-authored-by: Violetta Mishechkina <[email protected]>

* Update docs/website/docs/general-usage/http/rest-client.md

Co-authored-by: Violetta Mishechkina <[email protected]>

* fix snippet

---------

Co-authored-by: Violetta Mishechkina <[email protected]>
  • Loading branch information
sh-rp and VioletM authored Sep 20, 2024
1 parent 9a9bdf7 commit 875bf29
Show file tree
Hide file tree
Showing 20 changed files with 293 additions and 310 deletions.
26 changes: 13 additions & 13 deletions docs/website/docs/dlt-ecosystem/staging.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,9 @@ Such a staging dataset has the same name as the dataset passed to `dlt.pipeline`
[destination.postgres]
staging_dataset_name_layout="staging_%s"
```
The entry above switches the pattern to `staging_` prefix and for example, for a dataset with the name **github_data**, `dlt` will create **staging_github_data**.
The entry above switches the pattern to a `staging_` prefix and, for example, for a dataset with the name **github_data**, `dlt` will create **staging_github_data**.

To configure a static staging dataset name, you can do the following (we use the destination factory)
To configure a static staging dataset name, you can do the following (we use the destination factory):
```py
import dlt

Expand All @@ -41,21 +41,21 @@ truncate_staging_dataset=true
Currently, only one destination, the [filesystem](destinations/filesystem.md), can be used as staging. The following destinations can copy remote files:

1. [Azure Synapse](destinations/synapse#staging-support)
1. [Athena](destinations/athena#staging-support)
1. [Bigquery](destinations/bigquery.md#staging-support)
1. [Dremio](destinations/dremio#staging-support)
1. [Redshift](destinations/redshift.md#staging-support)
1. [Snowflake](destinations/snowflake.md#staging-support)
2. [Athena](destinations/athena#staging-support)
3. [Bigquery](destinations/bigquery.md#staging-support)
4. [Dremio](destinations/dremio#staging-support)
5. [Redshift](destinations/redshift.md#staging-support)
6. [Snowflake](destinations/snowflake.md#staging-support)

### How to use
In essence, you need to set up two destinations and then pass them to `dlt.pipeline`. Below we'll use `filesystem` staging with `parquet` files to load into the `Redshift` destination.
In essence, you need to set up two destinations and then pass them to `dlt.pipeline`. Below, we'll use `filesystem` staging with `parquet` files to load into the `Redshift` destination.

1. **Set up the S3 bucket and filesystem staging.**

Please follow our guide in the [filesystem destination documentation](destinations/filesystem.md). Test the staging as a standalone destination to make sure that files go where you want them. In your `secrets.toml`, you should now have a working `filesystem` configuration:
```toml
[destination.filesystem]
bucket_url = "s3://[your_bucket_name]" # replace with your bucket name,
bucket_url = "s3://[your_bucket_name]" # replace with your bucket name

[destination.filesystem.credentials]
aws_access_key_id = "please set me up!" # copy the access key here
Expand Down Expand Up @@ -88,7 +88,7 @@ In essence, you need to set up two destinations and then pass them to `dlt.pipel
dataset_name='player_data'
)
```
`dlt` will automatically select an appropriate loader file format for the staging files. Below we explicitly specify the `parquet` file format (just to demonstrate how to do it):
`dlt` will automatically select an appropriate loader file format for the staging files. Below, we explicitly specify the `parquet` file format (just to demonstrate how to do it):
```py
info = pipeline.run(chess(), loader_file_format="parquet")
```
Expand All @@ -103,15 +103,15 @@ Please note that `dlt` does not delete loaded files from the staging storage aft

### How to prevent staging files truncation

Before `dlt` loads data to the staging storage, it truncates previously loaded files. To prevent it and keep the whole history
of loaded files, you can use the following parameter:
Before `dlt` loads data to the staging storage, it truncates previously loaded files. To prevent this and keep the whole history of loaded files, you can use the following parameter:

```toml
[destination.redshift]
truncate_table_before_load_on_staging_destination=false
```

:::caution
The [Athena](destinations/athena#staging-support) destination only truncates not iceberg tables with `replace` merge_disposition.
The [Athena](destinations/athena#staging-support) destination only truncates non-iceberg tables with `replace` merge_disposition.
Therefore, the parameter `truncate_table_before_load_on_staging_destination` only controls the truncation of corresponding files for these tables.
:::

5 changes: 3 additions & 2 deletions docs/website/docs/dlt-ecosystem/table-formats/delta.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,9 @@ keywords: [delta, table formats]

# Delta table format

[Delta](https://delta.io/) is an open source table format. `dlt` can store data as Delta tables.
[Delta](https://delta.io/) is an open-source table format. `dlt` can store data as Delta tables.

## Supported Destinations
## Supported destinations

Supported by: **Databricks**, **filesystem**

5 changes: 3 additions & 2 deletions docs/website/docs/dlt-ecosystem/table-formats/iceberg.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,9 @@ keywords: [iceberg, table formats]

# Iceberg table format

[Iceberg](https://iceberg.apache.org/) is an open source table format. `dlt` can store data as Iceberg tables.
[Iceberg](https://iceberg.apache.org/) is an open-source table format. `dlt` can store data as Iceberg tables.

## Supported Destinations
## Supported destinations

Supported by: **Athena**

45 changes: 21 additions & 24 deletions docs/website/docs/dlt-ecosystem/transformations/dbt/dbt.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,7 @@ keywords: [transform, dbt, runner]

# Transform the data with dbt

[dbt](https://github.com/dbt-labs/dbt-core) is a framework that allows for the simple structuring of your transformations into DAGs. The benefits of
using dbt include:
[dbt](https://github.com/dbt-labs/dbt-core) is a framework that allows for the simple structuring of your transformations into DAGs. The benefits of using dbt include:

- End-to-end cross-db compatibility for dlt→dbt pipelines.
- Ease of use by SQL analysts, with a low learning curve.
Expand All @@ -20,21 +19,19 @@ You can run dbt with `dlt` by using the dbt runner.

The dbt runner:

- Can create a virtual env for dbt on the fly;
- Can create a virtual environment for dbt on the fly;
- Can run a dbt package from online sources (e.g., GitHub) or from local files;
- Passes configuration and credentials to dbt, so you do not need to handle them separately from
`dlt`, enabling dbt to configure on the fly.
- Passes configuration and credentials to dbt, so you do not need to handle them separately from `dlt`, enabling dbt to configure on the fly.

## How to use the dbt runner

For an example of how to use the dbt runner, see the
[jaffle shop example](https://github.com/dlt-hub/dlt/blob/devel/docs/examples/archive/dbt_run_jaffle.py).
For an example of how to use the dbt runner, see the [jaffle shop example](https://github.com/dlt-hub/dlt/blob/devel/docs/examples/archive/dbt_run_jaffle.py).
Included below is another example where we run a `dlt` pipeline and then a dbt package via `dlt`:

> 💡 Docstrings are available to read in your IDE.
```py
# load all pipedrive endpoints to pipedrive_raw dataset
# Load all Pipedrive endpoints to the pipedrive_raw dataset
pipeline = dlt.pipeline(
pipeline_name='pipedrive',
destination='bigquery',
Expand All @@ -45,38 +42,38 @@ load_info = pipeline.run(pipedrive_source())
print(load_info)

# Create a transformation on a new dataset called 'pipedrive_dbt'
# we created a local dbt package
# We created a local dbt package
# and added pipedrive_raw to its sources.yml
# the destination for the transformation is passed in the pipeline
# The destination for the transformation is passed in the pipeline
pipeline = dlt.pipeline(
pipeline_name='pipedrive',
destination='bigquery',
dataset_name='pipedrive_dbt'
)

# make or restore venv for dbt, using latest dbt version
# NOTE: if you have dbt installed in your current environment, just skip this line
# Make or restore venv for dbt, using the latest dbt version
# NOTE: If you have dbt installed in your current environment, just skip this line
# and the `venv` argument to dlt.dbt.package()
venv = dlt.dbt.get_venv(pipeline)

# get runner, optionally pass the venv
# Get runner, optionally pass the venv
dbt = dlt.dbt.package(
pipeline,
"pipedrive/dbt_pipedrive/pipedrive",
venv=venv
)

# run the models and collect any info
# If running fails, the error will be raised with full stack trace
# Run the models and collect any info
# If running fails, the error will be raised with a full stack trace
models = dbt.run_all()

# on success print outcome
# On success, print the outcome
for m in models:
print(
f"Model {m.model_name} materialized" +
f"in {m.time}" +
f"with status {m.status}" +
f"and message {m.message}"
f" in {m.time}" +
f" with status {m.status}" +
f" and message {m.message}"
)
```

Expand All @@ -86,18 +83,18 @@ It assumes that dbt is installed in the current Python environment and the `prof
<!--@@@DLT_SNIPPET ./dbt-snippets.py::run_dbt_standalone-->


Here's an example **duckdb** profile
Here's an example **duckdb** profile:
```yaml
config:
# do not track usage, do not create .user.yml
# Do not track usage, do not create .user.yml
send_anonymous_usage_stats: False

duckdb_dlt_dbt_test:
target: analytics
outputs:
analytics:
type: duckdb
# schema: "{{ var('destination_dataset_name', var('source_dataset_name')) }}"
# Schema: "{{ var('destination_dataset_name', var('source_dataset_name')) }}"
path: "duckdb_dlt_dbt_test.duckdb"
extensions:
- httpfs
Expand All @@ -108,8 +105,8 @@ You can run the example with dbt debug log: `RUNTIME__LOG_LEVEL=DEBUG python dbt

## Other transforming tools

If you want to transform the data before loading, you can use Python. If you want to transform the
data after loading, you can use dbt or one of the following:
If you want to transform the data before loading, you can use Python. If you want to transform the data after loading, you can use dbt or one of the following:

1. [`dlt` SQL client.](../sql.md)
2. [Pandas.](../pandas.md)

17 changes: 9 additions & 8 deletions docs/website/docs/dlt-ecosystem/transformations/dbt/dbt_cloud.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@ description: Transforming the data loaded by a dlt pipeline with dbt Cloud
keywords: [transform, sql]
---

# DBT Cloud Client and Helper Functions
# dbt Cloud client and helper functions

## API Client
## API client

The DBT Cloud Client is a Python class designed to interact with the dbt Cloud API (version 2).
The dbt Cloud Client is a Python class designed to interact with the dbt Cloud API (version 2).
It provides methods to perform various operations on dbt Cloud, such as triggering job runs and retrieving job run statuses.

```py
Expand All @@ -26,7 +26,7 @@ run_status = client.get_run_status(run_id=job_run_id)
print(f"Job run status: {run_status['status_humanized']}")
```

## Helper Functions
## Helper functions

These Python functions provide an interface to interact with the dbt Cloud API.
They simplify the process of triggering and monitoring job runs in dbt Cloud.
Expand Down Expand Up @@ -65,11 +65,11 @@ from dlt.helpers.dbt_cloud import get_dbt_cloud_run_status
status = get_dbt_cloud_run_status(run_id=1234, wait_for_outcome=True)
```

## Set Credentials
## Set credentials

### secrets.toml

When using a dlt locally, we recommend using the `.dlt/secrets.toml` method to set credentials.
When using dlt locally, we recommend using the `.dlt/secrets.toml` method to set credentials.

If you used the `dlt init` command, then the `.dlt` folder has already been created.
Otherwise, create a `.dlt` folder in your working directory and a `secrets.toml` file inside it.
Expand All @@ -86,9 +86,9 @@ job_id = "set me up!" # optional only for the run_dbt_cloud_job function (you ca
run_id = "set me up!" # optional for the get_dbt_cloud_run_status function (you can pass this explicitly as an argument to the function)
```

### Environment Variables
### Environment variables

`dlt` supports reading credentials from the environment.
dlt supports reading credentials from the environment.

If dlt tries to read this from environment variables, it will use a different naming convention.

Expand All @@ -103,3 +103,4 @@ DBT_CLOUD__JOB_ID
```

For more information, read the [Credentials](../../../general-usage/credentials) documentation.

7 changes: 4 additions & 3 deletions docs/website/docs/dlt-ecosystem/transformations/pandas.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ description: Transform the data loaded by a dlt pipeline with Pandas
keywords: [transform, pandas]
---

# Transform the Data with Pandas
# Transform the data with Pandas

You can fetch the results of any SQL query as a dataframe. If the destination supports that
natively (i.e., BigQuery and DuckDB), `dlt` uses the native method. Thanks to this, reading
Expand All @@ -22,7 +22,7 @@ with pipeline.sql_client() as client:
with client.execute_query(
'SELECT "reactions__+1", "reactions__-1", reactions__laugh, reactions__hooray, reactions__rocket FROM issues'
) as table:
# calling `df` on a cursor, returns the data as a data frame
# calling `df` on a cursor returns the data as a data frame
reactions = table.df()
counts = reactions.sum(0).sort_values(0, ascending=False)
```
Expand All @@ -32,10 +32,11 @@ chunks by passing the `chunk_size` argument to the `df` method.

Once your data is in a Pandas dataframe, you can transform it as needed.

## Other Transforming Tools
## Other transforming tools

If you want to transform the data before loading, you can use Python. If you want to transform the
data after loading, you can use Pandas or one of the following:

1. [dbt.](dbt/dbt.md) (recommended)
2. [`dlt` SQL client.](sql.md)

5 changes: 3 additions & 2 deletions docs/website/docs/dlt-ecosystem/transformations/sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ try:
"SELECT id, name, email FROM customers WHERE id = %s",
10
)
# prints column values of the first row
# Prints column values of the first row
print(res[0])
except Exception:
...
Expand All @@ -48,4 +48,5 @@ If you want to transform the data before loading, you can use Python. If you wan
data after loading, you can use SQL or one of the following:

1. [dbt](dbt/dbt.md) (recommended).
2. [Pandas.](pandas.md)
2. [Pandas](pandas.md).

21 changes: 11 additions & 10 deletions docs/website/docs/general-usage/credentials/advanced.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ keywords: [credentials, secrets.toml, secrets, config, configuration, environmen
```
`dlt` allows the user to specify the argument `pipedrive_api_key` explicitly if, for some reason, they do not want to use [out-of-the-box options](setup) for credentials management.

1. Required arguments (without default values) **are never injected** and must be specified when calling. For example, for the source:
2. Required arguments (without default values) **are never injected** and must be specified when calling. For example, for the source:

```py
@dlt.source
Expand All @@ -35,7 +35,7 @@ keywords: [credentials, secrets.toml, secrets, config, configuration, environmen
```
The argument `channels_list` would not be injected and will output an error if it is not specified explicitly.

1. Arguments with default values are injected if present in config providers. Otherwise, defaults from the function signature are used. For example, for the source:
3. Arguments with default values are injected if present in config providers. Otherwise, defaults from the function signature are used. For example, for the source:

```py
@dlt.source
Expand All @@ -48,7 +48,7 @@ keywords: [credentials, secrets.toml, secrets, config, configuration, environmen
```
`dlt` firstly searches for all three arguments: `page_size`, `access_token`, and `start_date` in config providers in a [specific order](setup). If it cannot find them, it will use the default values.

1. Arguments with the special default value `dlt.secrets.value` and `dlt.config.value` **must be injected**
4. Arguments with the special default value `dlt.secrets.value` and `dlt.config.value` **must be injected**
(or explicitly passed). If they are not found by the config providers, the code raises an
exception. The code in the functions always receives those arguments.

Expand All @@ -58,12 +58,12 @@ keywords: [credentials, secrets.toml, secrets, config, configuration, environmen

We highly recommend adding types to your function signatures.
The effort is very low, and it gives `dlt` much more
information on what source/resource expects.
information on what the source or resource expects.

Doing so provides several benefits:

1. You'll never receive the invalid data types in your code.
1. `dlt` will automatically parse and coerce types for you, so you don't need to parse it yourself.
1. You'll never receive invalid data types in your code.
1. `dlt` will automatically parse and coerce types for you, so you don't need to parse them yourself.
1. `dlt` can generate sample config and secret files for your source automatically.
1. You can request [built-in and custom credentials](complex_types) (i.e., connection strings, AWS / GCP / Azure credentials).
1. You can specify a set of possible types via `Union`, i.e., OAuth or API Key authorization.
Expand Down Expand Up @@ -94,7 +94,7 @@ Now,
## Toml files structure

`dlt` arranges the sections of [toml files](setup/#secretstoml-and-configtoml) into a **default layout** that is expected by the [injection mechanism](#injection-mechanism).
This layout makes it easy to configure simple cases but also provides a room for more explicit sections and complex cases, i.e., having several sources with different credentials
This layout makes it easy to configure simple cases but also provides room for more explicit sections and complex cases, i.e., having several sources with different credentials
or even hosting several pipelines in the same project sharing the same config and credentials.

```text
Expand Down Expand Up @@ -158,7 +158,7 @@ dlt.config["sheet_id"] = "23029402349032049"
dlt.secrets["destination.postgres.credentials"] = BaseHook.get_connection('postgres_dsn').extra
```

Will mock the `toml` provider to desired values.
This will mock the `toml` provider to desired values.

## Example

Expand All @@ -173,7 +173,7 @@ def google_sheets(
credentials=dlt.secrets.value,
only_strings=False
):
# Allow both a dictionary and a string passed as credentials
# Allow both a dictionary and a string to be passed as credentials
if isinstance(credentials, str):
credentials = json.loads(credentials)
# Allow both a list and a comma-delimited string to be passed as tabs
Expand All @@ -200,4 +200,5 @@ In the example above:
:::tip
`dlt.resource` behaves in the same way, so if you have a [standalone resource](../resource.md#declare-a-standalone-resource) (one that is not an inner function
of a **source**)
:::
:::

Loading

0 comments on commit 875bf29

Please sign in to comment.