From 875bf29a78685a25ebbf976c22a39bac19d3bf18 Mon Sep 17 00:00:00 2001 From: David Scharf Date: Fri, 20 Sep 2024 18:44:22 +0200 Subject: [PATCH] Docs: Grammar files 20-40 (#1847) * files 20-40 * Update docs/website/docs/general-usage/http/rest-client.md Co-authored-by: Violetta Mishechkina * Update docs/website/docs/general-usage/http/rest-client.md Co-authored-by: Violetta Mishechkina * Update docs/website/docs/general-usage/http/rest-client.md Co-authored-by: Violetta Mishechkina * fix snippet --------- Co-authored-by: Violetta Mishechkina --- docs/website/docs/dlt-ecosystem/staging.md | 26 ++--- .../docs/dlt-ecosystem/table-formats/delta.md | 5 +- .../dlt-ecosystem/table-formats/iceberg.md | 5 +- .../dlt-ecosystem/transformations/dbt/dbt.md | 45 ++++---- .../transformations/dbt/dbt_cloud.md | 17 +-- .../dlt-ecosystem/transformations/pandas.md | 7 +- .../docs/dlt-ecosystem/transformations/sql.md | 5 +- .../general-usage/credentials/advanced.md | 21 ++-- .../credentials/complex_types.md | 63 +++++------ .../docs/general-usage/credentials/index.md | 7 +- .../docs/general-usage/credentials/setup.md | 102 +++++++++--------- .../pseudonymizing_columns.md | 10 +- .../customising-pipelines/removing_columns.md | 16 ++- .../customising-pipelines/renaming_columns.md | 5 +- .../docs/general-usage/http/requests.md | 15 +-- .../docs/general-usage/http/rest-client.md | 83 +++++++------- docs/website/docs/tutorial/filesystem.md | 36 +++---- .../docs/tutorial/load-data-from-an-api.md | 88 ++++++++------- docs/website/docs/tutorial/rest-api.md | 23 ++-- docs/website/docs/tutorial/sql-database.md | 24 ++--- 20 files changed, 293 insertions(+), 310 deletions(-) diff --git a/docs/website/docs/dlt-ecosystem/staging.md b/docs/website/docs/dlt-ecosystem/staging.md index 789189b7dd..fe0be75630 100644 --- a/docs/website/docs/dlt-ecosystem/staging.md +++ b/docs/website/docs/dlt-ecosystem/staging.md @@ -16,9 +16,9 @@ Such a staging dataset has the same name as the dataset passed to `dlt.pipeline` [destination.postgres] staging_dataset_name_layout="staging_%s" ``` -The entry above switches the pattern to `staging_` prefix and for example, for a dataset with the name **github_data**, `dlt` will create **staging_github_data**. +The entry above switches the pattern to a `staging_` prefix and, for example, for a dataset with the name **github_data**, `dlt` will create **staging_github_data**. -To configure a static staging dataset name, you can do the following (we use the destination factory) +To configure a static staging dataset name, you can do the following (we use the destination factory): ```py import dlt @@ -41,21 +41,21 @@ truncate_staging_dataset=true Currently, only one destination, the [filesystem](destinations/filesystem.md), can be used as staging. The following destinations can copy remote files: 1. [Azure Synapse](destinations/synapse#staging-support) -1. [Athena](destinations/athena#staging-support) -1. [Bigquery](destinations/bigquery.md#staging-support) -1. [Dremio](destinations/dremio#staging-support) -1. [Redshift](destinations/redshift.md#staging-support) -1. [Snowflake](destinations/snowflake.md#staging-support) +2. [Athena](destinations/athena#staging-support) +3. [Bigquery](destinations/bigquery.md#staging-support) +4. [Dremio](destinations/dremio#staging-support) +5. [Redshift](destinations/redshift.md#staging-support) +6. [Snowflake](destinations/snowflake.md#staging-support) ### How to use -In essence, you need to set up two destinations and then pass them to `dlt.pipeline`. Below we'll use `filesystem` staging with `parquet` files to load into the `Redshift` destination. +In essence, you need to set up two destinations and then pass them to `dlt.pipeline`. Below, we'll use `filesystem` staging with `parquet` files to load into the `Redshift` destination. 1. **Set up the S3 bucket and filesystem staging.** Please follow our guide in the [filesystem destination documentation](destinations/filesystem.md). Test the staging as a standalone destination to make sure that files go where you want them. In your `secrets.toml`, you should now have a working `filesystem` configuration: ```toml [destination.filesystem] - bucket_url = "s3://[your_bucket_name]" # replace with your bucket name, + bucket_url = "s3://[your_bucket_name]" # replace with your bucket name [destination.filesystem.credentials] aws_access_key_id = "please set me up!" # copy the access key here @@ -88,7 +88,7 @@ In essence, you need to set up two destinations and then pass them to `dlt.pipel dataset_name='player_data' ) ``` - `dlt` will automatically select an appropriate loader file format for the staging files. Below we explicitly specify the `parquet` file format (just to demonstrate how to do it): + `dlt` will automatically select an appropriate loader file format for the staging files. Below, we explicitly specify the `parquet` file format (just to demonstrate how to do it): ```py info = pipeline.run(chess(), loader_file_format="parquet") ``` @@ -103,8 +103,7 @@ Please note that `dlt` does not delete loaded files from the staging storage aft ### How to prevent staging files truncation -Before `dlt` loads data to the staging storage, it truncates previously loaded files. To prevent it and keep the whole history -of loaded files, you can use the following parameter: +Before `dlt` loads data to the staging storage, it truncates previously loaded files. To prevent this and keep the whole history of loaded files, you can use the following parameter: ```toml [destination.redshift] @@ -112,6 +111,7 @@ truncate_table_before_load_on_staging_destination=false ``` :::caution -The [Athena](destinations/athena#staging-support) destination only truncates not iceberg tables with `replace` merge_disposition. +The [Athena](destinations/athena#staging-support) destination only truncates non-iceberg tables with `replace` merge_disposition. Therefore, the parameter `truncate_table_before_load_on_staging_destination` only controls the truncation of corresponding files for these tables. ::: + diff --git a/docs/website/docs/dlt-ecosystem/table-formats/delta.md b/docs/website/docs/dlt-ecosystem/table-formats/delta.md index 7840f40d11..d8dd87b750 100644 --- a/docs/website/docs/dlt-ecosystem/table-formats/delta.md +++ b/docs/website/docs/dlt-ecosystem/table-formats/delta.md @@ -6,8 +6,9 @@ keywords: [delta, table formats] # Delta table format -[Delta](https://delta.io/) is an open source table format. `dlt` can store data as Delta tables. +[Delta](https://delta.io/) is an open-source table format. `dlt` can store data as Delta tables. -## Supported Destinations +## Supported destinations Supported by: **Databricks**, **filesystem** + diff --git a/docs/website/docs/dlt-ecosystem/table-formats/iceberg.md b/docs/website/docs/dlt-ecosystem/table-formats/iceberg.md index a34bab9a0c..233ae0ce21 100644 --- a/docs/website/docs/dlt-ecosystem/table-formats/iceberg.md +++ b/docs/website/docs/dlt-ecosystem/table-formats/iceberg.md @@ -6,8 +6,9 @@ keywords: [iceberg, table formats] # Iceberg table format -[Iceberg](https://iceberg.apache.org/) is an open source table format. `dlt` can store data as Iceberg tables. +[Iceberg](https://iceberg.apache.org/) is an open-source table format. `dlt` can store data as Iceberg tables. -## Supported Destinations +## Supported destinations Supported by: **Athena** + diff --git a/docs/website/docs/dlt-ecosystem/transformations/dbt/dbt.md b/docs/website/docs/dlt-ecosystem/transformations/dbt/dbt.md index 526e62e44b..449f8b8bde 100644 --- a/docs/website/docs/dlt-ecosystem/transformations/dbt/dbt.md +++ b/docs/website/docs/dlt-ecosystem/transformations/dbt/dbt.md @@ -6,8 +6,7 @@ keywords: [transform, dbt, runner] # Transform the data with dbt -[dbt](https://github.com/dbt-labs/dbt-core) is a framework that allows for the simple structuring of your transformations into DAGs. The benefits of -using dbt include: +[dbt](https://github.com/dbt-labs/dbt-core) is a framework that allows for the simple structuring of your transformations into DAGs. The benefits of using dbt include: - End-to-end cross-db compatibility for dlt→dbt pipelines. - Ease of use by SQL analysts, with a low learning curve. @@ -20,21 +19,19 @@ You can run dbt with `dlt` by using the dbt runner. The dbt runner: -- Can create a virtual env for dbt on the fly; +- Can create a virtual environment for dbt on the fly; - Can run a dbt package from online sources (e.g., GitHub) or from local files; -- Passes configuration and credentials to dbt, so you do not need to handle them separately from - `dlt`, enabling dbt to configure on the fly. +- Passes configuration and credentials to dbt, so you do not need to handle them separately from `dlt`, enabling dbt to configure on the fly. ## How to use the dbt runner -For an example of how to use the dbt runner, see the -[jaffle shop example](https://github.com/dlt-hub/dlt/blob/devel/docs/examples/archive/dbt_run_jaffle.py). +For an example of how to use the dbt runner, see the [jaffle shop example](https://github.com/dlt-hub/dlt/blob/devel/docs/examples/archive/dbt_run_jaffle.py). Included below is another example where we run a `dlt` pipeline and then a dbt package via `dlt`: > 💡 Docstrings are available to read in your IDE. ```py -# load all pipedrive endpoints to pipedrive_raw dataset +# Load all Pipedrive endpoints to the pipedrive_raw dataset pipeline = dlt.pipeline( pipeline_name='pipedrive', destination='bigquery', @@ -45,38 +42,38 @@ load_info = pipeline.run(pipedrive_source()) print(load_info) # Create a transformation on a new dataset called 'pipedrive_dbt' -# we created a local dbt package +# We created a local dbt package # and added pipedrive_raw to its sources.yml -# the destination for the transformation is passed in the pipeline +# The destination for the transformation is passed in the pipeline pipeline = dlt.pipeline( pipeline_name='pipedrive', destination='bigquery', dataset_name='pipedrive_dbt' ) -# make or restore venv for dbt, using latest dbt version -# NOTE: if you have dbt installed in your current environment, just skip this line +# Make or restore venv for dbt, using the latest dbt version +# NOTE: If you have dbt installed in your current environment, just skip this line # and the `venv` argument to dlt.dbt.package() venv = dlt.dbt.get_venv(pipeline) -# get runner, optionally pass the venv +# Get runner, optionally pass the venv dbt = dlt.dbt.package( pipeline, "pipedrive/dbt_pipedrive/pipedrive", venv=venv ) -# run the models and collect any info -# If running fails, the error will be raised with full stack trace +# Run the models and collect any info +# If running fails, the error will be raised with a full stack trace models = dbt.run_all() -# on success print outcome +# On success, print the outcome for m in models: print( f"Model {m.model_name} materialized" + - f"in {m.time}" + - f"with status {m.status}" + - f"and message {m.message}" + f" in {m.time}" + + f" with status {m.status}" + + f" and message {m.message}" ) ``` @@ -86,10 +83,10 @@ It assumes that dbt is installed in the current Python environment and the `prof -Here's an example **duckdb** profile +Here's an example **duckdb** profile: ```yaml config: - # do not track usage, do not create .user.yml + # Do not track usage, do not create .user.yml send_anonymous_usage_stats: False duckdb_dlt_dbt_test: @@ -97,7 +94,7 @@ duckdb_dlt_dbt_test: outputs: analytics: type: duckdb - # schema: "{{ var('destination_dataset_name', var('source_dataset_name')) }}" + # Schema: "{{ var('destination_dataset_name', var('source_dataset_name')) }}" path: "duckdb_dlt_dbt_test.duckdb" extensions: - httpfs @@ -108,8 +105,8 @@ You can run the example with dbt debug log: `RUNTIME__LOG_LEVEL=DEBUG python dbt ## Other transforming tools -If you want to transform the data before loading, you can use Python. If you want to transform the -data after loading, you can use dbt or one of the following: +If you want to transform the data before loading, you can use Python. If you want to transform the data after loading, you can use dbt or one of the following: 1. [`dlt` SQL client.](../sql.md) 2. [Pandas.](../pandas.md) + diff --git a/docs/website/docs/dlt-ecosystem/transformations/dbt/dbt_cloud.md b/docs/website/docs/dlt-ecosystem/transformations/dbt/dbt_cloud.md index 2ff65537be..58bc489459 100644 --- a/docs/website/docs/dlt-ecosystem/transformations/dbt/dbt_cloud.md +++ b/docs/website/docs/dlt-ecosystem/transformations/dbt/dbt_cloud.md @@ -4,11 +4,11 @@ description: Transforming the data loaded by a dlt pipeline with dbt Cloud keywords: [transform, sql] --- -# DBT Cloud Client and Helper Functions +# dbt Cloud client and helper functions -## API Client +## API client -The DBT Cloud Client is a Python class designed to interact with the dbt Cloud API (version 2). +The dbt Cloud Client is a Python class designed to interact with the dbt Cloud API (version 2). It provides methods to perform various operations on dbt Cloud, such as triggering job runs and retrieving job run statuses. ```py @@ -26,7 +26,7 @@ run_status = client.get_run_status(run_id=job_run_id) print(f"Job run status: {run_status['status_humanized']}") ``` -## Helper Functions +## Helper functions These Python functions provide an interface to interact with the dbt Cloud API. They simplify the process of triggering and monitoring job runs in dbt Cloud. @@ -65,11 +65,11 @@ from dlt.helpers.dbt_cloud import get_dbt_cloud_run_status status = get_dbt_cloud_run_status(run_id=1234, wait_for_outcome=True) ``` -## Set Credentials +## Set credentials ### secrets.toml -When using a dlt locally, we recommend using the `.dlt/secrets.toml` method to set credentials. +When using dlt locally, we recommend using the `.dlt/secrets.toml` method to set credentials. If you used the `dlt init` command, then the `.dlt` folder has already been created. Otherwise, create a `.dlt` folder in your working directory and a `secrets.toml` file inside it. @@ -86,9 +86,9 @@ job_id = "set me up!" # optional only for the run_dbt_cloud_job function (you ca run_id = "set me up!" # optional for the get_dbt_cloud_run_status function (you can pass this explicitly as an argument to the function) ``` -### Environment Variables +### Environment variables -`dlt` supports reading credentials from the environment. +dlt supports reading credentials from the environment. If dlt tries to read this from environment variables, it will use a different naming convention. @@ -103,3 +103,4 @@ DBT_CLOUD__JOB_ID ``` For more information, read the [Credentials](../../../general-usage/credentials) documentation. + diff --git a/docs/website/docs/dlt-ecosystem/transformations/pandas.md b/docs/website/docs/dlt-ecosystem/transformations/pandas.md index 0e08666eaf..4125e4e114 100644 --- a/docs/website/docs/dlt-ecosystem/transformations/pandas.md +++ b/docs/website/docs/dlt-ecosystem/transformations/pandas.md @@ -4,7 +4,7 @@ description: Transform the data loaded by a dlt pipeline with Pandas keywords: [transform, pandas] --- -# Transform the Data with Pandas +# Transform the data with Pandas You can fetch the results of any SQL query as a dataframe. If the destination supports that natively (i.e., BigQuery and DuckDB), `dlt` uses the native method. Thanks to this, reading @@ -22,7 +22,7 @@ with pipeline.sql_client() as client: with client.execute_query( 'SELECT "reactions__+1", "reactions__-1", reactions__laugh, reactions__hooray, reactions__rocket FROM issues' ) as table: - # calling `df` on a cursor, returns the data as a data frame + # calling `df` on a cursor returns the data as a data frame reactions = table.df() counts = reactions.sum(0).sort_values(0, ascending=False) ``` @@ -32,10 +32,11 @@ chunks by passing the `chunk_size` argument to the `df` method. Once your data is in a Pandas dataframe, you can transform it as needed. -## Other Transforming Tools +## Other transforming tools If you want to transform the data before loading, you can use Python. If you want to transform the data after loading, you can use Pandas or one of the following: 1. [dbt.](dbt/dbt.md) (recommended) 2. [`dlt` SQL client.](sql.md) + diff --git a/docs/website/docs/dlt-ecosystem/transformations/sql.md b/docs/website/docs/dlt-ecosystem/transformations/sql.md index b358e97b4c..ffd348d1a0 100644 --- a/docs/website/docs/dlt-ecosystem/transformations/sql.md +++ b/docs/website/docs/dlt-ecosystem/transformations/sql.md @@ -36,7 +36,7 @@ try: "SELECT id, name, email FROM customers WHERE id = %s", 10 ) - # prints column values of the first row + # Prints column values of the first row print(res[0]) except Exception: ... @@ -48,4 +48,5 @@ If you want to transform the data before loading, you can use Python. If you wan data after loading, you can use SQL or one of the following: 1. [dbt](dbt/dbt.md) (recommended). -2. [Pandas.](pandas.md) +2. [Pandas](pandas.md). + diff --git a/docs/website/docs/general-usage/credentials/advanced.md b/docs/website/docs/general-usage/credentials/advanced.md index 793f5c2a55..c25030a154 100644 --- a/docs/website/docs/general-usage/credentials/advanced.md +++ b/docs/website/docs/general-usage/credentials/advanced.md @@ -26,7 +26,7 @@ keywords: [credentials, secrets.toml, secrets, config, configuration, environmen ``` `dlt` allows the user to specify the argument `pipedrive_api_key` explicitly if, for some reason, they do not want to use [out-of-the-box options](setup) for credentials management. -1. Required arguments (without default values) **are never injected** and must be specified when calling. For example, for the source: +2. Required arguments (without default values) **are never injected** and must be specified when calling. For example, for the source: ```py @dlt.source @@ -35,7 +35,7 @@ keywords: [credentials, secrets.toml, secrets, config, configuration, environmen ``` The argument `channels_list` would not be injected and will output an error if it is not specified explicitly. -1. Arguments with default values are injected if present in config providers. Otherwise, defaults from the function signature are used. For example, for the source: +3. Arguments with default values are injected if present in config providers. Otherwise, defaults from the function signature are used. For example, for the source: ```py @dlt.source @@ -48,7 +48,7 @@ keywords: [credentials, secrets.toml, secrets, config, configuration, environmen ``` `dlt` firstly searches for all three arguments: `page_size`, `access_token`, and `start_date` in config providers in a [specific order](setup). If it cannot find them, it will use the default values. -1. Arguments with the special default value `dlt.secrets.value` and `dlt.config.value` **must be injected** +4. Arguments with the special default value `dlt.secrets.value` and `dlt.config.value` **must be injected** (or explicitly passed). If they are not found by the config providers, the code raises an exception. The code in the functions always receives those arguments. @@ -58,12 +58,12 @@ keywords: [credentials, secrets.toml, secrets, config, configuration, environmen We highly recommend adding types to your function signatures. The effort is very low, and it gives `dlt` much more -information on what source/resource expects. +information on what the source or resource expects. Doing so provides several benefits: -1. You'll never receive the invalid data types in your code. -1. `dlt` will automatically parse and coerce types for you, so you don't need to parse it yourself. +1. You'll never receive invalid data types in your code. +1. `dlt` will automatically parse and coerce types for you, so you don't need to parse them yourself. 1. `dlt` can generate sample config and secret files for your source automatically. 1. You can request [built-in and custom credentials](complex_types) (i.e., connection strings, AWS / GCP / Azure credentials). 1. You can specify a set of possible types via `Union`, i.e., OAuth or API Key authorization. @@ -94,7 +94,7 @@ Now, ## Toml files structure `dlt` arranges the sections of [toml files](setup/#secretstoml-and-configtoml) into a **default layout** that is expected by the [injection mechanism](#injection-mechanism). -This layout makes it easy to configure simple cases but also provides a room for more explicit sections and complex cases, i.e., having several sources with different credentials +This layout makes it easy to configure simple cases but also provides room for more explicit sections and complex cases, i.e., having several sources with different credentials or even hosting several pipelines in the same project sharing the same config and credentials. ```text @@ -158,7 +158,7 @@ dlt.config["sheet_id"] = "23029402349032049" dlt.secrets["destination.postgres.credentials"] = BaseHook.get_connection('postgres_dsn').extra ``` -Will mock the `toml` provider to desired values. +This will mock the `toml` provider to desired values. ## Example @@ -173,7 +173,7 @@ def google_sheets( credentials=dlt.secrets.value, only_strings=False ): - # Allow both a dictionary and a string passed as credentials + # Allow both a dictionary and a string to be passed as credentials if isinstance(credentials, str): credentials = json.loads(credentials) # Allow both a list and a comma-delimited string to be passed as tabs @@ -200,4 +200,5 @@ In the example above: :::tip `dlt.resource` behaves in the same way, so if you have a [standalone resource](../resource.md#declare-a-standalone-resource) (one that is not an inner function of a **source**) -::: \ No newline at end of file +::: + diff --git a/docs/website/docs/general-usage/credentials/complex_types.md b/docs/website/docs/general-usage/credentials/complex_types.md index 24915c1b2e..d14e031097 100644 --- a/docs/website/docs/general-usage/credentials/complex_types.md +++ b/docs/website/docs/general-usage/credentials/complex_types.md @@ -49,7 +49,7 @@ dsn="postgres://loader:loader@localhost:5432/dlt_data" ### Mixed form -If all credentials, but the password provided explicitly in the code, `dlt` will look for the password in `secrets.toml`. +If all credentials, except the password, are provided explicitly in the code, `dlt` will look for the password in `secrets.toml`. ```toml dsn.password="loader" @@ -125,10 +125,10 @@ credentials.add_scopes(["scope3", "scope4"]) `OAuth2Credentials` is a base class to implement actual OAuth; for example, it is a base class for [GcpOAuthCredentials](#gcpoauthcredentials). -### GCP Credentials +### GCP credentials #### Examples -* [Google Analytics verified source](https://github.com/dlt-hub/verified-sources/blob/master/sources/google_analytics/__init__.py): the example of how to use GCP Credentials. +* [Google Analytics verified source](https://github.com/dlt-hub/verified-sources/blob/master/sources/google_analytics/__init__.py): an example of how to use GCP Credentials. * [Google Analytics example](https://github.com/dlt-hub/verified-sources/blob/master/sources/google_analytics/setup_script_gcp_oauth.py): how you can get the refresh token using `dlt.secrets.value`. #### Types @@ -192,6 +192,7 @@ property_id = "213025502" The `GcpOAuthCredentials` class is responsible for handling OAuth2 credentials for desktop applications in Google Cloud Platform (GCP). It can parse native values either as `GoogleOAuth2Credentials` or as serialized OAuth client secrets JSON. This class provides methods for authentication and obtaining access tokens. ##### Usage + ```py oauth_credentials = GcpOAuthCredentials() @@ -201,7 +202,7 @@ oauth_credentials = GcpOAuthCredentials() native_value_oauth = {"client_secret": ...} oauth_credentials.parse_native_representation(native_value_oauth) ``` -or more preferred use: +Or more preferred use: ```py import dlt from dlt.sources.credentials import GcpOAuthCredentials @@ -215,7 +216,7 @@ def google_analytics( credentials.auth(scopes=["scope1", "scope2"]) # Retrieve native credentials for Google clients - # For example, build the service object for Google Analytics PI. + # For example, build the service object for Google Analytics API. client = BetaAnalyticsDataClient(credentials=credentials.to_native_credentials()) # Get a string representation of the credentials @@ -223,7 +224,7 @@ def google_analytics( credentials_str = str(credentials) ... ``` -while `secrets.toml` looks as follows: +While `secrets.toml` looks as follows: ```toml [sources.google_analytics.credentials] client_id = "client_id" # please set me up! @@ -231,7 +232,7 @@ client_secret = "client_secret" # please set me up! refresh_token = "refresh_token" # please set me up! project_id = "project_id" # please set me up! ``` -and `config.toml`: +And `config.toml`: ```toml [sources.google_analytics] property_id = "213025502" @@ -239,11 +240,9 @@ property_id = "213025502" In order for the `auth()` method to succeed: -- You must provide valid `client_id`, `client_secret`, `refresh_token`, and `project_id` to get a current **access token** and authenticate with OAuth. Keep in mind that the `refresh_token` must contain all the scopes that is required for your access. +- You must provide valid `client_id`, `client_secret`, `refresh_token`, and `project_id` to get a current **access token** and authenticate with OAuth. Keep in mind that the `refresh_token` must contain all the scopes that are required for your access. - If the `refresh_token` is not provided, and you run the pipeline from a console or a notebook, `dlt` will use InstalledAppFlow to run the desktop authentication flow. - - #### Defaults If configuration values are missing, `dlt` will use the default Google credentials (from `default()`) if available. Read more about [Google defaults.](https://googleapis.dev/python/google-auth/latest/user-guide.html#application-default-credentials) @@ -264,7 +263,7 @@ credentials.region_name = "us-east-1" ``` or ```py -# Imports an external boto3 session and sets the credentials properties accordingly. +# Imports an external botocore session and sets the credentials properties accordingly. import botocore.session credentials = AwsCredentials() @@ -306,7 +305,7 @@ bucket_url = "bucket_url" If configuration is not provided, `dlt` uses the default AWS credentials (from `.aws/credentials`) as present on the machine: -- It works by creating an instance of botocore Session. +- It works by creating an instance of a botocore Session. - If `profile_name` is specified, the credentials for that profile are used. If not, the default profile is used. ### AzureCredentials @@ -364,7 +363,7 @@ Example: ```py @dlt.source def zen_source(credentials: Union[ZenApiKeyCredentials, ZenEmailCredentials, str] = dlt.secrets.value, some_option: bool = False): - # Depending on what the user provides in config, ZenApiKeyCredentials or ZenEmailCredentials will be injected in the `credentials` argument. Both classes implement `auth` so you can always call it. + # Depending on what the user provides in config, ZenApiKeyCredentials or ZenEmailCredentials will be injected into the `credentials` argument. Both classes implement `auth` so you can always call it. credentials.auth() return dlt.resource([credentials], name="credentials") @@ -374,7 +373,7 @@ assert list(zen_source())[0].email == "mx" # Pass explicit native value assert list(zen_source("secret:🔑:secret"))[0].api_secret == "secret" -# pass explicit dict +# Pass explicit dict assert list(zen_source(credentials={"email": "emx", "password": "pass"}))[0].email == "emx" ``` @@ -383,26 +382,23 @@ This applies not only to credentials but to [all specs](#writing-custom-specs). ::: :::tip -Check out the [complete example](https://github.com/dlt-hub/dlt/blob/devel/tests/common/configuration/test_spec_union.py), to learn how to create unions -of credentials that derive from the common class, so you can handle it seamlessly in your code. +Check out the [complete example](https://github.com/dlt-hub/dlt/blob/devel/tests/common/configuration/test_spec_union.py), to learn how to create unions of credentials that derive from the common class, so you can handle it seamlessly in your code. ::: ## Writing custom specs -**Custom specifications** let you take full control over the function arguments. You can +**Custom specifications** let you take full control over the function arguments. You can: - Control which values should be injected, the types, default values. - Specify optional and final fields. - Form hierarchical configurations (specs in specs). -- Provide own handlers for `on_partial` (called before failing on missing config key) or `on_resolved`. -- Provide own native value parsers. -- Provide own default credentials logic. -- Utilise Python dataclass functionality. -- Utilise Python `dict` functionality (`specs` instances can be created from dicts and serialized - from dicts). +- Provide your own handlers for `on_partial` (called before failing on missing config key) or `on_resolved`. +- Provide your own native value parsers. +- Provide your own default credentials logic. +- Utilize Python dataclass functionality. +- Utilize Python `dict` functionality (`specs` instances can be created from dicts and serialized from dicts). -In fact, `dlt` synthesizes a unique spec for each decorated function. For example, in the case of `google_sheets`, the following -class is created: +In fact, `dlt` synthesizes a unique spec for each decorated function. For example, in the case of `google_sheets`, the following class is created: ```py from dlt.sources.config import configspec, with_config @@ -417,24 +413,19 @@ class GoogleSheetsConfiguration(BaseConfiguration): ### All specs derive from [BaseConfiguration](https://github.com/dlt-hub/dlt/blob/devel/dlt/common/configuration/specs/base_configuration.py#L170) This class serves as a foundation for creating configuration objects with specific characteristics: -- It provides methods to parse and represent the configuration - in native form (`parse_native_representation` and `to_native_representation`). +- It provides methods to parse and represent the configuration in native form (`parse_native_representation` and `to_native_representation`). - It defines methods for accessing and manipulating configuration fields. -- It implements a dictionary-compatible interface on top of the dataclass. -This allows instances of this class to be treated like dictionaries. +- It implements a dictionary-compatible interface on top of the dataclass. This allows instances of this class to be treated like dictionaries. -- It defines helper functions for checking if a certain attribute is present, -if a field is valid, and for calling methods in the method resolution order (MRO). +- It defines helper functions for checking if a certain attribute is present, if a field is valid, and for calling methods in the method resolution order (MRO). More information about this class can be found in the class docstrings. ### All credentials derive from [CredentialsConfiguration](https://github.com/dlt-hub/dlt/blob/devel/dlt/common/configuration/specs/base_configuration.py#L307) -This class is a subclass of `BaseConfiguration` -and is meant to serve as a base class for handling various types of credentials. -It defines methods for initializing credentials, converting them to native representations, -and generating string representations while ensuring sensitive information is appropriately handled. +This class is a subclass of `BaseConfiguration` and is meant to serve as a base class for handling various types of credentials. It defines methods for initializing credentials, converting them to native representations, and generating string representations while ensuring sensitive information is appropriately handled. + +More information about this class can be found in the class docstrings. -More information about this class can be found in the class docstrings. \ No newline at end of file diff --git a/docs/website/docs/general-usage/credentials/index.md b/docs/website/docs/general-usage/credentials/index.md index c9cbe6707c..95e0ec36ac 100644 --- a/docs/website/docs/general-usage/credentials/index.md +++ b/docs/website/docs/general-usage/credentials/index.md @@ -9,10 +9,11 @@ import DocCardList from '@theme/DocCardList'; 1. Environment variables 2. Configuration files (`secrets.toml` and `config.toml`) -3. Key managers and Vaults +3. Key managers and vaults `dlt` automatically extracts configuration settings and secrets based on flexible [naming conventions](setup/#naming-convention). It then [injects](advanced/#injection-mechanism) these values where needed in code. -# Learn Details About +# Learn details about + + - \ No newline at end of file diff --git a/docs/website/docs/general-usage/credentials/setup.md b/docs/website/docs/general-usage/credentials/setup.md index 7933bab183..5f05e68b6d 100644 --- a/docs/website/docs/general-usage/credentials/setup.md +++ b/docs/website/docs/general-usage/credentials/setup.md @@ -30,12 +30,12 @@ A custom config provider is helpful if you want to use your own configuration fi 1. [Default Argument Values](advanced#ingestion-mechanism): These are the values specified in the function's signature. :::tip -Please make sure your pipeline name contains no whitespace or any other punctuation characters except `"-"` and `"_"`. This way you will ensure your code is working with any configuration option. +Please make sure your pipeline name contains no whitespace or any other punctuation characters except `"-"` and `"_"`. This way, you will ensure your code is working with any configuration option. ::: ## Naming convention -`dlt` uses a specific naming hierarchy to search for the secrets and configs values. This makes configurations and secrets easy to manage. +`dlt` uses a specific naming hierarchy to search for the secrets and config values. This makes configurations and secrets easy to manage. To keep the naming convention flexible, `dlt` looks for a lot of possible combinations of key names, starting from the most specific possible path. Then, if the value is not found, it removes the right-most section and tries again. @@ -85,7 +85,7 @@ The most specific possible path for **destinations** looks like: ```sh -[.destination..credentials] +[.destination..credentials] ="some_value" ``` @@ -120,12 +120,12 @@ def deals(api_key: str = dlt.secrets.value): `dlt` will search for the following names in this order: 1. `sources.pipedrive.deals.api_key` -1. `sources.pipedrive.api_key` -1. `sources.api_key` -1. `api_key` +2. `sources.pipedrive.api_key` +3. `sources.api_key` +4. `api_key` :::tip -You can use your pipeline name to have separate configurations for each pipeline in your project. All config values will be looked with the pipeline name first and then again without it. +You can use your pipeline name to have separate configurations for each pipeline in your project. All config values will be looked at with the pipeline name first and then again without it. ```toml [pipeline_name_1.sources.google_sheets.credentials] @@ -156,10 +156,10 @@ or set up all parameters of connection separately: drivername="snowflake" username="user" password="password" -database = "database" -host = "service-account" -warehouse = "warehouse_name" -role = "role" +database="database" +host="service-account" +warehouse="warehouse_name" +role="role" ``` `dlt` can work with both ways and convert one to another. To learn more about which credential types are supported, visit the [complex credential types](./complex_types) page. @@ -177,7 +177,7 @@ export SOURCES__FACEBOOK_ADS__ACCESS_TOKEN="" Check out the [example](#examples) of setting up credentials through environment variables. :::tip -To organize development and securely manage environment variables for credentials storage, you can use the [python-dotenv](https://pypi.org/project/python-dotenv/) to automatically load variables from an `.env` file. +To organize development and securely manage environment variables for credentials storage, you can use [python-dotenv](https://pypi.org/project/python-dotenv/) to automatically load variables from an `.env` file. ::: ## Vaults @@ -187,7 +187,7 @@ For other vault integrations, you are welcome to [contact sales](https://dlthub. ## secrets.toml and config.toml -The TOML config provider in dlt utilizes two TOML files: +The TOML config provider in `dlt` utilizes two TOML files: `config.toml`: @@ -239,7 +239,7 @@ The TOML provider also has the capability to read files from `~/.dlt/` (located `dlt` organizes sections in TOML files in a specific structure required by the [injection mechanism](advanced/#injection-mechanism). Understanding this structure gives you more flexibility in setting credentials. For more details, see [Toml files structure](advanced/#toml-files-structure). -## Custom Providers +## Custom providers You can use the `CustomLoaderDocProvider` classes to supply a custom dictionary to `dlt` for use as a supplier of `config` and `secret` values. The code below demonstrates how to use a config stored in `config.json`. @@ -255,14 +255,14 @@ def load_config(): config_dict = json.load(f) # create the custom provider -provider = CustomLoaderDocProvider("my_json_provider",load_config) +provider = CustomLoaderDocProvider("my_json_provider", load_config) # register provider dlt.config.register_provider(provider) ``` :::tip -Check our an [example](../../examples/custom_config_provider) for a `yaml` based config provider that supports switchable profiles. +Check out an [example](../../examples/custom_config_provider) for a `yaml` based config provider that supports switchable profiles. ::: ## Examples @@ -324,8 +324,8 @@ export RUNTIME__LOG_LEVEL="INFO" export DESTINATION__FILESYSTEM__BUCKET_URL="s3://[your_bucket_name]" export NORMALIZE__DATA_WRITER__DISABLE_COMPRESSION="true" export SOURCE__NOTION__API_KEY="api_key" -export DESTINATION__FILESYSTEM__CREDENTIALS__AWS_ACCESS_KEY_ID="api_key" -export DESTINATION__FILESYSTEM__CREDENTIALS__AWS_SECRET_ACCESS_KEY="api_key" +export DESTINATION__FILESYSTEM__CREDENTIALS__AWS_ACCESS_KEY_ID="ABCDEFGHIJKLMNOPQRST" +export DESTINATION__FILESYSTEM__CREDENTIALS__AWS_SECRET_ACCESS_KEY="1234567890_access_key" ``` @@ -335,6 +335,8 @@ export DESTINATION__FILESYSTEM__CREDENTIALS__AWS_SECRET_ACCESS_KEY="api_key" ```py import os import dlt +import botocore.session +from dlt.common.credentials import AwsCredentials # you can freely set up configuration directly in the code @@ -345,7 +347,7 @@ os.environ["NORMALIZE__DATA_WRITER__DISABLE_COMPRESSION"] = "true" # or even directly to the dlt.config dlt.config["runtime.log_level"] = "INFO" -dlt.config["destination.filesystem.bucket_url"] = "INFO" +dlt.config["destination.filesystem.bucket_url"] = "s3://[your_bucket_name]" dlt.config["normalize.data_writer.disable_compression"] = "true" # but please, do not set up the secrets in the code! @@ -353,8 +355,6 @@ dlt.config["normalize.data_writer.disable_compression"] = "true" os.environ["SOURCE__NOTION__API_KEY"] = os.environ.get("NOTION_KEY") # or use a third-party credentials supplier -import botocore.session - credentials = AwsCredentials() session = botocore.session.get_session() credentials.parse_native_representation(session) @@ -365,6 +365,7 @@ dlt.secrets["destination.filesystem.credentials"] = credentials + ### Google credentials for both source and destination Let's assume we use the `bigquery` destination and the `google_sheets` source. They both use Google credentials and expect them to be configured under the `credentials` key. @@ -406,8 +407,8 @@ export CREDENTIALS__PROJECT_ID="" ```py import os -# do not set up the secrets directly in the code! -# what you can do is reassign env variables +# Do not set up the secrets directly in the code! +# What you can do is reassign env variables. os.environ["CREDENTIALS__CLIENT_EMAIL"] = os.environ.get("GOOGLE_CLIENT_EMAIL") os.environ["CREDENTIALS__PRIVATE_KEY"] = os.environ.get("GOOGLE_PRIVATE_KEY") os.environ["CREDENTIALS__PROJECT_ID"] = os.environ.get("GOOGLE_PROJECT_ID") @@ -431,13 +432,13 @@ os.environ["CREDENTIALS__PROJECT_ID"] = os.environ.get("GOOGLE_PROJECT_ID") ```toml -# google sheet credentials +# Google Sheet credentials [sources.credentials] client_email = "" private_key = "" project_id = "" -# bigquery credentials +# BigQuery credentials [destination.credentials] client_email = "" private_key = "" @@ -449,12 +450,12 @@ project_id = "" ```sh -# google sheet credentials +# Google Sheet credentials export SOURCES__CREDENTIALS__CLIENT_EMAIL="" export SOURCES__CREDENTIALS__PRIVATE_KEY="" export SOURCES__CREDENTIALS__PROJECT_ID="" -# bigquery credentials +# BigQuery credentials export DESTINATION__CREDENTIALS__CLIENT_EMAIL="" export DESTINATION__CREDENTIALS__PRIVATE_KEY="" export DESTINATION__CREDENTIALS__PROJECT_ID="" @@ -468,13 +469,13 @@ export DESTINATION__CREDENTIALS__PROJECT_ID="" import dlt import os -# do not set up the secrets directly in the code! -# what you can do is reassign env variables +# Do not set up the secrets directly in the code! +# What you can do is reassign env variables. os.environ["DESTINATION__CREDENTIALS__CLIENT_EMAIL"] = os.environ.get("BIGQUERY_CLIENT_EMAIL") os.environ["DESTINATION__CREDENTIALS__PRIVATE_KEY"] = os.environ.get("BIGQUERY_PRIVATE_KEY") os.environ["DESTINATION__CREDENTIALS__PROJECT_ID"] = os.environ.get("BIGQUERY_PROJECT_ID") -# or set them to the dlt.secrets +# Or set them to the dlt.secrets. dlt.secrets["sources.credentials.client_email"] = os.environ.get("SHEETS_CLIENT_EMAIL") dlt.secrets["sources.credentials.private_key"] = os.environ.get("SHEETS_PRIVATE_KEY") dlt.secrets["sources.credentials.project_id"] = os.environ.get("SHEETS_PROJECT_ID") @@ -513,23 +514,23 @@ Let's assume we have several different Google sources and destinations. We can u ```toml -# google sheet credentials +# Google Sheet credentials [sources.google_sheets.credentials] client_email = "" private_key = "" -project_id = "" +project_id = "" -# google analytics credentials +# Google Analytics credentials [sources.google_analytics.credentials] client_email = "" private_key = "" -project_id = "" +project_id = "" -# bigquery credentials +# BigQuery credentials [destination.bigquery.credentials] client_email = "" private_key = "" -project_id = "" +project_id = "" ``` @@ -537,17 +538,17 @@ project_id = "" ```sh -# google sheet credentials +# Google Sheet credentials export SOURCES__GOOGLE_SHEETS__CREDENTIALS__CLIENT_EMAIL="" export SOURCES__GOOGLE_SHEETS__CREDENTIALS__PRIVATE_KEY="" export SOURCES__GOOGLE_SHEETS__CREDENTIALS__PROJECT_ID="" -# google analytics credentials +# Google Analytics credentials export SOURCES__GOOGLE_ANALYTICS__CREDENTIALS__CLIENT_EMAIL="" export SOURCES__GOOGLE_ANALYTICS__CREDENTIALS__PRIVATE_KEY="" export SOURCES__GOOGLE_ANALYTICS__CREDENTIALS__PROJECT_ID="" -# bigquery credentials +# BigQuery credentials export DESTINATION__BIGQUERY__CREDENTIALS__CLIENT_EMAIL="" export DESTINATION__BIGQUERY__CREDENTIALS__PRIVATE_KEY="" export DESTINATION__BIGQUERY__CREDENTIALS__PROJECT_ID="" @@ -561,8 +562,8 @@ export DESTINATION__BIGQUERY__CREDENTIALS__PROJECT_ID="" import os import dlt -# do not set up the secrets directly in the code! -# what you can do is reassign env variables +# Do not set up the secrets directly in the code! +# What you can do is reassign env variables os.environ["SOURCES__GOOGLE_ANALYTICS__CREDENTIALS__CLIENT_EMAIL"] = os.environ.get("SHEETS_CLIENT_EMAIL") os.environ["SOURCES__GOOGLE_ANALYTICS__CREDENTIALS__PRIVATE_KEY"] = os.environ.get("ANALYTICS_PRIVATE_KEY") os.environ["SOURCES__GOOGLE_ANALYTICS__CREDENTIALS__PROJECT_ID"] = os.environ.get("ANALYTICS_PROJECT_ID") @@ -571,7 +572,7 @@ os.environ["DESTINATION__CREDENTIALS__CLIENT_EMAIL"] = os.environ.get("BIGQUERY_ os.environ["DESTINATION__CREDENTIALS__PRIVATE_KEY"] = os.environ.get("BIGQUERY_PRIVATE_KEY") os.environ["DESTINATION__CREDENTIALS__PROJECT_ID"] = os.environ.get("BIGQUERY_PROJECT_ID") -# or set them to the dlt.secrets +# Or set them to the dlt.secrets dlt.secrets["sources.credentials.client_email"] = os.environ.get("SHEETS_CLIENT_EMAIL") dlt.secrets["sources.credentials.private_key"] = os.environ.get("SHEETS_PRIVATE_KEY") dlt.secrets["sources.credentials.project_id"] = os.environ.get("SHEETS_PROJECT_ID") @@ -583,7 +584,7 @@ dlt.secrets["sources.credentials.project_id"] = os.environ.get("SHEETS_PROJECT_I ### Credentials for several sources of the same type -Let's assume we have several sources of the same type, how can we separate them in the `secrets.toml`? The recommended solution is to use different pipeline names for each source: +Let's assume we have several sources of the same type. How can we separate them in the `secrets.toml`? The recommended solution is to use different pipeline names for each source: None`: This method is called before making the first API call in the `RESTClient.paginate` method. You can use this method to set up the initial request query parameters, headers, etc. For example, you can set the initial page number or cursor value. - `update_state(response: Response, data: Optional[List[Any]]) -> None`: This method updates the paginator's state based on the response of the API call. Typically, you extract pagination details (like the next page reference) from the response and store them in the paginator instance. -- `update_request(request: Request) -> None`: Before making the next API call in `RESTClient.paginate` method, `update_request` is used to modify the request with the necessary parameters to fetch the next page (based on the current state of the paginator). For example, you can add query parameters to the request, or modify the URL. +- `update_request(request: Request) -> None`: Before making the next API call in the `RESTClient.paginate` method, `update_request` is used to modify the request with the necessary parameters to fetch the next page (based on the current state of the paginator). For example, you can add query parameters to the request or modify the URL. -#### Example 1: creating a query parameter paginator +#### Example 1: Creating a query parameter paginator -Suppose an API uses query parameters for pagination, incrementing an page parameter for each subsequent page, without providing direct links to next pages in its responses. E.g. `https://api.example.com/posts?page=1`, `https://api.example.com/posts?page=2`, etc. Here's how you could implement a paginator for this scheme: +Suppose an API uses query parameters for pagination, incrementing a page parameter for each subsequent page, without providing direct links to the next pages in its responses. E.g., `https://api.example.com/posts?page=1`, `https://api.example.com/posts?page=2`, etc. Here's how you could implement a paginator for this scheme: ```py from typing import Any, List, Optional @@ -359,7 +353,7 @@ class QueryParamPaginator(BasePaginator): self.page = initial_page def init_request(self, request: Request) -> None: - # This will set the initial page number (e.g. page=1) + # This will set the initial page number (e.g., page=1) self.update_request(request) def update_state(self, response: Response, data: Optional[List[Any]] = None) -> None: @@ -395,9 +389,9 @@ def get_data(): [`PageNumberPaginator`](#pagenumberpaginator) that ships with dlt does the same thing, but with more flexibility and error handling. This example is meant to demonstrate how to implement a custom paginator. For most use cases, you should use the [built-in paginators](#paginators). ::: -#### Example 2: creating a paginator for POST requests +#### Example 2: Creating a paginator for POST requests -Some APIs use POST requests for pagination, where the next page is fetched by sending a POST request with a cursor or other parameters in the request body. This is frequently used in "search" API endpoints or other endpoints with big payloads. Here's how you could implement a paginator for a case like this: +Some APIs use POST requests for pagination, where the next page is fetched by sending a POST request with a cursor or other parameters in the request body. This is frequently used in "search" API endpoints or other endpoints with large payloads. Here's how you could implement a paginator for a case like this: ```py from typing import Any, List, Optional @@ -447,12 +441,12 @@ The available authentication methods are defined in the `dlt.sources.helpers.res - [OAuth2ClientCredentials](#oauth20-authorization) For specific use cases, you can [implement custom authentication](#implementing-custom-authentication) by subclassing the `AuthBase` class from the Requests library. -For specific flavors of OAuth 2.0 you can [implement custom OAuth 2.0](#oauth2-authorization) +For specific flavors of OAuth 2.0, you can [implement custom OAuth 2.0](#oauth2-authorization) by subclassing `OAuth2ClientCredentials`. ### Bearer token authentication -Bearer Token Authentication (`BearerTokenAuth`) is an auth method where the client sends a token in the request's Authorization header (e.g. `Authorization: Bearer `). The server validates this token and grants access if the token is valid. +Bearer Token Authentication (`BearerTokenAuth`) is an auth method where the client sends a token in the request's Authorization header (e.g., `Authorization: Bearer `). The server validates this token and grants access if the token is valid. **Parameters:** @@ -475,7 +469,7 @@ for page in client.paginate("/protected/resource"): ### API key authentication -API Key Authentication (`ApiKeyAuth`) is an auth method where the client sends an API key in a custom header (e.g. `X-API-Key: `, or as a query parameter). +API Key Authentication (`ApiKeyAuth`) is an auth method where the client sends an API key in a custom header (e.g., `X-API-Key: `, or as a query parameter). **Parameters:** @@ -521,15 +515,15 @@ response = client.get("/protected/resource") ### OAuth 2.0 authorization OAuth 2.0 is a common protocol for authorization. We have implemented two-legged authorization employed for server-to-server authorization because the end user (resource owner) does not need to grant approval. -The REST client acts as the OAuth client which obtains a temporary access token from the authorization server. This access token is then sent to the resource server to access protected content. If the access token is expired, the OAuth client automatically refreshes it. +The REST client acts as the OAuth client, which obtains a temporary access token from the authorization server. This access token is then sent to the resource server to access protected content. If the access token is expired, the OAuth client automatically refreshes it. -Unfortunately, most OAuth 2.0 implementations vary and thus you might need to subclass `OAuth2ClientCredentials` and implement `build_access_token_request()` to suite the requirements of the specific authorization server you want to interact with. +Unfortunately, most OAuth 2.0 implementations vary, and thus you might need to subclass `OAuth2ClientCredentials` and implement `build_access_token_request()` to suit the requirements of the specific authorization server you want to interact with. **Parameters:** -- `access_token_url`: The url to obtain the temporary access token. +- `access_token_url`: The URL to obtain the temporary access token. - `client_id`: Client credential to obtain authorization. Usually issued via a developer portal. - `client_secret`: Client credential to obtain authorization. Usually issued via a developer portal. -- `access_token_request_data`: A dictionary with data required by the autorization server apart from the `client_id`, `client_secret`, and `"grant_type": "client_credentials"`. Defaults to `None`. +- `access_token_request_data`: A dictionary with data required by the authorization server apart from the `client_id`, `client_secret`, and `"grant_type": "client_credentials"`. Defaults to `None`. - `default_token_expiration`: The time in seconds after which the temporary access token expires. Defaults to 3600. **Example:** @@ -540,7 +534,7 @@ from dlt.sources.helpers.rest_client import RESTClient from dlt.sources.helpers.rest_client.auth import OAuth2ClientCredentials class OAuth2ClientCredentialsHTTPBasic(OAuth2ClientCredentials): - """Used e.g. by Zoom Zoom Video Communications, Inc.""" + """Used e.g. by Zoom Video Communications, Inc.""" def build_access_token_request(self) -> Dict[str, Any]: authentication: str = b64encode( f"{self.client_id}:{self.client_secret}".encode() @@ -597,7 +591,7 @@ client = RESTClient( ## Advanced usage -`RESTClient.paginate()` allows to specify a [custom hook function](https://requests.readthedocs.io/en/latest/user/advanced/#event-hooks) that can be used to modify the response objects. For example, to handle specific HTTP status codes gracefully: +`RESTClient.paginate()` allows you to specify a [custom hook function](https://requests.readthedocs.io/en/latest/user/advanced/#event-hooks) that can be used to modify the response objects. For example, to handle specific HTTP status codes gracefully: ```py def custom_response_handler(response): @@ -608,7 +602,7 @@ def custom_response_handler(response): client.paginate("/posts", hooks={"response": [custom_response_handler]}) ``` -The handler function may raise `IgnoreResponseException` to exit the pagination loop early. This is useful for the enpoints that return a 404 status code when there are no items to paginate. +The handler function may raise `IgnoreResponseException` to exit the pagination loop early. This is useful for endpoints that return a 404 status code when there are no items to paginate. ## Shortcut for paginating API responses @@ -621,7 +615,6 @@ for page in paginate("https://api.example.com/posts"): print(page) ``` - ## Retry You can customize how the RESTClient retries failed requests by editing your `config.toml`. @@ -641,8 +634,7 @@ request_max_retry_delay = 30 # Cap exponential delay to 30 seconds ### `RESTClient.get()` and `RESTClient.post()` methods -These methods work similarly to the [get()](https://docs.python-requests.org/en/latest/api/#requests.get) and [post()](https://docs.python-requests.org/en/latest/api/#requests.post) functions -from the Requests library. They return a [Response](https://docs.python-requests.org/en/latest/api/#requests.Response) object that contains the response data. +These methods work similarly to the [get()](https://docs.python-requests.org/en/latest/api/#requests.get) and [post()](https://docs.python-requests.org/en/latest/api/#requests.post) functions from the Requests library. They return a [Response](https://docs.python-requests.org/en/latest/api/#requests.Response) object that contains the response data. You can inspect the `Response` object to get the `response.status_code`, `response.headers`, and `response.content`. For example: ```py @@ -659,7 +651,7 @@ print(response.content) ### `RESTClient.paginate()` -Debugging `paginate()` is trickier because it's a generator function that yields [`PageData`](#pagedata) objects. Here's several ways to debug the `paginate()` method: +Debugging `paginate()` is trickier because it's a generator function that yields [`PageData`](#pagedata) objects. Here are several ways to debug the `paginate()` method: 1. Enable [logging](../../running-in-production/running.md#set-the-log-level-and-format) to see detailed information about the HTTP requests: @@ -702,3 +694,4 @@ for page in client.paginate( ): print(page) ``` + diff --git a/docs/website/docs/tutorial/filesystem.md b/docs/website/docs/tutorial/filesystem.md index b748f794d5..6d30eed3e6 100644 --- a/docs/website/docs/tutorial/filesystem.md +++ b/docs/website/docs/tutorial/filesystem.md @@ -4,7 +4,7 @@ description: Learn how to load data files like JSON, JSONL, CSV, and Parquet fro keywords: [dlt, tutorial, filesystem, cloud storage, file system, python, data pipeline, incremental loading, json, jsonl, csv, parquet, duckdb] --- -This tutorial is for you if you need to load data files like JSONL, CSV, and Parquet from either Cloud Storage (ex. AWS S3, Google Cloud Storage, Google Drive, Azure Blob Storage) or a local file system. +This tutorial is for you if you need to load data files like JSONL, CSV, and Parquet from either Cloud Storage (e.g., AWS S3, Google Cloud Storage, Google Drive, Azure Blob Storage) or a local file system. ## What you will learn @@ -48,24 +48,22 @@ Here’s what each file does: - `config.toml`: This file contains the configuration settings for your dlt project. :::note -When deploying your pipeline in a production environment, managing all configurations with files might not be convenient. In this case, we recommend you to use the environment variables to store secrets and configs instead. Read more about [configuration providers](../general-usage/credentials/setup#available-config-providers) available in dlt. +When deploying your pipeline in a production environment, managing all configurations with files might not be convenient. In this case, we recommend you use environment variables to store secrets and configs instead. Read more about [configuration providers](../general-usage/credentials/setup#available-config-providers) available in dlt. ::: ## 2. Creating the pipeline The filesystem source provides users with building blocks for loading data from any type of files. You can break down the data extraction into two steps: -1. Listing the files in the bucket / directory. +1. Listing the files in the bucket/directory. 2. Reading the files and yielding records. dlt's filesystem source includes several resources: -- the `filesystem` resource lists files in the directory or bucket -- several readers resources (`read_csv`, `read_parquet`, `read_jsonl`) read files and yield the records. These resources have a -special type, they called [transformers](../general-usage/resource#process-resources-with-dlttransformer). Transformers expect items from another resource. -In this particular case transformers expect `FileItem` object and transform it into multiple records. +- The `filesystem` resource lists files in the directory or bucket. +- Several readers resources (`read_csv`, `read_parquet`, `read_jsonl`) read files and yield the records. These resources have a special type; they are called [transformers](../general-usage/resource#process-resources-with-dlttransformer). Transformers expect items from another resource. In this particular case, transformers expect a `FileItem` object and transform it into multiple records. -Let's initialize a source and create a pipeline for loading CSV files from Google Cloud Storage to DuckDB. You can replace code from `filesystem_pipeline.py` with the following: +Let's initialize a source and create a pipeline for loading CSV files from Google Cloud Storage to DuckDB. You can replace the code from `filesystem_pipeline.py` with the following: ```py import dlt @@ -81,26 +79,25 @@ print(info) What's happening in the snippet above? -1. We import the `filesystem` resource and initialize it with a bucket URL (`gs://filesystem-tutorial`) and the `file_glob` parameter. dlt uses `file_glob` to filter files names in the bucket. `filesystem` returns a generator object. -2. We pipe the files names yielded by the filesystem resource to the transformer resource `read_csv` to read each file and iterate over records from the file. We name this transformer resource `"encounters"` using the `with_name()`. dlt will use the resource name `"encounters"` as a table name when loading the data. +1. We import the `filesystem` resource and initialize it with a bucket URL (`gs://filesystem-tutorial`) and the `file_glob` parameter. dlt uses `file_glob` to filter file names in the bucket. `filesystem` returns a generator object. +2. We pipe the file names yielded by the filesystem resource to the transformer resource `read_csv` to read each file and iterate over records from the file. We name this transformer resource `"encounters"` using the `with_name()` method. dlt will use the resource name `"encounters"` as a table name when loading the data. :::note A [transformer](../general-usage/resource#process-resources-with-dlttransformer) in dlt is a special type of resource that processes each record from another resource. This lets you chain multiple resources together. ::: -3. We create the dlt pipeline configuring with the name `hospital_data_pipeline` and DuckDB destination. +3. We create the dlt pipeline, configuring it with the name `hospital_data_pipeline` and DuckDB as the destination. 4. We call `pipeline.run()`. This is where the underlying generators are iterated: - dlt retrieves remote data, - normalizes data, - creates or updates the table in the destination, - loads the extracted data into the destination. - 5. `print(info)` outputs pipeline running stats we get from `pipeline.run()` +5. `print(info)` outputs the pipeline running stats we get from `pipeline.run()`. ## 3. Configuring the filesystem source :::note -In this tutorial we will work with publicly accessed dataset [Hospital Patient Records](https://mavenanalytics.io/data-playground?order=date_added%2Cdesc&search=Hospital%20Patient%20Records) -synthetic electronic health care records. You can use the exact credentials from this tutorial to load this dataset from GCP. +In this tutorial, we will work with the publicly accessed dataset [Hospital Patient Records](https://mavenanalytics.io/data-playground?order=date_added%2Cdesc&search=Hospital%20Patient%20Records), which contains synthetic electronic health care records. You can use the exact credentials from this tutorial to load this dataset from GCP.
Citation Jason Walonoski, Mark Kramer, Joseph Nichols, Andre Quina, Chris Moesel, Dylan Hall, Carlton Duffett, Kudakwashe Dube, Thomas Gallagher, Scott McLachlan, Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record, Journal of the American Medical Informatics Association, Volume 25, Issue 3, March 2018, Pages 230–238, https://doi.org/10.1093/jamia/ocx079 @@ -170,7 +167,7 @@ files = filesystem( As you can see, all parameters of `filesystem` can be specified directly in the code or taken from the configuration. :::tip -dlt supports more ways of authorizing with the cloud storages, including identity-based and default credentials. To learn more about adding credentials to your pipeline, please refer to the [Configuration and secrets section](../general-usage/credentials/complex_types#aws-credentials). +dlt supports more ways of authorizing with cloud storages, including identity-based and default credentials. To learn more about adding credentials to your pipeline, please refer to the [Configuration and secrets section](../general-usage/credentials/complex_types#aws-credentials). ::: ## 4. Running the pipeline @@ -257,9 +254,9 @@ info = pipeline.run(reader, write_disposition="merge") print(info) ``` -Notice that we used `apply_hints` on the `files` resource, not on `reader`. Why did we do that? As mentioned before, the `filesystem` resource lists all files in the storage based on the `file_glob` parameter. So at this point, we can also specify additional conditions to filter out files. In this case, we only want to load files that have been modified since the last load. dlt will automatically keep the state of incremental load and manage the correct filtering. +Notice that we used `apply_hints` on the `files` resource, not on `reader`. As mentioned before, the `filesystem` resource lists all files in the storage based on the `file_glob` parameter. So at this point, we can also specify additional conditions to filter out files. In this case, we only want to load files that have been modified since the last load. dlt will automatically keep the state of the incremental load and manage the correct filtering. -But what if we not only want to process modified files, but we also want to load only new records? In the `encounters` table, we can see the column named `STOP` indicating the timestamp of the end of the encounter. Let's modify our code to load only those records whose `STOP` timestamp was updated since our last load. +But what if we not only want to process modified files but also want to load only new records? In the `encounters` table, we can see the column named `STOP` indicating the timestamp of the end of the encounter. Let's modify our code to load only those records whose `STOP` timestamp was updated since our last load. ```py import dlt @@ -302,7 +299,7 @@ from dlt.sources.filesystem import filesystem def read_csv_custom(items: Iterator[FileItemDict], chunksize: int = 10000, **pandas_kwargs: Any) -> Iterator[TDataItems]: import pandas as pd - # apply defaults to pandas kwargs + # Apply defaults to pandas kwargs kwargs = {**{"header": "infer", "chunksize": chunksize}, **pandas_kwargs} for file_obj in items: @@ -340,7 +337,7 @@ from dlt.common.storages.fsspec_filesystem import FileItemDict from dlt.common.typing import TDataItems from dlt.sources.filesystem import filesystem -# Define a standalone transformer to read data from a json file. +# Define a standalone transformer to read data from a JSON file. @dlt.transformer(standalone=True) def read_json(items: Iterator[FileItemDict]) -> Iterator[TDataItems]: for file_obj in items: @@ -367,3 +364,4 @@ Interested in learning more about dlt? Here are some suggestions: - Learn more about the filesystem source configuration in [filesystem source](../dlt-ecosystem/verified-sources/filesystem) - Learn more about different credential types in [Built-in credentials](../general-usage/credentials/complex_types#built-in-credentials) - Learn how to [create a custom source](./load-data-from-an-api.md) in the advanced tutorial + diff --git a/docs/website/docs/tutorial/load-data-from-an-api.md b/docs/website/docs/tutorial/load-data-from-an-api.md index 5b1d63373c..3640f0e8d7 100644 --- a/docs/website/docs/tutorial/load-data-from-an-api.md +++ b/docs/website/docs/tutorial/load-data-from-an-api.md @@ -9,7 +9,7 @@ This tutorial introduces you to foundational dlt concepts, demonstrating how to ## What you will learn - Loading data from a list of Python dictionaries into DuckDB. -- Low level API usage with built-in HTTP client. +- Low-level API usage with a built-in HTTP client. - Understand and manage data loading behaviors. - Incrementally load new data and deduplicate existing data. - Dynamic resource creation and reducing code redundancy. @@ -74,13 +74,13 @@ Load package 1692364844.460054 is LOADED and contains no failed jobs ### Explore the data -To allow sneak peek and basic discovery you can take advantage of [built-in integration with Strealmit](../reference/command-line-interface#show-tables-and-data-in-the-destination): +To allow a sneak peek and basic discovery, you can take advantage of [built-in integration with Streamlit](../reference/command-line-interface#show-tables-and-data-in-the-destination): ```sh dlt pipeline quick_start show ``` -**quick_start** is the name of the pipeline from the script above. If you do not have Streamlit installed yet do: +**quick_start** is the name of the pipeline from the script above. If you do not have Streamlit installed yet, do: ```sh pip install streamlit @@ -94,13 +94,13 @@ Streamlit Explore data. Schema and data for a test pipeline “quick_start”. :::tip `dlt` works in Jupyter Notebook and Google Colab! See our [Quickstart Colab Demo.](https://colab.research.google.com/drive/1NfSB1DpwbbHX9_t5vlalBTf13utwpMGx?usp=sharing) -Looking for source code of all the snippets? You can find and run them [from this repository](https://github.com/dlt-hub/dlt/blob/devel/docs/website/docs/getting-started-snippets.py). +Looking for the source code of all the snippets? You can find and run them [from this repository](https://github.com/dlt-hub/dlt/blob/devel/docs/website/docs/getting-started-snippets.py). ::: -Now that you have a basic understanding of how to get started with dlt, you might be eager to dive deeper. For that we need to switch to a more advanced data source - the GitHub API. We will load issues from our [dlt-hub/dlt](https://github.com/dlt-hub/dlt) repository. +Now that you have a basic understanding of how to get started with dlt, you might be eager to dive deeper. For that, we need to switch to a more advanced data source - the GitHub API. We will load issues from our [dlt-hub/dlt](https://github.com/dlt-hub/dlt) repository. :::note -This tutorial uses GitHub REST API for demonstration purposes only. If you need to read data from a REST API, consider using the dlt's REST API source. Check out the [REST API source tutorial](./rest-api) for quick start or [REST API source reference](../dlt-ecosystem/verified-sources/rest_api) for more details. +This tutorial uses the GitHub REST API for demonstration purposes only. If you need to read data from a REST API, consider using dlt's REST API source. Check out the [REST API source tutorial](./rest-api) for a quick start or the [REST API source reference](../dlt-ecosystem/verified-sources/rest_api) for more details. ::: ## Create a pipeline @@ -112,7 +112,7 @@ First, we need to create a [pipeline](../general-usage/pipeline). Pipelines are Here's what the code above does: 1. It makes a request to the GitHub API endpoint and checks if the response is successful. -2. Then it creates a dlt pipeline with the name `github_issues` and specifies that the data should be loaded to the `duckdb` destination and the `github_data` dataset. Nothing gets loaded yet. +2. Then, it creates a dlt pipeline with the name `github_issues` and specifies that the data should be loaded to the `duckdb` destination and the `github_data` dataset. Nothing gets loaded yet. 3. Finally, it runs the pipeline with the data from the API response (`response.json()`) and specifies that the data should be loaded to the `issues` table. The `run` method returns a `LoadInfo` object that contains information about the loaded data. ## Run the pipeline @@ -134,7 +134,7 @@ dlt pipeline github_issues show Try running the pipeline again with `python github_issues.py`. You will notice that the **issues** table contains two copies of the same data. This happens because the default load mode is `append`. It is very useful, for example, when you have daily data updates and you want to ingest them. To get the latest data, we'd need to run the script again. But how to do that without duplicating the data? -One option is to tell `dlt` to replace the data in existing tables in the destination by using `replace` write disposition. Change the `github_issues.py` script to the following: +One option is to tell `dlt` to replace the data in existing tables in the destination by using the `replace` write disposition. Change the `github_issues.py` script to the following: ```py import dlt @@ -161,7 +161,7 @@ load_info = pipeline.run( print(load_info) ``` -Run this script twice to see that **issues** table still contains only one copy of the data. +Run this script twice to see that the **issues** table still contains only one copy of the data. :::tip What if the API has changed and new fields get added to the response? @@ -172,18 +172,18 @@ See the `replace` mode and table schema migration in action in our [Schema evolu Learn more: - [Full load - how to replace your data](../general-usage/full-loading). -- [Append, replace and merge your tables](../general-usage/incremental-loading). +- [Append, replace, and merge your tables](../general-usage/incremental-loading). ## Declare loading behavior -So far we have been passing the data to the `run` method directly. This is a quick way to get started. However, frequently, you receive data in chunks, and you want to load it as it arrives. For example, you might want to load data from an API endpoint with pagination or a large file that does not fit in memory. In such cases, you can use Python generators as a data source. +So far, we have been passing the data to the `run` method directly. This is a quick way to get started. However, frequently, you receive data in chunks, and you want to load it as it arrives. For example, you might want to load data from an API endpoint with pagination or a large file that does not fit in memory. In such cases, you can use Python generators as a data source. You can pass a generator to the `run` method directly or use the `@dlt.resource` decorator to turn the generator into a [dlt resource](../general-usage/resource). The decorator allows you to specify the loading behavior and relevant resource parameters. ### Load only new data (incremental loading) -Let's improve our GitHub API example and get only issues that were created since last load. -Instead of using `replace` write disposition and downloading all issues each time the pipeline is run, we do the following: +Let's improve our GitHub API example and get only issues that were created since the last load. +Instead of using the `replace` write disposition and downloading all issues each time the pipeline is run, we do the following: @@ -192,17 +192,17 @@ Let's take a closer look at the code above. We use the `@dlt.resource` decorator to declare the table name into which data will be loaded and specify the `append` write disposition. -We request issues for dlt-hub/dlt repository ordered by **created_at** field (descending) and yield them page by page in `get_issues` generator function. +We request issues for the dlt-hub/dlt repository ordered by the **created_at** field (descending) and yield them page by page in the `get_issues` generator function. -We also use `dlt.sources.incremental` to track `created_at` field present in each issue to filter in the newly created. +We also use `dlt.sources.incremental` to track the `created_at` field present in each issue to filter in the newly created ones. Now run the script. It loads all the issues from our repo to `duckdb`. Run it again, and you can see that no issues got added (if no issues were created in the meantime). -Now you can run this script on a daily schedule and each day you’ll load only issues created after the time of the previous pipeline run. +Now you can run this script on a daily schedule, and each day you’ll load only issues created after the time of the previous pipeline run. :::tip -Between pipeline runs, `dlt` keeps the state in the same database it loaded data to. -Peek into that state, the tables loaded and get other information with: +Between pipeline runs, `dlt` keeps the state in the same database it loaded data into. +Peek into that state, the tables loaded, and get other information with: ```sh dlt pipeline -v github_issues_incremental info @@ -219,25 +219,25 @@ Learn more: ### Update and deduplicate your data The script above finds **new** issues and adds them to the database. -It will ignore any updates to **existing** issue text, emoji reactions etc. -To get always fresh content of all the issues you combine incremental load with `merge` write disposition, +It will ignore any updates to **existing** issue text, emoji reactions, etc. +To always get fresh content of all the issues, combine incremental load with the `merge` write disposition, like in the script below. -Above we add `primary_key` argument to the `dlt.resource()` that tells `dlt` how to identify the issues in the database to find duplicates which content it will merge. +Above, we add the `primary_key` argument to the `dlt.resource()` that tells `dlt` how to identify the issues in the database to find duplicates whose content it will merge. Note that we now track the `updated_at` field — so we filter in all issues **updated** since the last pipeline run (which also includes those newly created). -Pay attention how we use **since** parameter from [GitHub API](https://docs.github.com/en/rest/issues/issues?apiVersion=2022-11-28#list-repository-issues) +Pay attention to how we use the **since** parameter from the [GitHub API](https://docs.github.com/en/rest/issues/issues?apiVersion=2022-11-28#list-repository-issues) and `updated_at.last_value` to tell GitHub to return issues updated only **after** the date we pass. `updated_at.last_value` holds the last `updated_at` value from the previous run. [Learn more about merge write disposition](../general-usage/incremental-loading#merge-incremental_loading). ## Using pagination helper -In the previous examples, we used the `requests` library to make HTTP requests to the GitHub API and handled pagination manually. `dlt` has the built-in [REST client](../general-usage/http/rest-client.md) that simplifies API requests. We'll pick the `paginate()` helper from it for the next example. The `paginate` function takes a URL and optional parameters (quite similar to `requests`) and returns a generator that yields pages of data. +In the previous examples, we used the `requests` library to make HTTP requests to the GitHub API and handled pagination manually. `dlt` has a built-in [REST client](../general-usage/http/rest-client.md) that simplifies API requests. We'll use the `paginate()` helper from it for the next example. The `paginate` function takes a URL and optional parameters (quite similar to `requests`) and returns a generator that yields pages of data. Here's how the updated script looks: @@ -282,10 +282,10 @@ Let's zoom in on the changes: 1. The `while` loop that handled pagination is replaced with reading pages from the `paginate()` generator. 2. `paginate()` takes the URL of the API endpoint and optional parameters. In this case, we pass the `since` parameter to get only issues updated after the last pipeline run. -3. We're not explicitly setting up pagination, `paginate()` handles it for us. Magic! Under the hood, `paginate()` analyzes the response and detects the pagination method used by the API. Read more about pagination in the [REST client documentation](../general-usage/http/rest-client.md#paginating-api-responses). +3. We're not explicitly setting up pagination; `paginate()` handles it for us. Magic! Under the hood, `paginate()` analyzes the response and detects the pagination method used by the API. Read more about pagination in the [REST client documentation](../general-usage/http/rest-client.md#paginating-api-responses). If you want to take full advantage of the `dlt` library, then we strongly suggest that you build your sources out of existing building blocks: -To make most of `dlt`, consider the following: +To make the most of `dlt`, consider the following: ## Use source decorator @@ -301,7 +301,7 @@ from dlt.sources.helpers.rest_client import paginate primary_key="id", ) def get_comments( - updated_at = dlt.sources.incremental("updated_at", initial_value="1970-01-01T00:00:00Z") + updated_at=dlt.sources.incremental("updated_at", initial_value="1970-01-01T00:00:00Z") ): for page in paginate( "https://api.github.com/repos/dlt-hub/dlt/comments", @@ -310,7 +310,7 @@ def get_comments( yield page ``` -We can load this resource separately from the issues resource, however loading both issues and comments in one go is more efficient. To do that, we'll use the `@dlt.source` decorator on a function that returns a list of resources: +We can load this resource separately from the issues resource; however, loading both issues and comments in one go is more efficient. To do that, we'll use the `@dlt.source` decorator on a function that returns a list of resources: ```py @dlt.source @@ -330,7 +330,7 @@ from dlt.sources.helpers.rest_client import paginate primary_key="id", ) def get_issues( - updated_at = dlt.sources.incremental("updated_at", initial_value="1970-01-01T00:00:00Z") + updated_at=dlt.sources.incremental("updated_at", initial_value="1970-01-01T00:00:00Z") ): for page in paginate( "https://api.github.com/repos/dlt-hub/dlt/issues", @@ -338,7 +338,7 @@ def get_issues( "since": updated_at.last_value, "per_page": 100, "sort": "updated", - "directions": "desc", + "direction": "desc", "state": "open", } ): @@ -351,7 +351,7 @@ def get_issues( primary_key="id", ) def get_comments( - updated_at = dlt.sources.incremental("updated_at", initial_value="1970-01-01T00:00:00Z") + updated_at=dlt.sources.incremental("updated_at", initial_value="1970-01-01T00:00:00Z") ): for page in paginate( "https://api.github.com/repos/dlt-hub/dlt/comments", @@ -380,7 +380,7 @@ print(load_info) ### Dynamic resources -You've noticed that there's a lot of code duplication in the `get_issues` and `get_comments` functions. We can reduce that by extracting the common fetching code into a separate function and use it in both resources. Even better, we can use `dlt.resource` as a function and pass it the `fetch_github_data()` generator function directly. Here's the refactored code: +You've noticed that there's a lot of code duplication in the `get_issues` and `get_comments` functions. We can reduce that by extracting the common fetching code into a separate function and using it in both resources. Even better, we can use `dlt.resource` as a function and pass it the `fetch_github_data()` generator function directly. Here's the refactored code: ```py import dlt @@ -414,9 +414,9 @@ row_counts = pipeline.last_trace.last_normalize_info ## Handle secrets -For the next step we'd want to get the [number of repository clones](https://docs.github.com/en/rest/metrics/traffic?apiVersion=2022-11-28#get-repository-clones) for our dlt repo from the GitHub API. However, the `traffic/clones` endpoint that returns the data requires [authentication](https://docs.github.com/en/rest/overview/authenticating-to-the-rest-api?apiVersion=2022-11-28). +For the next step, we'd want to get the [number of repository clones](https://docs.github.com/en/rest/metrics/traffic?apiVersion=2022-11-28#get-repository-clones) for our dlt repo from the GitHub API. However, the `traffic/clones` endpoint that returns the data requires [authentication](https://docs.github.com/en/rest/overview/authenticating-to-the-rest-api?apiVersion=2022-11-28). -Let's handle this by changing our `fetch_github_data()` first: +Let's handle this by changing our `fetch_github_data()` function first: ```py from dlt.sources.helpers.rest_client.auth import BearerTokenAuth @@ -444,13 +444,13 @@ def github_source(access_token): ... ``` -Here, we added `access_token` parameter and now we can use it to pass the access token to the request: +Here, we added an `access_token` parameter and now we can use it to pass the access token to the request: ```py load_info = pipeline.run(github_source(access_token="ghp_XXXXX")) ``` -It's a good start. But we'd want to follow the best practices and not hardcode the token in the script. One option is to set the token as an environment variable, load it with `os.getenv()` and pass it around as a parameter. dlt offers a more convenient way to handle secrets and credentials: it lets you inject the arguments using a special `dlt.secrets.value` argument value. +It's a good start. But we'd want to follow the best practices and not hardcode the token in the script. One option is to set the token as an environment variable, load it with `os.getenv()`, and pass it around as a parameter. dlt offers a more convenient way to handle secrets and credentials: it lets you inject the arguments using a special `dlt.secrets.value` argument value. To use it, change the `github_source()` function to: @@ -467,7 +467,7 @@ When you add `dlt.secrets.value` as a default value for an argument, `dlt` will 1. Special environment variables. 2. `secrets.toml` file. -The `secret.toml` file is located in the `~/.dlt` folder (for global configuration) or in the `.dlt` folder in the project folder (for project-specific configuration). +The `secrets.toml` file is located in the `~/.dlt` folder (for global configuration) or in the `.dlt` folder in the project folder (for project-specific configuration). Let's add the token to the `~/.dlt/secrets.toml` file: @@ -505,7 +505,7 @@ load_info = pipeline.run(github_source()) ## Configurable sources -The next step is to make our dlt GitHub source reusable so it can load data from any GitHub repo. We'll do that by changing both `github_source()` and `fetch_github_data()` functions to accept the repo name as a parameter: +The next step is to make our dlt GitHub source reusable so it can load data from any GitHub repo. We'll do that by changing both the `github_source()` and `fetch_github_data()` functions to accept the repo name as a parameter: ```py import dlt @@ -515,7 +515,7 @@ BASE_GITHUB_URL = "https://api.github.com/repos/{repo_name}" def fetch_github_data(repo_name, endpoint, params={}, access_token=None): - """Fetch data from GitHub API based on repo_name, endpoint, and params.""" + """Fetch data from the GitHub API based on repo_name, endpoint, and params.""" url = BASE_GITHUB_URL.format(repo_name=repo_name) + f"/{endpoint}" return paginate( url, @@ -564,18 +564,16 @@ Interested in learning more? Here are some suggestions: 1. You've been running your pipelines locally. Learn how to [deploy and run them in the cloud](../walkthroughs/deploy-a-pipeline/). 2. Dive deeper into how dlt works by reading the [Using dlt](../general-usage) section. Some highlights: - [Set up "last value" incremental loading](../general-usage/incremental-loading#incremental_loading-with-last-value). - - Learn about data loading strategies: [append, replace and merge](../general-usage/incremental-loading). + - Learn about data loading strategies: [append, replace, and merge](../general-usage/incremental-loading). - [Connect the transformers to the resources](../general-usage/resource#feeding-data-from-one-resource-into-another) to load additional data or enrich it. - [Customize your data schema—set primary and merge keys, define column nullability, and specify data types](../general-usage/resource#define-schema). - [Create your resources dynamically from data](../general-usage/source#create-resources-dynamically). - [Transform your data before loading](../general-usage/resource#customize-resources) and see some [examples of customizations like column renames and anonymization](../general-usage/customising-pipelines/renaming_columns). - Employ data transformations using [SQL](../dlt-ecosystem/transformations/sql) or [Pandas](../dlt-ecosystem/transformations/sql). - [Pass config and credentials into your sources and resources](../general-usage/credentials). - - [Run in production: inspecting, tracing, retry policies and cleaning up](../running-in-production/running). - - [Run resources in parallel, optimize buffers and local storage](../reference/performance.md) + - [Run in production: inspecting, tracing, retry policies, and cleaning up](../running-in-production/running). + - [Run resources in parallel, optimize buffers, and local storage](../reference/performance.md) - [Use REST API client helpers](../general-usage/http/rest-client.md) to simplify working with REST APIs. -3. Explore [destinations](../dlt-ecosystem/destinations/) and [sources](../dlt-ecosystem/verified-sources/) provided by us and community. -4. Explore the [Examples](../examples) section to see how dlt can be used in real-world scenarios - - +3. Explore [destinations](../dlt-ecosystem/destinations/) and [sources](../dlt-ecosystem/verified-sources/) provided by us and the community. +4. Explore the [Examples](../examples) section to see how dlt can be used in real-world scenarios. diff --git a/docs/website/docs/tutorial/rest-api.md b/docs/website/docs/tutorial/rest-api.md index 0ae50695b4..e1c4d63daa 100644 --- a/docs/website/docs/tutorial/rest-api.md +++ b/docs/website/docs/tutorial/rest-api.md @@ -36,7 +36,7 @@ If you see the version number (such as "dlt 0.5.3"), you're ready to proceed. ## Setting up a new project -Initialize a new dlt project with REST API source and DuckDB destination: +Initialize a new dlt project with a REST API source and DuckDB destination: ```sh dlt init rest_api duckdb @@ -76,7 +76,7 @@ Let's verify that the pipeline is working as expected. Run the following command python rest_api_pipeline.py ``` -You should see the output of the pipeline execution in the terminal. The output will also diplay the location of the DuckDB database file where the data is stored: +You should see the output of the pipeline execution in the terminal. The output will also display the location of the DuckDB database file where the data is stored: ```sh Pipeline rest_api_pokemon load step completed in 1.08 seconds @@ -100,7 +100,7 @@ dlt pipeline rest_api_pokemon show ``` The command opens a new browser window with the data browser application. `rest_api_pokemon` is the name of the pipeline defined in the `rest_api_pipeline.py` file. -You can explore the loaded data, run queries and see some pipeline execution details: +You can explore the loaded data, run queries, and see some pipeline execution details: ![Explore rest_api data in Streamlit App](https://dlt-static.s3.eu-central-1.amazonaws.com/images/docs-rest-api-tutorial-streamlit-screenshot.png) @@ -145,9 +145,9 @@ def load_pokemon() -> None: print(load_info) ``` -Here what's happening in the code: +Here's what's happening in the code: -1. With `dlt.pipeline()` we define a new pipeline named `rest_api_pokemon` with DuckDB as the destination and `rest_api_data` as the dataset name. +1. With `dlt.pipeline()`, we define a new pipeline named `rest_api_pokemon` with DuckDB as the destination and `rest_api_data` as the dataset name. 2. The `rest_api_source()` function creates a new REST API source object. 3. We pass this source object to the `pipeline.run()` method to start the pipeline execution. Inside the `run()` method, dlt will fetch data from the API and load it into the DuckDB database. 4. The `print(load_info)` outputs the pipeline execution details to the console. @@ -169,7 +169,7 @@ config: RESTAPIConfig = { ``` - The `client` configuration is used to connect to the web server and authenticate if necessary. For our simple example, we only need to specify the `base_url` of the API: `https://pokeapi.co/api/v2/`. -- The `resource_defaults` configuration allows you to set default parameters for all resources. Normally you would set common parameters here, such as pagination limits. In our Pokemon API example, we set the `limit` parameter to 1000 for all resources to retrieve more data in a single request and reduce the number of HTTP API calls. +- The `resource_defaults` configuration allows you to set default parameters for all resources. Normally, you would set common parameters here, such as pagination limits. In our Pokemon API example, we set the `limit` parameter to 1000 for all resources to retrieve more data in a single request and reduce the number of HTTP API calls. - The `resources` list contains the names of the resources you want to load from the API. REST API will use some conventions to determine the endpoint URL based on the resource name. For example, the resource name `pokemon` will be translated to the endpoint URL `https://pokeapi.co/api/v2/pokemon`. :::note @@ -179,7 +179,7 @@ You may have noticed that we didn't specify any pagination configuration in the ## Appending, replacing, and merging loaded data -Try running the pipeline again with `python rest_api_pipeline.py`. You will notice that all the tables have data duplicated. This happens because by default, dlt appends the data to the destination table. In dlt you can control how the data is loaded into the destination table by setting the `write_disposition` parameter in the resource configuration. The possible values are: +Try running the pipeline again with `python rest_api_pipeline.py`. You will notice that all the tables have duplicated data. This happens because, by default, dlt appends the data to the destination table. In dlt, you can control how the data is loaded into the destination table by setting the `write_disposition` parameter in the resource configuration. The possible values are: - `append`: Appends the data to the destination table. This is the default. - `replace`: Replaces the data in the destination table with the new data. - `merge`: Merges the new data with the existing data in the destination table based on the primary key. @@ -237,7 +237,7 @@ pokemon_source = rest_api_source( }, }, # For the `berry` and `location` resources, we keep - # the`replace` write disposition + # the `replace` write disposition "write_disposition": "replace", }, "resources": [ @@ -312,7 +312,7 @@ load_info = pipeline.run(github_source) print(load_info) ``` -In this configuration, the `since` parameter is defined as a special incremental parameter. The `cursor_path` field specifies the JSON path to the field that will be used to fetch the updated data and we use the `initial_value` for the initial value for the incremental parameter. This value will be used in the first request to fetch the data. +In this configuration, the `since` parameter is defined as a special incremental parameter. The `cursor_path` field specifies the JSON path to the field that will be used to fetch the updated data, and we use the `initial_value` for the initial value for the incremental parameter. This value will be used in the first request to fetch the data. When the pipeline runs, dlt will automatically update the `since` parameter with the latest value from the response data. This way, you can fetch only the new or updated data from the API. @@ -324,5 +324,6 @@ Congratulations on completing the tutorial! You've learned how to set up a REST Interested in learning more about dlt? Here are some suggestions: -- Learn more about the REST API source configuration in [REST API source documentation](../dlt-ecosystem/verified-sources/rest_api/) -- Learn how to [create a custom source](./load-data-from-an-api.md) in the advanced tutorial \ No newline at end of file +- Learn more about the REST API source configuration in the [REST API source documentation](../dlt-ecosystem/verified-sources/rest_api/) +- Learn how to [create a custom source](./load-data-from-an-api.md) in the advanced tutorial. + diff --git a/docs/website/docs/tutorial/sql-database.md b/docs/website/docs/tutorial/sql-database.md index cc4edddd14..abaec53ce2 100644 --- a/docs/website/docs/tutorial/sql-database.md +++ b/docs/website/docs/tutorial/sql-database.md @@ -42,7 +42,7 @@ After running this command, your project will have the following structure: Here’s what each file does: -- `sql_database_pipeline.py`: This is the main script where you'll define your data pipeline. It contains several different examples for how you can configure your SQL Database pipeline. +- `sql_database_pipeline.py`: This is the main script where you'll define your data pipeline. It contains several different examples of how you can configure your SQL Database pipeline. - `requirements.txt`: This file lists all the Python dependencies required for your project. - `.dlt/`: This directory contains the [configuration files](../general-usage/credentials/) for your project: - `secrets.toml`: This file stores your credentials, API keys, tokens, and other sensitive information. @@ -69,14 +69,14 @@ from dlt.sources.sql_database import sql_database def load_tables_family_and_genome(): - # create a dlt source that will load tables "family" and "genome" + # Create a dlt source that will load tables "family" and "genome" source = sql_database().with_resources("family", "genome") # Create a dlt pipeline object pipeline = dlt.pipeline( - pipeline_name="sql_to_duckdb_pipeline", # custom name for the pipeline + pipeline_name="sql_to_duckdb_pipeline", # Custom name for the pipeline destination="duckdb", # dlt destination to which the data will be loaded - dataset_name="sql_to_duckdb_pipeline_data" # custom name for the dataset created in the destination + dataset_name="sql_to_duckdb_pipeline_data" # Custom name for the dataset created in the destination ) # Run the pipeline @@ -99,7 +99,7 @@ Explanation: ## 3. Add credentials -To sucessfully connect to your SQL database, you will need to pass credentials into your pipeline. dlt automatically looks for this information inside the generated TOML files. +To successfully connect to your SQL database, you will need to pass credentials into your pipeline. dlt automatically looks for this information inside the generated TOML files. Simply paste the [connection details](https://docs.rfam.org/en/latest/database.html) inside `secrets.toml` as follows: ```toml @@ -117,7 +117,7 @@ Alternatively, you can also paste the credentials as a connection string: sources.sql_database.credentials="mysql+pymysql://rfamro@mysql-rfam-public.ebi.ac.uk:4497/Rfam" ``` -For more details on the credentials format and other connection methods read the section on [configuring connection to the SQL Database](../dlt-ecosystem/verified-sources/sql_database#credentials-format). +For more details on the credentials format and other connection methods, read the section on [configuring connection to the SQL Database](../dlt-ecosystem/verified-sources/sql_database#credentials-format). ## 4. Install dependencies @@ -141,7 +141,7 @@ After performing steps 1-4, you should now be able to successfully run the pipel ```sh python sql_database_pipeline.py ``` -This will create the file `sql_to_duckdb_pipeline.duckdb` in your dlt project directory which contains the loaded data. +This will create the file `sql_to_duckdb_pipeline.duckdb` in your dlt project directory, which contains the loaded data. ## 6. Explore the data @@ -157,14 +157,13 @@ Next, run the following command to launch the data browser app: dlt pipeline sql_to_duckdb_pipeline show ``` -You can explore the loaded data, run queries and see some pipeline execution details. +You can explore the loaded data, run queries, and see some pipeline execution details. ![streamlit-screenshot](https://storage.googleapis.com/dlt-blog-images/docs-sql-database-tutorial-streamlit-screenshot.png) ## 7. Append, replace, or merge loaded data -Try running the pipeline again with `python sql_database_pipeline.py`. You will notice that -all the tables have the data duplicated. This happens as dlt, by default, appends data to the destination tables in every load. This behavior can be adjusted by setting the `write_disposition` parameter inside the `pipeline.run()` method. The possible settings are: +Try running the pipeline again with `python sql_database_pipeline.py`. You will notice that all the tables have the data duplicated. This happens as dlt, by default, appends data to the destination tables in every load. This behavior can be adjusted by setting the `write_disposition` parameter inside the `pipeline.run()` method. The possible settings are: - `append`: Appends the data to the destination table. This is the default. - `replace`: Replaces the data in the destination table with the new data. @@ -203,7 +202,7 @@ Run the pipeline again with `sql_database_pipeline.py`. This time, the data will When you want to update the existing data as new data is loaded, you can use the `merge` write disposition. This requires specifying a primary key for the table. The primary key is used to match the new data with the existing data in the destination table. -In the previous example, we set `write_disposition="replace"` inside `pipeline.run()` which caused all the tables to be loaded with `replace`. However, it's also possible to define the `write_disposition` strategy separately for each tables using the `apply_hints` method. In the example below, we use `apply_hints` on each table to specify different primary keys for merge: +In the previous example, we set `write_disposition="replace"` inside `pipeline.run()` which caused all the tables to be loaded with `replace`. However, it's also possible to define the `write_disposition` strategy separately for each table using the `apply_hints` method. In the example below, we use `apply_hints` on each table to specify different primary keys for merge: ```py import dlt @@ -233,7 +232,7 @@ if __name__ == '__main__': ## 8. Load data incrementally -Often you don't want to load the whole data in each load, but rather only the new or modified data. dlt makes this easy with [incremental loading](../general-usage/incremental-loading). +Often, you don't want to load the entire dataset in each load, but rather only the new or modified data. dlt makes this easy with [incremental loading](../general-usage/incremental-loading). In the example below, we configure the table `"family"` to load incrementally based on the column `"updated"`: @@ -274,3 +273,4 @@ Interested in learning more about dlt? Here are some suggestions: - Learn more about the SQL Database source configuration in [the SQL Database source reference](../dlt-ecosystem/verified-sources/sql_database) - Learn more about different credential types in [Built-in credentials](../general-usage/credentials/complex_types#built-in-credentials) - Learn how to [create a custom source](./load-data-from-an-api.md) in the advanced tutorial +