Skip to content

Commit

Permalink
Exporter: performance improvements for big workspaces (#3167)
Browse files Browse the repository at this point in the history
* Exporter: performance improvements for big workspaces

The first part of performance improvements - parallel generation of resources:

For each identified resource we're generating its body separately from other resources
using the `EXPORTER_RESOURCE_HANDLERS` (default is 50) goroutines - this helps with
processing references to other resources (although we still have some performance problems
for complex resources, like, jobs, but it will be improved in next commits).  Generated
bodies are sent to a dedicated channels that are responsible for writing the code into
files.

Next steps:

- reimplement `resourceApproximation` structure to avoid linear search.
- optimization of reference search.

* add error code handling

* Don't use os.Stat, just count resources

* More aggressive caching of some checks, like, users & users directories

* Rewrite resource approximation has/append

* Use dedicated channels only for some resources, everything else goes to a shared one

* Rewrote references lookup to avoid iteration when there is a direct lookup

Prefix lookup & case-insensitive lookup is still done by iterating over resources

* Reorder references to avoid having most expensive executed before direct lookup

Also, for case-insensitive matching, first try to do direct lookup before iteration

* Fix test of cluster import

* Start to emit notebooks & workspace files during the listing, without waiting its finished

* Further optimize notebooks/workspace files emits

* Introduce lightweight check for user existence
* Emit workspace objects from separate goroutines to avoid workspace listing stuck on
  users lookup

* Reorganize the order of checks in `Emit` function

It checks if a service is enabled before performing other checks - this should decrease
the number of lookups in the state approximation

* Move parallel workspace listing to the exporter implementation

* Fix test

* Another fix

* another attempt to fix tests

* Small adjustments

* control the submissions to the default channel to avoid deadlock
  • Loading branch information
alexott authored Jan 31, 2024
1 parent 8fca518 commit 510fd60
Show file tree
Hide file tree
Showing 12 changed files with 999 additions and 581 deletions.
6 changes: 5 additions & 1 deletion docs/guides/experimental-exporter.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,8 @@ All arguments are optional, and they tune what code is being generated.
* `-updated-since` - timestamp (in ISO8601 format supported by Go language) for exporting of resources modified since a given timestamp. I.e., `2023-07-24T00:00:00Z`. If not specified, the exporter will try to load the last run timestamp from the `exporter-run-stats.json` file generated during the export and use it.
* `-notebooksFormat` - optional format for exporting of notebooks. Supported values are `SOURCE` (default), `DBC`, `JUPYTER`. This option could be used to export notebooks with embedded dashboards.
* `-noformat` - optionally turn off the execution of `terraform fmt` on the exported files (enabled by default).
* `-debug` - turn on debug output.
* `-trace` - turn on trace output (includes debug level as well).

## Services

Expand Down Expand Up @@ -92,7 +94,9 @@ To speed up export, Terraform Exporter performs many operations, such as listing

* `EXPORTER_WS_LIST_PARALLELISM` (default: `5`) controls how many Goroutines are used to perform parallel listing of Databricks Workspace objects (notebooks, directories, workspace files, ...).
* `EXPORTER_DIRECTORIES_CHANNEL_SIZE` (default: `100000`) controls the channel's capacity when listing workspace objects. Please ensure that this value is big enough (greater than the number of directories in the workspace; default value should be ok for most cases); otherwise, there is a chance of deadlock.
* `EXPORTER_PARALLELISM_NNN` - number of Goroutines used to process resources of a specific type (replace `NNN` with the exact resource name, for example, `EXPORTER_PARALLELISM_databricks_notebook=10` sets the number of Goroutines for `databricks_notebook` resource to `10`). Defaults for some resources are defined by the `goroutinesNumber` map in `exporter/context.go` or equal to `2` if there is no value. *Don't increase default values too much to avoid REST API throttling!*
* `EXPORTER_DEDICATED_RESOUSE_CHANNELS` - by default, only specific resources (`databricks_user`, `databricks_service_principal`, `databricks_group`) have dedicated channels - the rest are handled by the shared channel. This is done to prevent throttling by specific APIs. You can override this by providing a comma-separated list of resources as this environment variable.
* `EXPORTER_PARALLELISM_NNN` - number of Goroutines used to process resources of a specific type (replace `NNN` with the exact resource name, for example, `EXPORTER_PARALLELISM_databricks_notebook=10` sets the number of Goroutines for `databricks_notebook` resource to `10`). There is a shared channel (with name `default`) for handling of resources for which there are no dedicated channels - use `EXPORTER_PARALLELISM_default` to increase it's size (default size is `15`). Defaults for some resources are defined by the `goroutinesNumber` map in `exporter/context.go` or equal to `2` if there is no value. *Don't increase default values too much to avoid REST API throttling!*
* `EXPORTER_DEFAULT_HANDLER_CHANNEL_SIZE` - the size of the shared channel (default: `200000`) - you may need to increase it if you have a huge workspace.


## Support Matrix
Expand Down
11 changes: 7 additions & 4 deletions exporter/command.go
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ func Run(args ...string) error {
if err != nil {
return err
}
var skipInteractive bool
var skipInteractive, trace, debug bool
flags.BoolVar(&skipInteractive, "skip-interactive", false, "Skip interactive mode")
flags.BoolVar(&ic.includeUserDomains, "includeUserDomains", false, "Include domain portion in `databricks_user` resource name")
flags.BoolVar(&ic.importAllUsers, "importAllUsers", false,
Expand All @@ -113,7 +113,8 @@ func Run(args ...string) error {
flags.BoolVar(&ic.noFormat, "noformat", false, "Don't run `terraform fmt` on exported files")
flags.StringVar(&ic.updatedSinceStr, "updated-since", "",
"Include only resources updated since a given timestamp (in ISO8601 format, i.e. 2023-07-01T00:00:00Z)")
flags.BoolVar(&ic.debug, "debug", false, "Print extra debug information.")
flags.BoolVar(&debug, "debug", false, "Print extra debug information.")
flags.BoolVar(&trace, "trace", false, "Print full debug information.")
flags.BoolVar(&ic.mounts, "mounts", false, "List DBFS mount points.")
flags.BoolVar(&ic.generateDeclaration, "generateProviderDeclaration", true,
"Generate Databricks provider declaration.")
Expand Down Expand Up @@ -146,9 +147,11 @@ func Run(args ...string) error {
if len(prefix) > 0 {
ic.prefix = prefix + "_"
}
if ic.debug {
if trace {
logLevel = append(logLevel, "[DEBUG]", "[TRACE]")
} else if debug {
logLevel = append(logLevel, "[DEBUG]")
}
ic.services = strings.Split(configuredServices, ",")
ic.enableServices(configuredServices)
return ic.Run()
}
Loading

0 comments on commit 510fd60

Please sign in to comment.