Exporter: performance improvements for big workspaces (#3167)

* Exporter: performance improvements for big workspaces The first part of performance improvements - parallel generation of resources: For each identified resource we're generating its body separately from other resources using the `EXPORTER_RESOURCE_HANDLERS` (default is 50) goroutines - this helps with processing references to other resources (although we still have some performance problems for complex resources, like, jobs, but it will be improved in next commits). Generated bodies are sent to a dedicated channels that are responsible for writing the code into files. Next steps: - reimplement `resourceApproximation` structure to avoid linear search. - optimization of reference search. * add error code handling * Don't use os.Stat, just count resources * More aggressive caching of some checks, like, users & users directories * Rewrite resource approximation has/append * Use dedicated channels only for some resources, everything else goes to a shared one * Rewrote references lookup to avoid iteration when there is a direct lookup Prefix lookup & case-insensitive lookup is still done by iterating over resources * Reorder references to avoid having most expensive executed before direct lookup Also, for case-insensitive matching, first try to do direct lookup before iteration * Fix test of cluster import * Start to emit notebooks & workspace files during the listing, without waiting its finished * Further optimize notebooks/workspace files emits * Introduce lightweight check for user existence * Emit workspace objects from separate goroutines to avoid workspace listing stuck on users lookup * Reorganize the order of checks in `Emit` function It checks if a service is enabled before performing other checks - this should decrease the number of lookups in the state approximation * Move parallel workspace listing to the exporter implementation * Fix test * Another fix * another attempt to fix tests * Small adjustments * control the submissions to the default channel to avoid deadlock
databricks · Jan 31, 2024 · 510fd60 · 510fd60
1 parent 8fca518
commit 510fd60
Show file tree

Hide file tree

Showing 12 changed files with 999 additions and 581 deletions.
diff --git a/docs/guides/experimental-exporter.md b/docs/guides/experimental-exporter.md
@@ -50,6 +50,8 @@ All arguments are optional, and they tune what code is being generated.
 * `-updated-since` - timestamp (in ISO8601 format supported by Go language) for exporting of resources modified since a given timestamp. I.e., `2023-07-24T00:00:00Z`. If not specified, the exporter will try to load the last run timestamp from the `exporter-run-stats.json` file generated during the export and use it.
 * `-notebooksFormat` - optional format for exporting of notebooks. Supported values are `SOURCE` (default), `DBC`, `JUPYTER`.  This option could be used to export notebooks with embedded dashboards.
 * `-noformat` - optionally turn off the execution of `terraform fmt` on the exported files (enabled by default).
+* `-debug` - turn on debug output.
+* `-trace` - turn on trace output (includes debug level as well).
 
 ## Services
 
@@ -92,7 +94,9 @@ To speed up export, Terraform Exporter performs many operations, such as listing
 
 * `EXPORTER_WS_LIST_PARALLELISM` (default: `5`) controls how many Goroutines are used to perform parallel listing of Databricks Workspace objects (notebooks, directories, workspace files, ...).
 * `EXPORTER_DIRECTORIES_CHANNEL_SIZE` (default: `100000`) controls the channel's capacity when listing workspace objects. Please ensure that this value is big enough (greater than the number of directories in the workspace; default value should be ok for most cases); otherwise, there is a chance of deadlock.
-* `EXPORTER_PARALLELISM_NNN` - number of Goroutines used to process resources of a specific type (replace `NNN` with the exact resource name, for example, `EXPORTER_PARALLELISM_databricks_notebook=10` sets the number of Goroutines for `databricks_notebook` resource to `10`).  Defaults for some resources are defined by the `goroutinesNumber` map in `exporter/context.go` or equal to `2` if there is no value.  *Don't increase default values too much to avoid REST API throttling!*
+* `EXPORTER_DEDICATED_RESOUSE_CHANNELS` - by default, only specific resources (`databricks_user`, `databricks_service_principal`, `databricks_group`) have dedicated channels - the rest are handled by the shared channel.  This is done to prevent throttling by specific APIs.  You can override this by providing a comma-separated list of resources as this environment variable.
+* `EXPORTER_PARALLELISM_NNN` - number of Goroutines used to process resources of a specific type (replace `NNN` with the exact resource name, for example, `EXPORTER_PARALLELISM_databricks_notebook=10` sets the number of Goroutines for `databricks_notebook` resource to `10`).  There is a shared channel (with name `default`) for handling of resources for which there are no dedicated channels - use `EXPORTER_PARALLELISM_default` to increase it's size (default size is `15`).   Defaults for some resources are defined by the `goroutinesNumber` map in `exporter/context.go` or equal to `2` if there is no value.  *Don't increase default values too much to avoid REST API throttling!*
+* `EXPORTER_DEFAULT_HANDLER_CHANNEL_SIZE` - the size of the shared channel (default: `200000`) - you may need to increase it if you have a huge workspace.
 
 
 ## Support Matrix

diff --git a/exporter/command.go b/exporter/command.go
@@ -98,7 +98,7 @@ func Run(args ...string) error {
 	if err != nil {
 		return err
 	}
-	var skipInteractive bool
+	var skipInteractive, trace, debug bool
 	flags.BoolVar(&skipInteractive, "skip-interactive", false, "Skip interactive mode")
 	flags.BoolVar(&ic.includeUserDomains, "includeUserDomains", false, "Include domain portion in `databricks_user` resource name")
 	flags.BoolVar(&ic.importAllUsers, "importAllUsers", false,
@@ -113,7 +113,8 @@ func Run(args ...string) error {
 	flags.BoolVar(&ic.noFormat, "noformat", false, "Don't run `terraform fmt` on exported files")
 	flags.StringVar(&ic.updatedSinceStr, "updated-since", "",
 		"Include only resources updated since a given timestamp (in ISO8601 format, i.e. 2023-07-01T00:00:00Z)")
-	flags.BoolVar(&ic.debug, "debug", false, "Print extra debug information.")
+	flags.BoolVar(&debug, "debug", false, "Print extra debug information.")
+	flags.BoolVar(&trace, "trace", false, "Print full debug information.")
 	flags.BoolVar(&ic.mounts, "mounts", false, "List DBFS mount points.")
 	flags.BoolVar(&ic.generateDeclaration, "generateProviderDeclaration", true,
 		"Generate Databricks provider declaration.")
@@ -146,9 +147,11 @@ func Run(args ...string) error {
 	if len(prefix) > 0 {
 		ic.prefix = prefix + "_"
 	}
-	if ic.debug {
+	if trace {
+		logLevel = append(logLevel, "[DEBUG]", "[TRACE]")
+	} else if debug {
 		logLevel = append(logLevel, "[DEBUG]")
 	}
-	ic.services = strings.Split(configuredServices, ",")
+	ic.enableServices(configuredServices)
 	return ic.Run()
 }