Cluster node initialization scripts
An init script is a shell script that runs during startup of each cluster nodebeforethe Apache Spark driver or worker JVM starts.
Databricks recommends managing all init scripts as cluster-scoped init scripts stored in workspace files.
Some examples of tasks performed by init scripts include:
Install packages and libraries not included in Databricks Runtime. To install Python packages, use the Databricks
pip
binary located at/databricks/python/bin/pip
to ensure that Python packages install into the Databricks Python virtual environment rather than the system Python environment. For example,/databricks/python/bin/pipinstall
.Modify the JVM system classpath inspecial cases.
Set system properties and environment variables used by the JVM.
Modify Spark configuration parameters.
Init script types
Databricks supports two kinds of init scripts: cluster-scoped and global.
Cluster-scoped: run on every cluster configured with the script. This is the recommended way to run an init script.
Global: run on every cluster in the workspace. They can help you to enforce consistent cluster configurations across your workspace. Use them carefully because they can cause unanticipated impacts, like library conflicts. Only admin users can create global init scripts.
Note
To manage global init scripts in the current release, you must use theGlobal Init Scripts API 2.0.
Whenever you change any type of init script, you must restart all clusters affected by the script.
Important
Databricks recommends using only cluster-scoped init scripts.
Environment variables
Cluster-scoped and global init scripts support the following environment variables:
DB_CLUSTER_ID
: the ID of the cluster on which the script is running. SeeClusters API 2.0.DB_CONTAINER_IP
: the private IP address of the container in which Spark runs. The init script is run inside this container. SeeSparkNode.DB_IS_DRIVER
: whether the script is running on a driver node.DB_DRIVER_IP
: the IP address of the driver node.DB_INSTANCE_TYPE
: the instance type of the host VM.DB_CLUSTER_NAME
: the name of the cluster the script is executing on.DB_IS_JOB_CLUSTER
: whether the cluster was created to run a job. SeeCreate a job.
For example, if you want to run part of a script only on a driver node, you could write a script like:
echo$DB_IS_DRIVERif[[$DB_IS_DRIVER="TRUE"]];thenelsefi
You can also configurecustom environment variablesfor a cluster and reference those variables in init scripts.
Use secrets in environment variables
You can use any valid variable name when you reference a secret. Access to secrets referenced in environment variables is determined by the permissions of the user who configured the cluster. Secrets stored in environmental variables are accessible by all users of the cluster, but are redacted from plaintext display in the normal fashion as secrets referenced elsewhere.
更多的維etails, seeReference a secret in an environment variable.
Cluster-scoped init scripts
Cluster-scoped init scripts are init scripts defined in a cluster configuration. Cluster-scoped init scripts apply to both clusters you create and those created to run jobs.
You can configure cluster-scoped init scripts using the UI, the CLI, and by invoking the Clusters API. This section focuses on performing these tasks using the UI. For the other methods, seeDatabricks CLIandClusters API 2.0.
You can add any number of scripts, and the scripts are executed sequentially in the order provided.
If a cluster-scoped init script returns a non-zero exit code, the cluster launch失敗. You can troubleshoot cluster-scoped init scripts by configuringcluster log deliveryand examining theinit script log.
Configure a cluster-scoped init script using the UI
This section containts instructions for configuring a cluster to run an init script using the Databricks UI.
Databricks recommends storing all cluster-scoped init scripts in workspace files. SeeStore init scripts in workspace files.
Warning
TheDBFSoption in the UI exists to support legacy workloads and is not recommended. All init scripts stored in DBFS should be migrated to workspace files.
Important
The script must exist at the configured location. If the script doesn’t exist, the cluster will fail to start or be autoscaled up.
The init script cannot be larger than 64KB. If a script exceeds that size, the cluster will fail to launch and a failure message will appear in the cluster log.
To use the UI to configure a cluster to run an init script, complete the following steps:
On the cluster configuration page, click theAdvanced Optionstoggle.
At the bottom of the page, click theInit Scriptstab.
In theDestinationdrop-down, select theWorkspacedestination type.
Specify a path to the init script.
ClickAdd.
Note
Each user has a首頁directory configured under the/Users
directory in the workspace. If a user with the nameuser1@m.eheci.com
stored an init script calledmy-init.sh
in their home directory, the configure path would be/Users/user1@m.eheci.com/my-init.sh
.
To remove a script from the cluster configuration, click theat the right of the script. When you confirm the delete you will be prompted to restart the cluster. Optionally you can delete the script file from the location you uploaded it to.
Example: Use conda to install Python libraries
磚運行時9.0及以上的,你不能use conda to install Python libraries. For instructions on how to install Python packages on a cluster, seeLibraries.
Important
Anaconda Inc. updated theirterms of servicefor anaconda.org channels in September 2020. Based on the new terms of service you may require a commercial license if you rely on Anaconda’s packaging and distribution. SeeAnaconda Commercial Edition FAQfor more information. Your use of any Anaconda channels is governed by theirterms of service.
As a result of this change, Databricks has removed the default channel configuration for the Conda package manager. This is a breaking change. You must update the usage of conda commands in init-scripts to specify a channel using-c
. If you do not specify a channel, conda commands will fail withPackagesNotFoundError
.
In Databricks Runtime 8.4 ML and below, you use theCondapackage manager to install Python packages. To install a Python library at cluster initialization, you can use a script like the following:
#!/bin/bashset-ex /databricks/python/bin/python -V . /databricks/conda/etc/profile.d/conda.sh conda activate /databricks/python conda install -c conda-forge -y astropy
Global init scripts
一個全球性的init腳本運行在every cluster created in your workspace. Global init scripts are useful when you want to enforce organization-wide library configurations or security screens. Only admins can create global init scripts. You can create them using either the UI or REST API.
Important
Databricks does not recommend using global init scripts. If you choose to use global init scripts, consider potential impacts such as the following:
It is easy to add libraries or make other modifications that cause unanticipated impacts. Whenever possible, use cluster-scoped init scripts instead.
Any user who creates a cluster and enables cluster log delivery can view the
stderr
andstdout
output from global init scripts. You should ensure that your global init scripts do not output any sensitive information.
You can troubleshoot global init scripts by configuringcluster log deliveryand examining theinit script log.
Add a global init script using the UI
To configure global init scripts using the admin settings:
Go to the admin settings and click theGlobal Init Scriptstab.
Click+ Add.
Name the script and enter it by typing, pasting, or dragging a text file into theScriptfield.
Note
The init script cannot be larger than 64KB. If a script exceeds that size, an error message appears when you try to save.
If you have more than one global init script configured for your workspace, set the order in which the new script will run.
If you want the script to be enabled for all new and restarted clusters after you save, toggleEnabled.
Important
When you add a global init script or make changes to the name, run order, or enablement of init scripts, those changes do not take effect until you restart the cluster.
ClickAdd.
Add a global init script using Terraform
You can add a global init script by using theDatabricks Terraform provideranddatabricks_global_init_script.
Edit a global init script using the UI
Go to the admin settings and click theGlobal Init Scriptstab.
Click a script.
Edit the script.
ClickConfirm.
Configure a global init script using the API
Admins can add, delete, re-order, and get information about the global init scripts in your workspace using theGlobal Init Scripts API 2.0.
Init script logging
Init script start and finish events are captured in cluster event logs. Details are captured in cluster logs. Global init script create, edit, and delete events are also captured in account-level audit logs.
Init script events
Cluster event logscapture two init script events:INIT_SCRIPTS_STARTED
andINIT_SCRIPTS_FINISHED
, indicating which scripts are scheduled for execution and which have completed successfully.INIT_SCRIPTS_FINISHED
also captures execution duration.
Cluster-scoped init scripts are indicated by the key"cluster"
.
Note
Cluster event logs do not log init script events for each cluster node; only one node is selected to represent them all.
Init script logs
Ifcluster log deliveryis configured for a cluster, the init script logs are written to/
. Logs for each container in the cluster are written to a subdirectory calledinit_scripts/
. For example, ifcluster-log-path
is set tocluster-logs
, the path to the logs for a specific container would be:dbfs:/cluster-logs/
.
If the cluster is configured to write logs to DBFS, you can view the logs using theFile system utility (dbutils.fs).
Every time a cluster launches, it writes a log to the init script log folder.
Important
Any user who creates a cluster and enables cluster log delivery can view thestderr
andstdout
output from global init scripts. You should ensure that your global init scripts do not output any sensitive information.