Skip to main content
Version: 1.20.0 (latest)

Configure Databricks as a target

note

Databricks is currently available as a preview feature and under development. If you use Databricks as a target metastore with Data Migrator, and have feedback to share, contact WANdisco. The feature is automatically enabled. See Preview features.

Configure a Databricks metadata agent#

Use Data Migrator to integrate with Databricks and migrate structured data from Hadoop to Databricks tables, including converting automatically from source Hive formats to Delta Lake format used in Databricks.

Configure a Databricks metadata agent in Data Migrator using the UI or CLI, and connect it to your Databricks cluster.

Prerequisites#

To ensure a successful migration to Databricks, the source tables must be in one of the following formats:

  • CSV
  • JSON
  • AVRO
  • ORC
  • PARQUET
  • Text

Ensure you have the following before you start:

Example: Script to mount ADLS Gen2 storage account onto DBFS folder#

%pythonstorageAccountName = "accountName"storageAccountAccessKey = "accessKey"blobContainerName = "containerName"mountPoint = "/path/to/mount/onto"
configs = {"fs.azure.account.auth.type." + storageAccountName + ".blob.core.windows.net" : "SharedKey", "fs.azure.account.key." + storageAccountName + ".blob.core.windows.net": storageAccountAccessKey}dbutils.fs.mkdirs(dir = mntPoint)dbutils.fs.mount(source = "wasbs://{}@{}.blob.core.windows.net".format(blobContainerName,storageAccountName),mount_point = mntPoint,extra_configs = configs)
  • JDBC connection

    To enable the JDBC connection:

    1. Download the Databricks JDBC driver.

      note

      If you download version 2.6.25 of the Databricks JDBC driver, you'll get an error message stating you haven't downloaded a driver. This is because version 2.6.25 isn't compatible with Hive Migrator yet. Download version 2.6.22 to create your agent.

    2. Unzip the package and upload the SparkJDBC42.jar file to the Data Migrator host machine.

    3. Move the SparkJDBC42.jar file to the Data Migrator directory:

      /opt/wandisco/hivemigrator/agent/databricks
    4. Change ownership of the jar file to the Hive Migrator system user and group:

      chown hive:hadoop /opt/wandisco/hivemigrator/agent/databricks/SparkJDBC42.jar

Recommended JDBC property settings#

Occasionally, migrations may fail due to brief network disruption or heavy data processing. To reduce the chance of failures, set the following timeout properties in the JDBC connection string:

PropertyDescriptionDefault valueRecommended valueReason
networkTimeoutNumber of milliseconds to wait for a response when interacting with the Databricks service before returning an error. 0 (zero) specifies that no network timeout is set.0600000The recommended 10 minute (600000 ms) network timeout ensures brief disruption to the network doesn’t cause migrations to fail and also migrations don't get stuck indefinitely.
queryTimeoutNumber of seconds to wait for a query to complete before returning an error. 0 (zero) specifies that the driver should wait indefinitely.00The query timeout default value of zero (0) prevents long query times causing a timeout.

Example: JDBC connection string#

Driver=<path-to-driver>;Host=<server-hostname>;Port=443;HTTPPath=<http-path>;ThriftTransport=2;SSL=1;AuthMech=3;UID=token;PWD=<personal-access-token>

Configure Databricks as a target with the UI#

To add Databricks from your Dashboard:

  1. Select Connect To Metastore.

  2. Select the filesystem.

  3. Select Databricks as the Metastore Type.

  4. Enter a Display Name.

  5. Enter the JDBC Server Hostname, Port, and HTTP Path.

  6. Enter the Databricks Access Token.

  7. Select Convert to Delta Lake if you want to convert your tables.

  8. Enter the Filesystem Mount Point.
    The filesystem that contains your data you want to migrate must be mounted onto your DBFS.
    Enter the mounted container's path on the DBFS.

  9. (Optional) - Enter another path for Default Filesystem Override.

    1. If you select Convert to Delta Lake , enter the location on the DBFS to store the tables converted to Delta Lake. To store Delta Lake tables on cloud storage, enter the path to the mount point and the path on the cloud storage.

      Example: Location on the DBFS to store tables converted to Delta Lake
      dbfs:<location>
      Example: Cloud storage location
      dbfs:/mnt/adls2/storage_account/
    2. If you don't select Convert to Delta, enter the mount point.

      Example: Filesystem mount point
      dbfs:<value of Fs mount point>
  10. Select Save.

Next steps#

  1. Create a metadata migration using the Databricks agent you just configured.

  2. Monitor the following from the Dashboard:

    • The progress of the migration.
    • The status of the migration.
    • The health of your agent connection. To view the connection status:
      • Select Status from the left side navigation bar and select View agent.
      • Go to the Overview page under Metastore Agents.