Fusion Plugin for Databricks Delta Lake User Guide

1. Welcome

1.1. Product overview

The Fusion Plugin for Databricks Delta Lake is used with WANdisco Fusion to provide continuous replication from on-premises Hadoop analytics to Spark based cloud analytics with zero downtime and zero data loss. Take advantage of this together with the Fusion Plugin for Live Hive to implement a LiveAnalytics solution.

Figure 1. LiveAnalytics

Use LiveAnalytics to migrate at scale from continuously operating on-premises Hadoop environments to Databricks, or to operate a hybrid analytics solution. As changes occur on-premises, the Fusion Plugin for Databricks Delta Lake keeps your cloud data and metadata consistent with those changes, automating the critical data replication needed for you to adopt a modern, cloud-based analytics platform.

Create Hive tables in Hadoop to make replicas of those tables available in Databricks. Ingest data to Hive tables and access the same information as Delta Lake content in a Databricks environment. Modify Hive schemas and see the same structure reflected in changes to matching Delta Lake tables.

The Fusion Plugin for Databricks Delta Lake works in conjunction with WANdisco Fusion and the Fusion Plugin for Live Hive to deliver WANdisco’s LiveAnalytics solution. LiveAnalytics lets you:

Automate analytics data migration without disrupting your on-premises applications, and without losing data. Replication is continuous, controlled by rules that you define.
Ensure the reliability and consistency of your data in the cloud by supporting unified analytics that spans your on-premises and cloud environments, and
Replicate changes immediately from Hive content and metadata on-premises to equivalent changes in your cloud environment.

1.2. Documentation guide

This guide explains how to install, configure and operate the Fusion Plugin for Databricks Delta Lake, and contains the following sections:

Welcome: The welcome chapter introduces this user guide and provides help with how to use it.
Release Notes: Details the latest software release, covering new features, fixes and known issues to be aware of.
Concepts: Explains how the Plugin for Databricks Delta Lake uses WANdisco’s LiveData platform, and how to use it with Hadoop, Hive, cloud storage systems and Databricks.
Installation: Covers the steps required to install and set up the Plugin for Databricks Delta Lake into a WANdisco Fusion deployment.
Operation: How to operate, configure and troubleshoot Plugin for Databricks Delta Lake.
Reference: Additional Fusion Plugin for Databricks Delta Lake documentation, including documentation for the available REST API.

1.2.1. Symbols in the documentation

In the guide we highlight specific types of information using the following call outs:

The alert symbol highlights important information.

The STOP symbol cautions you against doing something.

Tips are principles or practices that you’ll benefit from knowing or using.

The KB symbol shows where you can find more information, such as in our online Knowledge base.

1.3. How to contact WANdisco

See our online Knowledge base which contains updates and more information.

If you need more help raise a case on our support website.

If you find an error in this documentation or if you think some information needs improving, raise a case on our support website or email docs@wandisco.com.

2. Release Notes

2.1. WANdisco Fusion 4.0.1 Build 1565

The WANdisco Fusion Plugin for Databricks Delta Lake completes the product offerings in WANdisco’s LiveAnalytics solution for migration of on-premises Hadoop analytic datasets to the cloud.

Use the Fusion Plugin for Databricks Delta Lake with WANdisco Fusion for continuous replication from on-premises Hadoop analytics to Spark-based cloud analytics with zero downtime and zero data loss. Apply it with the Fusion Plugin for Live Hive and LiveMigrator to implement a LiveAnalytics solution.

For the release notes and information on known issues, please visit the Knowledge base - Fusion Plugin for Databricks Delta Lake 4.0.1 Build 1565.

3. Concepts

3.1. Use Cases

Implement a LiveAnalytics solution using the Fusion Plugin for Databricks Delta Lake. Typical use cases for LiveAnalytics include:

Hadoop and Hive migration to Databricks: Bring your on-premises Apache Hive content to Databricks as Delta Lake tables, so that you can take advantage of the unique capabilities of a unified analytics platform without disrupting business operations.

LiveAnalytics’ continuous, consistent, automated data replication ensures migrated data sets are available for analytical processing immediately. With minimal disruption when migrating between Hadoop and non Hadoop environments, LiveAnalytics provides faster adoption of Machine Learning and AI capabilities in the cloud. LiveAnalytics keeps your data accurate and consistent across all your business application environments, regardless of geographic location, data platform architecture, or cloud storage provider.
Hybrid Analytics: Take advantage of the unique capabilities that span your on-premises and cloud environments without compromising on data accuracy or availability. Make your on-premises Hive data available for processing at local speed in Databricks, even while it continues to undergo change.

3.2. Related Technology

Familiarize yourself with concepts for the product and environment in which it works here. Understand how to use the Fusion Plugin for Databricks Delta Lake by learning more about these additional items.

Apache Hive: Hive is a data warehousing technology for Apache Hadoop. It supports applications that want to use data residing in a Hadoop cluster in a structured manner, allowing ad-hoc querying, summarization and other data analysis tasks to be performed using high-level constructs, including Apache Hive SQL queries.
Databricks: The Databricks Unified Analytics platform accelerates data innovation by unifying data science, engineering and business. It is a cloud-based service available in AWS or Azure, providing a workspace through which users interact with data objects, computational resources, notebooks, libraries, and experiments.
Hive Metadata: The operation of Hive depends on the definition of metadata that describes the structure of data residing in a Hadoop cluster. Hive organizes its metadata with structure also, including definitions of Databases, Tables, Partitions, and Buckets.
Delta Lake: Delta Lake is an open source storage layer used by Databricks that provides ACID transactions, scalable metadata handling, and unification of batch and stream data processing.
Apache Hive Type System: Hive defines primitive and complex data types that can be assigned to data as part of the Hive metadata definitions. These are primitive types such as TINYINT, BIGINT, BOOLEAN, STRING, VARCHAR, TIMESTAMP, etc. and complex types like Structs, Maps, and Arrays.
Apache Hive Metastore: The Apache Hive Metastore is a stateless service in a Hadoop cluster that presents an interface for applications to access Hive metadata. It can be deployed in a variety of configurations to suit different requirements. In every case, it provides a common interface for applications to use Hive metadata.

The Hive Metastore is usually deployed as a standalone service, exposing an Apache Thrift interface by which client applications interact with it to create, modify, use and delete Hive metadata in the form of databases, tables, etc. It can also be run in embedded mode, where the Metastore implementation is co-located with the application making use of it.
WANdisco Fusion Live Hive Proxy: The Live Hive Proxy is a WANdisco service that is part of the Fusion Plugin for Live Hive, acting as a proxy for applications that use a standalone Hive Metastore. The service coordinates actions performed against the Metastore with actions in environments (including Databricks Delta Lake) to which associated Hive metadata are replicated.
Hive Client Applications: Client applications that use Apache Hive interact with the Hive Metastore, either directly (using its Thrift interface), or indirectly via another client application such as Beeline or HiveServer2.
HiveServer2: HiveServer2 is a service that exposes a JDBC interface for applications that want to use it for accessing Hive. This could include standard analytic tools and visualization technologies, or the Hive-specific CLI called Beeline.

Hive applications determine how to contact the Hive Metastore using the Hadoop configuration property hive.metastore.uris.
Hive pattern rules: A simple syntax used by Hive for matching database objects.

3.3. Deployment Architecture

Use the Plugin for Databricks Delta Lake along with the Fusion Plugin for Live Hive to replicate changes made in Apache Hive to Delta Lake tables accessible from a Databricks environment.

Figure 2. Deployment Architecture

A deployment will consist of two Zones:

Zone 1

This represents the source environment, where your Apache Hive content and metadata reside. Your table content will reside in the cluster storage (typically HDFS), and your Hive metadata are managed by and maintained in a Hive Metastore. An operational deployment of a LiveAnalytics solution will include:

WANdisco Fusion
Fusion Plugin for Live Hive

Zone 2

This is the target environment, where your Databricks instance is available. Hive content from Zone 1 will be replicated to cloud storage (e.g. Azure Data Lake Storage Gen2) and transformed to the format used by Delta Lake. Metadata changes made to Hive tables in Zone 1 will be replicated to equivalent changes to Databricks Delta Lake tables. An operational deployment of the solution will include:

WANdisco Fusion
Fusion Plugin for Databricks Delta Lake

3.4. Supported Functionality

3.4.1. Initial Data Migration

Migrate Hive metadata and data that existed in a source zone prior to the introduction of the Plugin for Databricks Delta Lake to Databricks Delta Lake tables in the target zone. Take advantage of WANdisco Fusion functionality including LiveMigrator to initiate migration and transition to LiveAnalytics for ongoing replication of your analytic information to Databricks Delta Lake.

Figure 3. Using LiveMigrator for Initial Migration

3.4.2. Selective Replication

Choose specific content, databases and tables to replicate from Hive to Databricks using a convenient pattern definition to match databases and tables by name.

Figure 4. Replication Rule Hive Selection

3.4.3. Hive File Formats

Apache Hive allows different file formats to be used for Hive tables. The Fusion Plugin for Databricks Delta Lake supports replication from Hive tables that use a subset of these data formats. Supported donor table formats for this release are:

Optimized Row Columnar (ORC)

The Hive ORC file format is used typically in HDP clusters. ORC format can be specified like:

CREATE TABLE … STORED AS ORC
ALTER TABLE … [PARTITION partition_spec] SET FILEFORMAT ORC
SET hive.default.fileformat=ORC

Parquet

The Hive Parquet file format is an ecosystem wide columnar file format for Hadoop, normally used in CDH clusters. Parquet format is specified in Hive using statements like:

CREATE TABLE … STORED AS PARQUET
ALTER TABLE … [PARTITION partition_spec] SET FILEFORMAT PARQUET
SET hive.default.fileformat=Parquet

Other Formats

Hive file formats that are not yet supported for source tables include:

Text File
SequenceFile
RCFile
Avro Files
Custom INPUTFORMAT and OUTPUTFORMAT

The Plugin for Databricks Delta Lake determines the format of a source table prior to replication automatically.

3.4.4. WANdisco LiveData Features

WANdisco Fusion provides a LiveData platform that offers continuous replication of changes to Hive content, making it available as Delta Lake tables in Databricks without the need to schedule transfer.

3.4.5. Hadoop and Object Storage

Work across a variety of big data source and target environments, including major Hadoop and object storage technologies.

3.4.6. Broad Hive support

The vast majority of Hive table types can be replicated without change. Take advantage of Hive features such as partitioning, managed tables, optimized table formats, etc. without needing to adjust your migration strategy.

3.4.7. Automation

The WANdisco Fusion platform automatically responds to service and network disruption without the need for administrative interaction.

3.4.8. Selective Replication

Take advantage of powerful tools to select subsets of your Hive datawarehouse for replication to Databricks Delta Lake. Define policies that control which data sets are replicated between environments, and selectively exclude data from migration.

3.4.9. Scale

The LiveAnalytics operates equally effectively whether you have 1 Terabyte or many Petabytes of data. Scale effectively without introducing overhead, and operate your environments as you need to while leveraging the unique capabilities of Databricks against data that was previously locked up in Hadoop.

4. Installation

A full deployment for LiveAnalytics includes a source zone and a target zone, as described in Deployment Architecture. Follow these installation details for the target environment, and refer to the Installation Guide for the Fusion Plugin for Live Hive for details for the source environment.

Installation is a three-step process that includes:

Installing all pre-requisite components
Executing the command-line installer for the Plugin for Databricks Delta Lake
Activating your environment

4.1. Pre-requisites

Source Zone

Target Zone

An Azure Data Lake Storage Gen2 storage account with access to a file system
Azure Databricks
WANdisco Fusion 2.14 deployed and configured to access the ADLS Gen2 storage account

With those pre-requisites in place, you can install and configure the Plugin for Databricks Delta Lake.

4.1.1. Setup

Begin installation by obtaining the installer from customer.wandisco.com. Your download location will be provided by WANdisco after you purchase the necessary license for access to and use of the software.

4.1.2. System Requirements

Along with the standard product requirements for WANdisco Fusion, you need to ensure that your environment meets the following system requirements:

Fusion Server Host

While the Plugin for Databricks Delta Lake imposes minimal overheads on the system requirements for the Fusion server, you should confirm the availability of:

An additional 25 GB of disk space available for the mount point to which the Fusion server logs operations using the /var/log/fusion/server/fusion-server.log file. The Plugin for Databricks Delta Lake logs additional information that will require extra storage space.
An additional 2 GB of RAM for the operation of the plugin. Additional memory is required for the caching performed by the plugin to maintain information about the databases and tables under replication. If you intend to replicate a large number of unique Hive tables (more than 100), please contact WANdisco support for assistance in meeting system requirements.

Databricks environment

The Fusion Plugin for Databricks Delta Lake replicates changes made to matching Hive content and metadata on a continuous basis. The rate at which operations are performed against Hive content that is governed by a Hive replication rule contributes to additional load in the Databricks environment as a result. While the Plugin for Databricks Delta Lake manages that load to account for failed Spark job submission as a result, you should monitor the operation of your Databricks environment to ensure that the cluster resources available are sufficient to accommodate this added load.

4.2. Command-line Installation

Install the Fusion Plugin for Databricks Delta Lake to a Fusion server that is configured in a zone associated with ADLS Gen2 storage. You should have previously deployed and configured this Fusion instance with suitable credentials and information, and confirmed that file system replication results in the successful availability of content in that ADLS Gen2 file system.

Prior to installation, ensure that you have a Databricks cluster available and running.

The install process requires that you obtain the following information before running the databricks-installer.sh file as root, and provide details to the installer using the associated configuration controls:

Table 1. Installation Configuration Controls
Control	Description
`set-account`	Name of the ADLS Gen2 storage account
`set-container`	Name of the file system container in the storage account
`set-account-key`	Storage account access key
`set-address`	Address of the Databricks service
`set-bearer`	Bearer token for the Databricks cluster
`set-cluster`	Databricks cluster ID
`set-http-path`	Unique JDBC HTTP path for the Databricks cluster

An example install is:

Sample installation

./databricks-installer.sh \
  set-account=azurestorageaccountname \
  set-container=adls2filesystemname \
  set-account-key=kG5m4i2x74QOZiMUMA16d4LP5D4zmMPf90H6iJ1Iub \
  set-address=eastus2.azuredatabricks.net \
  set-bearer=dapiecd87ed981997a3d6efda572a7ebb348 \
  set-cluster=0815-207120-pups123 \
  set-jdbc-http-path=sql/protocolv1/o/6971298374654602/0815-207120-pups123


    ::   ::  ::     #     #   ##    ####  ######   #   #####   #####   #####
   :::: :::: :::    #     #  #  #  ##  ## #     #  #  #     # #     # #     #
  ::::::::::: :::   #  #  # #    # #    # #     #  #  #       #       #     #
 ::::::::::::: :::  # # # # #    # #    # #     #  #   #####  #       #     #
  ::::::::::: :::   # # # # #    # #    # #     #  #        # #       #     #
   :::: :::: :::    ##   ##  #  ## #    # #     #  #  #     # #     # #     #
    ::   ::  ::     #     #   ## # #    # ######   #   #####   #####   #####


You are about to install WANdisco Databricks Delta Lake version <version>

Do you want to continue with the installation? (Y/n)
  databricks-fusion-core-plugin-<version>-xxxx.noarch.rpm ... Done
  databricks-ui-server-<version>-dist.tar.gz ... Done

Uploading datatransformer.jar to the dbfs ...
{}
Installing library ...
{}
Restarting Spark cluster is required as part of plugin activation
Do you wish to restart Spark cluster now? (y/N) y
{}
All requested components installed.

Running additional post install actions...
Restarting fusion-server is required as part of plugin activation
Do you wish to restart fusion-server now? (y/N) y
Stopping WANdisco Fusion Server: fusion-server
Stopped WANdisco Fusion Server process 27356 successfully.
Starting WANdisco Fusion Server: fusion-server
Started WANdisco Fusion Server process successfully.

Complete the installation by following the prompts before you activate the plugin.

4.3. Plugin Activation

The Fusion Plugin for Databricks Delta Lake works when paired with the Fusion Plugin for Live Hive. After installing both plugins, activate your environment for operation by following the activation process for the Fusion Plugin for Live Hive.

4.4. Installer Help

You can provide the help parameter to the installer package for guidance on the options available at install time:

Getting help for the command-line installer

# ./databricks-installer.sh help

Verifying archive integrity... All good.
Uncompressing WANdisco Databricks Delta Lake.............
This usage information describes the options of the embedded installer script.
Further help, if running directly from the installer is available using '--help'.

The following options should be specified without a leading '-' or '--'.
Also note that the component installation control option effects are applied in the order provided.

General options:
  help                             Print this message and exit

Component configuration control:
  set-account=                     Name of the ADLSv2 storage account
  set-container=                   Name of the container in the storage account
  set-account-key=                 Storage account access key
  set-address=                     Address of the Databticks service
  set-bearer=                      Bearer token for the Databricks cluster
  set-cluster=                     Databricks cluster ID
  set-jdbc-http-path=              Unique JDBC HTTP path

Component installation control:
  only-fusion-ui-server-plugin     Only install the plugin's fusion-ui-server component
  only-fusion-server-plugin        Only install the plugin's fusion-server component
  only-upload-etl                  Only upload ETL jar
  skip-fusion-ui-server-plugin     Do not install the plugin's fusion-ui-server component
  skip-fusion-server-plugin        Do not install the plugin's fusion-server component
  skip-upload-etl                  Do not upload ETL jar

Post Install service restart control:
  These options if not set will result in questions in interactive script use.
  restart-fusion-server            Request fusion-server restart
  skip-restart-fusion-server       Skip fusion-server restart

4.5. Configuration

Configure the Plugin for Databricks Delta Lake after installation by using the Databricks Configuration section in the Settings tab. Apply modified configuration properties by adjusting their values and clicking the Update button. Your modifications will only take effect after restarting the Fusion server, which can be performed from the Nodes tab.

Figure 5. Plugin Configuration

You can also modify configuration properties by editing them in the /etc/wandisco/fusion/plugins/databricks/databricks.properties file, then restarting the Fusion server.

Table 2. Configuration Properties
Property	Description
`plugin.databricks.azure.storage.account`	Name of the ADLS Gen2 storage account
`plugin.databricks.azure.storage.container`	Name of the container in the storage account used by Fusion replication
`plugin.databricks.azure.storage.account.key`	Storage account access key
`plugin.databricks.azure.address`	Address of the Databricks service
`plugin.databricks.bearer`	Bearer token for the Databricks Cluster. In Databricks UI go to User Settings and then Generate New Token.
`plugin.databricks.cluster`	Databricks cluster ID.
`plugin.databricks.jdbc.http.path`	Unique JDBC HTTP path
`plugin.databricks.etl.jobmainclass`	`com.wandisco.fusion.databricks.etl.DataTransformer`
`plugin.databricks.etl.location`	`dbfs:/datatransformer.jar`

4.6. Validation

Validate that your environment is configured correctly before relying on it for production workloads. You should also review Fusion log information periodically to confirm that external changes do not affect the performance or correctness of WANdisco Fusion or the Fusion Plugin for Databricks Delta Lake.

Validate your environment is functional by replicating a simple test database, table and content following these steps in the Hadoop zone:

Create a HCFS replication rule

Log in to the Fusion user interface for the zone associated with your Hadoop environment, and create a Replication Rule for the /apps/hive/warehouse/delta_lake_test.db location.
Create a Hive replication rule

Create a Hive Replication Rule for the same database, using:

Item Value

Database

delta_lake_test

Table name

*
Create a database

Log in to the beeline tool as a user with sufficient privileges to create content in the Hive warehouse (e.g. as the hdfs user), connect to Hive, and issue this command:
```
create database delta_lake_test;
```
Create a table

In the same beeline session, create a test table in the database with this command:
```
create table test_table (int id, string value);
```
Create content

In the same beeline session, generate simple content for the table with this command:
```
insert into delta_lake_test.test_table values (1, "Hello, world.");
```

Item	Value
Database	`delta_lake_test`
Table name	`*`

Validate successful replication

In your Databricks environment, you should be able to see that the delta_lake_test database exists, containing the test_table Delta Lake table with content from the insert operation. This can be seen from a notebook with the command:

%sql
select * from delta_lake_test.test_table

No replication?

If your environment does not result in successful replication of Hive content to Databricks and Delta Lake, you may want to look at the Fusion server logs in the Databricks zone for any obvious signs of mis-configuration. Please review the /var/log/fusion/server/fusion-server.log file. Any ERROR conditions (particularly those that include a stack trace) may give an indication of what requires configuration changes to correct your deployment. If you are unable to determine the cause of your validation issue, please contact WANdisco support.

Remove your test database

Issue these commands in your beeline session to remove the test database and content from the Hadoop cluster and from the Databricks environment:
```
drop table delta_lake_test.test_table;
drop database delta_lake_test;
```
You can then remove the Hive and HCFS replication rules in the Fusion user interface.

5. Operation

Operate the Plugin for Databricks Delta Lake by creating Hive and HCFS replication rules that match the metadata and content of Hive tables and databases to make them available as Delta Lake tables for Databricks. Subsequent changes to Hive content that matches these rules will initiate replication to maintain consistent Databricks databases and tables.

You can then use the full set of Databricks features, with local speed of access, on data that would otherwise be isolated from your cloud environment. Query and analyze your data through collaborative exploration of your largest datasets, building models iteratively, speeding up your data preparation tasks and operating machine learning lifecycles at scale.

Figure 6. Plugin for Databricks Delta Lake Operation

While the majority of Hive functionality is supported by the Plugin for Databricks Delta Lake, some aspects of Hive are not yet supported.

Some Hive functionality is not yet supported by the Plugin for Databricks Delta Lake. This includes:

Table formats: The first release of the Plugin for Databricks Delta Lake supports Hive tables in ORC and Parquet format only. Tables using other formats will not replicate to Delta Lake successfully. Additional table formats will be added in future releases of the plugin.
Bucketed tables: The plugin does not yet provide support for replicating the bucketing properties used by Hive tables to their Delta Lake replicas.

5.1. Administration

Administer the Plugin for Databricks Delta Lake by creating and managing replication rules that govern the Hive information that you want to make available as Delta Lake tables in the Databricks runtime. Replicate Hive metadata that defines databases, tables and partitions, changes to that metadata, and associated Hive content.

5.1.1. Creating Replication Rules

Create replication rules in the zone where your Apache Hive system operates. Browse your Hadoop file system to identify content for replication, and specify patterns to match your Hive content.

HCFS Replication: Data for the content of your Hive tables resides in the Hadoop-compatible file system used by your cluster. Create HCFS replication rules to make this content available in the cloud storage accessible to your Databricks runtime. You can create an HCFS replication rule at the level of an individual table (i.e. the location in the HCFS file system where the table content is held), or at some parent directory for this content (e.g. for an entire Hive database, or even the entire Hive warehouse.)

Create HCFS replication rules as described in the Create a rule section of the WANdisco Fusion user guide.
Hive Replication: Metadata for Hive constructs resides in the Hive Metastore. Create Hive replication rules to make this metadata available as equivalent, replica Delta Lake tables in your Databricks environment. Each Hive replication rule uses a pattern to match specific Hive tables and databases.

Create Hive replication rules as described in the Create Hive rule section of the Fusion Plugin for Live Hive user guide.

5.1.2. Deleting Replication Rules

You can choose to delete the HCFS and Hive replication rules at any time. Further activity against Hive tables will not affect the Databricks environment.

5.2. Initial Hive Table Migration

If you want to migrate existing Hive content from your Hadoop cluster to Databricks as Delta Lake tables, the LiveAnalytics solution provides a comprehensive approach to data and metadata migration. Use your choice of Make Consistent or Live Migrator functionality to populate initial content. WANdisco’s LiveAnalytics solution takes advantage of the benefits of Live Migrator to ensure that your Hive content are consistent between Hadoop and the cloud storage accessible to Databricks even if they are undergoing change during migration.

Follow these steps to perform initial Hive table migration to Databricks:

Create replication rules for content

Create an HCFS replication rule that matches your Hive content, then create a Hive replication rule to match the metadata that you want to make available.
Use Live Migrator

Use the Live Migrator functionality to migrate Hive table content to cloud storage. You should ensure that your Databricks environment has sufficient permissions to read from the cloud storage endpoint to which Live Migrator is delivering the content.
Migrate Hive Metadata

Use the Plugin for Databricks Delta Lake REST API to initiate a migration of Hive metadata to Delta Lake. Track the corresponding migration task with the REST API to ensure that it completes successfully. To do this:
1. On the Live Hive cluster, initiate the migration of the Hive schema for the database:
  curl -X POST -i 'http://<fusion.hostname>:8082/plugins/hive/migrate?ruleId=<Replication Rule ID>&database=<Database Name>'
2. On the Databricks Fusion node, initiate the migration of the Hive data into Delta Lake:
  curl -X POST -i 'http://<fusion.hostname>:8082/plugins/databricks/migration?dbName=<Database Name>'

5.3. Maintenance

The Plugin for Databricks Delta Lake leverages the WANdisco Fusion platform, and inherits its approach to logging, runtime operation and general maintenance requirements. Please consult the WANdisco Fusion user guide for further information.

5.4. Sizing and performance

Observe the performance of your LiveAnalytics deployment over time to ensure that it meets your requirements. General advice on performance characteristics is given below, but you may want to take advantage of WANdisco support if you have a particularly demanding environment.

A properly sized implementation of the Plugin for Databricks Delta Lake will need to take into account two key constraints, each of which is affected by the operational behavior of the environments:

Bandwidth vs Rate of Change: The WANdisco Fusion platform is used by the Plugin for Databricks Delta Lake to replicate file system content, including Hive table content. This typically involves transfer over a network that will have an upper bandwidth capacity (WANdisco Fusion can also enforce additional transfer limits).

If the rate of change to your replicated Hive content exceeds the available bandwidth, your replication performance will be affected. If these conditions persist, it will be impossible to maintain a current replica of Hive content. You will need to either increase available bandwidth, or modify application behavior to limit the rate of change to your Hive content.

In some instances, your environment may benefit from WAN optimization technologies, which can increase the effective bandwidth between zones
Transformation: Following successful replication of Hive content to cloud storage, the Plugin for Databricks Delta Lake performs a data transformation task to convert the original format of the Hive content into equivalent Delta Lake form. This uses a Databricks cluster (that you configure) to perform transformation.

If you find that there are significant delays between the Hive content becoming available in your cloud storage (as configured in the WANdisco Fusion platform), and the same information being queryable from a Databricks notebook or Spark job, you may need to allocate additional resources to the Databricks cluster to enable it to transform your content more readily. If the Databricks cluster configured for this purpose for the plugin is also used for other jobs, you may want to consider employing a cluster dedicated to this transformation so that it is unaffected by other work.

5.5. Troubleshooting

5.5.1. Check the Release notes

Updates on known issues, enhancements and new product releases is made available at the WANdisco community site, including product release notes, known issues, best practices and other information.

5.5.2. Check log Files

The WANdisco Fusion maintains flexible log information that can be configured to expose minimal or detailed information on product operation. Logging levels can be configure dynamically, allowing you to capture detail when required, and minimize overheads when it is not.

Please consult the WANdisco Fusion user guide for details on logging.

6. Reference Guide

6.1. Plugin REST API

Control the Fusion Plugin for Databricks Delta Lake using a REST API that extends the operations available from the Fusion server. Understand the resources and their functionality with the details in this section. Use them to migrate existing Hive content to Delta Lake tables in Databricks and to manage the Spark jobs that the plugin submits during operation.

The Databricks resource is the root resource providing functionality specific to the Plugin for Databricks Delta Lake. Access the following resources under this context, which is at the /plugins/databricks root URI.

6.1.1. Failed Spark Jobs

Resource

/plugins/databricks/failedSparkJobs

GET operation

Gets the request IDs of all jobs for the provided database that failed.

Query parameter database: The name of the Hive database under consideration
Response: A list of the request idenitifiers associated with any failed jobs for the database

POST operation

Resubmits a previously failed Spark job for execution.

Query parameter database: The name of the Hive database under consideration
Query parameter requestId: The request ID for a previously failed job

6.1.2. Failed Spark Job Detail

Resource

/plugins/databricks/failedSparkJobs/detail

GET operation

Gets the details for a failed Spark job.

Query parameter database: The name of the Hive database under consideration
Query parameter requestId: The request ID for the job
Response: Details of the specified job, including:
- tableName: The name of the table for the job
- requestId: The original job requestId
- columns: The columns for the table
- partitions: The table partitions
- dataFile: The location of the datafiles associated with the job
- storageFormat: The Hive table format
- database: The database for the job

6.1.3. Spark Job Log Management

Resource

/fusionLog

POST operation

Cleans the job and commit records associated with successfully completed Spark jobs.

Query parameter database: The name of the Hive database under consideration

6.1.4. Failed Job Submissions

Resource

/plugins/databricks/failedSubmissions

GET operation

Gets a list of details associated with failed job submissions.

Response: A list of details for each failed job submission, including:
- tableName: The name of the table for the job
- requestId: The original job requestId
- columns: The columns for the table
- partitions: The table partitions
- dataFile: The location of the datafiles associated with the job
- storageFormat: The Hive table format
- database: The database for the job

POST operation

Resubmits a previously failed job for execution, removing it from the list of failed submissions if it was submitted successfully.

Query parameter requestId: The request ID of the previously failed job sumission

6.1.5. Migration

Resource

/plugins/databricks/migration

POST operation

Migrates a Hive database, including all its tables, into a Databricks environment as Delta Lake tables. Use this endpoint to bring pre-existing Hive content into the Databricks environment. You must have previously migrated Hive table content to cloud storage using the Make Consistent feature of the WANdisco Fusion platform, or preferably Live Migrator, which will ensure that the data exist in full in your cloud storage even if they continue to be modified in your Hadoop cluster.

Query parameter database: The name of the Hive database to migrate
Response: A URI that can be queried for the status of the task associated with the migration.

6.1.6. Task Query

This is a standard WANdisco Fusion resource that is useful when querying the status of long-lived operations such as migration.

Resource

/fusion/task/<taskid>

GET operation

Returns details of a specific task.

Response: A JSON structure that includes a variety of key/value pairs representing the status of the task.