WANdisco Fusion for parallel operation of ADLS Gen1 and Gen2

25 November 2018

1. Overview

You can use WANdisco Fusion to maintain consistent replicas of the same data across separate ADLS Gen1 and ADLS Gen2 storage. This allows you to follow the “parallel cutover” strategy proposed by Microsoft for upgrade, where downtime and disruption are minimized and applications can move as required rather than requiring a point in time cutover. By taking this approach, you can eliminate the need to make ad-hoc copies of data, and eliminate the risk of divergent data sets. This is particularly useful for large data volumes, or for ensuring business-critical applications maintain operations while an upgrade is performed.

2. Prerequisites

An Azure subscription.
An Azure Storage account with Data Lake Storage Gen1 enabled.
An Azure Storage account with the Data Lake Storage Gen2 feature enabled.
Azure HDInsight cluster with Data Lake Storage Gen1 configured as primary storage.
Azure HDInsight cluster with access to a storage account with Data Lake Storage Gen2 enabled.
Licenses for WANdisco Fusion that accommodate the volume of data that you want to make available to ADLS Gen2.

3. Install WANdisco Fusion for your ADLS Gen1 HDInsight cluster

Follow the instructions provided by Wandisco for the installation and configuration of WANdisco Fusion for your ADLS Gen1 HDInsight cluster. Once installed and configured, applications can access and use ADLS Gen1 storage as before.

4. Install WANdisco Fusion for your ADLS Gen2 HDInsight cluster

4.1. Create ADLS Gen2 Storage Account

Create your storage account that provides Azure Data Lake Storage Gen2 capabilities.

Once created, navigate to the storage account in the Azure console, access the “Access keys” item under “Settings”, and take note of the Key value for either key1 or key2, as this will be used in later configuration. You will also need to recall the name you used for the storage account itself.

4.2. Create HDInsight Cluster

Create an HDInsight cluster with Azure Storage as the primary storage type, referencing your newly-created storage account for ADLS Gen2. You must add a VNet and subnet in the “Virtual Network Settings” step. Select Hadoop 2.7 on Linux (HDI 3.6) as the cluster type on creation.

4.3. Post-install Cluster Configuration

After confirming that your cluster is operations, access the Ambari interface to add appropriate configuration for use of the ADLS Gen2 storage as the primary storage type:

fs.abfs.impl=org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem
fs.abfss.impl=org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem
fs.AbstractFileSystem.abfs.impl=org.apache.hadoop.fs.azurebfs.Abfs
fs.AbstractFileSystem.abfss.impl=org.apache.hadoop.fs.azurebfs.Abfss
fs.azure.account.key.{storage account name}.dfs.core.windows.net={storage access key}
fs.azure.check.block.md5=false
fs.azure.store.blob.md5=false

Restart all affected cluster services, and wait on Ambari showing healthy cluster state, then validate the you can access ADLS Gen2 as the primary file system, e.g.

hadoop fs -ls abfs://<CONTAINER_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/

Where the <CONTAINER_NAME> portion is the container in the storage account that was created for the cluster, <STORAGE_ACCOUNT_NAME> should be replaced with your storage account details.

4.4. WANdisco Fusion Installation

Install the HDI 3.6 version of WANdisco Fusion onto a cluster edge or head node as the root user, choosing the “hdi-3.6” option, not the “hdi 3.6 Data Lake Store” option.

Choose the default hdfs/hadoop user and group.

You may need to setup a port redirection using SSH to gain easy access to the Fusion UI to complete installation.

On the Node Details step (5), provide the storage account access key for the “KEY1 Access Key” value, and the value of the cluster’s fs.defaultFS property for the WASB Storage URI, e.g.

wasb://psmblobfs-2018-06-03t20-54-50-610z@psmblobfs.blob.core.windows.net.

This should validate correctly.

You can complete server installation without following the clients step for now.

Then modify the /etc/wandisco/fusion/server/core-site.xml to refer to abfs rather than _wasb as the underlying file system, e.g.:

<property>
  <name>fs.fusion.underlyingFs</name>
  <value>abfs://<CONTAINER_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net</value>
</property>

Restart the Fusion server after making this change.

4.5. Fusion Client Installation

You can deploy the WANdisco Fusion client from the HDInsight Marketplace.

Go to your cluster in the Azure portal and select “Application” under Configuration, then click on “+ Add”, and select the correct WANdisco Fusion HDI App item.

Note the in the field labelled “License key” you actually need to provide the hostname of the instance where the Fusion server is running.

Fusion client application deployment will take some time. You can monitor progress in Ambari and the Script Actions section of your cluster. After completion, modify the ambari configuration properties to refer the client to the correct underlying file system, changing the configuration property for fs.fusion.underlyingFs from _wasb://_something to the same setting as for the Fusion server, e.g.:

abfs://<CONTAINER_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net

and add the fs.fusion.underlyingFsClass=org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem configuration property to the custom core-site.

Restart all affected cluster services.

4.6. Verify Operation

At this stage, you should be able to verify the correct operation of either cluster when accessing their default file systems (ADLS Gen1 and ADLS Gen2 respectively.)

Gen1 Cluster

hadoop fs -ls adl://mytest.azuredatalakestore.net/

Gen2 Cluster

hadoop fs -ls abfs://<CONTAINER_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/

4.7. WANdisco Fusion Configuration

With WANdisco Fusion installed for each cluster, you are now free to establish replication rules that govern which data will be made consistent among the environments.

Configure these as described in the WANdisco Fusion product documentation to create specific directories for which any new or changed data will be replicated between the ADLS Gen1 and ADLS Gen2 storage accounts.

Use the repair feature to perform an initial transfer of pre-existing content from ADLS Gen1 to ADLS Gen2 if required. This allows you to bring data sets to your ADLS Gen2 environment. A key benefit of WANdisco Fusion is that application operation in either the Gen1 or Gen2 clusters can continue while data are replicated, because WANdisco Fusion maintains data consistency.

4.8. Performance and other considerations

WANdisco Fusion is designed for workloads and data volumes at massive scale, but will benefit from configuration based on the specific demands of you upgrade needs. Choices such as the number of WANdisco Fusion server instances, replication rule configuration and other finer-grained configuration options will be affected by your specific upgrade needs. Options also exist for the replication of ancillary information such as Hive metadata, Ranger policies, and more. WANdisco can be consulted for assistance and support with a paid subscription to the WANdisco Fusion product.

4.9. Limitations

WANdisco Fusion uses the network to replicate data and maintain consistency between the ADLS Gen1 and ADLS Gen2 storage accounts. If there is insufficient bandwidth to accommodate the rate of change of data, you may need to reduce the speed with which new data are created or existing data are modified.

Once your environments are consistent WANdisco Fusion will need to remain operational if you intend to keep data available to either storage platform. This will allow you to continue to operate on your data in either Gen1 or Gen2, which may need to extend beyond the initial period of upgrade to ADLS Gen2.