Prepare for LiveData Migrator

LiveData Migrator for Azure is an easy way of migrating your on-premises Hadoop data to Azure. You can set it up quickly in your environment on an edge node, or try it out with our HDFS Sandbox solution that mimics a source filesystem.

info

The HDFS Sandbox option is available when you create the LiveData Migrator resource in the Azure Portal.

Before you start, make sure you have all of the prerequisites. LiveData Migrator supports the following operating systems:

Ubuntu 16 and 18
CentOS 6 and 7
Red Hat Enterprise Linux 6 and 7

The edge node on your on-premises cluster needs the following:

Hadoop client libraries installed.
Hadoop client available within the systemd environment.
Minimum system memory should exceed: (4 x (CPU threads) x 8MB x (pull threads) x 3)
- Example: (4 x (16 CPU threads) x 8MB x (50 pull threads) x 3) = 76.8 GB of RAM
Java 1.8+
If Kerberos is enabled on your Hadoop cluster, a valid keytab containing a suitable principal for the HDFS superuser must be available on the edge node.
- If you want to migrate Hive metadata from your Hadoop cluster, the edge node must also have a keytab containing a suitable principal for the Hive service.

Create the LiveData Migrator resource

Create the resource in Azure Portal
Create the resource with the Azure CLI

note

If you haven't already done so, register the WANdisco resource provider.

Sign in to the Azure Portal and check your intended subscription name is correct.
Go to the Marketplace, or select Create a resource: search for "LiveData Migrator for Azure".
Select the LiveData Migrator subscription, then select Subscribe (if you're in the marketplace or creating a new resource), or Create (if you're adding a resource through the resource group).
Complete the details under the Basics section to create the LiveData Migrator resource:
1. Choose to use an existing resource group, or create a new one as part of your setup process.
2. Select a supported region in the Instance details.
3. Enter a name for the migrator resource.
4. Select Yes if you want to use the Hadoop test cluster in the HDFS Sandbox as your source environment for testing, or No if you are using your own Hadoop cluster.
  - If Yes is selected, enter the Cluster Admin Username and Password (with confirmation) that you will use to sign in to the test cluster.
Select Review + create once the details have been provided.
If prompted, fill the I agree to the terms of service checkbox to consent to give the WANdisco resource provider access to your subscription. This will register the resource provider.
Select Create after reviewing the summary.

Run the following command to create the migrator resource:

az livedata migrator create -g <resource_group> --migrator-name <migrator_name> -l <azure_region>

-g: The name of the resource group in which your migrator resource will be created.
--migrator-name: The name to use for your migrator.
-l: The Azure region to create the migrator resource in. See the Supported regions list before entering your chosen Azure region.

If you're having problems creating a migrator resource, you can find solutions in the troubleshooting guide. Run az livedata migrator create --help to learn more about the command's options.

Prepare your source environment for migrations

We recommend you make the following configuration changes to your HDFS cluster environment to prepare for data migrations.

HDFS NameNode properties

You can adjust several properties on the HDFS NameNode to prevent data migrations from stalling due to an excess of notifications, or from operating too slowly.

Configure these properties in the hdfs-site.xml for the cluster. This will vary depending on your distribution:

Hortonworks (HDP)
- Custom hdfs-site
  dfs.namenode.inotify.max.events.per.rpc
- Advanced hdfs-site
  dfs.namenode.checkpoint.txns
Cloudera (CDH/CDP)
- Filesystem Checkpoint Transaction Threshold
  dfs.namenode.checkpoint.txns
- NameNode Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml
  dfs.namenode.inotify.max.events.per.rpc
- HDFS Client Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml
  dfs.namenode.checkpoint.txns
  dfs.namenode.inotify.max.events.per.rpc
  The HDFS Client entries are required for the LiveData Migrator host to register and confirm the configuration.

note

After configuring HDFS properties, you must restart all cluster services that rely on HDFS configuration for their function (including the HDFS service) for the changes to apply.

dfs.namenode.inotify.max.events.per.rpc

This value dictates the maximum number of events the NameNode can send to an inotify client in a single RPC response. By default, this is set to 1000, which should consume no more than 1MB of memory.

You may increase this value to allow iNotify clients (such as LiveData Migrator) to receive larger batches of event notifications in a single RPC, at the cost of higher memory use.

We recommend setting this value to 100000 for production use. By increasing this, your NameNode should be capable of allocating at least an additional 100MB of memory from its maximum heap capacity to deliver these larger batches of events.

dfs.namenode.checkpoint.txns

This value determines the threshold of which the number of namespace transactions will trigger a checkpoint, updating the filesystem metadata. If this threshold is reached, the checkpoint will be triggered regardless of whether the dfs.namenode.checkpoint.period has expired.

The default value for this is 1000000, but we recommend increasing it to 10000000 for production use.

Next steps

Once your environment is ready and you've created a LiveData Migrator resource, you're ready to download and install LiveData Migrator.

Create the LiveData Migrator resource​

Prepare your source environment for migrations​

HDFS NameNode properties​

dfs.namenode.inotify.max.events.per.rpc​

dfs.namenode.checkpoint.txns​

Next steps​