Prepare for LiveData Migrator

LiveData Migrator for Azure is an easy way of migrating your on-premises Hadoop data to Azure. You can set it up quickly in your environment on an edge node, or try it out with our HDFS Trial Sandbox solution that mimics a source filesystem.

info

The HDFS Trial Sandbox option is available when you create the LiveData Migrator resource in the Azure Portal.

Before you start, make sure you have all the necessary prerequisites. LiveData Migrator supports the following operating systems:

  • Ubuntu 16 and 18
  • CentOS 6 and 7
  • Red Hat Enterprise Linux 6 and 7

The edge node on your on-premises cluster needs the following:

  • Hadoop client libraries installed.
  • Hadoop client available within the systemd environment.
  • Minimum system memory should exceed: (4 x (CPU threads) x 8MB x (pull threads) x 3)
    • Example: (4 x (16 CPU threads) x 8MB x (50 pull threads) x 3) = 76.8 GB of RAM
  • Java 1.8+
  • If Kerberos is enabled on your Hadoop cluster, a valid keytab containing a suitable principal for the HDFS superuser must be available on the edge node.
    • If you want to migrate Hive metadata from your Hadoop cluster, the edge node must also have a keytab containing a suitable principal for the Hive service.

Create the LiveData Migrator resource#

note

If you haven't already done so, register the WANdisco resource provider.

  1. Sign in to the Azure Portal and check your intended subscription name is correct.

  2. Go to the Marketplace, or select Add New Resource: search for "LiveData Migrator for Azure".

  3. Select the LiveData Migrator subscription, then select Subscribe (if in the marketplace), or Create (if adding a resource).

  4. Complete the Basic details to create the LiveData Migrator resource:

    1. Choose to use an existing resource group, or create a new one as part of your setup process.
    2. Select a supported region in the Instance details.
    3. Provide a name for the migrator resource.
    4. Select Yes if you want to use the Hadoop test cluster in the HDFS Trial Sandbox as your source environment for testing, or No if you are using your own Hadoop cluster.
      • If Yes is selected, provide the Cluster Admin Username and Password (with confirmation) that you will use to sign in to the test cluster.
  5. Select Review + create once the details have been provided.

  6. If prompted, fill the I agree to the terms of service checkbox to consent to give the WANdisco resource provider access to your subscription. This will register the resource provider.

  7. Select Create after reviewing the summary.

Prepare your source environment for migrations#

We recommend you make the following configuration changes to your HDFS cluster environment to prepare for data migrations.

HDFS NameNode properties#

You can adjust several properties on the HDFS NameNode to prevent data migrations from stalling due to an excess of notifications, or from operating too slowly.

Configure these properties in the hdfs-site.xml for the cluster. This will vary depending on your distribution:

  • Hortonworks (HDP)

    • Custom hdfs-site
      dfs.namenode.inotify.max.events.per.rpc
    • Advanced hdfs-site
      dfs.namenode.checkpoint.txns
  • Cloudera (CDH/CDP)

    • Filesystem Checkpoint Transaction Threshold
      dfs.namenode.checkpoint.txns

    • NameNode Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml
      dfs.namenode.inotify.max.events.per.rpc

    • HDFS Client Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml
      dfs.namenode.checkpoint.txns
      dfs.namenode.inotify.max.events.per.rpc

      The HDFS Client entries are required for the LiveData Migrator host to register and confirm the configuration.

note

After configuring HDFS properties, you must restart all cluster services that rely on HDFS configuration for their function (including the HDFS service) for the changes to apply.

dfs.namenode.inotify.max.events.per.rpc#

This value dictates the maximum number of events the NameNode can send to an inotify client in a single RPC response. By default, this is set to 1000, which should consume no more than 1MB of memory.

You may increase this value to allow iNotify clients (such as LiveData Migrator) to receive larger batches of event notifications in a single RPC, at the cost of higher memory use.

We recommend setting this value to 100000 for production use. By increasing this, your NameNode should be capable of allocating at least an additional 100MB of memory from its maximum heap capacity to deliver these larger batches of events.

dfs.namenode.checkpoint.txns#

This value determines the threshold of which the number of namespace transactions will trigger a checkpoint, updating the filesystem metadata. If this threshold is reached, the checkpoint will be triggered regardless of whether the dfs.namenode.checkpoint.period has expired.

The default value for this is 1000000, but we recommend increasing it to 10000000 for production use.

Next steps#

Once your environment is ready and you've created a LiveData Migrator resource, you're ready to download and install LiveData Migrator.