Prepare for LiveData Migrator
LiveData Migrator for Azure is an easy way of migrating your on-premises Hadoop data to Azure. You can set it up quickly in your environment on an edge node, or try it out with our HDFS Sandbox solution that mimics a source filesystem.
The HDFS Sandbox option is available when you create the LiveData Migrator resource in the Azure Portal.
Before you start, make sure you have all of the prerequisites. LiveData Migrator supports the following operating systems:
- Ubuntu 16 and 18
- CentOS 6 and 7
- Red Hat Enterprise Linux 6 and 7
The edge node on your on-premises cluster needs the following:
- Hadoop client libraries installed.
- Hadoop client available within the systemd environment.
- Minimum system memory should exceed: (4 x (CPU threads) x 8MB x (pull threads) x 3)
- Example: (4 x (16 CPU threads) x 8MB x (50 pull threads) x 3) = 76.8 GB of RAM
- Java 1.8+
- If Kerberos is enabled on your Hadoop cluster, a valid keytab containing a suitable principal for the HDFS superuser must be available on the edge node.
- If you want to migrate Hive metadata from your Hadoop cluster, the edge node must also have a keytab containing a suitable principal for the Hive service.
Create the LiveData Migrator resource
- Create the resource in Azure Portal
- Create the resource with the Azure CLI
If you haven't already done so, register the WANdisco resource provider.
Sign in to the Azure Portal and check your intended subscription name is correct.
Go to the Marketplace, or select Create a resource: search for "LiveData Migrator for Azure".
Select the LiveData Migrator subscription, then select Subscribe (if you're in the marketplace or creating a new resource), or Create (if you're adding a resource through the resource group).
Complete the details under the Basics section to create the LiveData Migrator resource:
- Choose to use an existing resource group, or create a new one as part of your setup process.
- Select a supported region in the Instance details.
- Enter a name for the migrator resource.
- Select Yes if you want to use the Hadoop test cluster in the HDFS Sandbox as your source environment for testing, or No if you are using your own Hadoop cluster.
- If Yes is selected, enter the Cluster Admin Username and Password (with confirmation) that you will use to sign in to the test cluster.
Select Review + create once the details have been provided.
If prompted, fill the I agree to the terms of service checkbox to consent to give the WANdisco resource provider access to your subscription. This will register the resource provider.
Select Create after reviewing the summary.
Run the following command to create the migrator resource:
az livedata migrator create -g <resource_group> --migrator-name <migrator_name> -l <azure_region>
-g
: The name of the resource group in which your migrator resource will be created.--migrator-name
: The name to use for your migrator.-l
: The Azure region to create the migrator resource in. See the Supported regions list before entering your chosen Azure region.
If you're having problems creating a migrator resource, you can find solutions in the troubleshooting guide. Run az livedata migrator create --help
to learn more about the command's options.
Prepare your source environment for migrations
We recommend you make the following configuration changes to your HDFS cluster environment to prepare for data migrations.
HDFS NameNode properties
You can adjust several properties on the HDFS NameNode to prevent data migrations from stalling due to an excess of notifications, or from operating too slowly.
Configure these properties in the hdfs-site.xml
for the cluster. This will vary depending on your distribution:
Hortonworks (HDP)
- Custom hdfs-site
dfs.namenode.inotify.max.events.per.rpc
- Advanced hdfs-site
dfs.namenode.checkpoint.txns
- Custom hdfs-site
Cloudera (CDH/CDP)
Filesystem Checkpoint Transaction Threshold
dfs.namenode.checkpoint.txns
NameNode Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml
dfs.namenode.inotify.max.events.per.rpc
HDFS Client Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml
dfs.namenode.checkpoint.txns
dfs.namenode.inotify.max.events.per.rpc
The HDFS Client entries are required for the LiveData Migrator host to register and confirm the configuration.
After configuring HDFS properties, you must restart all cluster services that rely on HDFS configuration for their function (including the HDFS service) for the changes to apply.
dfs.namenode.inotify.max.events.per.rpc
This value dictates the maximum number of events the NameNode can send to an inotify client in a single RPC response. By default, this is set to 1000
, which should consume no more than 1MB of memory.
You may increase this value to allow iNotify clients (such as LiveData Migrator) to receive larger batches of event notifications in a single RPC, at the cost of higher memory use.
We recommend setting this value to 100000
for production use. By increasing this, your NameNode should be capable of allocating at least an additional 100MB of memory from its maximum heap capacity to deliver these larger batches of events.
dfs.namenode.checkpoint.txns
This value determines the threshold of which the number of namespace transactions will trigger a checkpoint, updating the filesystem metadata. If this threshold is reached, the checkpoint will be triggered regardless of whether the dfs.namenode.checkpoint.period
has expired.
The default value for this is 1000000
, but we recommend increasing it to 10000000
for production use.
Next steps
Once your environment is ready and you've created a LiveData Migrator resource, you're ready to download and install LiveData Migrator.