Skip to main content
Version: 1.14.0

On-premises Hadoop to Azure HDInsights

These are an outline of the steps needed to ready your environment for migration of data and metadata.

Time to complete: 1 hour (assuming all prerequisites are met).

Prerequisites#

On-premises Hadoop cluster#

Make sure all prerequisites are met for the source environment. This also includes:

  • Network connectivity between your edge node and your ADLS Gen2 storage container.
  • If using an Azure SQL Database on your HDInsights cluster, network connectivity between your edge node and this database.

Azure HDInsights cluster#

For your target environment, make sure the following prerequisites are met:

  • Your HDInsights cluster is using ADLS Gen2 as its primary storage type.
  • If using a default metastore, SSH access to an edge node on the HDInsights cluster.
    The edge node requires the following:
    • HDFS and Hive client libraries installed.
    • A chosen port open for outbound connections (for example: 5552) to communicate with the LiveData Migrator service on the on-premises Hadoop edge node.

Install LiveData Migrator on your Hadoop edge node#

Download and install LiveData Migrator on your Hadoop edge node.

Configure for data migrations#

Add HDFS as source filesystem#

  1. If Kerberos is enabled on your Hadoop cluster, specify the Kerberos credentials for the HDFS superuser on your Hadoop cluster:

  2. (CLI only) Check that HDFS on your on-premises Hadoop cluster is set as your source filesystem:

    source fs show

    If the filesystem shown is incorrect, delete it using source del and configure the source manually:

    filesystem add hdfs

    Ensure to include the --source parameter when using the command above.

Add ADLS Gen2 storage as target filesystem#

Configure your ADLS Gen2 storage container as your target filesystem. The method chosen will depend on the authentication method:

Create path mapping for default Hive warehouse directory#

Create a path mapping to ensure that data for managed Hive databases and tables are migrated to the default Hive warehouse directory for HDInsight clusters.

This lets you start using your source data and metadata on your HDInsights cluster immediately after migration, as it will be referenced correctly by your target metastore.

Configure for metadata migrations#

Add source hive agent#

  1. Configure the source hive agent to connect to the Hive metastore on the on-premises Hadoop cluster:

  2. Check that the configuration for the hive agent is correct:

    • UI - the agent will show a healthy connection.

    • CLI

      Example
      hive agent check --name hiveAgent

Add target hive agent#

HDInsights can use either a default metastore, or a custom metastore in the form of an Azure SQL Database.

Choose one of the methods below depending on the type of metastore deployed in your HDInsights cluster.

Default metastore#

note

For step 1, deploying a remote agent is only possible through the CLI.

  1. Deploy and configure a remote hive agent:

    hive agent add hive

    Use the automated deployment parameters or follow the steps for manual deployment.

    As mentioned in the prerequisites, you will need to specify a suitable edge node on your HDInsights cluster to deploy the hive agent service.

  2. Check that the configuration for the hive agent is correct:

    • UI - the agent will show a healthy connection.

    • CLI

      Example
      hive agent check --name azureAgent

Custom metastore (Azure SQL database)#

  1. Configure a hive agent to connect to an Azure SQL database:

  2. Check that the configuration for the hive agent is correct:

    • UI - the agent will show a healthy connection.

    • CLI

      Example
      hive agent check --name azureAgent

Next Steps#

Start defining exclusions and migrating data. You can also create metadata rules and start migrating metadata.