Version: 3.2 (latest)

On-premises Hadoop to Azure HDInsights

These are an outline of the steps needed to ready your environment for migration of data and metadata.

Time to complete: 1 hour (assuming all prerequisites are met).

Prerequisites

On-premises Hadoop cluster

Make sure all prerequisites are met for the source environment. This also includes:

Network connectivity between your edge node and your Azure Data Lake Storage (ADLS) Gen2 storage container.
If using an Azure SQL Database on your HDInsights cluster, network connectivity between your edge node and this database.

Azure HDInsights cluster

For your target environment, make sure the following prerequisites are met:

Your HDInsights cluster is using Azure Data Lake Storage (ADLS) Gen2 as its primary storage type.
If using a default metastore, SSH access to an edge node on the HDInsights cluster.
The edge node requires the following:
- Hadoop Distributed File System (HDFS) and Hive client libraries installed.
- A chosen port open for outbound connections (for example: 5552) to communicate with the Data Migrator service on the on-premises Hadoop edge node.

Install Data Migrator on your Hadoop edge node

Install Data Migrator on your Hadoop edge node.

Configuration for data migrations

Add Hadoop Distributed File System (HDFS) as a source filesystem

If Kerberos is enabled on your Hadoop cluster, enter the Kerberos credentials for the HDFS superuser on your Hadoop cluster:
- UI
- CLI
(CLI only) Check that HDFS on your on-premises Hadoop cluster is set as your source filesystem:
source show
If the filesystem shown is incorrect, delete it using source delete and configure the source manually:
filesystem add hdfs
Ensure to include the --source parameter when using the command above.

Add Azure Data Lake Storage (ADLS) Gen2 storage as a target filesystem

Configure your ADLS Gen2 storage container as your target filesystem. The method chosen will depend on the authentication method:

UI
CLI
- Using a service principal and OAuth 2 credentials: filesystem add adls2 oauth
- Using access key credentials: filesystem add adls2 sharedKey

Create path mapping for default Hive warehouse directory

Create a path mapping to ensure that data for managed Hive databases and tables are migrated to the default Hive warehouse directory for HDInsight clusters.

This lets you start using your source data and metadata on your HDInsights cluster immediately after migration, as it will be referenced correctly by your target metastore.

Configuration for metadata migrations

Add source hive agent

Configure the source hive agent to connect to the Hive metastore on the on-premises Hadoop cluster:
- UI
- CLI
Check that the configuration for the hive agent is correct:
- UI - the agent will show a healthy connection.
- CLI
  Example
```
hive agent check --name hiveAgent
```

Add target hive agent

HDInsights can use either a default metastore, or a custom metastore in the form of an Azure SQL Database.

Choose one of the methods below depending on the type of metastore deployed in your HDInsights cluster.

Default metastore

note

For step 1, deploying a remote agent is only possible through the CLI.

Deploy and configure a remote hive agent:
hive agent add hive
Use the automated deployment parameters or follow the steps for manual deployment.
As mentioned in the prerequisites, enter a suitable edge node on your HDInsights cluster to deploy the hive agent service.
Check that the configuration for the hive agent is correct:
- UI - the agent will show a healthy connection.
- CLI
  Example
```
hive agent check --name azureAgent
```

Custom metastore (Azure SQL database)

Configure a hive agent to connect to an Azure SQL database:
- UI
- CLI
Check that the configuration for the hive agent is correct:
- UI - the agent will show a healthy connection.
- CLI
  Example
```
hive agent check --name azureAgent
```

Next steps

Start defining exclusions and migrating data. You can also create metadata rules and start migrating metadata.

Prerequisites​

On-premises Hadoop cluster​

Azure HDInsights cluster​

Install Data Migrator on your Hadoop edge node​

Configuration for data migrations​

Add Hadoop Distributed File System (HDFS) as a source filesystem​

Add Azure Data Lake Storage (ADLS) Gen2 storage as a target filesystem​

Create path mapping for default Hive warehouse directory​

Configuration for metadata migrations​

Add source hive agent​

Add target hive agent​

Default metastore​

Custom metastore (Azure SQL database)​

Next steps​