On-premises Hadoop to Azure HDInsights
These are an outline of the steps needed to ready your environment for migration of data and metadata.
Time to complete: 1 hour (assuming all prerequisites are met).
#
Prerequisites#
On-premises Hadoop clusterMake sure all prerequisites are met for the source environment. This also includes:
- Network connectivity between your edge node and your Azure Data Lake Storage (ADLS) Gen2 storage container.
- If using an Azure SQL Database on your HDInsights cluster, network connectivity between your edge node and this database.
#
Azure HDInsights clusterFor your target environment, make sure the following prerequisites are met:
- Your HDInsights cluster is using Azure Data Lake Storage (ADLS) Gen2 as its primary storage type.
- If using a default metastore, SSH access to an edge node on the HDInsights cluster.
The edge node requires the following:- Hadoop Distributed File System (HDFS) and Hive client libraries installed.
- A chosen port open for outbound connections (for example: 5552) to communicate with the LiveData Migrator service on the on-premises Hadoop edge node.
#
Install LiveData Migrator on your Hadoop edge nodeDownload and install LiveData Migrator on your Hadoop edge node.
#
Configure for data migrations#
Add Hadoop Distributed File System (HDFS) as a source filesystemIf Kerberos is enabled on your Hadoop cluster, specify the Kerberos credentials for the HDFS superuser on your Hadoop cluster:
(CLI only) Check that HDFS on your on-premises Hadoop cluster is set as your source filesystem:
If the filesystem shown is incorrect, delete it using
source del
and configure the source manually:Ensure to include the
--source
parameter when using the command above.
#
Add Azure Data Lake Storage (ADLS) Gen2 storage as a target filesystemConfigure your ADLS Gen2 storage container as your target filesystem. The method chosen will depend on the authentication method:
- UI
- CLI
- Using a service principal and OAuth 2 credentials:
filesystem add adls2 oauth
- Using access key credentials:
filesystem add adls2 sharedKey
- Using a service principal and OAuth 2 credentials:
#
Create path mapping for default Hive warehouse directoryCreate a path mapping to ensure that data for managed Hive databases and tables are migrated to the default Hive warehouse directory for HDInsight clusters.
This lets you start using your source data and metadata on your HDInsights cluster immediately after migration, as it will be referenced correctly by your target metastore.
#
Configure for metadata migrations#
Add source hive agentConfigure the source hive agent to connect to the Hive metastore on the on-premises Hadoop cluster:
Check that the configuration for the hive agent is correct:
UI - the agent will show a healthy connection.
CLI
Examplehive agent check --name hiveAgent
#
Add target hive agentHDInsights can use either a default metastore, or a custom metastore in the form of an Azure SQL Database.
Choose one of the methods below depending on the type of metastore deployed in your HDInsights cluster.
#
Default metastorenote
For step 1, deploying a remote agent is only possible through the CLI.
Deploy and configure a remote hive agent:
Use the automated deployment parameters or follow the steps for manual deployment.
As mentioned in the prerequisites, you will need to specify a suitable edge node on your HDInsights cluster to deploy the hive agent service.
Check that the configuration for the hive agent is correct:
UI - the agent will show a healthy connection.
CLI
Examplehive agent check --name azureAgent
#
Custom metastore (Azure SQL database)Configure a hive agent to connect to an Azure SQL database:
Check that the configuration for the hive agent is correct:
UI - the agent will show a healthy connection.
CLI
Examplehive agent check --name azureAgent
#
Next StepsStart defining exclusions and migrating data. You can also create metadata rules and start migrating metadata.