Connect to source and target metastores
Ready to migrate metadata? Hive Migrator, which comes bundled with Data Migrator, lets you transfer metadata from a source metastore to any number of target metastores. Connect to metastores by creating local or remote metadata agents. If you're creating a remote metadata agent, you have to use the WANdisco® CLI for this.
Supported metadata sources are: Apache Hive and AWS Glue Data Catalog.
Supported metadata targets are: Apache Hive, Azure SQL DB, AWS Glue Data Catalog, Databricks, Google Dataproc, and Snowflake.
To configure Databricks as a target, see Configure Databricks as a target.
To configure Google Dataproc as a target, see Configure Google Dataproc as a target.
To configure Snowflake as a target, see Configure Snowflake as a target.
Transactional tables may take longer to appear on the target cluster than expected. Hive Migrator uses a cautious approach to ensure data integrity. The following conditions must be met for table data to appear on the target:
- All corresponding data files are migrated.
- The table's transaction
writeId
is updated, confirming that all data files are on the target.
Hive Migrator uses migration gates to ensure data files are in place before meeting the second condition. Coming improvements to migration gates will change the conditions so table migrations may proceed without the need for the data migration to be live, reducing migration times.
Connect to metastores with the UI
Apache Hive
Review the basic Prerequisites for Apache Hive before you begin.
A remote agent is a service deployed on a remote host that connects to Data Migrator. A remote agent must be deployed on the target cluster if the source and target run different major Hive versions.
When deploying a remote agent on an environment where Hive uses MySQL, the JDBC Driver for MySQL must be copied into /opt/wandisco/hivemigrator
and made executable on the remote server.
If you’re connecting to a Hive metastore agent and using CDP 7.1.8, you need to create a symlink to postgresql-jdbc.jar
. See Missing PostgreSQL driver for more information.
From the Dashboard, select a product under Products.
infoData Migrator will attempt to auto-discover Apache Hive and create a metadata agent for your Hadoop source filesystem. Check whether an existing agent is listed under the Agents panel.
Auto-discovery will fail if Kerberos is enabled.
Select Connect To Metastore.
Select the Filesystem in which the data associated with the metadata is held. For Hive agents, this will likely be the Hadoop Distributed File System (HDFS) which contains the data for your tables.
Enter a Display Name.
(Optional, required when using a local agent for a target filesystem) - Enter a value for Configuration Path. The default path will be used if left blank.
noteLeave empty for a local, source Hive metastore agent. Data Migrator will autodetect Hive configuration in /etc/hive/conf when your local agent is located on a Hive client node, the parameter won't be required and shouldn't be configured.
For a local agent for a target metastore or when Hive config is not located in /etc/hive/conf, supply a path containing the hive-site.xml, core-site.xml, and hdfs-site.xml for that specific cluster.(Optional) - Enter Kerberos Configuration. Use the Hive service principal
hive/hostname@REALM
or a principal of similar permission. The keytab must be readable by the user running the Hive Migrator process and contain the appropriate principal.(Optional) - Select Override JDBC Connection Properties to override the JDBC properties used to connect to the Hive metastore database. You'll need to enable this option for migrating transactional, managed tables on Hive 3+ on CDP Hadoop clusters.
Enter the following details for both source and target agents:
Connection URL: The JDBC URL for the database.
Connection Driver Name: The full class of the JDBC driver. For example,
org.postgresql.Driver
.Connection Username: The username for your metastore database.
Connection Password: The password for your metastore database.
infoIf you're using MariaDB or MySQL, you need to manually add the JDBC driver to the classpath. See Manual JDBC driver configuration for more information.
(Optional) - Enter Default Filesystem Override to override the default filesystem URI. Recommended for complex use cases only.
Select Save.
After creating your agent, select a preferred operation mode to manage how metadata changes are detected. Select your preferred operation mode before using the agent with any metadata migrations.
Azure SQL DB
The Azure SQL DB agent integrates directly with the external metastore of an HDInsight cluster. A HDI cluster can be spun up before or after the agent is created, and the metadata can be made available to it via the Azure SQL DB as its external metastore.
Add the IP address of the Data Migrator host as a Azure SQL Server firewall rule.
From the Dashboard, select a product under Products.
Select Connect to Metastore.
Select the Filesystem in which the data associated with the metadata is held. For Azure SQL agents, this will likely be an ADLS2 Container.
Select Azure SQL DB as the Metastore Type.
Enter a Display Name.
Enter the Azure SQL Server Name.
Enter the Azure SQL Database Name.
noteHive Migrator doesn’t support Azure SQL database names containing blank spaces (
-
).Enter the ADLS Gen2 Storage Account Name and Container Name.
Select the Authentication Method.
noteIf you're using the SQL Password authentication method, you’ll need to reenter the SQL database password when updating this agent.
(Optional) - Enter a Default Filesystem Override to override the default filesystem URI. Recommended for complex use cases only.
Select Save.
AWS Glue Data Catalog
AWS Glue Data Catalog allows a maximum of 100 objects per request.
When you're using it as a target, make the following change to avoid metadata migration failures due to hitting this limit:
- Add the property
hivemigrator.migrationBatchSize=100
to/etc/wandisco/hivemigrator/application.properties
. - Restart the Hive Migrator service using the command:
service hivemigrator restart
.
From the Dashboard, select a product under Products.
Select Connect to Metastore.
Select the Filesystem in which the data associated with the metadata is held. For AWS Glue agents, this will likely be an S3 object store.
Select AWS Glue as the Metastore Type.
Enter a Display Name.
Select the AWS Catalog Credentials Provider.
noteIf you're using the Access Key and Secret credentials provider, you’ll need to reenter the access and secret keys when updating this agent.
Enter a Virtual Private Cloud or AWS Glue Service endpoint.
Enter the AWS Region.
(Optional) - Enter a Default Filesystem Override to override the default filesystem URI. Recommended for complex use cases only.
Select Save.
Connect to metastores with the CLI
To deploy remote metadata agents, you must connect to the CLI.
- Apache Hive
- Azure SQL DB
- AWS Glue Data Catalog
- Filesystem
Apache Hive
Command | Action |
---|---|
hive agent add hive | Add a Hive agent for a local or remote Apache Hive Metastore |
hive agent configure hive | Change the configuration of an existing Hive agent for the Apache Hive Metastore |
hive agent check | Check whether the Hive agent can connect to the Metastore |
hive agent delete | Delete a Hive agent |
hive agent list | List all configured Hive agents |
hive agent show | Show the configuration for a Hive agent |
hive agent types | List supported Hive agent types |
Azure SQL DB
Command | Action |
---|---|
hive agent add azure | Add a Hive agent for an Azure SQL connection |
hive agent configure azure | Change the configuration of an existing Hive agent for the Azure SQL database server |
hive agent check | Check whether the Hive agent can connect to the Metastore |
hive agent delete | Delete a Hive agent |
hive agent list | List all configured Hive agents |
hive agent show | Show the configuration for a Hive agent |
hive agent types | List supported Hive agent types |
AWS Glue Data Catalog
Command | Action |
---|---|
hive agent add glue | Add a Hive agent for an AWS Glue Data Catalog |
hive agent configure glue | Change the configuration of an existing Hive agent for the AWS Glue Data Catalog |
hive agent check | Check whether the Hive agent can connect to the Metastore |
hive agent delete | Delete a Hive agent |
hive agent list | List all configured Hive agents |
hive agent show | Show the configuration for a Hive agent |
hive agent types | List supported Hive agent types |
Filesystem
Command | Action |
---|---|
hive agent add filesystem | Add a Hive agent for a local filesystem |
hive agent configure filesystem | Change the configuration of an existing Hive agent for the local filesystem |
hive agent check | Check whether the Hive agent can connect to the Metastore |
hive agent delete | Delete a Hive agent |
hive agent list | List all configured Hive agents |
hive agent show | Show the configuration for a Hive agent |
hive agent types | List supported Hive agent types |
Connect to metastores and deploy remote agents with the CLI
- Apache Hive
- Azure SQL DB
- AWS Glue Data Catalog
Apache Hive
Follow these steps to deploy a remote Hive agent for Apache Hive:
On your local host, run the
hive agent add hive
command with the following parameters to configure your remote Hive agent.--host
The host where the remote Hive agent will be deployed.--port
The port for the remote Hive agent to use on the remote host. This port is used to communicate with the local Data Migrator server.--no-ssl
(Optional) TLS encryption and certificate authentication is enabled by default between Data Migrator and the remote agent. Use this parameter to disable it.
Example for remote Apache Hive deployment - automatedhive agent add hive --name targetautoAgent --autodeploy --ssh-user root --ssh-key /root/.ssh/id_rsa --ssh-port 22 --host myRemoteHost.example.com --port 5052 --kerberos-keytab /etc/security/keytabs/hive.service.keytab --kerberos-principal hive/_HOST@REMOTEREALM.COM --config-path /<example directory path> --file-system-id mytargethdfs
Example for remote Apache Hive deployment - manualhive agent add hive --name targetmanualAgent --host myRemoteHost.example.com --port 5052 --kerberos-keytab /etc/security/keytabs/hive.service.keytab --kerberos-principal hive/_HOST@REMOTEREALM.COM --config-path /<example directory path> --file-system-id mytargethdfs
Transfer the remote server installer to your remote host:
Example of secure transfer from local to remote hostscp /opt/wandisco/hivemigrator/hivemigrator-remote-server-installer.sh myRemoteHost:~
On your remote host, run the installer as root (or sudo) user in silent mode:
./hivemigrator-remote-server-installer.sh -- --silent --config <example config string here>
On your remote host, start the remote server service:
service hivemigrator-remote-server start
If you enter Kerberos and configuration path information for remote agents, ensure that the directories and Kerberos principal are correct for your chosen remote host (not your local host).
Azure
Follow these steps to deploy a remote Hive agent for Azure:
On your local host, run the
hive agent add azure
command with the following parameters to configure your remote Hive agent.--host
The host where the remote Hive agent will be deployed.--port
The port for the remote Hive agent to use on the remote host. This port is used to communicate with the local Data Migrator server.--no-ssl
(Optional) TLS encryption and certificate authentication is enabled by default between Data Migrator and the remote agent. Use this parameter to disable it.
Example for remote Azure SQL deployment with System-assigned managed identity - automatedhive agent add azure --name azureRemoteAgent --db-server-name mysqlserver.database.windows.net --database-name mydb1 --auth-method AD_MSI --storage-account myadls2 --container-name mycontainer --file-system-id myadls2storage --autodeploy --ssh-user root --ssh-key /root/.ssh/id_rsa --ssh-port 22 --host myRemoteHost.example.com --port 5052
Example for remote Azure SQL deployment with User-assigned managed identity - manualhive agent add azure --name azureRemoteAgent --db-server-name mysqlserver.database.windows.net --database-name mydb1 --auth-method AD_MSI --client-id b67f67ex-ampl-e2eb-bd6d-client9385id --storage-account myadls2 --container-name mycontainer --file-system-id myadls2storage --host myRemoteHost.example.com --port 5052
Transfer the remote server installer to your remote host (Azure VM, HDI cluster node):
Example of secure transfer from local to remote hostscp /opt/wandisco/hivemigrator/hivemigrator-remote-server-installer.sh myRemoteHost:~
On your remote host, run the installer as root (or sudo) user in silent mode:
./hivemigrator-remote-server-installer.sh -- --silent
On your remote host, start the remote server service:
service hivemigrator-remote-server start
On your local host, run the
hive agent add hive
command with the following parameters to configure your remote Hive agent.--host
The host where the remote Hive agent will be deployed.--port
The port for the remote Hive agent to use on the remote host. This port is used to communicate with the local Data Migrator server.--no-ssl
(Optional) TLS encryption and certificate authentication is enabled by default between Data Migrator and the remote agent. Use this parameter to disable it.
Example for remote Apache Hive deployment - automatedhive agent add hive --name targetautoAgent --autodeploy --ssh-user root --ssh-key /root/.ssh/id_rsa --ssh-port 22 --host myRemoteHost.example.com --port 5552 --kerberos-keytab /etc/security/keytabs/hive.service.keytab --kerberos-principal hive/_HOST@REMOTEREALM.COM --config-path <example directory path> --file-system-id mytargethdfs
Replace <example directory path> with the path to a directory containing the
core-site.xml
,hdfs-site.xml
, andhive-site.xml
.Example for remote Apache Hive deployment - manualhive agent add hive --name targetmanualAgent --host myRemoteHost.example.com --port 5552 --kerberos-keytab /etc/security/keytabs/hive.service.keytab --kerberos-principal hive/_HOST@REMOTEREALM.COM --config-path <example directory path> --file-system-id mytargethdfs
Replace <example directory path> with the path to a directory containing the
core-site.xml
,hdfs-site.xml
, andhive-site.xml
.Transfer the remote server installer to your remote host:
Example of secure transfer from local to remote hostscp /opt/wandisco/hivemigrator/hivemigrator-remote-server-installer.sh myRemoteHost:~
On your remote host, run the installer as root (or sudo) user in silent mode:
./hivemigrator-remote-server-installer.sh -- --silent
On your remote host, start the remote server service:
service hivemigrator-remote-server start
If you enter Kerberos and configuration path information for remote agents, ensure that the directories and Kerberos principal are correct for your chosen remote host (not your local host).
AWS Glue Data Catalog
Follow these steps to deploy a remote Hive agent for AWS Glue:
On your local host, run the
hive agent add glue
command with the following parameters to configure your remote Hive agent.--host
The host where the remote Hive agent will be deployed.--port
The port for the remote Hive agent to use on the remote host. This port is used to communicate with the local Data Migrator server.--no-ssl
(Optional) TLS encryption and certificate authentication is enabled by default between Data Migrator and the remote agent. Use this parameter to disable it.
Example for remote AWS Glue agenthive agent add glue --name glueAgent --access-key ACCESS6HCFPAQIVZTKEY --secret-key SECRET1vTMuqKOIuhET0HAI78UIPfSRjcswTKEY --glue-endpoint glue.eu-west-1.amazonaws.com --aws-region eu-west-1 --file-system-id mys3bucket --host myRemoteHost.example.com --port 5052
Transfer the remote server installer to your remote host (Amazon EC2 instance):
Example of secure transfer from local to remote hostscp /opt/wandisco/hivemigrator/hivemigrator-remote-server-installer.sh myRemoteHost:~
On your remote host, run the installer as root (or sudo) user in silent mode:
./hivemigrator-remote-server-installer.sh -- --silent
On your remote host, start the remote server service:
service hivemigrator-remote-server start