Skip to main content
Version: 1.22.0

Connect to source and target metastores

Ready to migrate metadata? Hive Migrator, which comes bundled with Data Migrator, lets you transfer metadata from a source metastore to any number of target metastores. Connect to metastores by creating local or remote metadata agents. If you're creating a remote metadata agent, you have to use the WANdisco® CLI for this.

info

You must configure a source or target filesystem before you can connect to a metastore.

Migrating transactional tables

Transactional tables may take longer to appear on the target cluster than expected. Hive Migrator uses a cautious approach to ensure data integrity. The following conditions must be met for table data to appear on the target:

  • All corresponding data files are migrated.
  • The table's transaction writeId is updated, confirming that all data files are on the target.

Hive Migrator uses migration gates to ensure data files are in place before meeting the second condition. Coming improvements to migration gates will change the conditions so table migrations may proceed without the need for the data migration to be live, reducing migration times.

Connect to metastores with the UI

Apache Hive

Review the basic Prerequisites for Apache Hive before you begin.

Remote agent

A remote agent is a service deployed on a remote host that connects to Data Migrator. A remote agent must be deployed on the target cluster if the source and target run different major Hive versions.

When deploying a remote agent on an environment where Hive uses MySQL, the JDBC Driver for MySQL must be copied into /opt/wandisco/hivemigrator and made executable on the remote server.

CDP 7.1.8

If you’re connecting to a Hive metastore agent and using CDP 7.1.8 with Postgresql, you need to create a symlink to postgresql-jdbc.jar. See Missing PostgreSQL driver for more information.

  1. From the Dashboard, select a product under Products.

    info

    Data Migrator will attempt to auto-discover Apache Hive and create a metadata agent for your Hadoop source filesystem. Check whether an existing agent is listed under the Agents panel.

    Auto-discovery will fail if Kerberos is enabled.

  2. Select Connect To Metastore.

  3. Select the Filesystem in which the data associated with the metadata is held. For Hive agents, this will likely be the Hadoop Distributed File System (HDFS) which contains the data for your tables.

  4. Enter a Display Name.

  5. (Optional, required when using a local agent for a target filesystem) - Enter a value for Configuration Path. The default path will be used if left blank.

    note

    Leave empty for a local, source Hive metastore agent. Data Migrator will autodetect Hive configuration in /etc/hive/conf when your local agent is located on a Hive client node, the parameter won't be required and shouldn't be configured.
    For a local agent for a target metastore or when Hive config is not located in /etc/hive/conf, supply a path containing the hive-site.xml, core-site.xml, and hdfs-site.xml for that specific cluster.

  6. (Optional) - Enter Kerberos Configuration. Use the Hive service principal hive/hostname@REALM or a principal of similar permission. The keytab must be readable by the user running the Hive Migrator process and contain the appropriate principal.

  7. (Optional) - Select Override JDBC Connection Properties to override the JDBC properties used to connect to the Hive metastore database. You'll need to enable this option for migrating transactional, managed tables on Hive 3+ on CDP Hadoop clusters.

    Enter the following details for both source and target agents:

    • Connection URL: The JDBC URL for the database.

    • Connection Driver Name: The full class of the JDBC driver. For example, org.postgresql.Driver.

    • Connection Username: The username for your metastore database.

    • Connection Password: The password for your metastore database.

      info

      If you're using MariaDB or MySQL, you need to manually add the JDBC driver to the classpath. See Manual JDBC driver configuration for more information.

  8. (Optional) - Enter Default Filesystem Override to override the default filesystem URI. Recommended for complex use cases only.

  9. Select Save.

Preferred Operation Mode

After creating your agent, select a preferred operation mode to manage how metadata changes are detected. Select your preferred operation mode before using the agent with any metadata migrations.

Azure SQL DB

The Azure SQL DB agent integrates directly with the external metastore of an HDInsight cluster. A HDI cluster can be spun up before or after the agent is created, and the metadata can be made available to it via the Azure SQL DB as its external metastore.

  1. Add the IP address of the Data Migrator host as a Azure SQL Server firewall rule.

  2. From the Dashboard, select a product under Products.

  3. Select Connect to Metastore.

  4. Select the Filesystem in which the data associated with the metadata is held. For Azure SQL agents, this will likely be an ADLS2 Container.

  5. Select Azure SQL DB as the Metastore Type.

  6. Enter a Display Name.

  7. Enter the Azure SQL Server Name.

  8. Enter the Azure SQL Database Name.

    note

    Hive Migrator doesn’t support Azure SQL database names containing blank spaces ( ) or hyphens (-).

  9. Enter the ADLS Gen2 Storage Account Name and Container Name.

  10. Select the Authentication Method.

    note

    If you're using the SQL Password authentication method, you’ll need to reenter the SQL database password when updating this agent.

  11. (Optional) - Enter a Default Filesystem Override to override the default filesystem URI. Recommended for complex use cases only.

  12. Select Save.

AWS Glue Data Catalog

info

AWS Glue Data Catalog allows a maximum of 100 objects per request.
When you're using it as a target, make the following change to avoid metadata migration failures due to hitting this limit:

  1. Add the property hivemigrator.migrationBatchSize=100 to /etc/wandisco/hivemigrator/application.properties.
  2. Restart the Hive Migrator service using the command: service hivemigrator restart.
  1. From the Dashboard, select a product under Products.

  2. Select Connect to Metastore.

  3. Select the Filesystem in which the data associated with the metadata is held. For AWS Glue agents, this will likely be an S3 object store.

  4. Select AWS Glue as the Metastore Type.

  5. Enter a Display Name.

  6. Select the AWS Catalog Credentials Provider.

    note

    If you're using the Access Key and Secret credentials provider, you’ll need to reenter the access and secret keys when updating this agent.

  7. Enter a Virtual Private Cloud or AWS Glue Service endpoint.

  8. Enter the AWS Region.

  9. (Optional) - Enter a Default Filesystem Override to override the default filesystem URI. Recommended for complex use cases only.

  10. Select Save.

Connect to metastores with the CLI

To deploy remote metadata agents, you must connect to the CLI.

  • Apache Hive
  • Azure SQL DB
  • AWS Glue Data Catalog
  • Filesystem

Apache Hive

CommandAction
hive agent add hiveAdd a Hive agent for a local or remote Apache Hive Metastore
hive agent configure hiveChange the configuration of an existing Hive agent for the Apache Hive Metastore
hive agent checkCheck whether the Hive agent can connect to the Metastore
hive agent deleteDelete a Hive agent
hive agent listList all configured Hive agents
hive agent showShow the configuration for a Hive agent
hive agent typesList supported Hive agent types

Azure SQL DB

CommandAction
hive agent add azureAdd a Hive agent for an Azure SQL connection
hive agent configure azureChange the configuration of an existing Hive agent for the Azure SQL database server
hive agent checkCheck whether the Hive agent can connect to the Metastore
hive agent deleteDelete a Hive agent
hive agent listList all configured Hive agents
hive agent showShow the configuration for a Hive agent
hive agent typesList supported Hive agent types

AWS Glue Data Catalog

CommandAction
hive agent add glueAdd a Hive agent for an AWS Glue Data Catalog
hive agent configure glueChange the configuration of an existing Hive agent for the AWS Glue Data Catalog
hive agent checkCheck whether the Hive agent can connect to the Metastore
hive agent deleteDelete a Hive agent
hive agent listList all configured Hive agents
hive agent showShow the configuration for a Hive agent
hive agent typesList supported Hive agent types

Filesystem

CommandAction
hive agent add filesystemAdd a Hive agent for a local filesystem
hive agent configure filesystemChange the configuration of an existing Hive agent for the local filesystem
hive agent checkCheck whether the Hive agent can connect to the Metastore
hive agent deleteDelete a Hive agent
hive agent listList all configured Hive agents
hive agent showShow the configuration for a Hive agent
hive agent typesList supported Hive agent types

Connect to metastores and deploy remote agents with the CLI

  • Apache Hive
  • Azure SQL DB
  • AWS Glue Data Catalog

Apache Hive

Follow these steps to deploy a remote Hive agent for Apache Hive:

  1. On your local host, run the hive agent add hive command with the following parameters to configure your remote Hive agent.

    • --host The host where the remote Hive agent will be deployed.
    • --port The port for the remote Hive agent to use on the remote host. This port is used to communicate with the local Data Migrator server.
    • --no-ssl (Optional) TLS encryption and certificate authentication is enabled by default between Data Migrator and the remote agent. Use this parameter to disable it.
    Example for remote Apache Hive deployment - automated
        hive agent add hive --name targetautoAgent --autodeploy --ssh-user root --ssh-key /root/.ssh/id_rsa --ssh-port 22 --host myRemoteHost.example.com --port 5052 --kerberos-keytab /etc/security/keytabs/hive.service.keytab --kerberos-principal hive/_HOST@REMOTEREALM.COM --config-path /<example directory path> --file-system-id mytargethdfs
    Example for remote Apache Hive deployment - manual
        hive agent add hive --name targetmanualAgent --host myRemoteHost.example.com --port 5052 --kerberos-keytab /etc/security/keytabs/hive.service.keytab --kerberos-principal hive/_HOST@REMOTEREALM.COM --config-path /<example directory path> --file-system-id mytargethdfs
  2. Transfer the remote server installer to your remote host:

    Example of secure transfer from local to remote host
         scp /opt/wandisco/hivemigrator/hivemigrator-remote-server-installer.sh myRemoteHost:~
  3. On your remote host, run the installer as root (or sudo) user in silent mode:

         ./hivemigrator-remote-server-installer.sh -- --silent --config <example config string here>
  4. On your remote host, start the remote server service:

         service hivemigrator-remote-server start
info

If you enter Kerberos and configuration path information for remote agents, ensure that the directories and Kerberos principal are correct for your chosen remote host (not your local host).

Azure

Follow these steps to deploy a remote Hive agent for Azure:

  1. On your local host, run the hive agent add azure command with the following parameters to configure your remote Hive agent.

    • --host The host where the remote Hive agent will be deployed.
    • --port The port for the remote Hive agent to use on the remote host. This port is used to communicate with the local Data Migrator server.
    • --no-ssl (Optional) TLS encryption and certificate authentication is enabled by default between Data Migrator and the remote agent. Use this parameter to disable it.
    Example for remote Azure SQL deployment with System-assigned managed identity - automated
        hive agent add azure --name azureRemoteAgent --db-server-name mysqlserver.database.windows.net --database-name mydb1 --auth-method AD_MSI --storage-account myadls2 --container-name mycontainer --file-system-id myadls2storage --autodeploy --ssh-user root --ssh-key /root/.ssh/id_rsa --ssh-port 22 --host myRemoteHost.example.com --port 5052
    Example for remote Azure SQL deployment with User-assigned managed identity - manual
        hive agent add azure --name azureRemoteAgent --db-server-name mysqlserver.database.windows.net --database-name mydb1 --auth-method AD_MSI --client-id b67f67ex-ampl-e2eb-bd6d-client9385id --storage-account myadls2 --container-name mycontainer --file-system-id myadls2storage --host myRemoteHost.example.com --port 5052
  2. Transfer the remote server installer to your remote host (Azure VM, HDI cluster node):

    Example of secure transfer from local to remote host
         scp /opt/wandisco/hivemigrator/hivemigrator-remote-server-installer.sh myRemoteHost:~
  3. On your remote host, run the installer as root (or sudo) user in silent mode:

         ./hivemigrator-remote-server-installer.sh -- --silent
  4. On your remote host, start the remote server service:

         service hivemigrator-remote-server start
  5. On your local host, run the hive agent add hive command with the following parameters to configure your remote Hive agent.

    • --host The host where the remote Hive agent will be deployed.
    • --port The port for the remote Hive agent to use on the remote host. This port is used to communicate with the local Data Migrator server.
    • --no-ssl (Optional) TLS encryption and certificate authentication is enabled by default between Data Migrator and the remote agent. Use this parameter to disable it.
    Example for remote Apache Hive deployment - automated
     hive agent add hive --name targetautoAgent --autodeploy --ssh-user root --ssh-key /root/.ssh/id_rsa --ssh-port 22 --host myRemoteHost.example.com --port 5552 --kerberos-keytab /etc/security/keytabs/hive.service.keytab --kerberos-principal hive/_HOST@REMOTEREALM.COM --config-path <example directory path> --file-system-id mytargethdfs

    Replace <example directory path> with the path to a directory containing the core-site.xml, hdfs-site.xml, and hive-site.xml.

    Example for remote Apache Hive deployment - manual
     hive agent add hive --name targetmanualAgent --host myRemoteHost.example.com --port 5552 --kerberos-keytab /etc/security/keytabs/hive.service.keytab --kerberos-principal hive/_HOST@REMOTEREALM.COM --config-path <example directory path> --file-system-id mytargethdfs

    Replace <example directory path> with the path to a directory containing the core-site.xml, hdfs-site.xml, and hive-site.xml.

  6. Transfer the remote server installer to your remote host:

    Example of secure transfer from local to remote host
     scp /opt/wandisco/hivemigrator/hivemigrator-remote-server-installer.sh myRemoteHost:~
  7. On your remote host, run the installer as root (or sudo) user in silent mode:

         ./hivemigrator-remote-server-installer.sh -- --silent
  8. On your remote host, start the remote server service:

         service hivemigrator-remote-server start
info

If you enter Kerberos and configuration path information for remote agents, ensure that the directories and Kerberos principal are correct for your chosen remote host (not your local host).

AWS Glue Data Catalog

Follow these steps to deploy a remote Hive agent for AWS Glue:

  1. On your local host, run the hive agent add glue command with the following parameters to configure your remote Hive agent.

    • --host The host where the remote Hive agent will be deployed.
    • --port The port for the remote Hive agent to use on the remote host. This port is used to communicate with the local Data Migrator server.
    • --no-ssl (Optional) TLS encryption and certificate authentication is enabled by default between Data Migrator and the remote agent. Use this parameter to disable it.
    Example for remote AWS Glue agent
        hive agent add glue --name glueAgent --access-key ACCESS6HCFPAQIVZTKEY --secret-key SECRET1vTMuqKOIuhET0HAI78UIPfSRjcswTKEY --glue-endpoint glue.eu-west-1.amazonaws.com --aws-region eu-west-1 --file-system-id mys3bucket --host myRemoteHost.example.com --port 5052
  2. Transfer the remote server installer to your remote host (Amazon EC2 instance):

    Example of secure transfer from local to remote host
         scp /opt/wandisco/hivemigrator/hivemigrator-remote-server-installer.sh myRemoteHost:~
  3. On your remote host, run the installer as root (or sudo) user in silent mode:

         ./hivemigrator-remote-server-installer.sh -- --silent
  4. On your remote host, start the remote server service:

         service hivemigrator-remote-server start