Skip to main content
Version: 2.2

Configure CDP target for Hive metadata migrations

Hive Migrator can migrate metadata to a Hive or other metastore service in Apache Hadoop, operating as part of a Hadoop deployment. Metadata migrations make the metadata from a source environment available in a target environment, where tools like Apache Hive can query the data.

Use the following options to set up appropriate security credentials and connection settings to perform metadata migrations successfully.

Supported metadata types:

  • AWS Glue Data Catalog
  • Databricks
  • Google Cloud Dataproc (an Azure SQL Database used for Azure HDInsight's Hive-compatible metastore)
  • Apache Hive metastore
  • Snowflake
note

Support for more metadata types will be added in future releases.

Metadata migration to a Hive metastore target

Data Migrator interacts with a metastore using an "agent". Agents hold the information needed to communicate with metastores and allow metadata migrations to be defined in Data Migrator. The migrations don't need to know anything about the specific type of the underlying metastore or how to communicate with it.

Deploy each agent locally or remotely. Deploy a local agent on the host as Hive Migrator. A remote agent, currently used for Hive and Google Cloud Dataproc, runs as a separate service and can be deployed on a separate host, not running Data Migrator.

Deployment with local agents

Agent Local

Deployment with remote agents

Agent Remote

Remote agents let you migrate metadata between different versions of Apache Hive. They also give you complete control over the network communication between source and target environments instead of relying on the network interfaces directly exposed by your metadata target.

For a list of agent types available for each supported platform, see Supported metadata agents.

Configure Hive agents

Configure agents through your preferred interface: UI, Data Migrator CLI, or the REST API.

This information includes:

  • Agent names
  • Path to the Hadoop files that contain Hive and Hadoop configuration properties
  • File system location information that holds content referenced by your metadata, for example, a HDFS instance
  • [Optional] Kerberos credentials in the form of principal and keytab information

The Apache Thrift API and its underlying database allow migration metadata for Hive ACID transactional tables.

The main Hadoop configuration file hive-site.xml provides the information required for agents to communicate with the Hive metastore. In a Cloudera Data Platform deployment, it's found in /etc/hive/conf/hive-site.xml.

Runtime environment needed by Hive agents

Hive agents need access to the JDBC driver/connector used to communicate with the Hive metastore's underlying database. For example, a Cloudera Data Platform environment typically uses MySQL or Postgres databases.

  1. Download the appropriate JDBC driver/connector JAR file, MySQL or Postgres.

    note

    You only need the JAR file, if you're asked to specify an OS, choose platform independent. If the download is provided as an archive, extract the JAR file from the archive.

  2. Copy the JDBC driver JAR to /opt/wandisco/hivemigrator/agent/hive on your metadata agent host machine.

  3. Set the ownership of the file to the Hive Migrator system user and group for your Hivemigrator instance.

    Example
    chown {hive:hadoop} mysql*

Temporarily turn off redaction

If you don't know the password to connect to your Hive metastore, get it from your Hadoop environment. If your Hadoop platform redacts the password, follow the steps below on the node running your cluster manager:

  1. Open /etc/default/cloudera-scm-server in a text editor.

  2. Edit the following line, changing "true" to "false". For example:

    $ export CMF_JAVA_OPTS="$CMF_JAVA_OPTS -Dcom.cloudera.api.redaction=false"
  3. Save the change.

  4. Restart the manager using the command:

    $ sudo service cloudera-scm-server restart

    This may take a few minutes.

  5. The hive_metastore_database_password will no longer be redacted. Open and view the file to confirm. For example:

    ```
    },{
    "name" : "hive_metastore_database_password",
    "value" : "The-true-password-value",
    "sensitive" : true
    }, {
    ```
    info

    Enable redaction after you confirm your system's credentials

Configure the CDP target with the UI

To configure a CDP target, when adding Metastore Database Credentials in the Agents tab in the UI, select the Override JDBC Connection Properties checkbox to enable the following fields:

  • Connection URL - JDBC URL for the database
  • Connection Driver Name - Full class name of JDBC driver
  • Connection Username - The username for your metastore database
  • Connection Password - The password for your metastore database

Additional steps

Use the JDBC driver and Hadoop configuration information to create an agent instance:

  1. Save the changes to the hive-site.xml file.

  2. Go to the Hive Connection screen of the targetAgent. Enter the path to the alternative hive-site.xml file into the Override Default Hadoop Configuration Path.

  3. Select Save.

  4. Restart the Hive service in Cloudera Manager.

Deploy the remote agent from source (MySQL)

  1. On the source cluster, open Data Migrator through the CLI.

  2. Create a new remote agent using the command below:

    hive agent add hive --autodeploy --file-system-id fsId --host example.host.name --port 5052 --ignore-host-checking --name remoteAgent --ssh-key /root/.ssh/id_rsa --ssh-user root
  3. Navigate to the Hive Migrator directory:

    cd /opt/wandisco/hivemigrator-remote-server
  4. Run the following command to copy both of the appropriate jars into this directory:

    cp /usr/share/java/mysql*
  5. Set the appropriate permissions with the following command:

    chown {user:group} mysql*
  6. After a few seconds, the agent will appear healthy on the UI. Check its status with the following command:

    hive agent check --name remoteAgent

Deploy the remote agent from source (Postgres)

  1. On the source cluster, open Data Migrator through the CLI.

  2. Create a new remote agent with the following command:

    hive agent add hive --autodeploy --file-system-id fsId --host example.host.name --port 5052 --ignore-host-checking --name remoteAgent --ssh-key /root/.ssh/id_rsa --ssh-user root
  3. After a few seconds, the agent will appear healthy on the UI. Check its status with the following command:

    hive agent check --name remoteAgent

Create a metadata migration

Create a metadata migration.