Configure CDP target for Hive metadata migrations
Hive Migrator can migrate metadata to a Hive or other metastore service in Apache Hadoop, operating as part of a Hadoop deployment. Metadata migrations make the metadata from a source environment available in a target environment, where tools like Apache Hive can query the data.
Use the following options to set up appropriate security credentials and connection settings to perform metadata migrations successfully.
Supported metadata types:
- AWS Glue Data Catalog
- Databricks
- Google Cloud Dataproc (an Azure SQL Database used for Azure HDInsight's Hive-compatible metastore)
- Apache Hive metastore
- Snowflake
Support for more metadata types will be added in future releases.
Metadata migration to a Hive metastore target
Data Migrator interacts with a metastore using an "agent". Agents hold the information needed to communicate with metastores and allow metadata migrations to be defined in Data Migrator. The migrations don't need to know anything about the specific type of the underlying metastore or how to communicate with it.
Deploy each agent locally or remotely. Deploy a local agent on the host as Hive Migrator. A remote agent, currently used for Hive and Google Cloud Dataproc, runs as a separate service and can be deployed on a separate host, not running Data Migrator.
Deployment with local agents
Deployment with remote agents
Remote agents let you migrate metadata between different versions of Apache Hive. They also give you complete control over the network communication between source and target environments instead of relying on the network interfaces directly exposed by your metadata target.
For a list of agent types available for each supported platform, see Supported metadata agents.
Configure Hive agents
Configure agents through your preferred interface: UI, Data Migrator CLI, or the REST API.
This information includes:
- Agent names
- Path to the Hadoop files that contain Hive and Hadoop configuration properties
- File system location information that holds content referenced by your metadata, for example, a HDFS instance
- [Optional] Kerberos credentials in the form of principal and keytab information
The Apache Thrift API and its underlying database allow migration metadata for Hive ACID transactional tables.
The main Hadoop configuration file hive-site.xml
provides the information required for agents to communicate with the Hive metastore. In a Cloudera Data Platform deployment, it's found in /etc/hive/conf/hive-site.xml
.
Runtime environment needed by Hive agents
Hive agents need access to the JDBC driver/connector used to communicate with the Hive metastore's underlying database. For example, a Cloudera Data Platform environment typically uses MySQL or Postgres databases.
Download the appropriate JDBC driver/connector JAR file, MySQL or Postgres.
noteYou only need the JAR file, if you're asked to specify an OS, choose platform independent. If the download is provided as an archive, extract the JAR file from the archive.
Copy the JDBC driver JAR to
/opt/wandisco/hivemigrator/agent/hive
on your metadata agent host machine.Set the ownership of the file to the Hive Migrator system user and group for your Hivemigrator instance.
Examplechown {hive:hadoop} mysql*
Temporarily turn off redaction
If you don't know the password to connect to your Hive metastore, get it from your Hadoop environment. If your Hadoop platform redacts the password, follow the steps below on the node running your cluster manager:
Open
/etc/default/cloudera-scm-server
in a text editor.Edit the following line, changing "true" to "false". For example:
$ export CMF_JAVA_OPTS="$CMF_JAVA_OPTS -Dcom.cloudera.api.redaction=false"
Save the change.
Restart the manager using the command:
$ sudo service cloudera-scm-server restart
This may take a few minutes.
The
hive_metastore_database_password
will no longer be redacted. Open and view the file to confirm. For example:```
},{
"name" : "hive_metastore_database_password",
"value" : "The-true-password-value",
"sensitive" : true
}, {
```infoEnable redaction after you confirm your system's credentials
- UI
- CLI
Configure the CDP target with the UI
To configure a CDP target, when adding Metastore Database Credentials in the Agents tab in the UI, select the Override JDBC Connection Properties checkbox to enable the following fields:
- Connection URL - JDBC URL for the database
- Connection Driver Name - Full class name of JDBC driver
- Connection Username - The username for your metastore database
- Connection Password - The password for your metastore database
Configure the CDP target with the CLI
Get the required JDBC configuration using the following API call:
{clusterHost}:{clusterPort}/api/v19/clusters/{clusterName}/services/{serviceName}/config
For example:
abcd01-vm0.domain.name.com:7180/api/v19/clusters/ABCD-01/services/hive1/config
Open the
/etc/hive/conf/hive-site.xml
file with a text editor and save it as/etc/hive/alt-conf/hive-site.xml
.noteSaving the changes to the existing
hive-site.xml
file will work but the following changes will be lost if the Hive service is restarted.Add the following configuration to the new file.
hive-site.xml configuration | Description |
---|---|
--jdbc-url | JDBC URL for the database |
--jdbc-driver-name | Full class name of JDBC driver |
--jdbc-username | User name for connecting to database |
--jdbc-password | Password for connecting to database |
For example:
hive agent add hive --name targetautoAgent5 --host test.wandisco.com --port 5052 --no-ssl --jdbcUrl jdbc:postgresql://test.bdauto.wandisco.com:7432/hive --jdbc-DriverName org.postgresql.Driver --jdbc-username admin --jdbc--password *** --file-system-id 'testfs'
The correct formatting/syntax for these values:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:{database type}://{Host}:{Port used by database type}/hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value> See Guidance for javax.jdo.option.ConnectionDriverName, below </value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>{hive_metastore_database_password}</value>
</property>
For more information, see Command reference.
Guidance for javax.jdo.option.ConnectionDriverName
If you're using a MySQL database:
ConnectionDriverName is
com.mysql.jdbc.Driver
Port used is
3306
Database type is
mysql
If you're using a Postgres database:
ConnectionDriverName is
org.postgresql.Driver
Port used is
7432
Database is
postgresql
Save the changes to the file.
Go to the Hive Connection screen of the targetAgent. Enter the path to the alternate
hive-site.xml
file into the Override Default Hadoop Configuration Path.Select Save.
Restart the Hive service in Cloudera Manager.
Additional steps
Use the JDBC driver and Hadoop configuration information to create an agent instance:
Save the changes to the
hive-site.xml
file.Go to the Hive Connection screen of the targetAgent. Enter the path to the alternative
hive-site.xml
file into the Override Default Hadoop Configuration Path.Select Save.
Restart the Hive service in Cloudera Manager.
Deploy the remote agent from source (MySQL)
On the source cluster, open Data Migrator through the CLI.
Create a new remote agent using the command below:
hive agent add hive --autodeploy --file-system-id fsId --host example.host.name --port 5052 --ignore-host-checking --name remoteAgent --ssh-key /root/.ssh/id_rsa --ssh-user root
Navigate to the Hive Migrator directory:
cd /opt/wandisco/hivemigrator-remote-server
Run the following command to copy both of the appropriate jars into this directory:
cp /usr/share/java/mysql*
Set the appropriate permissions with the following command:
chown {user:group} mysql*
After a few seconds, the agent will appear healthy on the UI. Check its status with the following command:
hive agent check --name remoteAgent
Deploy the remote agent from source (Postgres)
On the source cluster, open Data Migrator through the CLI.
Create a new remote agent with the following command:
hive agent add hive --autodeploy --file-system-id fsId --host example.host.name --port 5052 --ignore-host-checking --name remoteAgent --ssh-key /root/.ssh/id_rsa --ssh-user root
After a few seconds, the agent will appear healthy on the UI. Check its status with the following command:
hive agent check --name remoteAgent