Connect metastores
Ready to migrate metadata? Hive Migrator, which comes bundled with LiveData Migrator, lets you transfer metadata from a source metastore to target metastores. Connect to metastores by creating local or remote metadata agents.
- Supported metadata sources are: Apache Hive, and AWS Glue Data Catalog.
- Supported metadata targets are: Apache Hive, Azure SQL DB, AWS Glue Data Catalog, Databricks, Google Dataproc, and Snowflake.
To configure Snowflake as a target, see Configure Snowflake as a target.
#
Connect to metastores with the UI- Apache Hive
- Azure SQL DB
- AWS Glue Data Catalog
- Databricks
- Google Cloud Dataproc
Remote agent
A remote agent is a service deployed on a remote host that connects to LiveData Migrator. A remote agent must be deployed on the target cluster if:
- The source and target run different major Hive versions.
- Transactional tables are migrated.
When deploying a remote agent on an environment where Hive uses MySQL, the JDBC Driver for MySQL must be copied into /opt/wandisco/hivemigrator and made executable on the remote server.
From the Dashboard, select a product under Products.
info
LiveData Migrator will attempt to auto-discover Apache Hive and create a metadata agent for your Hadoop source filesystem. Check whether an existing agent is listed under the Agents panel.
Auto-discovery will fail if Kerberos is enabled. Select to configure the existing agent and provide the Kerberos credentials.
Select Connect To Metastore.
Select the Filesystem in which the data associated with the metadata is held. For Hive agents, this will likely be the Hadoop Distributed File System (HDFS) which contains the data for your tables.
note
If using a local Hive agent for a target filesystem, then hive-site.xml must be copied from the target cluster to the local cluster into a location specified by the Override Default Hive Configuration Path. Alternatively, a remote agent can be used for the target filesystem.
Select Hive as the Metastore Type.
Enter a Display Name.
(Optional) - Enter a value for Configuration Path. The default path will be used if left blank.
note
This should be the path to a directory containing the core-site.xml, hdfs-site.xml, and hive-site.xml. Required when using a local agent for a target filesystem.
(Optional) - Enter Kerberos Configuration. Use the Hive service principal
hive/hostname@REALM
or a principal of similar permission. The keytab must be readable by the user running the Hive Migrator process and contain the appropriate principal.(Optional) - Enter DefaultFS Override to override the default filesystem URI. Recommended for complex use cases only.
Select Save.
The Azure SQL DB agent integrates directly with the external metastore of an HDInsight cluster. A HDI cluster can be spun up before or after the agent is created, and the metadata can be made available to it via the Azure SQL DB as its external metastore.
Add the IP address of the LiveData Migrator host as a Azure SQL Server firewall rule.
From the Dashboard, select a product under Products.
Select Connect to Metastore.
Select the Filesystem in which the data associated with the metadata is held. For Azure SQL agents, this will likely be an ADLS2 Container.
Select Azure SQL DB as the Metastore Type.
Enter a Display Name.
Enter the Azure SQL Server Name.
Enter the Azure SQL Database Name.
Enter the ADLS Gen2 Storage Account Name and Container Name.
Select the Authentication Method.
Select the HDI version. If an HDI cluster is already using this Azure SQL DB as an external metastore, choose the version of your HDI cluster. If this is a new Azure SQL DB, or an HDI cluster has not run against this metastore before, choose the version of the HDI cluster that will be used in the future (in most cases this will be the latest version).
(Optional) - Specify DefaultFS Override to override the default filesystem URI. Recommended for complex use cases only.
Select Save.
From the Dashboard, select a product under Products.
Select Connect to Metastore.
Select the Filesystem in which the data associated with the metadata is held. For AWS Glue agents, this will likely be an S3 object store.
Select AWS Glue as the Metastore Type.
Enter a Display Name.
Select the AWS Catalog Credentials Provider.
Enter the AWS Glue Service Endpoint.
Enter the AWS Region.
(Optional) - Specify DefaultFS Override to override the default filesystem URI. Recommended for complex use cases only.
Select Save.
note
Databricks agents are currently available as a preview feature.
tip
LiveData Migrator provides an option to convert tables to Delta Lake format.
- Supported tables are: CSV, JSON, AVRO, ORC, PARQUET, and Text.
Download the Databricks JDBC driver.
note
Version 2.6.25 of the Databricks JDBC driver was released recently. If you download this version, you'll receive an error message stating you haven't downloaded a driver. This is because the latest version isn't compatible with Hive Migrator. Download version 2.6.22 to create your agent.
Unzip the package and upload the
SparkJDBC42.jar
file to the LiveData Migrator host machine.Move the
SparkJDBC42.jar
file to the LiveData Migrator directory below:/opt/wandisco/hivemigrator/agent/databricks
Change ownership of the Jar file to the Hive Migrator system user and group:
Example for hive:hadoopchown hive:hadoop /opt/wandisco/hivemigrator/agent/databricks/SparkJDBC42.jar
From the Dashboard, select a product under Products.
Select Connect To Metastore.
Select the Filesystem in which the data associated with the metadata is held.
Select Databricks as the Metastore Type.
Enter a Display Name.
Enter the JDBC Server Hostname, Port and HTTP Path.
Enter the Databricks Access Token.
LiveData Migrator has the ability to convert your tables into Delta Lake format. Select the Convert to Delta Lake option if you want to convert your tables.
Enter the FS Mount Point. The filesystem containing your data will need to be mounted onto your DBFS. The value of this parameter will need to be the path on the DBFS for the mounted container.
(Optional) - Enter another filesystem path for DefaultFs Override to override the default filesystem.
- If Convert to Delta Lake is enabled, choose the location on the DBFS to store the delta converted tables. To store delta tables on cloud storage, provide the path to the mount point and the path on the cloud storage.
Example: Location on the DBFS to store delta converted tablesdbfs:<location>
Example: Store delta tables on cloud storagedbfs:/mnt/adls2/storage_account/
- If Convert to Delta is disabled then this field should be left blank or set to the value of FS mount point.
Example: Fs mount pointdbfs:<value of Fs mount point>
Select Save.
From the Dashboard, select a product under Products.
Select Connect To Metastore.
Select the Filesystem in which the data associated with the metadata is held. For Dataproc agents, this will likely be a Google Cloud Storage Bucket.
Select Google Cloud Dataproc as the Metastore Type.
Enter a Display Name.
Enter the Hostname or IP Address of the cluster edge node.
Enter the Port for use between the Hive Migrator service and the Dataproc server.
note
The port will default to 5052. To use a different port, update the
port
property in/etc/wandisco/hivemigrator-remote-server/agent.yaml
and restart the Hive Migrator remote agent usingservice hivemigrator-remote-server restart
.Choose whether to Use TLS
(Optional) - Enter Kerberos Configuration. Use the principal assigned to the Dataproc cluster.
(Optional) - Enter DefaultFS Override to override the default filesystem URI. Recommended for complex use cases only.
Select Save.
Follow the new instructions at the top of the page.
- Download the installer to the Dataproc cluster virtual machine.
For example: curl -O https://staging-ldmrepo.wandisco.com/hivemigrator-remote-server-installer.sh
Run the installation command with custom string promped from agent creation.
Start the Hive Migrator service:
service hivemigrator-remote-server start
Select Check connection to attempt a connection to the metastore with the details you provided.
If LiveData Migrator can connect to the remote agent, select Save.
#
Connect to metastores with the CLIConnect to the LiveData Migrator CLI.
- Apache Hive
- Azure SQL DB
- AWS Glue Data Catalog
- Databricks
- Google Cloud Dataproc
- Filesystem
Command | Action |
---|---|
hive agent add hive | Add a Hive agent for a local or remote Apache Hive Metastore |
hive agent configure hive | Change the configuration of an existing Hive agent for the Apache Hive Metastore |
hive agent check | Check whether the Hive agent can connect to the Metastore |
hive agent delete | Delete a Hive agent |
hive agent list | List all configured Hive agents |
hive agent show | Show the configuration for a Hive agent |
hive agent types | List supported Hive agent types |
Command | Action |
---|---|
hive agent add azure | Add a Hive agent for an Azure SQL connection |
hive agent configure azure | Change the configuration of an existing Hive agent for the Azure SQL database server |
hive agent check | Check whether the Hive agent can connect to the Metastore |
hive agent delete | Delete a Hive agent |
hive agent list | List all configured Hive agents |
hive agent show | Show the configuration for a Hive agent |
hive agent types | List supported Hive agent types |
Command | Action |
---|---|
hive agent add glue | Add a Hive agent for an AWS Glue Data Catalog |
hive agent configure glue | Change the configuration of an existing Hive agent for the AWS Glue Data Catalog |
hive agent check | Check whether the Hive agent can connect to the Metastore |
hive agent delete | Delete a Hive agent |
hive agent list | List all configured Hive agents |
hive agent show | Show the configuration for a Hive agent |
hive agent types | List supported Hive agent types |
note
Databricks agents are currently available as a preview feature.
- Supported tables are: CSV, JSON, AVRO, ORC, PARQUET, and Text.
Command | Action |
---|---|
hive agent add databricks | Add a Hive agent for a Databricks Delta Lake Metastore |
hive agent configure databricks | Change the configuration of an existing Hive agent for the Databricks Delta Lake Metastore |
hive agent check | Check whether the Hive agent can connect to the Metastore |
hive agent delete | Delete a Hive agent |
hive agent list | List all configured Hive agents |
hive agent show | Show the configuration for a Hive agent |
hive agent types | List supported Hive agent types |
Command | Action |
---|---|
hive agent add dataproc | Add a Hive agent for a Google Cloud Dataproc Metastore |
hive agent configure datapropc | Change the configuration of an existing Hive agent for the Google Cloud Dataproc Metastore |
hive agent check | Check whether the Hive agent can connect to the Metastore |
hive agent delete | Delete a Hive agent |
hive agent list | List all configured Hive agents |
hive agent show | Show the configuration for a Hive agent |
hive agent types | List supported Hive agent types |
Command | Action |
---|---|
hive agent add filesystem | Add a Hive agent for a local filesystem |
hive agent configure filesystem | Change the configuration of an existing Hive agent for the local filesystem |
hive agent check | Check whether the Hive agent can connect to the Metastore |
hive agent delete | Delete a Hive agent |
hive agent list | List all configured Hive agents |
hive agent show | Show the configuration for a Hive agent |
hive agent types | List supported Hive agent types |
#
Connect to remote metastores with the CLI- Apache Hive
- Azure SQL DB
- AWS Glue Data Catalog
- Databricks
- Google Dataproc
Follow these steps to deploy a remote Hive agent for Apache Hive:
On your local host, run the
hive agent add hive
command with the following parameters to configure your remote Hive agent.--host
The host where the remote Hive agent will be deployed.--port
The port for the remote Hive agent to use on the remote host. This port is used to communicate with the local LiveData Migrator server.--no-ssl
(Optional) TLS encryption and certificate authentication is enabled by default between LiveData Migrator and the remote agent. Use this parameter to disable it.
Example for remote Apache Hive deployment - automatedhive agent add hive --name targetautoAgent --autodeploy --ssh-user root --ssh-key /root/.ssh/id_rsa --ssh-port 22 --host myRemoteHost.example.com --port 5052 --kerberos-keytab /etc/security/keytabs/hive.service.keytab --kerberos-principal hive/_HOST@REMOTEREALM.COM --config-path /<example directory path> --file-system-id mytargethdfs
Example for remote Apache Hive deployment - manualhive agent add hive --name targetmanualAgent --host myRemoteHost.example.com --port 5052 --kerberos-keytab /etc/security/keytabs/hive.service.keytab --kerberos-principal hive/_HOST@REMOTEREALM.COM --config-path /<example directory path> --file-system-id mytargethdfs
Transfer the remote server installer to your remote host:
Example of secure transfer from local to remote hostscp /opt/wandisco/hivemigrator/hivemigrator-remote-server-installer.sh myRemoteHost:~
On your remote host, run the installer as root (or sudo) user in silent mode:
./hivemigrator-remote-server-installer.sh -- --silent
On your remote host, start the remote server service:
service hivemigrator-remote-server start
info
If specifying Kerberos and config path information for remote agents, ensure that the directories and Kerberos principal are correct for your chosen remote host (not your local host).
Follow these steps to deploy a remote Hive agent for Azure:
On your local host, run the
hive agent add azure
command with the following parameters to configure your remote Hive agent.--host
The host where the remote Hive agent will be deployed.--port
The port for the remote Hive agent to use on the remote host. This port is used to communicate with the local LiveData Migrator server.--no-ssl
(Optional) TLS encryption and certificate authentication is enabled by default between LiveData Migrator and the remote agent. Use this parameter to disable it.
Example for remote Azure SQL deployment with System-assigned managed identity - automatedhive agent add azure --name azureRemoteAgent --db-server-name mysqlserver.database.windows.net --database-name mydb1 --auth-method AD_MSI --storage-account myadls2 --container-name mycontainer --hdi-version 4.0 --file-system-id myadls2storage --autodeploy --ssh-user root --ssh-key /root/.ssh/id_rsa --ssh-port 22 --host myRemoteHost.example.com --port 5052
Example for remote Azure SQL deployment with User-assigned managed identity - manualhive agent add azure --name azureRemoteAgent --db-server-name mysqlserver.database.windows.net --database-name mydb1 --auth-method AD_MSI --client-id b67f67ex-ampl-e2eb-bd6d-client9385id --storage-account myadls2 --container-name mycontainer --hdi-version 4.0 --file-system-id myadls2storage --host myRemoteHost.example.com --port 5052
Transfer the remote server installer to your remote host (Azure VM, HDI cluster node):
Example of secure transfer from local to remote hostscp /opt/wandisco/hivemigrator/hivemigrator-remote-server-installer.sh myRemoteHost:~
On your remote host, run the installer as root (or sudo) user in silent mode:
./hivemigrator-remote-server-installer.sh -- --silent
On your remote host, start the remote server service:
service hivemigrator-remote-server start
On your local host, run the
hive agent add hive
command with the following parameters to configure your remote Hive agent.--host
The host where the remote Hive agent will be deployed.--port
The port for the remote Hive agent to use on the remote host. This port is used to communicate with the local LiveData Migrator server.--no-ssl
(Optional) TLS encryption and certificate authentication is enabled by default between LiveData Migrator and the remote agent. Use this parameter to disable it.
hive agent add hive --name targetautoAgent --autodeploy --ssh-user root --ssh-key /root/.ssh/id_rsa --ssh-port 22 --host myRemoteHost.example.com --port 5552 --kerberos-keytab /etc/security/keytabs/hive.service.keytab --kerberos-principal hive/_HOST@REMOTEREALM.COM --config-path <example directory path> --file-system-id mytargethdfs
Replace <example directory path> with the path to a directory containing the core-site.xml, hdfs-site.xml, and hive-site.xml.
hive agent add hive --name targetmanualAgent --host myRemoteHost.example.com --port 5552 --kerberos-keytab /etc/security/keytabs/hive.service.keytab --kerberos-principal hive/_HOST@REMOTEREALM.COM --config-path <example directory path> --file-system-id mytargethdfs
Replace <example directory path> with the path to a directory containing the core-site.xml, hdfs-site.xml, and hive-site.xml.
- Transfer the remote server installer to your remote host:
scp /opt/wandisco/hivemigrator/hivemigrator-remote-server-installer.sh myRemoteHost:~
On your remote host, run the installer as root (or sudo) user in silent mode:
./hivemigrator-remote-server-installer.sh -- --silent
On your remote host, start the remote server service:
service hivemigrator-remote-server start
info
If specifying Kerberos and config path information for remote agents, ensure that the directories and Kerberos principal are correct for your chosen remote host (not your local host).
Follow these steps to deploy a remote Hive agent for AWS Glue:
On your local host, run the
hive agent add glue
command with the following parameters to configure your remote Hive agent.--host
The host where the remote Hive agent will be deployed.--port
The port for the remote Hive agent to use on the remote host. This port is used to communicate with the local LiveData Migrator server.--no-ssl
(Optional) TLS encryption and certificate authentication is enabled by default between LiveData Migrator and the remote agent. Use this parameter to disable it.
Example for remote AWS Glue agenthive agent add glue --name glueAgent --access-key ACCESS6HCFPAQIVZTKEY --secret-key SECRET1vTMuqKOIuhET0HAI78UIPfSRjcswTKEY --glue-endpoint glue.eu-west-1.amazonaws.com --aws-region eu-west-1 --file-system-id mys3bucket --host myRemoteHost.example.com --port 5052
Transfer the remote server installer to your remote host (Amazon EC2 instance):
Example of secure transfer from local to remote hostscp /opt/wandisco/hivemigrator/hivemigrator-remote-server-installer.sh myRemoteHost:~
On your remote host, run the installer as root (or sudo) user in silent mode:
./hivemigrator-remote-server-installer.sh -- --silent
On your remote host, start the remote server service:
service hivemigrator-remote-server start
note
Databricks agents are currently available as a preview feature.
Follow these steps to deploy a remote Hive agent for Databricks:
On your local host, run the
hive agent add databricks
command with the following parameters to configure your remote Hive agent.--host
The host where the remote Hive agent will be deployed.--port
The port for the remote Hive agent to use on the remote host. This port is used to communicate with the local LiveData Migrator server.--no-ssl
(Optional) TLS encryption and certificate authentication is enabled by default between LiveData Migrator and the remote agent. Use this parameter to disable it.
Example for remote Databricks agenthive agent add databricks --name databricksAgent --jdbc-server-hostname mydbcluster.cloud.databricks.com --jdbc-port 443 --jdbc-http-path sql/protocolv1/o/8445611123456789/0234-125567-testy978 --access-token daexamplefg123456789t6f0b57dfdtoken4 --file-system-id mys3bucket --default-fs-override dbfs: --fs-mount-point /mnt/mybucket --convert-to-delta --host myRemoteHost.example.com --port 5052
Transfer the remote server installer to your remote host:
Example of secure transfer from local to remote hostscp /opt/wandisco/hivemigrator/hivemigrator-remote-server-installer.sh myRemoteHost:~
On your remote host, run the installer as root (or sudo) user in silent mode:
./hivemigrator-remote-server-installer.sh -- --silent
On your remote host, start the remote server service:
service hivemigrator-remote-server start
On your local host, run the
hive agent add databricks
command to configure your remote Hive agent.See the Example for remote Databricks agent example below for further guidance.
Follow these steps to deploy a remote Hive agent for Dataproc:
On your local host, run the
hive agent add dataproc
command with the following parameters to configure your remote Hive agent.--host
The host where the remote Hive agent will be deployed.--port
The port for the remote Hive agent to use on the remote host. This port is used to communicate with the local LiveData Migrator server.--no-ssl
(Optional) TLS encryption and certificate authentication is enabled by default between LiveData Migrator and the remote agent. Use this parameter to disable it.
Transfer the remote server installer to your remote host:
Example of secure transfer from local to remote hostscp /opt/wandisco/hivemigrator/hivemigrator-remote-server-installer.sh myRemoteHost:~
On your remote host, run the installer as root (or sudo) user in silent mode:
./hivemigrator-remote-server-installer.sh -- --silent
note
The agent port will default to 5052. To set a custom agent port, run the installer with the
--agent-port
parameter. For example,./hivemigrator-remote-server-installer.sh -- --silent --agent-port <custom port>
.On your remote host, start the remote server service:
service hivemigrator-remote-server start
Example for remote Dataproc deployment - automatedhive agent add dataproc --name targetautoAgent --autodeploy --ssh-user root --ssh-key /root/.ssh/id_rsa --ssh-port 22 --host myRemoteHost.example.com --port 5052 --config-path <example directory path> --file-system-id mytargethdfs
Example for remote Dataproc deployment - manualhive agent add dataproc --name targetmanualAgent --host myRemoteHost.example.com --port 5052 --config-path <example directory path> --file-system-id mytargethdfs
note
If specifying Kerberos and config path information for remote agents, ensure that the directories and Kerberos principal are correct for your chosen remote host (not your local host).
#
Next stepsConnected to your metastores? Define metadata rules for your metadata migrations.