Skip to main content
Version: 1.22.0

Configure Google Dataproc as a target

Configure Google Dataproc as a target metastore using either the UI or the CLI.

Remote agent

A remote agent is a service deployed on a remote host that connects to Data Migrator to handle metadata transfer. A remote agent must be deployed on the Dataproc cluster.

Prerequisites

See the knowledge base article Setting up a Dataproc agent.

Deploy a remote Hive agent for Dataproc with the CLI

  1. On your local host, run the hive agent add dataproc command with the following parameters to configure your remote Hive agent.

    • --host The host where the remote Hive agent will be deployed.
    • --port The port for the remote Hive agent to use on the remote host. This port is used to communicate with the local Data Migrator server.
    • --no-ssl (Optional) Transport Layer Security (TLS) encryption and certificate authentication is enabled by default between Data Migrator and the remote agent. Use this parameter to disable it.
  2. Transfer the remote server installer to your remote host:

    Example of secure transfer from local to remote host
         scp /opt/wandisco/hivemigrator/hivemigrator-remote-server-installer.sh myRemoteHost:~
  3. On your remote host, run the installer as root (or sudo) user in silent mode:

         ./hivemigrator-remote-server-installer.sh -- --silent
    note

    The agent port will default to 5052. To set a custom agent port, run the installer with the --agent-port parameter. For example, ./hivemigrator-remote-server-installer.sh -- --silent --agent-port <custom port>.

  4. On your remote host, start the remote server service:

         service hivemigrator-remote-server start
    Example for remote Dataproc deployment - automated
          hive agent add dataproc --name targetautoAgent --autodeploy --ssh-user root --ssh-key /root/.ssh/id_rsa --ssh-port 22 --host myRemoteHost.example.com --port 5052 --config-path <example directory path> --file-system-id mytargethdfs
    Example for remote Dataproc deployment - manual
          hive agent add dataproc --name targetmanualAgent --host myRemoteHost.example.com --port 5052 --config-path <example directory path> --file-system-id mytargethdfs
note

If you enter Kerberos and configuration path information for remote agents, ensure the directories and Kerberos principal are correct for your chosen remote host (not your local host).

Configure Google Dataproc with the UI

  1. From the Dashboard, select a product under Products.

  2. Select Connect To Metastore.

  3. Select the Filesystem in which the data associated with the metadata is held.
    For Dataproc agents, this is usually a Google Cloud Storage bucket.

  4. Select Google Cloud Dataproc as the Metastore Type.

  5. Download the installer to the Dataproc cluster virtual machine.

  6. Make the installer script executable.

    chmod +x hivemigrator-remote-server-installer.sh
  7. Run the installation command.

    ./hivemigrator-remote-server-installer.sh – --silent
  8. Start the service.

    service hivemigrator-remote-server start
  9. Enter a Display Name.

  10. Enter the hostname or IP address of the cluster edge node.

  11. Enter the port for communication between the Hive Migrator service and the Dataproc server.

  12. Choose whether to use TLS.

  13. Select Check connection to test the connection to the metastore with the details you entered.

    If Data Migrator can connect to the remote agent successfully, you can continue configuring the agent.

  14. Optional Settings:

    • Configuration path
    • Kerberos Configuration
      • Use the principal assigned to the Dataproc cluster.
      • Enter a default filesystem override to override the default filesystem URI. We recommend this for complex use cases only.
  15. Select Save.