WD Fusion Plugin: Hive Metastore

1. Introduction

The Hive Metastore plugin enables WD Fusion to replicate Hive's metastore, allowing WD Fusion to maintain a replicated instance of Hive's metadata and, in future, support Hive deployments that are distributed between data centers.

Release Notes

Check out the Hive Metastore Plugin Release Notes for the latest information. See 1.0 WD Hive Metastore Plugin Release Notes

2. Installation

Pre-requisites

Along with the default requirements that you can find on the WD Fusion Deployment Checklist, you also need to ensure that the Hive service is already running on your server. Installation will fail if the WD Fusion Plugin can't detect that Hive is already running.

Limitation: Hive must be running at all zones

All zones within a membership must be running Hive in order to support replication. We're aware that this currently prevents the popular use case for replicating between on-premises clusters and s3/cloud storage, where Hive is not running. We intend to remove the limitation in a future release.

Known Issues

Known Issue: bigsql-sync.jar must be included in the wd-hive-metastore classpath or the server will not start properly
There are 2 option available for meeting this requirement:
  • If the jar is available on the wd-hive-metastore node, create a symlink via
    cd /opt/wandisco/wd-hive-metastore
    ln -s /usr/ibmpacks/current/bigsql/bigsql/lib/java/bigsql-sync.jar bigsql-sync.jar 
    
  • If the jar is not available then copy the jar from a node that has it to:
    /opt/wandisco/wd-hive-metastore
Known Issue: WD Fusion Hive Metastore plugin not installed locally to Hive Metastore
If WD Fusion Hive Metastore is not installed to the same server as the Hive metastore, then you need to complete the following workaround:
  1. Login to kadmin.local or kadmin on the host machine running wd-hive-metastore.
  2. In kadmin use
    addprinc -randkey hive/<WD-hive-metstore-hostname>@<REALM>
    addprinc -randkey HTTP/<WD-hive-metstore-hostname>@<REALM>
    xst -norandkey -k hive.keytab hive/<WD-hive-metstore-hostname>@<REALM> HTTP/<WD-hive-metstore-hostname>@<REALM>
  3. Exit kadmin
  4. Check the keytab has the correct entries by using
    klist -e -k -t hive.keytab
    Move the keytab into place:
    sudo mv hive.keytab /etc/wandisco/hive/
  5. Make sure the keytab is readable by the hive user by using:
    sudo chown hive:hive /etc/wandisco/hive/hive.keytab
    chmod +r /etc/wandisco/hive/hive.keytab
  6. Now restart Fusion server using:
    service fusion-server restart
  7. Now restart the Hive Metastore:
    restart the Wd-hive-metastore service using your Hadoop manager
  8. Now restart the HiveServer2 service via using your Hadoop mamanger (i.e. Ambari)
  9. Connect to beeline again.
Known Issue: HDP deployments and Hive Metastore port

Currently, you can't run the vanilla Hive metastore service on the same host as the wd-hive-metastore service, because HDP uses the hive.metastore.uris parameter to set the port for the vanilla hive metastore service.

See Hortonworks Documentation about Hive Service Ports

HDP uses the hive.metastore.uris parameter to set the Hive Metastore port. Without the WD Hive Template installed, the HiveServer2 service would use an embedded metastore service and not the separate Hive Metastore service. When this is the case we can't support running the vanilla Hive Metastore and the wd-hive-metastore on the same host when using a HDP distribution. We recommend that you stop the vanilla Hive metastore when using WD Hive on HDP, and to be clear, even if the wd-hive-metastore service is deployed onto another host then the vanilla Hive metastore service port will be changed by our configuration of hive.metastore.uris.

Known Issue: Cloudera/WD Hive Metastore deployments: manual copy of hive-site.xml required

If WD Hive Metastore is installed onto a Cloudera-based cluster and the WD Hive Metastore is being installed on a host other than the WD Fusion server then after the install is complete the admin must complete these steps:

  1. Copy the config file from
    /etc/wandisco/hive/hive-site.xml
    to the same location on the target host.
  2. Restart the WD-HIVE service using Cloudera Mananger.
Known Issue: Failed to install metastore service during fusion installation in IOP4.1 and IOP4.2

Example failure, during BigInsight installation. The error is caused by the stack not being available via ambari-server. example failure

Workaround

To fix this you need to ensure that only a single Ambari-server process is running before doing the service ambari-server restart. To find the ambari-server processes that are running you can use.
ps -aux | grep ambari-server
Then kill all the ambari-server processes by using
kill -9 [pid of process]
then restart the ambari-server by using
service ambari-server restart
Also rerun the check to ensure you only have a single process running:
ps -aux | grep ambari-server
You can then check in the Ambari UI if the WD Hive Metastore and WD Hiveserver2 Template services are available. If they are present then you will be ok to proceed with retrying to install the service via the installer.

Installation procedure

  1. Download the installer script fusion-ui-server-hdp-hive_deb_installer.sh, etc., from WANdisco's FD website.
    In this early version of Hive Metastore, the Hive Metastore plugin is provided as a full blown installer that installs WD Fusion with Hive Metastore replication plugin already built-in.
  2. Navigate to the extracted files.
  3. Run through the installer:
    Saving to: `fusion-ui-server-hdp-hive_deb_installer.sh'
    
    100%[===============================================================================================================>] 1,635,783,053 8.76M/s   in 8m 18s
    
    2016-06-15 10:53:47 (3.13 MB/s) - `fusion-ui-server-hdp-hive_deb_installer.sh' saved [1635783053/1635783053]
    
    root@dc01-vm1:~# bash fusion-ui-server-hdp-hive_deb_installer.sh
  4. The installer will first perform a check for the system's JAVA_HOME variable.
    Installing WD Fusion
    Verifying archive integrity... All good.
    Uncompressing WANdisco Fusion........................
     
        ::   ::  ::     #     #   ##    ####  ######   #   #####   #####   #####
       :::: :::: :::    #     #  #  #  ##  ## #     #  #  #     # #     # #     #
      ::::::::::: :::   #  #  # #    # #    # #     #  #  #       #       #     #
     ::::::::::::: :::  # # # # #    # #    # #     #  #   #####  #       #     #
      ::::::::::: :::   # # # # #    # #    # #     #  #        # #       #     #
       :::: :::: :::    ##   ##  #  ## #    # #     #  #  #     # #     # #     #
        ::   ::  ::     #     #   ## # #    # ######   #   #####   #####   #####
     
     
     
     
    Welcome to the WANdisco Fusion installation
    
    
    
    You are about to install WANdisco Fusion version 2.8-19
    
    Do you want to continue with the installation? (Y/n)
    Enter "Y" to continue.
  5. The installer checks that both Perl and Java are installed on the system.
      Checking prerequisites:
     
    Checking for perl: OK
    Checking for java: OK
    
    No packages found matching fusion-*.
    
    INFO: Using the following Memory settings for the WANdisco Fusion Admin UI process:
     
    INFO: -Xms128m -Xmx512m
     
    Do you want to use these settings for the installation? (Y/n)
    
    
    Enter "Y" or "N" if you wish to set different Java heap settings.
  6. The installer asks you to confirm which TCP port will be used for accessing the WD Fusion web UI, the default is "8083".
    Which port should the UI Server listen on?  [8083]:
    Please specify the appropriate platform from the list below:
     
    [0] hdp-2.1.0
    [1] hdp-2.2.0
    [2] hdp-2.3.0
    [3] hdp-2.4.0
     
    Which fusion platform do you wish to use? 1
    You chose hdp-2.2.0:2.6.0.2.2.0.0-2041
    Select from the available Hadoop packages.
  7. Next, you set the system user, group for running the applicaton.
    We strongly advise against running Fusion as the root user.
     
    For default CDH setups, the user should be set to 'hdfs'. However, you should choose a user appropriate for running HDFS commands on your system.
     
    Which user should Fusion run as? [hdfs]
    Checking 'hdfs' ...
     ... 'hdfs' found.
     
    Please choose an appropriate group for your system. By default CDH uses the 'hdfs' group.
     
    Which group should Fusion run as? [hdfs]
    Checking 'hdfs' ...
     ... 'hdfs' found.
    You should press enter to go with the default "hdfs".
  8. You will now be shown a summary of the settings that you have provided so far:
    Installing with the following settings:
    
    User and Group:                     hdfs:hadoop
    Hostname:                           dc01-vm1.bdva.wandisco.com
    Fusion Admin UI Listening on:       0.0.0.0:8083
    Fusion Admin UI Minimum Memory:     128
    Fusion Admin UI Maximum memory:     512
    Platform:                           hdp-2.2.0 (2.6.0.2.2.0.0-2041)
    Fusion Server Hostname and Port:    dc01-vm1.bdva.wandisco.com:8082
     
    Do you want to continue with the installation? (Y/n)
    Enter "Y" unless you need to make changes to any of the settings.
  9. The installation will now complete:
    Adding the user hdfs to the hive group if the hive group is present.
    Installing hdp-2.2.0 packages:
      fusion-hcfs-hdp-2.2.0-server_2.8-SNAPSHOT-1854_all.deb ... Done
      fusion-hcfs-hdp-2.2.0-ihc-server_2.8-SNAPSHOT-1854_all.deb ... Done
    Installing plugin packages:
      wd-hive-plugin-hdp-2.2.0_1.0-SNAPSHOT-480_all.deb ... Done
    Installing fusion-ui-server package:
      fusion-ui-server_2.8-19_all.deb ...Selecting previously unselected package fusion-ui-server.
    (Reading database ... 56307 files and directories currently installed.)
    Unpacking fusion-ui-server (from .../fusion-ui-server_2.8-19_all.deb) ...
    Setting up fusion-ui-server (2.8-19) ...
     Done
  10. Once the installation has completed, you need to configure the WD Fusion server using the browser based UI.
    Starting fusion-ui-server:                                 [  OK  ]
    Checking if the GUI is listening on port 8083: .....Done
     
    Please visit http://node.hostname.com:8083/ to complete installation of WANdisco Fusion
     
    If 'your.hostname.server.com' is internal or not available from your browser, replace
    this with an externally available address to access it.
     
    Stopping fusion-ui-server:.                                [  OK  ]
    Starting fusion-ui-server:                                 [  OK  ]
    Open a browser and enter the provided URL, or IP address.
  11. Configure WD Fusion through a browser

  12. Follow this section to complete the installation by configuring WD Fusion using a browser-based graphical user interface.

    Silent Installation
    For large deployments it may be worth using Silent Installation option.

    Open a web browser and point it at the provided URL. e.g
    http://<YOUR-SERVER-ADDRESS>.com:8083/
  13. In the first "Welcome" screen you're asked to choose between Create a new Zone and Add to an existing Zone.
    Make your selection as follows:
    Adding a new WD Fusion cluster
    Select Add Zone.
    Adding additional WD Fusion servers to an existing WD Fusion cluster
    Select Add to an existing Zone.

    High Availability for WD Fusion / IHC Servers

    It's possible to enable High Availability in your WD Fusion cluster by adding additional WD Fusion/IHC servers to a zone. These additional nodes ensure that in the event of a system outage, there will remain sufficient WD Fusion/IHC servers running to maintain replication.

    Add HA nodes to the cluster using the installer and choosing to Add to an existing Zone, using a new node name.

    Configuration for High Availability
    When setting up the configuration for a High Availability cluster, ensure that fs.defaultFS, located in the core-site.xml is not duplicated between zones. This property is used to determine if an operation is being executed locally or remotely, if two separate zones have the same default file system address, then problems will occur. WD Fusion should never see the same URI (Scheme + authority) for two different clusters.

    WD Fusion Deployment

    Welcome.

  14. Run through the installer's detailed Environment checks. For more details about exactly what is checked in this stage, see Environmental Checks in the Appendix.
    WD Fusion Deployment

    Environmental checks.

    On clicking validate the installer will run through a series of checks of your system's hardware and software setup and warn you if any of WD Fusion's prerequisites are missing.

    WD Fusion Deployment

    Example check results.

    Any element that fails the check should be addressed before you continue the installation. Warnings may be ignored for the purposes of completing the installation, especially if the installation is only for evaluation purposes and not for production. However, when installing for production, you should also address all warnings, or at least take note of them and exercise due care if you continue the installation without resolving and revalidating.

  15. Upload the license file.
    WD Fusion Deployment

    Upload your license file.

  16. The conditions of your license agreement will be presented in the top panel, including License Type, Expiry data, Name Node Limit and Data Node Limit.
    WD Fusion Deployment

    Verify license and agree to subscription agreement.

    Click on the I agree to the EULA to continue, then click Next Step.
  17. Enter settings for the WD Fusion server.
    WD Fusion Deployment

    screen 4 - Server settings

    WD Fusion Server

    Maximum Java heap size (GB)
    Enter the maximum Java Heap value for the WD Fusion server.
    Umask (currently 022)
    Set the default permissions applied to newly created files. The value 022 results in default directory permissions 755 and default file permissions 644. This ensures that the installation will be able to start up/restart.
    Latitude
    The north-south coordinate angle for the installation's geographical location.
    Longitude
    The east-west coordinate angle for the installation's geographical location. The latitude and longitude is used to place the WD Fusion server on a global map to aid coordination in a far-flung cluster.
    Alternatively, you can click on global map to locate the node.

    Advanced options

    Only apply these options if you fully understand what they do.
    The following advanced options provide a number of low level configuration settings that may be required for installation into certain environments. The incorrect application of some of these settings could cause serious problems, so for this reason we strongly recommend that you discuss their use with WANdisco's support team before enabling them.

    Custom UI hostname
    Lets you set a custom hostname for the Fusion UI, distinct from the communication.hostname which is already set as part of the install and used by WD Fusion nodes to connect to the Fusion server.
    Custom UI Port
    Lets you change WD Fusion UI's default port, in case it is assigned elsewhere, e.g. Cloudera's headlamp debug server also uses it.
    Strict Recovery
    See explanation of the Strict Recovery Advanced Options.

    Enable SSL for WD Fusion

    Tick the checkbox to enable SSL WD Fusion Deployment

    KeyStore Path
    System file path to the keystore file.
    e.g. /opt/wandisco/ssl/keystore.ks
    KeyStore Password
    Encrypted password for the KeyStore.
    e.g. ***********
    Key Alias
    The Alias of the private key.
    e.g. WANdisco
    Key Password
    Private key encrypted password.
    e.g. ***********
    TrustStore Path
    System file path to the TrustStore file.
    /opt/wandisco/ssl/keystore.ks
    TrustStore Password
    Encrypted password for the TrustStore.
    e.g. ***********

    IHC Server

    WD Fusion

    IHC Settings

    Maximum Java heap size (GB)
    Enter the maximum Java Heap value for the WD Inter-Hadoop Communication server.
    IHC network interface
    The hostname for the IHC server.

    Advanced Options (optional)

    IHC server binding address
    In the advanced settings you can decide which address the IHC server will bind to. The address is optional, by default the IHC server binds to all interfaces (0.0.0.0), using the port specified in the ihc.server field. In all cases the port should be identical to the port used in the ihc.server address. i.e. /etc/wandisco/fusion/ihc/server/cdh-5.4.0/2.6.0-cdh5.4.0.ihc or /etc/wandisco/fusion/ihc/server/localfs-2.7.0/2.7.0.ihc
    Once all settings have been entered, click Next step.
  18. Next, you will enter the settings for your new Zone.
    WD Fusion Deployment

    New Zone

    Zone Information

    Entry fields for zone properties

    Fully Qualified Domain Name
    The full hostname for the server.
    Node ID
    A unique identifier that will be used by WD Fusion UI to identify the server.
    Location Name (optional)
    A location name that can quickly identify where the server is located.

    Induction failure
    If induction fails, attempting a fresh installation may be the most straight forward cure, however, it is possible to push through an induction manually, using the REST API. See Handling Induction Failure.

    Known issue with Location names
    You must use different Location names /Node IDs for each zone. If you use the same name for multiple zones then you will not be able to complete the induction between those nodes.

    DConE Port
    TCP port used by WD Fusion for replicated traffic.
    Zone Name
    The name used to identify the zone in which the server operates.
    Management Endpoint
    Select the Hadoop manager that you are using, i.e. Cloudera Manager, Ambari or Pivotal HD. The selection will trigger the entry fields for your selected manager:

    Advanced Options

    Only apply these options if you fully understand what they do.
    The following advanced options provide a number of low level configuration settings that may be required for installation into certain environments. The incorrect application of some of these settings could cause serious problems, so for this reason we strongly recommend that you discuss their use with WANdisco's support team before enabling them.

    URI Selection

    The default behavior for WD Fusion is to fix all replication to the Hadoop Distributed File System / hdfs:/// URI. Setting the hdfs-scheme provides the widest support for Hadoop client applications, since some applications can't support the available "fusion:///" URI they can only use the HDFS protocol. Each option is explained below:

    Use HDFS URI with HDFS file system
    URI Option A
    This option is available for deployments where the Hadoop applications support neither the WD Fusion URI or the HCFS standards. WD Fusion operates entirely within HDFS.

    This configuration will not allow paths with the fusion:// URI to be used; only paths starting with hdfs:// or no scheme that correspond to a mapped path will be replicated. The underlying file system will be an instance of the HDFS DistributedFileSystem, which will support applications that aren't written to the HCFS specification.
    Use WD Fusion URI with HCFS file system
    URI Option B
    When selected, you need to use fusion:// for all data that must be replicated over an instance of the Hadoop Compatible File System. If your deployment includes Hadoop applications that are either unable to support the Fusion URI or are not written to the HCFS specfication, this option will not work.

    MapR deployments
    Use this URI selection if you are installing into a MapR cluster.

    Use Fusion URI with HDFS file system
    URI option C
    This differs from the default in that while the WD Fusion URI is used to identify data to be replicated, the replication is performed using HDFS itself. This option should be used if you are deploying applications that can support the WD Fusion URI but not the Hadoop Compatible File System.
    Use Fusion URI and HDFS URI with HDFS file system
    URI Option D
    This "mixed mode" supports all the replication schemes (fusion://, hdfs:// and no scheme) and uses HDFS for the underlying file system, to support applications that aren't written to the HCFS specification.

    Fusion Server API Port

    This option lets you select the TCP port that is used for WD Fusion's API.

    Strict Recovery

    Two advanced options are provided to change the way that WD Fusion responds to a system shutdown where WD Fusion was not shutdown cleanly. Currently the default setting is to not enforce a panic event in the logs, if during startup we detect that WD Fusion wasn't shutdown. This is suitable for using the product as part of an evaluation effort. However, when operating in a production environment, you may prefer to enforce the panic event which will stop any attempted restarts to prevent possible corruption to the database.

    DConE panic if dirty (checkbox)
    This option lets you enable the strict recovery option for WANdisco's replication engine, to ensure that any corruption to its prevayler database doesn't lead to further problems. When the checkbox is ticked, WD Fusion will log a panic message whenever WD Fusion is not properly shutdown, either due to a system or application problem.
    App Integration panic of dirty (checkbox)
    This option lets you enable the strict recovery option for WD Fusion's database, to ensure that any corruption to its internal database doesn't lead to further problems. When the checkbox is ticked, WD Fusion will log a panic message whenever WD Fusion is not properly shutdown, either due to a system or application problem.

    <Hadoop Manager e.g. Ambari> Configuration

    This section configures WD Fusion to interact with the management layer, which could be Ambari or Cloudera Manager, etc.

    Manager Host Name /IP
    The full hostname or IP address for the working server that hosts the Hadoop manager.
    Port
    TCP port on which the Hadoop manager is running.
    Username
    The username of the account that runs the Hadoop manager.
    Password
    The password that corresponds with the above username.
    SSL
    (Checkbox) Tick the SSL checkbox to use https in your Manager Host Name and Port. You may be prompted to update the port if you enable SSL but don't update from the default http port.

    Authentication without a management layer
    WD Fusion normally uses the authentication built into your cluster's management layer, i.e. the Cloudera Manager username and password are required to login to WD Fusion. However, in Cloud-based deployments, such as Amazon's S3, there is no management layer. In this situation, WD Fusion adds a local user to WD Fusion's ui.properties file, either during the silent installation or through the command-line during an installation.
    Should you forget these credentials, see Reset internally managed password

  19. Enter security details, if applicable to your deployment.
    Kerberos

    Kerberos Configuration

    In this step you also set the configuration for an existing Kerberos setup. If you are installing into a Kerberized cluster, include the following configuration.

    Enabling Kerberos authentication on WD Fusion's REST API
    When a user has enabled Kerberos-authentication on their REST API, they must kinit before making REST calls, and enable GSS-Negotiate authentication. To do this with curl, the user must include the "-negotiate" and "-u:" options, like so:

    curl --negotiate -u: -X GET "http://${HOSTNAME}:8082/fusion/fs/transfers"

    See Setting up Kerberos for more information about Kerberos setup.

  20. Click Validate to confirm that your settings are valid. Once validated, click Next step.
    WD Fusion Deployment

    Zone information.

  21. The remaining panels in step 6 detail all of the installation settings. All your license, WD Fusion server, IHC server and zone settings are shown. If you spot anything that needs to be changed you can click on the go back
    WD Fusion Deployment

    Summary

    Once you are happy with the settings and all your WD Fusion clients are installed, click Deploy Fusion Server.
  22. WD Fusion Client Installation

  23. In the next step you must complete the installation of the WD Fusion client package on all the existing HDFS client machines in the cluster. The WD Fusion client is required to support the ingestion of data to WD Fusion nodes.
    WD Fusion Deployment

    Client installations.

    The installer supports three different packaging systems for installing Clients, regular RPMs, Parcels for Cloudera and HDP Stack for Hortonworks/Ambari.

    client package location
    You can find them in your installation directory, here:

    /opt/wandisco/fusion-ui-server/ui/client_packages
    /opt/wandisco/fusion-ui-server/ui/stack_packages
    /opt/wandisco/fusion-ui-server/ui/parcel_packages

    Important! If you are installing on Ambari 1.7 or CHD 5.3.x
    Additionally, due to a bug in Ambari 1.7, and an issue with the classpath in CDH 5.3.x, before you can continue you must log into Ambari/Cloudera Mananger and complete a restart of HDFS, in order to re-apply WD Fusion's client configuration.

  24. We now handle the configuration of the Hive Metastore Plugin, which will be integrated into WD Fusion now, rather than in a separate post-installation step. WD Fusion Deployment

    Hive Metastore plugin installation - substep 1.

    The installer performs some basic validation, checking the following criteria:

    Manager Validation
    Checks that the system is being configured with valid distribution manager support. In this example, "AMBARI" should be detected. Should this validation check fail, you would need to check that you have entered the right Manager details in Step 5.
    Hive Service installed Validation
    The installer will check that Hive is running on the server. Should the validation check fail, you should check that Hive is running.

    Configuration

    During the installation you need to enter the following properties:

    Hive Metastore host
    The hostname for the Hive Metastore service.
    Known Issue: You must provide a hostname, not an IP address. Currently, an IP address is not enough to verify the presence of the service. We will add support for IP addresses once we have identified a workable method for validating it with the manager.
    Hive Metastore port
    The TCP port that will be used by the Hive Mestastore service. Default:9084
  25. In this step you need to copy over and unpack the Hive services to the service directory of your Ambari installation.
    WD Fusion Deployment

    Hive Metastore plugin installation - sub-step 2.

    If you check Ambari, providing the new packages are correctly put in place, you will see them listed. Do not enable them through Ambari, they will be installed later.
    WD Fusion Deployment

    Hive Metastore plugin installation - check the service packages have been picked up.

    Important: You should see that the package for WD Hive Metastore is now listed through Ambari. Do NOT enable the package at this time. WD Hive Metastore needs to be installed through steps that appear later.
  26. At the end of this step, we address a possible problem that you may have in connecting WD Fusion to a remote Hive Metastore database. Please note that the following MySQL query is only applicable to Ambari installations.
    GRANT ALL PRIVILEGES ON *.* TO 'hive'@'<HOSTNAME-FOR-HIVE-METASTORE-SERVICE-NODE>'
    IDENTIFIED BY PASSWORD '<hive database password>' WITH GRANT OPTION;
  27. On Cloudera installations: The above procedure is HDP/Ambari based. For Cloudera installations this screen will appear
    WD Fusion Deployment

    Hive Metastore plugin installation - sub-step 2.

    
    You must add the Hive parcels to Cloudera AND ensure that the Hive Metastore database is accessible before you continue
    Please note: Hive parcels must be present in the Cloudera parcel-repo directory prior to continuing installation. 1. Download the Parcel packages to Cloudera service directory (/opt/cloudera.parcel-repo/ on dc07-vm0.bdva.wandisco.com) 2. Copy the Custom Service Descriptor .jar file to the Local Descriptor Repository (normally /opt/cloudera/csd) on the Cloudera manager node. 3. Restart the Cloudera server (run service cloudera-scm-server restart in the terminal) so that Cloudera can see the new parcel and jar.
  28. The next step handles the plugin's installation:
    WD Fusion Deployment

    Hive Metastore plugin installation - sub-step 3.

    When you have confirmed that the stack files are in place, on the installer screen, click Next.

    Summary

    The summary confirms the values of the entries you provided in the first sub-step of the WANdisco Hive Metastore installation section.

    To begin the installation of the Plugin, click Start Install.

    Ambari-based installation

    The following notes explain what is happening during each phase of the installation into a Ambari-based cluster:

    Metastore Service Install
    This step handles the installation of the WD Hive Metastore Service into Ambari.
    Hive Metastore Template Install
    Install the WANdisco Hive Metastore Service Template on Ambari.
    Update Hive Configuration
    Updates the URIs for Hive connections in Ambari.
    Configure Hive Conf Files
    Symlink the Hive configuration file into the Fusion Hive Metastore plugin.
    Restart Hive Service
    Restarts Hive Service in Ambari. Note this process can take several minutes to complete. Please don't make any changes or refresh your installer's browser session.
    Restart WD Hive Metastore Service
    Restarts Hive Metastore Service in Ambari. Note this process can take several minutes to complete.
    Restart WD HiveServer2 Service
    Restart HiveServer2 Service in Ambari. Note this process can take serveral minutes to complete.

    Cloudera-based installation

    The following notes explain what is happening during each phase of the installation into a CDH-based cluster:

    Fusion Hive parcel distribution and activation
    Distribute and activate Fusion Hive parcels.
    Hive-site Setup
    Retrieve and setup hive-site.xml for use with Wd-Fusion.
    Fusion Hive service descriptor
    Install Fusion Hive service descriptor.
    Known Issue: Cloudera-based deployments only
    When installing the Hive Metastore plugin, you must create the folder /etc/wandisco/hive on the Metastore host you specified above. This folder must have owner hive:hive and the Hive user must be able to access and write to this location.
    Fusion Hive service setup
    Install Fusion Hive service.
    Cloudera metastore configuration
    Configuring Cloudera to use Fusion Hive metastore.
    The Hive Metastore Consistency Checker only covers checking and repairing the Hive Metastore data and not any inconsistencies in the data within the replicated folder. This will be the responsibility of the Fusion Server's main Consistency Check tool.

3. Hive Metastore Consistency Checks

The WD Fusion Consistency Check tool is available for each replicated folder, under the Replication tab. For Hive Metastore data, a dedicated Hive Consistency tab is available.

WD Hive Metastore Example

Doing a consistency check

  1. Login to a WD Fusion node that replicates Hive Metastore data. Click on the Replication tab.
  2. Click on the replicated folder that contains the Hive Metstore data. On its Replication Rules screen click on the Details link
    WD Hive Metastore Example

    Don't click the Consistency link for Metastore data
    When clicking on a replicated folder that stores Hive Metastore data, don't follow the "consistency" link in the Consistency column as this takes you to the general consistency check tool which is not used for checking Hive Metastore data.

  3. On the Replicated Folder Details screen, click the Hive Consistency tab.
    WD Hive Metastore Example
  4. To perform a consistency check on a database table, click on it within the Choose context panel.
    WD Hive Metastore Example
    Basic details for the chosen element will display under the Detailed view panel. Click the Consistency Check button to begin a check of the consistency between replicated nodes.
  5. The result of the consistency check appears as a status (Consistent, Inconsistent, Unknown).
    WD Hive Metastore Example

    "Unknown" status appears if:

    • the databases have never been checked before (as in a newly deployed cluster).
    • the last consistency check has expired (consistency check results last for only an hour from when the check is completed).
    • an error occurred in the last check.

  6. For contexts that are inconsistent, you can perform a repair through the tool. The first step is to identify which node contains the most up-to-date or correct data. You can view a full consistency report to help
    WD Hive Metastore Example
    Select which zone contains the correct version, or "Source of truth" for the selected metastore data by clicking on the applicable Set as source button.

    Example Consistency Report

    The following example shows the contents of a simple consistency check report

    <diffPage>
      <pageNum>0</pageNum>
      <pageSize>20</pageSize>
      <totalSize>2</totalSize>
      <dbName>test_011</dbName>
      <objectDiffs>
        <type>table</type>
        <key>test_011_table_001</key>
        <diffKind>SomeMissing</diffKind>
        <subtree>1</subtree>
        <zones>
          <z>1</z>
          <z>0</z>
        </zones>
      </objectDiffs>
      <objectDiffs>
        <type>table</type>
        <key>test_011__test_011_table_001_index_dept__</key>
        <diffKind>SomeMissing</diffKind>
        <subtree>0</subtree>
        <zones>
          <z>1</z>
          <z>0</z>
        </zones>
      </objectDiffs>
    </diffPage>
    
  7. Once you have clicked on the Source of truth, you can begin the repair by clicking on the Repair button.
    WD Hive Metastore Example

    Repair Options

    Recursive
    Repair selected context and all its children.
    Add Missing
    Add missing data to the destination zone that exists at the "source of truth" zone.
    Remove Extra
    Remove data from the destination zone that do not exist at the "source of truth" zone.
    Update Different
    Update existing data at the destination zone to match the "source of truth" zone.
  8. After clicking on the Repair button, the repair process will start.
    WD Hive Metastore Example

Known Issue with failed consistency repairs (FUI-3272)
There is a known issue that may appear if Hive contains two indexes that have the same name that then apply to a column of the same name (even though the indexes apply to different tables in different databases). In this scenario, repairs will fail. The cause of the issue is related to the ordering of actions. You can work around the issue using the following procedure.

If you need to repair a database that contains Index tables then you will need to repair them in stages as follows:

  1. Repair the database with recursive unset.
  2. Repair the Index tables.
  3. Repair the parent tables with recursive enabled to repair any partitions and indexes.

Installing on a Kerberized cluster

The Installer lets you configure WD Fusion to use your platform's Kerberos implementation. You can find supporting information about how WD Fusion handles Kerberos in the Admin Guide, see Setting up Kerberos.

Configuring Kerberos principals for Hive Metastore plugin

You need to configure kerberos principals for the wd-hive-metastore and hive fusion plugin to use. All these steps need to be carried out with reference to the host where the wd-hive-metastore and fusion services are running.

For reference
See Cloudera's documentation on Create and Deploy the Kerberos Principals and Keytab Files.

Procedure

  1. Replace fusion-server.wandisco.com with the actual FQDN name for your wd-hive-metastore host.
  2. Login to kadmin.local or kadmin on the host machine running wd-hive-metastore.
  3. In kadmin use:
    addprinc -randkey hive/fusion-server.wandisco.com@WANDISCO.HADOOP
  4. In kadmin use:
    addprinc -randkey HTTP/fusion-server.wandisco.com@WANDISCO.HADOOP
  5. In kadmin use
    xst -k hive.service.keytab hive/fusion-server.wandisco.com@WANDISCO.HADOOP
    HTTP/fusion-server.wandisco.com@WANDISCO.HADOOP
  6. Exit kadmin
  7. Check the keytab has the correct entries by using
    klist -e -k -t hive.service.keytab
  8. Use
    sudo mv hive.service.keytab /etc/security/keytabs/
  9. Make sure the keytab is readable by the hive user by using
    sudo chown hive:hadoop /etc/wandisco/hive.service.keytab
    sudo chmod +r /etc/wandisco/hive.service.keytab
  10. <property>
        <name>hive.metastore.kerberos.keytab.file</name>
        <value>/etc/security/keytabs/hive.service.keytab</value>
    </property>
  11. Now restart Fusion server using service fusion-server restart.
  12. Restart the wd-hive-metastore service via CM.
  13. Restart the hiveServer2 service via CM.
  14. Reconnect to beeline again. Remember you need to perform a kinit before starting beeline using that nodes keytab and hive principal. You may also need to make a change to the hive-site.xml:
    <property>
        <name>hive.metastore.kerberos.keytab.file</name>
        <value>hive.service.keytab</value>
    </property>
    to
    <property>
        <name>hive.metastore.kerberos.keytab.file</name>
        <value>/etc/wandisco/hive/hive.servce.keytab</value>
    </property>
    
    All connections using beeline should use the same connection string regardless of the node that is being used - always use your hiveserver2 host's FQDN, e.g.:
    !connect jdbc:hive2://your.server.url:10000/default;principal=hive/principle.server.com@WANDISCO.HADOOP
    
    even if connecting on the principle server itself.

Secure Impersonation

Normally the Hive user has superuser permissions on the hiveserver2 and hive metastore nodes. If you are installing into a different nodes, corresponding proxyuser parameters should also be updated in core-site.xml and kms-site.xml

  1. Set up a proxy user on the NameNode, adding the following properties to core-site.xml on the applicable NameNode(s).
  2. <property>
            <name>hadoop.proxyuser.$USERNAME.hosts</name>
            <value>*</value>
        </property>
        <property>
            <name>hadoop.proxyuser.$USERNAME.groups</name>
            <value>*</value>
    </property>
    
    hadoop.proxyuser.$USERNAME.hosts
    Defines hosts from which client can be impersonated. $USERNAME, the superuser who wants to act as a proxy to the other users, is usually set as system user "hdfs". From Version 2.6 these values are captured by the installer and can apply these values automatically.
    hadoop.proxyuser.$USERNAME.groups
    A list of groups whose users the superuser is allowed to act as proxy. Including a wildcard (*), which will mean that proxies of any users are allowed. For example, for the superuser to act as proxy to another user, the proxy actions must be completed on one of the hosts that are listed, and the user must be included in the list of groups. Note that this can be a comma separated list or the noted wildcard (*).

High Availability with Hive

It's possible to set up High Availability by enabling multiple WD Fusion nodes to connect to the Hive Metastore. For a basic setup, use the following procedure:

Install on your first node

  1. Follow the instructions for a regular Fusion-Hive installation provided in the installation guide - 2. Installation

Install on your second node:

  1. Select Add to existing zone (give the address of the first WD Fusion node).
  2. Continue with the installation as you did on your first WD Fusion node until you reach the Hive setup screens.
  3. On the first Hive screen, add the address of the Metastore server associated with your first Fusion node (this will require changing the default) and clicking update.
  4. Proceed to the next screen.
  5. Skip the next screen as the Fusion-Hive stacks / parcels are already installed.
  6. Transfer /etc/wandisco/fusion/server/hive-site.xml from your first Fusion node to /etc/wandisco/fusion/server/hive-site.xml on this node.
  7. Click the Install button to launch the installation process.
  8. When prompted, proceed to the end of the UI installer.

4. Technical Glossary

Hive

Hive is a Hadoop-specific data warehouse component. It provides facilities to abstract a structured representation of data in Hadoop's file system. This structure is presented as databases containing tables that are split into partitions. Hive can prescribe structure onto existing Hadoop data, or it can be used to create and manage that data.

It uses an architecture that includes a "metastore", which provides the interface for accessing all metadata for Hive tables and partitions. The metastore is the component that persists the structure information of the various tables and partitions in the warehouse, including column and column type information, the serializers and deserializers necessary to read and write data and the location of any corresponding Hadoop files where the data is stored.

Hive offers a range of options for the deployment of a metastore:

As a metastore database:

  1. Local/embedded metastore database (Derby)
  2. Remote metastore database

As a metastore server:

  1. Local/embedded metastore server
  2. Remote metastore server

In remote mode, the metastore server is a Thrift service. In embedded mode, the Hive client connects directly to the underlying database using JDBC. Embedded mode supports only a single client session, so is not used normally for multi-user product environments.

WANdisco's implementation of a replicated Hive metastore supports deployments that use a remote metastore server. As tools exist that use interfaces to the metastore other than the thrift interface, the implementation does not just proxy that interface.

WANdisco Hive Metastore

The WANdisco Hive Metastore can act as a replacement or complement for the standard Hive Metastore, and provides two components:

  1. A plugin for Fusion that allows for the coordination of Hive Metastore activities, and a replacement Hive Metastore implementation that delegates the coordination of activities to the plugin in order that they can be performed in a consistent manner across multiple deployments of the Metastore.
  2. A replacement Hive Metastore implementation that delegates the coordination of activities to the plugin in order that they can be performed in a consistent manner across multiple deployments of the Metastore.

The resulting system ensures that Hive metadata can be made consistent across multiple Hadoop clusters, and by performing that coordination in conjunction with actions performed against the Hadoop file system, also ensures that this consistency applies to Hive-resident metadata and any corresponding files where Hive table/partition data is stored.

The following diagram provides a simplified view of how WANdisco's Hive Metastore plugin interfaces between your Hive deployment and WD Fusion.

WD Hive Metastore Example

The WANdisco Hive Metastore (NsHive Metastore in the diagram above) can replace the standard Hive Metastore from the Hadoop distribution, or run alongside that Metastore. It provides all the functionality of the standard Hive Metastore, but adds interaction with WANdisco Fusion when coordination and replication is required (i.e. for activities that result in writes against the metadata database used by Hive). Different versions of the Hive Metastore are supported.

Hive Metastore replication in a nutshell

  • WANdisco runs its own Metastore server instance that replaces the default server.
  • WANdisco only replicates write operation against the metastore database.
  • The WD Hive Metastore Plugin sends proposals into the WD Fusion core.
  • WD Fusion uses the Hive Metastore plugin to communicate, directly with the metastore database.

Overview of Release

The WANdisco Hive Metastore provides functionality for the replication of Hive metadata and underlying table data as a plugin.

This section of the user guide describes the plugin's functionality, behaviour and user experience.

Core Use Case

The functionality will address the core use case of interacting with Hive as a data warehouse technology in environments where active-active replication of Hive information is required, including underlying table data. Changes made to Hive metadata and data are replicated between multiple participating Hadoop clusters.

Restrictions

Table data location

All files that hold the data for a given Hive table need to exist under a single root directory that can be replicated using Fusion. This is because there is a need for a single distributed state machine (DSM) to coordinate the activities of metadata changes with underlying table/partition content.

This limitation may be removed if/when Fusion adds the ability to coordinate multiple otherwise independent DSMs.

Replacement of Metastore implementation

In order to coordinate all actions across multiple Metastore servers, the current solution replaces each standard Metastore server with a minimally-modified version of it, as provided by WANdisco. If selected portions of Hive metadata require replication, the WANdisco Hive Metastore can operate in addition to the standard Metastore server.

Metastore Version

The Hive Metastore differs in implementation across versions. The WANdisco Hive Metastore provides versions to match Hive 0.13, 0.14, 1.1.0 and 1.2.1.

The Metastore version used by all participating replicated instances must match. Future versions of the the Replicated Hive Metastore may allow replication between different versions of Hive.

Functionality Not Addressed

The following functionality does not exist in the 1.0 release of the WANdisco Hive Metastore:

  • Hive transactions are not supported.
  • Hive SQL Standard Based Authorization (which provides column-level access control granularity) is not supported, because this mode requires that HiveServer2 runs an embedded metastore.
  • Replication between different versions of Hive not supported (some combinations might work, but will need to be specifically tested).
  • Table directories have to be under the database directory (or at least in the same replicated directory as the database).
  • Limitations related to known Hive issues:
    • HIVE-2573: Create Function is not replicated
    • HIVE-10719: Alter rename table does not rename the table

Core Functionality Provided

Metadata Replication

Changes made to Hive metadata are replicated between multiple participating Hadoop clusters, by coordinating all write operations that will affect the metastore database, and ensuring that these operations are performed in a consistent manner across all WANdisco Hive Metastore instances within a Fusion membership.

Consistency Check and Repair

The functionality of consistency check and repair provides the means to:

  • Determine if there are any inconsistencies between the hive metastore data across all zones for any given replicated folder/location that maps to a database or table within the hive metastore.
  • Identify which metastore data is inconsistent i.e which tables are missing from a database or which columns are different/missing in a table.
  • Allow the user to select a zone as the source of truth and then repair the metastore data based on that version of the metastore data.

Assumptions made for the operation of this feature include:

  • This feature will only cover checking and repairing the hive metastore data and not any inconsistencies in the data within the replicated folder. This will be the responsibility of the Fusion Server.

Key facilities of this feature are:

  • Provide the ability to request and return the current version of the database metadata for a particular location/replicated folder from each node within the membership of the supplied replicated folder / location.
  • Provide the ability to compare the current database metadata of all the nodes/zones and to create a list of inconsistencies, if any. This list of inconsistencies will need to be displayed to the user in the same way as inconsistencies in the files / sub folders of a replicated folder are currently displayed.
  • Provide the ability to accept a "source of truth" for each inconsistency and the ability to propose to change the data on all other nodes to match this "source of truth". Provide support for "bulk" proposals so that all inconsistencies can be repair via a single proposal if they share the same "source of truth".

    Provide the ability to manage the CC&R process. This could be replaced by functionality in the Fusion Plugin ADK at a future date.

Testing

Here are some examples for testing basic functionality of the WAN Hive Metastore. They cover connection, creation of a replicated database, population of temporary table data, populating partitions of a table, creating inconsistent data to test consistency check and repair functionality.

How to start beeline and connect to Hive

You can use the hdfs user to prevent any permission issues:

  1. As the hdfs user start beeline on the master node i.e vm0. -
    beeline
  2. Connect to the hive metastore using the following command -
    !connect jdbc:hive2://hiveserver2_host:10000 hdfs
  3. You don't need a password here so press enter.

How to create a replicated database within Hive

  1. Using the hdfs user create a new hdfs folder on both clusters as a home for your test databases
    hdfs dfs -mkdir -p /hive/databases
  2. Within the Fusion UI create a new replicated folder for /hive/databases.
  3. Start beeline and connect as above.
  4. To create your new test database enter the following command.
    CREATE DATABASE test_01 LOCATION '/hive/databases/test_01';
    where test_01 is the database name you want to use.
  5. To check the database has been created and replicated you will need to connect to beeline on the master node of the other cluster using the instructions above, making sure to use the correct hiveserver2_host for that cluster. Then on both clusters use: SHOW DATABASES; This should display the default database and the new database you just created.

How to create and populate the temporary table

  1. This example assumes that you have a test data file containing a single string per line, placed in
    /usr/local/share/installers/Batting.csv.
  2. Start beeline and connect as above if you are not already connected.
  3. Set the test database you want to use for this test data.
    USE test_01;
  4. Create the temporary table for the batting data
    create table temp_batting (col_value STRING);
  5. Now load the test data into the temp_batting table:
    LOAD DATA LOCAL INPATH '/usr/local/share/installers/Batting.csv' OVERWRITE INTO TABLE temp_batting;
    This should replicate to the data to the second cluster for you. You need to replace the location of the uploaded Batting.csv file.
  6. To see the loaded data you can use
    SELECT * FROM temp_batting LIMIT 100;

How to create and populate a table with partitions with data from the above temp_batting table

  1. Start beeline and connect as above if you are not already connected.
  2. Set the test database you want to use for this test table using
    USE test_01;
  3. Create the new empty table partitioned by year
    create table batting (player_id STRING,runs INT) PARTITIONED BY(year INT);
  4. Now load the new table with data from the temp_batting table by
    insert overwrite table batting PARTITION(year) SELECT regexp_extract(col_value, '^(?:([^,]*),?){1}', 1) player_id, regexp_extract(col_value, '^(?:([^,]*),?){9}', 1) run, regexp_extract(col_value, '^(?:([^,]*),?){2}', 1) year from temp_batting;
  5. The above step may take a little while because it has to create a mapreduce job to process the data.
  6. To see that the table has been populated with data run
    SELECT * FROM batting WHERE year='2000' LIMIT 100;

How to create inconsistent or missing data for testing CC and repair

  1. Create a new folder in HDFS for the location of your database on both clusters.
    hdfs dfs -mkdir /testing
    Warning
    Do not add this folder to Fusion as a replicated folder.
  2. On one of the clusters connect to beeline and create your test database.
    CREATE DATABASE testing_01 LOCATION '/testing/testing_01';
  3. Select this new database
    USE testing_01;
  4. Create a table within this database
    create table temp_batting (col_value STRING);
  5. Now load the test data into the temp_batting table
    LOAD DATA LOCAL INPATH '/usr/local/share/installers/Batting.csv' OVERWRITE INTO TABLE temp_batting;
    
    You need to replace the location of the Batting.csv file.
  6. Create the new empty table
    create table batting (player_id STRING,runs INT, year INT);
    
  7. Now load the new table with data from the temp_batting table
    insert overwrite table batting SELECT regexp_extract(col_value, '^(?:([^,]*),?){1}', 1) player_id, regexp_extract(col_value, '^(?:([^,]*),?){9}', 1) run, regexp_extract(col_value, '^(?:([^,]*),?){2}', 1) year from temp_batting;
    
  8. Now add the '/testing' folder to Fusion as a replicated folder with the same membership of the two zones you created earlier.
  9. Both the hdfs and the hive metastore data will be inconsistent so you will be able to test consistency check and repair functionality.
  10. To create inconsistent data you will need to create the database and tables with whatever changes you want to make them inconsistent before adding the location of the database to Fusion.