WD Fusion Plugin: Hive Metastore

1. Introduction

The Hive Metastore plugin enables WD Fusion to replicate Hive's metastore, allowing WD Fusion to maintain a replicated instance of Hive's metadata and, in future, support Hive deployments that are distributed between data centers.

Release Notes

Check out the Hive Metastore Plugin Release Notes for the latest information. See 1.0 WD Hive Metastore Plugin Release Notes

2. Installation

Pre-requisites

Along with the default requirements that you can find on the WD Fusion Deployment Checklist, you also need to ensure that the Hive service is already running on your server. Installation will fail if the WD Fusion Plugin can't detect that Hive is already running.

  1. Download the installer script fusion-ui-server-hdp-hive_deb_installer.sh, etc., from WANdisco's FD website.
    In this early version of Hive Metastore, the Hive Metastore plugin is provided as a full blown installer that installs WD Fusion with Hive Metastore replication plugin already built-in.
  2. Navigate to the extracted files.
  3. Run through the installer:
    Saving to: `fusion-ui-server-hdp-hive_deb_installer.sh'
    
    100%[===============================================================================================================>] 1,635,783,053 8.76M/s   in 8m 18s
    
    2016-06-15 10:53:47 (3.13 MB/s) - `fusion-ui-server-hdp-hive_deb_installer.sh' saved [1635783053/1635783053]
    
    root@dc01-vm1:~# bash fusion-ui-server-hdp-hive_deb_installer.sh
  4. The installer will first perform a health check and confirm that there is sufficient Java heap to support an installation.
    Installing WD Fusion
    Verifying archive integrity... All good.
    Uncompressing WANdisco Fusion........................
     
        ::   ::  ::     #     #   ##    ####  ######   #   #####   #####   #####
       :::: :::: :::    #     #  #  #  ##  ## #     #  #  #     # #     # #     #
      ::::::::::: :::   #  #  # #    # #    # #     #  #  #       #       #     #
     ::::::::::::: :::  # # # # #    # #    # #     #  #   #####  #       #     #
      ::::::::::: :::   # # # # #    # #    # #     #  #        # #       #     #
       :::: :::: :::    ##   ##  #  ## #    # #     #  #  #     # #     # #     #
        ::   ::  ::     #     #   ## # #    # ######   #   #####   #####   #####
     
     
     
     
    Welcome to the WANdisco Fusion installation
    
    
    
    You are about to install WANdisco Fusion version 2.8-19
    
    Do you want to continue with the installation? (Y/n)
    Enter "Y" to continue.
  5. The installer checks that both Perl and Java are installed on the system.
      Checking prerequisites:
     
    Checking for perl: OK
    Checking for java: OK
    
    No packages found matching fusion-*.
    
    INFO: Using the following Memory settings for the WANdisco Fusion Admin UI process:
     
    INFO: -Xms128m -Xmx512m
     
    Do you want to use these settings for the installation? (Y/n)
    
    
    Enter "Y" or "N" if you wish to set different Java heap settings.
  6. The installer asks you to confirm which TCP port will be used for accessing the Fusion/Big Replicate web UI, the default is "8083".
    Which port should the UI Server listen on?  [8083]:
    Please specify the appropriate platform from the list below:
     
    [0] hdp-2.1.0
    [1] hdp-2.2.0
    [2] hdp-2.3.0
    [3] hdp-2.4.0
     
    Which fusion platform do you wish to use? 1
    You chose hdp-2.2.0:2.6.0.2.2.0.0-2041
    Select from the available Hadoop packages.
  7. Next, you set the system user, group for running the applicaton.
    We strongly advise against running Fusion as the root user.
     
    For default CDH setups, the user should be set to 'hdfs'. However, you should choose a user appropriate for running HDFS commands on your system.
     
    Which user should Fusion run as? [hdfs]
    Checking 'hdfs' ...
     ... 'hdfs' found.
     
    Please choose an appropriate group for your system. By default CDH uses the 'hdfs' group.
     
    Which group should Fusion run as? [hdfs]
    Checking 'hdfs' ...
     ... 'hdfs' found.
    You should press enter to go with the default "hdfs".
  8. You will now be shown a summary of the settings that you have provided so far:
    Installing with the following settings:
    
    User and Group:                     hdfs:hadoop
    Hostname:                           dc01-vm1.bdva.wandisco.com
    Fusion Admin UI Listening on:       0.0.0.0:8083
    Fusion Admin UI Minimum Memory:     128
    Fusion Admin UI Maximum memory:     512
    Platform:                           hdp-2.2.0 (2.6.0.2.2.0.0-2041)
    Fusion Server Hostname and Port:    dc01-vm1.bdva.wandisco.com:8082
     
    Do you want to continue with the installation? (Y/n)
    Enter "Y" unless you need to make changes to any of the settings.
  9. The installation will now complete:
    Adding the user hdfs to the hive group if the hive group is present.
    Installing hdp-2.2.0 packages:
      fusion-hcfs-hdp-2.2.0-server_2.8-SNAPSHOT-1854_all.deb ... Done
      fusion-hcfs-hdp-2.2.0-ihc-server_2.8-SNAPSHOT-1854_all.deb ... Done
    Installing plugin packages:
      wd-hive-plugin-hdp-2.2.0_1.0-SNAPSHOT-480_all.deb ... Done
    Installing fusion-ui-server package:
      fusion-ui-server_2.8-19_all.deb ...Selecting previously unselected package fusion-ui-server.
    (Reading database ... 56307 files and directories currently installed.)
    Unpacking fusion-ui-server (from .../fusion-ui-server_2.8-19_all.deb) ...
    Setting up fusion-ui-server (2.8-19) ...
     Done
  10. Once the installation has completed, you need to configure the WD Fusion server using the browser based UI.
    Starting fusion-ui-server:                                 [  OK  ]
    Checking if the GUI is listening on port 8083: .....Done
     
    Please visit http://node.hostname.com:8083/ to complete installation of WANdisco Fusion
     
    If 'your.hostname.server.com' is internal or not available from your browser, replace
    this with an externally available address to access it.
     
    Stopping fusion-ui-server:.                                [  OK  ]
    Starting fusion-ui-server:                                 [  OK  ]
    Open a browser and enter the provided URL, or IP address.
  11. Configure WD Fusion through a browser

  12. Follow this section to complete the installation by configuring WD Fusion using a browser-based graphical user interface.

    Silent Installation
    For large deployments it may be worth using Silent Installation option.

    Open a web browser and point it at the provided URL. e.g
    http://<YOUR-SERVER-ADDRESS>.com:8083/
  13. In the first "Welcome" screen you're asked to choose between Create a new Zone and Add to an existing Zone.
    Make your selection as follows:
    Adding a new WD Fusion cluster
    Select Add Zone.
    Adding additional WD Fusion servers to an existing WD Fusion cluster
    Select Add to an existing Zone.

    High Availability for WD Fusion / IHC Servers

    It's possible to enable High Availability in your WD Fusion cluster by adding additional WD Fusion/IHC servers to a zone. These additional nodes ensure that in the event of a system outage, there will remain sufficient WD Fusion/IHC servers running to maintain replication.

    Add HA nodes to the cluster using the installer and choosing to Add to an existing Zone, using a new node name.

    Configuration for High Availability
    When setting up the configuration for a High Availability cluster, ensure that fs.defaultFS, located in the core-site.xml is not duplicated between zones. This property is used to determine if an operation is being executed locally or remotely, if two separate zones have the same default file system address, then problems will occur. WD Fusion should never see the same URI (Scheme + authority) for two different clusters.

    WD Fusion Deployment

    Welcome.

  14. Run through the installer's detailed Environment checks. For more details about exactly what is checked in this stage, see Environmental Checks in the Appendix.
    WD Fusion Deployment

    Environmental checks.

    On clicking validate the installer will run through a series of checks of your system's hardware and software setup and warn you if any of WD Fusion's prerequisites are missing.

    WD Fusion Deployment

    Example check results.

    Any element that fails the check should be addressed before you continue the installation. Warnings may be ignored for the purposes of completing the installation, especially if the installation is only for evaluation purposes and not for production. However, when installing for production, you should also address all warnings, or at least take note of them and exercise due care if you continue the installation without resolving and revalidating.

  15. Upload the license file.
    WD Fusion Deployment

    Upload your license file.

  16. The conditions of your license agreement will be presented in the top panel, including License Type, Expiry data, Name Node Limit and Data Node Limit.
    WD Fusion Deployment

    Verify license and agree to subscription agreement.

    Click on the I agree to the EULA to continue, then click Next Step.
  17. Enter settings for the WD Fusion server.
    WD Fusion Deployment

    screen 4 - Server settings

    WD Fusion Server

    Maximum Java heap size (GB)
    Enter the maximum Java Heap value for the WD Fusion server.
    Umask (currently 022)
    Set the default permissions applied to newly created files. The value 022 results in default directory permissions 755 and default file permissions 644. This ensures that the installation will be able to start up/restart.
    Latitude
    The north-south coordinate angle for the installation's geographical location.
    Longitude
    The east-west coordinate angle for the installation's geographical location. The latitude and longitude is used to place the WD Fusion server on a global map to aid coordination in a far-flung cluster.
    Alternatively, you can click on global map to locate the node.

    Advanced options

    Only apply these options if you fully understand what they do.
    The following advanced options provide a number of low level configuration settings that may be required for installation into certain environments. The incorrect application of some of these settings could cause serious problems, so for this reason we strongly recommend that you discuss their use with WANdisco's support team before enabling them.

    Custom UI hostname
    Lets you set a custom hostname for the Fusion UI, distinct from the communication.hostname which is already set as part of the install and used by WD Fusion nodes to connect to the Fusion server.
    Custom UI Port
    Lets you change WD Fusion UI's default port, in case it is assigned elsewhere, e.g. Cloudera's headamp debug server also uses it.
    Strict Recovery
    See explanation of the Strict Recovery Advanced Options.

    Enable SSL for WD Fusion

    Tick the checkbox to enable SSL WD Fusion Deployment

    KeyStore Path
    System file path to the keystore file.
    e.g. /opt/wandisco/ssl/keystore.ks
    KeyStore Password
    Encrypted password for the KeyStore.
    e.g. ***********
    Key Alias
    The Alias of the private key.
    e.g. WANdisco
    Key Password
    Private key encrypted password.
    e.g. ***********
    TrustStore Path
    System file path to the TrustStore file.
    /opt/wandisco/ssl/keystore.ks
    TrustStore Password
    Encrypted password for the TrustStore.
    e.g. ***********

    IHC Server

    WD Fusion

    IHC Settings

    Maximum Java heap size (GB)
    Enter the maximum Java Heap value for the WD Inter-Hadoop Communication server.
    IHC network interface
    The hostname for the IHC server.

    Advanced Options (optional)

    IHC server binding address
    In the advanced settings you can decide which address the IHC server will bind to. The address is optional, by default the IHC server binds to all interfaces (0.0.0.0), using the port specified in the ihc.server field. In all cases the port should be identical to the port used in the ihc.server address. i.e. /etc/wandisco/fusion/ihc/server/cdh-5.4.0/2.6.0-cdh5.4.0.ihc or /etc/wandisco/fusion/ihc/server/localfs-2.7.0/2.7.0.ihc
    Once all settings have been entered, click Next step.
  18. Next, you will enter the settings for your new Zone.
    WD Fusion Deployment

    New Zone

    Zone Information

    Entry fields for zone properties

    Fully Qualified Domain Name
    The full hostname for the server.
    Node ID
    A unique identifier that will be used by WD Fusion UI to identify the server.
    Location Name (optional)
    A location name that can quickly identify where the server is located.

    Induction failure
    If induction fails, attempting a fresh installation may be the most straight forward cure, however, it is possible to push through an induction manually, using the REST API. See Handling Induction Failure.

    Known issue with Location names
    You must use different Location names /Node IDs for each zone. If you use the same name for multiple zones then you will not be able to complete the induction between those nodes.

    DConE Port
    TCP port used by WD Fusion for replicated traffic.
    Zone Name
    The name used to identify the zone in which the server operates.
    Management Endpoint
    Select the Hadoop manager that you are using, i.e. Cloudera Manager, Ambari or Pivotal HD. The selection will trigger the entry fields for your selected manager:

    Advanced Options

    Only apply these options if you fully understand what they do.
    The following advanced options provide a number of low level configuration settings that may be required for installation into certain environments. The incorrect application of some of these settings could cause serious problems, so for this reason we strongly recommend that you discuss their use with WANdisco's support team before enabling them.

    URI Selection

    The default behavior for WD Fusion is to fix all replication to the Hadoop Distributed File System / hdfs:/// URI. Setting the hdfs-scheme provides the widest support for Hadoop client applications, since some applications can't support the available "fusion:///" URI they can only use the HDFS protocol. Each option is explained below:

    Use HDFS URI with HDFS file system
    URI Option A
    This option is available for deployments where the Hadoop applications support neither the WD Fusion URI or the HCFS standards. WD Fusion operates entirely within HDFS.

    This configuration will not allow paths with the fusion:// URI to be used; only paths starting with hdfs:// or no scheme that correspond to a mapped path will be replicated. The underlying file system will be an instance of the HDFS DistributedFileSystem, which will support applications that aren't written to the HCFS specification.
    Use WD Fusion URI with HCFS file system
    URI Option B
    When selected, you need to use fusion:// for all data that must be replicated over an instance of the Hadoop Compatible File System. If your deployment includes Hadoop applications that are either unable to support the Fusion URI or are not written to the HCFS specfication, this option will not work.

    MapR deployments
    Use this URI selection if you are installing into a MapR cluster.

    Use Fusion URI with HDFS file system
    URI option C
    This differs from the default in that while the WD Fusion URI is used to identify data to be replicated, the replication is performed using HDFS itself. This option should be used if you are deploying applications that can support the WD Fusion URI but not the Hadoop Compatible File System.
    Use Fusion URI and HDFS URI with HDFS file system
    URI Option D
    This "mixed mode" supports all the replication schemes (fusion://, hdfs:// and no scheme) and uses HDFS for the underlying file system, to support applications that aren't written to the HCFS specification.

    Fusion Server API Port

    This option lets you select the TCP port that is used for WD Fusion's API.

    Strict Recovery

    Two advanced options are provided to change the way that WD Fusion responds to a system shutdown where WD Fusion was not shutdown cleanly. Currently the default setting is to not enforce a panic event in the logs, if during startup we detect that WD Fusion wasn't shutdown. This is suitable for using the product as part of an evaluation effort. However, when operating in a production environment, you may prefer to enforce the panic event which will stop any attempted restarts to prevent possible corruption to the database.

    DConE panic if dirty (checkbox)
    This option lets you enable the strict recovery option for WANdisco's replication engine, to ensure that any corruption to its prevayler database doesn't lead to further problems. When the checkbox is ticked, WD Fusion will log a panic message whenever WD Fusion is not properly shutdown, either due to a system or application problem.
    App Integration panic of dirty (checkbox)
    This option lets you enable the strict recovery option for WD Fusion's database, to ensure that any corruption to its internal database doesn't lead to further problems. When the checkbox is ticked, WD Fusion will log a panic message whenever WD Fusion is not properly shutdown, either due to a system or application problem.

    <Hadoop Manager e.g. Ambari> Configuration

    This section configures WD Fusion to interact with the management layer, which could be Ambari or Cloudera Manager, etc.

    Manager Host Name /IP
    The full hostname or IP address for the working server that hosts the Hadoop manager.
    Port
    TCP port on which the Hadoop manager is running.
    Username
    The username of the account that runs the Hadoop manager.
    Password
    The password that corresponds with the above username.
    SSL
    (Checkbox) Tick the SSL checkbox to use https in your Manager Host Name and Port. You may be prompted to update the port if you enable SSL but don't update from the default http port.

    Authentication without a management layer
    WD Fusion normally uses the authentication built into your cluster's management layer, i.e. the Cloudera Manager username and password are required to login to WD Fusion. However, in Cloud-based deployments, such as Amazon's S3, there is no management layer. In this situation, WD Fusion adds a local user to WD Fusion's ui.properties file, either during the silent installation or through the command-line during an installation.
    Should you forget these credentials, see Reset internally managed password

  19. Enter security details, if applicable to your deployment.
    Kerberos

    Kerberos Configuration

    In this step you also set the configuration for an existing Kerberos setup. If you are installing into a Kerberized cluster, include the following configuration.

    Enabling Kerberos authentication on WD Fusion's REST API
    When a user has enabled Kerberos-authentication on their REST API, they must kinit before making REST calls, and enable GSS-Negotiate authentication. To do this with curl, the user must include the "-negotiate" and "-u:" options, like so:

    curl --negotiate -u: -X GET "http://${HOSTNAME}:8082/fusion/fs/transfers"

    See Setting up Kerberos for more information about Kerberos setup.

  20. Click Validate to confirm that your settings are valid. Once validated, click Next step.
    WD Fusion Deployment

    Zone information.

  21. The remaining panels in step 6 detail all of the installation settings. All your license, WD Fusion server, IHC server and zone settings are shown. If you spot anything that needs to be changed you can click on the go back
    WD Fusion Deployment

    Summary

    Once you are happy with the settings and all your WD Fusion clients are installed, click Deploy Fusion Server.
  22. WD Fusion Client Installation

  23. In the next step you must complete the installation of the WD Fusion client package on all the existing HDFS client machines in the cluster. The WD Fusion client is required to support the ingestion of data to WD Fusion nodes.
    WD Fusion Deployment

    Client installations.

    Known Issue: For IBM Big Replicate/IBM Hive deployments, there is currently a limitation that prevents the deployment of the Ambari stack, through WD Fusion UI. The "Deploy Stack" button will not work and you will see a "Failed to deploy Ambari stack: Stack service does not exist..." warning. The work-around for this is to complete the installation of the Stack through Ambari. FUS-3038

    Currently you may need to manually install the WANdisco Fusion client files through Ambari. Complete the installation of the WD Fusion stack using Ambari's stack screen.
    WD Fusion Deployment

    Ambari manual client installation.

    The installer supports three different packaging systems for installing Clients, regular RPMs, Parcels for Cloudera and HDP Stack for Hortonworks/Ambari.

    client package location
    You can find them in your installation directory, here:

    /opt/wandisco/fusion-ui-server/ui/client_packages
    /opt/wandisco/fusion-ui-server/ui/stack_packages
    /opt/wandisco/fusion-ui-server/ui/parcel_packages

    Important! If you are installing on Ambari 1.7 or CHD 5.3.x
    Additionally, due to a bug in Ambari 1.7, and an issue with the classpath in CDH 5.3.x, before you can continue you must log into Ambari/Cloudera Mananger and complete a restart of HDFS, in order to re-apply WD Fusion's client configuration.

  24. We now handle the configuration of the Hive Metastore Plugin, which will be integrated into WD Fusion now, rather than in a separate post-installation step. WD Fusion Deployment

    Hive Metastore plugin installation - substep 1.

    The installer performs some basic validation, checking the following criteria:

    Manager Validation
    Checks that the system is being configured with valid distribution manager support. In this example, "AMBARI" should be detected. Should this validation check fail, you would need to check that you have entered the right Manager details in Step 5.
    Hive Service installed Validation
    The installer will check that Hive is running on the server. Should the validation check fail, you should check that Hive is running.

    Configuration

    During the installation you need to enter the following properties:

    Hive Metastore host
    The hostname for the Hive Metastore service.
    Hive Metastore port
    The TCP port that will be used by the Hive Mestastore service. Default:9084
  25. In this step you need to copy over and unpack the Hive services to the service directory of your Ambari installation.
    WD Fusion Deployment

    Hive Metastore plugin installation - sub-step 2.

    If you check Ambari, providing the new packages are correctly put in place, you will see them listed. Do not enable them through Ambari, they will be installed later.
    WD Fusion Deployment

    Hive Metastore plugin installation - check the service packages have been picked up.

    Important: You should see that the package for WD Hive Metastore is now listed through Ambari. Do NOT enable the package at this time. WD Hive Metastore needs to be installed through steps that appear later.
  26. At the end of this step, we address a possible problem that you may have in connecting WD Fusion to a remote Hive Metastore database. Please note that the following MySQL query is only applicable to Ambari/IBM Big Replicate installations.
    GRANT ALL PRIVILEGES ON *.* TO 'hive'@'<HOSTNAME-FOR-HIVE-METASTORE-SERVICE-NODE>'
    IDENTIFIED BY PASSWORD '<hive database password>' WITH GRANT OPTION;
  27. The next step handles the plugin's installation:
    WD Fusion Deployment

    Hive Metastore plugin installation - sub-step 3.

    When you have confirmed that the stack files are in place, on the installer screen, click Next.

    Summary

    The summary confirms the values of the entries you provided in the first sub-step of the WANdisco Hive Metastore installation section.

    To begin the installation of the Plugin, click Start Install.

    Ambari-based installation

    The following notes explain what is happening during each phase of the installation into a Ambari-based cluster:

    Metastore Service Install
    This step handles the installation of the WD Hive Metastore Service into Ambari.
    Hive Metastore Template Install
    Install the WANdisco Hive Metastore Service Template on Ambari.
    Configure Hive Configuration Files
    Symlink the Hive configuration file into the Fusion Hive Metastore plugin.
    Update Hive Configuration
    Updates the URIs for Hive connections in Ambari.
    Restart Hive Service
    Restarts Hive Service in Ambari. Note this process can take several minutes to complete. Please don't make any changes or refresh your installer's browser session.
    Restart WD Hive Metastore Service
    Restarts Hive Metastore Service in Ambari. Note this process can take several minutes to complete.
    Restart WD HiveServer2 Service
    Restart HiveServer2 Service in Ambari. Note this process can take serveral minutes to complete.

    Cloudera-based installation

    The following notes explain what is happening during each phase of the installation into a CDH-based cluster:

    Fusion Hive parcel distribution and activation
    Distribute and activate Fusion Hive parcels.
    Hive-site Setup
    Retrieve and setup hive-site.xml for use with Wd-Fusion.
    Fusion Hive service descriptor
    Install Fusion Hive service descriptor.
    Known Issue: Cloudera-based deployments only
    When installing the Hive Metastore plugin, you must ensure that the folder /etc/wandisco/hive is created on the server onto which you are installing the plugin, and that it can be written to by the system account running the Hive Metastore plugin installer.
    Fusion Hive service setup
    Install Fusion Hive service.
    Cloudera metastore configuration
    Configuring Cloudera to use Fusion Hive metastore.
    The Hive Metastore Consistency Checker only covers checking and repairing the Hive Metastore data and not any inconsistencies in the data within the replicated folder. This will be the responsibility of the Fusion Server's main Consistency Check tool.

Installing on a Kerberized cluster

The Installer lets you configure WD Fusion to use your platform's Kerberos implementation. You can find supporting information about how WD Fusion handles Kerberos in the Admin Guide, see Setting up Kerberos.

4. Technical Glossary

Architecture

The following diagram provides a simplified view of how WANdisco's Hive Metastore plugin interfaces between your Hive deployment and WD Fusion.

WD Hive Metastore Example

Hive Metastore replication in a nutshell

  • WANdisco runs its own Metastore server instance that replaces the default server.
  • WANdisco only replicates write operation against the metastore database.
  • The WD Hive Metastore Plugin sends proposals into the WD Fusion core.
  • WD Fusion uses the Hive Metastore plugin to communicate, directly with the metastore database.

Known Issue: Hive Metastore database password is encrypted

You may get an error in the WD Fusion log file or in the WDHive Metastore log:

Error getting metastore password: null

Workaround

    The steps to fix the password in the config files are these:
  1. Find out what the password is. The password is random character string. You can find it out from the Hive configuration in Cloudera Manager using the API:
    # curl -u YOUR_CM_USERNAME:YOUR_CM_PASSWORD http://YOUR_CM_HOSTNAME:7180/api/v11/clusters/DC-1/services/hive1/config
    curl -u admin:admin http://cert01-vm0:7180/api/v11/clusters/DC-1/services/hive1/config
    the password property name is hive_metastore_database_password.
  2. Create the local keystore file for the password
    hadoop credential create javax.jdo.option.ConnectionPassword -provider localjceks://file/etc/wandisco/hive/creds.localjceks
    # By default the file has rights -rwx------, so it has to be owned by the user who runs the Fusion or WDHive Metastore Server process
      chown hdfs /etc/wandisco/hive/creds.localjceks
    # If Fusion and WDHive Metastore Server run on the same server, but have each different user, then the file has to me made visible to both
    chmod 644 creds.localjceks
  3. Update the config file with the proper password location
    vi /etc/wandisco/hive/hive-site.xml
    set the property hadoop.security.credential.provider.path to localjceks://file/etc/wandisco/hive/creds.localjceks (the property will already exist in the file).
     
    <property>
        <name>hadoop.security.credential.provider.path</name>
        <value>localjceks://file/etc/wandisco/hive/creds.localjceks</value>
    </property>