Fusion Plugin for Live Hive

1. Welcome

1.1. Product overview

The Fusion Plugin for Live Hive enables WANdisco Fusion to replicate Hive’s metastore, allowing WANdisco Fusion to maintain a replicated instance of Hive’s metadata and, in future, support Hive deployments that are distributed between data centers.

1.2. Documentation guide

This guide contains the following:

Welcome: this chapter introduces this user guide and provides help with how to use it.
Release Notes: details the latest software release, covering new features, fixes and known issues to be aware of.
Concepts: explains how Fusion Plugin for Live Hive through WANdisco Fusion uses WANdisco’s Live Data platform.
Installation: covers the steps required to install and set up Fusion Plugin for Live Hive into a WANdisco Fusion deployment.
Operation: the steps required to run, reconfigure and troubleshoot Fusion Plugin for Live Hive.

Reference: additional Fusion Plugin for Live Hive documentation, including documentation for the available REST API.

1.2.1. Admonitions

In the guide we highlight types of information using the following call outs:

The alert symbol highlights important information.

The STOP symbol cautions you against doing something.

Tips are principles or practices that you’ll benefit from knowing or using.

The KB symbol shows where you can find more information, such as in our online Knowledgebase.

1.3. Contact support

See our online Knowledgebase which contains updates and more information.

If you need more help raise a case on our support website.

1.4. Give feedback

If you find an error or if you think some information needs improving, raise a case on our support website or email docs@wandisco.com.

2. Release Notes - Live Hive Plugin 1.0 Build 12

16 March 2018

The Fusion Plugin for Live Hive extends WANdisco Fusion by replicating Apache Hive metadata. With it, WANdisco Fusion maintains a Live Data environment including Hive content, such that applications can access, use and modify a consistent view of data everywhere, spanning platforms and locations, even at petabyte scale. WANdisco Fusion ensures the availability and accessibility of critical data everywhere.

The 1.0 release of the Fusion Plugin for Live Hive is the first release of this new capability, taking advantage of the extensible architecture of WANdisco Fusion to provide an alternative to the existing Fusion Hive Metastore Plugin.

WANdisco recommends that you consult our implementation services to plan the effective introduction of this significant new feature to your Live Data environment.

2.1. Available Packages

This release of the Fusion Plugin for Live Hive supports deployment into WANdisco Fusion 2.11.1 or greater for HDP and CDH Hadoop clusters:

CDH 5.12.0 - CDH 5.13.0
HDP 2.6.0 - HDP 2.6.4

2.2. Installation

The Fusion Plugin for Live Hive supports an integrated installation process that allows it to be added to an existing WANdisco Fusion deployment. Consult the Installation Guide for details.

2.3. What’s New

This is the first release of the Fusion Plugin for Live Hive, intended to provide enhanced functionality over the Fusion Hive Metastore Plugin. The most notable enhancements are:

New Replication Architecture: The Fusion Plugin for Live Hive adopts a proxy-based architecture that requires that the existing Hive Metastore remains in place without any need for a configuration change. Applications using Hive are directed to it via a Thrift proxy exposed by the Fusion Plugin for Live Hive through the hive.metastore.uris configuration property.
Table-level selective replication: By coordinating and replicating operations performed against the standard network interface of the Hive Metastore, the Fusion Plugin for Live Hive allows for selective replication of Hive constructs that provides finer control over replicated Hive data and metadata. Take advantage of regular expressions to match specific Hive content that you want to replicate at either table or database level.
Consistency check specific tables: The Fusion Plugin for Live Hive allows you to perform a consistency check of individual tables to focus on data that are relevant when confirming that replication is effective.
Repair specific tables: Repair individual tables instead of an entire Hive database if required.

2.4. Known Issues

LHV-238 - The Live Hive Plugin requires common Hadoop distributions and versions to be in place for all replicated zones.
LHV-341 - The proxy for the Hive Metastore must be deployed on the same host as the Fusion server.
LHV-342 - The Fusion Plugin for Live Hive does not provide for the removal of a Hive Regex rule.
LHV-344 - Replication rules for Hive data locations generated by the Fusion Plugin for Live Hive cannot be edited.
LHV-343 - Databases are replicated to all zones on creation regardless of Hive Regex rules.
LHV-219 - Consistency check and repair cannot be performed at the level of a Hive Regex rule, but must be performed per-table.
LHV-480 - It’s currently not possible to replicate metadata for tables created using the CTAS "create table as select" method.
LHV-486 - Unable to trigger repairs to metadata on nodes where the database under repair is absent. The workaround is to ensure that you trigger repairs from a node on which the database is present.

2.5. Planned Enhancements

In addition to the known issues listed above, WANdisco maintains work on planned enhancements for the Fusion Plugin for Live Hive. These include:

Selective replication of tables and databases by location in addition to name
Replication of existing tables and databases
Selective replication of partitions and indices by location
Selective replication of Hive functions
Scale and manageability enhancements to accommodate extremely large Hive environments
Exclusion and inclusion pattern matching for selective replication
Supported cross-version and distribution compatibility

3. Concepts

3.1. Product concepts

Familiarity with product and environment concepts will help you understand how to use the Fusion Plugin for Live Hive. Learn the following concepts to become proficient with replication.

Apache Hive: Hive is a data warehousing technology for Apache Hadoop. It is designed to offer an abstraction that supports applications that want to use data residing in a Hadoop cluster in a structured manner, allowing ad-hoc querying, summarization and other data analysis tasks to be performed using high-level constructs, including Apache Hive SQL querys.
Hive Metadata: The operation of Hive depends on the definition of metadata that describes the structure of data residing in a Hadoop cluster. Hive organizes its metadata with structure also, including definitions of Databases, Tables, Partitions, and Buckets.
Apache Hive Type System: Hive defines primitive and complex data types that can be assigned to data as part of the Hive metadata definitions. These are primitive types such as TINYINT, BIGINT, BOOLEAN, STRING, VARCHAR, TIMESTAMP, etc. and complex types like Structs, Maps, and Arrays.
Apache Hive Metastore: The Apache Hive Metastore is a stateless service in a Hadoop cluster that presents an interface for application to access Hive metadata. Because it is stateless, the metastore can be deployed in a variety of configuration to suit different requirements. In every case, it provides a common interface for applications to use Hive metadata.

The Hive Metastore is usually deployed as a standalone service, exposing an Apache Thrift interface by which client applications interact with it to create, modify, use and delete Hive Metadata in the form of databases, tables, etc. It can also be run in embedded mode, where the metastore implementation is co-located with the application making use of it.
WANdisco Fusion Live Hive Proxy: The Live Hive Proxy is a WANdisco service that is deployed with Live Hive, acting as a proxy for applications that use a standalone Hive Metastore. The service coordinates actions performed against the metastore with actions within clusters in which associated Hive metadata are replicated.
Hive Client Applications: Client applications that use Apache Hive interact with the Hive Metastore, either directly (using its Thrift interface), or indirectly via another client application such as Beeline or Hiveserver2.
Hiveserver2: is a service that exposes a JDBC interface for applications that want to use it for accessing Hive. This could include standard analytic tools and visualization technologies, or the Hive-specific CLI called Beeline.

Hive applications determine how to contact the Hive Metastore using the Hadoop configuration property hive.metastore.uris.
Hiveserver2 Template: A template service that amends the hiveserver2 config so that it no longer uses the embedded metastore, and instead correctly references the hive.metastore.uris parameter that points to our "external" Hive Metastore server.
Regex rules: Regular expressions that are used to describe the Hive metadata to be replicated.
WANdisco Fusion plugin: The Fusion Plugin for Live Hive is a plugin for WANdisco Fusion. Before you can install it you must first complete the installation of the core WANdisco Fusion product. See WANdisco Fusion user guide.

Get additional terms from the Big Data Glossary.

3.2. Product architecture

The native Hive Metastore is not replaced, instead, Live Hive Plugin runs as a proxy server that issues commands via the connected client (i.e. beeline) to the original metastore, which is on the cluster.

The Fusion Plugin for Live Hive proxy passes on read commands directly to the local Hive Metastore, while a WANdisco Fusion Live Hive Plugin will co-ordinates any write commands, so all metastores on all clusters will perform the write operations, such as table creation. Live Hive will also automatically start to replicate Hive tables when their names match a user defined rule.

Figure 1. Live Hive Plugin Plugin Architecture

1	Write access needs to be co-ordinated by Fusion before executing the command on the metastore.
2	Read Commands are 'passed-through' straight to the metastore as we do not need to co-ordinate via Fusion.
3	Makes connection to the metastore on the cluster.

3.2.1. Limitations

Membership changes

There is currently no support for dynamic membership changes. Once installed on all Fusion nodes, the Live Hive Plugin plugin is activated. See Activate Live Hive Plugin. During activation, the membership for replication is set and cannot be modified later. For this reason, it’s not possible to add new Live Hive Plugin nodes at a later time, including a High Availability node running an existing Live Hive proxy that wasn’t part of your original membership.

Any change to membership in terms of adding, removing or changing existing nodes will require a complete reinstallation of Live Hive.

Hive must be running at all zones

All participating zones must be running Hive in order to support replication. We’re aware that this currently prevents the popular use case for replicating between on-premises clusters and s3/cloud storage, where Hive is not running. We intend to remove the limitation in a future release.

4. Installation

4.1. Pre-requisites

An installation should only proceed if the following prerequisites are met on each Live Hive Plugin node:

Hadoop Cluster (CDH or HDP, meeting WANdisco Fusion requirements, see Checklist)
Hive installed, configured and running on the cluster
WANdisco Fusion 2.11.1 or later

4.2. Installation

4.2.1. Installer Options

The following section provides additional information about running the Live Hive installer.

Installation files

The Client step of the installer provides a list of available parcel/jar files for you to choose from. You need to select the files that correspond with your platform.

LIVE_HIVE_PROXY-2.11.0-el6.parcel
LIVE_HIVE_PROXY-2.11.0-el6.parcel.sha
LIVE_HIVE_PROXY-2.11.0.jar

Obtain the files so that you can distribute them to the appropriate hosts in your deployment for WANdisco Fusion. The JAR and parcel files need to be saved to your /cloudera/parcels directory, while the .jar file must be copied to the Local Descriptor Repository, the default path is opt/cloudera/csd.

Installer Help

The bundled installer provides some additional functionality that lets you install selected components, which may be useful if you need to restore or replace a specific file. To review the options, run the installer with the --help option, i.e.

[user@gmart01-vm1 ~]# ./live-hive-installer.sh help
Verifying archive integrity... All good.
Uncompressing WANdisco Hive Live.......................

This usage information describes the options of the embedded installer script. Further help, if running directly from the installer is available using '--help'. The following options should be specified without a leading '-' or '--'. Also note that the component installation control option effects are applied in the order provided.

Installation options

General options:
  help                             Print this message and exit

Component installation control:
  only-fusion-ui-client-plugin     Only install the plugin's fusion-ui-client component
  only-fusion-ui-server-plugin     Only install the plugin's fusion-ui-server component
  only-fusion-server-plugin        Only install the plugin's fusion-server component
  only-user-installable-resources  Only install the plugin's additional resources
  skip-fusion-ui-client-plugin     Do not install the plugin's fusion-ui-client component
  skip-fusion-ui-server-plugin     Do not install the plugin's fusion-ui-server component
  skip-fusion-server-plugin        Do not install the plugin's fusion-server component
  skip-user-installable-resources  Do not install the plugin's additional resources
[user@docs01-vm1 tmp]#

Standard help parameters

[user@docs01-vm1 tmp]# ./live-hive-installer.sh --help
Makeself version 2.1.5
 1) Getting help or info about ./live-hive-installer.sh :
  ./live-hive-installer.sh --help   Print this message
  ./live-hive-installer.sh --info   Print embedded info : title, default target directory, embedded script ...
  ./live-hive-installer.sh --lsm    Print embedded lsm entry (or no LSM)
  ./live-hive-installer.sh --list   Print the list of files in the archive
  ./live-hive-installer.sh --check  Checks integrity of the archive

 2) Running ./live-hive-installer.sh :
  ./live-hive-installer.sh [options] [--] [additional arguments to embedded script]
  with following options (in that order)
  --confirm             Ask before running embedded script
  --noexec              Do not run embedded script
  --keep                Do not erase target directory after running the embedded script
  --nox11               Do not spawn an xterm
  --nochown             Do not give the extracted files to the current user
  --target NewDirectory Extract in NewDirectory
  --tar arg1 [arg2 ...] Access the contents of the archive through the tar command
  --                    Following arguments will be passed to the embedded script

 3) Environment:
  LOG_FILE              Installer messages will be logged to the specified file

4.2.2. Cloudera-based steps

Run the installer

Obtain the Live Hive Plugin installer from WANdisco. Open a terminal session on your WANdisco Fusion node and run the installer as follows:

Run the Live Hive Plugin installer on each host required:
```
# sudo ./live-hive-installer.sh kbd:[Enter]
```

The installer will check for install the components necessary for completing the installation:

[user@docs-vm tmp]# sudo ./live-hive-installer.sh
Verifying archive integrity... All good.
Uncompressing WANdisco Live Hive.......................


    ::   ::  ::     #     #   ##    ####  ######   #   #####   #####   #####
   :::: :::: :::    #     #  #  #  ##  ## #     #  #  #     # #     # #     #
  ::::::::::: :::   #  #  # #    # #    # #     #  #  #       #       #     #
 ::::::::::::: :::  # # # # #    # #    # #     #  #   #####  #       #     #
  ::::::::::: :::   # # # # #    # #    # #     #  #        # #       #     #
   :::: :::: :::    ##   ##  #  ## #    # #     #  #  #     # #     # #     #
    ::   ::  ::     #     #   ## # #    # ######   #   #####   #####   #####


You are about to install WANdisco Live Hive version 1.0.0

Do you want to continue with the installation? (Y/n) y
  wd-live-hive-plugin-1.0.0.tar.gz ... Done
  live-hive-fusion-core-plugin-1.0.0-1079.noarch.rpm ... Done
  storing user packages in '/opt/wandisco/fusion-ui-server/ui-client-platform/downloads/core_plugins/live-hive' ... Done
  live-hive-ui-server-1.0.0-dist.tar.gz ... Done
	All requested components installed.
	Go to your WANDisco Fusion UI Server to complete configuration.

Installer options

View the Installer Options section for details on additional installer functions, including the ability to install selected components.

Configure the Live Hive Plugin

Open a session to your WANdisco Fusion UI. You will see message that confirms that the Live Hive Plugin has been detected. Click on plugins link to review the Plugins page.

Figure 2. Live Hive Plugin Architecture
The plugin Fusion Plugin for Live Hive now appears on the list. Click the button labelled Install Now.

Figure 3. Live Hive Plugin Architecture

The installation process runs through four steps that handle the placement of parcel files onto your Cloudera Manager server.

Figure 4. Live Hive Plugin Architecture

Parcels

Parcels need to be placed in the correct directory to make them available to the manager. To do this:

Copy the paths for the .parcel and .parcel.sha files for your corresponding platform type,
e.g. el6 (Enterprise Linux version 6).

Download the Parcel packages to the Cloudera service directory (/opt/cloudera/parcel-repo/) on your node, e.g.

ssh user@docs-cm.fusion.domain-name.com
user@docs-cm.fusion.domain-name.com password 
[user@docs-cm ~]$ sudo -i
[user@docs-cm ~] cd /opt/cloudera/parcel-repo
[user@docs-cm ~] wget <your-fusion-node.hostname>:8083/ui/downloads/core_plugins/live-hive/parcel_packages/LIVE_HIVE_PROXY-1.0.0-SNAPSHOT-el6.parcel
[user@docs-cm ~] wget <your-fusion-node.hostname>:8083/ui/downloads/core_plugins/live-hive/parcel_packages/LIVE_HIVE_PROXY-1.0.0-SNAPSHOT-el6.parcel.sha

Change the ownership of the parcel files to match up with Cloudera Manager, e.g.

chown cloudera-scm:cloudera-scm LIVE_HIVE_PROXY-*
[user@docs-cm ~]# ls -l
total 1492884
-rw-r--r-- 1 cloudera-scm cloudera-scm 1520997979 Jun 16  2017 CDH-5.11.0-1.cdh5.11.0.p0.34-el6.parcel
-rw-r--r-- 1 cloudera-scm cloudera-scm         41 Aug 24 15:38 CDH-5.11.0-1.cdh5.11.0.p0.34-el6.parcel.sha
-rw-r----- 1 cloudera-scm cloudera-scm      58207 Feb  7 14:41 CDH-5.11.0-1.cdh5.11.0.p0.34-el6.parcel.torrent
-rw-r--r-- 1 cloudera-scm cloudera-scm    7087088 Feb  7 14:37 FUSION-2.11.example-cdh5.11.0-el6.parcel
-rw-r--r-- 1 cloudera-scm cloudera-scm         41 Feb  7 14:37 FUSION-2.11.example-cdh5.11.0-el6.parcel.sha
-rw-r----- 1 cloudera-scm cloudera-scm        454 Feb  7 14:41 FUSION-2.11.1.2.example-el6.parcel.torrent
-rw-r--r-- 1 cloudera-scm cloudera-scm     544587 Feb  6 11:51 LIVE_HIVE_PROXY-1.0.0-el6.parcel
-rw-r--r-- 1 cloudera-scm cloudera-scm         41 Feb  6 11:51 LIVE_HIVE_PROXY-1.0.0-el6.parcel.sha

Copy the Custom Service Descriptor (LIVE_HIVE_PROXY-x.x.x.jar) file to the Local Descriptor Repository (normally /opt/cloudera/csd/) on your node. e.g.

[user@docs-cm ~]# cd ...
[user@docs-cm ~]# cd csd
[user@docs-cm ~] wget http://<your-fusion-node.hostname>:8083/ui/downloads/core_plugins/live-hive/parcel_packages/LIVE_HIVE_PROXY-1.0.0.jar
Resolving <your-fusion-node.hostname>... 10.0.0.1
Connecting to <your-fusion-node.hostname>.com|10.10.0.1|:8083... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4041 (3.9K) [application/java-archive]
Saving to: LIVE_HIVE_PROXY-1.0.0.jar

100%[=============================================================================================>] 4,041       --.-K/s   in 0s

2018-02-07 16:23:42 (279 MB/s) - LIVE_HIVE_PROXY-1.0.0.jar saved [4041/4041]

Restart the Cloudera server so that Cloudera can see the new parcel and jar, e.g.
```
[user@docs-cm ~] service cloudera-scm-server restart
```

The second installer screen handles Configuration. The first section validates existing configuration to ensure that Hive is set up correctly. Click the Validate button.

Figure 5. Live Hive Plugin installation - validation (Screen 2)

Live Hive Proxy Port

The HTP port used by the Plugin. Default: 9090

Hive Metastore URI

The metastore(s) which the Live Hive proxy will send requests to.

Add additional URIs by clicking the + Add URI button and entering additional URI / port information.
Note: If you add additional URIs, you must complete the necessary information or remove them. You cannot have an incomplete line.

Click on Next step to continue.
Step 3 of the installation covers security. If you have not enabled Kerberos on your cluster, you will pass through this step without adding any additional configuration.

Figure 6. Live Hive Plugin installation - security disabled (Screen 3)

If you enable Kerberos, you will need to supply your Kerberos credentials.

Figure 7. Live Hive Plugin installation - security enabled (Screen 3)

Keytab file path

The installer now validates that there is read access to the keytab that you specify here.

Metastore Service Principal Name

The installer validates where there are valid principals in the keytab.

Metastore Service Hostname

Enter the hostname of your Hive Metastore service.
The final step is to complete the installation. Click Start Install.

Figure 8. Live Hive Plugin installation summary - screen 4

The following steps are carried out:

Cloudera parcel distribution and activation

Distribute and active the Fusion Hive Plugin parcels in Cloudera Manager

Update cluster HDFS configuration and redeploy

Restarts the HDFS service configuration and distributes client configs for Fusion and Kerberos RPC privacy (if Kerberos is enabled)

Install Fusion Hive Plugin service descriptor in Cloudera

Installs the Fusion Hive Plugin service in Cloudera Manager

Configure Impala (if installed)

Configures Cloudera Impala to use Fusion Hive Plugin proxy

Configure Hive

Configure Cloudera Hive to use Fusion Hive Plugin proxy

Restart Hive service

Restarts the Hive service in Cloudera Manager to distribute update configurations

Restart Fusion Server

Complete the plugin installation and restart Fusion Server
The installation will complete with a message "Live Hive installation complete!"

Figure 9. Live Hive Plugin installation - Completion

Click Finish to close the Plugin installer screens.

Now advance to the Activation steps.

4.2.3. Ambari-based steps

Run the installer

Obtain the Live Hive Plugin installer from WANdisco. Open a terminal session on your WANdisco Fusion node and run the installer as follows:

Run the Live Hive Plugin installer on each host required:
```
# sudo ./live-hive-installer.sh kbd:[Enter]
```

The installer will check for install the components necessary for completing the installation:

[user@docs-vm tmp]# sudo ./live-hive-installer.sh
Verifying archive integrity... All good.
Uncompressing WANdisco Live Hive.......................


    ::   ::  ::     #     #   ##    ####  ######   #   #####   #####   #####
   :::: :::: :::    #     #  #  #  ##  ## #     #  #  #     # #     # #     #
  ::::::::::: :::   #  #  # #    # #    # #     #  #  #       #       #     #
 ::::::::::::: :::  # # # # #    # #    # #     #  #   #####  #       #     #
  ::::::::::: :::   # # # # #    # #    # #     #  #        # #       #     #
   :::: :::: :::    ##   ##  #  ## #    # #     #  #  #     # #     # #     #
    ::   ::  ::     #     #   ## # #    # ######   #   #####   #####   #####


You are about to install WANdisco Live Hive version 1.0.0

Do you want to continue with the installation? (Y/n) y
  wd-live-hive-plugin-1.0.0.tar.gz ... Done
  live-hive-fusion-core-plugin-1.0.0.noarch.rpm ... Done
  storing user packages in '/opt/wandisco/fusion-ui-server/ui-client-platform/downloads/core_plugins/live-hive' ... Done
  live-hive-ui-server-1.0.0-dist.tar.gz ... Done
All requested components installed.
Go to your WANDisco Fusion UI Server to complete configuration.

Installer options

View the Installer Options section for details on additional installer functions, including the ability to install selected components.

Configure the Live Hive Plugin

Open a session to your WANdisco Fusion UI. You will see message that confirms that the Live Hive Plugin has been detected. Click on Plugins link to review the Plugins page.

Figure 10. Live Hive Plugin Architecture

The plugin live-hive-plugin now appears on the list. Click the button labelled Install Now.

Figure 11. Live Hive Plugin Parcel installation

Figure 12. Live Hive Plugin installation - Clients (Screen1)

Stacks

Stacks need to be placed in the correct directory to make them available to the manager. To do this:

Download the service from the installer client download panel
The services are gz files that will expand to the directories /LIVE_HIVE_PROXY and /LIVE_HIVESERVER2_TEMPLATE.
For HDP, place this directory in /var/lib/ambari-server/resources/stacks/HDP/<version>/services.
Restart the Ambari server.
Note If using centos6/rhel6 we recommend using the following command to restart:
```
initctl restart ambari-server
```
Check on your Ambari manager that the services are present.

Figure 13. Stacks present

The second installer screen handles Configuration.

Figure 14. Live Hive Plugin installation - validation (Screen 2)

Live Hive Proxy Port

The HTP port used by the Plugin. Default: 9090

Hive Metastore URI

The metastore(s) which the Live Hive proxy will send requests to.
Add additional URIs by clicking the + Add URI button and entering additional URI / port information. Note: If you add additional URIs, you must complete the necessary information or remove them. You cannot have an incomplete line.

Click on Next step to continue.
Step 3 of the installation covers security. If you have not enabled Kerberos on your cluster, you will pass through this step without adding any additional configuration.

Figure 15. Live Hive Plugin installation - security disabled (Screen 3)

If you enable Kerberos, you will need to supply your Kerberos credentials.

Figure 16. Live Hive Plugin installation - security enabled (Screen 3)

Hive Proxy Kerberos

Hive Proxy Keytab file path

The installer now validates that there is read access to the keytab that you specify here.

Validate first
You must validate the keytab file before you choose the principal.

Hive Proxy Principal

The installer validates whether there are valid principals in the keytab.

Metastore Service Kerberos

Metastore Service Hostname

The hostname used by the Hive Proxy to connect to the Hive metastore.

Metastore Service Privileged User Name

The username of the meta service who is privileged to access the KDC server.

KDC Credentials

KDC admin principal

Admin principal of your KDC, required by Ambari in order to deploy keytabs for the Live Hive Proxy.

Password

The password associated with the KDC admin principal.
The final step is to complete the installation. Click Start Install.

Figure 17. Live Hive Plugin installation summary - screen 4

The following steps are carried out:

Live Hive Proxy Service Install

Install the Live Hive Proxy Service on Ambari.

Update Hive Configuration

Updates the URIs for Hive connections in Ambari.

Hive Metastore Template Install

Install Live Hive Metastore Service Template on Ambari.

Restart Hive Service

Restarts Hive Service in Ambari. NOTE this process can take several minutes to complete.

Restart Live Hive Proxy Service

Restarts Live Hive Proxy Service in Ambari, Note this process can take several minutes to complete.

Restart Fusion Server

Complete the plugin installation and restart Fusion Server.
The installation will complete with a message "Live Hive installation complete"

Figure 18. Live Hive Plugin installation - Completion

Click Finish to close the Plugin installer screens. You must now activate the plugin.

4.2.4. Activate Live Hive Plugin

After completing the installation you will need to active Live Hive Plugin before you can use it. Use the following steps to complete the plugin activation.

Log into the WANdisco Fusion UI. On the Settings tab, go to the Live Hive link on the side menu. The Live Hive Plugin Activation screen will appear.

Figure 19. Live Hive Plugin activation - Start

Ensure that your clusters have been inducted before activating.
The plugin will not work if you activate before completing the induction of all applicable zones.

Tick the checkboxes that correspond with the zones that you will to replicate Hive metadata between, then click Activate.
A message will appear at the bottom of the screen that confirms that the plugin is active.

Figure 20. Live Hive Plugin activation - Completion
```
The plugin is active and cannot be modified.
```

4.3. Validation

Once an installation is completed you should verify that hive metadata replication is working as expected, before entering into a production phase.

4.4. Upgrade

Fusion Plugin for Live Hive is upgraded is completed by uninstalling the plugin on all nodes, followed by a re-installation, using the standard installation steps for your platform type.

4.5. Uninstallation

Use the following section to guide the removal.

WARNING

Ensure that you contact WANdisco support before running this procedure. The following procedures currently require a lot of manual editing and should not be used without calling upon WANdisco’s support team for assistance.

4.5.1. Service removal

If removing Live Hive from a live cluster (rather than just removing Live Hive from a fusion server for re-installation / troubleshooting purposes) the following steps should be performed before removing the plugin:

Remove or reset to default the amended hive.metastore.uris parameter in the Hive service config (either in Ambari or Cloudera Manager) that is currently pointing at the Live Hive Proxy.
Restart the cluster to deploy the changed config. No Hive clients will now be replicating to the proxy.
Stop and delete the proxy service. On Cloudera, deactivate the LIVE_HIVE_PROXY parcel.

4.5.2. Package removal

Currently there is no programatic method for removing components although you can use the following commands to delete the plugin components, one at a time:

Remove any replicated paths related to plugin (i.e. auto-generated paths for tables), and default/hive. You may need to use the REST API to complete this. See Remove a directory from replication.
Check for tasks, wait 2 hours, check again. If the /tasks directory is now empty on ALL nodes, proceed with the following:

Stop Fusion Plugin for Live Hive, e.g.

[user@docs01-vm1 ~]# service fusion-ui-server stop
service fusion-server stop

Remove installation components with the following commands,

yum remove -y live-hive-fusion-core-plugin.noarch
rm -rf /opt/wandisco/fusion-ui-server/ui-client-platform/downloads/core_plugins/live-hive/
rm -rf /opt/wandisco/fusion-ui-server/ui-client-platform/plugins/wd-live-hive-plugin/
rm -rf /opt/wandisco/fusion-ui-server/plugins/live-hive-ui-server-1.0.0-SNAPSHOT/
sed -i '/LiveHive/d' /opt/wandisco/fusion-ui-server/properties/ui.properties

Now restart, e.g.
```
[user@docs01-vm1 ~]# service fusion-server start
[user@docs01-vm1 ~]# service fusion-ui-server start
```
The servers will come back, still inducted and with non-hive replication folders still in place.

4.6. Installation Troubleshooting

This section covers any additional settings or steps that you may need to take in order to successfully complete an installation.

If you encounter problems, make sure that you re-check the known issues and pre-requisites before raising a Support request.

4.6.1. Ensure hadoop.proxyuser.hive.hosts is properly configured

The following Hadoop property needs to be checked, when running with the Live Hive plugin. While the settings apply specifically to HDP/Ambari, it may also be necessary to check the property for Cloudera deployments.

Configuration placed in core-site.xml

<property>
  <name>hadoop.proxyuser.hive.hosts</name>
  <value>host1.domain.com,host-live-hive-proxy.organisation.com,host2.domain.com </value>
  <description>
     Hostname from where superuser hive can connect. This
     is required only when installing hive on the cluster.
  </description>
</property>

Proxyuser property

name: Hive hostname from which the superuser "hive" can connect.
value: Either a comma-separated list of your nodes or a wildcard. The hostnames should be included for Hiveserver2, Metastore hosts and LiveHive proxy.

Some cluster changes can modify this property

There are a number of changes that can be made to a cluster that might impact configuration, e.g.

adding a new Ambari component
adding an additional instance of a component
adding a new service using the Add Service wizard

These additions can result in unexpected configuration changes, based on installed components, available resources or configuration changes. Common changes might include (but are not limited to) changes to Heap setting or changes that impact security parameters, such as the proxyuser values.

Systems changes to properties such as hadoop.proxyuser.hive.hosts, should be made with great care. If the configuration is not present, impersonation will not be allowed and connection will fail.

Handling configuration changes

If any of the changes, listed in the previous section trigger a system change recommendation, there are two options:

A checkbox (selected by default) allowing you to say Ambari should apply the recommendation. You can uncheck this (or use the bulk uncheck at the top) for these.

Figure 21. Stopping a system change from altering hadoop.proxyuser.hive.hosts
Manually adjust the recommended value yourself, as you can specify additional properties that Ambari may not be aware of.

The Proxyuser property values should include hostnames for Hiveserver2, Metastore hosts and LiveHive proxy. Accepting recommendations that do not contain this (or the alternative all encompassing wildcard *), will more than likely result in service loss for Hive.

5. Operation

5.1. Configuration

The configuration section covers the essential steps that you need to take to start replicating Hive metadata.

5.1.1. Setting up Hive Metadata Replication

This section covers those steps that are required for replicating Hive Metadata between zones.

Live Hive Plugin can only replicate transactional data, it isn’t intended to sync large blocks of existing data.

Create Regex rule for Hive database.

Before you can set up a regular expression-based replication, you must create a Hive Regex rule.

Go to the Live Hive Plugin UI and click on Replication tab.

Figure 22. Live Hive Plugin Replication
Click on Create.

Figure 23. Live Hive Plugin Replication
From the Type dropdown, select Hive Regex.

Figure 24. Live Hive Plugin activation - Completion
Enter the criteria for the regular expression pattern that will be used to identify which metadata will be replicated between nodes.

Figure 25. Live Hive Plugin activation - Completion

Regular Expression
The regular expression type rule sets a pattern that Live Hive Plugin compares against new Hive tables. When a new table matches the pattern set in the regular expression, Live Hive

Database

Name of a Hive database.

Table name

Name of a table in the above database.

Description

A description that will help identify or distinguish the replication rule.

Zones

The zones between which the rule will be applied.

Priority Zone

Sets the zone which is most important in terms of reliability.

Click Create to apply the rule.

Once created, Hive data that matches the regular expression will automatically have a replication rule created.

Figure 26. Automatically generated replication rules

5.1.2. Hive table rule

Hive tables are tracked on the Replication Rules screen, along with Hive Regex rules and Hadoop compatible File System HCFS resources.

Figure 27. Live Hive Plugin Hive Table Rule 1

5.1.3. Hive table

Hive table rules are generated automatically when regex table rules are applied.

Hive table rule

Replication Rules

Type: Rule types, either Hive Table or Hive Regex
Resource: The elements that the replication rule will be applied to. Note that for Hive Table data a "Generated" label will appear because the rule has generated by a Live Hive Regex rule when a new table that is created in Hive matches the Regex. The "Generated" tag links to the Regex which generated the Table rule.
Zones: The Zones in which the replication rule will be applied.
Status: Indicates the state of file system consistency of the noted resources between the zones,
Activity: Identifies any current activity in terms of data transfer.

Replicated Hive table data appear on the Replication Rules table. Click on the Database or table name to view the details of the Hive element. Click on the Generated to see the Hive Regex that generated this Hive Table rule.

Hive regex rule

View/Edit

Figure 28. Live Hive Plugin Hive Table Rule 2

Type

Hive Regex

Regular Expression

A Hive Regular Expression does not replicate data itself. Instead, when a table is created and matches this expressions, a Hive Table rule will be created automatically with the options set in this rule.

Database name: A name of a Hive databases that will be matched the databases that exist in your Hive deployment.
Table name: The name of a table that stored in the above database that you intend to replicate.
File system location: The location of the data in the local file system.

File system location is currently fixed

In Live Hive 1.0 the File system location is locked to the wildcard .*

This value ensures that the file system location is always found. In a future release the File system location will be opened up for the user to edit.

Description: A description that you provide during the setup of the regex rule.
Zones: A list of the zones that take part in the replication of this resource.

In Live Hive 1.0, it’s not possible to change the zones of an existing rule.

Generated by

Clicking on the Generated label in the Replicated Rules table will display the replication rules table with a pre-defined filter that only shows the Hive Table rules created by this Hive Regex.

Figure 29. Live Hive Plugin Hive

The resulting table filters those rules that are generated by the regex rule. The top bar provides a drop-down that lets you select Generated by.

5.1.4. Running a consistency Checks

Figure 30. Live Hive Plugin Consistency Check

Live Hive Plugin provides a tool for checking that replica metadata is consistent between zones. Consistency is checked on a dedicated Hive Metadata Consistency tab.

When to complete a consistency check?

After adding new metadata data into replicationGroup
Periodically, as part of your platform monitoring
As part of a system repair/troubleshooting.

To complete a check:

Click on the Replication tab.

Figure 31. Live Hive Plugin Consistency Check 1
Click on the status of the applicable Replication Rule. In this example, an unchecked Hive Table rule.

Figure 32. Live Hive Plugin Consistency Check 2
The Hive Metadata Consistency tab will appear. Click on a context to check, in this case table0 (1.), then click on the Consistency CHeck (2.) button for the context. Alternatively, you can select the Consistency Check All button to start a check on all listed contexts.

Figure 33. Live Hive Plugin Consistency Check 1
In the Detailed view panel, the results of the check will appear. In this case, the table data is not present in one of the zones.

Figure 34. Live Hive Plugin Consistency Check 1

5.1.5. Running a repair

In the event that metadata is found to be inconsistent, you can use the repair function to fix the issue.

Identify the nature of the inconsistency from the Detailed View panel. Select the zone that contains the correct version of the metadata, then select what actions you want the repair tool to take.

Figure 35. Live Hive Plugin Repair 1

Recursive

If checkbox is ticked, this option will cause the path and all files under it to be made consistent. The default is true, but is ignored if the path represents a file.

Add Missing

Tick to copy data if missing from a zone.

Remove Extra

Should the zone under repair have extra files that are not present on the "Source of truth" zone, then those extra files are removed. Use this option to ensure that zones are in an identical state, rather than simply copying over missing files.

Update Different

Updates files if they are different.

Click Repair
You will get a report "Repair successfully triggered". Click on Close.

Figure 36. Live Hive Plugin Repair 2
To check if the repair has been successful, re-run the Consistency Check and review the status.

Figure 37. Live Hive Plugin Repair 3

5.2. Troubleshooting

The following tips should help you to understand any issues you might experience with Live Hive Plugin operation:

5.2.1. Check the Release notes

Make sure that you check the latest release notes, which may include references to known issues that could impact Live Hive Plugin.

5.2.2. Check log Files

Observe information in the log files, generated for the WANdisco Fusion server and the Fusion Plugin for Live Hive to troubleshoot issues at runtime. Exceptions or log entries with a SEVERE label may represent information that can assist in determining the cause of any problem.

As a distributed system, Fusion Plugin for Live Hive will be impacted by the operation of the underlying Hive database with which it communicates. You may also find it useful to review log or other information from these endpoints.

6. Reference Guide

6.1. API

Fusion Plugin for Live Hive offers increased control and flexibility through a RESTful (REpresentational State Transfer) API.

Below are listed some example calls that you can use to guide the construction of your own scripts and API driven interactions.

API documentation is still in development:

Note that this API documentation continues to be developed. Contact our support team if you have any questions about the available implementation.

Note the following:

All calls use the base URI:

http(s)://<server-host>:8082/plugin/hive/hiveRegex/<RESOURCE>

The internet media type of the data supported by the web service is application/xml.
The API is hypertext driven, using the following HTTP methods:

Type	Action
POST	Create a resource on the server
GET	Retrieve a resource from the server
PUT	Modify the state of a resource
DELETE	Remove a resource

Type

Action

POST

Create a resource on the server

GET

Retrieve a resource from the server

PUT

Modify the state of a resource

DELETE

Remove a resource

6.1.1. Unsupported operations

As part of Fusion’s replication system, we capture and replicate some "write" operations to an underlying DistributedFileSystem/FileSystem API. However, the truncate command is not currently supported. Do not run this command as your Hive metadata will become inconsistent between clusters.

6.1.2. Example WADL output

You can query the API using the following string:

http://example-livehive-node.domain.com:8082/fusion/application.wadl

<application xmlns="http://wadl.dev.java.net/2009/02">
<doc xmlns:jersey="http://jersey.java.net/" jersey:generatedBy="Jersey: 2.25.1 2017-01-19 16:23:50"/>
<doc xmlns:jersey="http://jersey.java.net/" jersey:hint="This is full WADL including extended resources. To get simplified WADL with users resources only do not use the query parameter detail. Link: http://example-livehive-node.domain.com:8082/fusion/application.wadl"/>;
<grammars>
<include href="application.wadl/xsd0.xsd">
<doc title="Generated" xml:lang="en"/>
</include>
</grammars>
<resources base="http://cbark01-vm1.bdauto.wandisco.com:8082/fusion/">
<resource path="/location">
<resource path="{locationIdentity}">...</resource>
<resource path="{locationIdentity}/attributes">...</resource>
<resource path="{locationIdentity}/nodes">...</resource>
<resource path="{locationIdentity}/startIgnoring/{ignoreLocationIdentity}">
<param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="ignoreLocationIdentity" style="template" type="xs:string"/>
<param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="locationIdentity" style="template" type="xs:string"/>
<method id="startIgnoringLocation" name="PUT">...</method>
<method id="apply" name="OPTIONS">...</method>
<method id="apply" name="OPTIONS">...</method>
<method id="apply" name="OPTIONS">...</method>
</resource>
</resource>
<resource path="/zones">...</resource>
<resource path="/replicationGroup">...</resource>
<resource path="/plugins">...</resource>
<resource path="/monitor">...</resource>
<resource path="/configuration">...</resource>
<resource path="/logging">...</resource>
<resource path="/">...</resource>
<resource path="/locations">...</resource>
<resource path="/zone">...</resource>
<resource path="/license">...</resource>
<resource path="/statemachines">...</resource>
<resource path="/nodes">...</resource>
<resource path="/node">...</resource>
<resource path="/task">...</resource>
<resource path="/tasks">...</resource>
<resource path="/replicationGroups">...</resource>
<resource path="/fs">...</resource>
<resource path="/memberships">...</resource>
<resource path="/membership">...</resource>
<resource path="/statemachine">...</resource>
<resource path="/monitors">...</resource>
<resource path="application.wadl">...</resource>
</resources>
</application>

6.1.3. Example REST calls

The following examples illustrate some simple use cases, most are direct calls through a web browser, although for deeper or interactive examples, a curl client may be used.

Optional query params

?dbName= &tableName= &path=

GET /hiveRegex/{hiveRegexRuleId}  > returns a HiveReplicationRuleDTO >
@XmlRootElement(name = "hiveRule")

`@XmlType(propOrder =`

Permitted value

private String dbNamePattern = "";
private String tableNamePattern = "";
private String tableLocationPattern = "";
private String membershipIdentity = "";
private String ruleIdentity;
private String description = "";

List Hive Replicateion Rule DTO

GET /hiveRegex/ > HiveReplicationRulesListDTO

(all known rules) list of HiveReplicationRuleDTO (see below for format)

PUT /hiveRegex/addHiveRule/ PAYLOAD HiveReplicationRuleDTO >
{dbNamePattern:'mbdb.*', tableNamePattern:'tabl.*', tableLocationPattern:'.*', membershipIdentity:'824ce758-641c-46d6-9c7d-d2257496734d', ruleIdentity:'6a61c98b-eaea-4275-bf81-0f82b4adaaef', description:'mydbrule'}

GET /hiveRegex/active/ >

Returns HiveReplicationRulesListDTO of all ACTIVE hiveRegex rules

GET /hiveReplicatedDirectories > HiveReplicatedDirectoresDTOList<HiveReplicatedDirectoryDTO> :

Will get all HCFS replicated dirs that were created via a hive regex rule automatically upon table creation Returns JSON in format:

{"hiveReplicatedDirectoryDTOList":[{"rd":"ReplicatedDirectoryDTO","propertiesDTO":{"properties":"Map<String, String>"},"consistencyTaskId":"str","consistencyStatus":"str","lastConsistencyCheck":0,"consistencyPending":true}]}

GET /hiveReplicatedDirectories/{regexRuleId}  >

Returns same as above but returns only directories created via a given regex Rule Id as a path parameter.

GET /hiveReplicatedDirectories/path?path=/some/location >

Returns same as above again but this time with query param of the path of the HCFS location.

Consistency Checks

Perform a consistency check on the database specified. The response will contain the location of the Platform task that is coordinating the consistency check. This task will exist on all nodes in the membership and at completion each task will be in the same state. The {@code taskIdentity} can be used to view the consistency check report using the {@code /hive/consistencyCheck/{taskIdentity} API.

Start a Consistency Check on a particular Hive Database and Table:

 POST /plugin/hive/consistencyCheck?dbName=test_db&tableName=test_table1&simpleCheck=true

Both tableName and simpleCheck are optional query parameters and if omitted will default to tableName="" and simpleCheck=true

Get the Consistency Check report for a Consistency Check task previously requested by the API above

GET /plugin/hive/consistencyCheck/{taskid}?withDiffs=true

The withDiffs query parameter is optional and defaults to false if not suppled.

Get part of the consistency check report depending on the query parameters set

 GET /plugin/hive/consistencyCheck/{taskId}/diffs?pageSize=10&page=0&dbName=test_db&tableName=test_table&partitions=true&indexes=true

Returns part of the diff from the consistency check. The hierarchy is: dbName / tableName / one of [partitions=true or indexes=true].

dbname

Name of the database to check.

tableName

Name of the database table to check, the default of " " will check all tables. If specified then either partitions or indexes must be specified.

pageSize

Optional. Will default to pageSize = 2,147,483,647

page

Optional. Will default to page=0.

Repair

Start to repair the specified database and table.

 PUT /plugin/hive/repair?truthZone=zone1&dbName=test_db&tableName=test_table&partName=testPart&indexName=test_index]&recursive=true&addMissing=true&removeExtra=true&updateDifferent=true&simpleCheck=true

truthZone (required): Zone which is the source of truth.
dbName (required): Database to repair in. Note, this database has to exist on zone where this API call is done.
tableName (optional)
partName (optional)
indexName (optional)
recursive (optional): Defaults to false.
addMissing (optional): Defaults to true. If true, the objects, which are missing will be created.
removeExtra (optional): Defaults to true. If true, the objects, which don’t exist in truthZone will be removed.
updateDifferent (optional): Defaults to true. If true, the objects which are different will be fixed.
simpleCheck (optional): Defaults to true. If true then the repair operation will only involve a simple check and not include any extended parameters of the objects being repaired.