logo

WANDISCO FUSION®
LIVE HIVE PLUGIN

1. Welcome

1.1. Product overview

The Fusion Plugin for Live Hive enables WANdisco Fusion to replicate Hive’s metastore, allowing WANdisco Fusion to maintain a replicated instance of Hive’s metadata and, in future, support Hive deployments that are distributed between data centers.

1.2. Documentation guide

This guide contains the following:

Welcome

This chapter introduces this user guide and provides help with how to use it.

Release Notes

Details the latest software release, covering new features, fixes and known issues to be aware of.

Concepts

Explains how Live Hive uses WANdisco’s LiveData platform.

Installation

Covers the steps required to install and set up Live Hive into a WANdisco Fusion deployment.

Operation

The steps required to run, reconfigure and troubleshoot Live Hive.

Reference

Additional Live Hive documentation, including documentation for the available REST API.

1.2.1. Symbols in the documentation

In the guide we highlight types of information using the following call outs:

The alert symbol highlights important information.
The STOP symbol cautions you against doing something.
Tips are principles or practices that you’ll benefit from knowing or using.
The KB symbol shows where you can find more information, such as in our online Knowledge base.

1.3. Contact support

See our online Knowledge base which contains updates and more information.

If you need more help raise a case on our support website.

1.4. Give feedback

If you find an error or if you think some information needs improving, raise a case on our support website or email docs@wandisco.com.

2. Release Notes

2.1. Live Hive 8.2.2 Build 1667

13 May 2021

For the release notes and information on known issues, please visit the Knowledge base - WANdisco Live Hive Plugin 8.2.

2.2. Live Hive 8.0.0 Build 1613

9 December 2020

For the release notes and information on known issues, please visit the Knowledge base - WANdisco Live Hive Plugin 8.0.

3. Concepts

3.1. Product concepts

Familiarity with product and environment concepts will help you understand how to use the Fusion Plugin for Live Hive. Learn the following concepts to become proficient with replication.

Apache Hive

Hive is a data warehousing technology for Apache Hadoop. It is designed to offer an abstraction that supports applications that want to use data residing in a Hadoop cluster in a structured manner, allowing ad-hoc querying, summarization and other data analysis tasks to be performed using high-level constructs, including Apache Hive SQL querys.

Hive Metadata

The operation of Hive depends on the definition of metadata that describes the structure of data residing in a Hadoop cluster. Hive organizes its metadata with structure also, including definitions of Databases, Tables, Partitions, and Buckets.

Apache Hive Type System

Hive defines primitive and complex data types that can be assigned to data as part of the Hive metadata definitions. These are primitive types such as TINYINT, BIGINT, BOOLEAN, STRING, VARCHAR, TIMESTAMP, etc. and complex types like Structs, Maps, and Arrays.

Apache Hive Metastore

The Apache Hive Metastore is a stateless service in a Hadoop cluster that presents an interface for applications to access Hive metadata. Because it is stateless, the Metastore can be deployed in a variety of configurations to suit different requirements. In every case, it provides a common interface for applications to use Hive metadata.

The Hive Metastore is usually deployed as a standalone service, exposing an Apache Thrift interface by which client applications interact with it to create, modify, use and delete Hive metadata in the form of databases, tables, etc. It can also be run in embedded mode, where the Metastore implementation is co-located with the application making use of it.

WANdisco Fusion Live Hive Proxy

The Live Hive Proxy is a WANdisco service that is deployed with Live Hive, acting as a proxy for applications that use a standalone Hive Metastore. The service coordinates actions performed against the Metastore with actions within clusters in which associated Hive metadata are replicated.

Hive Client Applications

Client applications that use Apache Hive interact with the Hive Metastore, either directly (using its Thrift interface), or indirectly via another client application such as Beeline or HiveServer2.

HiveServer2

is a service that exposes a JDBC interface for applications that want to use it for accessing Hive. This could include standard analytic tools and visualization technologies, or the Hive-specific CLI called Beeline.

Hive applications determine how to contact the Hive Metastore using the Hadoop configuration property hive.metastore.uris.

HiveServer2 Template

A template service that amends the HiveServer2 config so that it no longer uses the embedded metastore, and instead correctly references the hive.metastore.uris parameter that points to our "external" Hive Metastore server.

Hive pattern rules

A simple syntax used by Hive for matching database objects.

Fusion Plugin for Live Hive

The Fusion Plugin for Live Hive is a plugin for WANdisco Fusion. Before you can install it you must first complete the installation of the core WANdisco Fusion product. See WANdisco Fusion user guide.

Get additional terms from the Big Data Glossary.

3.2. Product architecture

The native Hive Metastore is not replaced, instead, Live Hive runs as a proxy server that issues commands via the connected client (i.e. beeline) to the original metastore, which is on the cluster.

The Live Hive proxy passes on read commands directly to the local Hive Metastore, while Fusion co-ordinates any write commands, so all metastores on all clusters will perform the write operations, such as table creation. Live Hive will also automatically start to replicate Hive tables when their names match a user defined rule.

Hive Plugin Architecture
Figure 1. Live Hive Architecture
  1. Write access needs to be co-ordinated by Fusion before executing the command on the metastore.

  2. Read Commands are 'passed-through' straight to the metastore as we do not need to co-ordinate via Fusion.

  3. Makes connection to the metastore on the cluster.

3.2.1. Cross-platform replication

See the release notes for a list of Hadoop distributions that are supported for cross platform replication.

3.2.2. Cross-version Hive support

From 7.0, you can replicate between different versions of Hive, however there are some limitations.

Limitations
  • Both clusters need to have the same value for hive.server2.enable.doAs and hive.server2.enable.impersonation.

  • Any operation which is not fully supported on both the versions of Hive you are using will not replicate. For example, if you have Hive 2 on zone A and Hive 1.2 on zone B then ALTER db SET LOCATION will not replicate as this is a Hive 2 operation with no equivalent in Hive 1.2.

    • Hive 3 has initial support for catalogs but the work is currently incomplete - see HIVE-18685. For cross-compatibility, all databases from Hive 2 and lower will be located in the default hive catalog.

    • Hive 3 dropped support for Indexes. Index operations will still replicate but will not be applied on Hive 3 zones. Consistency checks won’t report the absence of indexes on Hive 3 clusters unless other inconsistencies in indexes are found.

  • Hive 1, 2 and 3 have different prerequisites to enable ACID table transactions. We recommend enabling transactions from the Hive 1/2 side rather than Hive 3.
    You may get an error, for example if executing the ALTER query against Hive1/2:

    ALTER TABLE tablename0 SET TBLPROPERTIES ("transaction" = "true")

    You may get the error:

    FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. The table must be bucketed and stored using an ACID compliant format (such as ORC)

    You will then need to manually satisfy the prerequisites on the Hive1/2 side by turning the table into a bucketed table with an ACID compliant data storage format. Once done, replication will occur correctly on both the Hive 1/2 and Hive 3 sides.

See the release notes for any known issues with cross version hive support.

See the Validation section for how to validate that your set up is working.

3.2.3. Limitations

Membership changes

There is currently no support for dynamic membership changes. Once installed on all Fusion nodes, the Live Hive plugin is activated. See Activate Live Hive Plugin. During activation, the membership for replication is set and cannot be modified later. For this reason, it’s not possible to add new Live Hive nodes at a later time, including a High Availability node running an existing Live Hive proxy that wasn’t part of your original membership.

Any change to membership in terms of adding, removing or changing existing nodes will require a complete reinstallation of Live Hive.

Where to install Live Hive
  • Install Live Hive on all zones. While it is possible to only install on a subset of your zones, there are two potential problem scenarios:

    • Live Hive installed on all zones but a Hive replicated rule is on a membership spanning a subset of zones.

    • Live Hive not installed on all zones, but a replicated rule is on a membership spanning all zones.

Both situations result in unpredictable behaviour that may end up causing serious problems.

  • HDP/Ambari only On HDP you cannot co-locate the Live Hive proxy on a node that is running the Hive Metastore. This is because Ambari uses the value from hive.metastore.uris to determine what port the metastore should listen on, which would clash with Live Hive.

  • You must install Live Hive on all Fusion nodes within a zone. Note that while the plugin must be installed on all nodes within a zone, the plugin’s proxy does not.

Hive must be running at all zones

All participating zones must be running Hive in order to support replication. We’re aware that this currently prevents the popular use case for replicating between on-premises clusters and s3/cloud storage, where Hive is not running. We intend to remove the limitation in a future release.

3.3. Deployment models

The following deployment models illustrate some of the common use cases for running Live Hive.

Hive Plugin Architecture
Figure 2. Live Hive Deployment Model

3.4. Analytic off-loading

In a typical on-premises Hadoop cluster, data ingest, analytic jobs all run through the same infrastructure where some activities impose a load on the cluster that can impact other activities. Fusion Plugin for Live Hive allows you to divide up the workflow across separate environments, which lets you isolate the overheads associated with some events. You can ingest in one environment while using a different environment where capacity is provided to run the analytic jobs. You get more control over each environment’s performance.

  • You can ingest data from anywhere and query that at scale within the environment.

  • You can ingest data on premises (or where ever the data is generated) and query it at scale in another optimized environment, such as a cloud environment with elastic scaling that can be spun up only when queries jobs are queued. In this model, you may ingest data continuously but you don’t need to run a large cluster 24-hours-per-day for queries jobs.

3.5. Multi-stage jobs across multiple environments

A typical Hadoop workflow might involve a series of activities: ingesting data, cleaning data and then analyzing the data in a short series of steps. You may be generating intermediate output to be run against end-stage reporting jobs that perform analytical work. Running all of these work streams on a single cluster could require a lot of careful coordination with different types of workloads, conducting multi-stage jobs. This is a common chain of query activities for Hive, where you might ingest raw data, refine and augment it with other information, then eventually run analytic jobs against your output on a periodic basis, for reporting purposes, or in real-time.

In a replicated environment, however, you can control where those job stages are run. You can split this activity across multiple clusters to ensure the queries jobs needed for reporting purposes will have access to the capacity necessary to ensure that they run within SLAs. You also can run different types of clusters to make more efficient use of the overall chain of work that occurs in multi-stage job environments. You could have a cluster running that is tweaked and tuned for most efficient ingest, while running a completely different kind of environment that is tuned for another task, such as the end-stage reporting jobs that run against processed and augmented data. Running with LiveData across multiple environments allows you to run each different type of activity in the most efficient way.

3.6. Migration

Live Hive allows you to replicate both the Hive data, stored in HCFS and associated Hive metadata from an on-premises cluster over to cloud-based infrastructure. There’s no need to stop your cluster activity; the migration can happen without impact to your Hadoop operations.

3.7. Disaster Recovery

As data is replicated between nodes on a continuous basis, Live Hive is an ideal solution for protecting your data from loss. If a disaster occurs, the data is already available on a different zone.

4. Installation

4.1. Pre-requisites

An installation should only proceed if the following prerequisites are met on each Live Hive Plugin node:

  • Hadoop Cluster - see the release notes for which platforms are supported.

  • Hive installed, configured and running on the cluster.

    • If using Ambari, the Hive Client must also be installed on the Fusion and Live Hive Proxy nodes (this is covered later in the Hive Client (Ambari) section).

  • Fusion 2.16.x

It’s extremely useful to complete some work before you begin a Live Hive deployment. The following tasks and checks will make installation easier and reduce the chance of an unexpected roadblock causing a deployment to stall or fail.

It’s important to make sure that the following elements meet the requirements that are set in the Pre-requisites.

4.1.1. Server OS

One common requirement that runs through much of the deployment is the need for strict consistency between Fusion nodes, running Live Hive. Your nodes should, as a minimum, be running with the same versions of:

  • Hadoop/Manager software

  • Linux

    • Check to see if you are running a niche variant, e.g. Oracle Linux is compiled from Red Hat Enterprise Linux (RHEL), but it is not identical to a RHEL installation.

  • Java 8

    • Ensure you are running the same Java 8 version, on consistent paths.

4.1.2. Hadoop Environment

Before installing, confirm that your Hadoop clusters are fully operational.

  • Review all available Hadoop daemon log files for errors that may impact WANdisco Fusion or WANdisco Live Hive installations.

  • All nodes must have a "fusion" system user account for running Fusion services; as part of the installation of WANdisco Fusion, this system user will have been created.

Folder Permissions

When installing the Live Hive proxy or plugin, the permissions of /etc/wandisco/fusion/plugins/hive/ is set to match the Fusion user (FUSION_SERVER_USER) and group (FUSION_SERVER_GROUP), which are set in the Fusion node installation procedure.

Permissions on the folder are also set such that processes can write new files to that location as long as the user associated with the process is the FUSION_SERVER_USER or is a member of the FUSION_SERVER_GROUP.

No automatic fix for granting authorization

Changes to the fusion user/group are not automatically updated in their directories. You need to manually fix these issues, following the above guidelines.

4.1.3. Hive delegation token store compatibility

Live Hive supports the following delegation token store types:

hive.cluster.delegation.token.store.class
  • org.apache.hadoop.hive.thrift.ZooKeeperTokenStore

  • org.apache.hadoop.hive.thrift.MemoryTokenStore

  • org.apache.hadoop.hive.thrift.DBTokenStore

The DBTokenStore requires manual configuration so that it can work with the Live Hive Proxy. See the sections below for configuration steps depending on your platform.

HDP: Configuring Live Hive to work with DBTokenStore
  1. Obtain the Ambari password that is used to connect to the Ambari database. If you do not know this, it may be obtained through root access on the Ambari server.

    Example

    # grep 'passwd' /etc/ambari-server/conf/ambari.properties
    
    server.jdbc.user.passwd=/etc/ambari-server/conf/password.dat
    # cat /etc/ambari-server/conf/password.dat
    bigdata

    If you have encrypted your Ambari credentials, the server.jdbc.user.passwd may refer to an alias instead of a path to a file. In this scenario, you may need to reset your Ambari database password if your password cannot be obtained by other means.

  2. Obtain the password for the hive user from the Ambari database.

    1. Log in to the Ambari server and access the Ambari database by using the password obtained in the previous step:

      # psql -u ambari
      Password for user ambari: <enter password from step 1>
    2. Extend the database display and obtain the cluster configuration for Hive:

      Example

      Ambari => \x on
      Expanded display is on.
      Ambari => select config_id,type_name,config_data from clusterconfig where type_name='hive-site' ORDER BY config_id DESC LIMIT 1;
      
      -[ RECORD 1 ]--config_id | 120 type_name | hive-site config_data |
      {"hive.security.authenticator.manager":"org.apache.hadoop.hive.ql.security.ProxyUserAuthenticator","datanucleus.cache.level2.type":"none","hive.optimize.index.filter":"true","hive.enforce.sorting":"true",
      "javax.jdo.option.ConnectionPassword":"mine","hive.stats.autogather":"true","hive.metastore.uris":"thrift://dn2.example.com:9083","hive.stats.dbclass":"fs","hive.map.aggr.hash.force.flush.memory.threshold":"0.9","hive.server2.transport.mode":"binary","hive.compactor.worker.timeout":"86400L","hive.metastore.pre.event.listeners":"org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEventListener","hive.convert.join.bucket.mapjoin.tez":"false","hive.tez.container.size":"170","javax.jdo.option.ConnectionURL":"jdbc:mysql://dn2.example.com/hive?createDatabaseIfNotExist\u003dtrue"
      "hive.zookeeper.client.port":"2181","hive.compactor.worker.threads":"0","hive.exec.submitviachild":"false"}
    3. The javax.jdo.option.ConnectionPassword value will be the hive user password. In the example above, it is mine.

  3. Access the Hive database and configure all hosts where the Live Hive Proxy will be installed to authenticate with the hive user credentials. By default, HDP will use a MySQL database located on a cluster node.

    Log in to this node and run the following:

    # mysql -u root
    CREATE USER 'hive'@'lhvproxy.node.com' IDENTIFIED BY 'mine';
    GRANT ALL PRIVILEGES ON hive.* TO 'hive'@'lhvproxy.node.com' WITH GRANT OPTION;

    Adjust the commands above to your environment, where:
    mine = password for the hive user obtained in the previous step.
    lhvproxy.node.com = host where the Live Hive Proxy will be installed.

    Repeat this step for all hosts where the Live Hive Proxy will be installed.

  4. Copy the database driver from your Hive metastore host to the hosts where the Live Hive Proxy will be installed.

    Example

    scp /usr/hdp/current/hive-metastore/lib/mysql-connector-java.jar lhv_proxy_node:/usr/hdp/current/hive-metastore/lib/mysql-connector-java.jar
Post-installation task required if using hadoop.security.credential.provider.path

If you are using the hadoop.security.credential.provider.path property in your hive-site.xml, you will need to perform an additional task after installing the Live Hive Proxy service.

  1. Transfer the hive-site.jceks file from a Hive metastore node to all Live Hive Proxy nodes in the zone.

    This often located in the /etc/hive/conf/conf.server directory by default, but check the value for your hadoop.security.credential.provider.path property to confirm the location.

    Example

    scp /etc/hive/conf/conf.server/hive-site.jceks lhv_proxy_node:/etc/wandisco/fusion/plugins/hive/hive-site.jceks
  2. In Ambari, add an additional property to the live-hive-site.xml:

    Live Hive Proxy → Configs → Custom live-hive-site

    Add the following property and value below:

    hadoop.security.credential.provider.path=jceks://file/etc/wandisco/fusion/plugins/hive/hive-site.jceks
  3. Save the Live Hive Proxy configuration afterward.

  4. Restart the Live Hive Proxy service and any other required services on the cluster.

  5. Restart all Fusion Servers in the zone:

    Example

    service fusion-server restart
CDH/CDP: Configuring Live Hive to work with DBTokenStore

These steps should be performed after the installation of the Live Hive Proxy. This is because changes are required to the Live Hive Proxy configuration in Cloudera, and this configuration will not be available until after installation.

It is recommended to uncheck the option to Restart services automatically when on the final step of the Live Hive installation, and then carry out the tasks described in this section. Services can then be manually restarted afterward.
  1. Obtain the Hive configuration by using an API call to the cluster. This will require your Ambari administrator credentials.

    Example

    curl -u admin:admin http://<CM_HOST>:7180/api/v1/clusters/<cluster_name>/services/<hive_service_name>/config

    Adjust the curl command to your environment. For example, if SSL is enabled on the Cloudera manager, adjust the http prefix to https and use the -k option to allow insecure server connection (if desired).

  2. From the json output, the following properties will be needed for the next step:

    Example output

    {
        "name" : "hive_metastore_database_host",
        "value" : "hive_metastore.example.com"
    }, {
        "name" : "hive_metastore_database_name",
        "value" : "hive"
    }, {
        "name" : "hive_metastore_database_password",
        "value" : "mydbpassword"
    }, {
        "name" : "hive_metastore_database_port",
        "value" : "7432"
    }, {
        "name" : "hive_metastore_database_type",
        "value" : "postgresql"
    }
  3. Ensure that the following database properties are present in the custom live-hive-site.xml file in Cloudera:

    Live Hive Proxy → Configuration → Live Hive Metastore Proxy Advanced Configuration Snippet (Safety Valve) for live-hive-site.xml

    Also, for the client:

    Live Hive Proxy → Configuration → Live Hive Client Advanced Configuration Snippet (Safety Valve) for live-hive-conf/live-hive-site.xml

    Name: javax.jdo.option.ConnectionUserName
    Value: hive

    Name: javax.jdo.option.ConnectionDriverName
    Value for postgresql: org.postgresql.Driver
    Value for MySQL: com.mysql.jdbc.Driver

    Name: javax.jdo.option.ConnectionPassword
    Value: <hive_metastore_database_password>

    Name: javax.jdo.option.ConnectionURL
    Value: jdbc:<hive_metastore_database_type>://<hive_metastore_database_host>:<hive_metastore_database_port>/<hive_metastore_database_name>

    Save the configuration after adding all properties.

  4. Restart the Live Hive Proxy service as well as all other required services.

4.1.4. Live Hive dependencies

Generally, Live Hive is reliant on distribution artifacts being available for it to load. Proxy init scripts, plugin cp-extra scripts, etc, load in relevant items from the cluster it is installed on.

For Cloudera deployments, these functions will work as expected because all managed nodes have the Parcel, regardless of the role of the node, thus libs are available. However, on Ambari deployments there is a requirement to load in the needed items.

Fusion Client libraries

Live Hive will require the Fusion Client to be installed on the Live Hive Proxy and Fusion nodes prior to install. If the NameNode Proxy was selected instead of the other URI selection options, then a manual installation of the client will be required to ensure compatibility.

Manual install of Fusion Client on Fusion node(s)

You can install the client by running the following command on the Fusion node(s).

RHEL
yum install -y /opt/wandisco/fusion-ui-server/ui-client-platform/downloads/client_packages/fusion-hcfs-<distro-version>-client-hdfs-<fusion-version>.noarch.rpm
Debian
apt-get install -y /opt/wandisco/fusion-ui-server/ui-client-platform/downloads/client_packages/fusion-hcfs-<distro-version>-client-hdfs-<fusion-version>.deb

If the Live Hive Proxy was co-located on the Fusion node(s), no further action is required, but please read further if the Live Hive Proxy was installed on different node(s).

Manual install of Fusion Client on Live Hive Proxy node(s)

Obtain the client package first by running the following command on the Live Hive Proxy node:

RHEL
scp $FUSION_NODE:/opt/wandisco/fusion-ui-server/ui-client-platform/downloads/client_packages/fusion-hcfs-<distro-version>-client-hdfs-<fusion-version>.noarch.rpm .
Debian
scp $FUSION_NODE:/opt/wandisco/fusion-ui-server/ui-client-platform/downloads/client_packages/fusion-hcfs-<distro-version>-client-hdfs-<fusion-version>.deb .

The package can then be installed:

RHEL
yum install -y fusion-hcfs-<distro-version>-client-hdfs-<fusion-version>.noarch.rpm
Debian
apt-get install -y fusion-hcfs-<distro-version>-client-hdfs-<fusion-version>.deb
Hive Client (Ambari)

Check that Hive Client libraries are available on the Fusion and Live Hive Proxy nodes.

  1. In the Ambari UI, load the Hosts tab.

  2. For each Fusion and Live Hive Proxy node, check that the Hive Client is installed.

  3. If the Hive Client is not installed, use the Add option and select the Hive Client from the list.

    Repeat this step for any Fusion and Live Hive Proxy node that is missing the Hive Client.

4.1.5. Firewalls and Networking

Most of the items below should already be covered as part of the WANdisco Fusion installation, however, they are reiterated below:

  • If iptables or SELinux are running, you must confirm that any rules that are in place will not block Live Hive communication.

  • If any nodes are multi-homed, ensure that you account for this when setting which interfaces will be used during installation.

  • Ensure that you have hostname resolution between clusters, if not add suitable entries to your hosts files.

  • Check your network performance to make sure there are no unexpected latency issues or packet loss.

4.1.6. Security

This section explains how you can secure your Live Hive deployment.

Kerberos Configuration

As part of the automated installation of Live Hive, KDC admin credentials will be requested. This will allow the Live Hive Proxy keytabs to be generated by the Cluster manager.

If selecting or utilising a manual Kerberos setup, follow the next section below.

Manual Kerberos Setup

Prepare a Kerberos principal for each WANdisco Fusion/Live Hive Proxy node and place this in a keytab on the relevant node(s).

  • The principal value should be taken from the hive-site.xml property hive.metastore.kerberos.principal.

    The keytab/principal that you specify for the live hive service must use the same principal that is used by the rest of the Hive stack. Usually it appears in the form hive/_HOST@DOMAIN.COM. Other values are likely to cause proxied requests to fail at the proxy-to-metastore step.
  • The keytab should be exported to path located in hive-site.xml property hive.metastore.kerberos.keytab.file.

  • Please ensure the keytab is owned by the Live Hive user:group and has suitable permissions, such as 640.

  • The Live Hive user and WANdisco Fusion user must also be a part of a common group.
    Example
    Live Hive user = hive
    WANdisco Fusion user = hdfs
    Common group = hadoop

If a non-superuser principal is used, it also needs sufficient permission to impersonate all users. See the Secure Impersonation (proxyuser) section for details on how to set this up.

Secure Impersonation (proxyuser)

Normally the Hive user has superuser permissions on the HiveServer2 and Hive Metastore nodes. If you are installing into different nodes, corresponding proxyuser parameters should also be updated in core-site.xml and kms-site.xml.

The Hive user on the Live Hive proxy nodes (e.g. the WANdisco Fusion nodes) must have permission to impersonate users.

core-site.xml
<property>
        <name>hadoop.proxyuser.$USERNAME.hosts</name>
        <value>$HIVE_METASTORE,$HIVESERVER2,$LHV_PROXY01,$LHV_PROXY02</value>
    </property>
    <property>
        <name>hadoop.proxyuser.$USERNAME.groups</name>
        <value>*</value>
</property>

$USERNAME - the superuser who will act as a proxy to the other users, this is usually set as system user hive.

hadoop.proxyuser.$USERNAME.hosts

Defines hosts from which client can be impersonated. This can be a comma separated list or a wildcard (*).
During installation, this value will automatically be set to include the Live Hive Proxy nodes. If the value has already been set as a wildcard (*), it will be left untouched.

hadoop.proxyuser.$USERNAME.groups

A list of groups whose users the superuser is allowed to act as proxy. Include a wildcard (*), which will mean that proxies of any users are allowed. To clarify, for the superuser to act as proxy to another user, the proxy actions must be completed on one of the hosts that are listed, and the user must be included in the list of groups. Note that this can be a comma-separated list or the noted wildcard (*).

kms-site.xml

This is only required if TDE Encryption (e.g. Ranger KMS, Navigator Key Trustee) is being used for HDFS.

<property>
  <name>hadoop.kms.proxyuser.$USERNAME.hosts</name>
  <value>$HIVE_METASTORE,$HIVESERVER2,$LHV_PROXY01,$LHV_PROXY02</value>
</property>
<property>
  <name>hadoop.kms.proxyuser.$USERNAME.groups</name>
  <value>*</value>
</property>
SSL

Enablement of SSL encryption (optional) is covered in the WANdisco Fusion User Guide - SSL section.

To enable SSL encrypted communication between Fusion nodes, Java KeyStore and TrustStore files must be generated or available for all Live Hive nodes.

We don’t recommend using Self-signed certificates, except for proof-of-concept/testing.
Connections that can be SSL secured
  • Live Hive server ←→ Live Hive server

  • IHC server ←→ Live Hive server

  • client ←→ Live Hive server

  • browser ←→ UI server

4.1.7. Cloudera Redaction Policy support

Cloudera Redaction Policy is supported from Live Hive 8.2 onwards. If your cluster has this enabled, provide your Hive Metastore Database Password during installation of Live Hive through the UI installer.

If your Hive metastore database is not a PostgreSQL database, then the appropriate database driver jar must be provided to the Live Hive UI server plugin repo directory.

The Live Hive UI server plugin repo directory will not be created until the after initial CLI installation is complete, so this task should be carried out after that.

For example, if the Hive metastore database is a MySQL database, copy the mysql-connector-java.jar to the Live Hive directory shown below:

/opt/wandisco/fusion-ui-server/plugins-repo/live-hive-ui-server/

Example

cp /usr/share/java/mysql-connector-java.jar /opt/wandisco/fusion-ui-server/plugins-repo/live-hive-ui-server/

Once the Jar is in place, restart the Fusion UI:

service fusion-ui-server restart

4.1.8. Server utilization

  • By default, the Live Hive Proxy will be installed on the same node as the WANdisco Fusion Server. Ensure there is suitable resources available for the Proxy to run alongside any other applications installed on the WANdisco Fusion Server.

  • Use ulimit -u && ulimit -n to check on the the number of processes / open files being sufficient for the Live Hive user (e.g. hive). Compare with those set for the nodes that have active HiveServer2 component(s) installed.

  • Use netstat to review the connections being made to the server. Verify that the port required by the Live Hive Proxy is not in use (default: 9090).

4.2. Installation

4.2.1. Cloudera-based steps

Run the installer

Obtain the Live Hive installer from WANdisco. Open a terminal session on your Fusion node and run the installer as follows:

  1. Using an account with appropriate permissions, run the Live Hive installer on each host required:

    ./live-hive-installer_<version>.sh

    You will see the following messaging.

    # ./live-hive-installer_<version>.sh
    Verifying archive integrity... All good.
    Uncompressing WANdisco Live Hive.......................
    
        ::   ::  ::     #     #   ##    ####  ######   #   #####   #####   #####
       :::: :::: :::    #     #  #  #  ##  ## #     #  #  #     # #     # #     #
      ::::::::::: :::   #  #  # #    # #    # #     #  #  #       #       #     #
     ::::::::::::: :::  # # # # #    # #    # #     #  #   #####  #       #     #
      ::::::::::: :::   # # # # #    # #    # #     #  #        # #       #     #
       :::: :::: :::    ##   ##  #  ## #    # #     #  #  #     # #     # #     #
        ::   ::  ::     #     #   ## # #    # ######   #   #####   #####   #####
    
    
    
    
    You are about to install WANdisco Live Hive version 8.0.0.0
    
    Do you want to continue with the installation? (Y/n)
      wd-live-hive-plugin-8.0.0.0.tar.gz ... Done
      live-hive-fusion-core-plugin-8.0.0.0-1613.noarch.rpm ... Done
      storing user packages in '/opt/wandisco/fusion-ui-server/ui-client-platform/downloads/core_plugins/live-hive' ... Done
      live-hive-ui-server-8.0.0.0-dist.tar.gz ... Done
    All requested components installed.
    Go to your WANdisco Fusion UI Server to complete configuration.
Installer options
View the Installer Options section for details on additional installer functions, including the ability to install selected components.
IMPORTANT: Once you run this installer script, do not restart the Fusion node until you have fully completed the installation steps (up to activation) for this node.
Configure Live Hive
  1. Open a session to your Fusion UI. You will see a message confirming that the Live Hive Plugin has been detected.

    Live Hive Install
    Figure 3. Live Hive install - dashboard
  2. Go to the Settings page → Plugins. The Fusion Plugin for Live Hive now appears. Click the button Install Now.

  3. The installation process runs through four steps that handle the placement of parcel files onto your Cloudera Manager server.

    Parcels need to be placed in the correct directory to make them available to the manager. To do this:

    Copy the paths for the .parcel and .parcel.sha files for your corresponding platform type,
    e.g. el6 (Enterprise Linux version 6).

    1. Download the Parcel packages to the Cloudera service directory (/opt/cloudera/parcel-repo/) on your node, e.g.

      cd /opt/cloudera/parcel-repo
      wget <your-fusion-node.hostname>:8083/ui/downloads/core_plugins/live-hive/parcel_packages/LIVE_HIVE_PROXY-<version>-<os>.parcel
      wget <your-fusion-node.hostname>:8083/ui/downloads/core_plugins/live-hive/parcel_packages/LIVE_HIVE_PROXY-<version>-<os>.parcel.sha
    2. Change the ownership of the parcel files to match up with Cloudera Manager, e.g. use 'chown cloudera-scm:cloudera-scm LIVE_HIVE_PROXY-*'

    3. Copy the Custom Service Descriptor (LIVE_HIVE_PROXY-x.x.x.jar) file to the Local Descriptor Repository (normally /opt/cloudera/csd/) on your node. e.g.

      cd ../csd
      wget http://<your-fusion-node.hostname>:8083/ui/downloads/core_plugins/live-hive/parcel_packages/LIVE_HIVE_PROXY-<version>.jar
      Resolving <your-fusion-node.hostname>... 10.0.0.1
      Connecting to <your-fusion-node.hostname>.com|10.10.0.1|:8083... connected.
      HTTP request sent, awaiting response... 200 OK
      Length: 4041 (3.9K) [application/java-archive]
      Saving to: LIVE_HIVE_PROXY-<version>.jar
      
      100%[=============================================================================================>] 4,041       --.-K/s   in 0s
    4. Restart the Cloudera server so that Cloudera can see the new parcel and jar, e.g.

      service cloudera-scm-server restart
      After restarting the Cloudera Server, the Cloudera Manager Service(CMS) will report a stale config, which requires a restart via Cloudera Manager.
  4. The second installer screen handles Configuration.

    Live Hive installation
    Figure 4. Live Hive installation - validation
    Install a Live Hive Proxy on this host

    The installer lets you choose not to install the Live Hive proxy onto this node. While you must install Live Hive on all nodes, if you don’t wish to use a node to store Hive metadata, you can choose to exclude the Live Hive proxy from the installation. If you do this, the node still plays its part in transaction coordination, without keeping a local copy of the replicated data.

    If you deselect Live Hive proxy on ALL nodes, then replication will not work. You must install at least 1 proxy in each zone. Should you have a cluster that doesn’t have a single Live Hive proxy, you will need to perform the following procedure to enable Hive metadata replication.
    Live Hive Proxy Port

    The HTTP port used by the Plugin. Default: 9083

    Hive Metastore URI

    The metastore(s) which the Live Hive proxy will send requests to.
    Add additional URIs by clicking the + Add URI button and entering additional URI / port information.

    If you add additional URIs, you must complete the necessary information or remove them. You cannot have an incomplete line.
    Live Hive installation
    Figure 5. Live Hive installation - Additional URIs
    Hive Metastore Database Password

    If Redaction Policy is enabled on the cluster, provide the Hive Metastore Database Password for the cluster.

    Ensure that you have completed the steps in the Cloudera Redaction Policy support section if the Hive metastore database is not a PostgreSQL database.

    Click on Next step to continue.

  5. Step 3 of the installation covers security. If you have not enabled Kerberos on your cluster, you will pass through this step without adding any additional configuration.

    If kerberos is enabled on your cluster, supply your kerberos credentials.

    Live Hive installation
    Figure 6. Live Hive installation - security enabled

    Hive Proxy Security

    User

    System user used for Hive Proxy

    Group

    System group for secure access

    Principal name

    The name of the Kerberos principal name for access

    Ensure that you use the same principal as is used for the Hive stack. If you use a different principal then Live Hive will not work due to basic security constraints.
    Manual Kerberos setup

    Tick the manual Kerberos setup

    Provide KDC credentials

    Tick the checkbox to configure KDC credentials

    KDC Credentials

    KDC admin principal

    Admin principal of your KDC, required by the Hadoop manager in order to deploy keytabs for the Live Hive Proxy.

    Password

    Password for the KDC admin principle.

    The above credentials are stored using stored using the Hadoop Manager’s temporary credential mechanism, and as such will be destroyed if either the Hadoop manager is restarted or 90 minutes (by default) have passed.
    Keytab file path

    The installer now validates that there is read access to the keytab that you specify here.

    Metastore Service Principal Name

    The installer validates where there are valid principals in the keytab.

    Metastore Service Hostname

    Enter the hostname of your Hive Metastore service.

  6. The final step is to complete the installation. If you want to restart services automatically, check the box, then click Start Install.

    If you are running a DBTokenStore implementation for your Hive metastore(s), uncheck the box and start the installation. You can then restart services manually after completing the steps in the CDH/CDP: Configuring Live Hive to work with DBTokenStore section.
    Live Hive installation
    Figure 7. Live Hive installation - summary

    The following steps are carried out:

    Cloudera parcel distribution and activation

    Distributes and activates the Live Hive Plugin parcels in Cloudera Manager

    Install Live Hive Plugin service descriptor in Cloudera

    Installs the Live Hive Plugin service descriptor package in Cloudera Manager

    Install Live Hive Plugin service in Cloudera

    Installs the Live Hive Plugin service in Cloudera Manager

    Cloudera metastore configuration

    Configures Cloudera Hive to use the Live Hive Proxy

    Restart HDFS, Hive and Live Hive services

    Restarts the HDFS, Hive and Live Hive services in Cloudera Manager to distribute updated configurations

    Restart Fusion Server

    Complete the plugin installation and restart Fusion Server

    Configure Impala (if installed)

    Configures Cloudera Impala to be compatible with Live Hive

  7. The installation will complete with a message Live Hive installation complete!.
    Click Finish to close the plugin installer screens.

Now advance to the Activation steps.

4.2.2. Ambari-based steps

Important HDP/Ambari requirement
On HDP you cannot co-locate the Live Hive proxy on a node that is running the Hive Metastore. This is because Ambari uses the value from hive.metastore.uris to determine what port the Metastore should listen on, which would clash with Live Hive.
Run the installer

Obtain the Live Hive installer from WANdisco. Open a terminal session on your Fusion node and run the installer as follows:

  1. Using an account with appropriate permissions, run the Live Hive installer on each host required:

    ./live-hive-installer_<version>.sh

    You will see the following messaging.

    # ./live-hive-installer_<version>.sh
    Verifying archive integrity... All good.
    Uncompressing WANdisco Live Hive.......................
    
        ::   ::  ::     #     #   ##    ####  ######   #   #####   #####   #####
       :::: :::: :::    #     #  #  #  ##  ## #     #  #  #     # #     # #     #
      ::::::::::: :::   #  #  # #    # #    # #     #  #  #       #       #     #
     ::::::::::::: :::  # # # # #    # #    # #     #  #   #####  #       #     #
      ::::::::::: :::   # # # # #    # #    # #     #  #        # #       #     #
       :::: :::: :::    ##   ##  #  ## #    # #     #  #  #     # #     # #     #
        ::   ::  ::     #     #   ## # #    # ######   #   #####   #####   #####
    
    You are about to install WANdisco Live Hive version 7.1.0.0
    
    Do you want to continue with the installation? (Y/n)
      wd-live-hive-plugin-7.1.0.0.tar.gz ... Done
      live-hive-fusion-core-plugin-7.1.0.0-1459.noarch.rpm ... Done
      storing user packages in '/opt/wandisco/fusion-ui-server/ui-client-platform/downloads/core_plugins/live-hive' ... Done
      live-hive-ui-server-7.1.0.0-dist.tar.gz ... Done
    All requested components installed.
    Go to your WANdisco Fusion UI Server to complete configuration.
Installer options
View the Installer Options section for details on additional installer functions, including the ability to install selected components.
IMPORTANT: Once you run this installer script, do not restart the Fusion node until you have fully completed the installation steps for this node.
Configure Live Hive
  1. Open a session to your Fusion UI. You will see a message that confirms that the Live Hive plugin has been detected.

    Live Hive installation
    Figure 8. Live Hive install - dashboard
  2. Go to the Settings page → Plugins. The Fusion Plugin for Live Hive now appears. Click the button Install Now.

  3. The installation process runs through four steps that handle the placement of stack files onto your Ambari Manager server.

    Live Hive installation
    Figure 9. Live Hive installation - Clients
    Stacks

    Stacks need to be placed in the correct directory to make them available to the manager. To do this:

    1. Download the compressed stacks using the links in the UI download panel.

    2. The compressed stacks will expand to the following directories: LIVE_HIVE_PROXY and LIVE_HIVESERVER2_TEMPLATE.

      Before decompressing them, check the /var/lib/ambari-server/resources/stacks/HDP/<hdp-version>/services directory on the Ambari server, and remove any existing directories named LIVE_HIVE_PROXY and LIVE_HIVESERVER2_TEMPLATE.

    3. Decompress the compressed stacks inside of the /var/lib/ambari-server/resources/stacks/HDP/<hdp-version>/services directory.

    4. Restart the Ambari server.

    5. Remove the compressed stacks from the /var/lib/ambari-server/resources/stacks/HDP/<hdp-version>/services directory afterwards:

      rm -f live-hive-proxy-<lhv-version>.stack.tar.gz live-hiveserver2-template-<lhv-version>.stack.tar.gz

    6. Check on your Ambari manager that the Live Hiveserver 2 Template and Live Hive Proxy services are listed when selecting to Add Service.

  4. The second installer screen handles Configuration.

    Live Hive installation
    Figure 10. Live Hive installation - Configuration
    Install a Live Hive Proxy on this host

    The installer lets you choose not to install the Live Hive proxy onto this node. While you must install Live Hive on all nodes, if you don’t wish to use a node to store Hive metadata, you can choose to exclude the Live Hive proxy from the installation. If you do this, the node still plays its part in transaction coordination, without keeping a local copy of the replicated data.

    If you deselect Live Hive proxy on ALL nodes, then replication will not work. You must install at least 1 proxy in each zone. Should you have a cluster that doesn’t have a single Live Hive proxy, you will need to perform the following procedure to enable Hive metadata replication.
    Live Hive Proxy Port

    The HTTP port used by the Plugin. Default: 9083

    Hive Metastore URI

    The metastore(s) which the Live Hive proxy will send requests to.
    Add additional URIs by clicking the + Add URI button and entering additional URI / port information.

    If you add additional URIs, you must complete the necessary information or remove them. You cannot have an incomplete line.
    Live Hive installation
    Figure 11. Live Hive installation - Additional URIs

    Click on Next step to continue.

  5. Step 3 of the installation covers security. If you have not enabled Kerberos on your cluster, you will pass through this step without adding any additional configuration.

    If you enable Kerberos, you will need to supply your Kerberos credentials.

    Live Hive installation
    Figure 12. Live Hive installation - security enabled
    Hive Proxy Security

    Kerberos settings for the Hive Proxy.

    User

    The system user for Hive.

    Group

    The system group for Hive.

    Principal name

    The Principal name for the Hive user.

    Ensure that you use the same principal as is used for the Hive stack. If you use a different principal then Live Hive will not work due to basic security constraints.
    Manual Kerberos setup (checkbox)

    Tick this checkbox to provide the Kerberos details for Hive Proxy Kerberos.

    Hive Proxy Kerberos
    Live Hive installation
    Figure 13. Live Hive installation - Kerberos
    Keytab file path

    The installer now validates that there is read access to the keytab that you specify here.

    Validate first
    You must validate the keytab file before you choose the principal.
    Principal

    Select from the available principals. This is the principal that will be used to connect to the original Hive Metastore. Validation checks that the principal is valid.

    Provide KDC credentials (Checkbox)

    Tick the checkbox to provide details for a KDC’s admin principal and password.

    Live Hive installation
    Figure 14. Live Hive installation - KDC credentials
    KDC Credentials
    If Ambari is managing the cluster’s Kerberos implementation, you must provide the following KDC credentials or the plugin installation will fail.
    KDC admin principal

    Admin principal of your KDC, required by the Hadoop manager in order to deploy keytabs for the Live Hive Proxy.

    Password

    Password for the KDC admin principle.

    The above credentials are stored using stored using the Hadoop Manager’s temporary credential mechanism, and as such will be destroyed if either the Hadoop manager is restarted or 90 minutes (by default) have passed.
  6. The final step is to complete the installation. If you want to restart services automatically, check the box, then click Start Install.

    If you are using the hadoop.security.credential.provider.path property in your hive-site.xml, uncheck the box and start the installation. You can then restart services manually after completing the steps in the Post-installation task required if using hadoop.security.credential.provider.path section.
    Live Hive installation
    Figure 15. Live Hive installation - summary

    The following steps are carried out:

    Hive Metastore Template Install

    Install Live Hive Metastore Service Template on Ambari.

    Live Hive Proxy Service Install

    Install the Live Hive Proxy Service on Ambari.

    Update Hive Configuration

    Updates the URIs for Hive connections in Ambari.

    Restart HDFS, Hive and Live Hive services

    Restarts the HDFS, Hive and Live Hive services in Ambari.

    You will also need to restart any dependent services that are impacted by the installation, such as Big SQL.
    Restart Fusion Server

    Complete the plugin installation and restart Fusion Server.

  7. The installation will complete with a message Live Hive installation complete.
    Click Finish to close the plugin installer screens.

You must now activate the plugin.

4.2.3. Live Hive Proxy configuration dependencies

Whether it is installed on a Ambari or Cloudera cluster, the Live Hive Proxy has dependencies on the HDFS and Hive configurations.

Whenever changes are made to these services, the Live Hive Proxy service configuration will become stale and require a restart alongside any other dependent services in the cluster.

This should be taken into account when planning any future changes on your Hadoop cluster.

4.2.4. Installer Options

The following section provides additional information about running the Live Hive installer.

Installer Help

The bundled installer provides some additional functionality that lets you install selected components, which may be useful if you need to restore or replace a specific file. To review the options, run the installer with the --help option, i.e.

./live-hive-installer.sh --help
Verifying archive integrity... All good.
Uncompressing WANdisco Hive Live.......................

This usage information describes the options of the embedded installer script. Further help, if running directly from the installer is available using '--help'. The following options should be specified without a leading '-' or '--'. Also note that the component installation control option effects are applied in the order provided.

Installation options
General options:
  help                             Print this message and exit

Component installation control:
  only-fusion-ui-client-plugin     Only install the plugin's fusion-ui-client component
  only-fusion-ui-server-plugin     Only install the plugin's fusion-ui-server component
  only-fusion-server-plugin        Only install the plugin's fusion-server component
  only-user-installable-resources  Only install the plugin's additional resources
  skip-fusion-ui-client-plugin     Do not install the plugin's fusion-ui-client component
  skip-fusion-ui-server-plugin     Do not install the plugin's fusion-ui-server component
  skip-fusion-server-plugin        Do not install the plugin's fusion-server component
  skip-user-installable-resources  Do not install the plugin's additional resources
Standard help parameters
./live-hive-installer.sh --help
Makeself version 2.1.5
 1) Getting help or info about ./live-hive-installer.sh :
  ./live-hive-installer.sh --help   Print this message
  ./live-hive-installer.sh --info   Print embedded info : title, default target directory, embedded script ...
  ./live-hive-installer.sh --lsm    Print embedded lsm entry (or no LSM)
  ./live-hive-installer.sh --list   Print the list of files in the archive
  ./live-hive-installer.sh --check  Checks integrity of the archive

 2) Running ./live-hive-installer.sh :
  ./live-hive-installer.sh [options] [--] [additional arguments to embedded script]
  with following options (in that order)
  --confirm             Ask before running embedded script
  --noexec              Do not run embedded script
  --keep                Do not erase target directory after running the embedded script
  --nox11               Do not spawn an xterm
  --nochown             Do not give the extracted files to the current user
  --target NewDirectory Extract in NewDirectory
  --tar arg1 [arg2 ...] Access the contents of the archive through the tar command
  --                    Following arguments will be passed to the embedded script

 3) Environment:
  LOG_FILE              Installer messages will be logged to the specified file

4.2.5. Silent installation for Cloudera and Ambari

Instead of installing through the UI, you can install using the silent (scripted) installer. These steps need to be repeated on each node you want the Live Hive Plugin installed on.

  1. Obtain the Live Hive Plugin installer from WANdisco and open a terminal session on your Fusion node.

  2. Ensure the downloaded file is executable e.g.

    chmod +x live-hive-installer.sh
  3. Run the Live Hive Plugin installer using an account with appropriate permissions e.g.

    ./live-hive-installer.sh
  4. Now place the parcels or stacks in the relevant directory. They can be found in the directory /opt/wandisco/fusion-ui-server/ui-client-platform/downloads/core_plugins/live-hive-<version>. The steps are the same steps as in the UI installer. For more information see Parcels if you are using Cloudera, or Stacks if using Ambari. Ensure that you restart your Cloudera or Ambari server.

  5. Now edit the live_hive_silent_installer.properties file, located in /opt/wandisco/fusion-ui-server/plugins/live-hive-ui-server-<your version>/properties.

    The following fields are required:

    live.hive.proxy.thrift.host - the hostname to which the Live Hive Proxy server will bind.

    live.hive.proxy.thrift.port - the port the Live Hive Proxy will run on.

    live.hive.proxy.remote.thrift.uris - the original Hive Metastore thrift host and port. This must be in the form host:port. If vanilla metastore HA is configured, this should be a comma separated list of all existing metastore host:ports. For kerberized clusters, please see the notes in the properties file specific to your setup.

  6. To start the silent installation, go to /opt/wandisco/fusion-ui-server/plugins/live-hive-ui-server-<version> and run:

    ./scripts/silent_installer_live_hive.sh ./properties/LIVE_HIVE_PROXY-silent-installer.properties
  7. Repeat these steps on each node.

  8. Once the plugin is installed on all nodes that will replicate Hive metadata, activate the plugin.

4.3. Activate Live Hive

After completing the installation on all applicable Fusion nodes, you will need to activate Live Hive before you can use it. Use the following steps to complete the plugin activation.

  1. Log into the Fusion UI. The Live Hive will be visible on the dashboard, but inactive.

  2. On the Settings tab, go to the Plugin Activation link in the Live Hive section of the side menu. The Live Hive Plugin Activation screen will appear.

    Live Hive installation
    Figure 16. Live Hive activation - activate
    Ensure that your clusters have been inducted before activating.
    The plugin will not work if you activate before completing the induction of all applicable zones.

    Tick the checkboxes that correspond to the zones that you want to replicate Hive metadata between, then click Activate.

  3. A message will appear at the bottom of the screen that confirms that the plugin is active.

    Live Hive installation
    Figure 17. Live Hive activation - activated
    You can’t change membership, once activated. See Membership changes.
  4. Reload your browser to use the new plugin.

    Live Hive installation
    Figure 18. Live Hive activation - reload window

    It will now also be visible on the dashboard.

4.4. Validation

Hive replication can use the full potential of WANdisco’s LiveData, meaning metadata DDL commands such as Create, Alter and Drop table commands are replicated, in addition to the inserts and data stores in HDFS.

The following section offers guidance in testing live replication of Hive metadata and related HDFS data.

Both Hive, Fusion and the Live Hive Proxy/Plugin are assumed to be in a healthy state prior to running through these validation steps.

4.4.1. Live Hive Plugin configuration

In order to enable Hive replication, two rules are needed:

  • HCFS rule that covers the location of Hive data in the Hadoop file system - see Create a rule.

    • This rule must be consistent before creating Hive rules.

  • Hive rule that defines patterns to match the database name and table names - see Create Hive rule.

Once complete, the Replication tab on the UI should display the rules like the example below:

Live Hive replication tab

In this example, the rules will allow for live replication of any HDFS data written to the /apps/hive/warehouse path, and for a database named directory with any tables contained within (*).

This may not be suitable for all environments (especially production ones), as a more restricted dataset may be desired. Adjust for your own requirements when creating the rules.

4.4.2. Creating and Replicating New Hive Databases and Tables

To start replicating Hive data and metadata, create a database and table(s) in Hive that match the replication rules created in the previous section (Live Hive Plugin configuration).

All links below reference the Apache Hive documentation.

  1. Connect to HiveServer2 via Beeline.
    Beeline example - Guidance on connecting to the HiveServer2 service using a Beeline connection.

  2. Create a Database to be used and ensure that it matches the replication rules created for both the HCFS and Hive data.
    Create/Drop/Alter/Use Database - Documentation for Database commands.

  3. Create table(s) within the database to be used for testing.
    Create/Drop/Truncate Table - Documentation for Table commands.

    Managed vs External Tables - Guidance on tables used within Hive databases.

  4. Verify that the database and tables have been created and exist in Hive on the source zone.
    Show Databases - Documentation for listing databases.

    Show Tables - Documentation for listing tables in a specified database.

If the replication rules have been set up correctly to match the database/table(s) created, then it will be possible to perform a Hive consistency check on these rules to ensure Hive metadata replication has occurred on the remote zone(s).

If performing a Hive consistency check at this point, it is also recommended to run a HCFS consistency check on the relevant HCFS rule.

Adding and replicating data to a Hive table

This section will provide guidance in demonstrating live replication of data within a Hive table, followed by a query of the data.

  1. On the source cluster, download a sample data set from a public AWS S3 bucket ("fusion-demo").

    wget https://s3-us-west-1.amazonaws.com/fusion-demo/datastream1.txt
  2. On the source cluster, create a Hive table with the following schema, note that the <database_name> and <table_name> will need to be replaced with names that match a Hive replication rule:

    use <database_name>;
    CREATE TABLE IF NOT EXISTS <table_name> (id int , fname string, lname string, email string,social string, age string, secret string,license string, ip string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
  3. On the source cluster, place the data file (datastream1.txt) into HDFS, so that it can be replicated to the remote cluster and queried from Hive in the next step.

    hdfs dfs -put datastream1.txt /path/to/replication_rule/<database_name>.db/<table_name>
  4. On source and remote clusters, open a beeline connection and run a query on the data set.

    use <database_name>;
    select count(*) from <table_name> where id >=0;
    select * from <table_name> where fname=<username> order by lname;

    The output from the two select queries should match for each cluster/zone defined in the replication rule. As an example, if you use the <username> Helen, 680 rows will be returned in the example dataset.

Alter and Drop Table Examples

These next steps will demonstrate replicating the Alter and Drop commands.

  1. On the source cluster, run a alter table command and provide the replace <new_table_name> with a different name.

    Note that the new table name must still match up with a Hive replication rule.

    use <database_name>;
    alter table <table_name> rename to <new_table_name>;
  2. On a remote cluster, verify the alter table command has replicated.

    use <database_name>;
    show tables;

    The <new_table_name> should now be shown in the tables list with the original <table_name> no longer present.

  3. On the source cluster, run a drop table command on the newly renamed table.

    use <database_name>;
    drop table <new_table_name>;
  4. On a remote cluster, verify that <new_table_name> no longer exists.

    use <database_name>;
    show tables;

    The renamed table should no longer be listed.

4.4.3. Verify HDFS and Hive Data

To verify the HDFS and Hive data, it is worth running a consistency check for both the HCFS and Hive replication rules.

For details on how to run these, following the section links below:

4.5. Upgrade

If you wish to perform a Live Hive upgrade, please contact WANdisco support.

An outline of the steps involved:

  1. Run a script to gather information on existing configurations

  2. Reset environment

  3. Upgrade plugin components

  4. Renew configurations


4.6. Uninstallation

If you wish to uninstall Live Hive, please contact WANdisco support. The uninstall procedure currently requires manual editing and should not be done without calling WANdisco’s support team for assistance. The process involves both service and package removal.

4.7. Installation Troubleshooting

This section covers any additional settings or steps that you may need to take in order to successfully complete an installation.

If you encounter problems, make sure that you re-check the release notes and pre-requisites before raising a Support request.

4.7.1. Ensure hadoop.proxyuser.hive.hosts is properly configured

The following Hadoop property needs to be checked, when running with the Live Hive Plugin. While the settings apply specifically to HDP/Ambari, it may also be necessary to check the property for Cloudera deployments.

Configuration placed in core-site.xml

<property>
  <name>hadoop.proxyuser.hive.hosts</name>
  <value>host1.domain.com,host-LIVE_HIVE_PROXY.organisation.com,host2.domain.com </value>
  <description>
     Hostname from where superuser hive can connect. This
     is required only when installing hive on the cluster.
  </description>
</property>
Proxyuser property
Name

Hive hostname from which the superuser "hive" can connect.

Value

Either a comma-separated list of your nodes or a wildcard. The hostnames should be included for HiveServer2, Metastore hosts and Live Hive proxy.

Some cluster changes can modify this property
Systems changes to properties such as hadoop.proxyuser.hive.hosts, should be made with great care. If the configuration is not present, impersonation will not be allowed and connection will fail.

There are a number of changes that can be made to a cluster that might impact configuration, e.g.

  • adding a new Ambari component

  • adding an additional instance of a component

  • adding a new service using the Add Service wizard

These additions can result in unexpected configuration changes, based on installed components, available resources or configuration changes. Common changes might include (but are not limited to) changes to Heap setting or changes that impact security parameters, such as the proxyuser values.

Handling configuration changes

If any of the changes, listed in the previous section trigger a system change recommendation, there are two options:

  1. A checkbox (selected by default) allowing you to say Ambari should apply the recommendation. You can uncheck this (or use the bulk uncheck at the top) for these.

    Live Hive Architecture
    Figure 19. Stopping a system change from altering hadoop.proxyuser.hive.hosts
  2. Manually adjust the recommended value yourself, as you can specify additional properties that Ambari may not be aware of.

The Proxyuser property values should include hostnames for HiveServer2, Metastore hosts and Live Hive proxy. Accepting recommendations that do not contain this (or the alternative all encompassing wildcard *), will more than likely result in service loss for Hive.

5. Operation

This section covers the steps required for setting up and configuring Live Hive after installation has been completed.

5.1. Setting up Hive Metadata Replication

This section covers the essential steps required for replicating Hive metadata between zones.

Live Hive requires that you create two kinds of rules in order to replicate Hive metadata.

HCFS Rule

Create a rule that matches the location of your underlying Hive data on the HDFS file-system. This rule handles the actual data replication, without it, a corresponding Hive rule will not work. See the Create a rule section in the WANdisco Fusion user guide for details.

Must have a consistent HCFS rule before creating a Hive rule
Before creating a Hive rule, the corresponding HCFS rule must be consistent. See the Consistency Check and Make Consistent sections for more detail.
Hive Rule

Create a rule that uses Hive’s pattern syntax to describe Hive databases and tables. This rule applies to any matching HCFS rule, contextualizing Hive metadata. See Create Hive rule.

5.1.1. Replication limitations

  • An alter table which changes the location of the table will not necessarily guarantee replication if the original directory was not replicated. Moving to a replicated directory does not cause replication. You will need to make the tables consistent for replication to occur.

  • Database descriptions will be replicated but they cannot be changed using the make consistent tool.

  • If using a Hortonworks set up, Hive does not support changes of database locations. This can however be done manually, see Hive: Changing Database Location for more information.

Ignored Hive properties

The following properties (relating to tables, partitions and indexes) will not be considered for consistency check or make consistent operations due to their non-deterministic nature:

transient_lastDdlTime
last_modified_by
last_modified_time
numFilesErasureCoded
numFiles
numPartitions
totalSize
numRows
rawDataSize
hive.stats.tmp.loc
DO_NOT_UPDATE_STATS
COLUMN_STATS_ACCURATE
STATS_GENERATED_VIA_STATS_TASK
bucketing_version

5.1.2. Create Hive rule

To replicate Hive metadata between zones, a Hive pattern-based rule must be created. This rule uses patterns from Hive’s own DDL (data definition language), which deals with schema(structure) and description, of how the data should reside in the Hive.

For more information about Hive Patterns, used in creating replication rules, see Hive LanguageManual
  1. Go to the Fusion UI. Click on Replication tab and + Create.

    No Hive rule option?
    After Live Hive activation, if you don’t see the option to create Hive-based replication rules, ensure that you have refreshed your browser session.
  2. From the Type dropdown, select Hive. Enter the criteria for the Hive pattern that will be used to identify which metadata will be replicated between nodes.

    Live Hive configuration
    Figure 20. Live Hive - Create a rule

    Hive Pattern
    Replication rules can be created to match Hive tables based on the same simple syntax used by Hive for pattern matching, rather than more complex regular expressions. Wildcards in replication rules can only be * for any character(s) or | for a choice.

Examples: employees, emp*, emp*|*ees, all match a table named employees. The asterisk wildcard will select all words beginning with emp, and the pipe "|" will match all words that begin with emp or end with ees.
Database

Name of a Hive database.

Table name

Name of a table in the above database.

Description

A description that will help identify or distinguish the Hive-pattern replication rule.

Click Create to apply the rule.

  1. Once created, Hive metadata that matches the Hive pattern will automatically have a replication rule created.

File system location is currently fixed

The File system location is locked to the wildcard .*

This value ensures that the file system location is always found. In a future release the File system location will be opened up for the user to edit.

5.1.3. Delete Hive Rule

You can delete unwanted Hive rules, through the Fusion web UI, using the following procedure.

  1. Navigate to the Fusion UI. Click on the Replication Tab.

  2. Click on the Rule that you want to delete.

    Live Hive installation
    Figure 21. Live Hive - Delete rule
  3. Click on the Delete Rule button at the bottom of the panel.

    Live Hive installation
    Figure 22. Live Hive - Delete rule
  4. A warning message "Are you sure you want to delete this rule? Metadata that matches this Hive Pattern rule will stop replicating after deletion. Click Confirm only if you are sure you wish to proceed.

    Live Hive installation
    Figure 23. Live Hive - Delete rule
  5. The Replication screen will refresh. You can confirm that the deletion was successful if the rule no longer appears on the screen.

    Live Hive installation
    Figure 24. Live Hive - Delete rule

5.1.4. Review Hive Rule

Review the status of existing Hive rules through the Replication tab of the Fusion UI.

  1. Click on the Hive Rule that you wish to review.

    Live Hive installation
    Figure 25. Live Hive - View rule
  2. The View screen of the selected Hive pattern will appear.

    This Hive Pattern rule will replicate all databases and tables matching the pattern below as long as their location is already replicated in a HCFS rule.
    Live Hive installation
    Figure 26. Live Hive - View rule
    Database name

    The name of the Hive database that is getting its Hive metadata replicated.

    Table name

    A table within the above named Hive database for which Hive metadata is replicated.

    Description

    A description of the rule that you provided during its creation to help identify what it does, later.

    Delete Rule

    This button is used to remove the rule. For more details, see Delete Hive Rule.

  3. Click on the Status tab.

    Live Hive installation
    Figure 27. Live Hive - Rule status

    The status provides an up-to-date view of the status of the metadata being replicated. All databases and tables are listed on the screen, along with their latest consistency check results.

We will only replicate objects which match the pattern and already have their location replicated in a HCFS rule. When database is checked for consistency, its non-replicating tables are not considered.
Trigger check

This button triggers a rule-wide consistency check.

Database name

A name of a Hive databases that will be matched the databases that exist in your Hive deployment.

Table name

The name of a table that stored in the above database that you intend to replicate.

File system location

The location of the data in the local file system.

File system location is currently fixed

The File system location is locked to the wildcard .*

This value ensures that the file system location is always found. In a future release the File system location will be opened up for the user to edit.

Description

A description that you provide during the setup of the regex rule.

Zones

A list of the zones that take part in the replication of this resource.

It’s not currently possible to change the zones of an existing rule.

5.1.5. Running a Consistency Check

Live Hive provides a tool for checking that replica metadata is consistent between zones.

When to complete a consistency check?
  • After adding new metadata data into replicationGroup

  • Periodically, as part of your platform monitoring

  • As part of a system repair/troubleshooting.

To complete a check:

  1. Click on the Replication tab.

  2. Click on the applicable Resource of the Replication Rule.

    Live Hive CC
    Figure 28. Live Hive - Select resource
  3. On the Status tab you can trigger a consistency check of the whole rule, or more granularly.

    1. Click Trigger check to run a consistency check on everything.

      Live Hive CC
      Figure 29. Live Hive - Trigger Consistency Check
    2. Click Check at the end of the relevant row to run the consistency check on a specific table, for example.

      Live Hive CC
      Figure 30. Live Hive - Trigger specific Consistency Check
  4. If the result is inconsistent, click on Inconsistent to use the Make Consistent tool. See the next section for details.

HDP 3.1.5 - External type tables

Due to a backport of features from Apache Hive 4.0 to HDP 3.1.5 and above (HIVE-21838), newly created Hive tables that are non-ACID will be registered as External tables. Previously, all tables created within the standard Hive warehouse location (i.e. /apps/hive/warehouse) will have been registered as Managed tables.

Due to this change, when performing a consistency check between HDP 3.1.5 and other distributions, the table type will be ignored on the HDP 3.1.5 zone when it is an external table.

5.1.6. Make consistent

Enable the Parallel Repairs preview feature by following the steps in the Use Parallel Repair option (preview) section. This is a new feature that improves performance of a make consistent operation and will be enabled by default in an upcoming release of Live Hive.

In the event that metadata is found to be inconsistent, you can use this tool to make the data consistent. The associated HCFS rule must be consistent before performing this operation.

  1. Identify the nature of the inconsistency from the Detailed View panel. Select the zone that contains the correct version of the metadata, then select what actions you want the to take.

    Live Hive CC
    Figure 31. Live Hive - Make consistent
    Recursive

    If checkbox is ticked, this option will cause the selected context i.e database/table/index/partition and all the child objects under it to be made consistent. The default is true, but is ignored if the selected context represents an index or partition.

    Add Missing

    Tick to create any database/table/index/partitions that are missing from a zone depending on the context selected.

    Remove Extra

    Database/Tables/Index/Partitions that exist in the target zone will be removed if they do not exist in the zone selected as your source of truth. Parents of the selected context will never be removed. e.g if a table is selected then it’s parent database will not be removed even if it is missing from the source of truth.

    Update Different

    Database/Tables/Index/Partitions that exist on both the source and target zones will be overwritten to match the source zone. Database/Tables/Index/Partitions that already exist on the target zone will not be modified if this option is left unchecked.

Now click Repair.

  1. The status will now show as Unknown. To check if data has successfully been made consistent, re-run the consistency check and review the status.

    Live Hive CC
    Figure 32. Live Hive - status

Note If you are making a table consistent and the source of truth’s partition keys are different, then the table cannot be made consistent with an ALTER TABLE command. The table therefore has to be dropped and re-created from the source of truth. This operation will also drop the table’s partitions and indices, as they are children of the table.

You can also use the Make Consistent tab to access this functionality.

Live Hive CC
Figure 33. Live Hive - Make consistent tab

5.2. Administration

5.2.1. Support for Hive transactions

From Live Hive 8.2, Hive transactions are supported for both migration and active-passive replication scenarios for HDP 3.x and CDP 7.x (and above). An active-passive scenario is when only one zone is configured to use live replication.

Previously, Hive transaction support was only available for HDP 3.x between Live Hive 7.0 and 8.0.

Hive transactions will not be replicated by default and will be passed through to the local Hive Metastore by the Live Hive Proxy. To enable the active-passive support you first need to set the active zone. For active zones, any Hive transaction operations which match a Hive rule are replicated.

If you are using a non-supported version of Hive, Hive transaction support must be disabled on your cluster for Live Hive to replicate successfully.

Compaction on the target zones needs to be disabled for the tables involved in replication. This can be per table or globally for all tables. See the disable Hive compaction section for details

Hive transaction limitations

The current Hive transaction support has multiple limitations.

  • Hive transactions are only supported for HDP 3.x and CDP 7.x and above. There is currently no support for older versions of HDP or CDH.

    • There is currently no cross version support, clusters must be of the same version, for example: HDP-3.0.x to HDP-3.0.x.

  • Migration or replication is unidirectional between an active zone and passive target zones.

  • On passive zones, and on any unsupported distributions, Hive transaction operations by default will not be blocked but will be just passed through to the Hive Metastore and not replicated. On passive zones we recommend that you do not perform any write operations on top of the replicated transactional tables, as this will bring them out-of-sync.

  • On passive zones, data associated with writes to transactional tables in active zones arrive as bandwidth allows, independently of updates to the metadata that define transactional boundaries. This means that applications consuming content from transactional tables should not expect that data associated with an individual transaction are always available in full.

    Partial transaction content may be returned for a read operation that queries a transactional table in a passive zone, and applications should account for that behavior. There is no change to transactional behavior for the active zone.

Set your Active Zone

Enable active-passive replication of Hive Transactions by setting active.txn.zone property to true on a single zone.

HDP/CDP with Live Hive 8.2 and above

Define a zone as active, with regard to Hive transaction support, by using the following steps:

  1. On the cluster manager, go to the Live Hive Proxy configuration.

  2. Filter for the active.txn.zone property and set it to true (click the checkbox).

  3. Save the configuration and restart the Live Hive service as directed by the manager.

HDP with Live Hive 8.1.1 and below

Define a zone as active, with regard to Hive transaction support, by using the following steps:

  1. On the Ambari UI, go to Live Hive ProxyCustom live-hive-site.

  2. Add the active.txn.zone property and set it to true.

  3. Save the configuration and restart the Live Hive service as directed by the manager.

CDP with Live Hive 8.1.1 and below
  1. On the Cloudera manager, go to Live Hive ProxyLive Hive Client Advanced Configuration Snippet (Safety Valve) for live-hive-conf/live-hive-site.xml.

  2. Add the active.txn.zone property and set it to true.

  3. Add the same property and value again to Live Hive Metastore Proxy Advanced Configuration Snippet (Safety Valve) for live-hive-site.xml.

    CDP - active txn zone
    Figure 34. CDP - active.txn.zone
  4. Save the configuration and restart the Live Hive service as directed by the manager.

Disable Hive Compaction

For successful Hive transaction active-passive replication, disable Hive table compaction on the passive zones.

This can be done for all tables by doing the following depending on your cluster distribution.

HDP
  1. On the Ambari UI, go to HiveCustom hive-site.

  2. Add the NO_AUTO_COMPACTION property and set it to true.

  3. Save the configuration and restart the Hive services as directed by the manager.

CDP
  1. On the Cloudera manager, go to HiveHive Metastore Server Advanced Configuration Snippet (Safety Valve) for hive-site.xml.

  2. Add the NO_AUTO_COMPACTION property and set it to true.

  3. Add the same property and value again to HiveServer2 Advanced Configuration Snippet (Safety Valve) for hive-site.xml.

  4. Also in the Hive configuration, uncheck (disable) the Turn on compactor initiator thread (hive.compactor.initiator.on) option.

    CDP - disable compaction
    Figure 35. CDP - disable compaction
  5. Save the configuration and restart the Hive services as directed by the manager.

Disabling Hive transaction support on a Hadoop cluster

To disable Hive Transaction support entirely on your Hadoop cluster, set the following properties in your Hive configuration through your Ambari/Cloudera Manager and restart the Hive services:

hive.support.concurrency=false;
hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager;
hive.exec.dynamic.partition.mode=strict;
hive.compactor.initiator.on=false;
hive.compactor.worker.threads=0;
hive.strict.managed.tables=false;
hive.create.as.insert.only=false;
metastore.create.as.acid=false;

5.2.2. Live Hive and NameNode Proxy compatibility

If WANdisco Fusion is configured to use the NameNode Proxy for data replication, then additional steps to maintain compatibility with Live Hive are required.

Both Live Hive Proxy and NameNode Proxy can receive and initiate replication of HCFS data, as such, it is necessary to configure Hive so that it ignores the NameNode Proxy’s HCFS write requests. Hive requests can be composed of data and metadata changes, and the Live Hive Proxy will handle both sets of changes.

If the requests are not bypassed by one of these components, then the Live Hive Proxy and NameNode Proxy will attempt to replicate the same sets of HCFS data.

Requirements for compatibility
  • HDFS must be configured with a nameservice, rather than the FQDN of a single namenode. This will be the default in HDFS environments where NameNode HA is enabled.

    The compatibility steps require that the nameservice set for Hive Metastore is adjusted so that the underlying NameNodes are referenced rather than the NameNode Proxy(s).

  • NameNode Proxy must be installed and configured as a nameservice (see HDP or CDH/CDP links for guidance) rather than a single hostname and port (non-HA).

  • The NameNode Proxy nameservice must be the defined value for the fs.defaultFS property in the HDFS config (see HDP or CDH/CDP links for guidance).

    Otherwise, you must ensure all applications will use the NameNode Proxy nameservice in their application specific configurations.

Configuration steps

This example uses the following variables:

NameNode Proxy nameservice = nnproxies
NameNode nameservice = nameservice01
NameNode 1 unique identifier = nn1
NameNode 2 unique identifier = nn2

  1. Navigate to the HDFS configuration to add new entries that will overwrite the hdfs-site.xml (see option a or b depending on platform).

    1. Ambari
      HDFSConfigsAdvancedCustom hdfs-site.

    2. Cloudera
      HDFSConfigurationHDFS Service Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml.

  2. Create new properties so that the NameNode Proxy nameservice will contain references to the underlying NameNodes through their unique identifiers.

    • dfs.namenode.rpc-address.nnproxies.nn1=<nameservice01.namenode1’s rpc-address>:8020

    • dfs.namenode.rpc-address.nnproxies.nn2=<nameservice01.namenode2’s rpc-address>:8020

      Please note that the default NameNode RPC ports are shown above, adjust if your environment differs from these values.

  3. Save the configuration after adding the new properties and restart designated services.

  4. Navigate to the Hive Metastore configuration to add a property that will overwrite the core-site.xml (see option a or b depending on platform).

    1. Ambari
      HiveConfigsAdvancedCustom hivemetastore-site.

    2. Cloudera
      HiveConfigurationHive Metastore Server Advanced Configuration Snippet (Safety Valve) for core-site.xml.

  5. Add a property that will point the Hive Metastore(s) to the NameNodes (rather than the NameNode Proxies) by setting dfs.ha.namenodes.<nnproxies-nameservice> to reference the NameNodes' unique identifiers.

    1. Add the dfs.ha.namenodes.nnproxies=nn1,nn2 property.

    2. Ensure that the property has the Final option selected, so that the hdfs-site does not overwrite it.

    3. Save the configuration changes.

  6. Deploy the configuration and restart the Hive service.

The Hive Metastore(s) will now specifically be directed to the underlying NameNodes, whilst the rest of the cluster services will be directed to the NameNode Proxy(s).

5.2.3. Live Hive Bypass

Hive replication can be stopped at either a global or rule level. This is a similar to the WANdisco Fusion Manual Bypass for HCFS rules feature.

Bypass allows clients to bypass WANdisco Fusion and can be used for maintenance and troubleshooting. When bypass is enabled, consistency check and make consistent can still continue in both directions. Replication can also continue from the remote zone to the zone with the bypass in place.

Live Hive Global Bypass

The Global Bypass setting stops the replication of Hive metadata. To stop both Hive data and metadata, see Manual Bypass in the WANdisco Fusion user guide.

To enable Live Hive Global Bypass go to the Settings page → Live HivePlugin Bypass.

Live Hive Global Bypass
Figure 36. Enable Global Bypass
Suspend Live Replication

To suspend replication for specific rules go to the relevant rule page and click Suspend Live Replication.

Suspend Live Replication
Figure 37. Suspend Live Replication

Once suspended, the rule will have the warning This rule is currently not actively replicating. Hive requests matching this rule will not be automatically replicated.

5.2.4. Adding a node

See the WANdisco Fusion User Guide for the detailed steps on how to add and remove nodes.

When using the Live Hive, the plugin must be installed on the new node after it is inducted but before rules are added. There are no differences to the standard steps when removing a node.

5.2.5. Tuning

Hive Metastore tuning

There are certain properties in the Hadoop Hive Metastore configuration that can be used for tuning Live Hive performance.

Live Hive will match these directly and performance issues may be fixed by altering these values.

  • hive.metastore.server.min.threads

    • The minimum number of threads that Live Hive Proxy is able to create to service clients (default = 200).

  • hive.metastore.server.max.threads

    • The maximum number of threads that Live Hive Proxy is able to create to service clients (default = 1000).

  • hive.metastore.server.tcp.keepalive

    • Whether to enable keepalive for TCP connections to clients. Keepalive will prevent accumulation of half-open connections (default = true).

After adjusting these properties, restart the Live Hive Proxy service after restarting any other services in your cluster.

Live Hive Proxy tuning

The following properties can be adjusted in the live-hive-site.xml for the Live Hive Proxy:

  • thrift.client.max.connections

    • This parameter limits the maximum number of connections from Live Hive Proxy to the Hive metastore. This value can be increased or decreased if it is having an impact on Hive client performance (default = 50).

  • token.renew.time.seconds

    • This value is used to define after how long delegation tokens are to be recreated. It can be adjusted to match your token expiry requirements (default = 7200, which is 2hrs. The Hadoop default expiration time is 1 day).

After adjusting these properties, ensure to restart the Live Hive Proxy service.

Use Parallel Repair option (preview)

The parallel repair feature improves performance of the Make Consistent function by utilizing an improved algorithm and parallel threads to perform the operation.

This feature is currently in preview, speak to WANdisco support before enabling it in your environment.

Enable this feature by following the steps below.

  1. In your cluster manager, add a new property to the live-hive-site.xml.

    HDP: Live Hive Proxy → Configs → Custom live-hive-site
    CDH/CDP: Live Hive Proxy → Configuration → Live Hive Client Advanced Configuration Snippet (Safety Valve) for live-hive-conf/live-hive-site.xml

    Add the following property below:

    use.parallel.repair=true

    The repair.parallel.threads=N property can also be added where N is the number of execution threads for parallel repair. Adjust this depending on the current CPU load of your Fusion nodes (more threads, more CPU load).
    If not added, the value will match that of the default Live Hive pool of execution threads.

  2. Save the Live Hive Proxy configuration afterward.

  3. Restart the Live Hive Proxy service.

  4. Restart all Fusion Servers in the zone:

    Example

    service fusion-server restart

5.3. Troubleshooting

The following tips should help you to understand any issues you might experience with Live Hive operation:

5.3.1. Check the Release notes

Make sure that you check the latest release notes, which may include references to known issues that could impact Live Hive.

5.3.2. Check log files

Observe information in the log files, generated for the WANdisco Fusion server and Live Hive to troubleshoot issues at runtime. Exceptions or log entries with a SEVERE label may represent information that can assist in determining the cause of any problem.

As a distributed system, Live Hive will be impacted by the operation of the underlying Hive database with which it communicates. You may also find it useful to review log or other information from these endpoints.

Table 1. Log Locations

Log Type

Default Location

Metastore

/var/log/hive

Live Hive Node

/var/log/fusion/plugins/live-hive-proxy/

Hive Server

/var/log/hive

Change the timezone

You can ensure that logging between zones is consistent by making sure that logging is manually updated. By default, Logs use UTC timezone but this can be manually altered through log4j configuration.

To alter the timezone the xxx.layout.ConversionPattern property needs to be overwritten.

log4j.appender.stdout.layout.ConversionPattern=%d{ISO8601}{UTC} %p %c - %t:[%m]%n

{UTC} can be replaced with, for example {GMT} or {ITC+1:30}. If offsetting from a timezone, + or - can be used, hours must be between 0 and 23, and minutes must be between 00 and 59.

This property is located in several properties files. For an example set up these are listed below, but the exact paths may differ for your set up:

  • /etc/wandisco/fusion/plugins/hive/hive-log4j.properties

  • /etc/wandisco/fusion/server/log4j.properties

  • /etc/wandisco/fusion/ihc/server/hdp-2.6.0/log4j.properties

The Fusion UI will also need to be updated if changing the timezone, but the property is different to the one above:

/opt/wandisco/fusion-ui-server/lib/fusion_ui_log4j.xml
<PatternLayout pattern="%d{ISO8601}{UTC} %p %c - %t:[%m]%n" charset="UTF-8"/>

After updating all the relevant files, Live Hive, including the UI and Live Hive Proxy, will need to be restarted for the changes to take effect.

5.3.3. Consistency between zones.

We strongly recommend that you replicate between zones running with identical setups, with the same vendors and versions. Even modest differences in configuration may result in replication or consistency problems.

Metastore Schema incompatibility

While making Hive metadata consistent, it’s possible that you would see errors such as,

"INSERT INTO COLUMNS_V2 (CD_ID,COMMENT,"COLUMN_NAME",TYPE_NAME,INTEGER_IDX) VALUES (?,?,?,?,?)"

This may be caused by differences in the Hive Metastore schema in different zones. You should investigate and ensure that all zones have exactly the same Hive Metastore schema, if necessary, run with a custom schema.

Manager configuration

Cloudera

  1. Login to Cloudera Manager.

  2. In Live Hive Metastore Proxy Logging Advanced Configuration Snippet (Safety Valve) add where the timezone can be specified, i.e. GM+1, in the following example:

log4j.appender.RFA.layout=org.apache.log4j.EnhancedPatternLayout
log4j.appender.RFA.layout.ConversionPattern=%d{ISO8601}{GMT+1} %p %c - [%t]: %m%n

5.3.4. Connection issues

Any metastore connection issues will show in the logs, usually caused by an issue with SASL negotiation / delegation tokens. Start your investigation "from the outside, going inwards", i.e. first metastore, then proxy, then Hive server. It’s always worth trying a restart of the proxy/Fusion server before looking elsewhere.

5.3.5. Plugin initialization failure

Various errors are caused by the Live Hive Plugin not starting properly.

The plugin status may appear as unknown on the Plugins screen, under Settings.

A common cause is user/group ownership (combined with permissions) prevent all relevant Fusion services from reading the Live Hive configuration.

Check that the Fusion and Live Hive Proxy system users can list the Live Hive configuration directory and live-hive-site.xml file. For example, in Ambari, the default locations would be:

/etc/wandisco/fusion/plugins/hive

/etc/wandisco/fusion/plugins/hive/live-hive-site.xml

If any of these users are unable to access this directory and file, then consider the solution below.

Solution

In order for both Fusion and Live Hive to access shared configuration, they need to share a common group.

This can be attained by one of two ways:

  1. Change the group value of the Live Hive Proxy to one that the Fusion system user is a member of.

  2. Add the Fusion system user to the Live Hive Proxy user’s group.

In the following examples, the assumptions are:

Fusion system user = fusionuser
Fusion system group = fusionuser
Live Hive Proxy user = hive
Live Hive Proxy group = hadoop

Option 1

Change the Live Hive Proxy’s group from hadoop to fusionuser. This should be performed by adjusting the Live Hive Proxy config on the cluster manager.

Option 2

Add the Fusion system user fusionuser to the hadoop group. This should be performed on the Fusion nodes.

Restart the Live Hive proxy and fusion-server services after making any changes.

6. Reference Guide

6.1. API

Fusion Plugin for Live Hive offers increased control and flexibility through a RESTful (REpresentational State Transfer) API.

Below are listed some example calls that you can use to guide the construction of your own scripts and API driven interactions.

API documentation is still in development
Note that this API documentation continues to be developed. Contact our support team if you have any questions about the available implementation.

Note the following:

  • All calls use the base URI:

    http(s)://<FUSION-HOSTNAME>:8082/plugins/hive/
  • The internet media type of the data supported by the web service is application/xml.

  • The API is hypertext driven, using the following HTTP methods:

Type Action

POST

Create a resource on the server

GET

Retrieve a resource from the server

PUT

Modify the state of a resource

DELETE

Remove a resource

6.1.1. Unsupported operations

As part of Fusion’s replication system, we capture and replicate some "write" operations to an underlying DistributedFileSystem/FileSystem API. However, the truncate command is not currently supported. Do not run this command as your Hive metadata will become inconsistent between clusters.

6.1.2. Example WADL output

The application.wadl provides further details on the API usage:

http(s)://<FUSION-HOSTNAME>:8082/plugins/hive/application.wadl

Adding ?detail=true to the above will provide further details.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<application xmlns="http://wadl.dev.java.net/2009/02">
    <doc xmlns:jersey="http://jersey.java.net/" jersey:generatedBy="Jersey: 2.25.1 2017-01-19 16:23:50"/>
    <doc xmlns:jersey="http://jersey.java.net/" jersey:hint="This is simplified WADL with user and core resources only. To get full WADL with extended resources use the query parameter detail. Link: http://livehive-host.com:8082/plugins/hive/application.wadl?detail=true"/&gt;
    <grammars>
        <include href="application.wadl/xsd0.xsd">
            <doc title="Generated" xml:lang="en"/>
        </include>
    </grammars>
    <resources base="http://livehive-host.com:8082/plugins/hive/">
        <resource path="/">
            <resource path="consistencyCheck">
                <method id="startConsistencyCheck" name="POST">
                    <request>
                        <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="dbName" style="query" type="xs:string"/>
                        <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="tableName" style="query" type="xs:string" default=""/>
                        <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="simpleCheck" style="query" type="xs:boolean" default="true"/>
                    </request>
                    <response>
                        <representation mediaType="/"/>
                    </response>
                </method>
                <resource path="{taskIdentity}">
                    <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="taskIdentity" style="template" type="xs:string"/>
                    <method id="getConsistencyCheckReport" name="GET">
                        <request>
                            <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="withDiffs" style="query" type="xs:boolean" default="false"/>
                        </request>
                        <response>
                            <representation mediaType="application/xml"/>
                            <representation mediaType="application/json"/>
                        </response>
                    </method>
                </resource>
                <resource path="{taskIdentity}/diffs">
                    <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="taskIdentity" style="template" type="xs:string"/>
                    <method id="getConsistencyReportPage" name="GET">
                        <request>
                            <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="pageSize" style="query" type="xs:int" default="2147483647"/>
                            <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="page" style="query" type="xs:int" default="0"/>
                            <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="dbName" style="query" type="xs:string"/>
                            <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="tableName" style="query" type="xs:string"/>
                            <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="partitions" style="query" type="xs:boolean" default="false"/>
                            <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="indexes" style="query" type="xs:boolean" default="false"/>
                        </request>
                        <response>
                            <representation mediaType="application/xml"/>
                            <representation mediaType="application/json"/>
                        </response>
                    </method>
                </resource>
                <resource path="{ruleIdentity}">
                    <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="ruleIdentity" style="template" type="xs:string"/>
                    <method id="startConsistencyCheckForRule" name="POST">
                        <request>
                            <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="dbName" style="query" type="xs:string"/>
                            <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="tableName" style="query" type="xs:string"/>
                            <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="simpleCheck" style="query" type="xs:boolean" default="true"/>
                        </request>
                        <response>
                            <representation mediaType="/"/>
                        </response>
                    </method>
                </resource>
            </resource>
            <resource path="repair">
                <resource path="{ruleIdentity}">
                    <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="ruleIdentity" style="template" type="xs:string"/>
                    <method id="repair" name="PUT">
                        <request>
                            <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="truthZone" style="query" type="xs:string"/>
                            <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="dbName" style="query" type="xs:string"/>
                            <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="tableName" style="query" type="xs:string"/>
                            <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="partName" style="query" type="xs:string"/>
                            <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="indexName" style="query" type="xs:string"/>
                            <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="recursive" style="query" type="xs:boolean" default="true"/>
                            <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="addMissing" style="query" type="xs:boolean" default="true"/>
                            <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="removeExtra" style="query" type="xs:boolean" default="true"/>
                            <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="updateDifferent" style="query" type="xs:boolean" default="true"/>
                            <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="simpleCheck" style="query" type="xs:boolean" default="true"/>
                        </request>
                        <response>
                            <representation mediaType="/"/>
                        </response>
                    </method>
                </resource>
            </resource>
            <resource path="hiveRegex">
                <method id="getReplicationRules" name="GET">
                    <response>
                        <representation mediaType="application/xml"/>
                        <representation mediaType="application/json"/>
                    </response>
                </method>
                <resource path="{id}">
                    <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="id" style="template" type="xs:string"/>
                    <method id="getReplicationRuleById" name="GET">
                        <response>
                            <representation mediaType="application/xml"/>
                            <representation mediaType="application/json"/>
                        </response>
                    </method>
                    <method id="deleteReplicationRuleById" name="DELETE">
                        <response>
                            <representation mediaType="application/xml"/>
                            <representation mediaType="application/json"/>
                        </response>
                    </method>
                </resource>
                <resource path="addHiveRule">
                    <method id="addReplicationRule" name="PUT">
                        <request>
                            <ns2:representation xmlns:ns2="http://wadl.dev.java.net/2009/02" xmlns="" element="hiveRule" mediaType="application/xml"/>
                            <ns2:representation xmlns:ns2="http://wadl.dev.java.net/2009/02" xmlns="" element="hiveRule" mediaType="application/json"/>
                        </request>
                        <response>
                            <representation mediaType="/"/>
                        </response>
                    </method>
                </resource>
                <resource path="active">
                    <method id="getActiveRules" name="GET">
                        <response>
                            <representation mediaType="application/xml"/>
                            <representation mediaType="application/json"/>
                        </response>
                    </method>
                </resource>
            </resource>
            <resource path="hiveReplicatedDirectories">
                <method id="getReplicationRules" name="GET">
                    <response>
                        <representation mediaType="application/xml"/>
                        <representation mediaType="application/json"/>
                    </response>
                </method>
                <resource path="{regexRuleId}">
                    <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="regexRuleId" style="template" type="xs:string"/>
                    <method id="getReplicationRuleById" name="GET">
                        <response>
                            <representation mediaType="application/xml"/>
                            <representation mediaType="application/json"/>
                        </response>
                    </method>
                </resource>
                <resource path="path">
                    <method id="getReplicationRuleByPath" name="GET">
                        <request>
                            <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="path" style="query" type="xs:string"/>
                        </request>
                        <response>
                            <representation mediaType="application/xml"/>
                            <representation mediaType="application/json"/>
                        </response>
                    </method>
                </resource>
            </resource>
            <resource path="hiveMetastore">
                <resource path="{ruleId}/{databaseName}/tables/{tableName}">
                    <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="databaseName" style="template" type="xs:string"/>
                    <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="ruleId" style="template" type="xs:string"/>
                    <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="tableName" style="template" type="xs:string"/>
                    <method id="getTable" name="GET">
                        <response>
                            <representation mediaType="application/xml"/>
                            <representation mediaType="application/json"/>
                        </response>
                    </method>
                </resource>
                <resource path="{ruleId}/databases">
                    <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="ruleId" style="template" type="xs:string"/>
                    <method id="getDatabases" name="GET">
                        <request>
                            <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="databaseFilter" style="query" type="xs:string"/>
                        </request>
                        <response>
                            <representation mediaType="application/xml"/>
                            <representation mediaType="application/json"/>
                        </response>
                    </method>
                </resource>
                <resource path="{ruleId}/databases/{databaseName}">
                    <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="databaseName" style="template" type="xs:string"/>
                    <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="ruleId" style="template" type="xs:string"/>
                    <method id="getDatabase" name="GET">
                        <response>
                            <representation mediaType="application/xml"/>
                            <representation mediaType="application/json"/>
                        </response>
                    </method>
                </resource>
                <resource path="{ruleId}/{databaseName}/tables">
                    <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="databaseName" style="template" type="xs:string"/>
                    <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="ruleId" style="template" type="xs:string"/>
                    <method id="getTables" name="GET">
                        <request>
                            <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="tableFilter" style="query" type="xs:string"/>
                        </request>
                        <response>
                            <representation mediaType="application/xml"/>
                            <representation mediaType="application/json"/>
                        </response>
                    </method>
                </resource>
                <resource path="{ruleId}/summary">
                    <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="ruleId" style="template" type="xs:string"/>
                    <method id="getRuleSummary" name="GET">
                        <response>
                            <representation mediaType="application/xml"/>
                            <representation mediaType="application/json"/>
                        </response>
                    </method>
                </resource>
            </resource>
        </resource>
    </resources>
</application>

6.1.3. Example REST calls

The following examples illustrate some simple use cases, most are direct calls through a web browser, although for deeper or interactive examples, a curl client may be used.

Optional query params
?dbName= &tableName= &path=
GET http(s)://<FUSION-HOSTNAME>:8082/plugins/hive/hiveRegex/{hiveRegexRuleId}  > returns a HiveReplicationRuleDTO >
@XmlRootElement(name = "hiveRule")
@XmlType(propOrder =
Permitted value
  • private String dbNamePattern = "";

  • private String tableNamePattern = "";

  • private String tableLocationPattern = "";

  • private String membershipIdentity = "";

  • private String ruleIdentity;

  • private String description = "";

List Hive Replication Rule DTO
GET http(s)://<FUSION-HOSTNAME>:8082/plugins/hive/hiveRegex/ > HiveReplicationRulesListDTO
  • (all known rules) list of HiveReplicationRuleDTO (see below for format)

PUT http(s)://<FUSION-HOSTNAME>:8082/plugins/hive/hiveRegex/addHiveRule/ PAYLOAD HiveReplicationRuleDTO >
{dbNamePattern:'mbdb.*', tableNamePattern:'tabl.*', tableLocationPattern:'.*', membershipIdentity:'824ce758-641c-46d6-9c7d-d2257496734d', ruleIdentity:'6a61c98b-eaea-4275-bf81-0f82b4adaaef', description:'mydbrule'}
GET http(s)://<FUSION-HOSTNAME>:8082/plugins/hive/hiveRegex/active/ >
  • Returns HiveReplicationRulesListDTO of all ACTIVE hiveRegex rules

GET http(s)://<FUSION-HOSTNAME>:8082/plugins/hive/hiveReplicatedDirectories > HiveReplicatedDirectoresDTOList<HiveReplicatedDirectoryDTO> :
  • Will get all HCFS replicated directories that were created via a Hive pattern rule automatically upon table creation. Returns JSON in format:

{"hiveReplicatedDirectoryDTOList":[{"rd":"ReplicatedDirectoryDTO","propertiesDTO":{"properties":"Map<String, String>"},"consistencyTaskId":"str","consistencyStatus":"str","lastConsistencyCheck":0,"consistencyPending":true}]}
GET http(s)://<FUSION-HOSTNAME>:8082/plugins/hive/hiveReplicatedDirectories/{regexRuleId}  >
  • Returns same as above but returns only directories created via a given regex Rule Id as a path parameter.

GET http(s)://<FUSION-HOSTNAME>:8082/plugins/hive/hiveReplicatedDirectories/path?path=/some/location >
  • Returns same as above again but this time with query param of the path of the HCFS location.

Consistency Checks

Perform a consistency check on the database specified. The response will contain the location of the Platform task that is coordinating the consistency check. This task will exist on all nodes in the membership and at completion each task will be in the same state. The taskId can be used to view the consistency check report using the /hive/consistencyCheck/{taskId} API.

<rule-ID> is needed in some of the following commands, it can be found by viewing the rule in the Fusion UI, or use GET http(s)://<FUSION-HOSTNAME>:8082/plugins/hive/hiveRegex for a list of all Hive rules.

Start a Consistency Check on a particular Hive Database and Table
POST http(s)://<FUSION-HOSTNAME>:8082/plugins/hive/consistencyCheck/<rule-ID>?dbName=<db Name>&tableName=<table Name>
  • tableName, dbName and simpleCheck are optional query parameters and if omitted will default to tableName="", dbName="" and simpleCheck=true

Get the Consistency Check report for a Consistency Check task previously requested by the API above
GET http(s)://<FUSION-HOSTNAME>:8082/plugins/hive/consistencyCheck/{taskId}?withDiffs=true
  • The withDiffs query parameter is optional and defaults to false if not supplied.

Get part of the consistency check report depending on the query parameters set
 GET http(s)://<FUSION-HOSTNAME>:8082/plugins/hive/consistencyCheck/{taskId}/diffs?pageSize=10&page=0&dbName=test_db&tableName=test_table&partitions=true&indexes=true
  • Returns part of the diff from the consistency check. The hierarchy is: dbName / tableName / one of [partitions=true or indexes=true].

    dbname

    Name of the database to check.

    tableName

    Name of the database table to check, the default of " " will check all tables. If specified then either partitions or indexes must be specified.

    pageSize

    Optional. Will default to pageSize = 2,147,483,647

    page

    Optional. Will default to page=0.

Repair
Start to repair the specified database and table.
PUT "http(s)://<FUSION-HOSTNAME>:8082/plugins/hive/repair/<rule-ID>?truthZone=zone1&dbName=test_db&tableName=test_table&partName=testPart&indexName=test_index&recursive=true&addMissing=true&removeExtra=true&updateDifferent=true&simpleCheck=true"
truthZone (required)

Zone which is the source of truth.

dbName (required)

Database to repair in. Note, this database has to exist on zone where this API call is done.

tableName (optional)
partName (optional)
indexName (optional)
recursive (optional)

Defaults to false.

addMissing (optional)

Defaults to true. If true, the objects, which are missing will be created.

removeExtra (optional)

Defaults to true. If true, the objects, which don’t exist in truthZone will be removed.

updateDifferent (optional)

Defaults to true. If true, the objects which are different will be fixed.

simpleCheck (optional)

Defaults to true. If true then the repair operation will only involve a simple check and not include any extended parameters of the objects being repaired.