2. Installation Guide

This section will run through the installation of WD Fusion from the initial steps where we make sure that your existing environment is compatible, through the procedure for installing the necessary components and then finally configuration.

Deployment Overview

We'll start with a quick overview of this installation guide so that you can seen what's coming or quickly find the part that you want:

2.1 Deployment Checklist: Important hardware and software requirements, along with considerations that need to be made before starting to install WD Fusion.
2.2 Final Preparations: Things that you need to do immediately before you start the installation.
2.3 Running the installer: Step by step guide to the installation process when using the unified installer. For instructions on completing a fully manual installation see 2.7 Manual Installation.
2.4 Configuration: Runs through the changes you need to make to start WD Fusion working on your platform.
2.5 Deployment: Necessary steps for getting WD Fusion to work with supported Hadoop applications.
2.6 Appendix: Extras that you may need that we didn't want cluttering up the installation guide - Installation Troubleshooting, How to remove an existing WD Fusion installation.

From version 2.2, WD Fusion comes with an installer package
WD Fusion now has a unified installation package that installs all of WD Fusion's components (WD Fusion server, IHC servers and WD Fusion UI). The installer greatly simplifies installation as it handles all the components you need and does a lot of configuration in the background.

2.1 Deployment Checklist

2.1.1 WD Fusion and IHC servers' requirements

This section describes hardware requirements for deploying Hadoop using WD Fusion. These are guidelines that provide a starting point for setting up data replication between your Hadoop clusters.

Glossary
We'll be using terms that relate to the Hadoop ecosystem, WD Fusion and WANdisco's DconE replication technology. If you encounter any unfamiliar terms checkout the Glossary.

WD Fusion Deployment Components

Example WD Fusion Data Center Deployment.

WD Fusion Server

The core WD Fusion server. Using HCFS (Hadoop Compatible File System) to permit the replication of HDFS data between data centers, while maintaining strong consistency.

Recommendation: Install WD Fusion on edge nodes (AKA gateway nodes).
Edge nodes are the interface between the Hadoop cluster and the outside network. For this reason, they're sometimes referred to as gateway nodes. Most commonly, edge nodes are used to run client applications and cluster administration tools. They are the best place to install WD Fusion.

Edge node Advantages

Automatic consistency in terms of components, e.g. you get the same JRE that the other cluster nodes are running.
All config changes will propagated automatically from the Cluster Manager (Ambari, CM, etc).
Easier integration into the clusters and consequently a more intuitive set of Dos and Don'ts.

Setups that don't support installation into edge nodes:

In MapR deployments there's no concept of edge/gateway nodes, and we don't integrate into MapR's proprietary control system UI.
Deployments that are not running Ambari or Cloudera Mananger.

WD Fusion UI

A separate server that provides administrators with a browser-based management console for each WD Fusion server. This can be installed on the same machine as WD Fusion's server or on a different machine within your data center.

IHC Server

Inter Hadoop Communication servers handle the traffic that runs between zones or data centers that use different versions of Hadoop. IHC Servers are matched to the version of Hadoop running locally. It's possible to deploy different numbers of IHC servers at each data center, additional IHC Servers can form part of a High Availability mechanism.

WD Fusion servers don't need to be collocated with IHC servers
If you deploy using the installer, both the WD Fusion and IHC servers are installed into the same system by default. This configuration is made for convenience, but they can be installed on separate systems. This would be recommended if your servers don't have the recommended amount of system memory.

WD Fusion Client

Client jar files to be installed on each Hadoop client, such as mappers and reducers that are connected to the cluster. The client is designed to have a minimal memory footprint and impact on CPU utilization.

WD Fusion must not be collocated with HDFS servers (DataNodes, etc)
HDFS's default block placement policy dictates that if a client is collocated on a DataNode, then that collocated DataNode will receive 1 block of whatever file is being put into HDFS from that client. This means that if the WD Fusion Server (where all transfers go through) is collocated on a DataNode, then all incoming transfers will place 1 block onto that DataNode. In which case the DataNode is likely to consumes lots of disk space in a transfer-heavy cluster, potentially forcing the WD Fusion Server to shut down in order to keep the Prevaylers from getting corrupted.

The following guidelines apply to both the WD Fusion server and for separate IHC servers. We recommend that you deploy on physical hardware rather than on a virtual platform, however, there are no reasons why you can't deploy on a virtual environment.

Scaling a deployment
How much WD Fusion you need to deploy is not proportionate to the amount of data stored in your clusters, or the number of nodes in your clusters. You deploy WD Fusion/IHC server nodes in proportion to the data traffic between clusters; the more data traffic you need to handle, the more resources you need to put into the WD Fusion server software.

If you plan to locate both the WD Fusion and IHC servers on the same machine then check the Collocated Server requirements:

CPUs:	Small WD Fusion server deployment : 8 cores Large WD Fusion server deployment: : 16 cores Architecture: 64-bit only.
System memory:	There are no special memory requirements, except for the need to support a high throughput of data: Type: Use ECC RAM Size: Recommended 64 GB recommended (minimum of 16 GB) Small WD Fusion server Deployment: 32 GB Large WD Fusion server deployment: 128 GB System memory requirements are matched to the expected cluster size and should take into account the number of files and block size. The more RAM you have, the bigger the supported file system, or the smaller the block size. Collocation of WD Fusion/IHC servers Both the WD Fusion server and the IHC server are, by default, installed on the same machine, in which case you would need to double the minimum memory requirements stated above. E.g. Size: Recommended 64 GB recommended (minimum of 32 GB) Small WD Fusion server Deployment: 64 GB Large WD Fusion server deployment: 128 GB or more
Storage space:	Type: Hadoop operations are storage-heavy and disk-intensive so we strongly recommend that you use enterprise-class Solid State Drives (SSDs). Size: Recommended: 1 TiB Minimum: You need at least 250 GiB of disk space for a production environment.
Network	Connectivity: Minimum 1Gb Ethernet between local nodes. Small WANdisco Fusion server: 2Gbps Large WANdisco Fusion server: 4x10 Gbps (cross-rack) TCP Port Allocation: The following default TCP ports need to be reserved for WD Fusion installations: Network diagram illustrating basic connections/port arrangement. WD Fusion Server DConE replication port: 6444 - DCone port handles all co-ordination traffic that manages replication. It needs to be open between all WD Fusion nodes. Nodes that are situated in zones that are external to the data center's network will require unidirectional access through the firewall. Application/REST API: 8082 - REST port is used by the WD Fusion application for configuration and reporting, both internally and via REST API. The port needs to be open between all WD Fusion nodes and any systems or scripts that interface with WD Fusion through the REST API. WD Fusion Client port: 8023 - Port used by WD Fusion server to communicate with HCFS/HDFS clients. The port is generally only open to the local WD Fusion server, however you must make sure that it is open to edge nodes. IHC ports: 7000-range or 9000-range 7000 range, (exact port is determined at installation time based on what ports are available), used for data transfer between Fusion Server and IHC servers. Must be accessible from all WD Fusion nodes in the replicated system. 9000 range, exact port is determined at installation time based on available ports), used for an HTTP Server that exposes JMX metrics from the IHC server. WD Fusion UI Web UI interface: 8083 Used to access the WD Fusion Administration UI by end users (requires authentication), also used for inter-UI communication. This port should be accessible from all Fusion servers in the replicated system as well as visible to any part of the network where administrators require UI access.

2.1.2 Software requirements

Operating systems:

RHEL 6 x86_64
RHEL 7 x86_64
Oracle Linux 6 x86_64
Oracle Linux 7 x86_64
CentOS 6 x86_64
CentOS 7 x86_64
Ubuntu 12.04LTS
Ubuntu 14.04LTS
SLES 11 x86_64

Web browsers:

Mozilla Firefox 11 and higher
Google Chrome

Java:

Java JRE 1.7 / 1.8 See Supported versions
Hadoop requires Java JRE 1.7. as a minimum. It is built and tested on Oracle's version of Java Runtime Environment.
We have now added support for Open JDK 7, which is used in Amazon Cloud deployments. For other types of deployment we recommend running with Oracle's Java as it has undergone more testing.
Architecture: 64-bit only
Heap size: Set Java Heap Size of to a minimum of 1Gigabytes, or the maximum available memory on your server.
Use a fixed heap size. Give -Xminf and -Xmaxf the same value. Make this as large as your server can support.
Avoid Java defaults. Ensure that garbage collection will run in an orderly manner. Configure NewSize and MaxNewSize Use 1/10 to 1/5 of Max Heap size for JVMs larger than 4GB.

Stay deterministic!
When deploying to a cluster, make sure you have exactly the same version of the Java environment on all nodes.

Where's Java?
Although WD Fusion only requires the Java Runtime Environment (JRE), Cloudera and Hortonworks may install the full Oracle JDK with the high strength encryption package included. This JCE package is a requirement for running Kerberized clusters.
For good measure, remove any JDK 6 that might be present in /usr/java. Make sure that /usr/java/default and /usr/java/latest point to an instance of java 7 version, your Hadoop manager should install this.

Ensure that you set the JAVA_HOME environment variable for the root user on all nodes. Remember that, on some systems, invoking sudo strips environmental variables, so you may need to add the JAVA_HOME to Sudo's list of preserved variables.

Due to a bug in JRE 7, you should not run FINER level logging for javax.security.sasl if you are running on JDK 7. Doing so may result in an NPE. You can guard against the problem by locking down logging with the addition of the following line in WD Fusion's logger.properties file (in /etc/fusion/server)

javax.security.sasl.level=INFO

The problem has been fixed for JDK 8. FUS-1946

File descriptor/Maximum number of processes limit:

Maximum User Processes and Open Files limits are low by default on some systems. It is possible to check their value with the ulimit or limit command:

      ulimit -u && ulimit -n

-u The maximum number of processes available to a single user.
-n The maximum number of open file descriptors.

For optimal performance, we recommend both hard and soft limits values to be set to 64000 or more:

RHEL6 and later: A file /etc/security/limits.d/90-nproc.conf explicitly overrides the settings in security.conf, i.e.:

      # Default limit for number of user's processes to prevent
      # accidental fork bombs.
      # See rhbz #432903 for reasoning.
      * soft nproc 1024 <- Increase this limit or ulimit -u will be reset to 1024

Ambari/Pivotal HD and Cloudera manager will set various ulimit entries, you must ensure hard and soft limits are set to 64000 or higher. Check with the ulimit or limit command. If the limit is exceeded the JVM will throw an error: java.lang.OutOfMemoryError: unable to create new native thread.

Additional requirements:

iptables
Disable iptables.

$ sudo chkconfig iptables off

Reboot.
When the installation is complete you can re-enable iptables using the corresponding command:

sudo chkconfig iptables on

Comment out requiretty in /etc/sudoers
The installer's use of sudo won't work with some linux distributions (CentOS where /etc/sudoer sets enables requiretty, where sudo can only be invoked from a logged in terminal session, not through cron or a bash script. When enabled the installer will fail with an error:

execution refused with "sorry, you must have a tty to run sudo" message

Ensure that requiretty is commented out:

# Defaults	   requiretty

SSL encryption:

Basics
WD Fusion supports SSL for any or all of the three channels of communication: Fusion Server - Fusion Server, Fusion Server - Fusion Client, and Fusion Server - IHC Server.

keystore
A keystore (containing a private key / certificate chain) is used by an SSL server to encrypt the communication and create digital signatures.

truststore
A truststore is used by an SSL client for validating certificates sent by other servers. It simply contains certificates that are considered "trusted". For convenience you can use the same file as both the keystore and the truststore, you can also use the same file for multiple processes.

Enabling SSL

Follow these steps to enable SSL: Enable HTTPS

License Model

WD Fusion is supplied through a licensing model based on the number of nodes and data transfer volumes. WANdisco generates a license file matched to your agreed usage model. If your usage pattern changes or if your license period ends then you need to renew your license. See License renewals

Evaluation license

To simplify the process of pre-deployment testing, WD Fusion is supplied with an evaluation license (also known as a "trial license"). This type of license imposes limits of usage:

Source	Time limit	No. fusion servers	No. of Zones	Replicated Data	Plugins	Specified IPs
Website	14 days	1-2	1-2	5TB	No	No

Production license

Customers entering production need a production license file for each node. These license files are tied to the node's IP address. In the event that a node needs to be moved to a new server with a different IP address customers should contact WANdisco's support team and request that a new license be generated. Production licenses can be set to expire or they can be perpetual.

Source	Time limit	No. fusion servers	No. of Zones	Replicated Data	Plugins	Specified IPs
FD	variable (default: 1 year)	variable (default: 20)	variable (default: 10)	variable (default: 20TB)	Yes	Yes, machine IPs are embedded within the license

Unlimited license

For large deployments, Unlimited licenses are available, for which there are no usage limits.

License renewals

The WD Fusion UI provides a warning message whenever you log in.

Zone information.
A warning also appears under the Settings tab on the license Settings panel. Follow the link to the website.

Zone information.
Complete the form to set out your requirements for license renewal.

Zone information.

2.1.3 Supported versions

This table shows the versions of Hadoop and Java that we currently support:

Distribution:	Console:	JRE:
Apache Hadoop 2.5.0		Oracle JDK 1.7 / 1.8 or OpenJDK 7
HDP 2.1 / 2.2 / 2.3 / 2.4	Ambari 1.6.1 / 1.7 / 2.1 Support for EMC Isilon 7.2.0.1 and 7.2.0.2	Oracle JDK 1.7 / 1.8 or OpenJDK 7
CDH 5.2.0 / 5.3.0 / 5.4 / 5.5 / 5.6 / 5.7	Cloudera Manager 5.3.x, 5.4.x, 5.5.x, 5.6.x and 5.7.x Support for EMC Isilon 7.2.0.1 and 7.2.0.2	Oracle JDK 1.7 / 1.8 or OpenJDK 7
Pivotal HD 3.0, 3.4	Ambari 1.6.1 / 1.7	Oracle JDK 1.7 / 1.8 or OpenJDK 7
MapR 4.0.x, 4.1.0, 5.0.0	Ambari 1.6.1 / 1.7	Oracle JDK 1.7 / 1.8 or OpenJDK 7
Amazon S3		Oracle JDK 1.7 / 1.8 or OpenJDK 7
IOP (BigInsights) 4.0 / 4.1	Ambari 1.7 (with IOP 4.0) / 2.1 (with IOP 4.1)	Oracle JDK 1.7 / 1.8 or OpenJDK 7

Supported applications

Supported Big Data applications my be noted here, as we complete testing:

Application:	Version Supported:	Tested with:
Syncsort DMX-h:	8.2.4.	See Knowledge base

2.2 Final Preparations

We'll now look at what you should know and do as you begin the installation.

Time requirements

The time required to complete a deployment of WD Fusion will in part be based on its size, larger deployments with more nodes and more complex replication rules will take correspondingly more time to set up. Use the guide below to help you plan for for deployments.

Run through this document and create a checklist of your requirements. (1-2 hours).
Complete the WD Fusion installation (about 20 minutes per node, or 1 hour for a test deployment).
Complete client installations and complete basic tests (1-2 hours).

Of course, this is a guideline to help you plan your deployment. You should think ahead and determine if there are additional steps or requirements introduced by your organization's specific needs.

Network requirements

See the deployment checklist for a list of the TCP ports that need to be open for WD Fusion.

Running WD Fusion on multihomed servers

The following guide runs through what you need to do to correctly configure a WD Fusion deployment if the nodes are running with multiple network interfaces.

Servers running on multiple networks.

Example:
10.123.456.127 is the public IP Address of the IHC for DC1 and 192.168.10.41 is the private IP address.
The public IP address is configured in two places, both in DC1:

/etc/wandisco/ihc (for the IHC process) in the IHC machine.

Flow

A file is created in Data Center 1n (DC1). A Client writes the Data.
Periodically, after the data is written, a proposal is sent by the WD Fusion Server in Data Center 1 telling the WD Fusion server in Data Center 2 (DC2) to pull the new file.
Fusion Server in DC2 gets this agreement, connects to 10.123.456.127:7000 and pulls the data.

Getting Connected to the right interface

Stop all WD Fusion services.
Reconfigure your IHCs to your preferred address in /etc/wandisco/ihc/ and /etc/wandisco/fusion.ihc for each IHC node.
Restart all services

Further discussion

You can read more about setting up on multihomed servers in the Deployment section for Oracle DBA: Operating in a multihomed environment

Kerberos Security

If you are running Kerberos on your cluster you should consider the following requirements:

Kerberos is already installed and running on your cluster
Fusion-Server is configured for Kerberos as described in Setting up Kerberos.

Kerberos Configuration before starting the installation

Before running the installer on a platform that is secured by Kerberos, you'll need to run through the following steps: Setting up Kerberos.

Warning about mixed Kerberized / Non-Kerberized zones

In deployments that mix kerberized and non-kerberized zones it's possible that permission errors will occur because the different zones don't share the same underlying system superusers. In this scenario you would need to ensure that the superuser for each zone is created on the other zones.

For example, if you connect a Zone that runs CDH, which has superuser 'hdfs" with a zone running MapR, which has superuser 'mapr', you would need to create the user 'hdfs' on the MapR zone and 'mapr' on the CDH zone.

Kerberos Relogin Failure with Hadoop 2.6.0 and JDK7u80 or later
Hadoop Kerberos relogin fails silently due to HADOOP-10786. This impacts Hadoop 2.6.0 when JDK7u80 or later is used (including JDK8).

Users should downgrade to JDK7u79 or earlier, or upgrade to Hadoop 2.6.1 or later.

Manual instructions

See the Knowledge Base for instructions on setting up manual Kerberos settings. You only need these in special cases as the steps have been handled by the installer. See Manual Updates for WD Fusion UI Configuration.

See the Knowledge Base for instructions on setting up auth-to-local permissions, mapping a Kerberos principal onto a local system user. See Setting up Auth-to-local.

Clean Environment

Before you start the installation you must ensure that there are no existing WD Fusion installations or WD Fusion components installed on your elected machines. If you are about to upgrade to a new version of WD Fusion you must first make sure that you run through the removal instructions provided in the Appendix - Cleanup WD Fusion.

Ensure HADOOP_HOME is set in the environment
Where the hadoop command isn't in the standard system path, administrators must ensure that the HADOOP_HOME environment variable is set for the root user and the user WD fusion will run as, typically hdfs.
When set, HADOOP_HOME must be the parent of the bin directory into which the Hadoop scripts are installed.
Example: if the hadoop command is:

/opt/hadoop-2.6.0-cdh5.4.0/bin/hadoop

then HADOOP_HOME must be set to /opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/bin/hadoop.

Installer File

You need to match WANdisco's WD Fusion installer file to each data center's version of Hadoop. Installing the wrong version of WD Fusion will result in the IHC servers being misconfigured.

License File

After completing an evaluation deployment, you will need to contact WANdisco about getting a license file for moving your deployment into production.

2.3 Running the installer

Below is the procedure for getting set up with the installer. Running the installer only takes a few minutes while you enter the neccessary settings, however, if you wish to handle installations without the need for a user having to manually enter the settings you can use the use the Silent Installer.

Starting the installation

Use the following steps to complete an installation using the installer file. This requires an administrator to enter details throughout the procedure. Once the initial settings are entered through the terminal session, the installation is then completed through a browser or alternatively, using a Silent Installation option to handle configuration programatically.

Open a terminal session on your first installation server. Copy the WD Fusion installer script into a suitable directory.

Make the script executable, e.g.

chmod +x fusion-ui-server-<version>_rpm_installer.sh

Execute the file with root permissions, e.g.

sudo ./fusion-ui-server-<version>_rpm_installer.sh

The installer will now start.

Verifying archive integrity... All good.
Uncompressing WANdisco Fusion..............................

    ::   ::  ::     #     #   ##    ####  ######   #   #####   #####   #####
   :::: :::: :::    #     #  #  #  ##  ## #     #  #  #     # #     # #     #
  ::::::::::: :::   #  #  # #    # #    # #     #  #  #       #       #     #
 ::::::::::::: :::  # # # # #    # #    # #     #  #   #####  #       #     #
  ::::::::::: :::   # # # # #    # #    # #     #  #        # #       #     #
   :::: :::: :::    ##   ##  #  ## #    # #     #  #  #     # #     # #     #
    ::   ::  ::     #     #   ## # #    # ######   #   #####   #####   #####




Welcome to the WANdisco Fusion installation



You are about to install WANdisco Fusion version 2.4-206

Do you want to continue with the installation? (Y/n) y

The installer will perform an integrity check, confirm the product version that will be installed, then invite you to continue. Enter "Y" to continue the installation.

The installer performs some basic checks and lets you modify the Java heap settings. The heap settings apply only to the WD Fusion UI.
```
Checking prerequisites:

Checking for perl: OK
Checking for java: OK

INFO: Using the following Memory settings:

INFO: -Xms128m -Xmx512m

Do you want to use these settings for the installation? (Y/n) y
```
The installer checks for Perl and Java. See the Installation Checklist for more information about these requirements. Enter "Y" to continue the installation.
Next, confirm the port that will be used to access WD Fusion through a browser.
```
Which port should the UI Server listen on? [8083]:
```
Select the Hadoop version and type from the list of supported platforms:
```
Please specify the appropriate backend from the list below:

[0] cdh-5.2.0
[1] cdh-5.3.0
[2] cdh-5.4.0
[3] cdh-5.5.0
[4] hdp-2.1.0
[5] hdp-2.2.0
[6] hdp-2.3.0

Which fusion backend do you wish to use? 3
You chose hdp-2.2.0:2.6.0.2.2.0.0-2041
```
MapR/Pivotal availability
The MapR/PHD versions of Hadoop have been removed from the trial version of WD Fusion in order to reduce the size of the installer for most prospective customers. These versions are run by a small minority of customers, while their presence nearly doubled the size of the installer package. Contact WANdisco if you need to evaluate WD Fusion running with MapR or PHD.

Additional available packages
```
[1] mapr-4.0.1
[2] mapr-4.0.2
[3] mapr-4.1.0
[4] mapr-5.0.0
[5] phd-3.0.0
```
MapR requirements
URI
MapR needs to use WD Fusion's native "fusion:///" URI, instead of the default hdfs:///. Ensure that during installation you select the Use WD Fusion URI with HCFS file system URI option.

Superuser
If you install into a MapR cluster then you need to assign the MapR superuser system account/group "mapr" if you need to run WD Fusion using the fusion:/// URI.

See the requirement for MapR Client Configuration.

See the requirement for MapR impersonation.

When using MapR and doing a TeraSort run, if one runs without the simple partitioner configuration, then the YARN containers will fail with a Fusion Client ClassNotFoundException. The remedy is to set "yarn.application.classpath" on each node's yarn-site.xml. FUI-1853

The installer now confirms which system user/group will be applied to WD Fusion.

We strongly advise against running Fusion as the root user.
For default HDFS setups, the set to 'hdfs'. However, you should choose a user appropriate for running HDFS commands on your system.

Which user should Fusion run as? [hdfs]
Checking 'hdfs' ...
... 'hdfs' found.

Please choose an appropriate group for your system. By default HDP uses the 'hadoop' group.
Which group should Fusion run as? [hadoop]
Checking 'hadoop' ...
... 'hadoop' found.

The installer does a search for the commonly used account and group, assigning these by default.

Check the summary to confirm that your chosen settings are appropriate:

Installing with the following settings:

User and Group:                     hdfs:hadoop
Hostname:                           node04-example.host.com
Fusion Admin UI Listening on:       0.0.0.0:8083
Fusion Admin UI Minimum Memory:     128
Fusion Admin UI Maximum memory:     512
Platform:                           hdp-2.3.0 (2.7.1.2.3.0.0-2557)
Manager Type                        AMBARI
Manager Host and Port:              :
Fusion Server Hostname and Port:    node04-example.host.com:8082
SSL Enabled:                        false

Do you want to continue with the installation? (Y/n) y

You are now given a summary of all the settings provided so far. If these settings are correct then enter "Y" to complete the installation of the WD Fusion server.

The package will now install

Installing hdp-2.1.0 packages:
  fusion-hdp-2.1.0-server-2.4_SNAPSHOT-1130.noarch.rpm ...
   Done
  fusion-hdp-2.1.0-ihc-server-2.4_SNAPSHOT-1130.noarch.rpm ...
   Done
Installing fusion-ui-server package
Starting fusion-ui-server:[  OK  ]
Checking if the GUI is listening on port 8083: .....Done

The WD Fusion server will now start up:
```
Please visit http://<YOUR-SERVER-ADDRESS>.com:8083/ to access the WANdisco Fusion
		
		If 'http://<YOUR-SERVER-ADDRESS>.com' is internal or not available from your browser, replace this with an externally available address to access it. 
		
Installation Complete
[root@node05 opt]#
```
At this point the WD Fusion server and corresponding IHC server will be installed. The next step is to configure the WD Fusion UI through a browser or using the silent installation script.

Configure WD Fusion through a browser

Follow this section to complete the installation by configuring WD Fusion using a browser-based graphical user interface.

Silent Installation
For large deployments it may be worth using Silent Installation option.

Open a web browser and point it at the provided URL. e.g
```
http://<YOUR-SERVER-ADDRESS>.com:8083/
```
In the first "Welcome" screen you're asked to choose between Create a new Zone and Add to an existing Zone.
Make your selection as follows:

Adding a new WD Fusion cluster

Select Add Zone.

Adding additional WD Fusion servers to an existing WD Fusion cluster

Select Add to an existing Zone.

High Availability for WD Fusion / IHC Servers
It's possible to enable High Availability in your WD Fusion cluster by adding additional WD Fusion/IHC servers to a zone. These additional nodes ensure that in the event of a system outage, there will remain sufficient WD Fusion/IHC servers running to maintain replication.

Add HA nodes to the cluster using the installer and choosing to Add to an existing Zone, using a new node name.

Configuration for High Availability
When setting up the configuration for a High Availability cluster, ensure that fs.defaultFS, located in the core-site.xml is not duplicated between zones. This property is used to determine if an operation is being executed locally or remotely, if two separate zones have the same default file system address, then problems will occur. WD Fusion should never see the same URI (Scheme + authority) for two different clusters.

Welcome.
Run through the installer's detailed Environment checks. For more details about exactly what is checked in this stage, see Environmental Checks in the Appendix.

Environmental checks.

On clicking validate the installer will run through a series of checks of your system's hardware and software setup and warn you if any of WD Fusion's prerequisites are missing.

Example check results.

Any element that fails the check should be addressed before you continue the installation. Warnings may be ignored for the purposes of completing the installation, especially if the installation is only for evaluation purposes and not for production. However, when installing for production, you should also address all warnings, or at least take note of them and exercise due care if you continue the installation without resolving and revalidating.
Upload the license file.

Upload your license file.
The conditions of your license agreement will be presented in the top panel, including License Type, Expiry data, Name Node Limit and Data Node Limit.

Verify license and agree to subscription agreement.
Click on the I agree to the EULA to continue, then click Next Step.
Enter settings for the WD Fusion server.

screen 4 - Server settings

WD Fusion Server

Maximum Java heap size (GB)
Enter the maximum Java Heap value for the WD Fusion server.

Umask (currently 022)
Set the default permissions applied to newly created files. The value 022 results in default directory permissions 755 and default file permissions 644. This ensures that the installation will be able to start up/restart.

Latitude
The north-south coordinate angle for the installation's geographical location.

Longitude
The east-west coordinate angle for the installation's geographical location. The latitude and longitude is used to place the WD Fusion server on a global map to aid coordination in a far-flung cluster.

Alternatively, you can click on global map to locate the node.

Advanced options

Only apply these options if you fully understand what they do.
The following advanced options provide a number of low level configuration settings that may be required for installation into certain environments. The incorrect application of some of these settings could cause serious problems, so for this reason we strongly recommend that you discuss their use with WANdisco's support team before enabling them.

Custom UI hostname
Lets you set a custom hostname for the Fusion UI, distinct from the communication.hostname which is already set as part of the install and used by WD Fusion nodes to connect to the Fusion server.

Custom UI Port
Lets to change WD Fusion UI's default port, in case it is assigned elsewhere, e.g. Cloudera's headamp debug server also uses it.

Strict Recovery
See explanation of the Strict Recovery Advanced Options.

Enable SSL for WD Fusion

Tick the checkbox to enable SSL

KeyStore Path

System file path to the keystore file.
e.g. /opt/wandisco/ssl/keystore.ks

KeyStore Password

Encrypted password for the KeyStore.
e.g. ***********

Key Alias

The Alias of the private key.
e.g. WANdisco

Key Password

Private key encrypted password.
e.g. ***********

TrustStore Path

System file path to the TrustStore file.
/opt/wandisco/ssl/keystore.ks

TrustStore Password

Encrypted password for the TrustStore.
e.g. ***********

IHC Server

IHC Settings

Maximum Java heap size (GB)
Enter the maximum Java Heap value for the WD Inter-Hadoop Communication server.

IHC network interface

The hostname for the IHC server.

Advanced Options (optional)

IHC server binding address

In the advanced settings you can decides which address the IHC server will bind to. The address is optional, by default the IHC server binds to all interfaces (0.0.0.0), using the port specified in the ihc.server field. In all cases the port should be identical to the port used in the ihc.server address. i.e. /etc/wandisco/fusion/ihc/server/cdh-5.4.0/2.6.0-cdh5.4.0.ihc or /etc/wandisco/fusion/ihc/server/localfs-2.7.0/2.7.0.ihc

Once all settings have been entered, click Next step.
Next, you will enter the settings for your new Zone.

New Zone

Zone Information

Entry fields for zone properties

Fully Qualified Domain Name
the full hostname for the server.

Node ID
A unique identifier that will be used by WD Fusion UI to identify the server.

Location Name (optional)
A location name that can quickly identify where the server is located.

Induction failure
If induction fails, attempting a fresh installation may be the most straight forward cure, however, it is possible to push through an induction manually, using the REST API. See Handling Induction Failure.

Known issue with Location names
You must use different Location names /Node IDs for each zone. If you use the same name for multiple zones then you will not be able to complete the induction between those nodes.

DConE Port
TCP port used by WD Fusion for replicated traffic.

Zone Name
The name used to identify the zone in which the server operates.

Management Endpoint
Select the Hadoop manager that you are using, i.e. Cloudera Manager, Ambari or Pivotal HD. The selection will trigger the entry fields for your selected manager:

Advanced Options

Only apply these options if you fully understand what they do.
The following advanced options provide a number of low level configuration settings that may be required for installation into certain environments. The incorrect application of some of these settings could cause serious problems, so for this reason we strongly recommend that you discuss their use with WANdisco's support team before enabling them.

URI Selection

The default behavior for WD Fusion is to fix all replication to the Hadoop Distributed File System / hdfs:/// URI. Setting the hdfs-scheme provides the widest support for Hadoop client applications, since some applications can't support the available "fusion:///" URI they can only use the HDFS protocol. Each option is explained below:

Use HDFS URI with HDFS file system

This option is available for deployments where the Hadoop applications support neither the WD Fusion URI or the HCFS standards. WD Fusion operates entirely within HDFS.

This configuration will not allow paths with the fusion:// uri to be used; only paths starting with hdfs:// or no scheme that correspond to a mapped path will be replicated. The underlying file system will be an instance of the HDFS DistributedFileSystem, which will support applications that aren't written to the HCFS specification.

Use WD Fusion URI with HCFS file system

When selected, you need to use fusion:// for all data that must be replicated over an instance of the Hadoop Compatible File System. If your deployment includes Hadoop applications that are either unable to support the Fusion URI or are not written to the HCFS specfication, this option will not work.

MapR deployments
Use this URI selection if you are installing into a MapR cluster.

Use Fusion URI with HDFS file system

This differs from the default in that while the WD Fusion URI is used to identify data to be replicated, the replication is performed using HDFS itself. This option should be used if you are deploying applications that can support the WD Fusion URI but not the Hadoop Compatible File System.

Use Fusion URI and HDFS URI with HDFS file system

This "mixed mode" supports all the replication schemes (fusion://, hdfs:// and no scheme) and uses HDFS for the underlying file system, to support applications that aren't written to the HCFS specification.

Fusion Server API Port

This option lets you select the TCP port that is used for WD Fusion's API.

Strict Recovery

Two advanced options are provided to change the way that WD Fusion responds to a system shutdown where WD Fusion was not shutdown cleanly. Currently the default setting is to not enforce a panic event in the logs, if during startup we detect that WD Fusion wasn't shutdown. This is suitable for using the product as part of an evaluation effort. However, when operating in a production environment, you may prefer to enforce the panic event which will stop any attempted restarts to prevent possible corruption to the database.

DConE panic if dirty (checkbox)
This option lets you enable the strict recovery option for WANdisco's replication engine, to ensure that any corruption to its prevayler database doesn't lead to further problems. When the checkbox is ticked, WD Fusion will log a panic message whenever WD Fusion is not properly shutdown, either due to a system or application problem.

App Integration panic of dirty (checkbox)

This option lets you enable the strict recovery option for WD Fusion's database, to ensure that any corruption to its internal database doesn't lead to further problems. When the checkbox is ticked, WD Fusion will log a panic message whenever WD Fusion is not properly shutdown, either due to a system or application problem.

<Hadoop Management Layer> Configuration

This section configures WD Fusion to interact with the management layer, which could be Ambari or Cloudera Manager, etc.

Manager Host Name /IP
The full hostname or IP address for the working server that hosts the Hadoop manager.

Port
TCP port on which the Hadoop manager is running.

Username
The username of the account that runs the Hadoop manager.

Password
The password that corresponds with the above username.

SSL
(Checkbox) Tick the SSL checkbox to use https in your Manager Host Name and Port. You may be prompted to update the port if you enable SSL but don't update from the default http port.

Authentication without a management layer
WD Fusion normally uses the authentication built into your cluster's management layer, i.e. the Cloudera Manager username and password are required to login to WD Fusion. However, in Cloud-based deployments, such as Amazon's S3, there is no management layer. In this situation, WD Fusion adds a local user to WD Fusion's ui.properties file, either during the silent installation or through the command-line during an installation.
Should you forget these credentials, see Reset internally managed password
Enter security details, if applicable to your deployment.

Kerberos Configuration

In this step you also set the configuration for an existing Kerberos setup. If you are installing into a Kerberized cluster, include the following configuration.
Enabling Kerberos authentication on WD Fusion's REST API
When a user has enabled Kerberos-authentication on their REST API, they must kinit before making REST calls, and enable GSS-Negotiate authentication. To do this with curl, the user must include the "-negotiate" and "-u:" options, like so:
```
curl --negotiate -u: -X GET "http://${HOSTNAME}:8082/fusion/fs/transfers"
```
See Setting up Kerberos for more information about Kerberos setup.
Click Validate to confirm that your settings are valid. Once validated, click Next step.

Zone information.
The remaining panels in step 6 detail all of the installation settings. All your license, WD Fusion server, IHC server and zone settings are shown. If you spot anything that needs to be changed you can click on the go back

Summary
Once you are happy with the settings and all your WD Fusion clients are installed, click Deploy Fusion Server.

WD Fusion Client Installation

In the next step you must complete the installation of the WD Fusion client package on all the existing HDFS client machines in the cluster. The WD Fusion client is required to support data WD Fusion's replication across the Hadoop ecosystem.

Client installations.

The installer supports three different packaging systems for installing Clients, regular RPMs, Parcels for Cloudera and HDP Stack for Hortonworks/Ambari.

Installing into MapR
If you are installing into a MapR cluster, use this default RPM, detailed below. Fusion client installation with RPMs.
client package location
You can find them in your installation directory, here:
```
/opt/wandisco/fusion-ui-server/ui/client_packages
/opt/wandisco/fusion-ui-server/ui/stack_packages
/opt/wandisco/fusion-ui-server/ui/parcel_packages
```
RPM / DEB Packages

client nodes
By client nodes we mean any machine that is interacting with HDFS that you need to form part of WD Fusion's replicated system. If a node is not going to form part of the replicated system then it won't need the WD Fusion client installed. If you are hosting the WD Fusion UI package on a dedicated server, you don't need to install the WD Fusion client on it as the client is built into the WD Fusion UI package. Note that in this case the WD Fusion UI server would not be included in the list of participating client nodes.

Important! If you are installing on Ambari 1.7 or CHD 5.3.x
Additionally, due to a bug in Ambari 1.7, and an issue with the classpath in CDH 5.3.x, before you can continue you must log into Ambari/Cloudera Mananger and complete a restart of HDFS, in order to re-apply WD Fusion's client configuration.

Example clients list

For more information about doing a manual installation, see Fusion Client installation for regular RPMs.
To install with the Cloudera parcel file, see: Fusion Client installation with Parcels.
For Hortonwork's own proprietary packaging format: Fusion Client installation with HDP Stack.
The next step starts WD Fusion up for the first time. You may receive a warning message if your clients have not yet been installed. You can now address any client installations, then click Revalidate Client Install to make the warning go away. If everything is setup correctly you can click Start WD Fusion.

Skip or start.
If you are installing onto a platform that is running Ambari (HDP or Pivotal HD), once the clients are installed you should login to Ambari and restart any services that are flagged as waiting for a restart. This will apply to MapReduce and YARN, in particular.

restart to refresh config
If you are running Ambari 1.7, you'll be prompted to confirm this is done.

Confirm that you have completed the restarts

Important! If you are installing on Ambari 1.7 or CHD 5.3.x
Additionally, due to a bug in Ambari 1.7, and an issue with the classpath in CDH 5.3.x, before you can continue you must log into Ambari/Cloudera Mananger and complete a restart of HDFS, in order to re-apply WD Fusion's client configuration.

First WD Fusion node installation

When installing WD Fusion for the first time, this step is skipped. Click Skip Induction.

Second and subsequent WD Fusion node installations into an existing zone

When adding a node to an existing zone, users will be prompted for zone details at the start of the installer and induction will be handled automatically. Nodes added to a new zone will have the option of being inducted at the end of the install process where the user can add details of the remote node.

Induction failure due to HADOOP-11461
There's a known bug in Jersey 1.9, covered in HADOOP-11461 which can result in the failure of WD Fusion's induction.

Workaround:

Open the file /etc/wandisco/fusion/server/log4j.properties in an editor.

Add the following property:

log4j.logger.com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator=OFF

Save the file and retry the induction.

Known issue with Location names
You must use different Location names /IDs for each zone. If you use the same name for multiple zones then you will not be able to complete the induction between those nodes.

Induction.

Once the installation is complete you will get access to the WD Fusion UI, once you log in using your Hadoop manager username and password.

WD Fusion UI

2.4 Configuration

Once WD Fusion has been installed on all data centers you can proceed with setting up replication on your HDFS file system. You should plan your requirements ahead of the installation, matching up your replication with your cluster to maximise performance and resilience. The next section will take a brief look at a example configuration and run through the necessary steps for setting up data replication between two data centers.

Replication Overview

Example WD Fusion deployment in a 3 data center deployment.

In this example, each one of three data centers ingests data from it's own datasets, "Weblogs", "phone support" and "Twitter feed". An administrator can choose to replicate any or all of these data sets so that the data is replicated across any of the data centers where it will be available for compute activities by the whole cluster. The only change required to your Hadoop applications will be the addition of a replication specific URI, and this will only be a requirement if you are using HCFS rather than the native HDFS protocal.

Setting up Replication

The following steps are used to start replicating hdfs data. The detail of each step will depend on your cluster setup and your specific replication requirements, although the basic steps remain the same.

Create a membership including all the data centers that will share a particular directory. See Create Membership
Create and configure a Replicated Folder. See Replicated Folders
Perform a consistency check on your replicated folder. See Consistency Check
Configure your Hadoop applications to use WANdisco's protocol. See Configure Hadoop for WANdisco replication
Run Tests to validate that your replicated folder remains consistent while data is being written to each data center. See Testing replication

You can't move files between replicated directories
Currently you can't perform a straight move operation between two separate replicated directories.

Installing on a Kerberized cluster

The Installer lets you configure WD Fusion to use your platform's Kerberos implementation. You can find supporting information about how WD Fusion handles Kerberos in the Admin Guide, see Setting up Kerberos.

2.5 Deployment

The deployment section covers the final step in setting up a WD Fusion cluster, where supported Hadoop applications are plugged into WD Fusion's synchronized distributed namespace. It won't be possible to cover all the requirements for all the third-party software covered here, we strongly recommend that you get hold of the corresponding documenation for each Hadoop application before you work through these procedures.

2.5.1 Hive
2.5.2 Impala
2.5.3 Oozie
2.5.4 Oracle Big Data Appliance
2.5.5 Apache Tez
2.5.6 Apache Ranger
2.5.7 SolR
2.5.8 Flume
2.5.9 Spark
2.5.10 HBase (cold-backup mode)
2.5.11 Apache Phoenix
2.5.12 Deploying WD Fusion into a LocalFileSystem

2.5.1 Hive

This guide integrates WD Fusion with Apache Hive, it aims to accomplish the following goals:

Replicate Hive table storage.
Use fusion URIs as store paths.
Use fusion URIs as load paths.
Share the Hive metastore between two clusters.

Prerequisites

Knowledge of Hive architecture.
Ability to modify Hadoop site configuration.
WD Fusion installed and operating.

Replicating Hive Storage via fusion:///

The following requirements come into play if you have deployed WD Fusion using with its native fusion:/// URI. In order to store a Hive table in WD Fusion you specify a WD Fusion URI when creating a table. E.g. consider creating a table called log that will be stored in a replicated directory.

CREATE TABLE log(requestline string) stored as textfile location 'fusion:///repl1/hive/log';

Note: Replicating table storage without sharing the Hive metadata will create a logical discrepancy in the Hive catalog. For example, consider a case where a table is defined on one cluster and replicated on the HCFS to another cluster. A Hive user on the other cluster would need to define the table locally in order to make use of it.

Exceptions

Hive from CDH 5.3/5.4 does not work with WD Fusion, (because of HIVE-9991). To get it working with CDH 5.3 and 5.4. you need to modify the default Hive file system setting. In Cloudera Manager, add the following property to hive-site.xml:

<property>
    <name>fs.defaultFS</name>
    <value>fusion:///</value>
</property>

This property should be added in 3 areas:

Service Wide
GateWay Group
Hiveserver2 group

Replicated directories as store paths

It's possible to configure Hive to use WD Fusion URIs as output paths for storing data, to do this you must specify a Fusion URI when writing data back to the underlying Hadoop-compatible file system (HCFS). For example, consider writing data out from a table called log to a file stored in a replicated directory:

INSERT OVERWRITE DIRECTORY 'fusion:///repl1/hive-out.csv' SELECT * FROM log;

Replicated directories as load paths

In this section we'll describe how to configure Hive to use fusion URIs as input paths for loading data.

It is not common to load data into a Hive table from a file using the fusion URI. When loading data into Hive from files the core-site.xml setting fs.default.name must also be set to fusion, which may not be desirable. It is much more common to load data from a local file using the LOCAL keyword:

LOAD DATA LOCAL INPATH '/tmp/log.csv' INTO TABLE log;

If you do wish to use a fusion URI as a load path, you must change the fs.defaultFS setting to use WD Fusion, as noted in a previous section. Then you may run:

LOAD DATA INPATH 'fusion:///repl1/log.csv' INTO TABLE log;

Sharing the Hive metastore

Advanced configuration - please contact WANdisco before attempting
In this section we'll describe how to share the Hive metastore between two clusters. Since WANdisco Fusion can replicate the file system that contains the Hive data storage, sharing the metadata presents a single logical view of Hive to users on both clusters.

When sharing the Hive metastore, note that Hive users on all clusters will know about all tables. If a table is not actually replicated, Hive users on other clusters will experience errors if they try to access that table.

There are two options available.

Hive metastore available read-only on other clusters

In this configuration, the Hive metastore is configured normally on one cluster. On other clusters, the metastore process points to a read-only copy of the metastore database. MySQL can be used in master-slave replication mode to provide the metastore.

Hive metastore writable on all clusters

In this configuration, the Hive metastore is writable on all clusters.

Configure the Hive metastore to support high availability.
Place the standby Hive metastore in the second data center.
Configure both Hive services to use the active Hive metastore.

Performance over WAN
Performance of Hive metastore updates may suffer if the writes are routed over the WAN.

Hive metastore replication

There are three strategies for replicating Hive metastore data with WD Fusion:

Standard

For Cloudera CDH: See Hive Metastore High Availability.

For Hortonworks/Ambari: High Availability for Hive Metastore.

Manual Replication

In order to manually replicate metastore data ensure that the DDLs are placed on two clusters, and perform a partitions rescan.

2.5.2 Impala

Prerequisites

Knowledge of Impala architecture.
Ability to modify Hadoop site configuration.
WD Fusion installed and operating.

Impala Parcel

Also provided in a parcel format is the WANdisco compatible version of Cloudera's Impala tool:

Ready to distribute.

Follow the same steps described for installing the WD Fusion client, downloading the parcel and SHA file, i.e.:

Have cluster with CDH installed with parcels and Impala.
Copy the FUSION_IMPALA parcel and SHA into the local parcels repository, on the same node where Cloudera Manager Services is installed, this need not be the same location where the Cloudera Manager Server is installed. The default location is at: /opt/cloudera/parcel-repo, but is configurable. In Cloudera Manager, you can go to the Parcels Management Page -> Edit Settings to find the Local Parcel Repository Path. See Parcel Locations.
FUSION_IMPALA should be available to distribute and activate on the Parcels Management Page, remember to click Check for New Parcels button.
Once installed, restart the cluster.
Impala reads on Fusion files should now be available.

Parcel Locations

By default local parcels are stored on the Cloudera Manager Server:/opt/cloudera/parcel-repo. To change this location, follow the instructions in Configuring Server Parcel Settings.

The location can be changed by setting the parcel_dir property in /etc/cloudera-scm-agent/config.ini file of the Cloudera Manager Agent and restart the Cloudera Manager Agent or by following the instructions in Configuring the Host Parcel Directory.

Don't link to /usr/lib/
The path to the CDH libraries is /opt/cloudera/parcels/CDH/lib instead of the usual /usr/lib. We strongly recommend that you don't link /usr/lib/ elements to parcel deployed paths, as some scripts distinguish between the two paths.

Setting the CLASSPATH

In order to get Impala compatible with the Fusion HDFS proxy, the user needs to include a small configuration change in their Impala service through Cloudera Manager. In Cloudera Manager, the user needs to add an environment variable in the section Impala Service Environment Advanced Configuration Snippet (Safety Valve),

AUX_CLASSPATH='colon-delimited list of all the Fusion client jars'

Classpath configuration for WD Fusion.

Query a table stored in a replicated directory

Support from WD Fusion v2.3 - v2.5
Impala does not allow the use of non-HDFS file system URIs for table storage. If you are running with WD Fusion 2.5 or earlier, you need to work around this. WANdisco Fusion 2.3 comes with a client program See Impala Parcel that will support reading data from a table stored in a replicated directory. From WD Fusion 2.6, it becomes possible to replicate directly over HDFS using the hdfs:/// URI.

2.5.3 Oozie

The Oozie service can function with Fusion, running without problem with Cloudera CDH. Under Hortonworks HDP we saw failures in terms of FusionHdfs class not being found. Run the following workaround after completing the WD Fusion installation:

Go into Oozie lib directory

cd /usr/hdp/current/oozie-server/oozie-server/webapps/oozie/WEB-INF/lib

Create symlink for fusion client jars

ln -s /usr/hdp/{hdp_version}/hadoop/lib/fusion* .
ln -s /usr/hdp/{hdp_version}/hadoop/lib/netty-all-4* .
ln -s /usr/hdp/{hdp_version}/hadoop/lib/bcprov-jdk15on-1.52.jar .

Restart the oozie sevice and fusion services.

2.5.4 Oracle: Big Data Appliance

Each node in an Oracle:BDA deployment has multiple network interfaces, with at least one used for intra-rack communications and one used for external communications. WD Fusion requires external communications so configuration using the public IP address is required instead of using host names.

Prerequisites

Knowledge of Oracle:BDA architecture and configuration.
Ability to modify Hadoop site configuration.

Required steps

Install WD Fusion and make sure it's capable of Operating in a multi-homed environment.
Configure WD Fusion to support Kerberos. See Setting up Kerberos
Configure WD Fusion to work with NameNode High Availability described in Oracle's documentation
Restart the cluster, WD Fusion and IHC processes. See init.d management script
Test that replication between zones is working.

Operating in a multi-homed environment

Oracle:BDA is built on top of Cloudera's Hadoop and requires some extra steps to support multi-homed network environment.

Procedure

Complete a standard installation, following the steps provided in the Installation Guide. Retrieve and use the public interface IP addresses for the nodes that will host the WD Fusion and IHC servers.
Once the installation is completed you need to set up WD Fusion for a multi-homed environment, first edit WD Fusion's properties file (/opt/wandisco/fusion-server/application.properties). Create a backup of the file, then add the following line at the end:
```
communication.hostname=0.0.0.0
```
Resave the file.
Next we need to update the IHC servers so that they will also use the public IP addresses rather than hostnames. The specific number and names of the configuration files that you need to update will depend on the details of your installation. If you run both WD Fusion server and IHCs on the same server you can get a view of the files with the following command:
```
tree /etc/wandisco
```
View of the WD Fusion configuration files.
Edit each of the revealed config files. In the above example there are two instances of 2.5.0-cdh5.3.0.ihc that will need to be edited:
```
#Fusion Server Properties
#Wed Jun 03 10:14:41 BST 2015
ihc.server=node01.obda.domain.com\:7000
http.server=node01.obda.domain.com\:9001
```
In each case you should change the addresses so that they use the public IP addresses instead of the hostnames.

Troubleshooting

If you suspect that the multi-homed environment is causing difficulty, verify that you can communicate to the IHC server(s) from other data centers. For example, from a machine in another data center, run:

nc <IHC server IP address>:<IHC server port>

If you see errors from that command, you must fix the network configuration.

Running Fusion with Oracle BDA 4.2 / CDH 5.5.1

There's a known issue concerning configuration and the Cloudera Navigator Metadata Server classpath.

Error message:

2016-04-19 08:50:31,434 ERROR com.cloudera.nav.hdfs.extractor.HdfsExtractorShim [CDHExecutor-0-CDHUrlClassLoader@3bd4729d]: Internal Error while extracting
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.wandisco.fs.client.FusionHdfs not found
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2199)
        ...

There's no clear way to override the fs.hdfs.impl setting just for the Navigator Metadata server, as is required for running with WD Fusion.

Fix Script

Use the following fix script to overcome the problem:

CLIENT_JARS=$(for i in $(ls -1 /opt/cloudera/parcels/CDH/lib/hadoop/client/*.jar  | grep -v jsr305 | awk '{print $NF}' ) ; do echo -n $i: ; done)
NAVIGATOR_EXTRA_CLASSPATH=/opt/wandisco/fusion/client/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop/lib/jetty-*.jar:$CLIENT_JARS
echo "NAVIGATOR_EXTRA_CLASSPATH=$NAVIGATOR_EXTRA_CLASSPATH" > ~/navigator_env.txt

The environment variables are provided here - navigator_env.txt

You need to put this in the configuration for the Cloudera Management Service under "Navigator Metadata Server Environment Advanced Configuration Snippet (Safety Valve)". This modification currently needs to be applied whenever you upgrade or downgrade WD Fusion.

2.5.5 Apache Tez

Apache Tez is a YARN application framework that supports high performance data processing through DAGs. When set up, Tez uses its own tez.tar.gz containing the dependencies and libraries that it needs to run DAGs. For a DAG to access WD Fusion's fusion:/// URI it needs our client jars:

Configure the tez.lib.uris property with the path to the WD Fusion client jar files.

...
<property>
  <name>tez.lib.uris</name> 
# Location of the Tez jars and their dependencies.
# Tez applications download required jar files from this location, so it should be public accessible.
  <value>${fs.default.name}/apps/tez/,${fs.default.name}/apps/tez/lib/</value>
</property>
...

Tez with Hive

In order to make Hive with Tez work, you need to append the Fusion jar files in tez.cluster.additional.classpath.prefix under the Advanced tez-site section:


tez.cluster.additional.classpath.prefix = /opt/wandisco/fusion/client/lib/*

e.g.

Tez configuration.

Running Hortonworks Data Platform, the tez.lib.uris parameter defaults to /hdp/apps/${hdp.version}/tez/tez.tar.gz. So, to add Fusion libs, there are two choices:

Option 1: Delete the above value, and instead have a list including the path where the above gz unpacks to, and the path where Fusion libs are.
or
Option 2: Unpack the above gz, repack with WD Fusion libs and re-upload to HDFS.

Note that both changes are vulnerable to a platform (HDP) upgrade.

2.5.6 Apache Ranger

Apache Ranger is another centralized security console for Hadoop clusters, a preferred solution for Hortonworks HDP (whereas Cloudera prefers Apache Sentry). While Apache Sentry stores its policy file in HDFS, Ranger uses its own local MySQL database, which introduces concerns over non-replicated security policies. Ranger also applies its policies to the ecosystem via java plugins into the ecosystem components - the namenode, hiveserver etc. In testing, the WD Fusion client has not experienced any problems communicating with Apache Ranger-enabled platforms (Ranger+HDFS).

Ensure that the hadoop system user, typically hdfs, has permission to impersonate other users.

...
<property>
<name>hadoop.proxyuser.hdfs.users</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hdfs.groups</name>
<value>*</value>
</property>
...

2.5.6 Solr

Apache Solr is a scalable search engine that can be used with HDFS.

In this section we cover what you need to do for Solr to work with a WD Fusion deployment.

Minimal deployment using the default hdfs:// URI

Getting set up with the default URI is simple, Solr just needs to be able to find the fusion client jar files that contain the FusionHdfs class.

Copy the Fusion/Netty jars into the classpath. Please follow these steps on all deployed Solr servers. For CDH5.4 with parcels, use these two commands:

cp /opt/cloudera/parcels/FUSION/lib/fusion* /opt/cloudera/parcels/CDH/lib/solr/webapps/solr/WEB-INF/lib
cp /opt/cloudera/parcels/FUSION/lib/netty-all-*.Final.jar /opt/cloudera/parcels/CDH/lib/solr/webapps/solr/WEB-INF/lib

Restart all Solr Servers.
Solr is now successfully configured to work with WD Fusion.

Minimal deployment using the WANdisco "fusion://" URI

This is a minimal working solution with Solr on top of fusion.

Requirements
Solr will use a shared replicated directory.

Symlink the WD Fusion jars into Solr webapp

cd /opt/cloudera/parcels/CDH/lib/solr/webapps/solr/WEB-INF/lib
ln -s /opt/cloudera/parcels/FUSION/lib .
ln -s /opt/cloudera/parcels/FUSION/lib/netty-all-4* .

Restart Solr
Create instance configuration
```
$ solrctl instancedir --generate conf1
```
Edit conf1/conf/solrconfig.xml and replace solr.hdfs.home in directoryFactory definition with actual fusion:/// uri, like fusion:///repl1/solr

Create solr directory and set solr:solr permissions on it.

$ sudo -u hdfs hdfs dfs -mkdir fusion:///repl1/solr
$ sudo -u hdfs hdfs dfs -chown solr:solr fusion:///repl1/solr

Upload configuration to zk

vvvvvvv$ solrctl instancedir --create conf1 conf1

Create collection on first cluster
```
$ solrctl collection --create col1 -c conf1 -s 3
```
--For cloudera fusion.impl.disable.cache = true should be set for Solr servers. (don't set this options cluster-wide, that will stall the WD Fusion server with an unbounded number of client connections).

2.5.8 Flume

This set of instructions will set up Flume to ingest data via the fusion:/// URI.

Edit the configuration, set "agent.sources.flumeSource.command" to the path of the source data.
In the case of the above TestRail case, it is the last few lines of the DConE Log file. Set "agent.sinks.flumeHDFS.hdfs.path" to the replicated directory of one of the DCs. Make sure it begins with fusion:/// to push the files to Fusion and not hdfs.

Prerequisites

Create a user in both the clusters 'useradd -G hadoop <username>'
Create user directory in hadoop fs 'hadoop fs -mkdir /user/<username>'
Create replication directory in both DC's 'hadoop fs -mkdir /fus-repl'
Set permission to replication directory 'hadoop fs -chown username:hadoop /fus-repl'
Install and configure WD Fusion

Flume with HDP

If running Flume on HDP/Ambari, replace:

/usr/hdp/current/hadoop-hdfs/lib/*

mapreduce.application.classpath

Setting up Flume through Cloudera Manager

If you want to set up Flume through Cloudera Manager follow these steps:

Download the client in the form of a parcel and the parcel.sha through the UI.
Put the parcel and .sha into /opt/cloudera/parcel-repo on the Cloudera Managed node.
Go to the UI on the Cloudera Manager node. On the main page, click the small button that looks like a gift wrapped box and the FUSION parcel should appear (if it doesn't, try clicking Check for new parcels and wait a moment)
Install, distribute, and activate the parcel.
Repeat steps 1-4 for the second zone.
Make sure membership and replicated directories are created for sharing between Zones.
Go onto Cloudera Manager's UI on one of the zones and click Add Service.
Select the Flume Service. Install the service on any of the nodes.
Once installed, go to Flume->Configurations.
Set 'System User' to 'hdfs'
Set 'Agent Name' to 'agent'
Set 'Configuration File' to the contents of the flume.conf configuration.
Restart Flume Service
Selected data should now be in Zone1 and replicated in Zone2
To check data was replicated, open a terminal onto one of the DCs and become hdfs user, e.g. su hdfs, and run
```
hadoop fs -ls /repl1/flume_out"
```
On both Zones, there should be the same FlumeData file with a long number. This file will contain the contents of the source(s) you chose in your configuration file.

2.5.9 Spark

It's possible to deploy WD Fusion with Apache's high-speed data processing engine if you consider the following points:

HDP

For HDP, in Spark configuration,spark.driver.extraClassPath is provided for extra libraries one may need with Spark.

Go to Spark -> Configs -> Custom spark-defaults and add the following:
```
Key: spark.driver.extraClassPath 
Value: /opt/wandisco/fusion/client/lib/*
```
Without this, you will hit a ClassNotFoundException from com.wandisco.fs.client.FusionHdfs.

CDH

There is a known issue where Spark is not picking up Hive-Site.xml, See Hadoop configuration is not localised when submitting job in yarn-cluster mode (Fixed in version 1.4).

You need to manually add it in by either:

Copy /etc/hive/conf/hive-site.xml into /etc/spark/conf.
Do one of the following, depending on which deployment mode you are running in:
Client - set HADOOP_CONF_DIR to /etc/hive/conf/ (or the directory where hive-site.xml is located).
Cluster - add --files=/etc/hive/conf/hive-site.xml (or the path for hive-site.xml) to the spark-submit script.

For CDH-5.3 with parcel install we found that we also need to do the following steps:
1. Go to Cloudera-Manager -> Advanced Configuration Snippets
2. Search for "env.sh"
3. Under "Gateway Client Environment Advanced Configuration Snippet (Safety Valve) for hadoop-env.sh" add the following:
```
HADOOP_CLASSPATH=/opt/cloudera/parcels/FUSION-2.6.1-SNAPSHOT.2.5.0-cdh5.3.0/lib/*:$HADOOP_CLASSPATH:
```
4. Deploy configs and restart services.

Using the FusionUri
The fusion:/// URI has a known issue where it complains about "Wrong fs". For now Spark is only verified with FusionHdfs going through the hdfs:/// URI.

2.5.10 HBase (Cold Back-up mode)

It's possible to run HBase in a cold-back-up mode across multiple data centers using WD Fusion, so that in the event of the active HBase node going down, you can bring up the HBase cluster in another data centre, etc. However, there will be unavoidable and considerable inconsistency between the lost node and the awakened replica. The following procedure should make it possible to overcome corruption problems enough to start running HBase again, however, since the damage dealt to underlying filesystem might be arbitrary, it's impossible to account for all possible corruptions.

Requirements

For HBase to run with WD Fusion, the following directories need to be created and permissioned, as shown below:

platform	path	permission
CDH5.x	/user/hbase	hbase:hbase
HDP2.x	/hbase /user/hbase	hbase:hbase hbase:hbase

Procedure

The steps below provide a method of handling a recovery using a cold back-up. Note that multiple HMaster/region servers restarts might be needed for certain steps, since hbck command generally requires master to be up, which may require fixing filesystem-level inconsistencies first.

1. Delete all recovered.edits folder artifacts from possible log splitting for each table/region. This might not be strictly necessary, but could reduce the numbers of errors observed during startup.
```
hdfs dfs -rm /apps/hbase/data/data/default/TestTable/8fdee4924ac36e3f3fa430a68b403889/recovered.edits
```
Detect and clean up (quarantine) all corrupted HFiles in all tables (including system tables - hbase:meta and hbase:namespace). Sideline option forces hbck to move corrupted HFiles to a special .corrupted folder, which could be examined/cleanup up by admins:
```
hbase hbck -checkCorruptHFiles -sidelineCorruptHFiles
```
Attempt to rebuild corrupted table descriptors based on filesystem information:
```
hbase hbck -fixTableOrphans
```
General recovery step - try to fix assignments, possible region overlaps and region holes in HDFS - just in case:
```
hbase hbck -repair
```
Clean up ZK. This is particularly necessary if hbase:meta or hbase:namespace were messed up (note that exact name of ZK znode is set by cluster admin)
```
hbase zkcli rmr /hbase-unsecure
```
Final step to correct metadata-related errors
```
hbase hbck -metaonly
hbase hbck -fixMeta
```

2.5.11 Apache Phoenix

The Phoenix Query Server provides an alternative means for interaction with Phoenix and HBase. When WD Fusion is installed, the Phoenix query server may fail to start. The following workaround will get it running with Fusion.

Grab the client jar files as a colon separated string, like so, and set phoenix_class_path equal to this within the phoenix_utils.py file:

/opt/wandisco/fusion/client/lib/fusion-client-hdfs-${VERSION}.jar:/opt/wandisco/fusion/client/lib/fusion-client-common-${VERSION}.jar:/opt/wandisco/fusion/client/lib/fusion-netty-${VERSION}.jar:/opt/wandisco/fusion/client/lib/netty-all-4.0.23.Final.jar:/opt/wandisco/fusion/client/lib/guava-11.0.2.jar:/opt/wandisco/fusion/client/lib/fusion-common-${VERSION}.jar

Change the Java construction command to look like the one below by appending the phoenix_class_path to it:

java_cmd = '%(java)s -cp ' + hbase_config_path + os.pathsep + phoenix_utils.phoenix_queryserver_jar + os.pathsep + phoenix_utils.phoenix_class_path + \

2.5.12 Deploying WD Fusion into a LocalFileSystem

Installer-based LocalFileSystem Deployment

The following procedure covers the installation and setup of WD Fusion deployed over the LocalFileSystem. This requires an administrator to enter details throughout the procedure. Once the initial settings are entered through the terminal session, the deployment to the LocalFileSystem is then completed through a browser.

Open a terminal session on your first installation server. Copy the WD Fusion installer script into a suitable directory.

Make the script executable, e.g.

chmod +x fusion-ui-server-<version>_rpm_installer.sh

Execute the file with root permissions, e.g.

sudo ./fusion-ui-server-<version>_rpm_installer.sh

The installer will now start. You will be asked if you wish to continue with the installation. Enter Y to continue.

LocalFS figure 1.
The installer performs some basic checks and lets you modify the Java heap settings. The heap settings apply only to the WD Fusion UI.
```
INFO: Using the following Memory settings for the WANDISCO Fusion Admin UI process:

INFO: -Xms128m -Xmx512m

Do you want to use these settings for the installation? (Y/n) y
```
The default values should be fine for evaluation, although you should review your system resource requirements for production. Enter Y to continue.

LocalFS figure 2.

Select the localfs platform and then enter a username and password that you will use to login to the WD Fusion web UI.

Which port should the UI Server listen on [8083]:
 Please specify the appropriate platform from the list below:
 
 [0] localfs-2.7.0
 
 Which Fusion platform do you wish to use? 0
 You chose localfs-2.7.0:2.7.0
 Please provide an admin username for the Fusion web ui: admin
 Please provide an admin password for the Fusion web ui: ************

LocalFS figure 3.

Provide a system user account for running WD Fusion. Following the on-screen instructions, you should set up an account called 'fusion' when running the default LocalFS setup.

We strongly advise against running Fusion as the root user.
 
 For default LOCALFS setups, the user should be set to 'fusion'. However, you should choose a user appropriate for running HDFS commands on your system.
 
 Which user should fusion run as? [fusion] fusion

Click Enter to accept 'fusion' or enter another suitable system account.

Now choose a suitable group, again 'fusion' is the default.

Please choose an appropriate group for your system. By default LOCALFS uses the 'fusion' group.
 
 Which group should Fusion run as? [fusion] fusion

You will get a summary of the all the configuration that you have so far entered. Give it a check before you continue.

LocalFS figure 6.
The installation process will complete. The final configuration steps will not be done over the web UI. Follow the on-screen instructions for where to point your browser, i.e. http://your-server-IP:8083/

LocalFS figure 7.
In the first "Welcome" screen you're asked to choose between Create a new Zone and Add to an existing Zone.
Make your selection as follows: Add Zone

LocalFS figure 8.

Adding a new WD Fusion cluster

Select Add Zone.

Adding additional WD Fusion servers to an existing WD Fusion cluster

Select Add to an existing Zone.
Run through the installer's detailed Environment checks. For more details about exactly what is checked in this stage, see Environmental Checks in the Appendix.

LocalFS figure 9.
On clicking Validate, any element that fails the check should be addressed before you continue the installation.

LocalFS figure 10.

Warnings may be ignored for the purposes of completing the installation, especially if the installation is only for evaluation purposes and not for production. However, when installing for production, you should also address all warnings, or at least take note of them and exercise due care if you continue the installation without resolving and revalidating. Click Next Step to continue.
Click on Select file and then navigate to the license file provided by WANdisco.

LocalFS figure 11.
Click on Upload to validate the license file.

LocalFS figure 12.
Providing the license file is validated successfullly, you will see a summary of what features are provided under the license.

LocalFS figure 13.

Click on the I agree to the EULA to continue, then click Next Step.
Enter settings for the WD Fusion server.

LocalFS figure 14 - Server settings

WD Fusion Server

Fusion Server Max Memory (GB)
Enter the maximum Java Heap value for the WD Fusion server. We recommend that for production you should top out with at least 16GB.

Umask (currently 022)
Set the default permissions applied to newly created files. The value 022 results in default directory permissions 755 and default file permissions 644. This ensures that the installation will be able to start up/restart.

Latitude
The north-south coordinate angle for the installation's geographical location.

Longitude
The east-west coordinate angle for the installation's geographical location. The latitude and longitude is used to place the WD Fusion server on a global map to aid coordination in a far-flung cluster.

Advanced options

Only apply these options if you fully understand what they do.
The following advanced options provide a number of low level configuration settings that may be required for installation into certain environments. The incorrect application of some of these settings could cause serious problems, so for this reason we strongly recommend that you discuss their use with WANdisco's support team before enabling them.

Custom UI hostname
Lets you set a custom hostname for the Fusion UI, distinct from the communication.hostname which is already set as part of the install and used by WD Fusion nodes to connect to the Fusion server.

Custom UI Port
Lets to change WD Fusion UI's default port, in case it is assigned elsewhere, e.g. Cloudera's headamp debug server also uses it.

IHC Server

Maximum Java heap size (GB)
Enter the maximum Java Heap value for the WD Inter-Hadoop Communication server. We recommend that for production you should top out with at least 16GB.

IHC Network Interface
The address on which the IHC (Inter-Hadoop Connect) server will be located on.

Once all settings have been entered, click Next step.
Next, you will enter the settings for your new Zone.

LocalFS figure 15.

Zone Information

Entry fields for zone properties

Fully Qualified Domain Name
the full hostname for the server.

Node ID
A unique identifier that will be used by WD Fusion UI to identify the server.

DConE Port
TCP port used by WD Fusion for replicated traffic.

Zone Name
The name used to identify the zone in which the server operates.
Add an entry for the EC2 node in your host file
You need to ensure that the hostname of your EC2 machine has been added to the /etc/hosts file of your LocalFS server node. If you don't do this then, currently you get an error when you start the node:
```
Could not resolve Kerberos principal name: java.net.UnknownHostException: ip-10-0-100-72: ip-10-0-100-72" exception
```
File System Information

Configuration for the local file system:

Use Kerberos for file system access:
Tick this check-box to enable Kerberos authentication on the local filesystem.

Kerberos Token Directory
This defines what the root token directory should be for the Kerberos Token field. This is only set if you are using LocalFileSystem with Kerberos and want to target the token creations within the NFS directory and not on just the actual LocalFileSystem. If left unset it will default to the original behavior; which is to create tokens in the /user/<username>/ directory.

The installer will validate that the directory given or that is set by default (if you leave the field blank), can be written to by WD Fusion.

Configuration file path
System path to the Kerberos configuration file, e.g. /etc/krb5.conf

Keytab file path
System path to your generated keytab file, e.g. /etc/krb5.keytab

Name and place the keytab where you like
These paths and file names can be anything you like, providing they are the consistent with your field entries.
Review the summary. Click Validate to continue.

LocalFS figure 16.
In the next step you must complete the installation of the WD Fusion client package on all the existing HDFS client machines in the cluster. The WD Fusion client is required to support data WD Fusion's replication across the Hadoop ecosystem.
LocalFS figure 17.

In this case, download the client RPM file. Leave your browser session running while you do this, we haven't finished yet.
For localFS deployments, download the client RPM manually onto each client system, in the screenshot we use wget to copy the file into place.

LocalFS figure 18.
Ensure that the client install file has suitable permissions to run. Then use your package manager to install the client.
```
yum install -y fusion-localfs-2.7.0-client-localfs-2.6.4.1.e16-1510.noarch.rpm
```
LocalFS figure 19.
Once the client has successfully installed you will see a verification message.

LocalFS figure 20.
It's now time to return to the browser session and startup WDFusion UI for the first time. Click Start WD Fusion.

LocalFS figure 21.
Once started we now complete the final step of installer's configuration steps, Induction.
LocalFS figure 22.

For the first node you will miss this step out. For all the following node installations you will provide the FQDN or IP address and port of this first node. (In fact you can complete induction by referring to any node that has itself completed induction.)

"Could not resolved Kerberos principal" error
You need to ensure that the hostname of your EC2 machine has been added to the /etc/hosts file of your LocalFS server.
Login to WD Fusion UI using the admin username and password that you provided during the installation. See step 6.

LocalFS figure 23.
The installation of your first node is now complete. You can find more information about working with the WD Fusion UI in the Admin section of this guide.

LocalFS figure 24.

Manual installation

The following procedures covers the hands-on approach to installation and basic setup of a deployment that deploys over the LocalFileSystem. For the vast majority of cases you should use the previous Installer-based LocalFileSystem Deployment procedure.

Non-HA Local filesystem setup

Start with the regular WD Fusion setup. You can go through either the installation manually or using the installer.
When you select the $user:$group you should pick a master user account that will have complete access to the local directory that you plan to replicate. You can set this manually by modifying etc/wandisco/fusion-env.sh setting FUSION_SERVER_GROUP to $group and FUSION_SERVER_USER to $user.
Next, you'll need to configure the core-site.xml, typically in /etc/hadoop/conf/, and override "fs.file.impl" to "com.wandisco.fs.client.FusionLocalFs", "fs.defaultFS" to "file:///", and "fusion.underlyingFs" to "file:///". (Make sure to add the usual Fusion properties as well, such as "fusion.server").
If you are running with fusion URI, (via "fs.fusion.impl"), then you should still set the value to "com.wandisco.fs.client.FusionLocalFs".
If you are running with Kerberos then you should also override "fusion.handshakeToken.dir" to point to some directory that will exist within the local directory you plan to replicate to/from. You should also make sure to have "fs.fusion.keytab" and "fs.fusion.principal" defined as usual.
Ensure that the local directory you plan to replicate to/from alreadly exists. If not, create it and give it 777 permissions or create a symlink (locally) that will point to the local path you plan to replicate to/from.
For ex., if you want to replicate /repl1/ but don't want to create a directory on your root level, you can create a symlink to repl1 on your root level and point it to wherever you want to actually be your replicated directory. In the case of using NFS, it should be used to point to /mnt/nfs/.
Set-up an NFS.
See How To Set Up an NFS Mount on CentOS 6

Be sure to point your replicated directory to your NFS mount, either directly or using a a symlink.

HA local file system setup

Install Fusion UI, Server, IHC, and Client (for LocalFileSystem) on every node you plan to use for HA.
When you select the $user:$group you should pick a master user account that will have complete access to the local directory that you plan to replicate. You can set this manually by modifying /etc/wandisco/fusion-env.sh setting FUSION_SERVER_GROUP to $group and FUSION_SERVER_USER to $user.
Next, you'll need to configure the core-site.xml, typically in /etc/hadoop/conf/, and override "fs.file.impl" to "com.wandisco.fs.client.FusionLocalFs", "fs.defaultFS" to "file:///", and "fusion.underlyingFs" to "file:///". (Make sure to add the usual Fusion properties as well, such as "fs.fusion.server").
If you are running with fusion URI, (via "fs.fusion.impl"), then you should still set the value to "com.wandisco.fs.client.FusionLocalFs".
If you are running with Kerberos then you should also override "fusion.handshakeToken.dir" to point to some directory that will exist within the local directory you plan to replicate to/from. You should also make sure to have "fs.fusion.keytab" and "fs.fusion.principal" defined as usual.
Ensure that the local directory you plan to replicate to/from alreadly exists. If not, create it and give it 777 permissions or create a symlink (locally) that will point to the local path you plan to replicate to/from.
For ex., if you want to replicate /repl1/ but don't want to create a directory on your root level, you can create a symlink to repl1 on your root level and point it to wherever you want to actually be your replicated directory. In the case of using NFS, it should be used to point to /mnt/nfs/.
Now follow a regular HA set up, making sure that you copy over the core-site.xml and fusion-env.sh everywhere so all HA nodes have the same configuration.
Create the replicated directory (or symlink to it) on every HA node and chmod it to 777.

Notes on user settings

When using LocalFileSystem, you can only support 1 single user. This means when you configure the WD Fusion Server's process owner, that process owner should also be the process owner of the IHC server, the Fusion UI server, and the client user that will be used to perform any puts.

Fusion under LocalFileSystem only supports 1 user
Again, Fusion under LocalFileSystem only supports 1 user (on THAT side; you don't have to worry about the other DCs). To assist administrators the LocalFS RPM comes with Fusion and Hadoop shell, so that it is possible to run suitable commands from either. E.g.

hadoop fs -ls /
fusion fs -ls /

Using the shell is required for replication.

2.6 Appendix

The appendix section contains extra help and procedures that may be required when running through a WD Fusion deployment.

Environmental Checks

During the installation, your system's environment is checked to ensure that it will support WANdisco Fusion, the Environment checks are intended to catch basic compatibility issues, especially those that may appear during an early evaluation phase. The checks are not intended to replace carefully running through the Deployment Checklist.

Users should downgrade to JDK7u79 or earlier, or upgrade to Hadoop 2.6.1 or later.

Operating System:	The WD Fusion installer verifies that you are installing onto a system that is running on a compatible operating system. See the Operating system section of the Deployment Checklist, although the current supported distributions of Linux are listed here: Supported Operating Systems RHEL 6 x86_64 RHEL 7 x86_64 Oracle Linux 6 x86_64 Oracle Linux 7 x86_64 CentOS 6 x86_64 CentOS 7 x86_64 Ubuntu 12.04LTS Ubuntu 14.04LTS SLES 11 x86_64 Architecture: 64-bit only
Java:	The WD Fusion installer verifies that the necessary Java components are installed on the system.The installer checks: Env variables: `JRE_HOME`, `JAVA_HOME` and runs the `which java` command. Version: 1.7/1.8 recommended. Must be at least 1.7. Architecture: JVM must be 64-bit. Distribution: Must be from Oracle. See Oracle's Java Download page. For more information about JAVA requirements, see the Java of the Deployment Checklist.
ulimit:	The WD Fusion installer verifies that the system's maximum user processes and maximum open files are set to 64000. For more information about setting, see the File descriptor/Maximum number of procesesses limit on the Deployment Checklist.
System memory and storage	WD Fusion's requirements for system resources are split between its component parts, WD Fusion server, Inter-Hadoop Communication servers (IHCs) and the WD Fusion UI, all of which can, in principle be either collocated on the same machine or hosted separately. The installer will warn you if the system on which you are currently installing WD Fusion is falling below the requirement. For more details about the RAM and storage requirements, see the Memory and Storage sections of the Deployment Checklist.
Compatible Hadoop Flavour	WD Fusion's installer confirms that a compatible Hadoop platform is installed. Currently, it takes the Cluster Manager detail provided on the Zone screen and polls the Hadoop Manager (CM or Ambari) for details. The installation can only continue if the Hadoop Manager is running a compatible version of Hadoop. See the Deployment Checklist for Supported Versions of Hadoop
HDFS service state:	WD Fusion validates that the HDFS service is running. If it is unable to confirm the HDFS state a warning is given that will tell you to check the UI logs for possible errors. See the Logs section for more information.
HDFS service health	WD Fusion validates the overall health of the HDFS service. If the installer is unable to communicate with the HDFS service then you're told to check the WD Fusion UI logs for any clues. See the Logs section for more information.
HDFS maintenance mode.	WD Fusion looks to see if HDFS is currently in maintenance mode. Both Hortonworks and Ambari support this mode for when you need to make changes to your Hadoop configuration or hardware, it supresses alerts for a host, service, role or, if required, the entire cluster.
WD Fusion node running as a client	We validate that the WD Fusion server is configured as a HDFS client.

Fusion Client installation with RPMs

The WD Fusion installer doesn't currently handle the installation of the client to the rest of the nodes in the cluster. You need to go through the following procedure:

In the Client Installation section of the installer you will see line "Download a list of your client nodes" along with links to the client RPM packages.
RPM package location
If you need to find the packages after leaving the installer page with the link, you can find them in your installation directory, here:
```
/opt/wandisco/fusion-ui-server/ui/client_packages
```
If you are installing the RPMs, download and install the package on each of the nodes that appear on the list from step 1.

Installing the client RPM is done in the usual way:

rpm -i <package-name>

Install checks

First, we check if we can run hadoop classpath, in order to complete the installation.
If we're unable to run hadoop classpath then we check for HADOOP_HOME and run the Hadoop classpath from that location.

If the checks cause the installation to fail, you need to export HADOOP_HOME and set it so that the hadoop binary is available at $HADOOP_HOME/bin/hadoop, e.g.
```
export HADOOP_HOME=/opt/hadoop/hadoop
export HIVE_HOME=/opt/hadoop/hive
export PATH=$HADOOP_HOME/bin:$HIVE_HOME/bin  
```

HDP2.1/Ambari 1.6: Start services after installation
When installing clients via RPM into HDP2.1/Ambari 1.6., ensure that you restart services in Ambari before continuing to the next step.

Fusion Client installation with DEB

Debian not supported
Although Ubuntu uses Debian's packaging system, currently Debian itself is not supported. Note: Hortonworks HDP does not support Debian.

If you are running with an Ubuntu Linux distribution, you need to go through the following procedure for installing the clients using Debian's DEB package:

In the Client Installation section of the installer you will see the link to the list of nodes here and the link to the client DEB package.
DEB package location
If you need to find the packages after leaving the installer page with the link, you can find them in your installation directory, here:
```
/opt/wandisco/fusion-ui-server/ui/client_packages
```
To install WANdisco Fusion client, download and install the package on each of the nodes that appear on the list from step 1.
You can install it using
```
sudo dpkg -i /path/to/deb/file
```
followed by
```
sudo apt-get install -f
```
Alternatively, move the DEB file to /var/cache/apt/archives/ and then run
```
apt-get install <fusion-client-filename.deb>
```
.

Fusion Client installation with Parcels

For deployments into Cloudera clusters, clients can be installed using Cloudera's own packaging format: Parcels.

Installing the parcel

Open a terminal session to the location of your parcels repository, it may be your Cloudera Manager server, although the location may have been customized. Ensure that you have suitable permissions for handling files.

Download the appropriate parcel and sha for your deployment.

wget "http://fusion.example.host.com:8083/ui/parcel_packages/FUSION-<version>-cdh5.<version>.parcel"
wget "http://node01-example.host.com:8083/ui/parcel_packages/FUSION-<version>-cdh5.<version>.parcel.sha"

Change the ownership of the parcel and .sha files so that they match the system account that runs Cloudera Manager:
```
chown cloudera-scm:cloudera-scm FUSION-<version>-cdh5.<version>.parcel*
```

Move the files into the server's local repository, i.e.

mv FUSION-<version>-cdh5.<version>.parcel* /opt/cloudera/parcel-repo/

Open Cloudera Manager and navigate to the Parcels screen.

New Parcels check.
The WD Fusion client package is now ready to distribute.

Ready to distribute.
Click on the Distribute button to install WANdisco Fusion from the parcel.

Distribute Parcels.
Click on the Activate button to activate WANdisco Fusion from the parcel.

Distribute Parcels.
The configuration files need redeploying to ensure the WD Fusion elements are put in place correctly. You will need to check Cloudera Manager to see which processes will need to be restarted in order for the parcel to be deployed. Cloudera Manager provides a visual cue about which processes will need a restart.

Important
To be clear, you must restart the services, it is not sufficient to run the "Deploy client configuration" action.

Restarts.
WD Fusion uses Hadoop configuration files associated with the Yarn Gateway service and not HDFS Gateway. WD Fusion uses config files under /etc/hadoop/conf and CDH deploys the Yarn Gateway files into this directory.

Replacing earlier parcels?

If you are replacing an existing package that was installed using a parcel, once the new package is activated you should remove the old package through Cloudera Manager. Use the Remove From Host button.

Remove from the host.

Installing HttpFS with parcels

HttpFS is a server that provides a REST HTTP gateway supporting all HDFS File System operations (read and write). And it is interoperable with the webhdfs REST HTTP API.

While HttpFS runs fine with WD Fusion, there is an issue where it may be installed without the correct class paths being put in place, which can result in errors when running Mammoth test scripts.

Example errors

Running An HttpFS Server Test -- accessing hdfs directory info via curl requests
Start running httpfs test
HTTP/1.1 401 Unauthorized
Server: Apache-Coyote/1.1
WWW-Authenticate: Negotiate
Set-Cookie: hadoop.auth=; Path=/; Expires=Thu, 01-Jan-1970 00:00:00 GMT; HttpOnly
Content-Type: text/html;charset=utf-8
Content-Length: 997
Date: Thu, 04 Feb 2016 16:06:52 GMT
 
HTTP/1.1 500 Internal Server Error
Server: Apache-Coyote/1.1
Set-Cookie: hadoop.auth="u=oracle&p=oracle/bdatestuser@UATBDAKRB.COM&t=kerberos&e=1454638012050&s=7qupbmrZ5D0hhtBIuop2+pVrtmk="; Path=/; Expires=Fri, 05-Feb-2016 02:06:52 GMT; HttpOnly
Content-Type: application/json
Transfer-Encoding: chunked
Date: Thu, 04 Feb 2016 16:06:52 GMT
Connection: close
 
{"RemoteException":{"message":"java.lang.ClassNotFoundException: Class com.wandisco.fs.client.FusionHdfs not found","exception":"RuntimeException","javaClassName":"java.lang.RuntimeException"}}

Workaround

Once the parcel has been installed and HDFS has been restarted, the HttpFS service must also be restarted. Without this follow-on restart you will get missing class errors. This impacts only the HttpFS service, rather than the whole HDFS subsystem.

Fusion Client installation with HDP Stack / Pivotal HD

For deployments into Hortonworks HDP/Ambari cluster, version 1.7 or later. Clients can be installed using Hortonwork's own packaging format: HDP Stack. This approach always works for Pivotal HD.

Ambari 1.6 and earlier
If you are deploying with Ambari 1.6 or earlier, don't use the provided Stacks, instead use the generic RPMs.

Ambari 1.7
If you are deploying with Ambari 1.7, take note of the requirement to perform some necessary restarts on Ambari before completing an installation.

Ambari 2.0
When adding a stack to Ambari 2.0 (any stack, not just WD Fusion client) there is a bug which causes the YARN parameter yarn.nodemanager.resource.memory-mb to reset to a default value for the YARN stack. This may result in the Java heap dropping from a manually-defined value, back to a low default value (2Gb). Note that this issue is fixed from Ambari 2.1.

Upgrading Ambari
When running Ambari prior to 2.0.1, we recommend that you remove and then reinstall the WD Fusion stack if you perform an update of Ambari. Prior to version 2.0.1, an upgraded Ambari refuses to restart the WD Fusion stack because the upgrade may wipe out the added services folder on the stack.

If you perform an Ambari upgrade and the Ambari server fails to restart , the workaround is to copy the WD Fusion service directory from the old to the new directory, so that it is picked up by the new version of Ambari, e.g.:

cp -R /var/lib/ambari-server/resources/stacks_25_08_15_21_06.old/HDP/2.2/services/FUSION /var/lib/ambari-server/resources/stacks/HDP/2.2/services

Again, this issue doesn't occur once Ambari 2.0.1 is installed.

HDP 2.3/Ambari 2.1.1 install
There's currently a problem that can block the installation of the WD Fusion client stack. If the installation of the client service gets stuck at the "Customize Service" step, you may need to use a workaround:

If possible, restart the sequence again, if the option is not available, because the Next button is disabled, or it doesn't work try the next workaround.
Try installing the client RPMs.
Install the WD Fusion client service manually, using the Ambari API.

Install & Start the service via Ambari's API

Make sure the service components are created and the configurations attached by making a GET call, e.g.

http://<ambari-server-host>:8080/api/v1/clusters/<cluster-name>/services/<service-name>

1. Add the service

curl -u <username>:<password> -H "X-Requested-By: ambari" http://<ambari-server-host>:8080/api/v1/clusters/<cluster-name>/services -d '{"ServiceInfo":{"service_name":"FUSION"}}'

2. Add the component

curl -u <username>:<password> -H "X-Requested-By: ambari" http://<ambari-server-host>:8080/api/v1/clusters/<cluster-name>/services/FUSION/components/FUSION_CLIENT -X POST

3. Get a list of the hosts

curl -u <username>:<password> -H "X-Requested-By: ambari" http://<ambari-server-host>:8080/api/v1/clusters/<cluster-name>/hosts/

4. For each of the hosts in the list, add the FUSION_CLIENT component

curl -u <username>:<password> -H "X-Requested-By: ambari" http://<ambari-server-host>:8080/api/v1/clusters/<cluster-name>/hosts/<host-name>/host_components/FUSION_CLIENT -X POST

5. Install the FUSION_CLIENT component

curl -u <username>:<password> -H "X-Requested-By: ambari" http://<ambari-server-host>:8080/api/v1/clusters/<cluster-name>/services/FUSION/components/FUSION_CLIENT -X PUT -d '{"ServiceComponentInfo":{"state": "INSTALLED"}}'

Installing the WANdisco service into your HDP Stack

Download the service from the installer client download panel, or after the installation is complete, from the client packages section on the Settings screen.
The service is a gz file (e.g. fusion-hdp-2.2.0-2.4_SNAPSHOT.stack.tar.gz) that will expand to a folder called /FUSION.
For HDP,
Place this folder in /var/lib/ambari-server/resources/stacks/HDP/<version-of-stack>/services.
Pivotal HD,
In the case of a Pivotal HD deployment, place in one of the following or similar folders: /var/lib/ambari-server/resources/stacks/PHD/<version-of-stack>/services, or /var/lib/ambari-server/resources/stacks/<distribution>/<version-of-stack>/services.
Restart the ambari-server
```
service ambari-server restart
```
After the server restarts, go to + Add Service.

Add Service.
Choose Service, scroll to the bottom.

Scroll to the bottom of the list.
Tick the WANdisco Fusion service checkbox. Click Next.

Tick the WANdisco Fusion service checkbox.
Datanodes and node managers are automatically selected. You must ensure that all servers are ticked as "Client", by default only the local node is ticked. Then click Next.

Assign Slaves and Clients. Add all the nodes as "Client"
Deploy the changes.

Deploy.
Install, Start and Test.

Install, start and test.
Review Summary and click Complete.

Review.

Known bug (AMBARI-9022) Installation of Services can remove Kerberos settings
During the installation of services, via stacks, it is possible that Kerberos configuration can be lost. This has been seen to occur on Kerberized HDP2.2 clusters when installing Kafka or Oozie. Kerberos configuration in the core-site.xml file was removed during the installation which resulted in all HDFS / Yarn instances being unable to restart.

You will need to reapply your Kerberos settings in Ambari, etc. WD Fusion tree

Kerberos re-enabled

For more details, see AMBARI-9022.

MapR Client Configuration

On MapR clusters, you need to copy WD Fusion configuration onto all other nodes in the cluster:

Open a terminal to your WD Fusion node.
Navigate to /opt/mapr/hadoop/<hadoop-version>/etc/hadoop.
Copy the core-site.xml and yarn-site.xml files to the same location on all other nodes in the cluster.
Now restart HDFS, and any other service that indicates that a restart is required.

MapR Impersonation

Enable impersonation when cluster security is disabled

Follow these steps on the client to configure impersonation without enabling cluster security.

Enable impersonation for all relevant components in your ecosystem. See the MapR documentation - Component Requirements for Impersonation.
Enable impersonation for the MapR core components:
The following steps will ensure that MapR will have the necessary permissions on your Hadoop cluster:
- Open the core-site.xml file in a suitable editor.
- Add the following hadoop.proxyuser properties:
```
<property>
    <name>hadoop.proxyuser.mapr.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.mapr.groups</name>
    <value>*</value>
</property> 
```
  Note: The wildcard asterisk * lets the "mapr" user connect from any host and impersonate any user in any group.
- Check that your settings are correct, save and close the core-site.xml file.
On each client system on which you need to run impersonation:
- Set a MAPR_IMPERSONATION_ENABLED environment variable with the value, true. This value must be set in the environment of any process you start that does impersonation. E.g.
```
export MAPR_IMPERSONATION_ENABLED=true
```
- Create a file in /opt/mapr/conf/proxy/ that has the name of the mapr superuser. The default file name would be mapr. To verify the superuser name, check the mapr.daemon.user= line in the /opt/mapr/conf/daemon.conf file on a MapR cluster server.

Removing WANdisco Service

If you are removing WD Fusion, maybe as part of a reinstallation, you should remove the client packages as well. Ambari never deletes any services from the stack it only disables them. If you remove the WD Fusion service from your stack, remember to also delete fusion-client.repo.

[WANdisco-fusion-client]
name=WANdisco Fusion Client repo
baseurl=file:///opt/wandisco/fusion/client/packages
gpgcheck=0

For instructions for the cleanup of Stack, see Host Cleanup for Ambari and Stack

Cleanup WD Fusion HD

The following section is used when preparing to install WD Fusion on system that already has an earlier version of WD Fusion installed. Before you install an updated version of WD Fusion you need to ensure that components and configurartion for an earlier installation have been removed. Go through the following steps before installing a new version of WD Fusion:

On the production cluster, run the following Curl to remove the service:

Curl -su <user>:<password> -H "X-Requested-By: ambari" http://<ambari-server>:<ambari-port>/api/v1/clusters/<cluster>/services/FUSION -X DELETE

On ALL nodes, run the corresponding package manager to remove the client package command, e.g.
```
yum remove fusion-hdp-x.x.x-client
```
Remove all remnant Fusion directories from services/. These left-over files can cause problems if you come to reinstall, so it is worth doing a check of places like /var/lib/ambari-agent/ and /opt/wandisco/fusion. Ensure the removal of /etc/yum.repos.d/fusion-client.repo, if it is left in place it will prevent the next installation of WD Fusion.

Uninstall WD Fusion

There's currently no uninstall function for our installer, so the system will have to be cleaned up manually. If you used the unified installer then use the following steps:

To uninstall all of WD Fusion:

Remove the packages on the WD Fusion node:
```
yum remove -y "fusion-*"
```

Remove the jars, logs, configs:

rm -rf /opt/wandisco/ /etc/wandisco/ /var/run/fusion/ /var/log/fusion/

Cloudera Manager:

Go to "Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml"
Delete all Fusion-related content
Remove WD Fusion parcel
Restart services

Ambari

Got to HDFS -> Configs -> Advanced -> Custom core-site
Delete all WD Fusion-related elements
Remove stack (See Removing WANdisco Service.
Remove the package from all clients, e.g.
```
yum remove -y fusion*client*.rpm
```
Restart services

Properties that you should delete from the core-site

For a complete uninstallation, remove the following properties from the core-site.xml:

fs.fusion.server (If removing a single node from a zone, remove just that node from the property's value, instead).
fs.hdfs.impl (its removal ensures that this native hadoop class is used, e.g. org.apache.hadoop.hdfs.DistributedFileSystem).
fs.fusion.impl

Reinstalling fusion server only
If you reinstall the fusion-server without also reinstalling the fusion-ui-server, then you should restart the fusion-ui-server service to ensure the correct function of some parts of the UI. If the service is not restarted then you may find that the dashboard graphs stop working properly, along with the UI's Stop/start controls. e.g. run:

[root@redhat6 init.d]# service fusion-ui-server restart

2.7 Silent Installation

The "Silent" installation tools are still under development, although, with a bit of scripting, it should now be possible to automate WD Fusion node installation. The following section looks at the provided tools, in the form of a number of scripts, which automate different parts of the installation process.

Overview

The silent installation process supports two levels:

Unattended installation handles just the command line steps of the installation, leaving the web UI-based configuration steps in the hands of an administrator. See 2.7.1 unattended installation.

Fully Automated also includes the steps to handle the configuration without the need for user interaction. See 2.8.2 Fully Automated installation.

2.7.1 Unattended Installation

Use the following command for an unattended installation where an administrator will complete the configuration steps using the browser UI.
```
sudo FUSIONUI_USER=x FUSIONUI_GROUP=y FUSIONUI_FUSION_BACKEND_CHOICE=z ./fusion-ui-server_rpm_installer.sh
```
Set the environment

There are a number of properties that need to be set up before the installer can be run:
FUSIONUI_USER
User which will run WD Fusion services. This should match the user who runs the hdfs service.

FUSIONUI_GROUP
Group of the user which will run Fusion services. The specified group must be one that FUSIONUI_USER is in.
Check FUSIONUI_USER is in FUSIONUI_GROUP
Perform a check of your chosen user to verify that they are in the group that you select.
```
> groups hdfs
hdfs : hdfs hadoop
```
FUSIONUI_FUSION_BACKEND_CHOICE
Should be one of the supported package names, as per the following list:
- cdh-5.2.0:2.5.0-cdh5.2.0
- cdh-5.3.0:2.5.0-cdh5.3.0
- cdh-5.4.0:2.6.0-cdh5.4.0
- cdh-5.5.0:2.6.0-cdh5.5.0
- hdp-2.1.0:2.4.0.2.1.5.0-695
- hdp-2.2.0:2.6.0.2.2.0.0-2041
- hdp-2.3.0:2.7.1.2.3.0.0-2557
- mapr-4.0.1:2.4.1-mapr-1408
- mapr-4.0.2:2.5.1-mapr-1501
- mapr-4.1.0:2.5.1-mapr-1503
- mapr-5.0.0:2.7.0-mapr-1506
- phd-3.0.0:2.6.0.3.0.0.0-249
- emr-4.0.0:2.6.0-amzn-0 - Additional restrictions apply to this option, used for Elastic MapReduce on Amazon S3.
- emr-4.1.0:2.6.0-amzn-1
This mode only automates the initial command line installation step, the configuration steps still need to be handled manually in the browser steps.

Example:
```
sudo FUSIONUI_USER=hdfs FUSIONUI_GROUP=hadoop FUSIONUI_FUSION_BACKEND_CHOICE=hdp-2.3.0 ./fusion-ui-server_rpm_installer.sh
```
2.7.2 Fully Automated Installation

This mode is closer to a full "Silent" installation as it handles the configuration steps as well as the installation.

Properties that need to be set:

SILENT_CONFIG_PATH

Path for the environmental variables used in the command-line driven part of the installation. The paths are added to a file called silent_installer_env.sh.

SILENT_PROPERTIES_PATH

Path to 'silent_installer.properties' file. This is a file that will be parsed during the installation, providing all the remaining paramaters that are required for getting set up. The template is annotated with information to guide you through making the changes that you'll need.
Take note that parameters stored in this file will automatically override any default settings in the installer.

FUSIONUI_USER

User which will run Fusion services. This should match the user who runs the hdfs service.

FUSIONUI_GROUP

Group of the user which will run Fusion services. The specified group must be one that FUSIONUI_USER is in.

FUSIONUI_FUSION_BACKEND_CHOICE

Should be one of the supported package names, as per the following list:

FUSIONUI_UI_HOSTNAME

The hostname for the WD Fusion server.

FUSIONUI_UI_PORT

Specify a fusion-ui-server port (default is 8083)

FUSIONUI_TARGET_HOSTNAME

The hostname or IP of the machine hosting the WD Fusion server.

FUSIONUI_TARGET_PORT

The fusion-server port (default is 8082)

FUSIONUI_MEM_LOW

Starting Java Heap value for the WD Fusion server.

FUSIONUI_MEM_HIGH

Maximum Java Heap.

FUSIONUI_UMASK

Sets the default permissions applied to newly created files. The value 022 results in default directory permissions 755 and default file permissions 644. This ensures that the installation will be able to start up/restart.

FUSIONUI_INIT

Sets whether the server will start automatically when the system boots. Set as "1" for yes or "0" for no

Cluster Manager Variables are deprecated
The cluster manager variables are mostly redundant as they generally get set in different processes though they currently remain in the installer code.

FUSIONUI_MANAGER_TYPE  
FUSIONUI_MANAGER_HOSTNAME
FUSIONUI_MANAGER_PORT

FUSIONUI_MANAGER_TYPE: "AMBARI", "CLOUDERA", "MAPR" or "UNMANAGED_S3". This setting can still be used but it is generally set at a different point in the installation now.

Editing tips

Follow these points when updating the silent_installer_properties file.

Avoid excess space characters in settings.

Induction:
When there is no existing WD Fusion server to induct to you must set "induction.skip=true".

When you do have a server to induct to, either leave it commented out or explicitly set "induction.skip=false" and provide both "induction.remote.node" and "induction.remote.port" settings for an existing node. The port in question would be for the fusion-server (usually 8082) .

New Zone/Existing Zone and License:
If both existing.zone.domain and existing.zone.port are provided this is considered to be an Existing Zone installation. The port in question here is the fusion-ui-server port (usually 8083). In this case, some settings will be taken from the existing server including the license. Otherwise, this is the New Zone installation mode. In this mode license.file.path must point to a valid license key file on the server.

Validation Skipping:
There are three flags that allow for the skipping of validations for situations where this may be appropriate. Set any of the following to false to skip the validation step:

validation.environment.checks.enabled
validation.manager.checks.enabled: Note manager validation is currently not available for S3 installs
validation.kerberos.checks.enabled: Note kerberos validation is currently not available for S3 installs

If this part of the installation fails it is possible to re-run the silent_installer part of the installation by running:

/opt/wandisco/fusion-ui-server/scripts/silent_installer_full_install.sh /path/to/silent_installer.properties

Uninstall WD Fusion UI only

This procedure is useful for UI-only installatons:

sudo yum erase -y fusion-ui-server
sudo rm -rf /opt/wandisco/fusion-ui-server /etc/wandisco/fusion/ui

To UNINSTALL Fusion UI, Fusion Server and Fusion IHC Server (leaving any fusion clients installed):

sudo yum erase -y "fusion-*-server"
sudo rm -rf /opt/wandisco/fusion-ui-server /etc/wandisco/fusion/ui

2.8 Installing into Amazon S3/EMRFS

Pre-requisites

Before you begin an installation to an S3 cluster make sure that you have the following directories created and suitably permissioned. Examples:

${hadoop.tmp.dir}/s3

and

/tmp/hadoop-${user.name}

You can deploy to Amazon S3 using either the:

Silent installer

WD Fusions installer
Download AMI

Known Issues using S3

Make sure that you read and understand the following known issues, taking action if they impact your deployment requirements

Replicating large files in S3
In the initial release supporting S3 there is a problem transferring very large files that will need to be worked around until the next major release (2.7). The problem only impacts users who are running clusters that include S3, either exclusively or in conjunction with other Hadoop data centers.

Workaround

core-site.xml

<property>
    <name>dfs.client.read.prefetch.size</name>
    <value>9223372036854775807</value>
</property>
<property>
     <name>fs.fusion.push.threshold</name>
     <value>0</value>
</property>

Out of Memory issue in EMR 4.1.0
The WDDOutputStream can cause an out-of-memory error because its ByteArrayOutputStream can go beyond the memory limit.

Workaround

By default, EMR has a configuration in hadoop-env.sh that OnOutOfMemoryError it runs a "kill -9 <pid>" command. WDDOutputStream is supposed to handle this Error by flushing its buffer and clearing space for more writing. (Configurable via HADOOP_CLIENT_OPTS in hadoop-env.sh; which sets client-side heap and just needs to be commented out)

core-site.xml

<property>
     <name>fs.fusion.push.threshold</name>
     <value>0</value>
</property>

S3 Silent Installation

You can complete an Amazon S3/EMRFS installation using the Silent Installation procedure, putting the necessary configuration in the silent_installer.properties as described in the previous section.

S3 specific settings

Environment Variables Required for S3 deployments:

FUSIONUI_MANAGER_TYPE=UNMANAGED_S3
FUSIONUI_INTERNALLY_MANAGED_USERNAME
FUSIONUI_INTERNALLY_MANAGED_PASSWORD
FUSIONUI_FUSION_BACKEND_CHOICE
FUSIONUI_USER
FUSIONUI_GROUP
SILENT_PROPERTIES_PATH

silent_installer.properties File additional settings or specific required values listed here:

s3.installation.mode=true
s3.bucket.name
kerberos.enabled=false (or unspecified)

Example Installation

As an example (as root), running on the installer moved to /tmp.

# If necessary download the latest installer and make the script executable
 chmod +x /tmp/installer.sh
# You can reference an original path to the license directly in the silent properties but note the requirement for being in a location that is (or can be made) readable for the $FUSIONUI_USER
# The following is partly for convenience in the rest of the script 
cp /path/to/valid/license.key /tmp/license.key
 
# Create a file to encapsulate the required environmental variables (example is for emr-4.0.0): 
cat <<EOF> /tmp/s3_env.sh
export FUSIONUI_MANAGER_TYPE=UNMANAGED_S3
export FUSIONUI_INTERNALLY_MANAGED_USERNAME=admin
export FUSIONUI_FUSION_BACKEND_CHOICE=emr-4.0.0':'2.6.0-amzn-0
export FUSIONUI_USER=hdfs
export FUSIONUI_GROUP=hdfs
export SILENT_PROPERTIES_PATH=/tmp/s3_silent.properties
export FUSIONUI_INTERNALLY_MANAGED_PASSWORD=admin
EOF
 
 # Create a silent installer properties file - this must be in a location that is (or can be made) readable for the $FUSIONUI_USER:
cat <<EOF > /tmp/s3_silent.properties
existing.zone.domain=
existing.zone.port=
license.file.path=/tmp/license.key
server.java.heap.max=4
ihc.server.java.heap.max=4
server.latitude=54
server.longitude=-1
fusion.domain=my.s3bucket.fusion.host.name
fusion.server.dcone.port=6444
fusion.server.zone.name=twilight
s3.installation.mode=true
s3.bucket.name=mybucket
induction.skip=false
induction.remote.node=my.other.fusion.host.name
induction.remote.port=8082
EOF

# If necessary, (when $FUSIONUI_GROUP is not the same as $FUSIONUI_USER and the group is not already created) create the $FUSIONUI_GROUP (the group that our various servers will be running as):
[[ "$FUSIONUI_GROUP" = "$FUSIONUI_USER" ]] || groupadd hadoop

#If necessary, create the $FUSIONUI_USER (the user that our various servers will be running as):
useradd hdfs

# if [[ "$FUSIONUI_GROUP" = "$FUSIONUI_USER" ]]; then
  useradd $FUSIONUI_USER
else
  useradd -g $FUSIONUI_GROUP $FUSIONUI_USER
fi

# silent properties and the license key *must* be accessible to the created user as the silent installer is run by that user
chown hdfs:hdfs $FUSIONUI_USER:$FUSIONUI_GROUP /tmp/s3_silent.properties /tmp/license.key

# Give s3_env.sh executable permissions and run the script to populate the environment
. /tmp/s3_env.sh

# If you want to make any final checks of the environment variables, the following command can help - sorted to make it easier to find variables!
env | sort
 
# Run installer:
/tmp/installer.sh

S3 Setup through the installer

You can set up WD Fusion on an S3-based cluster deployment, using the installer script.

Follow this section to complete the installation by configuring WD Fusion on an S3-based cluster deployment, using the browser-based graphical user installer.

Open a web browser and point it at the provided URL. e.g

http://<YOUR-SERVER-ADDRESS>.com:8083/

In the first "Welcome" screen you're asked to choose between Create a new Zone and Add to an existing Zone.
Make your selection as follows:

Adding a new WD Fusion cluster

Select Add Zone.

Adding additional WD Fusion servers to an existing WD Fusion cluster

Select Add to an existing Zone.

Welcome screen.
Run through the installer's detailed Environment checks. For more details about exactly what is checked in this stage, see Environmental Checks in the Appendix.

Environmental checks.

On clicking validate the installer will run through a series of checks of your system's hardware and software setup and warn you if any of WD Fusion's prerequisites are not going to be met.

Example check results.

Address any failures before you continue the installation. Warnings may be ignored for the purposes of completing the installation, especially if the installation is only for evaluation purposes and not for production. However, when installing for production, you should address all warnings, or at least take note of them and exercise due care if you continue the installation without resolving and revalidating.
Upload the license file.

Upload your license file.
The conditions of your license agreement will be presented in the top panel, including License Type, Expiry data, Name Node Limit and Data Node Limit.

Verify license and agree to subscription agreement.
Click on the I agree to the EULA to continue, then click Next Step.
Enter settings for the WD Fusion server. See WD Fusion Server for more information about what is entered during this step.

Screen 4 - Server settings

In step 5 the zone information is added.
WD Fusion Deployment

S3 Install

Zone Information

Fully Qualified Domain Name: the full hostname for the server.
Node ID: A unique identifier that will be used by WD Fusion UI to identify the server.
DConE Port: TCP port used by WD Fusion for replicated traffic.
Zone Name: The name used to identify the zone in which the server operates.

S3 Information

Bucket Name: The name of the S3 Bucket that will connect to WD Fusion.
Amazon S3 encryption
: Tick to set your bucket to use AWS's built in data protection.
Use access key and secret key: Additional details required if the S3 bucket is located in a different region. See Use access key and secret key.
Use KMS with Amazon S3: Use an established AWS Key Management Server See Use KMS with Amazon S3.

Use access key and secret key

Tick this checkbox if your S3 bucket is located in a different region. This option will reveal additional entry fields:
WD Fusion Deployment

keys and Bucket

Access Key: This is your AWS Access Key ID.
Secret Key: This is the the secret key that is used in conjunction with your Access Key ID to sign programmatic requests that are sent to AWS.
AWS Bucket Region: Select the Amazon region for your S3 bucket. This is required if you need data to move between AWS regions which is blocked by default.

More about WDS Access Key ID and Secret Access Key
If the node you are installing is set up with the correct IAM role, thene you won't need to use the Access Key ID and Secret Key, as the EC2 instance will have access to S3. However if IAM is not correctly set for the instance or the machine isn't even in AWS then you need to provide both the Access Key ID and Secret Key.
Entered details are placed in core.site.xml.

Alternatively the AMI instance could be turned off. You could then create a new AMI based on it, then launch a new instance with the IAM based off of that AMI so that the key does not need to be entered.

"fs.s3.awsAccessKeyId"
"fs.s3.awsSecretAccessKey"

Read Amazon's documentation about Getting your Acess Key ID and Secret Access Key.

Use KMS with Amazon S3

KMS Key ID

This option must be selected if you are deploying your S3 bucket with AWS Key Management Service. Enter your KMS Key ID. This is a unique identifier of the key. This can be an ARN, an alias, or a globally unique identifier. The ID will be added to the JSON string used in the EMR cluster configuration.

KMS ID Key reference

Core-Site.xml Information

fs.s3.buffer.dir: The full path to a directory or multiple directories, separated by comma without space, that S3 will use for temporary storage. The install will check that the directory exists and that it will accept writes.
hadooop.tmp.dir: The full path to a one or more directories that Hadoop will use for "housekeeping" data storage. The installer will check that the directories that you provide exists and is writable. You can enter multiple directories separate by comma without space.

S3 bucket validation

The following checks are made during installation to confirm that the zone has a working S3 bucket.

S3 Bucket Valid:	The S3 Bucket is checked to ensure that it is available and that it is in the same Amazon region as the EC2 instance on which WD Fusion will run. If the test fails, ensure that you have the right bucket details and that the bucket is reachable from the installation server (in the same region for a start).
S3 Bucket Writable:	The S3 Bucket is confirmed to be writable. If this is not the case then you should check for a permissions mismatch.

The following checks ensure that the cluster zone has the required temporary filespace:

S3 core-site.xml validation

fs.s3.buffer.dir	Determines where on the local filesystem the S3 filesystem should store files before sending them to S3 (or after retrieving them from S3). If the check fails you will need to make sure that the property is added manually.
hadoop.tmp.dir	Hadoop's base for other temporary directory storage. If the check fails then you will need to add the property to the core-site.xml file and try to validate again.

These directories should already be set up on Amazon's (ephemeral) EC2 Instance Store and be correctly permissioned.

The summary screen will now confirm all the details that you have entered so far.
S3 Install details in the summary
Click Next Step if you are sure that everything has been entered correctly.
In step 7 you need to handle the WD Fusion client installations.

S3 Install
This step first asks you to confirm whether the node that you are installing will participate in Active-Active replication. If, instead the node will only ingest data to S3 then you don't need to install the WD Fusion client and can click on Next Step.

Client installation not required. Proceed to the next step.

For deployments where data will come back to the node through the EMR cluster then you should select This node will participate in active-active replication.
Client installation for S3

The installer covers two methods:
- Installing on a new Amazon Elastic MapReduce (EMR) cluster
- Installing on an existing Amazon EMR Cluster Not recommended
Installing on a new Amazon Elastic MapReduce (EMR) cluster

These instructions apply during the set up of WD Fusion on a new AWS EMR cluster. This is the recommended approach, even if you already have an EMR cluster set up.
Login to your EC2 console and select EMR Managed Hadoop Framework.

S3 Install
Click Create cluster. Enter the properties according to your cluster requirements.

S3 New EMR cluster
Click Go to advanced options.

S3 Install
Click on the Edit software settings (optional) dropdown. This opens up a Change settings entry field for entering your own block of configuration, in the form of a JSON string.

Enter the JSON string provided in the installer screen.
Copy the JSON string, provided by the installer. e.g.

JSON

JSON string is stored in the settings screen
You can get the JSON string after the installation has completed by going to the Settings screen.

Example JSON string
```
classification=core-site,properties=[fusion.underlyingFs=s3://example-s3/,fs.fusion.server=52.8.156.64:8023,fs.fusion.impl=com.wandisco.fs.client.FusionHcfs,fs.AbstractFileSystem.fusion.impl=com.wandisco.fs.client.FusionAbstractFs,dfs.client.read.prefetch.size=9223372036854775807]
```
The JSON String contains the necessary WD Fusion parameters that the client will need:

fusion.underlyingFs
The address of the underlying filesystem. In the case of Elastic MapReduce FS, the fs.defaultFS points to a local HDFS built on the instance storage which is temporary, with persistent data being stored in S3. Example: s3://wandisco

fs.fusion.server
The hostname and request port of the Fusion server. Comma-separated list of hostname:port for multiple Fusion servers.
fs.fusion.impl
The Abstract FileSystem implementation to be used.

fs.AbstractFileSystem.fusion.impl
The abstract filesystem implementation to be used.
You need to take the client RPM and install script from the WD Fusion installer, make appropriate edits to the install script (more on that, below) and place them onto the S3 storage.
Install script location
A copy of the install_emr_client.sh script is stored within the following location within a WD Fusion installation: /opt/wandisco/fusion-ui-server/ui/user_scripts/install_emr_client.sh
```
 s3://<bucketName>/install_emr_client.sh
 s3://<bucketName>/fusion-emr-x.x.x-client-dhfs-2.x.x.noarch.rpm
```
Install script edits

Here's an example of the install script, you'll need to plug in your own system paths:
```
#!/bin/bash
set -e
function usage() {
  cat <<-EOF
	  usage: $0 S3_URI_OF_FUSION_CLIENT_RPM [ADDITIONAL_AWS_OPTIONS]
	EOF
}
if [ "$#" -lt 1 ]; then
  usage
  exit -1
fi
TMP_DIR="/tmp"
RPM_LOCATION="${1}"
if ! [[ "${RPM_LOCATION}" =~ s3://[^/]+/.+ ]]; then
  echo "${RPM_LOCATION} is not a valid S3 file location" >&2
  usage
  exit -1
fi
shift
ADDITIONAL_OPT="$@"

RPM_NAME=$(basename ${RPM_LOCATION})
CLIENT_INSTALL_DIR="/opt/wandisco/fusion/client/lib"
HADOOP_LIB_DIR="/usr/lib/hadoop/lib"

sudo aws ${ADDITIONAL_OPT} s3 cp "${RPM_LOCATION}" "${TMP_DIR}"
sudo yum install -y "${TMP_DIR}/${RPM_NAME}"

if [ -z $(which hadoop 2>/dev/null) ]; then
  sudo mkdir -p "${HADOOP_LIB_DIR}"
  sudo ln -s ${CLIENT_INSTALL_DIR}/*  ${HADOOP_LIB_DIR}
fi
```
Key

${RPM_LOCATION}
File path to the WD Fusion package, e.g. s3://wandisco-s3/fusionInstall/

${ADDITIONAL_OPT}
Any additional options that you need to run. For example, a region, if you are pointing to a different region, e.g. --region zone1-us-west1.

${TMP_DIR}
Path to the fs.s3.buffer.dir ephemeral storage.

${RPM_NAME}
The WD Fusion package name, e.g. fusion-emr-4.1.0-client-hdfs-2.6.3.el6-1466.noarch.rpm

${HADOOP_LIB_DIR}
Path to the Hadoop Lib directory, e.g. /usr/lib/hadoop/lib/

${CLIENT_INSTALL_DIR}
Path to the WD Fusion client install directory, e.g. /opt/wandisco/fusion/client/lib/

Note the option refererence to "region". This option must be invoked if the corresponding S3 bucket is located on different region from the EC2 instance on which the command is run. It's also possible to tack the region onto the end of a command, e.g.
```
  ./install_emr_client.sh s3://wandisco-s3/fusionInstall/fusion-emr-4.1.0-client-hdfs-2.6.3.el6-1466.noarch.rpm --region us-w-1
```

In the next step, create a Bootstrap Action that will add the WD Fusion client to cluster creation. Click on the Select a bootstrap action dropdown.
S3 Install
Choose Custom Action, then click Configure and add.
S3 Install
Complete the Add Bootstrap Action form.

Add Bootstrap Action

In the JAR location you need to include the S3 path to the emr install script, i.e. install_emr_client.sh
In the Optional arguments add the S3 path to the WD fusion client RPM file. Remember to add the region, if necessary.

Click Add to store the action.
Confirm that the custom Bootstrap Action has been added.

Confirm action
Finally, click the Create cluster button to complete the AWS set up.

Create cluster
Return to the WD Fusion setup, clicking on Start WD Fusion.

Deploy server

Installing on an existing Amazon Elastic MapReduce (EMR) cluster

We strongly recommend that you terminate your existing cluster and use the previous step for installing into a new cluster.
No autoscaling
This is because installing WD Fusion into an existing cluster will not benefit from AWS's auto-scaling feature. The configuration changes that you make to the core-site.xml file will not be included in automatically generated cluster nodes, as the cluster automatically grows you'd have to follow up by manually distributing the client configuration changes.

Two manual steps

Install the fusion client (the one for EMR) on each node and after scaling, modify the core-site.xml file with the following:
```
<property>
  <name>fusion.underlyingFs</name>
  <value>s3://YOUR-S3-URL/</value>
</property>
<property>
  <name>fs.fusion.server</name>
  <value>IP-HOSTNAME:8023</value>
</property>
<property>
  <name>fs.fusion.impl</name>
  <value>com.wandisco.fs.client.FusionHcfs</value>
</property>
<property>
  <name>fs.AbstractFileSystem.fusion.impl</name>
  <value>com.wandisco.fs.client.FusionAbstractFs</value>
</property>
```
fusion.underlyingFs
The address of the underlying filesystem. In the case of Elastic MapReduce FS, the fs.defaultFS points to a local HDFS built on the instance storage which is temporary, with persistent data being stored in S3. Example: s3://wandisco

fs.fusion.server
The hostname and request port of the Fusion server. Comma-separated list of hostname:port for multiple Fusion servers.
fs.fusion.impl
The Abstract FileSystem implementation to be used.

fs.AbstractFileSystem.fusion.impl
The abstract filesystem implementation to be used.

Known Issue running with S3

In WD Fusion 2.6.2 or 2.6.3, the first release supporting S3, there was a problem transferring very large files that needed to be worked around. If you are using this release in conjuction with Amazon's S3 storage then you need to make the following changes:

WD Fusion 2.6.2/2.6.3/AWS S3 Workaround
Use your management layer (Ambari/Cloudera Manager, etc) to update the core-site.xml with the following property:

<property>
    <name>dfs.client.read.prefetch.size</name>
    <value>9223372036854775807</value>
</property>
<property>
     <name>fs.fusion.push.threshold</name>
     <value>0</value>
</property>

This second parameter "fs.fusion.push.threshold" becomes optional from version 2.6.3, onwards. Although optional, we still recommend that you use the "0" Setting. This property sets the threshold for when a client sends a push request to the WD Fusion server. As the push feature is not supported for S3 storage disabling it (setting it to "0") may remove some performance cost.

S3 AMI Launch

This section covers the launch of WANdisco Fusion for S3, using Amazon's Cloud Formation Template. What this will do is automatically provision the Amazon cluster, attaching Fusion to an on-premises cluster.

IMPORTANT: Amazon cost considerations.

Please take note of the following costs, when running Fusion from Amazon's cloud platform:

AWS EC2 instances are charged per hour or annually.
WD Fusion nodes provide continuous replication to S3 that will translate into 24/7 usage of EC2 and will accumulate charges that are in line with Amazon's EC2 charges (noted above).
When you stop the Fusion EC2 instances, Fusion data on the EBS storage will remain on the root device and its continued storage will be charged for. However, temporary data in the instance stores will be flushed as they don't need to persist.
If the WD Fusion servers are turned off then replication to the S3 bucket will stop.

Prerequisites

There are a few things that you need to already have before you start this procedure:

Amazon AWS account. If you don't have an AWS account, sign up through Amazon's Web Services.
Amazon Key Pair for security. If you don't have a Key Pair defined. See Create a Key Pair.
Ensure that you have clicked the Acept Terms button on the CFT's Amazon store page. E.g.

You must accept the terms for your specific version of Fusion
If you try to start a CFT without first clicking the Accept Terms button you will get an error and the CFT will fail. If this happens, go to the Amazon Marketplace, search for the Fusion download screen that correspond with the version that you are deploying, run through the screen until you have clicked the Accept Terms button. You can then successfully run the CFT.

Launch procedure

Login to your AWS account and navigate to the awsmarketplace. Locate the WANdisco's Fusion products. For the purpose of this guide we'll deploy the WANdisco Fusion S3 Active Migrator - BYOL, search for WANdisco using the search field.

LocalFS figure 25.

Ensure that you download the appropriate version The BYOL (Bring Your Own License) version requires that you purchase a license, seperately, from WANdisco. You can get set up immediately with a trial license, but you may prefer to run with one of the versions which come with built-in license, based on usage; 200TB or 50TB. Each version has its own download page and distinct Cloud Formation Template, so make sure that you get the one that you need.
On the Select Template screen, select the option Specify an Amazon S3 template URL, entering the URL for WANdisco's template. For example:
```
https://s3.amazonaws.com/wandisco-public-files/Fusion-Cluster.template
```
Amazon CFT Screen 1.
Ensure that you select the right Region (top-right on the Amazon screen). This must be set to the same region that hosts your S3 Bucket. Click Next to move to the next step.
You now specify the parameters for the cluster.

Amazon CFT Screen 2 - AWS Parameters
Enter the following details:
AWS configuration

Stack name
This is a label for your cluster that Amazon will use for reference. Give the cluster a meaningful name, e.g. FusionStack.

Remote Access CIDR*
An optional CIDR address for assigning access to the cluster. If you don't know the address you need you can use the range 0.0.0.0/0 which will open up a brand new virtual private cloud. It's recommended that you edit this later to lock down access.

VPC Subnet ID *
Entering addressing for your virtual private cloud. In this example we want to connect to an existing existing VPC, going into the settings and capturing its subnet ID. If you already have an on-premises cluster that you are connecting to then you are probably going to have a subnet to reference.

S3Bucket *
Enter the name of your Amazon S3 bucket, setting up permissions so that the we can only talk to the specified bucket.

Persistent Storage *
Use this field to add additional storage for your cluster. In general use, you shouldn't need to add any more storage, you can rely on the memory in the node plus the ephemeral storage.

KeyName *
Enter the name of the exiting EC2 KeyPair within your AWS account, all instances will launch with this KeyPair.

Cluster Name *
The WD Fusion CF identifier, in the example, awsfs

The * noted at the end of some field names indicate a required field that must be completed.

The next block of configuration is specific to the WD Fusion product:
WD Fusion configuration

Amazon CFT Screen 3 - WD Fusion Parameters
Cluster Instance Count*
Enter the number of WD Fusion instances (1-3) that you'll launch. e.g. "2" This value is driven by the needs of the cluster, either for horizontal scaling, continuous availability of the WD Fusion service, etc. (dropdown)

Zone Name *
The logical name that you provide for your zone. e.g. awsfs

User Name *
Default username for the WD Fusion UI is "admin".

Password *
Default password for the WD Fusion UI is "admin".

Inductor Node IP
This is the hostname or IP address for an existing WD Fusion node that will be used to connect the new node into a membership.
How to get the IP address of an existing WD Fusion Node:
1. Log into the WD Fusion UI.
2. On the Fusion Nodes tab, click on the link to the existing WD Fusion Node.
3. Get the IP address from the Node information screen.
Fusion Version *
Select the version of WD Fusion that you are running. (Dropdown) e.g. 2.6.4.

EMR Version *
Select the version of Elastic MapReduce that you are running. (Dropdown)

ARN Topic code to publish messages to *
ARN Code to topic to email. If you set up an SNS service you can add an ARN code here to receive a notification when CFT completes succesfully. This oculd be an email, SMS message or various other message types supported by AWS SNS service.

Fusion License
This is a path to your WD Fusion license file. If you don't specify the path to a license key you will automatically get a trial license.
The * noted at the end of some field names indicate a required field that must be completed.

S3 Security configuration for WD Fusion

KMSKey
ARN for KMS Encryption Key ID. You can leave the field blank to disable KMS encryption.

S3ServerEncryption
Enable server-side encryption on S3 with a "Yes", otherwise leave as "No".

Click Next.
On the next screen you can add options, such as Tags for resources in your stack, or Advanced elements.
We recommend that you disable the setting Rollback on failure. This ensures that if there's a problem when you deploy, the log files that you would need to diagnose the cause of the failure don't get wiped as part of the rollback.

LocalFS figure 35.
Click Next.
You will now see a review of the template settings, giving you a chance to make sure that all settings are correct for your launch.
At the end, take note of any Capabilities notices and finally tick the checkbox for I acknowledge that this template might cause AWD CloudFormation to create IAM (Identity and Access Management) resources. Click Create or click Previous to navigate back to make any changes.
The creation process will start. You'll see the Stack creation process running.
You will soon see the stack creation in progress and can follow the individual creation events.
This template will create your set number of WD Fusion servers, pre-install them to the point where they're ready to be inducted into on premises cluster.
Your WD Fusion servers will now be set up, connecting your on-premises Hadoop to your AWS cloud storage.

Default Username and Password

The WANdisco AMI creates a node for which the login credentials are:

Username:: admin
Password:: password

IMPORTANT: Reset the password using the following procedure.

Reset internally managed password

WD Fusion normally uses the authentication built into your cluster's management layer, i.e. the Cloudera Manager username and password are required to login to WD Fusion. However, in Cloud-based deployments, such as Amazon's S3, there is no management layer. In this situation, WD Fusion adds a local user to WD Fusion's ui.properties file. In this scenario, if you need to reset the internal password for any reason follow these instructions:

Password Reset Procedure: in-situ

Stop the UI server.

Invoke the reset runner:

JAVA_HOME/bin/java" -cp APPLICATION_ROOT/fusion-ui-server.jar com.wandisco.nonstopui.authn.ResetPasswordRunner -p NEW_PASSWORD -f PATH_TO_UI_PROPERTIES_FILE

Start the UI server. e.g.
```
service fusion-ui-server start
```
If you fail to provide these arguments the reset password runner will prompt you.

Password Reset Procedure: AMI

Note that if you reset your password you will also need to update it on your Amazon IAMs.

Removing the stack

You can remove the WD Fusion deployment simply by removing the stack. See Delete Stack.

IMPORTANT: After deleting the stack you will need to manually remove the associated EC2 instance. Previously this wasn't required as the instance was attached to an autoscaling group. Instances are now removed from the autoscaling group because the replication system doesn't yet support dynamic scaling.

2.9 Installing WD Fusion into Microsoft Azure

This section will run you through an installation of WANdisco's Fusion to enable you to replicate on-premises data over to a Microsoft Azure object (blob) store.

This procedure assumes that you have already created all the key components for a deployment using a Custom Azure template. You will first need to create a virtual network, create a storage account, then start completing a template for a HDInsight cluster:

Log in to the Azure portal, create a Custom deployment. On the Edit template panel, click Edit parameters

MS Azure - WD Fusion Parameters 01

Template properties

Subscription: This is your MS Azure account plan.
Resource Group (string): Your existing Azure resource group that you are deploying to.
Location (string): The geographical location of your cloud.
Legal terms: Review and agree with Microsoft's terms for using Azure.

Parameters


EXISTINGVNETRESOURCEGROUPNAME (string): Enter a name of an existing VNS Group Name.
EXISTINGSUBNETCLIENTSNAME (string): Enter a name of an existing subnet client.
SSHUSERNAME (string): Your SSH Username, used to remotely access the the cluster and edge node vm machine.
EDGENODEVIRTUALMACHINESIZE (string): Select the size of your edge node virtual machine.

Continue to enter the required field values.

Azure - WD Fusion Parameters 02

AZURESTORAGECONTAINERNAME
The name of your storage container.

AZURESTORAGEBLOBKEY
The access key for your BLOB storage.

FUSIONADMIN
The admin username that you'll use to access WD Fusion's UI.

FUSIONPASSWORD
The password for accessing WD Fusion's UI.

FUSIONVERSION
The version of WD Fusion that you will be running with. E.g 2.6.

FUSIONLICENSE
The WANdisco license key file path.

ZONENAME
The name that you give to the Fusion zone.

SERVERLATITUDE
The latitude for the WD Fusion server's location.

SERVERLONGITUDE
The longitude for the WD Fusion server's location.

IHCHEAPSIZE
The IHC Server's allocated Maximum Heap (in GB).

INDUCTORNODEIP
The IP Address of the Inductor Node. You'll need to get this from an existing WD Fusion server.
Confirm that you agree to Microsoft's terms and conditions, then click Create.
MS Azure MS Azure - Ts&Cs

WD Fusion Installation

In this next stage, we'll install WD Fusion

Download the installer script to the WD Fusion server. Open a terminal session, navigate to the installer script, make it executable and then run it, i.e.
```
chmod +x fusion-ui-server_hdi_deb_installer.sh
sudo ./fusion-ui-server_hdi_deb_installer.sh
```
MS Azure - WD Fusion Installation 01
Enter "Y" and press return.
MS Azure - WD Fusion Installation 01
Enter the system user that will run WD Fusion, e.g. "hdfs".
MS Azure - WD Fusion Installation 01
Enter the group under which WD Fusion will be run. By default HDI uses the "hadoop" group.
MS Azure - WD Fusion Installation 01
The installer will now check that WD Fusion is running over the default web UI port, 8083.

MS Azure - WD Fusion Installation 01
Point your browser at the WD Fusion UI.

MS Azure - WD Fusion Installation 01
You will be taken to the Welcome screen of the WD Fusion installer. For a first installation you select Add Zone.

MS Azure - WD Fusion Installation 01
Run through the installer's detailed Environment checks. For more details about exactly what is checked in this stage, see Environmental Checks in the Appendix.

MS Azure - WD Fusion Installation 01
On clicking validate the installer will run through a series of checks of your system's hardware and software setup and warn you if any of WD Fusion's prerequisites are missing.
Any element that fails the check should be addressed before you continue the installation. Warnings may be ignored for the purposes of completing the installation, especially if the installation is only for evaluation purposes and not for production. However, when installing for production, you should also address all warnings, or at least take note of them and exercise due care if you continue the installation without resolving and revalidating.
Click Next Step to continue the installation.

MS Azure - WD Fusion Installation 01
Click on Select a file, navigate to your WANdisco Fusion license file.

MS Azure - WD Fusion Installation 01
Click Upload.
MS Azure - WD Fusion Installation 01
The license properties are presented. along with the WD Fusion End User License Agreement. Click the checkbox to agree, then click Next Step.
MS Azure - WD Fusion Installation 01
Enter settings for the WD Fusion server.
MS Azure - WD Fusion Installation 01

WD Fusion Server

Fusion Server Max Memory (GB)
Enter the maximum Java Heap value for the WD Fusion server. We recommend that for production you should top out with at least 16GB.

Recommendation
For the purposes of our example installation, we've entered 2GB. We recommend that you allocate 70-80% of the server's available RAM.
Read more about Server hardware requirements.

Umask (currently 022)
Set the default permissions applied to newly created files. The value 022 results in default directory permissions 755 and default file permissions 644. This ensures that the installation will be able to start up/restart.

Latitude
The north-south coordinate angle for the installation's geographical location.

Longitude
The east-west coordinate angle for the installation's geographical location. The latitude and longitude is used to place the WD Fusion server on a global map to aid coordination in a far-flung cluster.

IHC Server

IHC Settings

Maximum Java heap size (GB)
Enter the maximum Java Heap value for the WD Inter-Hadoop Communication server. We recommend that for production you should top out with at least 16GB.

IHC Network Interface
The address on which the IHC (Inter-Hadoop Connect) server will be located on.

Once all settings have been entered, click Next step.
In this step you enter Fusion's Zone information and some important Microsoft Asure properties:
MS Azure - WD Fusion Installation 01

Zone Information

Fully Qualified Domain Name
Full hostname for the server.

Node ID
A unique identifier that will be used by WD Fusion UI to identify the server.

DConE Port
TCP port used by WD Fusion for replicated traffic.

Zone Name
Name used to identify the zone in which the server operates.

MS Azure Information
Primary Access Key
When you create a storage account, Azure generates two 512-bit storage access keys, which are used for authentication when the storage account is accessed. By providing two storage access keys, Azure enables you to regenerate the keys with no interruption to your storage service or access to that service. The Primary Access Key is now referred to as Key1 in Microsoft's documentation. You can get the KEY from the Microsoft Azure storage account:

WASB storage URI
this is the native URI used for accessing Azure Blob storage. E.g. wasb://

Validate (button)
The installer will make the following validation checks:
WASB storage URI:
The URI will need to take the form:
wasb[s]://<containername>@<accountname>.blob.core.windows.net
URI readable
Confirms that it is possible for WD Fusion to read from the Blob store.

URI writable
Confirms that it is possible for WD Fusion to write data to the Blob store.
You now get a summary of your installation. Run through and check that everything is correct. Then click Next Step.
MS Azure - WD Fusion Installation 01
In the next step you must complete the installation of the WD Fusion client package on all the existing HDFS client machines in the cluster. The WD Fusion client is required to support data WD Fusion's replication across the Hadoop ecosystem. Download the client DEB file. Leave your browser session running while you do this, we've not finished with it yet.
MS Azure - WD Fusion Installation 01
Return to your console session. Download the client package "fusion-hdi-x.x.x-client-hdfs_x.x.x.deb".
MS Azure - WD Fusion Installation 01
Install the package on each client machine:
MS Azure - WD Fusion Installation 01
e.g.
```
dpkg -i fusion-hdi-x.x.x-client-hdfs.deb
```

MS Azure - WD Fusion Installation 01

Return to the WD Fusion UI. Click Next Step, then click Start WD Fusion.

MS Azure - WD Fusion Installation 01
Once started we now complete the final step of installer's configuration, Induction.
For this first node you will miss this step out, choosing to Skip Induction. For all the following node installations you will provide the FQDN or IP address and port of this first node. (In fact, you can complete induction by referring to any node that has itself completed induction.)

What is induction?
Multiple instances of WD Fusion join together to form a replication network or ecosystem. Induction is the process used to connect each new node to an existing set of nodes.

MS Azure - WD Fusion Installation 01
Once you have completed the installation of a second node in your on-premises zone, you will be able to complete induction so that both zones are aware of each other.
MS Azure - WD Fusion Installation 01
Once induction has been completed you will see bashboard status for each inducted zone. Click on Membership.
MS Azure - WD Fusion Installation 01
Click on the Create New tab. The "New Membership" window will open that will display the WD Fusion nodes organized by zone.
MS Azure - WD Fusion Installation 01
In this example we make the on-premises CDH server the Distinguished Node, as we'll be copying data to the cloud, in an Active-Passive configuration. Click Create.
For advise on setting up memberships, see Creating resilient Memberships.
Next, click on the Replicated Folders tab. Click + Create.

MS Azure - WD Fusion Installation 01

2.10 Installing WD Fusion into Google Cloud Platform

This section will run you through an installation of WANdisco's Fusion to enable you to replicate on-premises data over to Google's Cloud platform.

Log into the Google Cloud Platform. Under VM instances, click Create instance.

Google Compute - WD Fusion Installation 01
Set up suitable specification for the VM.

Google Compute - WD Fusion Installation 01

Machine type
2vCPUs recommended for evaluation.

Boot disk
Click on the Change button and select Centos6.7.

Firewall
Enable publically available HTTP and HTTPS.

Management, disk, networking, access & security options
There are two options here:
Project Access, Tick the checkbox "Allow API access to all Google Cloud services in the same project".
Google Compute - WD Fusion Installation 01

Click on the Management tab.

Google Compute - WD Fusion Installation 01

Under Metadata add the following key:

Google Compute - WD Fusion Installation 01

startup-script-url

https://storage.googleapis.com/wandisco-public-bucket/installScript.sh (see sample code)

Click Add item
Click on the Networking tab.

Google Compute - WD Fusion Installation 01

Network
Your Google network VPC, e.g. fusion-gce.

Click Create.

Google Compute - WD Fusion Installation 01

There will be a brief delay while the instance is set up. You will see the VM instances panel that shows the VM system activity.

Google Compute - WD Fusion Installation 01

When the instance is complete, click on it.

Google Compute - WD Fusion Installation 01

You will see the management screen for the instance.

Google Compute - WD Fusion Installation 01

Make a note of the internal IP address, it should look like 172.25.0.x.

Google Compute - WD Fusion Installation 01

Configuration

WD Fusion is now installed. Next, we'll complete the basic configuration steps using the web UI.
1. In the first "Welcome" screen you're asked to choose between Create a new Zone and Add to an existing Zone.
  Make your selection as follows: Add Zone
  
  Google Compute - WD Fusion Installation 01
2. Run through the installer's detailed environment checks. For more details about exactly what is checked in this stage, see Environmental Checks in the User Guide's Appendix.
  
  Google Compute - WD Fusion Installation 01
  On clicking Validate, any element that fails the check should be addressed before you continue the installation.
3. Click on Select file and then navigate to the license file provided by WANdisco.
  
  Google Compute - WD Fusion Installation 01
4. Enter settings for the WD Fusion server.
  
  Google Compute - WD Fusion Installation 01
  
  WD Fusion Server
  
  Fusion Server Max Memory (GB)
  Enter the maximum Java Heap value for the WD Fusion server. We recommend that for production you should top out with at least 16GB.
  
  Recommendation
  We recommend that you allocate 70-80% of the server's available RAM.
  Read more about Server hardware requirements.
  
  Umask (currently 022)
  Set the default permissions applied to newly created files. The value 022 results in default directory permissions 755 and default file permissions 644. This ensures that the installation will be able to start up/restart.
  
  Latitude
  The north-south coordinate angle for the installation's geographical location.
  
  Longitude
  The east-west coordinate angle for the installation's geographical location. The latitude and longitude is used to place the WD Fusion server on a global map to aid coordination in a far-flung cluster.
  
  IHC Server
  
  Maximum Java heap size (GB)
  Enter the maximum Java Heap value for the WD Inter-Hadoop Communication server. We recommend that for production you should top out with at least 16GB.
  
  IHC Network Interface
  The address on which the IHC (Inter-Hadoop Connect) server will be located on.
  
  Once all settings have been entered, click Next step.
5. Next, you will enter the settings for your new Zone. You are going to need the name of the Google Bucket, you can check this on your Google Cloud Storage screen.
  Google Compute - Get the name of your bucket
  
  Zone Information
  
  Entry fields for zone properties
  
  Fully Qualified Domain Name
  Full hostname for the server.
  
  Node ID
  A unique identifier that will be used by WD Fusion UI to identify the server.
  
  DConE Port
  TCP port used by WD Fusion for replicated traffic.
  
  Zone Name
  Name used to identify the zone in which the server operates.
  
  Google Compute Information
  
  Entry fields for Google's platform information.
  
  Google Bucket Name
  The name of the Google storage bucket that will be replicated.
  
  Google Project ID
  The Google Project associated with the deployment.
  
  The following validation is completed against the settings:
  - the provided bucket matches with an actual bucket on the platform.
  - the provided bucket can be written to by WD Fusion.
  - the bucket can be read by WD Fusion.
Review the summary. Click Next Step to continue.

Google Compute - WD Fusion Installation 01
The next step of the installer can be ignored as it handles the installation of the Fusion client which are not required for a Google cloud deployment. Click Next step.

Google Compute - WD Fusion Installation 01
It's now time to return to start up WDFusion UI for the first time. Click Start WD Fusion.

Google Compute - WD Fusion Installation 01
Once started we now complete the final step of installer's configuration, Induction.

For this first node you will miss this step out, click Skip Induction.For all the following node installations you will provide the FQDN or IP address and port of this first node. (In fact, you can complete induction by referring to any node that has itself completed induction.)

Google Compute - WD Fusion Installation 01
Click Complete Induction.

Google Compute - WD Fusion Installation 01
You will now see the admin UI's Dashboard. You can immediately see that the induction was successful as both zones will appear in the dashboa rds
Google Compute - WD Fusion Installation 01

Demonstration

Setting up data replication

It's now time to demonstrate data replication between the on-premises cluster and the Google bucket storage. First we need to perform a synchronization to ensure that the data stored in both zones is in exactly the same state.

Synchronization

You can synchronize data in both directions:

Synchronize from on-premises to Google's zone: Login to the on-premises WD Fusion UI.
Synchronize from Google Cloud to the on-premises zone: Login to the WD Fusion UI in AWS.
Synchronize in both directions (because the data already exists in locations): Login to either Fusion UI.

from on-premises to Google Cloud

Login to the on-premises WD Fusion UI and click on the Replicated Folders tab.

Google Cloud - Fusion installation figure 09.
Click on the Create button to set up a folder on the local system.

Google Cloud - Fusion installation figure 10.

Navigate the HDFS File Tree (1), on the right-hand side of the New Rule panel to select your target folder, created in the previous step. The selected folder will appear in the Path entry field. You can, instead, type or copy in the full path to the folder in the Path directory.

Next, select both zones from the Zones list (2). You can leave the default membership in place. This will replicate data between the two zones.

More about Membership
Read about Membership in the WD Fusion User Guide - 4. Managing Replication.

Click Create to continue.
When you first create the folder you may notice status messages for the folder indicating that the system is preparing the folder for replication. Wait until all pending messages are cleared before moving to the next step.

Google Cloud - Fusion installation figure 11.
Now that the folder is set up it is likely that the file replicas between both zones will be in an inconsistent state, in that you will have files on the local (on-premises) zone that do not yet exist in the WASB store. Click on the Inconsistent link in the Fusion UI to address these.

Google Cloud - Fusion installation figure 12.

The consistency report will show you the number of inconsistencies that need correction. We will use bulk resolve to do the first replication.

See the Appendix for more information on improving performance of your first synch and resolving individual inconsistencies if you have a small number of files that might conflict between zones - Running initial repairs in parallel
Click on the dropdown selector entitled Bulk resolve inconsistencies to display the options that determine synch direction. Choose the zone that will be used for the source files. Tick the check box Preserve extraneous file so that files are not deleted if they don't exist in the source zone. The system will begin the file transfer process.

Google Cloud - Fusion installation figure 13.
We will now verify the file transfers. Login to the WD Fusion UI on the HDI instance. Click on the Replicated Folders tab. In the File Transfers column, click the View link.

Google Cloud - Fusion installation figure 14.

By checking off the boxes for each status type, you can report on files that are:
- In progress
- Incomplete
- Complete
No transfers in progress?
You may not see files in progress if they are very small, as they tend to clear before the UI polls for in-flight transfers.
Congratulations! You have successfully installed, configured, replicated and monitored data transfer with WANdisco Fusion.

Google Cloud - Fusion installation figure 15.

2. Installation Guide

Deployment Overview

2.1 Deployment Checklist

2.1.1 WD Fusion and IHC servers' requirements

WD Fusion Deployment Components

2.1.2 Software requirements

Enabling SSL

License Model

License renewals

2.1.3 Supported versions

Supported applications

2.2 Final Preparations

Time requirements

Network requirements

Running WD Fusion on multihomed servers

Flow

Getting Connected to the right interface

Further discussion

Kerberos Security

Kerberos Configuration before starting the installation

Clean Environment

Installer File

License File

2.3 Running the installer

Starting the installation

Configure WD Fusion through a browser

WD Fusion Server

Advanced options

Enable SSL for WD Fusion

IHC Server

Advanced Options (optional)

Zone Information

Advanced Options

URI Selection

Fusion Server API Port

Strict Recovery

<Hadoop Management Layer> Configuration

Kerberos Configuration

WD Fusion Client Installation

RPM / DEB Packages

First WD Fusion node installation

Second and subsequent WD Fusion node installations into an existing zone

2.4 Configuration

Replication Overview

Setting up Replication

Installing on a Kerberized cluster

2.5 Deployment

2.5.1 Hive

Prerequisites

Replicating Hive Storage via fusion:///

Exceptions

Replicated directories as store paths

Replicated directories as load paths

Sharing the Hive metastore

Hive metastore available read-only on other clusters

Hive metastore writable on all clusters

Hive metastore replication

Standard

Manual Replication

2.5.2 Impala

Prerequisites

Impala Parcel

Parcel Locations

Setting the CLASSPATH

Query a table stored in a replicated directory

2.5.3 Oozie

2.5.4 Oracle: Big Data Appliance

Prerequisites

Required steps

Operating in a multi-homed environment

Procedure

Troubleshooting

Running Fusion with Oracle BDA 4.2 / CDH 5.5.1

Fix Script

2.5.5 Apache Tez

Tez with Hive

2.5.6 Apache Ranger

2.5.6 Solr

Minimal deployment using the default hdfs:// URI

Minimal deployment using the WANdisco "fusion://" URI