2. Installation Guide

This section will run through the installation of WD Fusion from the initial steps where we make sure that your existing environment is compatible, through the procedure for installing the necessary components and then finally configuration.

Deployment Overview

We'll start with a quick overview of this installation guide so that you can seen what's coming or quickly find the part that you want:

2.1 Deployment Checklist: Important hardware and software requirements, along with considerations that need to be made before starting to install WD Fusion.
2.2 Final Preparations: Things that you need to do immediately before you start the installation.
2.3 Running the installer: Step by step guide to the installation process when using the unified installer. For instructions on completing a fully manual installation see 2.7 Manual Installation.
2.4 Configuration: Runs through the changes you need to make to start WD Fusion working on your platform.
2.5 Deployment: Necessary steps for getting WD Fusion to work with supported Hadoop applications.
2.6 Appendix: Extras that you may need that we didn't want cluttering up the installation guide - Installation Troubleshooting, How to remove an existing WD Fusion installation.

From version 2.2, WD Fusion comes with an installer package
WD Fusion now has a unified installation package that installs all of WD Fusion's components (WD Fusion server, IHC servers and WD Fusion UI).The installer greatly simplifies installation as it handles all the components you need and does a lot of configuration in the background. However, if you need more control over the installation, you can use the orchestration script, instead. See the Orchestrated Installation Guide.

Sample Orchestration mydefines.sh file.

2.1 Deployment Checklist

2.1.1 WD Fusion and IHC servers' requirements

This section describes hardware requirements for deploying Hadoop using WD Fusion. These are guidelines that provide a starting point for setting up data replication between your Hadoop clusters.

Glossary
We'll be using terms that relate to the Hadoop ecosystem, WD Fusion and WANdisco's DconE replication technology. If you encounter any unfamiliar terms checkout out the Glossary.

WD Fusion Deployment Components

Example WD Fusion Data Center Deployment.

WD Fusion Server: The core WD Fusion server. Using HCFS (Hadoop Compatible File System) to permit the replication of HDFS data between data centers, while maintaining strong consistency.
WD Fusion UI: A seperate server that provides administrators with a browser-based management console for each WD Fusion server. This can be installed on the same machine as WD Fusion's server or on a different machine within your data center.
IHC Server: Inter Hadoop Communication servers handle the traffic that runs between zones or data centers that use different versions of Hadoop. IHC Servers are matched to the version of Hadoop running locally. It's possible to deploy different numbers of IHC servers at each data center, additional IHC Servers can form part of a High Availability mechanism.
WD Fusion Client: Client jar files to be installed on each Hadoop client, such as mappers and reducers that are connected to the cluster. The client is designed to have a minimal memory footprint and impact on CPU utilization.

WD Fusion servers must not be co-located with HDFS servers (DataNodes, etc)
HDFS's default block placement policy dictates that if a client is co-located on a DataNode, then that co-located DataNode will receive 1 block of whatever file is being put into HDFS from that client. This means that if the WD Fusion Server (where all transfers go through) is co-located on a DataNode, then all incoming transfers will place 1 block onto that DataNode. In which case the DataNode is likely to consumes lots of disk space in a transfer-heavy cluster, potentially forcing the WD Fusion Server to shut down in order to keep the Prevaylers from getting corrupted.

The following guidelines apply to both the WD Fusion server and for separate IHC servers. We recommend that you deploy on physical hardware rather than on a virtual platform, however, there are no reasons why you can't deploy on a virtual environment.

If you plan to locate both the WD Fusion and IHC servers on the same machine then check the Collocated Server requirements:

CPUs:	Small WD Fusion server deployment : 8 cores Large WD Fusion server deployment: : 16 cores Architecture: 64-bit only.
System memory:	There are no special memory requirements, except for the need to support a high throughput of data: Type: Use ECC RAM Size: Recommended 64 GB recommended (minimum of 16 GB) Small WD Fusion server Deployment: 32GB Large WD Fusion server deployment: 128GB System memory requirements are matched to the expected cluster size and should take into account the number of files and block size. The more RAM you have, the bigger the supported file system, or the smaller the block size. Collocation of WD Fusion/IHC servers If you plan to install the WD Fusion and your IHC servers on the same machine then you should look to increase the memory specification: Recommended: 64 GB+ Minimum: 48 GB (16 GB for the WD Fusion, 16 GB for each of at least 2 IHC servers.)
Storage space:	Type: Hadoop operations are storage-heavy and disk-intensive so we strongly recommend that you use enterprise-class Solid State Drives (SSDs). Size: Recommended: 1 TB Minimum: You need at least 500 GB of disk space for a production environment.
Network	Connectivity: Minimum 1Gb Ethernet between local nodes. Small WANdisco Fusion server: 2Gbps Large WANdisco Fusion server: 4x10 Gbps (cross-rack) TCP Port Allocation: Two ports are required for deployment of WD Fusion: DConE port: (default 8082) IHC ports: (7000 range for command ports) (8000 range for HTTP) HTTP interface: (default 50070) is re-used from the stand-alone Hadoop NameNode Web UI interface: (default 8083)

2.1.2 Software requirements

Operating systems:	RHEL 6 x86_64 CentOS 6 x86_64 Ubuntu 12.04LTS and 14.04LTS
Web browsers:	Mozilla Firefox 11 and higher Google Chrome Safari 5 and higher
Java:	Hadoop requires Java JRE 1.7. It is built and tested on Oracle's version of Java Runtime Environment. We have now added support for Open JDK 7, although we recommend running with Oracle's Java as it has undergone more testing. Architecture: 64-bit only Heap size: Set Java Heap Size of to a minimum of 1Gigabytes, or the maximum available memory on your server. Use a fixed heap size. Give -Xminf and -Xmaxf the same value. Make this as large as your server can support. Avoid Java defaults. Ensure that garbage collection will run in an orderly manner. Configure `NewSize` and `MaxNewSize` Use 1/10 to 1/5 of Max Heap size for JVMs larger than 4GB. Stay deterministic! When deploying to a cluster, make sure you have exactly the same version of the Java environment on all nodes. Where's Java? Although WD Fusion only requires the Java Runtime Environment (JRE), Cloudera and Hortonworks may install the full `Oracle JDK` with the high strength encryption package included. This JCE package is a requirement for running Kerberized clusters. For good measure, remove any JDK 6 that might be present in `/usr/java`. Make sure that /usr/java/default and /usr/java/latest point to a java 7 version your Hadoop manager installs ones. Ensure that you set the `JAVA_HOME` environment variable for the root user on all nodes. Remember that, on some systems, invoking `sudo` strips environmental variables, so you may need to add the `JAVA_HOME` to Sudo's list of preserved variables.
File descriptor/Maximum number of processes limit:	Maximum User Processes and Open Files limits are low by default on some systems. It is possible to check their value with the `ulimit` or `limit` command: ulimit -u && ulimit -n -u The maximum number of processes available to a single user. -n The maximum number of open file descriptors. For optimal performance, we recommend both hard and soft limits values to be set to 64000 or more: RHEL6 and later: A file /etc/security/limits.d/90-nproc.conf explicitly overrides the settings in security.conf, i.e.: # Default limit for number of user's processes to prevent # accidental fork bombs. # See rhbz #432903 for reasoning. * soft nproc 1024 <- Increase this limit or ulimit -u will be reset to 1024 Ambari/Pivotal HD and Cloudera manager will set various `ulimit` entries, you must ensure hard and soft limits are set to 64000 or higher. Check with the `ulimit` or `limit` command. If the limit is exceeded the JVM will throw an error: `java.lang.OutOfMemoryError: unable to create new native thread`.
Additional requirements:	passwordless ssh If you plan to set up the cluster using the supplied WD Fusion orchestration script you must be able to establish secure shell connections without using a passphrase. KB Read our Knowledgebase article How to set up passwordless ssh. Security Enhanced (SE) Linux You need to disable Security-Enhanced Linux (SELinux) for the installation to ensure that it doesn't block activity that's necessary to complete the installation. Disable SELinux on all nodes, then reboot them: sudo vi /etc/sysconfig/selinux Set SELINUX to the following value: `SELINUX=disabled` iptables Disable iptables. $ sudo chkconfig iptables off Reboot. When the installation is complete you can re-enable iptables using the corresponding command: sudo chkconfig iptables on Comment out `requiretty` in `/etc/sudoers` The installer's use of sudo won't work with some linux distributions (CentOS where /etc/sudoer sets enables `requiretty`, where sudo can only be invoked from a logged in terminal session, not through cron or a bash script. When enabled the installer will fail with an error: execution refused with "sorry, you must have a tty to run sudo" message Ensure that requiretty is commented out: # Defaults requiretty

2.1.3 Supported versions

This table shows the versions of Hadoop and Java that we currently support:

Distribution:	Console:	JRE:
Apache Hadoop 2.5.0		Oracle JDK 1.7_45 64-bit
HDP 2.1 / 2.2 / 2.3	Ambari 1.6.1 / 1.7 / 2.1 Support for EMC Isilon 7.2.0.1 and 7.2.0.2	Oracle JDK 1.7_45 64-bit
CDH 5.2.0/5.3.0/5.4	Cloudera Manager 5.3.2 Support for EMC Isilon 7.2.0.1 and 7.2.0.2	Oracle JDK 1.7_45 64-bit
Pivotal HD 3.0	Ambari 1.6.1 / 1.7	Oracle JDK 1.7_45 64-bit

2.2 Final Preparations

We'll now look at what you should know and do as you begin the installation.

Time requirements

The time required to complete a deployment of WD Fusion will in part be based on its size, larger deployments with more nodes and more complex replication rules will take correspondingly more time to set up. Use the guide below to help you plan for for deployments.

Run through this document and create a checklist of your requirements. (1-2 hours).
Complete the WD Fusion server installations (20 minutes per node, or 1 hour for a test deployment).
Install WD Fusion UI (30 minutes).
Complete client installations and complete basic tests (1-2 hours).

Of course, this is a guideline to help you plan your deployment. You should think ahead and determine if there are additional steps or requirements introduced by your organization's specific needs.

Network requirements

See the deployment checklist for a list of the TCP ports that need to be open for WD Fusion.

Running WD Fusion on multi-homed servers

The following guide runs through what you need to do to correctly configure a WD Fusion deployment if the nodes are running with multiple network interfaces.

Overview

A file is created in DC1. A Client writes the Data.
Periodically after the data is written, a proposal is sent by the WD Fusion Server in DC1, telling the WD Fusion server in DC2 to pull the new file. This proposal includes the map of IHC server public IP addresses, in this case, listening at <Public-IP>:7000 (Fusion Server in DC1 read this from
```
/etc/wandisco/fusion/server/ihcList
```
Fusion Server in DC2 gets this agreement, connects to <Public-IP>:7000 and pulls the data.

Procedure

Stop all WD Fusion services.
Reconfigure your IHCs to your preferred address in /etc/wandisco/ihc/*.ihc for each IHC node.
For the WD Fusion servers, delete all files in /etc/wandisco/fusion/server/ihclist/*.
Copy zone1 IHC's /etc/wandisco/ihc/*.ihc files to zone1 Fusion-Server /etc/wandisco/server/ihcList
Copy zone2 IHC's /etc/wandisco/ihc/*.ihc files to zone2 Fusion-Server /etc/wandisco/server/ihcList
Restart all services

Kerberos Security

If you are running Kerberos on your cluster you should consider the following requirements:

Kerberos is already installed and running on your cluster
Fusion-Server is configured for Kerberos as described in Setting up Kerberos.
We will be using the same keytab and principal we generated for fusion-server. Assume it's in /etc/hadoop/conf/fusion.keytab

Kerberos Configuration before starting the installation

Before running the installer on a platform that is secured by Kerberos, you'll need run through the following steps: Setting up Kerberos.

Update WD Fusion UI configuration

Manual instructions

The following instructions apply to manual or orchestration script-based installation. If you install WD Fusion using the installer script, Kerberos settings are, from Version 2.5.2, handled in the installer.

Make the following changes to WD Fusion's UI element to enable it to interact with a Kerberized environment:

Add core-site.xml and hdfs-site.xml path to the ui.properties configuration file:

client.core.site=/etc/hadoop/conf/core-site.xml
client.hdfs.site=/etc/hadoop/conf/hdfs-site.xml

Enable kerberos in fusion-ui configuration (/opt/wandisco/fusion-ui-server/properties/ui.properties):

kerberos.enabled=true
kerberos.generated.config.path=/opt/wandisco/fusion-ui-server/properties/kerberos.cfg
kerberos.keytab.path=/etc/hadoop/conf/fusion.keytab
kerberos.principal=/${hostname}@${krb_realm}

kerberos.enabled: Is used to switch on Kerberos (with a =true) for the WD Fusion node.
kerberos.generated.config.path: The path to the Kerberos configuration, used to allow WD Fusion / IHC servers to communicate with a Kerberos-enabled cluster.
kerberos.keytab.path: Path to the Kerberos keytab file.
kerberos.principal: The Kerberos indentity used by the hdfs superuser, principal is provided in the form primary/instance@realm.

Set up a proxy user on the NameNode, adding the following properties to core-site.xml on the NameNode(s).

<property>
        <name>hadoop.proxyuser.$USERNAME.hosts</name>
        <value>*</value>
    </property>
    <property>
        <name>hadoop.proxyuser.$USERNAME.groups</name>
        <value>*</value>
</property>

hadoop.proxyuser.$USERNAME.hosts: Defines hosts from which client can be impersonated. $USERNAME, the superuser who wants to act as a proxy to the other users, is usually set as system user "hdfs". From Version 2.6 these values are captured by the installer and can apply these values automatically.
hadoop.proxyuser.$USERNAME.groups: A list of groups whose users the superuser is allowed to act as proxy. Including a wildcard (*), which will mean that proxies of any users are allowed. For example, for the superuser to act as proxy to another user, the proxy actions must be completed on one of the hosts that are listed, and the user must be included in the list of groups. Note that this can be a comma separated list or the noted wildcard (*).

Clean Environment

Before you start the installation you must ensure that there are no existing WD Fusion installations or WD Fusion components installed on your elected machines. If you are about to upgrade to a new version of WD Fusion you must first make sure that you run through the removal instructions provided in the Appendix - Cleanup WD Fusion.

Installer File

You need to match WANdisco's WD Fusion installer file to each data center's version of Hadoop. Installing the wrong version of WD Fusion will result in the IHC servers being misconfigured.

License File

After completing an evaulation deployment, you will need to contact WANdisco about getting a license file for your moving your deployment into production.

2.3 Running the installer

Below is the procedure for getting set up with the installer. Running the installer only takes a few minutes while you enter the neccessary settings, however, if you wish to handle installations without the need for a user having to manually enter the settings you can use the use the Silent Installer.

Hands on installation

Listed below is the procedure that you should use for completing an installation using the installer file. This requires an administrator to enter details throughout the procedure. Alternatively, see Using the "Silent" Installer option to handle installation programatically.

Open a terminal session on your first installation server. Copy the WD Fusion installer script into a suitable directory.

Make the script executable, e.g.

chmod +x fusion-ui-server-<version>_rpm_installer.sh

Execute the file with root permissions, e.g.

sudo ./fusion-ui-server-<version>_rpm_installer.sh

The installer will now start.

Verifying archive integrity... All good.
Uncompressing WANdisco Fusion..............................

    ::   ::  ::     #     #   ##    ####  ######   #   #####   #####   #####
   :::: :::: :::    #     #  #  #  ##  ## #     #  #  #     # #     # #     #
  ::::::::::: :::   #  #  # #    # #    # #     #  #  #       #       #     #
 ::::::::::::: :::  # # # # #    # #    # #     #  #   #####  #       #     #
  ::::::::::: :::   # # # # #    # #    # #     #  #        # #       #     #
   :::: :::: :::    ##   ##  #  ## #    # #     #  #  #     # #     # #     #
    ::   ::  ::     #     #   ## # #    # ######   #   #####   #####   #####




Welcome to the WANdisco Fusion installation



You are about to install WANdisco Fusion version 2.4-206

Do you want to continue with the installation? (Y/n) y

The installer will perform an integrity check, confirm the product version that will be installed, then invite you to continue. Enter "Y" to continue the installation.

The installer performs some basic checks and lets you modify the Java heap settings. The heap settings apply only to the WD Fusion UI.
```
Checking prerequisites:

Checking for perl: OK
Checking for java: OK

INFO: Using the following Memory settings:

INFO: -Xms128m -Xmx512m

Do you want to use these settings for the installation? (Y/n) y
```
The installer checks for Perl and Java. See the Installation Checklist for more information about these requirements. Enter "Y" to continue the installation.
Next, confirm the port that will be used to access WD Fusion through a browser.
```
Which port should the UI Server listen on? [8083]:
```
Select the Hadoop version and type from the list of supported platforms:
```
Please specify the appropriate backend from the list below:

[0] cdh-5.2.0
[1] cdh-5.3.0
[2] cdh-5.4.0
[3] hdp-2.1.0
[4] hdp-2.2.0
[5] hdp-2.3.0

Which fusion backend do you wish to use? 3
You chose hdp-2.2.0:2.6.0.2.2.0.0-2041
```
MapR/Pivotal availability
The MapR/PHD versions of Hadoop have been removed from the trial version of WD Fusion in order to reduce the size of the installer for most prospective customers. These versions are run by a small minority of customers, while their presence nearly doubled the size of the installer package. Contact WANdisco if you need to evaluate WD Fusion running with MapR or PHD.

Additional available packages
```
[1] mapr-4.0.1
[2] mapr-4.0.2
[3] mapr-4.1.0
[4] phd-3.0.0
```
MapR requirement
If you install into a MapR cluster then you need to assign the MapR superuser system account/group "mapr" if you need to run WD Fusion using the fusion:/// URI.

See the requirement for MapR Client Configuration.

The installer now confirms which system user/group will be applied to WD Fusion.

We strongly advise against running Fusion as the root user.
For default HDFS setups, the user should be set to 'hdfs'. However, you should choose a user appropriate for running HDFS commands on your system.

Which user should Fusion run as? [hdfs]
Checking 'hdfs' ...
... 'hdfs' found.

Please choose an appropriate group for your system. By default HDP uses the 'hadoop' group.
Which group should Fusion run as? [hadoop]
Checking 'hadoop' ...
... 'hadoop' found.

The installer does a search for the commonly used account and group, assigning these by default.

Check the summary to confirm that you're chosen settings are appropriate:

Installing with the following settings:

User and Group:                     hdfs:hadoop
Hostname:                           node04-example.host.com
Fusion Admin UI Listening on:       0.0.0.0:8083
Fusion Admin UI Minimum Memory:     128
Fusion Admin UI Maximum memory:     512
Platform:                           hdp-2.3.0 (2.7.1.2.3.0.0-2557)
Manager Type                        AMBARI
Manager Host and Port:              :
Fusion Server Hostname and Port:    node04-example.host.com:8082
SSL Enabled:                        false

Do you want to continue with the installation? (Y/n) y

You are now given a summary of all the settings provided so far. If these settings are correct then enter "Y" to complete the installation of the WD Fusion server.

The package will now install

Installing hdp-2.1.0 packages:
  fusion-hdp-2.1.0-server-2.4_SNAPSHOT-1130.noarch.rpm ...
   Done
  fusion-hdp-2.1.0-ihc-server-2.4_SNAPSHOT-1130.noarch.rpm ...
   Done
Installing fusion-ui-server package
Starting fusion-ui-server:[  OK  ]
Checking if the GUI is listening on port 8083: .....Done

The WD Fusion server will now start up:

Please visit http://<YOUR-SERVER-ADDRESS>.com:8083/ to access the WANdisco Fusion
		
		If 'http://<YOUR-SERVER-ADDRESS>.com' is internal or not available from your browser, replace this with an externally available address to access it. 
		
Installation Complete
[root@node05 opt]#

At this point the WD Fusion server and corresponding IHC server will be installed. The next step is to configure the WD Fusion UI. Open a web browser and point it at the provided URL. E.g

http://<YOUR-SERVER-ADDRESS>.com:8083/

In the first "Welcome" screen you're asked to choose between Create a new Zone and Add to an existing Zone.
Make your selection as follows:

Adding a new WD Fusion cluster

Select Add Zone.

Adding additional WD Fusion servers to an existing WD Fusion cluster

Select Add to an existing Zone.

High Availability for WD Fusion / IHC Servers
It's possible to enable High Availability in your WD Fusion cluster by adding additional WD Fusion/IHC servers to a zone. These additional nodes ensure that in the event of a system outage, there will remain sufficient WD Fusion/IHC servers running to maintain replication.

Add HA nodes to the cluster using the installer and choosing to Add to an existing Zone, using a new node name.

Configuration for High Availability
When setting up the configuration for a High Availability cluster, ensure that fs.defaultFS, located in the core-site.xml is not duplicated between zones. This property is used to determin if an operation is being executed locally or remotely, if two separate zones have the same default file system address, then problems will occur. WD Fusion should never see the same URI (Scheme + authority) for two different clusters.

Welcome.
Run through the installer's detailed Environment checks. For more details about exactly what is checked in this stage, see Environmental Checks in the Appendix.

Environmental checks.

On clicking validate the installer will run through a series of checks of your system's hardware and software setup and warn you if any of WD Fusion's prerequisites are missing.

Example check results.

Any element that fails the check should be addressed before you continue the installation. Warnings may be ignored for the purposes of completing the installation, especially if the installation is only for evaluation purposes and not for production. However, when installing for production, you should also address all warnings, or at least take note of them and exercise due care if you continue the installation without resolving and revalidating.
Upload the license file.

Upload your license file.
The conditions of your license agreement will be presented in the top panel, including License Type, Expiry data, Name Node Limit and Data Node Limit.

Verify license and agree to subscription agreement.
Click on the I agree to the EULA to continue.

Next step.
Enter settings for the WD Fusion server.

screen 4 - Server settings

WD Fusion Server

Fusion Server Max Memory (GB)
Enter the maximum Java Heap value for the WD Fusion server.

Umask (currently 022)
Set the default permissions applied to newly created files. The value 022 results in default directory permissions 755 and default file permissions 644. This ensures that the installation will be able to start up/restart.

Latitude
The north-south coordinate angle for the installation's geographical location.

Longitude
The east-west coordinate angle for the installation's geographical location. The latitude and longitude is used to place the WD Fusion server on a global map to aid coordination in a far-flung cluster.

IHC Server

Maximum Java heap size (GB)
Enter the maximum Java Heap value for the WD Inter-Hadoop Communication server.

Once all settings have been entered, click Next step.
Next, you will enter the settings for your new Zone.

New Zone

Zone Information

Entry fields for zone properties

Fully Qualified Domain Name
the full hostname for the server.

Node ID
A unique identifier that will be used by WD Fusion UI to identify the server.

Location Name (optional)
A location name that can quickly identify where the server is located.

Known issue with Location names
You must use different Location names /Node IDs for each zone. If you use the same name for multiple zones then you will not be able to complete the induction between those nodes.

DConE Port
TCP port used by WD Fusion for replicated traffic.

Zone Name
The name used to identify the zone in which the server operates.

Management Endpoint
Select the Hadoop manager that you are using, i.e. Cloudera Manager, Ambari or Pivotal HD. The selection will trigger the entry fields for your selected manager:

Advanced Options

Only apply these options if you full understand what they do.
The following advanced options provide a number of low level configuration settings that may be required for installation into certain environments. The incorrect application of some of these settings could cause serious problems, so for this reason we strongly recommend that you discuss their use with WANdisco's support team before enabling them.

URI Selection

The default behavior for WD Fusion is to fix all replication to the Hadoop Distributed File System / hdfs:/// URI. Setting the hdfs-scheme provides the widest support for Hadoop client applications, so some applications can't support the available "fusion:///" URI or they can only run on HDFS. instead of the more lenient HCFS. Each option is explained below:

Use HDFS URI with HDFS file system

This option is available for deployments where the Hadoop applications support neither the WD Fusion URI or the HCFS standards. WD Fusion operates entirely within HDFS.

This configuration will not allow paths with the fusion:// uri to be used; only paths starting with hdfs:// or no scheme that correspond to a mapped path will be replicated. The underlying file system will be an instance of the HDFS DistributedFileSystem, which will support applications that aren't written to the HCFS specification.

Use WD Fusion URI with HCFS file system

This is the default option that applies if you don't enable Advanced Options, and was the only option in WD Fusion prior to version 2.6. When selected, you need to use fusion:// for all data that must be replicated over an instance of the Hadoop Compatible File System. If your deployment includes Hadoop applications that are either unable to support the Fusion URI or are not written to the HCFS specfication, this option will not work.

Use Fusion URI with HDFS file system

This differs from the default in that while the WD Fusion URI is used to identify data to be replicated, the replication is performed using HDFS itself. This option should be used if you are deploying applications that can support the WD Fusion URI but not the Hadoop Compatible File System.

Use Fusion URI and HDFS URI with HDFS file system

This "mixed mode" supports all the replication schemes (fusion://, hdfs:// and no scheme) and uses HDFS for the underlying file system, to support applications that aren't written to the HCFS specification.

<Hadoop Management Layer> Configuration

This section configures WD Fusion to interact with the management layer, which could be Ambari or Cloudera Manager, etc.

Manager Host Name /IP
The full hostname or IP address for the working server that hosts the Hadoop manager.

Port
TCP port on which the Hadoop manager is running.

Username
The username of the account that runs the Hadoop manager.

Password
The password that corresponds with the above username.

SSL
(Checkbox) Tick the SSL checkbox to use https in your Manager Host Name and Port. You may be prompted to update the port if you enable SSL but don't update from the default http port.

Kerberos Configuration

In this step you also set the configuration for an existing Kerberos setup. If you are installing into a Kerberized cluster, include the following configuration.
Config file path
Path to the Kerberos configuration file, e.g. krb5.conf.

Keytab file path
Path to the generated keytab, e.g. your/etc/krb5.keytab

Principle
Principle for the keytab file, e.g. HDFS@<REALM>

Enable Kerberos authentication for WD Fusion endpoints
This checkbox tells WD Fusion whether or not to Kerberize all WD Fusion communication. When unchecked WD Fusion's application traffic is secured, but not through Kerberos. When ticked we introduce Kerberos authentication across the /fusion/* REST API paths. Meaning, if enabled, all users will require Kerberos credentials in order to access the Web UI.
Enabling Kerberos authentication on WD Fusion's REST API
When a user has enabled Kerberos-authentication on their REST API, they must kinit before making REST calls, and enable GSS-Negotiate authentication. To do this with curl, the user must include the "-negotiate" and "-u:" options, like so:
```
curl --negotiate -u: -X GET "http://${HOSTNAME}:8082/fusion/fs/transfers"
```
Keytab file path
Path to the generated keytab for the HTTP principle, which is used when WD Fusion is Kerberized, while the settings that have already been provided are for when the Hadoop cluster itself is Kerberized.

Principle
This is specifically the principle for the HTTP user. This should be HTTP/<hostname>@<REALM>.
See Setting up Kerberos for more information about Kerberos setup.
Click Validate to confirm that your settings are valid. Once validated, click Next step.

Zone information.
The remaining panels in Step 6 detail all of the installation settings. All your license, WD Fusion server, IHC server and zone settings are shown. If you spot anything that needs to be changed you can click on the go back

Summary
Once you are happy with the settings and all your WD Fusion clients are installed, click Deploy Fusion Server.

WD Fusion Client Installation

In the next step you must complete the installation of the WD Fusion client package on all the existing HDFS client machines in the cluster. The WD Fusion client is required to support data WD Fusion's replication across the Hadoop ecosystem.

Client installations.

The installer supports three different packaging systems for installing Clients, regular RPMs, Parcels for Cloudera and HDP Stack for Hortonworks/Ambari.

Installing into MapR
If you are installing into a MapR cluster, use this default RPMs, detailed below. Fusion client installation with RPMs.

RPM / DEB Packages

client nodes
By client nodes we mean any machine that is interacting with HDFS that you need to form part of WD Fusion's replicated system. If a node is not going to form part of the replicated system then it won't need the WD Fusion client installed. If you are hosting the WD Fusion UI package on a dedicated server, you don't need to install the WD Fusion client on it as the client is built into the WD Fusion UI package. Note that in this case the WD Fusion UI server would not be included in the list of participating client nodes.

Example clients list

For more information about doing a manual installation, see Fusion Client installation for regular RPMs.
To install with the Cloudera parcel file, see: Fusion Client installation with Parcels.
For Hortonwork's own proprietary packaging format: Fusion Client installation with HDP Stack.
The next step starts WD Fusion up for the first time. You may receive a warning message if your clients have not yet been installed. You can now address any client installations, then click Revalidate Client Install to make the warning go away. If everything is setup correctly you can click Start WD Fusion.

Skip or start.
If you are installing onto a platform that is running Ambari (HDP or Pivotal HD), once the clients are installed you should login to Ambari and restart any services that are flagged as waiting for a restart. This will apply to MapReduce and YARN, in particular.

restart to refresh config
If you are running Ambari 1.7, you'll be prompted to confirm this is done.

Confirm that you have completed the restarts

Important! If you are installing on Ambari 1.7
Additionally, due to a bug in Ambari 1.7, before you can continue you must log into Ambari and complete a restart of HDFS, in order to re-apply WD Fusion's client configuration.
First WD Fusion node installation
When installing WD Fusion for the first time, this step is skipped. Click Skip Induction.

Second and subsequent WD Fusion node installations into an existing zone

For the second and all subsequent WD Fusion nodes entered into a new or existing zone, you must complete the induction step. Enter the fully qualified domain name for the existing node, along with the WD Fusion server port (8082 by default). Click Start Induction.

Known issue with Location names
You must use different Location names /IDs for each zone. If you use the same name for multiple zones then you will not be able to complete the induction between those nodes.

Induction.
Once the installation is complete you will get access to the WD Fusion UI, once you log in using your Hadoop manager username and password.

WD Fusion UI

2.4 Configuration

Once WD Fusion has been installed on all data centers you can proceed with setting up replication on your HDFS file system. You should plan your requirements ahead of the installation, matching up your replication with your cluster to maximise performance and resilience. The next section will take a brief look at a example configuration and run through the necessary steps for setting up data replication between two data centers.

Replication Overview

Example WD Fusion deployment in a 3 data center deployment.

In this example, each one of three data centers ingests data from it's own datasets, "Weblogs", "phone support" and "Twitter feed". An administrator can choose to replicate any or all of these data sets so that the data is replicated across any of the data centers where it will be available for compute activities by the whole cluster. The only change required to your Hadoop applications will be the addition of a replication specific URI. You can read more about adapting your Hadoop applications for replication.

Setting up Replication

The following steps are used to start replicating hdfs data. The detail of each step will depend on your cluster setup and your specific replication requirements, although the basic steps remain the same.

Create a membership including all the data centers that will share a particular directory. See Create Membership
Create and configure a Replicated Folder. See Replicated Folders
Perform a consistency check on your replicated folder. See Consistency Check
Configure your Hadoop applications to use WANdisco's protocol. See Configure Hadoop for WANdisco replication
Run Tests to validate that your replicated folder remains consistent while data is being written to each data center. See Testing replication

Deployment with a small number of datanodes
You should consider setting the following configuration, if you are planning to run with a small number of datanodes (less than 3). This is especially important in cases where a single datanode may be deployed:

<property>
<name>dfs.client.block.write.replace-datanode-on-failure.best-effort</name>
<value>true</value>
</property>

dfs.client.block.write.replace-datanode-on-failure.best-effort: Default is "false". Running with the default setting, the client will keep trying until the set HDFS replication policy is satisfied. When set to "true", even if the specified policy requirements can't be met (e.g., there's only one DataNode that succeeds in the pipeline, which is less than the policy requirement), the client will still be allowed to continue to write.

Installing on a Kerberized cluster

Currently the Installer doesn't work on a platform that is secured by Kerberos. If you run the installer on a platform that is running Kerberos then the WD Fusion and IHC servers will fail to start at the end of the installation. You can overcome this issue by completing the following procedure before you install WD Fusion: Setting up Kerberos.

2.5 Deployment

The deployment section covers the final step in setting up a WD Fusion cluster, where supported Hadoop applications are plugged into WD Fusion's synchronized distributed namespace. It won't be possible to cover all the requirements for all the third-party software covered here, we strongly recommend that you get hold of the corresponding documenation for each Hadoop application before you work through these procedures.

2.5.1 Hive
2.5.2 Impala
2.5.3 Oracle Big Data Appliance
2.5.4 EMC Isilon
2.5.5 Apache Tez
2.5.6 Apache Ranger

2.5.1 Hive

This guide integrates WD Fusion with Apache Hive, it aims to accomplish the following goals:

Replicate Hive table storage.
Use fusion URIs as store paths.
Use fusion URIs as load paths.
Share the Hive metastore between two clusters.

Prerequisites

Knowledge of Hive architecture.
Ability to modify Hadoop site configuration.
WD Fusion installed and operating.

Replicating Hive Storage via fusion:///

The following requirements come into plqy if you are deploying WD Fusion using its native fusion:/// URI. In order to store a Hive table in WD Fusion you specify a WD Fusion URI when creating a table. E.g. consider creating a table called log that will be stored in a replicated directory.

CREATE TABLE log(requestline string) stored as textfile location 'fusion:///repl1/hive/log';

Note: Replicating table storage without sharing the Hive metadata will create a logical discrepancy in the Hive catalog. For example, consider a case where a table is defined on one cluster and replicated on the HCFS to another cluster. A Hive user on the other cluster would need to define the table locally in order to make use of it.

When running Hive with Cloudera The Hivemetastore canary test currently reports having "Bad health". FUS-1140

Exceptions

Hive from CDH 5.3/5.4 does not work with WD Fusion, as a result of HIVE-9991. The issue will be addressed once this fix for Hive is released. This requires that modify the default Hive file system setting when using CDH 5.3 and 5.4. In Cloudera Manager, add the following property to hive-site.xml:

<property>
    <name>fs.defaultFS</name>
    <value>fusion:///</value>
</property>

This property should be added in 3 areas:

Service Wide
GateWay Group
Hiveserver2 group

Replicated directories as store paths

It's possible to configure Hive to use WD fusion URIs as output paths for storing data, to do this you must specify a fusion URI when writing data back to the underlying Hadoop-compatible file system (HCFS). For example, consider writing data out from a table called log to a file stored in a replicated directory:

INSERT OVERWRITE DIRECTORY 'fusion:///repl1/hive-out.csv' SELECT * FROM log;

Exceptions

HDP 2.2
When running MapReduce jobs on HDP 2.2, you need to append the following entry to mapreduce.application.classpath in mapred-site.xml:

/usr/hdp/<hdp version>/hadoop-hdfs/lib/*

Replicated directories as load paths

In this section we'll describe how to configure Hive to use fusion URIs as input paths for loading data.

It is not common to load data into a Hive table from a file using the fusion URI. When loading data into Hive from files the core-site.xml setting fs.default.name must also be set to fusion, which may not be desirable. It is much more common to load data from a local file using the LOCAL keyword:

LOAD DATA LOCAL INPATH '/tmp/log.csv' INTO TABLE log;

If you do wish to use a fusion URI as a load path, you must change the fs.defaultFS setting to use WD Fusion, as noted in a previous section. Then you may run:

LOAD DATA INPATH 'fusion:///repl1/log.csv' INTO TABLE log;

Sharing the Hive metastore

Advanced configuration - please contact WANdisco before attempting
In this section we'll describe how to share the Hive metastore between two clusters. Since WANdisco Fusion can replicate the file system that contains the Hive data storage, sharing the metadata presents a single logical view of Hive to users on both clusters.

When sharing the Hive metastore, note that Hive users on all clusters will know about all tables. If a table is not actually replicated, Hive users on other clusters will experience errors if they try to access that table.

There are two options available.

Hive metastore available read-only on other clusters

In this configuration, the Hive metastore is configured normally on one cluster. On other clusters, the metastore process points to a read-only copy of the metastore database. MySQL can be used in master-slave replication mode to provide the metastore.

Hive metastore writable on all clusters

In this configuration, the Hive metastore is writable on all clusters.

Configure the Hive metastore to support high availability.
Place the standby Hive metastore in the second data center.
Configure both Hive services to use the active Hive metastore.

Performance over WAN
Performance of Hive metastore updates may suffer if the writes are routed over the WAN.

Hive metastore replication

There are three strategies for replicating Hive metastore data with WD Fusion:

Standard

For Cloudera DCH: See Hive Metastore High Availability.

For Hortonworks/Ambari: High Availability for Hive Metastore.

Manual Replication

In order to manually replicate metastore data ensure that the DDLs are placed on two clusters, and perform a partitions rescan.

Hive specific configuration for WD Fusion with fusion:/// URI

Required configuration for running WD Fusion on a Hive-equipped cluster.

The recommended way to set up Hive with Fusion is to set the fs.defaultsFS property in hive-site.xml pointing to fusion:/// URI, while keeping scratchdirs pointing into the local HDFS, as described below. In this setup all tables will be created in Hive with the WD Fusion URI by default, however, replication of particular tables/databases could and should then be configured through the WD Fusion UI.
```
<property> 
<name>fs.defaultFS</name>
<value>fusion:///</value>
</property>
<property>
<name>hive.exec.scratchdir</name>
<value>hdfs://dc1-cdh54-Cluster/tmp/hive-$
{user.name}</value>
</property>
<property>
<name>hive.exec.local.scratchdir</name>
<value>file:///tmp/${user.name}
</value>
</property>		
```

2.5.2 Impala

Prerequisites

Knowledge of Impala architecture.
Ability to modify Hadoop site configuration.
WD Fusion installed and operating.

Query a table stored in a replicated directory

Support from WD Fusion v2.3 - v2.5
Impala does not allow the use of non-HDFS file system URIs for table storage. To work around this, WANdisco Fusion 2.3 comes with a client program See Impala Parcel that will support reading data from a table stored in a replicated directory. From WD Fusion 2.6, it becomes possible to replicate directly over HDFS using the hdfs:/// URI.

2.6.3 Oracle: Big Data Appliance

Each node in an Oracle:BDA deployment has multiple network interfaces, with at least one used for intra-rack communications and one used for external communications. WD Fusion requires external communications so configuration using the public IP address is required instead of using host names.

Prerequisites

Knowledge of Oracle:BDA architecture and configuration.
Ability to modify Hadoop site configuration.

Required steps

Install WD Fusion and make sure it's capable of Operating in a multi-homed environment.
Configure WD Fusion to support Kerberos. See Setting up Kerberos
Configure WD Fusion to work with NameNode High Availability described in Oracle's documentation
Restart the cluster, WD Fusion and IHC processes. See init.d management script
Test that replication between zones is working.

Operating in a multi-homed environment

Oracle:BDA is built on top of Cloudera's Hadoop and requires some extra steps to support multi-homed network environment.

Procedure

Complete a standard installation, following the steps provided in the Installation Guide. Retrieve and use the public interface IP addresses for the nodes that will host the WD Fusion and IHC servers.
Once the installation is completed you need to set up WD Fusion for a multi-homed environment, first edit WD Fusion's properties file (/opt/fusion-server/application.properties). Create a backup of the file, then add the following line at the end:
```
communication.hostname=0.0.0.0
```
Resave the file
Next we need to update the IHC servers so that they will also use the public IP addresses rather than hostnames. The specific number and names of the configuration files that you need to update will depend on the details of your installation. If you run both WD Fusion server and IHCs on the same server you can get a view of the files with the following command:
```
tree /etc/wandisco
```
View of the WD Fusion configuration files.
Edit each of the revealed config files. In the above example there are two instances of 2.5.0-cdh5.3.0.ihc that will need to be edited:
```
#Fusion Server Properties
#Wed Jun 03 10:14:41 BST 2015
ihc.server=node01.obda.domain.com\:7000
http.server=node01.obda.domain.com\:9001
```
In each case you should change the addresses so that they use the public IP addresses instead of the hostnames.
Open a terminal session on the node hosting the WD Fusion UI. Edit the properties file /opt/wandisco/fusion-ui-server/properties/ui.properties Add the property to allow the UI to listen on all interfaces, i.e.
```
ui.hostname=0.0.0.0
```
This should now ensure that the multi-homed deployment will work with WD Fusion.

Troubleshooting

If you suspect that the multi-homed environment is causing difficulty, verify that you can communicate to the IHC server(s) from other data centers. For example, from a machine in another data center, run:

nc <IHC server IP address>:<IHC server port>

If you see errors from that command, you must fix the network configuration.

2.5.4 EMC Isilon

Prerequisites

Knowledge of EMC Isilon administration.
Ability to modify Hadoop site configuration.

HDP on Isilon

Follow these steps to install WANdisco Fusion on a Hortonworks (HDP) cluster on Isilon storage.

Complete a standard installation, following the steps provided in the Installation Guide.
Copy /opt/fusion-server/core-site.xml from the WANdisco Fusion server to /opt/fusion/ihc-server/<package-version>/ on the IHC server(s).
Restart IHC services.

2.5.5 Apache Tez

Apache Tez is a YARN application framework that supports high performance data processing through DAGs. When set up, Tez uses its own tez.tar.gz containing the dependencies and libraries that it needs to run DAGs. For a DAG to access WD Fusion's fusion:/// URI it needs our client jars:

Configure the tez.lib.uris property with the path to the WD Fusion client jar files.

...
<property>
  <name>tez.lib.uris</name> 
# Location of the Tez jars and their dependencies.
# Tez applications download required jar files from this location, so it should be public accessible.
  <value>${fs.default.name}/apps/tez/,${fs.default.name}/apps/tez/lib/</value>
</property>
...

Running Hortonworks Data Platform, the tez.lib.uris parameter defaults to /hdp/apps/${hdp.version}/tez/tez.tar.gz. So, to add fusion libs, there are two choices:

Option 1: Delete the above value, and instead have a list including the path where the above gz unpacks to, and the path where fusion libs are.
or
Option 2: Unpack the above gz, repack with WD Fusion libs and reupload to HDFS.

Note that both changes are vulnerable to a platform (HDP) upgrade.

2.5.6 Apache Ranger

Apache Ranger is another centralized security console for Hadoop clusters, a preferred solution for Hortonworks HDP (whereas Cloudera prefer Apache Sentry). While Apache Sentry stores it's policy file in HDFS, Ranger uses its own local MySQL database, which introdces concerns over non-replicated security policies. Ranger also applies its policies to the ecosystem via java plugins into the ecosystem components - the namenode, hiveserver etc. In testing, the WD Fusion client has not experienced any problems communicating with ranger-enabled platforms.

Ensure that the hadoop system user, typically hdfs, has permission to imperonate other users.

...
<property>
<name>hadoop.proxyuser.hdfs.users</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hdfs.groups</name>
<value>*</value>
</property>
...

2.6 Appendix

The appendix section contains extra help and procedures that may be required when running through a WD Fusion deployment.

Environmental Checks

During the installation, your system's environment is checked to ensure that it will support WANdisco Fusion, the Environment checks are intended to catch basic compatibility issues, especially those that may appear during an early evaluation phase. The checks are not intended to replace carefully running through the Deployment Checklist.

Operating System:	The WD Fusion installer verifies that you are installing onto a system that is running on a compatible operating system. See the Operating system section of the Deployment Checklist, although the current supported distributions of Linux are listed here: Supported Operating Systems RHEL 6 x86_64 CentOS 6 x86_64 Ubuntu 12.04LTS and 14.04LTS Architecture: 64-bit only
Java:	The WD Fusion installer verifies that the necessary Java components are installed on the system.The installer checks: Env variables: `JRE_HOME`, `JAVA_HOME` and runs the `which java` command. Version: 1.7 recommended. Must be at least 1.7. You can run with later versions, but they are not recommended as we perform all our testing with 1.7. Architecture: JVM must be 64-bit. Distribution: Must be from Oracle. See Oracle's Java Download page. For more information about JAVA requirements, see the Java of the Deployment Checklist.
ulimit:	The WD Fusion installer verifies that the system's maximum user processes and maximum open files are set to 64000. For more information about setting, see the File descriptor/Maximum number of procesesses limit on the Deployment Checklist.
System memory and storage	WD Fusion's requirements for system resources are split between its component parts, WD Fusion server, Inter-Hadoop Communication servers (IHCs) and the WD Fusion UI, all of which can, in principle be either co-located on the same machine or hosted separately. The installer will warn you if the system on which you are currently installing WD Fusion is falling below the requirement. For more details about the RAM and storage requirements, see the Memory and Storage sections of the Deployment Checklist.
Compatible Hadoop Flavour	WD Fusion's installer confirms that a compatible Hadoop platform is installed. Currently, it takes the Cluster Manager detail provided on the Zone screen and polls the Hadoop Manager (CM or Ambari) for details. The installation can only continue if the Hadoop Manager is running a compatible version of Hadoop. See the Deployment Checklist for Supported Versions of Hadoop
HDFS service state:	WD Fusion validates that the HDFS service is running. If it is unable to confirm the HDFS state a warning is given that will tell you to check the UI logs for possible errors. See the Logs section for more information.
HDFS service health	WD Fusion validates the overall health of the HDFS service. If the installer is unable to communicate with the HDFS service then you're told to check the WD Fusion UI logs for any clues. See the Logs section for more information.
HDFS maintenance mode.	WD Fusion looks to see if HDFS is currently in maintenance mode. Both Hortonworks and Ambari support this mode for when you need to make changes to your Hadoop configuration or hardware, it supresses alerts for a host, service, role or, if required, the entire cluster.
WD Fusion node running as a client	We validate that the WD Fusion server is configured as a HDFS client.

Fusion Client installation with RPMs

The WD Fusion installer doesn't currently handle the installation of the client to the rest of the nodes in the cluster. You need to go through the following procedure:

In the Client Installation section of the installer you will see the link to the list of nodes here and the link to the client RPM package.
RPM package location
If you need to find the packages after leaving the installer page with the link, you can find there in your installation directory, here:
```
/opt/wandisco/fusion-ui-server/ui/client_packages
```
If you are installing the RPMs, download and install the package on each of the nodes that appear on the list from step 1.
Installing the client RPM is done in the usually way:
```
rpm -i <package-name>
```

Fusion Client installation with DEB

Debian not supported
Although Ubuntu uses Debian's packaging system, currently Debian itself is not supported. Note: Hortonworks HDP does not support Debian.

If you are running with an Ubuntu Linux distribution, You need to go through the following procedure for installing the clients using Debian's DEB package:

In the Client Installation section of the installer you will see the link to the list of nodes here and the link to the client DEB package.
DEB package location
If you need to find the packages after leaving the installer page with the link, you can find there in your installation directory, here:
```
/opt/wandisco/fusion-ui-server/ui/client_packages
```
To install WANdisco Fusion client, download and install the package on each of the nodes that appear on the list from step 1.
You can install it using
```
sudo dpkg -i /path/to/deb/file
```
followed by
```
sudo apt-get install -f
```
Alternatively, move the DEB file to /var/cache/apt/archives/ and then run
```
apt-get install <fusion-client-filename.deb>
```
.

Fusion Client installation with Parcels

For deployments into a Cloudera clusters, clients can be installed using Cloudera's own packaging format: Parcels.

Installing the parcel

Open a terminal session to your Cloudera Manager server. Ensure that you have suitable permissions for handling files.

Download the appropriate parcel and sha for your deployment.

wget "http://fusion.example.host.com:8083/ui/parcel_packages/FUSION-<version>-cdh5.<version>.parcel"
wget "http://node01-example.host.com:8083/ui/parcel_packages/FUSION-<version>-cdh5.<version>.parcel.sha"

Change the ownership of the parcel and .sha files so that they match the system account that Cloudera Manager:
```
chown cloudera-scm:cloudera-scm FUSION-<version>-cdh5.<version>.parcel*
```

Move the files into the server's local repository, i.e.

mv FUSION-<version>-cdh5.<version>.parcel* /opt/cloudera/parcel-repo/

Open Cloudera manager and navigate to the Parcels screen.

New Parcels check.
The WD Fusion client package is now ready to distribute.

Ready to distribute.
Click on the Distribute button to install WANdisco Fusion from the parcel.

Distribute Parcels.
Click on the Activate button to activate WANdisco Fusion from the parcel.

Distribute Parcels.
The configuration files need redeploying to ensure the WD Fusion elements are put in place correctly. You will need to check Cloudera Manager to see which processes will need to be restarted in order for the parcel to be deployed. Cloudera Manager provides a visual cue about which processes will need a restart.

Restarts.

Impala Parcel

Also provided in a parcel format is the WANdisco compatible version of Cloudera's Impala tool:

Ready to distribute.

Follow the same steps described for installing the WD Fusion client, downloading the parcel and SHA file, i.e.:

Have cluster with CDH installed with parcels and Impala.
Copy the FUSION_IMPALA parcel and SHA into the local parcels repository, on the same node as Cloudera Mananger. This is by default located at: /opt/cloudera/parcels, but is configurable. In Cloudera Manager, you can go to the Parcels Management Page -> Edit Settings to find the Local Parcel Repository Path.
FUSION_IMPALA should be available to distribute and activate on the Parcels Management Page, remember to click Check for New Parcels button.
Once installed, restart the cluster.
Impala reads on Fusion files should now be available.

Fusion Client installation with HDP Stack

For deployments into Hortonworks HDP/Ambari cluster, version 1.7 or later. Clients can be installed using Hortonwork's own packaging format: HDP Stack.

Ambari 1.6 and earlier
If you are deploying with Ambari 1.6 or earlier, don't use the provided Stacks, instead use the generic RPMs.

Ambari 1.7
If you are deploying with Ambari 1.7, take note of the requirement to perform some necessary restarts on Ambari before completing an installation.

Ambari 2.0
When adding a stack to Ambari 2.0 (any stack, not just WD Fusion client) there is a bug which causes the YARN parameter yarn.nodemanager.resource.memory-mb to reset to a default value for the YARN stack. This may result in the Java heap dropping from a manually-defined value, back to a low default value (2Gb). Note that this issue is fixed from Ambari 2.1.

Upgrading Ambari
When running Ambari prior to 2.0.1, we recommend that you remove and then reinstall the WD Fusion stack if you perform an update of Ambari. Prior to version 2.0.1, an upgraded Ambari refuses to restart the WD Fusion stack because the upgrade may wipe out the added services folder on the stack.

If you perform an Ambari upgrade and the Ambari server fails to restart , the workaround is to copy the WD Fusion service directory from the old to the new directory, so that it is picked up by the new version of Ambari, e.g.:

cp -R /var/lib/ambari-server/resources/stacks_25_08_15_21_06.old/HDP/2.2/services/FUSION /var/lib/ambari-server/resources/stacks/HDP/2.2/services

Again, this issue doesn't occur once Ambari 2.0.1 is installed.

Installing the WANDisco service into your HDP Stack

Download the service from the installer client download panel, or after the installation is complete, from the client packages section on the Settings screen.
The service is a gz file (e.g. fusion-hdp-2.2.0-2.4_SNAPSHOT.stack.tar.gz) that will expand to a folder called /FUSION.
Place this folder in /var/lib/ambari-server/resources/stacks/HDP/<version-of-stack>/services.
Restart the ambari-server
```
service ambari-server restart
```
After the server restarts, go to + Add Service.

Add Service.
Choose Service, scroll to the bottom.

Scroll to the bottom of the list.
Tick the WANdisco Fusion service checkbox. Click Next.

Tick the WANdisco Fusion service checkbox.
Datanodes and node managers are automatically selected. Choose any additional nodes you may want as client. Click Next.

Assign Slaves and Clients.
Deploy the changes.

Deploy.
Install, Start and Test.

Install, start and test.
Review Summary and click Complete.

Review.

Installation of Services can remove Kerberos settings
During the installation of services through stacks it is possible that Kerberos configuration can be lost. This has been been seen to occur on Kerberized HDP2.2 clusters when installing Kafka or Oozie. Kerberos configuration in the core-site.xml file was removed during the installation which resulted in all HDFS / Yarn instances being unable to restart. For more details, see the Ambari JIRA AMBARI-9022

MapR Client Configuration

On MapR clusters, you need to copy WD Fusion configuration onto all other nodes in the cluster:

Open a terminal to your WD Fusion node.
Navigate to /opt/mapr/hadoop/<hadoop-version>/etc/hadoop.
Copy the core-site.xml file to the same location on all other nodes in the cluster.
The configuration will be picked up automatically, no need for restarting the nodes. The entire cluster can now communicate with WD Fusion.

Removing WANdisco Service

If you are removing WD Fusion, maybe as part of a reinstallation, you should remove the client packages as well. Ambari never deletes any services from the stack it only disables them. If you remove the WD Fusion service from your stack, remember to also delete fusion-client.repo.

[WANdisco-fusion-client]
name=WANdisco Fusion Client repo
baseurl=file:///opt/wandisco/fusion/client/packages
gpgcheck=0

For instructions for the cleanup of Stack, see Host Cleanup for Ambari and Stack

Uninstall WD Fusion

There currently isn't a uninstall function for our installer, so the system will have to be cleaned up manually. The best way to remove Fusion (assuming it has been installed with our unified installer), is:

Stop all Fusion processes on the Fusion Server. See Shutting down.
Remove the fusion packages on the Fusion Server:
```
yum erase 'fusion*'
```
Remove the client packages on any other nodes in the cluster
Remove the folders created by our installation on the Fusion Server:
```
rm -r /opt/wandisco /opt/fusion/ /etc/wandisco
```
Remove the extra configuration we added to the core-site.xml on the Manager Server
Restart the managers after reverting the config changes
(Optional) Remove the installer file on the Fusion Server

2. Installation Guide

Deployment Overview

2.1 Deployment Checklist

2.1.1 WD Fusion and IHC servers' requirements

WD Fusion Deployment Components

2.1.2 Software requirements

2.1.3 Supported versions

2.2 Final Preparations

Time requirements

Network requirements

Running WD Fusion on multi-homed servers

Overview

Procedure

Kerberos Security

Kerberos Configuration before starting the installation

Update WD Fusion UI configuration

Clean Environment

Installer File

License File

2.3 Running the installer

Hands on installation

WD Fusion Server

IHC Server

Zone Information

Advanced Options

URI Selection

<Hadoop Management Layer> Configuration

Kerberos Configuration

WD Fusion Client Installation

RPM / DEB Packages

First WD Fusion node installation

Second and subsequent WD Fusion node installations into an existing zone

2.4 Configuration

Replication Overview

Setting up Replication

Installing on a Kerberized cluster

2.5 Deployment

2.5.1 Hive

Prerequisites

Replicating Hive Storage via fusion:///

Exceptions

Replicated directories as store paths

Exceptions

Replicated directories as load paths

Sharing the Hive metastore

Hive metastore available read-only on other clusters

Hive metastore writable on all clusters

Hive metastore replication

Standard

Manual Replication

Hive specific configuration for WD Fusion with fusion:/// URI

2.5.2 Impala

Prerequisites

Query a table stored in a replicated directory

2.6.3 Oracle: Big Data Appliance

Prerequisites

Required steps

Operating in a multi-homed environment

Procedure

Troubleshooting

2.5.4 EMC Isilon

Prerequisites

HDP on Isilon

2.5.5 Apache Tez

2.5.6 Apache Ranger

2.6 Appendix

Environmental Checks

Fusion Client installation with RPMs

Fusion Client installation with DEB

Fusion Client installation with Parcels

Installing the parcel

Impala Parcel

Fusion Client installation with HDP Stack

Installing the WANDisco service into your HDP Stack

MapR Client Configuration

Removing WANdisco Service

Uninstall WD Fusion