2. Installation Guide

This section will run through the installation of WD Fusion from the initial steps where we make sure that your existing environment is compatible, through the procedure for installing the necessary components and then finally configuration.

Deployment Overview

We'll start with a quick overview of this installation guide so that you can seen what's coming or quickly find the part that you want:

2.1 Deployment Checklist
Important hardware and software requirements, along with considerations that need to be made before starting to install WD Fusion.
2.2 Final Preparations
Things that you need to do immediately before you start the installation.
2.3 Running the installer
Step by step guide to the installation process when using the unified installer. For instructions on completing a fully manual installation see 2.7 Manual Installation.
2.4 Configuration
Runs through the changes you need to make to start WD Fusion working on your platform.
2.5 Deployment
Necessary steps for getting WD Fusion to work with supported Hadoop applications.
2.6 Appendix
Extras that you may need that we didn't want cluttering up the installation guide - Installation Troubleshooting, How to remove an existing WD Fusion installation.

From version 2.2, WD Fusion comes with an installer package
WD Fusion now has a unified installation package that installs all of WD Fusion's components (WD Fusion server, IHC servers and WD Fusion UI).The installer greatly simplifies installation as it handles all the components you need and does a lot of configuration in the background. However, if you need more control over the installation, you can use the orchestration script, instead. See the Orchestrated Installation Guide.

Sample Orchestration mydefines.sh file.

2.1 Deployment Checklist

2.1.1 WD Fusion and IHC servers' requirements

This section describes hardware requirements for deploying Hadoop using WD Fusion. These are guidelines that provide a starting point for setting up data replication between your Hadoop clusters.

Glossary
We'll be using terms that relate to the Hadoop ecosystem, WD Fusion and WANdisco's DconE replication technology. If you encounter any unfamiliar terms checkout out the Glossary.

WD Fusion Deployment Components

WD Fusion Deployment

Example WD Fusion Data Center Deployment.

WD Fusion Server
The core WD Fusion server. Using HCFS (Hadoop Compatible File System) to permit the replication of HDFS data between data centers, while maintaining strong consistency.
WD Fusion UI
A seperate server that provides administrators with a browser-based management console for each WD Fusion server. This can be installed on the same machine as WD Fusion's server or on a different machine within your data center.
IHC Server
Inter Hadoop Communication servers handle the traffic that runs between zones or data centers that use different versions of Hadoop. IHC Servers are matched to the version of Hadoop running locally. It's possible to deploy different numbers of IHC servers at each data center, additional IHC Servers can form part of a High Availability mechanism.

WD Fusion servers can co-located with IHC servers
Providing that a server has sufficient resources, it is possible to co-locate your WD Fusion server with the IHC servers.

WD Fusion Client
Client jar files to be installed on each Hadoop client, such as mappers and reducers that are connected to the cluster. The client is designed to have a minimal memory footprint and impact on CPU utilization.

WD Fusion servers must not be co-located with HDFS servers (DataNodes, etc)
HDFS's default block placement policy dictates that if a client is co-located on a DataNode, then that co-located DataNode will receive 1 block of whatever file is being put into HDFS from that client. This means that if the WD Fusion Server (where all transfers go through) is co-located on a DataNode, then all incoming transfers will place 1 block onto that DataNode. In which case the DataNode is likely to consumes lots of disk space in a transfer-heavy cluster, potentially forcing the WD Fusion Server to shut down in order to keep the Prevaylers from getting corrupted.

The following guidelines apply to both the WD Fusion server and for separate IHC servers. We recommend that you deploy on physical hardware rather than on a virtual platform, however, there are no reasons why you can't deploy on a virtual environment.

If you plan to locate both the WD Fusion and IHC servers on the same machine then check the Collocated Server requirements:

CPUs: Small WD Fusion server deployment : 8 cores
Large WD Fusion server deployment: : 16 cores
Architecture: 64-bit only.
System memory: There are no special memory requirements, except for the need to support a high throughput of data:
Type: Use ECC RAM
Size: Recommended 64 GB recommended (minimum of 16 GB)
Small WD Fusion server Deployment: 32GB
Large WD Fusion server deployment: 128GB
System memory requirements are matched to the expected cluster size and should take into account the number of files and block size. The more RAM you have, the bigger the supported file system, or the smaller the block size.

Collocation of WD Fusion/IHC servers
If you plan to install the WD Fusion and your IHC servers on the same machine then you should look to increase the memory specification:
Recommended: 64 GB+
Minimum: 48 GB (16 GB for the WD Fusion, 16 GB for each of at least 2 IHC servers.)

Storage space: Type: Hadoop operations are storage-heavy and disk-intensive so we strongly recommend that you use enterprise-class Solid State Drives (SSDs).
Size: Recommended: 1 TB
Minimum: You need at least 500 GB of disk space for a production environment.
Network Connectivity: Minimum 1Gb Ethernet between local nodes.
Small WANdisco Fusion server: 2Gbps
Large WANdisco Fusion server: 4x10 Gbps (cross-rack)
TCP Port Allocation: Two ports are required for deployment of WD Fusion:
DConE port: (default 8082)
IHC ports: (7000 range for command ports) (8000 range for HTTP)
HTTP interface: (default 50070) is re-used from the stand-alone Hadoop NameNode
Web UI interface: (default 8083)

2.1.2 Software requirements

Operating systems:
  • RHEL 6 x86_64
  • CentOS 6 x86_64
  • Ubuntu 12.04LTS and 14.04LTS
Web browsers:
  • Mozilla Firefox 11 and higher
  • Google Chrome
  • Safari 5 and higher
Java: Hadoop requires Java JRE 1.7. It is built and tested on Oracle's version of Java Runtime Environment.
We have now added support for Open JDK 7, although we recommend running with Oracle's Java as it has undergone more testing.

Architecture: 64-bit only
Heap size: Set Java Heap Size of to a minimum of 1Gigabytes, or the maximum available memory on your server.
Use a fixed heap size. Give -Xminf and -Xmaxf the same value. Make this as large as your server can support.
Avoid Java defaults. Ensure that garbage collection will run in an orderly manner. Configure NewSize and MaxNewSize Use 1/10 to 1/5 of Max Heap size for JVMs larger than 4GB.

Stay deterministic!
When deploying to a cluster, make sure you have exactly the same version of the Java environment on all nodes.

Where's Java?
Although WD Fusion only requires the Java Runtime Environment (JRE), Cloudera and Hortonworks may install the full Oracle JDK with the high strength encryption package included. This JCE package is a requirement for running Kerberized clusters.
For good measure, remove any JDK 6 that might be present in /usr/java. Make sure that /usr/java/default and /usr/java/latest point to a java 7 version your Hadoop manager installs ones.

Ensure that you set the JAVA_HOME environment variable for the root user on all nodes. Remember that, on some systems, invoking sudo strips environmental variables, so you may need to add the JAVA_HOME to Sudo's list of preserved variables.

File descriptor/Maximum number of processes limit: Maximum User Processes and Open Files limits are low by default on some systems. It is possible to check their value with the ulimit or limit command:
      ulimit -u && ulimit -n 
      

-u The maximum number of processes available to a single user.
-n The maximum number of open file descriptors.

For optimal performance, we recommend both hard and soft limits values to be set to 64000 or more:

RHEL6 and later: A file /etc/security/limits.d/90-nproc.conf explicitly overrides the settings in security.conf, i.e.:

      # Default limit for number of user's processes to prevent
      # accidental fork bombs.
      # See rhbz #432903 for reasoning.
      * soft nproc 1024 <- Increase this limit or ulimit -u will be reset to 1024 
Ambari/Pivotal HD and Cloudera manager will set various ulimit entries, you must ensure hard and soft limits are set to 64000 or higher. Check with the ulimit or limit command. If the limit is exceeded the JVM will throw an error: java.lang.OutOfMemoryError: unable to create new native thread.
Additional requirements: passwordless ssh
If you plan to set up the cluster using the supplied WD Fusion orchestration script you must be able to establish secure shell connections without using a passphrase.

KB
Read our Knowledgebase article How to set up passwordless ssh.


Security Enhanced (SE) Linux
You need to disable Security-Enhanced Linux (SELinux) for the installation to ensure that it doesn't block activity that's necessary to complete the installation. Disable SELinux on all nodes, then reboot them:
sudo vi /etc/sysconfig/selinux
Set SELINUX to the following value:
SELINUX=disabled

iptables
Disable iptables.
$ sudo chkconfig iptables off
Reboot.
When the installation is complete you can re-enable iptables using the corresponding command:
sudo chkconfig iptables on


Comment out requiretty in /etc/sudoers
The installer's use of sudo won't work with some linux distributions (CentOS where /etc/sudoer sets enables requiretty, where sudo can only be invoked from a logged in terminal session, not through cron or a bash script. When enabled the installer will fail with an error:
execution refused with "sorry, you must have a tty to run sudo" message	
Ensure that requiretty is commented out:
# Defaults	   requiretty

2.1.3 Supported versions

This table shows the versions of Hadoop and Java that we currently support:

Distribution: Console: JRE:
Apache Hadoop 2.5.0 Oracle JDK 1.7_45 64-bit
HDP 2.1 / 2.2 / 2.3 Ambari 1.6.1 / 1.7 / 2.1
Support for EMC Isilon 7.2.0.1 and 7.2.0.2
Oracle JDK 1.7_45 64-bit
CDH 5.2.0/5.3.0/5.4 Cloudera Manager 5.3.2
Support for EMC Isilon 7.2.0.1 and 7.2.0.2
Oracle JDK 1.7_45 64-bit
Pivotal HD 3.0 Ambari 1.6.1 / 1.7
Oracle JDK 1.7_45 64-bit

2.2 Final Preparations

We'll now look at what you should know and do as you begin the installation.

Time requirements

The time required to complete a deployment of WD Fusion will in part be based on its size, larger deployments with more nodes and more complex replication rules will take correspondingly more time to set up. Use the guide below to help you plan for for deployments.

  • Run through this document and create a checklist of your requirements. (1-2 hours).
  • Complete the WD Fusion server installations (20 minutes per node, or 1 hour for a test deployment).
  • Install WD Fusion UI (30 minutes).
  • Complete client installations and complete basic tests (1-2 hours).

Of course, this is a guideline to help you plan your deployment. You should think ahead and determine if there are additional steps or requirements introduced by your organization's specific needs.

Network requirements

See the deployment checklist for a list of the TCP ports that need to be open for WD Fusion.

Running WD Fusion on multi-homed servers

The following guide runs through what you need to do to correctly configure a WD Fusion deployment if the nodes are running with multiple network interfaces.

Overview

  1. A file is created in DC1. A Client writes the Data.
  2. Periodically after the data is written, a proposal is sent by the WD Fusion Server in DC1, telling the WD Fusion server in DC2 to pull the new file. This proposal includes the map of IHC server public IP addresses, in this case, listening at <Public-IP>:7000 (Fusion Server in DC1 read this from
    /etc/wandisco/fusion/server/ihcList
  3. Fusion Server in DC2 gets this agreement, connects to <Public-IP>:7000 and pulls the data.

Procedure

  1. Stop all WD Fusion services.
  2. Reconfigure your IHCs to your preferred address in /etc/wandisco/ihc/*.ihc for each IHC node.
  3. For the WD Fusion servers, delete all files in /etc/wandisco/fusion/server/ihclist/*.
  4. Copy zone1 IHC's /etc/wandisco/ihc/*.ihc files to zone1 Fusion-Server /etc/wandisco/server/ihcList
  5. Copy zone2 IHC's /etc/wandisco/ihc/*.ihc files to zone2 Fusion-Server /etc/wandisco/server/ihcList
  6. Restart all services

Kerberos Security

If you are running Kerberos on your cluster you should consider the following requirements:

  • Kerberos is already installed and running on your cluster
  • Fusion-Server is configured for Kerberos as described in Setting up Kerberos.
  • We will be using the same keytab and principal we generated for fusion-server. Assume it's in /etc/hadoop/conf/fusion.keytab

Kerberos Configuration before starting the installation

Before running the installer on a platform that is secured by Kerberos, you'll need run through the following steps: Setting up Kerberos.

Update WD Fusion UI configuration

Manual instructions

The following instructions apply to manual or orchestration script-based installation. If you install WD Fusion using the installer script, Kerberos settings are, from Version 2.5.2, handled in the installer.

Make the following changes to WD Fusion's UI element to enable it to interact with a Kerberized environment:

  1. Add core-site.xml and hdfs-site.xml path to the ui.properties configuration file:
    client.core.site=/etc/hadoop/conf/core-site.xml
    client.hdfs.site=/etc/hadoop/conf/hdfs-site.xml
  2. Enable kerberos in fusion-ui configuration (/opt/wandisco/fusion-ui-server/properties/ui.properties):
    kerberos.enabled=true
    kerberos.generated.config.path=/opt/wandisco/fusion-ui-server/properties/kerberos.cfg
    kerberos.keytab.path=/etc/hadoop/conf/fusion.keytab
    kerberos.principal=/${hostname}@${krb_realm}
  3. kerberos.enabled
    Is used to switch on Kerberos (with a =true) for the WD Fusion node.
    kerberos.generated.config.path
    The path to the Kerberos configuration, used to allow WD Fusion / IHC servers to communicate with a Kerberos-enabled cluster.
    kerberos.keytab.path
    Path to the Kerberos keytab file.
    kerberos.principal
    The Kerberos indentity used by the hdfs superuser, principal is provided in the form primary/instance@realm.
  4. Set up a proxy user on the NameNode, adding the following properties to core-site.xml on the NameNode(s).
  5. <property>
            <name>hadoop.proxyuser.$USERNAME.hosts</name>
            <value>*</value>
        </property>
        <property>
            <name>hadoop.proxyuser.$USERNAME.groups</name>
            <value>*</value>
    </property>
    
    hadoop.proxyuser.$USERNAME.hosts
    Defines hosts from which client can be impersonated. $USERNAME, the superuser who wants to act as a proxy to the other users, is usually set as system user "hdfs". From Version 2.6 these values are captured by the installer and can apply these values automatically.
    hadoop.proxyuser.$USERNAME.groups
    A list of groups whose users the superuser is allowed to act as proxy. Including a wildcard (*), which will mean that proxies of any users are allowed. For example, for the superuser to act as proxy to another user, the proxy actions must be completed on one of the hosts that are listed, and the user must be included in the list of groups. Note that this can be a comma separated list or the noted wildcard (*).

Clean Environment

Before you start the installation you must ensure that there are no existing WD Fusion installations or WD Fusion components installed on your elected machines. If you are about to upgrade to a new version of WD Fusion you must first make sure that you run through the removal instructions provided in the Appendix - Cleanup WD Fusion.

Installer File

You need to match WANdisco's WD Fusion installer file to each data center's version of Hadoop. Installing the wrong version of WD Fusion will result in the IHC servers being misconfigured.

License File

After completing an evaulation deployment, you will need to contact WANdisco about getting a license file for your moving your deployment into production.

2.3 Running the installer

Below is the procedure for getting set up with the installer. Running the installer only takes a few minutes while you enter the neccessary settings, however, if you wish to handle installations without the need for a user having to manually enter the settings you can use the use the Silent Installer.

Hands on installation

Listed below is the procedure that you should use for completing an installation using the installer file. This requires an administrator to enter details throughout the procedure. Alternatively, see Using the "Silent" Installer option to handle installation programatically.

  1. Open a terminal session on your first installation server. Copy the WD Fusion installer script into a suitable directory.
  2. Make the script executable, e.g.
    chmod +x fusion-ui-server-<version>_rpm_installer.sh
    	
  3. Execute the file with root permissions, e.g.
    sudo ./fusion-ui-server-<version>_rpm_installer.sh
  4. The installer will now start.
    Verifying archive integrity... All good.
    Uncompressing WANdisco Fusion..............................
    
        ::   ::  ::     #     #   ##    ####  ######   #   #####   #####   #####
       :::: :::: :::    #     #  #  #  ##  ## #     #  #  #     # #     # #     #
      ::::::::::: :::   #  #  # #    # #    # #     #  #  #       #       #     #
     ::::::::::::: :::  # # # # #    # #    # #     #  #   #####  #       #     #
      ::::::::::: :::   # # # # #    # #    # #     #  #        # #       #     #
       :::: :::: :::    ##   ##  #  ## #    # #     #  #  #     # #     # #     #
        ::   ::  ::     #     #   ## # #    # ######   #   #####   #####   #####
    
    
    
    
    Welcome to the WANdisco Fusion installation
    
    
    
    You are about to install WANdisco Fusion version 2.4-206
    
    Do you want to continue with the installation? (Y/n) y	
    	
    The installer will perform an integrity check, confirm the product version that will be installed, then invite you to continue. Enter "Y" to continue the installation.
  5. The installer performs some basic checks and lets you modify the Java heap settings. The heap settings apply only to the WD Fusion UI.
    Checking prerequisites:
    
    Checking for perl: OK
    Checking for java: OK
    
    INFO: Using the following Memory settings:
    
    INFO: -Xms128m -Xmx512m
    
    Do you want to use these settings for the installation? (Y/n) y
    
    The installer checks for Perl and Java. See the Installation Checklist for more information about these requirements. Enter "Y" to continue the installation.
  6. Next, confirm the port that will be used to access WD Fusion through a browser.
    Which port should the UI Server listen on? [8083]:
    
  7. Select the Hadoop version and type from the list of supported platforms:
    Please specify the appropriate backend from the list below:
    
    [0] cdh-5.2.0
    [1] cdh-5.3.0
    [2] cdh-5.4.0
    [3] hdp-2.1.0
    [4] hdp-2.2.0
    [5] hdp-2.3.0
    
    Which fusion backend do you wish to use? 3
    You chose hdp-2.2.0:2.6.0.2.2.0.0-2041

    MapR/Pivotal availability
    The MapR/PHD versions of Hadoop have been removed from the trial version of WD Fusion in order to reduce the size of the installer for most prospective customers. These versions are run by a small minority of customers, while their presence nearly doubled the size of the installer package. Contact WANdisco if you need to evaluate WD Fusion running with MapR or PHD.

    Additional available packages

    [1] mapr-4.0.1
    [2] mapr-4.0.2
    [3] mapr-4.1.0
    [4] phd-3.0.0

    MapR requirement
    If you install into a MapR cluster then you need to assign the MapR superuser system account/group "mapr" if you need to run WD Fusion using the fusion:/// URI.

    See the requirement for MapR Client Configuration.

  8. The installer now confirms which system user/group will be applied to WD Fusion.
    We strongly advise against running Fusion as the root user.
    For default HDFS setups, the user should be set to 'hdfs'. However, you should choose a user appropriate for running HDFS commands on your system.
    
    Which user should Fusion run as? [hdfs]
    Checking 'hdfs' ...
    ... 'hdfs' found.
    
    Please choose an appropriate group for your system. By default HDP uses the 'hadoop' group.
    Which group should Fusion run as? [hadoop]
    Checking 'hadoop' ...
    ... 'hadoop' found.
    The installer does a search for the commonly used account and group, assigning these by default.
  9. Check the summary to confirm that you're chosen settings are appropriate:
    Installing with the following settings:
    
    User and Group:                     hdfs:hadoop
    Hostname:                           node04-example.host.com
    Fusion Admin UI Listening on:       0.0.0.0:8083
    Fusion Admin UI Minimum Memory:     128
    Fusion Admin UI Maximum memory:     512
    Platform:                           hdp-2.3.0 (2.7.1.2.3.0.0-2557)
    Manager Type                        AMBARI
    Manager Host and Port:              :
    Fusion Server Hostname and Port:    node04-example.host.com:8082
    SSL Enabled:                        false
    
    Do you want to continue with the installation? (Y/n) y
    You are now given a summary of all the settings provided so far. If these settings are correct then enter "Y" to complete the installation of the WD Fusion server.
  10. The package will now install
    Installing hdp-2.1.0 packages:
      fusion-hdp-2.1.0-server-2.4_SNAPSHOT-1130.noarch.rpm ...
       Done
      fusion-hdp-2.1.0-ihc-server-2.4_SNAPSHOT-1130.noarch.rpm ...
       Done
    Installing fusion-ui-server package
    Starting fusion-ui-server:[  OK  ]
    Checking if the GUI is listening on port 8083: .....Done	
    	
  11. The WD Fusion server will now start up:
    Please visit http://<YOUR-SERVER-ADDRESS>.com:8083/ to access the WANdisco Fusion
    		
    		If 'http://<YOUR-SERVER-ADDRESS>.com' is internal or not available from your browser, replace this with an externally available address to access it. 
    		
    Installation Complete
    [root@node05 opt]#
    
    At this point the WD Fusion server and corresponding IHC server will be installed. The next step is to configure the WD Fusion UI. Open a web browser and point it at the provided URL. E.g
    http://<YOUR-SERVER-ADDRESS>.com:8083/
  12. In the first "Welcome" screen you're asked to choose between Create a new Zone and Add to an existing Zone.
    Make your selection as follows:
    Adding a new WD Fusion cluster
    Select Add Zone.
    Adding additional WD Fusion servers to an existing WD Fusion cluster
    Select Add to an existing Zone.

    High Availability for WD Fusion / IHC Servers

    It's possible to enable High Availability in your WD Fusion cluster by adding additional WD Fusion/IHC servers to a zone. These additional nodes ensure that in the event of a system outage, there will remain sufficient WD Fusion/IHC servers running to maintain replication.

    Add HA nodes to the cluster using the installer and choosing to Add to an existing Zone, using a new node name.

    Configuration for High Availability
    When setting up the configuration for a High Availability cluster, ensure that fs.defaultFS, located in the core-site.xml is not duplicated between zones. This property is used to determin if an operation is being executed locally or remotely, if two separate zones have the same default file system address, then problems will occur. WD Fusion should never see the same URI (Scheme + authority) for two different clusters.

    WD Fusion Deployment

    Welcome.

  13. Run through the installer's detailed Environment checks. For more details about exactly what is checked in this stage, see Environmental Checks in the Appendix.
    WD Fusion Deployment

    Environmental checks.

    On clicking validate the installer will run through a series of checks of your system's hardware and software setup and warn you if any of WD Fusion's prerequisites are missing.

    WD Fusion Deployment

    Example check results.

    Any element that fails the check should be addressed before you continue the installation. Warnings may be ignored for the purposes of completing the installation, especially if the installation is only for evaluation purposes and not for production. However, when installing for production, you should also address all warnings, or at least take note of them and exercise due care if you continue the installation without resolving and revalidating.

  14. Upload the license file.
    WD Fusion Deployment

    Upload your license file.

  15. The conditions of your license agreement will be presented in the top panel, including License Type, Expiry data, Name Node Limit and Data Node Limit.
    WD Fusion Deployment

    Verify license and agree to subscription agreement.

  16. Click on the I agree to the EULA to continue.
    WD Fusion Deployment

    Next step.

  17. Enter settings for the WD Fusion server.
    WD Fusion Deployment

    screen 4 - Server settings

    WD Fusion Server

    Fusion Server Max Memory (GB)
    Enter the maximum Java Heap value for the WD Fusion server.
    Umask (currently 022)
    Set the default permissions applied to newly created files. The value 022 results in default directory permissions 755 and default file permissions 644. This ensures that the installation will be able to start up/restart.
    Latitude
    The north-south coordinate angle for the installation's geographical location.
    Longitude
    The east-west coordinate angle for the installation's geographical location. The latitude and longitude is used to place the WD Fusion server on a global map to aid coordination in a far-flung cluster.

    IHC Server

    Maximum Java heap size (GB)
    Enter the maximum Java Heap value for the WD Inter-Hadoop Communication server.
    Once all settings have been entered, click Next step.
  18. Next, you will enter the settings for your new Zone.
    WD Fusion Deployment

    New Zone

    Zone Information

    Entry fields for zone properties

    Fully Qualified Domain Name
    the full hostname for the server.
    Node ID
    A unique identifier that will be used by WD Fusion UI to identify the server.
    Location Name (optional)
    A location name that can quickly identify where the server is located.

    Known issue with Location names
    You must use different Location names /Node IDs for each zone. If you use the same name for multiple zones then you will not be able to complete the induction between those nodes.

    DConE Port
    TCP port used by WD Fusion for replicated traffic.
    Zone Name
    The name used to identify the zone in which the server operates.
    Management Endpoint
    Select the Hadoop manager that you are using, i.e. Cloudera Manager, Ambari or Pivotal HD. The selection will trigger the entry fields for your selected manager:

    Advanced Options

    Only apply these options if you full understand what they do.
    The following advanced options provide a number of low level configuration settings that may be required for installation into certain environments. The incorrect application of some of these settings could cause serious problems, so for this reason we strongly recommend that you discuss their use with WANdisco's support team before enabling them.

    URI Selection

    The default behavior for WD Fusion is to fix all replication to the Hadoop Distributed File System / hdfs:/// URI. Setting the hdfs-scheme provides the widest support for Hadoop client applications, so some applications can't support the available "fusion:///" URI or they can only run on HDFS. instead of the more lenient HCFS. Each option is explained below:

    Use HDFS URI with HDFS file system
    URI Option A
    This option is available for deployments where the Hadoop applications support neither the WD Fusion URI or the HCFS standards. WD Fusion operates entirely within HDFS.

    This configuration will not allow paths with the fusion:// uri to be used; only paths starting with hdfs:// or no scheme that correspond to a mapped path will be replicated. The underlying file system will be an instance of the HDFS DistributedFileSystem, which will support applications that aren't written to the HCFS specification.
    Use WD Fusion URI with HCFS file system
    URI Option B
    This is the default option that applies if you don't enable Advanced Options, and was the only option in WD Fusion prior to version 2.6. When selected, you need to use fusion:// for all data that must be replicated over an instance of the Hadoop Compatible File System. If your deployment includes Hadoop applications that are either unable to support the Fusion URI or are not written to the HCFS specfication, this option will not work.
    Use Fusion URI with HDFS file system
    URI option C
    This differs from the default in that while the WD Fusion URI is used to identify data to be replicated, the replication is performed using HDFS itself. This option should be used if you are deploying applications that can support the WD Fusion URI but not the Hadoop Compatible File System.
    Use Fusion URI and HDFS URI with HDFS file system
    URI Option D
    This "mixed mode" supports all the replication schemes (fusion://, hdfs:// and no scheme) and uses HDFS for the underlying file system, to support applications that aren't written to the HCFS specification.

    <Hadoop Management Layer> Configuration

    This section configures WD Fusion to interact with the management layer, which could be Ambari or Cloudera Manager, etc.

    Manager Host Name /IP
    The full hostname or IP address for the working server that hosts the Hadoop manager.
    Port
    TCP port on which the Hadoop manager is running.
    Username
    The username of the account that runs the Hadoop manager.
    Password
    The password that corresponds with the above username.
    SSL
    (Checkbox) Tick the SSL checkbox to use https in your Manager Host Name and Port. You may be prompted to update the port if you enable SSL but don't update from the default http port.

    Kerberos Configuration

    In this step you also set the configuration for an existing Kerberos setup. If you are installing into a Kerberized cluster, include the following configuration.

    WD Fusion Kerberos
    Config file path
    Path to the Kerberos configuration file, e.g. krb5.conf.
    Keytab file path
    Path to the generated keytab, e.g. your/etc/krb5.keytab
    Principle
    Principle for the keytab file, e.g. HDFS@<REALM>
    Enable Kerberos authentication for WD Fusion endpoints
    This checkbox tells WD Fusion whether or not to Kerberize all WD Fusion communication. When unchecked WD Fusion's application traffic is secured, but not through Kerberos. When ticked we introduce Kerberos authentication across the /fusion/* REST API paths. Meaning, if enabled, all users will require Kerberos credentials in order to access the Web UI.

    Enabling Kerberos authentication on WD Fusion's REST API
    When a user has enabled Kerberos-authentication on their REST API, they must kinit before making REST calls, and enable GSS-Negotiate authentication. To do this with curl, the user must include the "-negotiate" and "-u:" options, like so:

    curl --negotiate -u: -X GET "http://${HOSTNAME}:8082/fusion/fs/transfers"
    Keytab file path
    Path to the generated keytab for the HTTP principle, which is used when WD Fusion is Kerberized, while the settings that have already been provided are for when the Hadoop cluster itself is Kerberized.
    Principle
    This is specifically the principle for the HTTP user. This should be HTTP/<hostname>@<REALM>.

    See Setting up Kerberos for more information about Kerberos setup.

  19. Click Validate to confirm that your settings are valid. Once validated, click Next step.
    WD Fusion Deployment

    Zone information.

  20. The remaining panels in Step 6 detail all of the installation settings. All your license, WD Fusion server, IHC server and zone settings are shown. If you spot anything that needs to be changed you can click on the go back
    WD Fusion Deployment

    Summary

    Once you are happy with the settings and all your WD Fusion clients are installed, click Deploy Fusion Server.
  21. WD Fusion Client Installation

  22. In the next step you must complete the installation of the WD Fusion client package on all the existing HDFS client machines in the cluster. The WD Fusion client is required to support data WD Fusion's replication across the Hadoop ecosystem.
    WD Fusion Deployment

    Client installations.

    The installer supports three different packaging systems for installing Clients, regular RPMs, Parcels for Cloudera and HDP Stack for Hortonworks/Ambari.

    Installing into MapR
    If you are installing into a MapR cluster, use this default RPMs, detailed below. Fusion client installation with RPMs.

    RPM / DEB Packages

    client nodes
    By client nodes we mean any machine that is interacting with HDFS that you need to form part of WD Fusion's replicated system. If a node is not going to form part of the replicated system then it won't need the WD Fusion client installed. If you are hosting the WD Fusion UI package on a dedicated server, you don't need to install the WD Fusion client on it as the client is built into the WD Fusion UI package. Note that in this case the WD Fusion UI server would not be included in the list of participating client nodes.

    WD Fusion Deployment

    Example clients list

    For more information about doing a manual installation, see Fusion Client installation for regular RPMs.
    To install with the Cloudera parcel file, see: Fusion Client installation with Parcels.
    For Hortonwork's own proprietary packaging format: Fusion Client installation with HDP Stack.

  23. The next step starts WD Fusion up for the first time. You may receive a warning message if your clients have not yet been installed. You can now address any client installations, then click Revalidate Client Install to make the warning go away. If everything is setup correctly you can click Start WD Fusion.
    WD Fusion Deployment

    Skip or start.

  24. If you are installing onto a platform that is running Ambari (HDP or Pivotal HD), once the clients are installed you should login to Ambari and restart any services that are flagged as waiting for a restart. This will apply to MapReduce and YARN, in particular.
    Restart HDFS

    restart to refresh config

    If you are running Ambari 1.7, you'll be prompted to confirm this is done.
    WD Fusion Deployment

    Confirm that you have completed the restarts

    Important! If you are installing on Ambari 1.7
    Additionally, due to a bug in Ambari 1.7, before you can continue you must log into Ambari and complete a restart of HDFS, in order to re-apply WD Fusion's client configuration.

  25. First WD Fusion node installation

    When installing WD Fusion for the first time, this step is skipped. Click Skip Induction.

    Second and subsequent WD Fusion node installations into an existing zone

    For the second and all subsequent WD Fusion nodes entered into a new or existing zone, you must complete the induction step. Enter the fully qualified domain name for the existing node, along with the WD Fusion server port (8082 by default). Click Start Induction.

    Known issue with Location names
    You must use different Location names /IDs for each zone. If you use the same name for multiple zones then you will not be able to complete the induction between those nodes.


    WD Fusion Deployment

    Induction.

  26. Once the installation is complete you will get access to the WD Fusion UI, once you log in using your Hadoop manager username and password.
    WD Fusion Deployment

    WD Fusion UI

2.4 Configuration

Once WD Fusion has been installed on all data centers you can proceed with setting up replication on your HDFS file system. You should plan your requirements ahead of the installation, matching up your replication with your cluster to maximise performance and resilience. The next section will take a brief look at a example configuration and run through the necessary steps for setting up data replication between two data centers.

Replication Overview

Example Deployment

Example WD Fusion deployment in a 3 data center deployment.

In this example, each one of three data centers ingests data from it's own datasets, "Weblogs", "phone support" and "Twitter feed". An administrator can choose to replicate any or all of these data sets so that the data is replicated across any of the data centers where it will be available for compute activities by the whole cluster. The only change required to your Hadoop applications will be the addition of a replication specific URI. You can read more about adapting your Hadoop applications for replication.

Setting up Replication

The following steps are used to start replicating hdfs data. The detail of each step will depend on your cluster setup and your specific replication requirements, although the basic steps remain the same.

  1. Create a membership including all the data centers that will share a particular directory. See Create Membership
  2. Create and configure a Replicated Folder. See Replicated Folders
  3. Perform a consistency check on your replicated folder. See Consistency Check
  4. Configure your Hadoop applications to use WANdisco's protocol. See Configure Hadoop for WANdisco replication
  5. Run Tests to validate that your replicated folder remains consistent while data is being written to each data center. See Testing replication
Deployment with a small number of datanodes
You should consider setting the following configuration, if you are planning to run with a small number of datanodes (less than 3). This is especially important in cases where a single datanode may be deployed:
<property>
<name>dfs.client.block.write.replace-datanode-on-failure.best-effort</name>
<value>true</value>
</property>
dfs.client.block.write.replace-datanode-on-failure.best-effort
Default is "false". Running with the default setting, the client will keep trying until the set HDFS replication policy is satisfied. When set to "true", even if the specified policy requirements can't be met (e.g., there's only one DataNode that succeeds in the pipeline, which is less than the policy requirement), the client will still be allowed to continue to write.
Explanation
Fusion uses hflush and append to efficiently replicate files while they are still being written at the source cluster. As such it's possible to see these errors on the destination cluster. This is because appends have stricter requirements around creating the desired number of block replicas (HDFS default is 3) before allowing the write to be marked as complete. As per this Hortonworks article, a resolution may be to set dfs.client.block.write.replace-datanode-on-failure.best-effort to true, allowing the append to continue despite the inability to create the 3 block replicas. Note that his is not a recommended setting for clusters with more than 3 datanodes, as it may result in under replicated blocks. In this case the root cause of the errors should be identified and addressed - potentially a disk space issue could result in there not being sufficient datanodes having enough space to create the 3 replicas, resulting in the same symptoms.

Installing on a Kerberized cluster

Currently the Installer doesn't work on a platform that is secured by Kerberos. If you run the installer on a platform that is running Kerberos then the WD Fusion and IHC servers will fail to start at the end of the installation. You can overcome this issue by completing the following procedure before you install WD Fusion: Setting up Kerberos.

2.5 Deployment

The deployment section covers the final step in setting up a WD Fusion cluster, where supported Hadoop applications are plugged into WD Fusion's synchronized distributed namespace. It won't be possible to cover all the requirements for all the third-party software covered here, we strongly recommend that you get hold of the corresponding documenation for each Hadoop application before you work through these procedures.

2.5.1 Hive

This guide integrates WD Fusion with Apache Hive, it aims to accomplish the following goals:

  • Replicate Hive table storage.
  • Use fusion URIs as store paths.
  • Use fusion URIs as load paths.
  • Share the Hive metastore between two clusters.

Prerequisites

  • Knowledge of Hive architecture.
  • Ability to modify Hadoop site configuration.
  • WD Fusion installed and operating.

Replicating Hive Storage via fusion:///

The following requirements come into plqy if you are deploying WD Fusion using its native fusion:/// URI. In order to store a Hive table in WD Fusion you specify a WD Fusion URI when creating a table. E.g. consider creating a table called log that will be stored in a replicated directory.

CREATE TABLE log(requestline string) stored as textfile location 'fusion:///repl1/hive/log';

Note: Replicating table storage without sharing the Hive metadata will create a logical discrepancy in the Hive catalog. For example, consider a case where a table is defined on one cluster and replicated on the HCFS to another cluster. A Hive user on the other cluster would need to define the table locally in order to make use of it.
When running Hive with Cloudera The Hivemetastore canary test currently reports having "Bad health". FUS-1140

Exceptions

Hive from CDH 5.3/5.4 does not work with WD Fusion, as a result of HIVE-9991. The issue will be addressed once this fix for Hive is released. This requires that modify the default Hive file system setting when using CDH 5.3 and 5.4. In Cloudera Manager, add the following property to hive-site.xml:

<property>
    <name>fs.defaultFS</name>
    <value>fusion:///</value>
</property>

This property should be added in 3 areas:

  • Service Wide
  • GateWay Group
  • Hiveserver2 group

Replicated directories as store paths

It's possible to configure Hive to use WD fusion URIs as output paths for storing data, to do this you must specify a fusion URI when writing data back to the underlying Hadoop-compatible file system (HCFS). For example, consider writing data out from a table called log to a file stored in a replicated directory:

INSERT OVERWRITE DIRECTORY 'fusion:///repl1/hive-out.csv' SELECT * FROM log;

Exceptions

HDP 2.2
When running MapReduce jobs on HDP 2.2, you need to append the following entry to mapreduce.application.classpath in mapred-site.xml:

/usr/hdp/<hdp version>/hadoop-hdfs/lib/*

Replicated directories as load paths

In this section we'll describe how to configure Hive to use fusion URIs as input paths for loading data.

It is not common to load data into a Hive table from a file using the fusion URI. When loading data into Hive from files the core-site.xml setting fs.default.name must also be set to fusion, which may not be desirable. It is much more common to load data from a local file using the LOCAL keyword:

LOAD DATA LOCAL INPATH '/tmp/log.csv' INTO TABLE log;
If you do wish to use a fusion URI as a load path, you must change the fs.defaultFS setting to use WD Fusion, as noted in a previous section. Then you may run:
LOAD DATA INPATH 'fusion:///repl1/log.csv' INTO TABLE log;

Sharing the Hive metastore

Advanced configuration - please contact WANdisco before attempting
In this section we'll describe how to share the Hive metastore between two clusters. Since WANdisco Fusion can replicate the file system that contains the Hive data storage, sharing the metadata presents a single logical view of Hive to users on both clusters.

When sharing the Hive metastore, note that Hive users on all clusters will know about all tables. If a table is not actually replicated, Hive users on other clusters will experience errors if they try to access that table.

There are two options available.

Hive metastore available read-only on other clusters

In this configuration, the Hive metastore is configured normally on one cluster. On other clusters, the metastore process points to a read-only copy of the metastore database. MySQL can be used in master-slave replication mode to provide the metastore.

Hive metastore writable on all clusters

In this configuration, the Hive metastore is writable on all clusters.

  • Configure the Hive metastore to support high availability.
  • Place the standby Hive metastore in the second data center.
  • Configure both Hive services to use the active Hive metastore.
Performance over WAN
Performance of Hive metastore updates may suffer if the writes are routed over the WAN.

Hive metastore replication

There are three strategies for replicating Hive metastore data with WD Fusion:

Standard

For Cloudera DCH: See Hive Metastore High Availability.

For Hortonworks/Ambari: High Availability for Hive Metastore.

Manual Replication

In order to manually replicate metastore data ensure that the DDLs are placed on two clusters, and perform a partitions rescan.

Hive specific configuration for WD Fusion with fusion:/// URI

Required configuration for running WD Fusion on a Hive-equipped cluster.

  1. The recommended way to set up Hive with Fusion is to set the fs.defaultsFS property in hive-site.xml pointing to fusion:/// URI, while keeping scratchdirs pointing into the local HDFS, as described below. In this setup all tables will be created in Hive with the WD Fusion URI by default, however, replication of particular tables/databases could and should then be configured through the WD Fusion UI.
    <property> 
    <name>fs.defaultFS</name>
    <value>fusion:///</value>
    </property>
    <property>
    <name>hive.exec.scratchdir</name>
    <value>hdfs://dc1-cdh54-Cluster/tmp/hive-$
    {user.name}</value>
    </property>
    <property>
    <name>hive.exec.local.scratchdir</name>
    <value>file:///tmp/${user.name}
    </value>
    </property>		
    

2.5.2 Impala

Prerequisites

  • Knowledge of Impala architecture.
  • Ability to modify Hadoop site configuration.
  • WD Fusion installed and operating.

Query a table stored in a replicated directory

Support from WD Fusion v2.3 - v2.5
Impala does not allow the use of non-HDFS file system URIs for table storage. To work around this, WANdisco Fusion 2.3 comes with a client program See Impala Parcel that will support reading data from a table stored in a replicated directory. From WD Fusion 2.6, it becomes possible to replicate directly over HDFS using the hdfs:/// URI.

2.6.3 Oracle: Big Data Appliance

Each node in an Oracle:BDA deployment has multiple network interfaces, with at least one used for intra-rack communications and one used for external communications. WD Fusion requires external communications so configuration using the public IP address is required instead of using host names.

Prerequisites

  • Knowledge of Oracle:BDA architecture and configuration.
  • Ability to modify Hadoop site configuration.

Required steps

Operating in a multi-homed environment

Oracle:BDA is built on top of Cloudera's Hadoop and requires some extra steps to support multi-homed network environment.

Procedure

  1. Complete a standard installation, following the steps provided in the Installation Guide. Retrieve and use the public interface IP addresses for the nodes that will host the WD Fusion and IHC servers.
  2. Once the installation is completed you need to set up WD Fusion for a multi-homed environment, first edit WD Fusion's properties file (/opt/fusion-server/application.properties). Create a backup of the file, then add the following line at the end:
    communication.hostname=0.0.0.0
    Resave the file
  3. Next we need to update the IHC servers so that they will also use the public IP addresses rather than hostnames. The specific number and names of the configuration files that you need to update will depend on the details of your installation. If you run both WD Fusion server and IHCs on the same server you can get a view of the files with the following command:
    tree /etc/wandisco
    WD Fusion tree

    View of the WD Fusion configuration files.

  4. Edit each of the revealed config files. In the above example there are two instances of 2.5.0-cdh5.3.0.ihc that will need to be edited:
    #Fusion Server Properties
    #Wed Jun 03 10:14:41 BST 2015
    ihc.server=node01.obda.domain.com\:7000
    http.server=node01.obda.domain.com\:9001
    
    In each case you should change the addresses so that they use the public IP addresses instead of the hostnames.
  5. Open a terminal session on the node hosting the WD Fusion UI. Edit the properties file /opt/wandisco/fusion-ui-server/properties/ui.properties Add the property to allow the UI to listen on all interfaces, i.e.
    ui.hostname=0.0.0.0
    This should now ensure that the multi-homed deployment will work with WD Fusion.

Troubleshooting

If you suspect that the multi-homed environment is causing difficulty, verify that you can communicate to the IHC server(s) from other data centers. For example, from a machine in another data center, run:

nc <IHC server IP address>:<IHC server port>
If you see errors from that command, you must fix the network configuration.

2.5.4 EMC Isilon

Prerequisites

  • Knowledge of EMC Isilon administration.
  • Ability to modify Hadoop site configuration.

HDP on Isilon

Follow these steps to install WANdisco Fusion on a Hortonworks (HDP) cluster on Isilon storage.

  • Complete a standard installation, following the steps provided in the Installation Guide.
  • Copy /opt/fusion-server/core-site.xml from the WANdisco Fusion server to /opt/fusion/ihc-server/<package-version>/ on the IHC server(s).
  • Restart IHC services.

2.5.5 Apache Tez

Apache Tez is a YARN application framework that supports high performance data processing through DAGs. When set up, Tez uses its own tez.tar.gz containing the dependencies and libraries that it needs to run DAGs. For a DAG to access WD Fusion's fusion:/// URI it needs our client jars:

Configure the tez.lib.uris property with the path to the WD Fusion client jar files.

...
<property>
  <name>tez.lib.uris</name> 
# Location of the Tez jars and their dependencies.
# Tez applications download required jar files from this location, so it should be public accessible.
  <value>${fs.default.name}/apps/tez/,${fs.default.name}/apps/tez/lib/</value>
</property>
...

Running Hortonworks Data Platform, the tez.lib.uris parameter defaults to /hdp/apps/${hdp.version}/tez/tez.tar.gz. So, to add fusion libs, there are two choices:

Option 1: Delete the above value, and instead have a list including the path where the above gz unpacks to, and the path where fusion libs are.
or
Option 2: Unpack the above gz, repack with WD Fusion libs and reupload to HDFS.

Note that both changes are vulnerable to a platform (HDP) upgrade.

2.5.6 Apache Ranger

Apache Ranger is another centralized security console for Hadoop clusters, a preferred solution for Hortonworks HDP (whereas Cloudera prefer Apache Sentry). While Apache Sentry stores it's policy file in HDFS, Ranger uses its own local MySQL database, which introdces concerns over non-replicated security policies. Ranger also applies its policies to the ecosystem via java plugins into the ecosystem components - the namenode, hiveserver etc. In testing, the WD Fusion client has not experienced any problems communicating with ranger-enabled platforms.

Ensure that the hadoop system user, typically hdfs, has permission to imperonate other users.

...
<property>
<name>hadoop.proxyuser.hdfs.users</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hdfs.groups</name>
<value>*</value>
</property>
...

2.6 Appendix

The appendix section contains extra help and procedures that may be required when running through a WD Fusion deployment.

Environmental Checks

During the installation, your system's environment is checked to ensure that it will support WANdisco Fusion, the Environment checks are intended to catch basic compatibility issues, especially those that may appear during an early evaluation phase. The checks are not intended to replace carefully running through the Deployment Checklist.

Operating System: The WD Fusion installer verifies that you are installing onto a system that is running on a compatible operating system.
See the Operating system section of the Deployment Checklist, although the current supported distributions of Linux are listed here:
Supported Operating Systems
  • RHEL 6 x86_64
  • CentOS 6 x86_64
  • Ubuntu 12.04LTS and 14.04LTS
Architecture:
  • 64-bit only
Java: The WD Fusion installer verifies that the necessary Java components are installed on the system.The installer checks:

  • Env variables: JRE_HOME, JAVA_HOME and runs the which java command.
  • Version: 1.7 recommended. Must be at least 1.7. You can run with later versions, but they are not recommended as we perform all our testing with 1.7.
  • Architecture: JVM must be 64-bit.
  • Distribution: Must be from Oracle. See Oracle's Java Download page.
For more information about JAVA requirements, see the Java of the Deployment Checklist.
ulimit: The WD Fusion installer verifies that the system's maximum user processes and maximum open files are set to 64000.
For more information about setting, see the File descriptor/Maximum number of procesesses limit on the Deployment Checklist.

System memory and storage WD Fusion's requirements for system resources are split between its component parts, WD Fusion server, Inter-Hadoop Communication servers (IHCs) and the WD Fusion UI, all of which can, in principle be either co-located on the same machine or hosted separately.
The installer will warn you if the system on which you are currently installing WD Fusion is falling below the requirement. For more details about the RAM and storage requirements, see the Memory and Storage sections of the Deployment Checklist.
Compatible Hadoop Flavour WD Fusion's installer confirms that a compatible Hadoop platform is installed. Currently, it takes the Cluster Manager detail provided on the Zone screen and polls the Hadoop Manager (CM or Ambari) for details. The installation can only continue if the Hadoop Manager is running a compatible version of Hadoop.
See the Deployment Checklist for Supported Versions of Hadoop
HDFS service state: WD Fusion validates that the HDFS service is running. If it is unable to confirm the HDFS state a warning is given that will tell you to check the UI logs for possible errors.
See the Logs section for more information.
HDFS service health WD Fusion validates the overall health of the HDFS service. If the installer is unable to communicate with the HDFS service then you're told to check the WD Fusion UI logs for any clues. See the Logs section for more information.
HDFS maintenance mode. WD Fusion looks to see if HDFS is currently in maintenance mode. Both Hortonworks and Ambari support this mode for when you need to make changes to your Hadoop configuration or hardware, it supresses alerts for a host, service, role or, if required, the entire cluster.
WD Fusion node running as a client We validate that the WD Fusion server is configured as a HDFS client.

Fusion Client installation with RPMs

The WD Fusion installer doesn't currently handle the installation of the client to the rest of the nodes in the cluster. You need to go through the following procedure:

  1. In the Client Installation section of the installer you will see the link to the list of nodes here and the link to the client RPM package.

    RPM package location
    If you need to find the packages after leaving the installer page with the link, you can find there in your installation directory, here:

    /opt/wandisco/fusion-ui-server/ui/client_packages

  2. If you are installing the RPMs, download and install the package on each of the nodes that appear on the list from step 1.
  3. Installing the client RPM is done in the usually way:
    rpm -i <package-name>

Fusion Client installation with DEB

Debian not supported
Although Ubuntu uses Debian's packaging system, currently Debian itself is not supported. Note: Hortonworks HDP does not support Debian.

If you are running with an Ubuntu Linux distribution, You need to go through the following procedure for installing the clients using Debian's DEB package:

  1. In the Client Installation section of the installer you will see the link to the list of nodes here and the link to the client DEB package.

    DEB package location
    If you need to find the packages after leaving the installer page with the link, you can find there in your installation directory, here:

    /opt/wandisco/fusion-ui-server/ui/client_packages

  2. To install WANdisco Fusion client, download and install the package on each of the nodes that appear on the list from step 1.
  3. You can install it using
    sudo dpkg -i /path/to/deb/file
    followed by
    sudo apt-get install -f
    Alternatively, move the DEB file to /var/cache/apt/archives/ and then run
    apt-get install <fusion-client-filename.deb>
    .

Fusion Client installation with Parcels

For deployments into a Cloudera clusters, clients can be installed using Cloudera's own packaging format: Parcels.

Installing the parcel

  1. Open a terminal session to your Cloudera Manager server. Ensure that you have suitable permissions for handling files.
  2. Download the appropriate parcel and sha for your deployment.
    wget "http://fusion.example.host.com:8083/ui/parcel_packages/FUSION-<version>-cdh5.<version>.parcel"
    wget "http://node01-example.host.com:8083/ui/parcel_packages/FUSION-<version>-cdh5.<version>.parcel.sha"
  3. Change the ownership of the parcel and .sha files so that they match the system account that Cloudera Manager:
    chown cloudera-scm:cloudera-scm FUSION-<version>-cdh5.<version>.parcel*
  4. Move the files into the server's local repository, i.e.
    mv FUSION-<version>-cdh5.<version>.parcel* /opt/cloudera/parcel-repo/
  5. Open Cloudera manager and navigate to the Parcels screen.
    WD Fusion tree

    New Parcels check.

  6. The WD Fusion client package is now ready to distribute.
    WD Fusion tree

    Ready to distribute.

  7. Click on the Distribute button to install WANdisco Fusion from the parcel.
    WD Fusion tree

    Distribute Parcels.

  8. Click on the Activate button to activate WANdisco Fusion from the parcel.
    WD Fusion tree

    Distribute Parcels.

  9. The configuration files need redeploying to ensure the WD Fusion elements are put in place correctly. You will need to check Cloudera Manager to see which processes will need to be restarted in order for the parcel to be deployed. Cloudera Manager provides a visual cue about which processes will need a restart.
    WD Fusion tree

    Restarts.

Impala Parcel

Also provided in a parcel format is the WANdisco compatible version of Cloudera's Impala tool:

WD Fusion tree

Ready to distribute.

Follow the same steps described for installing the WD Fusion client, downloading the parcel and SHA file, i.e.:

  1. Have cluster with CDH installed with parcels and Impala.
  2. Copy the FUSION_IMPALA parcel and SHA into the local parcels repository, on the same node as Cloudera Mananger. This is by default located at: /opt/cloudera/parcels, but is configurable. In Cloudera Manager, you can go to the Parcels Management Page -> Edit Settings to find the Local Parcel Repository Path.
    FUSION_IMPALA should be available to distribute and activate on the Parcels Management Page, remember to click Check for New Parcels button.
  3. Once installed, restart the cluster.
  4. Impala reads on Fusion files should now be available.

Fusion Client installation with HDP Stack

For deployments into Hortonworks HDP/Ambari cluster, version 1.7 or later. Clients can be installed using Hortonwork's own packaging format: HDP Stack.

Ambari 1.6 and earlier
If you are deploying with Ambari 1.6 or earlier, don't use the provided Stacks, instead use the generic RPMs.

Ambari 1.7
If you are deploying with Ambari 1.7, take note of the requirement to perform some necessary restarts on Ambari before completing an installation.

Ambari 2.0
When adding a stack to Ambari 2.0 (any stack, not just WD Fusion client) there is a bug which causes the YARN parameter yarn.nodemanager.resource.memory-mb to reset to a default value for the YARN stack. This may result in the Java heap dropping from a manually-defined value, back to a low default value (2Gb). Note that this issue is fixed from Ambari 2.1.

Upgrading Ambari
When running Ambari prior to 2.0.1, we recommend that you remove and then reinstall the WD Fusion stack if you perform an update of Ambari. Prior to version 2.0.1, an upgraded Ambari refuses to restart the WD Fusion stack because the upgrade may wipe out the added services folder on the stack.

If you perform an Ambari upgrade and the Ambari server fails to restart , the workaround is to copy the WD Fusion service directory from the old to the new directory, so that it is picked up by the new version of Ambari, e.g.:

cp -R /var/lib/ambari-server/resources/stacks_25_08_15_21_06.old/HDP/2.2/services/FUSION /var/lib/ambari-server/resources/stacks/HDP/2.2/services
Again, this issue doesn't occur once Ambari 2.0.1 is installed.

Installing the WANDisco service into your HDP Stack

  1. Download the service from the installer client download panel, or after the installation is complete, from the client packages section on the Settings screen.
  2. The service is a gz file (e.g. fusion-hdp-2.2.0-2.4_SNAPSHOT.stack.tar.gz) that will expand to a folder called /FUSION.
  3. Place this folder in /var/lib/ambari-server/resources/stacks/HDP/<version-of-stack>/services.
  4. Restart the ambari-server
    service ambari-server restart
  5. After the server restarts, go to + Add Service.
    WD Fusion tree

    Add Service.

  6. Choose Service, scroll to the bottom.
    WD Fusion tree

    Scroll to the bottom of the list.

  7. Tick the WANdisco Fusion service checkbox. Click Next.
    WD Fusion tree

    Tick the WANdisco Fusion service checkbox.

  8. Datanodes and node managers are automatically selected. Choose any additional nodes you may want as client. Click Next. WD Fusion tree

    Assign Slaves and Clients.

  9. Deploy the changes.
    WD Fusion tree

    Deploy.

  10. Install, Start and Test.
    WD Fusion tree

    Install, start and test.

  11. Review Summary and click Complete.
    WD Fusion tree

    Review.


Installation of Services can remove Kerberos settings
During the installation of services through stacks it is possible that Kerberos configuration can be lost. This has been been seen to occur on Kerberized HDP2.2 clusters when installing Kafka or Oozie. Kerberos configuration in the core-site.xml file was removed during the installation which resulted in all HDFS / Yarn instances being unable to restart. For more details, see the Ambari JIRA AMBARI-9022


MapR Client Configuration

On MapR clusters, you need to copy WD Fusion configuration onto all other nodes in the cluster:

  1. Open a terminal to your WD Fusion node.
  2. Navigate to /opt/mapr/hadoop/<hadoop-version>/etc/hadoop.
  3. Copy the core-site.xml file to the same location on all other nodes in the cluster.
  4. The configuration will be picked up automatically, no need for restarting the nodes. The entire cluster can now communicate with WD Fusion.

Removing WANdisco Service

If you are removing WD Fusion, maybe as part of a reinstallation, you should remove the client packages as well. Ambari never deletes any services from the stack it only disables them. If you remove the WD Fusion service from your stack, remember to also delete fusion-client.repo.

[WANdisco-fusion-client]
name=WANdisco Fusion Client repo
baseurl=file:///opt/wandisco/fusion/client/packages
gpgcheck=0

For instructions for the cleanup of Stack, see Host Cleanup for Ambari and Stack

Uninstall WD Fusion

There currently isn't a uninstall function for our installer, so the system will have to be cleaned up manually. The best way to remove Fusion (assuming it has been installed with our unified installer), is:

  • Stop all Fusion processes on the Fusion Server. See Shutting down.
  • Remove the fusion packages on the Fusion Server:
    yum erase 'fusion*'
  • Remove the client packages on any other nodes in the cluster
  • Remove the folders created by our installation on the Fusion Server:
    rm -r /opt/wandisco /opt/fusion/ /etc/wandisco
  • Remove the extra configuration we added to the core-site.xml on the Manager Server
  • Restart the managers after reverting the config changes
  • (Optional) Remove the installer file on the Fusion Server