WANdisco Glossary

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z

Welcome to the WANdisco Glossary for all our Big Data products.

For more detailed information please visit the WANdisco Fusion user guide.

If you use also use our SCM/ALM products, see their glossary here.

A

Acceptor

The recipient node for an incoming Agreement message initiated by a Proposer. It is a Paxos term used in content distribution for a node that can vote on the order in which replicated changes will play out.

Agreement

Each step that the DConE replicated state machine executes is called an agreement.

Agreement Manager

Individual agreement steps of the replicated state machine are executed under the purview of one or more Agreement Managers.

Amazon Web Services (AWS)

AWS is a subsidiary of Amazon.com which provides cloud computing platforms on a subscription basis. WANdisco Fusion can be run on this platform.

Ambari

It is an open source management platform for provisioning, managing, monitoring and securing Apache Hadoop clusters.
Ambari is also a fibre which is quite similar to jute!

AMQP

Advanced Message Queuing Protocol

Apache Kafka

A fast, scalable, fault-tolerant messaging system.
Kafka is often used in place of traditional message brokers like JMS and AMQP because of its higher throughput, reliability and replication.
Kafka works in combination with Apache Storm, Apache HBase and Apache Spark for real-time analysis and rendering of streaming data.

API

Application Program Interface.

AWS

See Amazon Web Services.

Azure

Azure is Microsoft’s cloud computing platform, a growing collection of integrated services-analytics, computing, database, mobile, networking, storage, and web-for moving faster, achieving more, and saving money.

Azure resource group

Applications are typically made up of many components, for example a web app, database, database server, storage, and 3rd party services. Azure Resource Manager (ARM) enables you to work with the resources in your application as a group, referred to as an Azure Resource Group. You can deploy, update, monitor or delete all of the resources for your application in a single, coordinated operation. You use a template for deployment and that template can work for different environments such as testing, staging and production. You can clarify billing for your organization by viewing the rolled-up costs for the entire group. For more information, see Azure Resource Manager Overview.

Azure Blob storage

Azure Blob storage is a robust, general-purpose storage solution that integrates with HDInsight. Through the WASB driver and the WebWasb (WebHDFS over WASB) interface, the full set of components in HDInsight can operate directly via standard Hadoop DFS tools (command line, File System Java API) on structured or unstructured data in Blob storage.

There are several benefits associated with using Azure Blob Storage as the native file system:

Storing data in Blob storage enables users to safely delete the HDInsight clusters that are used for computation without losing user data.
Data reuse and sharing
Data storage cost

Although there is an implied performance cost of not co-locating computer clusters and storage resources, this is mitigated by the way the compute clusters are created close to the storage account resources inside the Azure datacenter, where the high-speed network makes it very efficient for the compute nodes to access the data inside Azure Blob storage. For more information, see Use Azure Blob storage with Hadoop in HDInsight.
Address files in Blob storage
HDInsight uses Azure Storage Blob through the WASB(S) driver. Azure Blob storage is transparent to users and developers. To access the files on the default storage account, you can use one of the following syntax:
/example/jars/hadoop-mapreduce-examples.jar wasb:///example/jars/hadoop-mapreduce-examples.jar wasb://mycontainer@myaccount.blob.core.windows.net/example/jars/hadoop-mapreduce-examples.jar
If the data is stored outside the default storage account, you must link to the storage account at the creation time. The URI scheme for accessing files in Blob storage from HDInsight is:
wasb[s]://<BlobStorageContainerName>@<StorageAccountName>.blob.core.windows.net/<path> wasb[s]: The URI scheme provides unencrypted access (with the wasb: prefix) and SSL encrypted access (with wasbs).
We recommend using wasbs wherever possible, even when accessing data that lives inside the same datacenter in Azure.
<BlobStorageContainerName>: Identifies the name of the container in Azure Blob storage.
<StorageAccountName>: Identifies the Azure Storage account name. An FQDN is required.
<path>: is the file or directory HDFS path name. Because containers in Azure Blob storage are simply key-value stores, there is no true hierarchical file system. A slash character ( / ) inside a blob key is interpreted as a directory separator. For example, the blob name for hadoop-mapreduce-examples.jar is:
```
example/jars/hadoop-mapreduce-examples.jar
```
When working with blobs outside of HDInsight, most utilities do not recognize the WASB format and instead expect a basic path format, such as example/jars/hadoop-mapreduce-examples.jar.
Best Practices for using blob storage with HDInsight:
Don’t share a default container between two live clusters. This is not a supported scenario.
Re-use the default container to reuse the same root path on a different cluster.
Use additional linked storage account for user data.

B

Big data: This term describes large and complex data sets that are too big for traditional data processing software to handle. Software such as WANdisco Fusion however, can.
BigInsights: It is a software platform from IBM for discovering, analyzing, and visualizing data from disparate sources. You use this software to help process and analyze the volume, variety, and velocity of data that continually enters your organization every day. Our Fusion product can be run on this system, this product is called Big Replicate.
Big Replicate: The product name for WANdisco Fusion when run on IBM’s BigInsights system.

C

CA: See Certificate Authority
CentOS: A community-supported, free and open source operating system based on Red Hat Enterprise Linux. It exists to provide a free enterprise class computing platform and strives to maintain 100% binary compatibility with its upstream distribution. CentOS stands for Community ENTerprise Operating System.
Certificate Authority (CA): A body which issues digital certificates.
A certificate is a digitally signed statement from one entity (the issuer), saying that the public key (and some other information) of another entity (the subject) has some specific value. These trusted certificates can then be used to create secure connections to a server.
CDH: See Cloudera Distribution Hadoop.
Cloudera Distribution Hadoop (CDH): It is an open source Apache Hadoop distribution provided by Cloudera Inc.
Clustering: WANdisco’s Clustering products are implemented as a transparent gateway between the clients and the servers, and acts as a network proxies for the local server. There is a one-to-one relationship between replicator and server. With clustering, a load balancer directs a user to one node in the cluster but clients write activity must be replicated from one node to the others in the cluster.

D

Data Transfer Object (DTO): An object which transfers data between software application subsystems.
DConE: WANdisco’s Distributed Coordinated Engine, the software engine underlying replication. Read the Whitepaper on our DConE technology.
Deterministic State Machine (DSM): An object whose principle job is obtaining agreements on the ordering of proposals as part of the DConE engine. Each DSM will have a group of nodes assigned to it, compromising of at least two Zones. Each node can have one or more of the Paxos roles (Proposer, Acceptor, Learner).
Distinguished Name (DN): Used for unique identification in LDAP.
Distinguished Node: The distinguished node is used in situations where there is an even number of nodes, a configuration that introduces the risk of a tied vote. The Distinguished Node’s bigger vote ensures that it is not possible for a vote to become tied (also known as a Tiebreaker).
DN: See Distinguished Name.
DSM: See Deterministic State Machine.
DTO: See Data Transfer Object.

E

Ecosystem: The combined set of nodes that have been inducted together.
Edge Nodes: Edge nodes (AKA gateway nodes) are servers that interface between the Hadoop cluster and systems that are outside the network. Most commonly, edge nodes are used to run client applications and cluster administration tools. Read more.
Elastic MapReduce (Amazon EMR): It processes big data across a Hadoop cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3). It has dynamic resizing ability, which allows it to alter resource use depending on the demand at any given time.
EMR: Elastic MapReduce (if using Amazon).

F

Flume: Flume is an Apache product which we use for efficiently collecting, aggregating and moving large amounts of log data.
FQDN: See Fully Qualified Domain Name.
Fully Qualified Domain Name (FQDN): A domain name that specifies its exact location within the tree hierarchy of the DNS. It is unambiguous.
Fusion Client: Client jar files to be installed on each Hadoop client, such as mappers and reducers that are connected to the cluster. The client is designed to have a minimal memory footprint and impact on CPU utilization.
Fusion: WANdisco Fusion our software application which allows Hadoop deployments to replicate HDFS data between Hadoop clusters that are running different, even incompatible versions of Hadoop. It is even possible to replicate between different vendor distributions and versions of Hadoop.
Fusion UI: A separate server that provides administrators with a browser-based management console for each WANdisco Fusion server. This can be installed on the same machine as WANdisco Fusion’s server or on a different machine within your data center.

G

Garbage Collection: A type of automatic memory management. It allows system resources to be freed up by removing objects that are no longer required.
Global Sequence Number (GSN): The ordered number of an agreement associated with a DSM.
Globally Unique Identifier (GUID): Each node is assigned a GUID when added to a Replication Group. The nodes identify each other by their GUIDs.
Growl: Growl messages provide immediate feedback in response to a user’s interactions with the Admin UI. They appear in the top right-hand corner of the screen and persist for a brief period or until the screen is refreshed or changed.
GSN: See Global Sequence Number
GUID: See Globally Unique Identifier

H

Hadoop: Hadoop is an open source software framework for storing data and running applications on clusters of commodity hardware. It provides storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.
Hadoop Distributed File System (HDFS): HDFS is a standard distributed file system written in Java. It stores a large amount of data placed across multiple machines, and provides excellent data throughput and access through the MapReduce algorithm, high fault tolerance and native support of large data sets. The HDFS architecture consists of clusters, each of which is accessed through a single Name Node software tool installed on a separate machine to monitor and manage the that cluster’s file system and user access mechanism. The other machines install one instance of Data Node to manage cluster storage.
HCFS: Hadoop Compatible File System.
HDFS: See Hadoop Distributed File System.
HDInsight: HDInsight deploys and creates Apache Hadoop clusters in the cloud, providing a software framework designed to manage, analyze, and report on big data with high reliability and availability. HDInsight uses the Hortonworks Data Platform (HDP) Hadoop distribution. Hadoop often refers to the entire Hadoop ecosystem of components, which includes Apache Hadoop, Apache Storm and Apache HBase clusters, as well as other technologies under the Hadoop umbrella.
High Availability: Provides continuous hot backup, while making failover and disaster recovery automatic and transparent for both developers and administrators.
Hive: Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query, and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.

I

IHC Server: Inter Hadoop Communication servers handle the traffic that runs between zones or data centers that use different versions of Hadoop. IHC Servers are matched to the version of Hadoop running locally. It’s possible to deploy different numbers of IHC servers at each data center, additional IHC Servers can form part of a High Availability mechanism.
Induction: Induction adds a node to the Fusion ecosystem, either as a new Zone or to an existing Zone.

J

JIRA: Browser-based bug, issue, task and defect tracking system, and project management software solution used for open source and enterprise projects. JIRA isn’t an acronym, it’s a contraction of 'Gojira'. When listing product features and bug fixes we’ll usually add a reference code for the corresponding JIRA issue.
Jenkins: An open source automation server.

K

KB: See Knowledgebase.
KDC: See Key Distribution Center.
Kerberos: A network authentication system.
Key Distribution Center (KDC): In cryptography, a KDC is a network service which provides temporary session keys to users and computers. It is designed to reduce the risks associated with exchanging keys.
Key Management Service (KMS): KMS is an encryption keystore, providing network users with keys required to decrypt data-at-rest.
KMS: See Key Management Service.
Knowledgebase (KB): The WANdisco Knowledgebase contains updates and product information.

L

LDAP: See Lightweight Directory Access Protocol.
Learner: It is a Paxos term used in content distribution for a node which delivers and executes proposals, a node with a repository replica. The node will receive replication traffic that will synchronize its data with other nodes.
Lightweight Directory Access Protocol (LDAP): An Internet protocol that enables client programs to access distributed directory services.

M

MapReduce

Hadoop MapReduce is a software framework for writing applications to process large amounts of data in-parallel on large clusters. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in parallel. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

Membership

This term is redundant as of Fusion 2.11.

A Membership is a defined group of WANdisco Fusion servers that replicate data between their Zones. You can have as many WANdisco Fusion servers in a Membership as you like, and each WANdisco Fusion server can participate in multiple Memberships. WANdisco Fusion allows you to define multiple Memberships, and WANdisco Fusion servers can fulfil different roles in each Membership they participate in. This allows you to control exactly how your WANdisco Fusion environment operates normally, and how it behaves when faced with network failures, host failures and other types of issues.

Microsoft Azure: See Azure.

N

Node: A server on which a replicator product is installed.

O

Owl: A fan of Sheffield Wednesday Football Club.

P

Proposer: A Paxos term used in content distribution for a node which creates proposals and resolves proposal conflicts. It creates proposals for changes that can be applied to the other nodes.

Q

Quorum: A group of nodes that can reach agreement to determine transaction order. At least 50% of the nodes must be available to achieve Quorum. In the case of an even number of nodes, a distinguished node settles a conflict.

R

Replication Rules: File system content is replicated selectively by defining Replication Rules, which specify: the directory in the file system that will be replicated and the Zones that will participate in that replication.

Without any Replication Rules defined, each Zone’s file system operates independently of the others. With the combination of Zones and Replication Rules, WANdisco Fusion gives you complete control over how data is replicated between your file systems and/or object stores.

Replicator: A replicator is synonymous with an installed instance of the product or "node".
RESTful API: Representational State Transfer Application Program Interface.

S

SHA1: Secure Hash Algorithm 1. This is a 160-bit hash value normally rendered as a 40 digit hexadecimal number. Although originally developed for security purposes, SHA1 is used to ensure data hasn’t changed, for example due to accidental corruption.
Sideline: A sidelined repository replica is a replica that will no longer get updates from other replicas in the family. The other replicas will not preserve operations for a sidelined replica thereby preventing them from running out of memory. Replicas become sidelined if they are out of communication for an extended amount of time and the number of outstanding agreements exceeds a tunable maximum. Sidelined replicas can be brought back into normal operation via a repair procedure.
Site: A physical location containing computers where one or more WANdisco replicated products are installed.
SSH: Secure Shell (SSH) is a means of getting secure access to a remote computer. It can be used for authentication.
SSL: Secure Socket Layer (SSL) is a commonly used encryption protocol.
Synchronized Stop: A special transaction that will prevent further write transactions from happening. A replicator will process transactions normally up to this special transaction and the node will enter a "stopping" state. This causes all nodes to stop after processing up through the exact same GSN. A Synchronized Stop is typically actioned prior to administrative tasks such as a product upgrade. This transaction requires Unanimous Agreement in order to complete. Even if one of the nodes is not available, the other nodes will enter the stopping state and prevent write transactions. Unanimous Agreement means that all nodes need to be available for a synchronized stop to complete.

T

Talkback: A bash script that is provided in your product installation to collect product meta-data in case you need to report an issue to the WANdisco support team.
TDE: See Transparent Data Encryption.
Transparent Data Encryption (TDE): TDE is a technology which gives encryption at file level. It can therefore for example be used to encrypt databases.

U

UI: User interface. The console with which users interact.
Universally unique identifier (UUID): A 128-bit identifier that is created in such a way that no other identifier will ever be its equal.
URI: Uniform Resource Identifier.
UUID: See Universally unique identifier.

V

VPC: See Virtual Private Cloud.
Virtual Private Cloud (VPC): A VPC is a virtual network dedicated to your AWS account. It is logically isolated from other virtual networks in the AWS Cloud. You can launch AWS resources, such as Amazon EC2 instances, into your VPC.

W

WANdisco: Wide Area Network Distributed Computing. WANdisco is a leading provider of distributed software development solutions. By using WANdisco’s unique replication technology, software development occurs anywhere without the constraints associated with far-flung distribution.
WASB: See Azure Blob Storage. It stands for Windows Azure Storage Blob.
Wildcard: A symbol used to replace or represent one or more characters.
Writer: In the WANdisco Fusion’s architecture, only one Fusion node/server per zone is allowed to write into a replicated filespace - this node is the "Writer" for that replicated folder. Therefore, if there is one replicated folder and two zones, there will be two writers for the replicated folder, one in each zone. If there are two replicated folders and two zones there will be four writers, two in each zone.

The writer for a replicated folder does not have to be the same node as the writer for another replicated folder, e.g. Node 1 may be the writer for /dir1/dir2 and /dir1/dir3 and Node 2 may be the writer for /dir1/dir4, which allows for load-balancing across Fusion servers within a zone. If a writer node fails, a new writer for that folder must be selected (set through the process for Writer Selection.)

An exception to this is when a Fusion node is started/restarted, it will check if any replicated folders do not have a writer assigned, and if not, elect itself as writer.

X

XML: Extensible Markup Language. Multiple files in WANdisco products are written in this language.

Y

YARN: Apache Hadoop YARN is a platform responsible for managing computing resources in clusters and using them for scheduling users' applications. It stands for Yet Another Resource Negotiator.
Yeturu Aahlad: Dr Aahlad is Wandisco’s chief scientist and is recognized as a authority on distributed computing in which he currently holds 3 patents.
Prior to WANdisco, Dr. Aahlad served as the distributed systems architect for iPlanet (Sun/Netscape Alliance) Application Server. At Netscape, Dr. Aahlad joined the elite team in charge of creating a new server platform based on the CORBA distributed object framework.

Z

Zones: A Zone represents the file system used in a standalone Hadoop cluster. Multiple Zones could be from separate clusters in the same data center, or could be from distinct clusters operating in geographically-separate data centers that span the globe. WANdisco Fusion operates as a distributed collection of servers. While each WANdisco Fusion server always belongs to only one Zone, a Zone can have multiple WANdisco Fusion servers (for load balancing and high availability). When you install WANdisco Fusion, you should create a Zone for each cluster’s file system.