Skip to main content
Version: 1.18.1

On-premises Hadoop to Amazon S3 and AWS Glue

These are an outline of the steps needed to ready your environment for migration of data and metadata.

Time to complete: 1 hour (assuming all prerequisites are met).

Recommended technical knowledge#

  • Linux operating system
  • Apache Hadoop administration
    • Hadoop Distributed Filesystem (HDFS)
    • Apache Hive
  • Amazon Web Services (AWS) Service configuration and management
    • Amazon Simple Storage Service (Amazon S3)
    • AWS Glue

Prerequisites#

On-premises Hadoop cluster#

Amazon S3 and AWS Glue#

For your target environment, make sure you have the following:

AWS security#

All AWS services should be secured using best practices. This is a summary of those practices and which services they apply to.

Amazon S3#

All Amazon S3 buckets should adhere to AWS best practices for Amazon S3. These include the following:

AWS Glue#

All AWS Glue instances should be configured using AWS security practices for Glue. These include the following:

AWS deployment#

These are some of the options to consider before creating your Amazon S3 bucket and AWS Glue instance:

AWS costs and quotas#

The following table lists the required and optional AWS services that are applicable to this use-case:

ServiceRequired?PricingQuotas
Amazon S3YesAmazon S3 pricingAmazon S3 quotas
AWS GlueYesAWS Glue pricingAWS Glue quotas
Site-to-Site VPNOptionalSite-to-Site VPN pricingSite-to-Site VPN quotas
Direct ConnectOptionalDirect Connect pricingDirect Connect quotas
Key Management Service (KMS)OptionalKMS pricingKMS quotas

See AWS pricing for more general guidance.

Install LiveData Migrator on your Hadoop edge node#

Download and install LiveData Migrator on your Hadoop edge node.

Configure for data migrations#

Add Hadoop Distributed File System (HDFS) as source filesystem#

  1. Configure your on-premises HDFS as the source filesystem:

  2. (CLI only) Validate that your on-premises HDFS is now set as your source filesystem:

    source show

    If the filesystem shown is incorrect, delete it using source del and configure the source manually:

    filesystem add hdfs

    Ensure to include the --source parameter when using the command above.

Add Amazon S3 bucket as target filesystem#

Configure your Amazon S3 bucket as your target filesystem:

You may then test your target filesystem.

Test the S3 bucket target#

LiveData Migrator automatically tests the connection to any target filesystem added to ensure the details provided are valid and a migration can be created and run.

To check that the configuration for the filesystem is correct:

  • UI - the target will show a healthy connection.

  • CLI - the filesystem show command will show only a target that was successfully added:

    Example
    filesystem show --file-system-id myAWSBucket

To test a migration to the S3 bucket, create a migration and run it to transfer data, then check that the data has arrived in its intended destination.

Create path mappings (optional)#

Create path mappings to ensure that data for managed Hive databases and tables are migrated to an appropriate folder location on your Amazon S3 bucket.

This lets you start using your source data and metadata immediately after migration, as it will be referenced correctly by your AWS Glue crawler and/or AWS Glue Studio.

Configure for metadata migrations#

Add Apache Hive as source hive agent#

  1. Configure the source hive agent to connect to the Hive metastore on the on-premises Hadoop cluster:

Test the Apache Hive source hive agent#

LiveData Migrator automatically tests the connection to any hive agent added to ensure the details provided are valid and a metadata migration can be created and run.

To check that the configuration for the hive agent is correct:

  • UI - the agent will show a healthy connection.

  • CLI

    Example
    hive agent check --name hiveAgent

To test a metadata migration from the Apache Hive agent, create a metadata migration and run it to transfer data, then check that the data has arrived in its intended destination.

Add AWS Glue as target hive agent#

Configure a hive agent to connect to AWS Glue:

Test the AWS Glue target hive agent#

LiveData Migrator automatically tests the connection to any hive agent added to ensure the details provided are valid and a metadata migration can be created and run.

To check that the configuration for the hive agent is correct:

  • UI - the agent will show a healthy connection.

  • CLI

    Example
    hive agent check --name hiveAgent

To test a metadata migration to the AWS Glue target agent, create a metadata migration and run it to transfer data, then check that the data has arrived in its intended destination.

Troubleshooting#

In the event a filesystem or hive agent could not be added, LiveData Migrator will provide you with error messages in most cases to help you discern the issue.

If no data appears to have been transferred in either a migration or a metadata migration, check LiveData Migrator's notifications for errors. In most cases, these will provide you with the information you need to diagnose any problems.

In the event of a problem you cannot diagnose, contact WANdisco support.

Network architecture#

LiveData Migrator Network Architecture

The diagram is an example of LiveData Migrator architecture between two environments - On-premises and AWS Cloud.

On-Premises#

  1. All migration activity, both reads and writes, goes through the LiveData Migrator service. Data transfer to AWS is via Port 443 (HTTPS). Metadata transfer through the HiveMigrator functionality is over port 6780/6781 (HTTP/HTTPS).

  2. Interaction with LiveData Migrator is handled either through WANdisco's LiveData UI component (port 8081) or LiveData Migrator CLI (via LDM's API port 18080). The LDM CLI does not open any ports itself and acts as a client.

AWS Cloud#

  1. The WAN connection to AWS from the source environment (see AWS Site-to-Site VPN and AWS Direct Connect).

  2. A VPC and Subnet (see Working with VPCs and subnets) that are configured with access to the underlying storage and metastore and necessary external connectivity to the source environment.

  3. The IAM role and associated permissions for access to resources.

  4. The underlying storage (Amazon S3 bucket) and metastore (AWS Glue Data Catalog).

    important

    By default, S3 buckets are set as private to prevent unauthorized access. We strongly recommend that you read the following blog on the AWS support site for a good overview of this subject:
    Best practices for securing sensitive data in AWS data stores

  5. The AWS Key Management Service configured to encrypt both the Amazon S3 bucket and AWS Glue instance.

Next Steps#

Start defining exclusions and migrating data. You can also create metadata rules and start migrating metadata.