Skip to main content
Version: 2.4.3 (latest)

Configure an Amazon S3 source

You can migrate data to from Amazon Simple Storage Service (Amazon S3) by configuring one as a source filesystem.

Follow these steps to create an Amazon S3 bucket as a source using either the UI or CLI.

Prerequisites

You need the following:

note

When migrating data with Amazon S3 as a source, data contained in paths with two or more consecutive forward slashes can't be replicated.

caution

When using Amazon S3 as a source, do not include the SQS initialization path (sqs-init-path/) in any migration, this will cause an issue where Data Migrator will prevent subsequent migrations from progressing to a Live status.

Configure Amazon S3 as a source with the UI

  1. From the Dashboard, select an instance under Instances.

  2. In the Filesystems & Agents menu, select Filesystems.

  3. Select Add source filesystem

  4. Select Amazon S3 from the Filesystem Type dropdown list.

  5. Enter the following details:

    • Display Name - The name you want to give your source filesystem.

    • Bucket Name - The reference name of the Amazon S3 bucket you are using.

    • Authentication Method - The Java class name of a credentials provider for authenticating with the S3 endpoint.

      The Authentication Method options available include:

      • Access Key and Secret org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider

        Use this provider to enter credentials as an access key and secret access key with the following entries:

        • Access Key - Enter the AWS access key. For example, RANDOMSTRINGACCESSKEY.

        • Secret Key - Enter the secret key that corresponds with your access key. For example, RANDOMSTRINGPASSWORD.

      • AWS Identity and Access Management com.amazonaws.auth.InstanceProfileCredentialsProvider

        Use this provider if you're running Data Migrator on an EC2 instance that has been assigned an IAM role with policies that allow it to access the S3 bucket.

      • AWS Hierarchical Credential Chain com.amazonaws.auth.DefaultAWSCredentialsProviderChain

        A commonly used credentials provider chain that looks for credentials in this order:

        • Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, or AWS_ACCESS_KEY and AWS_SECRET_KEY.
        • Java System Properties - aws.accessKeyId and aws.secretKey.
        • Web Identity Token credentials from the environment or container.
        • Credential profiles file at the default location (~/.aws/credentials) shared by all AWS SDKs and the AWS CLI.
        • Credentials delivered through the Amazon EC2 container service if the AWS_CONTAINER_CREDENTIALS_RELATIVE_URI environment variable is set and security manager has permission to access the variable.
        • Instance profile credentials delivered through the Amazon EC2 metadata service.
      • Environment Variables com.amazonaws.auth.EnvironmentVariableCredentialsProvider

        Use this provider to enter an access key and a secret access key as either AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, or AWS_ACCESS_KEY and AWS_SECRET_KEY.

      • EC2 Instance Metadata Credentials com.amazonaws.auth.InstanceProfileCredentialsProvider

        Use this provider if you need instance profile credentials delivered through the Amazon EC2 metadata service.

      • Profile Credentials Provider com.wandisco.livemigrator2.fs.ExtendedProfileCredentialsProvider

        Use this provider to enter a custom profile configured to access Amazon S3 storage. You can find AWS credential information in a local file named credentials in a folder named .aws in your home directory.

        Enter an AWS Named Profile and a Credentials File Path. For example, ~/.aws/credentials.

        For more information, see Using the AWS Credentials File and Credential Profiles.

      • Custom Provider Class

        Use this if you want to enter your own class for the credentials provider.

    • JCEKS Keystore

      This authentication method uses an access key and a secret key for Amazon S3 contained in a Java Cryptography Extension KeyStore (JCEKS). The keystore must contain key/value pairs for the access key fs.s3a.access.key and the secret key fs.s3a.secret.key.

      info

      You must configure HDFS as a target to be able to select JCEKS Keystore. The HDFS resource must exist on the same Data Migrator instance as the Amazon S3 filesystem you're adding.

      • JCEKS HDFS - Select the HDFS filesystem where your JCEKS file is located.

      • JCEKS Keystore Path - Enter the path containing the JCEKS keystore. For example, jceks://hdfs@nameservice01:8020/aws/credentials/s3.jceks.

        JCEKS on HDFS with Kerberos - You must add the dfs.namenode.kerberos.principal.pattern configuration property.

        Include the following steps when you add an HDFS source or target with Kerberos:

      1. Under Additional Configuration, select Configuration Property Overrides from the dropdown.

      2. Select + Add Key/Value Pair and add the key dfs.namenode.kerberos.principal.pattern and the value *.

      3. Select Save, then restart Data Migrator.

        note

        If you remove filesystems configured with JCEKS authentication, remove any Amazon S3 filesystems before you remove an HDFS source.

    • S3 Service Endpoint - The endpoint for the source AWS S3 bucket. See --endpoint in the S3A parameters.

    • ⁤Simple Queue Service (SQS) Endpoints (Optional)
      Data Migrator listens to the event queue to continually migrate changes from source file paths to target filesystem(s).
      If you add an S3 source, you have 3 options regarding the queue:

      • Add the source without a queue. Data Migrator creates a queue automatically.
        If you want Data Migrator to create its own queue, ensure your account has the necessary permissions to create and manage SQS queues and attach them to S3 buckets.

      • Add the source and enter a queue but no endpoint. This allows you to use a queue that exists in a public endpoint.
        If you define your own queue, the queue must be attached to the S3 bucket.
        For more information about adding queues to buckets, see the AWS documentation.

      • Add the source and enter a queue and a service endpoint. The endpoint can be a public or a private endpoint.
        For more information about public endpoints, see the Amazon SQS endpoints documentation.

        • Queue - Enter the name of your SQS queue. This field is mandatory if you enter an SQS endpoint.

        • Endpoint - Enter the URL that you want Data Migrator to use. Note if you're using a Virtual Private Network (VPC), you must enter an endpoint.

      note

      You can set an Amazon Simple Notification Service (Amazon SNS) topic as the delivery target of the S3 event.

      Ensure you enable raw message delivery when you subscribe the SQS queue to the SNS topic.

      For more information, see the Amazon SNS documentation.

      Migration events expire after 14 days

      Data Migrator uses SQS messages to track changes to an S3 source filesystem. The maximum retention time for SQS messages is 14 days, which means events are lost after that time and can't be read by a migration.

      If you haven't used Data Migrator or have paused your S3 migrations for 14 days, we recommend you reset your S3 migrations.

    • S3A Properties (Optional) - Override properties or enter additional properties by adding key/value pairs.

Filesystem Options

  • Live Migration - After existing data is moved, changes made to the source filesystem are migrated in real time using an SQS queue.

  • One-Time Migration - Existing data is moved to the target, after which the migration is complete and no further changes are migrated.

Removed files

If a file is removed from the source after a one-time migration has completed the initial scan, subsequent rescans will not remove the file from the target. Rescanning will only update existing files or add new ones; it will not remove anything from the target.

One exception to this is the removal of a partition. Removing a partition is an action taken in Hive, and the metadata change will be replicated live by Hive Migrator. The data under that partition will remain on the target regardless of whether it has been deleted on the source. However, since the partition was removed in the metadata, the data inside won't be visible to queries on the target.

Configure Amazon S3 as a source with the CLI

Add an Amazon S3 filesystem

Use the filesystem add s3a command with the following parameters:

Add an S3 filesystem
filesystem add s3a          [--file-system-id] string  
[--bucket-name] string
[--endpoint] string
[--access-key] string
[--secret-key] string
[--sqs-queue] string
[--sqs-endpoint] string
[--credentials-provider] string
[--source]
[--scan-only]
[--properties-files] list
[--properties] string
[--s3type] string
[--bootstrap.servers] string
[--topic] string

For guidance about access, permissions, and security when adding an Amazon S3 bucket as a target filesystem, see Security best practices in IAM.

S3A mandatory parameters
  • --file-system-id The ID for the new filesystem resource.

  • --bucket-name The name of your S3 bucket.

  • --credentials-provider The Java class name of a credentials provider for authenticating with the Amazon S3 endpoint. In the UI, this is called Credentials Provider. The provider options include:

    • org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider

      Use this provider to offer credentials as an access key and secret access key with the --access-key and --secret-key parameters.

    • com.amazonaws.auth.InstanceProfileCredentialsProvider

      Use this provider when running Data Migrator on an Elastic Compute Cloud (EC2) instance that has an IAM role assigned with policies to access the Amazon S3 bucket.

    • com.amazonaws.auth.DefaultAWSCredentialsProviderChain

      A commonly-used credentials provider chain that looks for credentials in this sequence:

      • Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, or AWS_ACCESS_KEY and AWS_SECRET_KEY.
      • Java System Properties - aws.accessKeyId and aws.secretKey.
      • Web Identity Token credentials from the environment or container.
      • Credential profiles file at the default location (~/.aws/credentials) shared by all AWS SDKs and the AWS CLI.
      • Credentials delivered through the Amazon EC2 container service if the AWS_CONTAINER_CREDENTIALS_RELATIVE_URI environment variable is set and the security manager has permission to access the variable.
      • Instance profile credentials delivered through the Amazon EC2 metadata service.
    • com.wandisco.livemigrator2.fs.ExtendedProfileCredentialsProvider

      This provider supports the use of multiple AWS credentials, which are stored in a credentials file.

      When adding a source filesystem, use the following properties:

      • awsProfile - Name for the AWS profile.

      • awsCredentialsConfigFile - Path to the AWS credentials file. The default path is ~/.aws/credentials.

        For example:

        filesystem add s3a --file-system-id testProfile1Fs --bucket-name profile1-bucket --credentials-provider com.wandisco.livemigrator2.fs.ExtendedProfileCredentialsProvider --properties awsProfile=<profile-name>,
        awsCredentialsConfigFile=</path/to/the/aws/credentials" file>

        In the CLI, you can also use --aws-profile and --aws-config-file.

        For example:

        filesystem add s3a --file-system-id testProfile1Fs --bucket-name profile1-bucket --credentials-provider com.wandisco.livemigrator2.fs.ExtendedProfileCredentialsProvider --aws-profile <profile-name>
        --aws-config-file </path/to/the/aws/credentials/file>

        Learn more about using AWS profiles: Configuration and credential file settings.

S3A optional parameters
  • --access-key When using the org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider credentials provider, enter the access key with this parameter.
  • --secret-key When using the org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider credentials provider, enter the secret key using this parameter.
  • --endpoint Enter a specific endpoint to access the bucket, such as an AWS PrivateLink endpoint. If you don't enter a value, the filesystem defaults to AWS.
note

Using --endpoint supercedes fs.s3a.endpoint.

  • --sqs-queue [Amazon S3 as a source only] Enter an SQS queue name. This field is required if you enter an SQS endpoint. If you provide a value for this parameter, and create a migration from the Amazon S3 source, the migration is by default a live migration.
  • --sqs-endpoint [Amazon S3 as a source only] Enter an SQS endpoint.
  • --source Enter this parameter to add the filesystem as a source.
  • --scan-only Enter this parameter to create a static source filesystem for one-time migrations. Requires --source.
  • --properties-files Reference a list of existing properties files, each containing Hadoop configuration properties in the format used by core-site.xml or hdfs-site.xml.
  • --properties Enter properties to use in a comma-separated key/value list.
  • --s3type Enter the value aws.

For information on properties that are added by default for new S3A filesystems, see the Command reference s3a default properties.

For information on properties that you can customize for new S3A filesystems, see the Command reference s3a custom properties.

Update the Amazon S3 filesystem

note

To update existing filesystems, first stop all migrations associated with them. After saving updates, restart the Data Migrator service for your changes to take effect. In most supported Linux distributions, run the command service livedata-migrator restart.

Update the source filesystem with the following commands:

CommandAction
source clearDelete all sources
source deleteDelete one source
source showView the source filesystem configuration
Permissions and ownership of migrated items

When migrating files from an S3 source to an HDFS target, the user who writes the files is the file owner. In Data Migrator, this is the user that is mapped to the principal used to authenticate with the target.

S3 object stores don't retain owner (RWX) permissions. Anything migrated from an S3 object store to an HDFS target has rwxrwxrwx permissions.

Next steps

Configure a target filesystem to migrate data to. Then create a migration.