Version: 2.2

Configure an Amazon S3 source

You can migrate data to from Amazon Simple Storage Service (Amazon S3) by configuring one as a source filesystem.

Follow these steps to create an Amazon S3 bucket as a source using either the WANdisco® UI or CLI.

Prerequisites

You need the following:

An Amazon S3 bucket. See the Amazon S3 bucket documentation.
Authentication details for your bucket. See below for more information.
If you're configuring your own SQS queue in AWS for live replication with Data Migrator, the queue must be attached to the S3 bucket:
- Ensure the queue has an Access Policy applied allowing your bucket to send messages to the queue. See grant destinations permissions to s3
- Ensure the bucket has an Event notification to the destination queue with all event types required for live replication. See enable event notifications.
Events required for live replication, enable the following event types:
Object creation: Select All object create events or select individually Put, Post, Copy, Multipart upload completed.
Object removal: Select All object removal events or select individually Permanently delete, Delete marker created.

note

When migrating data with Amazon S3 as a source, data contained in paths with two or more consecutive forward slashes can't be replicated.

caution

When using Amazon S3 as a source, do not include the SQS initialization path (sqs-init-path/) in any migration, this will cause an issue where Data Migrator will prevent subsequent migrations from progressing to a Live status.

Configure Amazon S3 as a source with the UI

From the Dashboard, select a product under Products.
In the Filesystems & Agents menu, select Filesystems.
Select Add source filesystem
Select Amazon S3 from the Filesystem Type dropdown list.
Enter the following details:
- Display Name - The name you want to give your source filesystem.
- Bucket Name - The reference name of the Amazon S3 bucket you are using.
- Authentication Method - The Java class name of a credentials provider for authenticating with the S3 endpoint.
  The Authentication Method options available include:
  - Access Key and Secret org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
    Use this provider to enter credentials as an access key and secret access key with the following entries:
    - Access Key - Enter the AWS access key. For example, RANDOMSTRINGACCESSKEY.
    - Secret Key - Enter the secret key that corresponds with your access key. For example, RANDOMSTRINGPASSWORD.
  - AWS Identity and Access Management com.amazonaws.auth.InstanceProfileCredentialsProvider
    Use this provider if you're running Data Migrator on an EC2 instance that has been assigned an IAM role with policies that allow it to access the S3 bucket.
  - AWS Hierarchical Credential Chain com.amazonaws.auth.DefaultAWSCredentialsProviderChain
    A commonly used credentials provider chain that looks for credentials in this order:
    - Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, or AWS_ACCESS_KEY and AWS_SECRET_KEY.
    - Java System Properties - aws.accessKeyId and aws.secretKey.
    - Web Identity Token credentials from the environment or container.
    - Credential profiles file at the default location (~/.aws/credentials) shared by all AWS SDKs and the AWS CLI.
    - Credentials delivered through the Amazon EC2 container service if the AWS_CONTAINER_CREDENTIALS_RELATIVE_URI environment variable is set and security manager has permission to access the variable.
    - Instance profile credentials delivered through the Amazon EC2 metadata service.
  - Environment Variables com.amazonaws.auth.EnvironmentVariableCredentialsProvider
    Use this provider to enter an access key and a secret access key as either AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, or AWS_ACCESS_KEY and AWS_SECRET_KEY.
  - EC2 Instance Metadata Credentials com.amazonaws.auth.InstanceProfileCredentialsProvider
    Use this provider if you need instance profile credentials delivered through the Amazon EC2 metadata service.
  - Profile Credentials Provider com.wandisco.livemigrator2.fs.ExtendedProfileCredentialsProvider
    Use this provider to enter a custom profile configured to access Amazon S3 storage. You can find AWS credential information in a local file named credentials in a folder named .aws in your home directory.
    Enter an AWS Named Profile and a Credentials File Path. For example, ~/.aws/credentials.
    For more information, see Using the AWS Credentials File and Credential Profiles.
  - Custom Provider Class
    Use this if you want to enter your own class for the credentials provider.
- JCEKS Keystore
  This authentication method uses an access key and a secret key for Amazon S3 contained in a Java Cryptography Extension KeyStore (JCEKS). The keystore must contain key/value pairs for the access key fs.s3a.access.key and the secret key fs.s3a.secret.key.
  info
  You must configure HDFS as a target to be able to select JCEKS Keystore. The HDFS resource must exist on the same Data Migrator instance as the Amazon S3 filesystem you're adding. Due to this dependency, be aware of Backup and Restore limitations before performing a backup with this configuration.
  - JCEKS HDFS - Select the HDFS filesystem where your JCEKS file is located.
  - JCEKS Keystore Path - Enter the path containing the JCEKS keystore. For example, jceks://hdfs@nameservice01:8020/aws/credentials/s3.jceks.
    JCEKS on HDFS with Kerberos - You must add the dfs.namenode.kerberos.principal.pattern configuration property.
    Include the following steps when you add an HDFS source or target with Kerberos:
  1. Under Additional Configuration, select Configuration Property Overrides from the dropdown.
  2. Select + Add Key/Value Pair and add the key dfs.namenode.kerberos.principal.pattern and the value *.
  3. Select Save, then restart Data Migrator.
    note
    If you remove filesystems configured with JCEKS authentication, remove any Amazon S3 filesystems before you remove an HDFS source.
- S3 Service Endpoint - The endpoint for the source AWS S3 bucket. See --endpoint in the S3A parameters.
- ⁤Simple Queue Service (SQS) Endpoints (Optional)
  Data Migrator listens to the event queue to continually migrate changes from source file paths to target filesystem(s).
  If you add an S3 source, you have 3 options regarding the queue:
  - Add the source without a queue. Data Migrator creates a queue automatically.
    If you want Data Migrator to create its own queue, ensure your account has the necessary permissions to create and manage SQS queues and attach them to S3 buckets.
  - Add the source and enter a queue but no endpoint. This allows you to use a queue that exists in a public endpoint.
    If you define your own queue, the queue must be attached to the S3 bucket.
    For more information about adding queues to buckets, see the AWS documentation.
  - Add the source and enter a queue and a service endpoint. The endpoint can be a public or a private endpoint.
    For more information about public endpoints, see the Amazon SQS endpoints documentation.
    - Queue - Enter the name of your SQS queue. This field is mandatory if you enter an SQS endpoint.
    - Endpoint - Enter the URL that you want Data Migrator to use. Note if you're using a Virtual Private Network (VPC), you must enter an endpoint.
    note
    You can set an Amazon Simple Notification Service (Amazon SNS) topic as the delivery target of the S3 event.
    Ensure you enable raw message delivery when you subscribe the SQS queue to the SNS topic.
    For more information, see the Amazon SNS documentation.
    Migration events expire after 14 days
    Data Migrator uses SQS messages to track changes to an S3 source filesystem. The maximum retention time for SQS messages is 14 days, which means events are lost after that time and can't be read by a migration.
    If you haven't used Data Migrator or have paused your S3 migrations for 14 days, we recommend you reset your S3 migrations.
    Purge your SQS queue
    Your SQS queue starts to capture events as soon as it's created and live. After queue creation, it may capture irrelevant events up to the time you start your first migration. As Data Migrator will need to consume these events, we recommend you purge your SQS queue just prior to first use.
- S3A Properties (Optional) - Override properties or enter additional properties by adding key/value pairs.

Filesystem Options

Live Migration - After existing data is moved, changes made to the source filesystem are migrated in real time using an SQS queue.
One-Time Migration - Existing data is moved to the target, after which the migration is complete and no further changes are migrated.

Removed files

If a file is removed from the source after a one-time migration has completed the initial scan, subsequent rescans will not remove the file from the target. Rescanning will only update existing files or add new ones; it will not remove anything from the target.

One exception to this is the removal of a partition. Removing a partition is an action taken in Hive, and the metadata change will be replicated live by Hive Migrator. The data under that partition will remain on the target regardless of whether it has been deleted on the source. However, since the partition was removed in the metadata, the data inside won't be visible to queries on the target.

Configure Amazon S3 as a source with the CLI

Add an Amazon S3 filesystem

Use the filesystem add s3a command with the following parameters:

Add an S3 filesystem
filesystem add s3a          [--file-system-id] string  
                            [--bucket-name] string  
                            [--endpoint] string  
                            [--access-key] string  
                            [--secret-key] string  
                            [--sqs-queue] string  
                            [--sqs-endpoint] string  
                            [--credentials-provider] string  
                            [--source]  
                            [--scan-only]  
                            [--properties-files] list  
                            [--properties] string  
                            [--s3type] string  
                            [--bootstrap.servers] string  
                            [--topic] string

For guidance about access, permissions, and security when adding an Amazon S3 bucket as a target filesystem, see Security best practices in IAM.

S3A mandatory parameters

--file-system-id The ID for the new filesystem resource.
--bucket-name The name of your S3 bucket.
--credentials-provider The Java class name of a credentials provider for authenticating with the Amazon S3 endpoint. In the UI, this is called Credentials Provider. The provider options include:
- org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
  Use this provider to offer credentials as an access key and secret access key with the --access-key and --secret-key parameters.
- com.amazonaws.auth.InstanceProfileCredentialsProvider
  Use this provider when running Data Migrator on an Elastic Compute Cloud (EC2) instance that has an IAM role assigned with policies to access the Amazon S3 bucket.
- com.amazonaws.auth.DefaultAWSCredentialsProviderChain
  A commonly-used credentials provider chain that looks for credentials in this sequence:
  - Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, or AWS_ACCESS_KEY and AWS_SECRET_KEY.
  - Java System Properties - aws.accessKeyId and aws.secretKey.
  - Web Identity Token credentials from the environment or container.
  - Credential profiles file at the default location (~/.aws/credentials) shared by all AWS SDKs and the AWS CLI.
  - Credentials delivered through the Amazon EC2 container service if the AWS_CONTAINER_CREDENTIALS_RELATIVE_URI environment variable is set and the security manager has permission to access the variable.
  - Instance profile credentials delivered through the Amazon EC2 metadata service.
- com.wandisco.livemigrator2.fs.ExtendedProfileCredentialsProvider
  This provider supports the use of multiple AWS credentials, which are stored in a credentials file.
  When adding a source filesystem, use the following properties:
  - awsProfile - Name for the AWS profile.
  - awsCredentialsConfigFile - Path to the AWS credentials file. The default path is ~/.aws/credentials.
    For example:
    filesystem add s3a --file-system-id testProfile1Fs --bucket-name profile1-bucket --credentials-provider com.wandisco.livemigrator2.fs.ExtendedProfileCredentialsProvider --properties awsProfile=<profile-name>, awsCredentialsConfigFile=</path/to/the/aws/credentials" file>
    In the CLI, you can also use --aws-profile and --aws-config-file.
    For example:
    filesystem add s3a --file-system-id testProfile1Fs --bucket-name profile1-bucket --credentials-provider com.wandisco.livemigrator2.fs.ExtendedProfileCredentialsProvider --aws-profile <profile-name> --aws-config-file </path/to/the/aws/credentials/file>
    Learn more about using AWS profiles: Configuration and credential file settings.

S3A optional parameters

--access-key When using the org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider credentials provider, enter the access key with this parameter.
--secret-key When using the org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider credentials provider, enter the secret key using this parameter.
--endpoint Enter a specific endpoint to access the bucket, such as an AWS PrivateLink endpoint. If you don't enter a value, the filesystem defaults to AWS.

note

Using --endpoint supercedes fs.s3a.endpoint.

--sqs-queue [Amazon S3 as a source only] Enter an SQS queue name. This field is required if you enter an SQS endpoint. If you provide a value for this parameter, and create a migration from the Amazon S3 source, the migration is by default a live migration.
--sqs-endpoint [Amazon S3 as a source only] Enter an SQS endpoint.
--source Enter this parameter to add the filesystem as a source.
--scan-only Enter this parameter to create a static source filesystem for one-time migrations. Requires --source.
--properties-files Reference a list of existing properties files, each containing Hadoop configuration properties in the format used by core-site.xml or hdfs-site.xml.
--properties Enter properties to use in a comma-separated key/value list.
--s3type Enter the value aws.

For information on properties that are added by default for new S3A filesystems, see the Command reference s3a default properties.

For information on properties that you can customize for new S3A filesystems, see the Command reference s3a custom properties.

Update the Amazon S3 filesystem

note

To update existing filesystems, first stop all migrations associated with them. After saving updates, restart the Data Migrator service for your changes to take effect. In most supported Linux distributions, run the command service livedata-migrator restart.

Update the source filesystem with the following commands:

Command	Action
`source clear`	Delete all sources
`source delete`	Delete one source
`source show`	View the source filesystem configuration

Permissions and ownership of migrated items

When migrating files from an S3 source to an HDFS target, the user who writes the files is the file owner. In Data Migrator, this is the user that is mapped to the principal used to authenticate with the target.

S3 object stores don't retain owner (RWX) permissions. Anything migrated from an S3 object store to an HDFS target has rwxrwxrwx permissions.

Next steps

Configure a target filesystem to migrate data to. Then create a migration.

Prerequisites​

Configure Amazon S3 as a source with the UI​

Filesystem Options​

Configure Amazon S3 as a source with the CLI​

Add an Amazon S3 filesystem​

S3A mandatory parameters​

S3A optional parameters​

Update the Amazon S3 filesystem​

Next steps​

Prerequisites

Configure Amazon S3 as a source with the UI

Filesystem Options

Configure Amazon S3 as a source with the CLI

Add an Amazon S3 filesystem

S3A mandatory parameters

S3A optional parameters

Update the Amazon S3 filesystem

Next steps