Configure an Amazon S3 source
You can migrate data to from Amazon Simple Storage Service (Amazon S3) by configuring one as a source filesystem.
Follow these steps to create an Amazon S3 bucket as a source using either the UI or CLI.
Prerequisites
You need the following:
- An Amazon S3 bucket. See the Amazon S3 bucket documentation. 
- Authentication details for your bucket. See below for more information. 
- If you're configuring your own SQS queue in AWS for live replication with Data Migrator, the queue must be attached to the S3 bucket: - Ensure the queue has an Access Policy applied allowing your bucket to send messages to the queue. See grant destinations permissions to s3
- Ensure the bucket has an Event notification to the destination queue with all event types required for live replication. See enable event notifications.
 - Events required for live replication, enable the following event types: - Object creation: Select - All object create eventsor select individually- Put,- Post,- Copy,- Multipart upload completed.- Object removal: Select - All object removal eventsor select individually- Permanently delete,- Delete marker created.
When migrating data with Amazon S3 as a source, data contained in paths with two or more consecutive forward slashes can't be replicated.
When using Amazon S3 as a source, do not include the SQS initialization path (sqs-init-path/) in any migration, this will cause an issue where Data Migrator will prevent subsequent migrations from progressing to a Live status.
Configure Amazon S3 as a source with the UI
- From the Dashboard, select an instance under Instances. 
- In the Filesystems & Agents menu, select Filesystems. 
- Select Add source filesystem 
- Select Amazon S3 from the Filesystem Type dropdown list. 
- Enter the following details: - Display Name - The name you want to give your source filesystem. 
- Bucket Name - The reference name of the Amazon S3 bucket you are using. 
- Authentication Method - The Java class name of a credentials provider for authenticating with the S3 endpoint. - The Authentication Method options available include: - Access Key and Secret - org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider- Use this provider to enter credentials as an access key and secret access key with the following entries: - Access Key - Enter the AWS access key. For example, - RANDOMSTRINGACCESSKEY. If you have configured a Vault for secrets storage, use a reference to the value stored in your secrets store.
- Secret Key - Enter the secret key that corresponds with your access key. For example, - RANDOMSTRINGPASSWORD. If you have configured a Vault for secrets storage, use a reference to the value stored in your secrets store.
 
- AWS Identity and Access Management - com.amazonaws.auth.InstanceProfileCredentialsProvider- Use this provider if you're running Data Migrator on an EC2 instance that has been assigned an IAM role with policies that allow it to access the S3 bucket. 
- AWS Hierarchical Credential Chain - com.amazonaws.auth.DefaultAWSCredentialsProviderChain- A commonly used credentials provider chain that looks for credentials in this order: - Environment Variables - AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEY, orAWS_ACCESS_KEYandAWS_SECRET_KEY.
- Java System Properties - aws.accessKeyIdandaws.secretKey.
- Web Identity Token credentials from the environment or container.
- Credential profiles file at the default location (~/.aws/credentials) shared by all AWS SDKs and the AWS CLI.
- Credentials delivered through the Amazon EC2 container service if the AWS_CONTAINER_CREDENTIALS_RELATIVE_URIenvironment variable is set and security manager has permission to access the variable.
- Instance profile credentials delivered through the Amazon EC2 metadata service.
 
- Environment Variables - 
- Environment Variables - com.amazonaws.auth.EnvironmentVariableCredentialsProvider- Use this provider to enter an access key and a secret access key as either - AWS_ACCESS_KEY_IDand- AWS_SECRET_ACCESS_KEY, or- AWS_ACCESS_KEYand- AWS_SECRET_KEY.
- EC2 Instance Metadata Credentials - com.amazonaws.auth.InstanceProfileCredentialsProvider- Use this provider if you need instance profile credentials delivered through the Amazon EC2 metadata service. 
- Profile Credentials Provider - com.wandisco.livemigrator2.fs.ExtendedProfileCredentialsProvider- Use this provider to enter a custom profile configured to access Amazon S3 storage. You can find AWS credential information in a local file named - credentialsin a folder named- .awsin your home directory.- Enter an AWS Named Profile and a Credentials File Path. For example, - ~/.aws/credentials.- For more information, see Using the AWS Credentials File and Credential Profiles. 
- Custom Provider Class - Use this if you want to enter your own class for the credentials provider. 
- JCEKS Keystore - This authentication method uses an access key and a secret key for Amazon S3 contained in a Java Cryptography Extension KeyStore (JCEKS). The keystore must contain key/value pairs for the access key - fs.s3a.access.keyand the secret key- fs.s3a.secret.key.info- You must configure HDFS as a target to be able to select JCEKS Keystore. The HDFS resource must exist on the same Data Migrator instance as the Amazon S3 filesystem you're adding. Due to this dependency, be aware of Backup and Restore limitations before performing a backup with this configuration. - JCEKS HDFS - Select the HDFS filesystem where your JCEKS file is located. 
- JCEKS Keystore Path - Enter the path containing the JCEKS keystore. For example, - jceks://hdfs@nameservice01:8020/aws/credentials/s3.jceks.- JCEKS on HDFS with Kerberos - You must add the - dfs.namenode.kerberos.principal.patternconfiguration property.- Include the following steps when you add an HDFS source or target with Kerberos: 
 - Under Additional Configuration, select Configuration Property Overrides from the dropdown. 
- Select + Add Key/Value Pair and add the key - dfs.namenode.kerberos.principal.patternand the value- *.
- Select Save, then restart Data Migrator. note- If you remove filesystems configured with JCEKS authentication, remove any Amazon S3 filesystems before you remove an HDFS source. 
 
 
- S3 Service Endpoint - The endpoint for the source AWS S3 bucket. See - --endpointin the S3A parameters.
- Simple Queue Service (SQS) Endpoints (Optional) 
 Data Migrator listens to the event queue to continually migrate changes from source file paths to target filesystem(s).
 If you add an S3 source, you have 3 options regarding the queue:- Add the source without a queue. Data Migrator creates a queue automatically. 
 If you want Data Migrator to create its own queue, ensure your account has the necessary permissions to create and manage SQS queues and attach them to S3 buckets.
- Add the source and enter a queue but no endpoint. This allows you to use a queue that exists in a public endpoint. 
 If you define your own queue, the queue must be attached to the S3 bucket.
 For more information about adding queues to buckets, see the AWS documentation.
- Add the source and enter a queue and a service endpoint. The endpoint can be a public or a private endpoint. 
 For more information about public endpoints, see the Amazon SQS endpoints documentation.- Queue - Enter the name of your SQS queue. This field is mandatory if you enter an SQS endpoint. 
- Endpoint - Enter the URL that you want Data Migrator to use. Note if you're using a Virtual Private Network (VPC), you must enter an endpoint. 
 note- You can set an Amazon Simple Notification Service (Amazon SNS) topic as the delivery target of the S3 event. - Ensure you enable raw message delivery when you subscribe the SQS queue to the SNS topic. - For more information, see the Amazon SNS documentation. Migration events expire after 14 days- Data Migrator uses SQS messages to track changes to an S3 source filesystem. The maximum retention time for SQS messages is 14 days, which means events are lost after that time and can't be read by a migration. - If you haven't used Data Migrator or have paused your S3 migrations for 14 days, we recommend you reset your S3 migrations. Purge your SQS queue- Your SQS queue starts to capture events as soon as it's created and live. After queue creation, it may capture irrelevant events up to the time you start your first migration. As Data Migrator will need to consume these events, we recommend you purge your SQS queue just prior to first use. 
 
- S3A Properties (Optional) - Override properties or enter additional properties by adding key/value pairs. 
 
Filesystem Options
- Live Migration - After existing data is moved, changes made to the source filesystem are migrated in real time using an SQS queue. 
- One-Time Migration - Existing data is moved to the target, after which the migration is complete and no further changes are migrated. 
If a file is removed from the source after a one-time migration has completed the initial scan, subsequent rescans will not remove the file from the target. Rescanning will only update existing files or add new ones; it will not remove anything from the target.
One exception to this is the removal of a partition. Removing a partition is an action taken in Hive, and the metadata change will be replicated live by Hive Migrator. The data under that partition will remain on the target regardless of whether it has been deleted on the source. However, since the partition was removed in the metadata, the data inside won't be visible to queries on the target.
Configure Amazon S3 as a source with the CLI
Add an Amazon S3 filesystem
Use the filesystem add s3a command with the following parameters:
filesystem add s3a          [--file-system-id] string  
                            [--bucket-name] string  
                            [--endpoint] string  
                            [--access-key] string  
                            [--secret-key] string  
                            [--sqs-queue] string  
                            [--sqs-endpoint] string  
                            [--credentials-provider] string  
                            [--source]  
                            [--scan-only]  
                            [--properties-files] list  
                            [--properties] string  
                            [--s3type] string  
                            [--bootstrap.servers] string  
                            [--topic] string
For guidance about access, permissions, and security when adding an Amazon S3 bucket as a target filesystem, see Security best practices in IAM.
S3A mandatory parameters
- --file-system-idThe ID for the new filesystem resource.
- --bucket-nameThe name of your S3 bucket.
- --credentials-providerThe Java class name of a credentials provider for authenticating with the Amazon S3 endpoint. In the UI, this is called Credentials Provider. The provider options include:- org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider- Use this provider to offer credentials as an access key and secret access key with the - --access-keyand- --secret-keyparameters.
- com.amazonaws.auth.InstanceProfileCredentialsProvider- Use this provider when running Data Migrator on an Elastic Compute Cloud (EC2) instance that has an IAM role assigned with policies to access the Amazon S3 bucket. 
- com.amazonaws.auth.DefaultAWSCredentialsProviderChain- A commonly-used credentials provider chain that looks for credentials in this sequence: - Environment Variables - AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEY, orAWS_ACCESS_KEYandAWS_SECRET_KEY.
- Java System Properties - aws.accessKeyIdandaws.secretKey.
- Web Identity Token credentials from the environment or container.
- Credential profiles file at the default location (~/.aws/credentials) shared by all AWS SDKs and the AWS CLI.
- Credentials delivered through the Amazon EC2 container service if the AWS_CONTAINER_CREDENTIALS_RELATIVE_URIenvironment variable is set and the security manager has permission to access the variable.
- Instance profile credentials delivered through the Amazon EC2 metadata service.
 
- Environment Variables - 
- com.wandisco.livemigrator2.fs.ExtendedProfileCredentialsProvider- This provider supports the use of multiple AWS credentials, which are stored in a credentials file. - When adding a source filesystem, use the following properties: - awsProfile - Name for the AWS profile. 
- awsCredentialsConfigFile - Path to the AWS credentials file. The default path is - ~/.aws/credentials.- For example: - filesystem add s3a --file-system-id testProfile1Fs --bucket-name profile1-bucket --credentials-provider com.wandisco.livemigrator2.fs.ExtendedProfileCredentialsProvider --properties awsProfile=<profile-name>,
 awsCredentialsConfigFile=</path/to/the/aws/credentials" file>- In the CLI, you can also use - --aws-profileand- --aws-config-file.- For example: - filesystem add s3a --file-system-id testProfile1Fs --bucket-name profile1-bucket --credentials-provider com.wandisco.livemigrator2.fs.ExtendedProfileCredentialsProvider --aws-profile <profile-name>
 --aws-config-file </path/to/the/aws/credentials/file>- Learn more about using AWS profiles: Configuration and credential file settings. 
 
 
S3A optional parameters
- --access-keyWhen using the- org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvidercredentials provider, enter the access key with this parameter.
- --secret-keyWhen using the- org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvidercredentials provider, enter the secret key using this parameter.
- --endpointEnter a specific endpoint to access the bucket, such as an AWS PrivateLink endpoint. If you don't enter a value, the filesystem defaults to AWS.
Using --endpoint supercedes fs.s3a.endpoint.
- --sqs-queue[Amazon S3 as a source only] Enter an SQS queue name. This field is required if you enter an SQS endpoint. If you provide a value for this parameter, and create a migration from the Amazon S3 source, the migration is by default a live migration.
- --sqs-endpoint[Amazon S3 as a source only] Enter an SQS endpoint.
- --sourceEnter this parameter to add the filesystem as a source.
- --scan-onlyEnter this parameter to create a static source filesystem for one-time migrations. Requires- --source.
- --properties-filesReference a list of existing properties files, each containing Hadoop configuration properties in the format used by- core-site.xmlor- hdfs-site.xml.
- --propertiesEnter properties to use in a comma-separated key/value list.
- --s3typeEnter the value- aws.
For information on properties that are added by default for new S3A filesystems, see the Command reference s3a default properties.
For information on properties that you can customize for new S3A filesystems, see the Command reference s3a custom properties.
Update the Amazon S3 filesystem
To update existing filesystems, first stop all migrations associated with them. After saving updates, restart the Data Migrator service for your changes to take effect. In most supported Linux distributions, run the command service livedata-migrator restart.   
Update the source filesystem with the following commands:
| Command | Action | 
|---|---|
| source clear | Delete all sources | 
| source delete | Delete one source | 
| source show | View the source filesystem configuration | 
When migrating files from an S3 source to an HDFS target, the user who writes the files is the file owner. In Data Migrator, this is the user that is mapped to the principal used to authenticate with the target.
S3 object stores don't retain owner (RWX) permissions. Anything migrated from an S3 object store to an HDFS target has rwxrwxrwx permissions.
Next steps
Configure a target filesystem to migrate data to. Then create a migration.