Version: 2.0

Configure source filesystems

Configure source filesystems for each product to migrate your data from depending on what your environment is:

Hadoop Distributed File System (HDFS) - Add one source filesystem only for each product.
S3 sources (IBM Cloud Object Storage, Amazon S3) - Add one or more source filesystems.

Data Migrator supports the following filesystems as a source:

Configure source filesystems with the UI

The Filesystems panel shows the source and target filesystems Data Migrator can use for data migrations.

Use the Filesystems panel to:

View and configure source and target filesystems.
Add or remove targets.

Add a source filesystem

To add a source filesystem from your dashboard, select the following:

The relevant instance from the Products panel.
Add source filesystem in the Filesystem Configuration page.

For information about configuring filesystem health check notifications and email alerts, see Configure email notifications with the UI.

info

If you have HDFS in your environment, Data Migrator automatically detects it as your source filesystem. However, if Kerberos is enabled, or if your Hadoop configuration doesn't contain the configuration file information required for Data Migrator to connect to Hadoop, configure a HDFS source with additional Kerberos configuration.

If you want to configure a new source manually, delete any existing source first, and then manually add a new source.

note

If you deleted the HDFS source that Data Migrator detected automatically, and you want to redetect it, go to the CLI and run the command filesystem auto-discover-source hdfs.

Configure HDFS as a source

See Configure a HDFS source filesystem.

Migration events expire after 14 days

Data Migrator uses SQS messages to track changes to an S3 source filesystem. The maximum retention time for SQS messages is 14 days, which means events are lost after that time and can't be read by a migration.

If you haven't used Data Migrator or have paused your S3 migrations for 14 days, we recommend you reset your S3 migrations.

caution

When using Amazon S3 as a source, do not include the SQS initialization path (sqs-init-path/) in any migration, this will cause an issue where Data Migrator will prevent subsequent migrations from progressing to a Live status.

Configure an Amazon S3 bucket as a source

Select your Data Migrator product from the Products panel.
Select Add Source Filesystem.
Select Amazon S3 from the Filesystem Type dropdown list.
Enter the following details:
- Display Name - The name you want to give your source filesystem.
- Bucket Name - The reference name of the Amazon S3 bucket you are using.
- Authentication Method - The Java class name of a credentials provider for authenticating with the S3 endpoint. This isn't a mandatory parameter when you add an IBM Cloud Object Storage bucket with the UI.
  The Authentication Method options available include:
  - Access Key and Secret org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
    Use this provider to enter credentials as an access key and secret access key with the following entries:
    - Access Key - Enter the AWS access key. For example, RANDOMSTRINGACCESSKEY.
    - Secret Key - Enter the secret key that corresponds with your access key. For example, RANDOMSTRINGPASSWORD.
  - AWS Identity and Access Management com.amazonaws.auth.InstanceProfileCredentialsProvider
    Use this provider if you're running Data Migrator on an EC2 instance that has been assigned an IAM role with policies that allow it to access the S3 bucket.
  - AWS Hierarchical Credential Chain com.amazonaws.auth.DefaultAWSCredentialsProviderChain
    A commonly used credentials provider chain that looks for credentials in this order:
    - Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, or AWS_ACCESS_KEY and AWS_SECRET_KEY.
    - Java System Properties - aws.accessKeyId and aws.secretKey.
    - Web Identity Token credentials from the environment or container.
    - Credential profiles file at the default location (~/.aws/credentials) shared by all AWS SDKs and the AWS CLI.
    - Credentials delivered through the Amazon EC2 container service if the AWS_CONTAINER_CREDENTIALS_RELATIVE_URI environment variable is set and security manager has permission to access the variable.
    - Instance profile credentials delivered through the Amazon EC2 metadata service.
  - Environment Variables com.amazonaws.auth.EnvironmentVariableCredentialsProvider
    Use this provider to enter an access key and a secret access key as either AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, or AWS_ACCESS_KEY and AWS_SECRET_KEY.
  - EC2 Instance Metadata Credentials com.amazonaws.auth.InstanceProfileCredentialsProvider
    Use this provider if you need instance profile credentials delivered through the Amazon EC2 metadata service.
- Profile Credentials Provider com.wandisco.livemigrator2.fs.ExtendedProfileCredentialsProvider
  Use this provider to enter a custom profile configured to access Amazon S3 storage. You can find AWS credential information in a local file named credentials in a folder named .aws in your home directory.
  Enter an AWS Named Profile and a Credentials File Path. For example, ~/.aws/credentials.
  For more information, see Using the AWS Credentials File and Credential Profiles.
- Custom Provider Class
  Use this if you want to enter your own class for the credentials provider.
  - JCEKS Keystore hadoop.security.credential.provider.path
  This authentication method uses an access key and a secret key for Amazon S3 contained in a Java Cryptography Extension KeyStore (JCEKS). The keystore needs to contain values for the access key and the secret key.
  The access and secret keys are already in the keystore properties file, so you don't need to enter them once you've saved the path.
  info
  JCEKS Keystore can’t be selected if there is no HDFS target configured. The HDFS resource must exist on the same Data Migrator as the Amazon S3 filesystem being added.
  - JCEKS HDFS - Select the HDFS filesystem where your JCEKS file is located.
  - JCEKS Keystore Path - Enter the path containing the JCEKS keystore. For example, jceks://hdfs@active-namenode-host:8020/credentials/aws/aws.jceks.
  info
  You must provide an endpoint when using JCEKS for a s3a-vpc type of S3 bucket.
  JCEKS on HDFS with Kerberos - You must add the dfs.namenode.kerberos.principal.pattern configuration property.
  Include the following steps when you add an HDFS source or target with Kerberos:
  1. Under Additional Configuration, select Configuration Property Overrides from the dropdown.
  2. Select + Add Key/Value Pair and add the key dfs.namenode.kerberos.principal.pattern and the value *.
  3. Select Save, then restart Data Migrator.
  The property dfs.namenode.kerberos.principal.pattern provides a regular expression wildcard that allows realm authentication. You need to use this if the realms on your source or target filesystems don't have matching truststores, or principal patterns.
  note
  When deleting filesystems with JCEKS authentication configured, delete the Amazon S3 filesystem before the HDFS.
- S3 Service Endpoint - The endpoint for the source AWS S3 bucket. See --endpoint in the S3A parameters.
- ⁤Simple Queue Service (SQS) Endpoints (Optional)
  Data Migrator listens to the event queue to continually migrate changes from source file paths to target filesystem(s).
  If you add an S3 source, you have 3 options regarding the queue:
  - Add the source without a queue. Data Migrator create a queue automatically.
    If you want Data Migrator to create its own queue, ensure your account has the necessary permissions to create and manage SQS queues and attach them to S3 buckets.
  - Add the source and enter a queue but no endpoint. This allows you to use a queue that exists in a public endpoint.
    If you define your own queue, the queue must be attached to the S3 bucket.
    For more information about adding queues to buckets, see the AWS documentation.
  - Add the source and enter a queue and a service endpoint. The endpoint can be a public or a private endpoint.
    For more information about public endpoints, see the Amazon SQS endpoints documentation.
    - Queue - Enter the name of your SQS queue. This field is mandatory if you enter an SQS endpoint.
    - Endpoint - Enter the URL that you want Data Migrator to use.
  note
  You can set an Amazon Simple Notification Service (Amazon SNS) topic as the delivery target of the S3 event.
  Ensure you enable raw message delivery when you subscribe the SQS queue to the SNS topic.
  For more information, see the Amazon SNS documentation.
- S3A Properties (Optional) - Override properties or enter additional properties by adding key/value pairs.

Live Migration - Enabled by default, this setting allows Data Migrator to migrate ongoing changes automatically from this source to the target filesystem during a migration. If you deselect the checkbox, or if the source filesystem doesn't allow live migrations to take place, Data Migrator uses one-time migration.

Removed files

If a file is removed from the source after a one-time migration has completed the initial scan, subsequent rescans will not remove the file from the target. Rescanning will only update existing files or add new ones; it will not remove anything from the target.

One exception to this is the removal of a partition. Removing a partition is an action taken in Hive, and the metadata change will be replicated live by Hive Migrator. The data under that partition will remain on the target regardless of whether it has been deleted on the source. However, since the partition was removed in the metadata, the data inside won't be visible to queries on the target

Configure an S3 bucket as a source

Select your Data Migrator product from the Products panel.
Select Add Source Filesystem.
Select S3 from the Filesystem Type dropdown list.
Enter the following details:
- Display Name - The name you want to give your source filesystem.
- Bucket Name - The reference name of the S3 bucket you are using.
- Access Key - Enter the access key. For example, RANDOMSTRINGACCESSKEY.
- Secret Key - Enter the secret key that corresponds with your access key. For example, RANDOMSTRINGPASSWORD.
- S3A Properties - Data Migrator uses Hadoop's S3A library to connect to S3 filesystems. Enter key/value pairs to apply additional properties.
  info
  You need to define an S3 endpoint using the fs.s3a.endpoint parameter so that Data Migrator can connect to your source. For example, fs.s3a.endpoint=https://example-s3-endpoint:80.
Select Save.

Permissions and ownership of migrated items

When migrating files from an S3 source to an HDFS target, the user that writes the files will be the file owner. In Data Migrator, this is the user that is mapped to the principal used to authenticate with the target.

Additionally, S3 object stores don't retain RWX permissions. Anything migrated from an S3 object store to an HDFS target will have 'rwxrwxrwx' permissions.

Configure IBM Cloud Object Storage as a source (preview)

Enter the following:

Filesystem Type - The type of filesystem source. Select IBM Cloud Object Storage.
Display Name - The name you want to give your IBM Cloud Object Storage.
Access Key - The access key for your authentication credentials, associated with the fixed authentication credentials provider org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider.

note

Although IBM Cloud Object Storage can use other providers (for example InstanceProfileCredentialsProvider, DefaultAWSCredentialsProviderChain), they're only available in the cloud, not for on-premises. As on-premises is currently the expected type of source, these other providers have not been tested and are not currently selectable.

Secret Key - Enter the secret key using this parameter, used for the org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider credentials provider.
Bucket Name - The name of your Cloud Object Store bucket.
Topic - The name of the Kafka topic to which the notifications will be sent.
Endpoint - An endpoint for a Kafka broker, in a host/port format.
Bootstrap Servers - A comma-separated list of host and port pairs that are addresses for Kafka brokers on a "bootstrap" Kafka cluster that Kafka clients use to bootstrap themselves.
Port - The TCP port used for connection to the IBM Cloud Object Storage bucket. Default is 9092.

note

Migrations from IBM Cloud Object Storage use Amazon S3, along with its filesystem classes. The main difference between IBM Cloud Object Storage and Amazon S3 is in the messaging services: SQSQueue for Amazon S3, and Kafka for IBM Cloud Object Storage.

Configure notifications for migration

Migrating data from IBM Cloud Object Storage requires that filesystem events are fed into a Kafka-based notification service. Whenever an object is written, overwritten, or deleted using the S3 protocol, a notification is created and stored in a Kafka topic - a message category under which Kafka publishes the notifications stream.

Configure Kafka notifications

Enter the following information into the IBM Cloud Object Storage Manager web interface.

Select the Administration tab.
In the Notification Service section, select Configure.
On the Notification Service Configuration page, select Add Configuration.
In the General section, enter the following:

Name: A name for the configuration, for example "IBM Cloud Object Storage Notifications".
Topic: The name of the Kafka topic to which the notifications will be sent.
Hostnames: List of Kafka node endpoints (host:port) format. Note that larger clusters may support multiple nodes.
Type: Type of configuration.

(Optional) In the Authentication section, select Enable authentication and enter your Kafka username and password.
(Optional) In the Encryption section, select Enable TLS for Apache Kafka network connections.
- If the Kafka cluster is encrypted using a self-signed TLS certificate, paste the root CA key for your Kafka configuration in the Certificate PEM field.
Select Save.

A message appears confirming that the notification was created successfully and the configuration is listed in the Notification Service Configurations table.

Select the name of the configuration (defined in step 4) to assign vaults.
In the Assignments section, select Change.
In the Not Assigned tab, select vaults and select Assign to Configuration. Filter available vaults by selecting or typing a name into the Vault field.

note

Notification configurations can't be assigned to containers vaults, mirrored vaults, vault proxies, or vaults that are migrating data. Once a notification is assigned to configuration, an associated vault can't be used in a mirror, with a vault proxy, or for data migration.

Only new operations that occur after a vault is assigned to the configuration will trigger notifications.

Select Update.

note

For more information, see the Apache Kafka documentation.

Configure Azure Data Lake Storage (ADLS) Gen2 as a source (preview)

note

ADLS Gen2 as a source is currently a preview feature, and is subject to change.

You can use ADLS Gen2 for one-time migrations only - not for live migrations.

Enter the following:

Filesystem Type - The type of filesystem source. Select Azure Data Lake Storage (ADLS) Gen2.
Display Name - Enter a name for your source filesystem.
Data Lake Storage Endpoint - This defaults to dfs.core.windows.net.
Authentication Type - The authentication type to use when connecting to your filesystem. Select either Shared Key or Service Principal (OAuth2).
You'll be asked to enter the security details of your Azure storage account. These will vary depending on which Authentication Type you select. See below.
Use Secure Protocol - This checkbox determines whether to use TLS encryption in communication with ADLS Gen2. This is enabled by default.

The Azure storage account details necessary will vary depending on whether you selected Shared Key or Service Principal (OAuth2):

Shared key

Account Name - The Microsoft Azure account name that owns the data lake storage.
Access Key - The access key associated with the Microsoft Azure account.
Container Name - The ADLS Gen2 container you want to migrate data from.

Service principal (OAuth2)

Account Name - The Microsoft Azure account name that owns the data lake storage.
Container Name - The ADLS Gen2 container you want to migrate data from.
Client ID - The client ID (also known as application ID) for your Azure service principal.
Secret - The client secret (also known as application secret) for the Azure service principal.
Endpoint - The client endpoint for the Azure service principal. This will often take the form of https://login.microsoftonline.com/{tenant}/oauth2/v2.0/token where {tenant} is the directory ID for the Azure service principal. You can enter a custom URL (such as a proxy endpoint that manually interfaces with Azure Active Directory).

Select Save.

Configure source filesystems with the CLI

Data Migrator migrates data from a single source filesystem. Data Migrator automatically detects the Hadoop Distributed File System (HDFS) it's installed on and configures it as the source filesystem. If it doesn't detect the HDFS source automatically, you can validate the source. You can override auto-discovery of any HDFS source by manually adding a source filesystem.

note

At this time, Azure Data Lake Storage (ADLS) Gen2 source filesystems can only be used for one-time migrations.

Use the following CLI commands to add source filesystems:

Command	Action
`filesystem add adls2 oauth`	Add an ADLS Gen 2 filesystem resource using a service principal and oauth credentials
`filesystem add adls2 sharedKey`	Add an ADLS Gen 2 filesystem resource using access key credentials
`filesystem add hdfs`	Add a HDFS resource
`filesystem add s3a`	Add an S3 filesystem resource. You can choose this when using Amazon S3, Oracle, and IBM Cloud Object Storage. If you want to specify a required filesystem, use `--s3type` parameter. See s3a optional parameters.
`filesystem add local`	Add a local or mounted NAS filesystem resource.

Validate your source filesystem

Verify that the correct source filesystem is registered or delete the existing one (you define a new source in the step Add a source filesystem.

If Kerberos is enabled or your Hadoop configuration does not contain the information needed to connect to the Hadoop filesystem, use the filesystem auto-discover-source hdfs command to enter your Kerberos credentials and auto-discover your source HDFS configuration.

note

If Kerberos is disabled, and Hadoop configuration is on the host, Data Migrator will detect the source filesystem automatically on startup.

Manage your source filesystem

Manage the source filesystem with the following commands:

Command	Action
`source clear`	Delete all sources
`source delete`	Delete one source
`source show`	View the source filesystem configuration
`filesystem auto-discover-source hdfs`	Enter your Kerberos credentials to access your source HDFS configuration

note

To update existing filesystems, first stop all migrations associated with them. After saving updates to your configuration, you'll need to restart the Data Migrator service for your updates to take effect. In most supported Linux distributions, run the command service livedata-migrator restart.

Configure source filesystems with the UI​

Add a source filesystem​

Configure HDFS as a source​

Configure an Amazon S3 bucket as a source​

Configure an S3 bucket as a source​

Configure IBM Cloud Object Storage as a source (preview)​

Configure notifications for migration​

Configure Kafka notifications​

Configure a local filesystem as a source​

Configure Azure Data Lake Storage (ADLS) Gen2 as a source (preview)​

Shared key​

Service principal (OAuth2)​

Configure source filesystems with the CLI​

Validate your source filesystem​

Manage your source filesystem​

Configure source filesystems with the UI

Add a source filesystem

Configure HDFS as a source

Configure an Amazon S3 bucket as a source

Configure an S3 bucket as a source

Configure IBM Cloud Object Storage as a source (preview)

Configure notifications for migration

Configure Kafka notifications

Configure a local filesystem as a source

Configure Azure Data Lake Storage (ADLS) Gen2 as a source (preview)

Shared key

Service principal (OAuth2)

Configure source filesystems with the CLI

Validate your source filesystem

Manage your source filesystem