Configure source filesystems
Configure source filesystems for each product to migrate your data from depending on what your environment is:
- Hadoop Distributed File System (HDFS) - Add one source filesystem only for each product.
- S3 sources (IBM Cloud Object Storage, Amazon S3) - Add one or more source filesystems.
Data Migrator supports the following filesystems as a source:
#
Configure source filesystems with the UIThe Filesystems panel shows the source and target filesystems Data Migrator can use for data migrations.
Use the Filesystems panel to:
- View and configure source and target filesystems.
- Add or remove targets.
#
Add a source filesystemTo add a source filesystem from your dashboard, select the following:
- The relevant instance from the Products panel.
- Add source filesystem in the Filesystem Configuration page.
info
If you have HDFS in your environment, Data Migrator automatically detects it as your source filesystem. However, if Kerberos is enabled, or if your Hadoop configuration doesn't contain the configuration file information required for Data Migrator to connect to Hadoop, configure the source filesystem manually with additional Kerberos configuration settings.
If you want to configure a new source manually, delete any existing source first, and then manually add a new source.
note
If you deleted the HDFS source that Data Migrator detected automatically, and you want to redetect it, go to the CLI and run the command filesystem auto-discover-hdfs
.
- HDFS
- Amazon S3
- IBM Cloud Object Storage (preview)
- Local Filesystem
- ADLS Gen2 (preview)
#
Configure HDFS as a source#
Configure an Amazon S3 bucket as a sourceSelect your Data Migrator product from the Products panel.
Select Add Source Filesystem.
Select Amazon S3 from the Filesystem Type dropdown list.
Enter the following details:
Display Name - The name you want to give your source filesystem.
Bucket Name - The reference name of the Amazon S3 bucket you are using.
Authentication Method - The Java class name of a credentials provider for authenticating with the S3 endpoint. This isn't a mandatory parameter when you add an IBM Cloud Object Storage bucket with the UI.
The Authentication Method options available include:
Access Key and Secret
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
Use this provider to enter credentials as an access key and secret access key with the following entries:
Access Key - Enter the AWS access key. For example,
RANDOMSTRINGACCESSKEY
.Secret Key - Enter the secret key that corresponds with your Access Key. For example,
RANDOMSTRINGPASSWORD
.
AWS Identity and Access Management
com.amazonaws.auth.InstanceProfileCredentialsProvider
Use this provider if you're running Data Migrator on an EC2 instance that has been assigned an IAM role with policies that allow it to access the S3 bucket.
AWS Hierarchical Credential Chain
com.amazonaws.auth.DefaultAWSCredentialsProviderChain
A commonly used credentials provider chain that looks for credentials in this order:
- Environment Variables -
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
, orAWS_ACCESS_KEY
andAWS_SECRET_KEY
. - Java System Properties -
aws.accessKeyId
andaws.secretKey
. - Web Identity Token credentials from the environment or container.
- Credential profiles file at the default location (
~/.aws/credentials
) shared by all AWS SDKs and the AWS CLI. - Credentials delivered through the Amazon EC2 container service if the
AWS_CONTAINER_CREDENTIALS_RELATIVE_URI
environment variable is set and security manager has permission to access the variable. - Instance profile credentials delivered through the Amazon EC2 metadata service.
- Environment Variables -
Environment Variables
com.amazonaws.auth.EnvironmentVariableCredentialsProvider
Use this provider to enter an access key and a secret access key as either
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
, orAWS_ACCESS_KEY
andAWS_SECRET_KEY
.EC2 Instance Metadata Credentials
com.amazonaws.auth.InstanceProfileCredentialsProvider
Use this provider if you need instance profile credentials delivered through the Amazon EC2 metadata service.
Profile Credentials Provider
com.wandisco.livemigrator2.fs.ExtendedProfileCredentialsProvider
Use this provider to enter a custom profile configured to access Amazon S3 storage. You can find AWS credential information in a local file named
credentials
in a folder named.aws
in your home directory.Enter an AWS Named Profile and a Credentials File Path. For example,
~/.aws/credentials
.Custom Provider Class
Use this if you want to enter your own class for the credentials provider.
JCEKS Keystore
hadoop.security.credential.provider.path
This authentication method uses an access key and a secret key for Amazon S3 contained in a Java Cryptography Extension KeyStore (JCEKS).
Enter the path containing the JCEKS keystore. For example,
jceks://hdfs@active-namenode-host:8020/credentials/aws/aws.jceks
.The keystore needs to contain values for the access key and the secret key.
The access key and secret key are already in the keystore properties file so you don't need to enter them once you've saved the path.
JCEKS on Hadoop Distributed File System (HDFS) - Ensure you add the HDFS as a source or target filesystem, as Data Migrator won't be able to find the JCEKS file otherwise.
JCEKS on HDFS with Kerberos - You must add the
dfs.namenode.kerberos.principal.pattern
configuration property.Include the following steps when you add an HDFS source or target with Kerberos:
Under Additional Configuration, select Configuration Property Overrides from the dropdown.
Select + Add Key/Value Pair and add the key
dfs.namenode.kerberos.principal.pattern
and the value*
.Select Save, then restart Data Migrator.
The property
dfs.namenode.kerberos.principal.pattern
provides a regular expression wildcard that allows realm authentication. You need to use this if the realms on your source or target filesystems don't have matching truststores, or principal patterns.
- Live Migration - Enabled by default, this setting allows Data Migrator to migrate ongoing changes automatically from this source to the target filesystem during a migration. If you deselect the checkbox, or if the source filesystem doesn't allow live migrations to take place, Data Migrator uses one-time migration.
#
Configure IBM Cloud Object Storage as a source (preview)Enter the following:
- Filesystem Type - The type of filesystem source. Select IBM Cloud Object Storage.
- Display Name - The name you want to give your IBM Cloud Object Storage.
- Access Key - The access key for your authentication credentials, associated with the fixed authentication credentials provider
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
.
note
Although IBM Cloud Object Storage can use other providers (for example InstanceProfileCredentialsProvider, DefaultAWSCredentialsProviderChain), they're only available in the cloud, not for on-premises. As on-premises is currently the expected type of source, these other providers have not been tested and are not currently selectable.
- Secret Key - Enter the secret key using this parameter, used for the
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
credentials provider. - Bucket Name - The name of your Cloud Object Store bucket.
- Topic - The name of the Kafka topic to which the notifications will be sent.
- Endpoint - An endpoint for a Kafka broker, in a host/port format.
- Bootstrap Servers - A comma-separated list of host and port pairs that are addresses for Kafka brokers on a "bootstrap" Kafka cluster that Kafka clients use to bootstrap themselves.
- Port - The TCP port used for connection to the IBM Cloud Object Storage bucket. Default is 9092.
note
Migrations from IBM Cloud Object Storage use Amazon S3, along with its filesystem classes. The main difference between IBM Cloud Object Storage and Amazon S3 is in the messaging services: SQSQueue for Amazon S3, and Kafka for IBM Cloud Object Storage.
#
Configure notifications for migrationMigrating data from IBM Cloud Object Storage requires that filesystem events are fed into a Kafka-based notification service. Whenever an object is written, overwritten, or deleted using the S3 protocol, a notification is created and stored in a Kafka topic - a message category under which Kafka publishes the notifications stream.
#
Configure Kafka notificationsEnter the following information into the IBM Cloud Object Storage Manager web interface.
- Select the Administration tab.
- In the Notification Service section, select Configure.
- On the Notification Service Configuration page, select Add Configuration.
- In the General section, enter the following:
- Name: A name for the configuration, for example "IBM Cloud Object Storage Notifications".
- Topic: The name of the Kafka topic to which the notifications will be sent.
- Hostnames: List of Kafka node endpoints (host:port) format. Note that larger clusters may support multiple nodes.
- Type: Type of configuration.
(Optional) In the Authentication section, select Enable authentication and enter your Kafka username and password.
(Optional) In the Encryption section, select Enable TLS for Apache Kafka network connections.
- If the Kafka cluster is encrypted using a self-signed TLS certificate, paste the root CA key for your Kafka configuration in the Certificate PEM field.
Select Save.
- A message appears confirming that the notification was created successfully and the configuration is listed in the Notification Service Configurations table.
Select the name of the configuration (defined in step 4) to assign vaults.
In the Assignments section, select Change.
In the Not Assigned tab, select vaults and select Assign to Configuration. Filter available vaults by selecting or typing a name into the Vault field.
note
Notification configurations can't be assigned to containers vaults, mirrored vaults, vault proxies, or vaults that are migrating data. Once a notification is assigned to configuration, an associated vault can't be used in a mirror, with a vault proxy, or for data migration.
Only new operations that occur after a vault is assigned to the configuration will trigger notifications.
- Select Update.
note
For more information, see the Apache Kafka documentation.
#
Configure a local filesystem as a sourceEnter the following:
- Filesystem Type - The type of filesystem source. Select Local Filesystem.
- Display Name - Enter a name for your source filesystem.
- Mount Point - The local filesystem directory path to use as the source filesystem. You can migrate any data in the Mount Point directory.
note
Local filesystems don't provide change notifications, so Live Migration isn't enabled for local filesystem sources.
#
Configure Azure Data Lake Storage (ADLS) Gen2 as a source (preview)note
ADLS Gen2 as a source is currently a preview feature, and is subject to change.
You can use ADLS Gen2 for one-time migrations only - not for live migrations.
Enter the following:
- Filesystem Type - The type of filesystem source. Select Azure Data Lake Storage (ADLS) Gen2.
- Display Name - Enter a name for your source filesystem.
- Data Lake Storage Endpoint - This defaults to
dfs.core.windows.net
. - Authentication Type - The authentication type to use when connecting to your filesystem. Select either Shared Key or Service Principal (OAuth2).
- You'll be asked to enter the security details of your Azure storage account. These will vary depending on which Authentication Type you select. See below.
- Use Secure Protocol - This checkbox determines whether to use TLS encryption in communication with ADLS Gen2. This is enabled by default.
The Azure storage account details necessary will vary depending on whether you selected Shared Key or Service Principal (OAuth2):
#
Shared key- Account Name - The Microsoft Azure account name that owns the data lake storage.
- Access Key - The access key associated with the Microsoft Azure account.
- Container Name - The ADLS Gen2 container you want to migrate data from.
#
Service principal (OAuth2)- Account Name - The Microsoft Azure account name that owns the data lake storage.
- Container Name - The ADLS Gen2 container you want to migrate data from.
- Client ID - The client ID (also known as application ID) for your Azure service principal.
- Secret - The client secret (also known as application secret) for the Azure service principal.
- Endpoint - The client endpoint for the Azure service principal. This will often take the form of https://login.microsoftonline.com/{tenant}/oauth2/v2.0/token where {tenant} is the directory ID for the Azure service principal. You can enter a custom URL (such as a proxy endpoint that manually interfaces with Azure Active Directory).
Select Save.
#
Configure source filesystems with the CLIData Migrator migrates data from a single source filesystem. Data Migrator automatically detects the Hadoop Distributed File System (HDFS) it's installed on and configures it as the source filesystem. If it doesn't detect the HDFS source automatically, you can validate the source. You can override auto-discovery of any HDFS source by manually adding a source filesystem.
note
At this time, Azure Data Lake Storage (ADLS) Gen2 source filesystems can only be used for one-time migrations.
Use the following CLI commands to add source filesystems:
Command | Action |
---|---|
filesystem add adls2 oauth | Add an ADLS Gen 2 filesystem resource using a service principal and oauth credentials |
filesystem add adls2 sharedKey | Add an ADLS Gen 2 filesystem resource using access key credentials |
filesystem add gcs | Add a Google Cloud Storage filesystem resource |
filesystem add hdfs | Add a HDFS resource |
filesystem add s3a | Add an S3 filesystem resource (choose this when using Amazon S3 and IBM Cloud Object Storage) |
filesystem add local | Add a local filesystem resource |
#
Validate your source filesystemVerify that the correct source filesystem is registered or delete the existing one (you define a new source in the step Add a source filesystem.
If Kerberos is enabled or your Hadoop configuration does not contain the information needed to connect to the Hadoop filesystem, use the filesystem auto-discover-source hdfs
command to enter your Kerberos credentials and auto-discover your source HDFS configuration.
note
If Kerberos is disabled, and Hadoop configuration is on the host, Data Migrator will detect the source filesystem automatically on startup.
#
Manage your source filesystemManage the source filesystem with the following commands:
Command | Action |
---|---|
source clear | Delete all sources |
source del | Delete one source |
filesystem auto-discover-hdfs | Automatically detect a HDFS source |
source show | View the source filesystem configuration |
filesystem auto-discover-source hdfs | Enter your Kerberos credentials to access your source HDFS configuration |
note
To update existing filesystems, first stop all migrations associated with them.
After saving updates to your configuration, you'll need to restart the Data Migrator service for your updates to take effect. In most supported Linux distributions, run the command service livedata-migrator restart
.