Ready to start? Follow these steps to get ready to use Data Migrator.
Read the release notes to get the latest information about the current version of Data Migrator.
Recommended technical knowledge
- Linux operating system installation
- Disk management
- Memory monitoring and management
- Command line administration and manually editing configuration files
- Service configuration and management
- IP address allocation
- TCP/IP ports and firewall setup or server certificates (for TLS)
Cloud storage technologies
You need to be proficient with your intended target storage technologies and metadata technologies, such as:
- For Amazon Web Services, this includes:
- Knowledge of AWS Marketplace, Amazon Simple Storage Service (Amazon S3), AWS Glue Data Catalog, and AWS Command Line Tool.
- Understanding any storage persistence and related costs.
- Ability to monitor and troubleshoot AWS services.
- AWS S3
- AWS Glue
- For Amazon Web Services, this includes:
- Azure Data Lake Storage (ADLS) Gen2
- Azure SQL DB
- Google Cloud Platform
- Google Cloud Storage
- Hadoop Distributed File System (HDFS)
Activate your data
You need to understand the installation procedures described in this guide for your platform.
If you’re not confident about meeting the requirements, you can discuss a supported installation by contacting WANdisco.
- Linux host
- See the release notes for a list of supported operating systems.
- Java 1.8 64-bit (Latest LTS version).
- Network connectivity from your Data Migrator host to your target filesystem. For example, ADLS Gen2 endpoint or S3 bucket.
- Port 8081 accessible on your Linux host (to access the UI through a web browser).
- Ensure suitable ulimit settings for your deployment. See Increasing ulimits for Data Migrator and Hive Migrator.
- If migrating from Hadoop Distributed File System (HDFS):
- Hadoop client libraries must be installed on the Linux host.
- Ability to authenticate as the HDFS superuser. For example,
- If Kerberos is enabled on your Hadoop cluster, a valid keytab containing a suitable principal for the HDFS superuser must be available on the Linux host.
- Apache Hive. If you want to migrate metadata to or from Apache Hive:
- The Hive service must be present on the cluster.
- SSH/CLI access to the cluster.
- If Kerberos is enabled on your Hadoop cluster, a valid keytab containing a suitable principal for the Hive service user must be available. The host for the keytab will depend on whether you deploy locally, remotely, or both (see the
hive agent add hivesection for more information).
- The keytab must be owned by the same user running Data Migrator's metadata migration component.
- The user the Hivemigrator principal maps to must be able to proxy from its host. If its
hivethis will normally be the case anyway. If not, see this Knowledge base article
- Ensure that the Hive metastore database (such as MySQL and PostgreSQL) can be accessed from the Data Migrator host through the JDBC connection URL. For more information, see Configure CDP target for Hive metadata migrations.
- 16 CPUs, 48 GB RAM (minimum 4 CPUs, 32 GB RAM)
- For Hadoop sources, install on an existing edge node. For busy clusters, it may be necessary to dedicate an edge node, use a higher spec machine, or deploy data transfer agents to spread the load across the cluster.
- 200 GB (minimum 100 GB)
- SSD-based storage is recommended.
- 2 Gbps minimum network capacity
- Your network bandwidth must be able to cope with transferring data and ongoing changes from your source filesystem.
Below are the requirements for hosting the WANdisco® UI on a stand-alone server (optional and requires manual installation of the UI).
UI server specification
Processors/Memory: Minimum 4 CPUs, 8 GB RAM
The default UI memory configuration is to use 4 GB RAM.
This can be increased (
/etc/wandisco/ui/vars.env) if there is additional capacity on the server.
Disk: Minimum 100 GB SSD-based storage
The UI database is not large, but over a long period, detailed logging can take up significant disk space.
Logging configuration can be tuned in the configuration file (
Example specification for a deployment of data transfer agents with Data Migrator
For a 10 Gbps network capacity, have 8 CPUs and 16 GB RAM available. You may host more agents on smaller machines, if preferred. If your network capacity is higher than 10 Gbps, you can increase the specifications of your agent host to ensure you take advantage of available bandwidth. Eventually, increasing network capacity won’t necessarily increase your data throughput as other bottlenecks may occur.
In general, the RAM and CPU should be enough to saturate your available bandwidth for data migrations. If both source and target are HDFS which don’t use encryption and have cheap checksums, you can lower these specifications. If the machine has, for example, 25 Gbps network capacity, increase the CPUs and RAM to support it.
If you need more guidance for your specific use case, please contact WANdisco Support.
Production use configuration
We recommend you configure data migration properties on your Hadoop Distributed File System to ensure smooth operation.
Once you have all the prerequisites, set up your network and then install Data Migrator. You can also configure your HDFS NameNode for optimal performance.