Prerequisites
Ready to start? Follow these steps to get ready to use Data Migrator.
Read the release notes to get the latest information about the current version of Data Migrator.
Recommended technical knowledge
System administration
- Linux operating system installation
- Disk management
- Memory monitoring and management
- Command line administration and manually editing configuration files
- Service configuration and management
Networking
- IP address allocation
- TCP/IP ports and firewall setup or server certificates (for TLS)
Cloud storage technologies
You need to be proficient with your intended target storage technologies and metadata technologies, such as:
- AWS
- For Amazon Web Services, this includes:
- Knowledge of AWS Marketplace, Amazon Simple Storage Service (Amazon S3), AWS Glue Data Catalog, and AWS Command Line Tool.
- Understanding any storage persistence and related costs.
- Ability to monitor and troubleshoot AWS services.
- AWS S3
- AWS Glue
- For Amazon Web Services, this includes:
- Azure
- Azure Data Lake Storage (ADLS) Gen2
- Azure SQL DB
- Databricks
- Google Cloud Platform
- Google Cloud Storage
- Dataproc
- Hadoop
- Hadoop Distributed File System (HDFS)
- Hive
- Snowflake
Activate your data
You need to understand the installation procedures described in this guide for your platform.
If you’re not confident about meeting the requirements, you can discuss a supported installation by contacting us.
Prerequisites
- Linux host
- See the Supported Operating Systems for a list of supported operating systems.
- Java 1.8 64-bit (Java 8, Latest LTS version) or Java 11.
Oracle/Java identified a security vulnerability in older versions of Java 8 related to the use of TLSv1.0 and TLSv1.1, which are now considered insecure. See Oracle Java Bug Database:JDK-8202343.
For this reason, Java 8, update 291 (Java 1.8.0_291) or later is strongly recommended. These versions include important security patches and stronger support for modern TLS protocols.
However, for environments using updates prior to Java 8, update 291 (Java 1.8.0_291), TLSv1.0 and TLSv1.1 can be disabled, see Java.com Disabling TLS 1.0 and TLS 1.1.
- Network connectivity from your Data Migrator host to your target filesystem. For example, ADLS Gen2 endpoint or S3 bucket.
- Port 8081 accessible on your Linux host (to access the UI through a web browser).
- Ensure suitable ulimit settings for your deployment. See Increasing ulimits for Data Migrator and Hive Migrator.
- If migrating from Hadoop Distributed File System (HDFS):
- Hadoop client libraries must be installed on the Linux host.
- Ability to authenticate as the HDFS superuser. For example,
hdfs
. - If Kerberos is enabled on your Hadoop cluster, a valid keytab containing a suitable principal for the HDFS superuser must be available on the Linux host.
- Apache Hive: If you want to migrate metadata to or from Apache Hive:
- The Hive service must be present on the cluster.
- SSH/CLI access to the cluster.
- If Kerberos is enabled on your Hadoop cluster, a valid keytab containing a suitable principal for the Hive service user must be available. The host for the keytab will depend on whether you deploy locally, remotely, or both (see the
hive agent add hive
section for more information).- The keytab must be owned by the same user running Data Migrator's metadata migration component.
- The user the Hivemigrator principal maps to must be able to proxy from its host. If its
hive
this will normally be the case anyway. If not, see this Knowledge base article
- Ensure that the Hive metastore database (such as MySQL and PostgreSQL) can be accessed from the Data Migrator host through the JDBC connection URL. For more information, see Configure CDP target for Hive metadata migrations.
Machine specification
- 16 CPUs, 48 GB RAM (minimum 4 CPUs, 32 GB RAM)
- For Hadoop sources, install on an existing edge node. For busy clusters, it may be necessary to dedicate an edge node, use a higher spec machine, or deploy data transfer agents to spread the load across the cluster.
- 200 GB (minimum 100 GB)
- SSD-based storage is recommended.
- 2 Gbps minimum network capacity
- Your network bandwidth must be able to cope with transferring data and ongoing changes from your source filesystem.
Below are the requirements for hosting the UI on a stand-alone server (optional and requires manual installation of the UI).
UI server specification
Processors/Memory: Minimum 4 CPUs, 8 GB RAM
The default UI memory configuration is to use 4 GB RAM.
This can be increased (/etc/wandisco/ui/vars.env
) if there is additional capacity on the server.Disk: Minimum 100 GB SSD-based storage
The UI database is not large, but over a long period, detailed logging can take up significant disk space.
Logging configuration can be tuned in the configuration file (/etc/wandisco/ui/logback-spring.xml
).
Example specification for a deployment of data transfer agents with Data Migrator
For a 10 Gbps network capacity, have 8 CPUs and 16 GB RAM available. You may host more agents on smaller machines, if preferred. If your network capacity is higher than 10 Gbps, you can increase the specifications of your agent host to ensure you take advantage of available bandwidth. Eventually, increasing network capacity won’t necessarily increase your data throughput as other bottlenecks may occur.
In general, the RAM and CPU should be enough to saturate your available bandwidth for data migrations. If both source and target are HDFS which don’t use encryption and have cheap checksums, you can lower these specifications. If the machine has, for example, 25 Gbps network capacity, increase the CPUs and RAM to support it.
If you need more guidance for your specific use case, please contact Support.
Supported Operating Systems
We recommend using one of the following operating systems:
- Ubuntu 18, 20.04
- CentOS 7
- Red Hat Enterprise Linux 7, 8, 9
On RHEL 9, Data Migrator automatically falls back to use default_jsse
(Java Secure Socket Extension package). The OpenSSL hcfs.ssl.channel.mode
option is not currently supported.
Production use configuration
We recommend you fully review and configure namenode properties on your Hadoop Distributed File System to ensure optimum operation.
Next steps
Once you have all the prerequisites, set up your network and then install Data Migrator. You can also configure your HDFS NameNode for optimal performance.