Skip to main content
Version: 1.22.0

Configure your HDFS cluster

Data Migrator reads events from a HDFS cluster's NameNode to track changes to data on the filesystem. The NameNode properties on this page affect how quickly Data Migrator can process changes and recover from network or storage device failures during a migration.

To optimize migration performance, follow these steps:

  1. Navigate to your cluster manager.
  2. Set the four properties with the correct values detailed in the table below.
  3. Restart the recommended services.
  4. Restart Data Migrator.
  5. Navigate to the UI then remove and re-add the HDFS source.

Select a property below for more information.

The number of transactions the NameNode retains. Data Migrator reads these transactions to track filesystem activity.1,000,000Increase this value to 25,000,000 to minimize the risk of losing necessary edit logs during a migration.This setting doesn't impact cluster performance, but you need a few gigabytes of extra storage for the edits.
The maximum number of events the NameNode can send to inotify clients (including Data Migrator) in one Remote Procedure Call (RPC) response.1,000Increase this value to 100,000 to let migrations process more events with every RPC, increasing your data migrations' maximum data transfer rate.The increased number of RPC events increases NameNode memory consumption by 1MB.
The number of transactions after which the NameNode will create a checkpoint, splitting the filesystem load by letting it read multiple, smaller checkpoints of events.10,000Use the default value.No change.
The maximum number of edit checkpoints, which contain transactions retained by the number of extra edits, that the NameNode will maintain. You typically won't need to change this.1,000,000Use the default value.No change.

Configuration properties explained

Number of extra edits retained

This is the number of additional edits (also called "events") on the system that the NameNode records in edit log files on the disk.

The NameNode creates a log of every file edit and creates checkpoints periodically to prevent these records stacking up indefinitely. After each checkpoint, the NameNode stashes its edits as a checkpoint file and deletes the original edit logs. However, it stores the most recent edits in a log file, up to the number of edits specified in this property.

Data Migrator reads edits to keep track of filesystem activity for replication on the target filesystem during a migration. If Data Migrator can't access the expected edit logs past its current point in a migration, the migration will fail and will return the exception org.apache.hadoop.hdfs.inotify.MissingEventsException on each of its org.apache.hadoop.hdfs.DFSInotifyEventInputStream calls.

If Data Migrator loses access to the HDFS for a long time during a migration, it may try to resume reading deleted edits and fail.

The recommended value is suitable for most large-scale use. If you expect extremely high data edit rates or lengthy outages during migrations, increase this property's value further.

Maximum inotify events from each RPC

This is the maximum number of events the NameNode can send to Data Migrator and other inotify clients in a single Remote Procedure Call (RPC) response.

Data Migrator sends RPCs to read events on the filesystem, which it uses to detect data changes that need migrated. The filesystem returns the same number of events as this property's value. On filesystems with lots of activity, the default maximum of 1,000 means the NameNode sends events more slowly than they happen to the filesystem, which causes migrations will progress much slower than filesystem changes.

Maximum extra edits segments retained

This is the number of files containing logged edits that the NameNode will retain on the filesystem at any given time.

Each of these edit log files contains the same number of edits equal to dfs.namenode.num.extra.edits.retained. Increase the number of edits retained instead to preserve the edit logs for Data Migrator, and keep this property at its default value.

NameNode checkpoint transactions

This is the number of transactions (events) after which the NameNode will create a checkpoint, splitting the filesystem load by letting it read multiple, smaller checkpoints of events instead of a single, oversized checkpoint which could harm performance. In most cases, no modification is necessary.

Learn more

See the Hadoop documentation for more information about each of these NameNode configuration properties. See this Knowledge base article for additional information on events, actions and queues in Data Migrator.