Transient failure handling
Transient failures are errors that occur in when migrating data. There can be a variety of reasons for transient failures:
- Network interruptions
- System maintenance
- Expired access keys
- Directory quotas filled on the target filesystem
You can manage and configure the retry duration so you don't get unnecessary error notifications when you have system maintenance or planned downtime.
Failed paths and retries
Failed paths are files and directories that Data Migrator tried to migrate from the source filesystem to the target filesystem unsuccessfully, and which Data Migrator won't retry.
To retry a failed path, you can do one or both of the following:
Retries are files and directories that Data Migrator tried to migrate unsuccessfully, and which Data Migrator continues to retry for a specified time period.
You can configure the time period Data Migrator retries failed paths so you don't need to add paths to be rescanned or reset migrations during your configured time period. This gives you time to renew expired access keys, increase directory quotas, and so forth.
When the configured retry duration ends for an event, the event becomes a failed path and Data Migrator doesn't retry.
Prerequisites
Plan your memory usage
To ensure events are queued for an extended period, and to avoid out-of-memory (OOM) situations, plan sufficient memory usage for the number of events you have for your migrations.
Data Migrator events consume roughly 0.77KB each. As more events are consumed and retried, memory usage will grow accordingly. If you have very busy migrations, you may want to ensure there is a buffer of memory available for retries.
- UI
- CLI
Set up notifications
- To set up email notifications for failed file transfers, go to Configuration and select Email Notifications.
- Select the checkbox Retry occurred.
For more information, see Configure email notifications.
View details of failed files and retries
Data Migrator provides details on failed files and retries to help you identify the causes for failures. These causes can include missing permissions on a specific target file path or file sizes that exceed a threshold value.
View the following details on the Retries page:
Metric | Description |
---|---|
Total Active Retries | The total number of events or operations that Data Migrator tried more than once. |
Total paths being retried | The total number of events that this Data Migrator instance is retrying in the configured retry period for one migration. |
Path | The path of the directory or file in the source filesystem on which Data Migrator tried an operation more than once. |
Reason | The last reason why the operation on the path didn’t succeed. |
Times retried | The total number of times that Data Migrator tried to migrate this file. |
Timestamp | The Coordinated Universal Time when Data Migrator last tried to migrate this file. |
To view the number of retries:
- Select a migration from the Migrations panel and go to the Migration Status page. You can see the number of active retry paths on this page.
- Select View details to view more information on the Retries page.
Or
- Go to Notifications to view whether a Data Migrator instance has tried to transfer a failed file more than once.
- Select View details to see which migrations and files have failed and retried.
Configure the retry duration
You can configure the maximum time Data Migrator tries to migrate failed files. This configuration gives you more time to identify patterns of failures and the root cause, and resolve issues such as network problems, permissions on file paths, and file size thresholds.
On the Retries page, enter the number of hours you want Data Migrator to keep trying to migrate failed files. The maximum configurable duration for retries is 12 hours.
View failed files and retries in the CLI
To view the current number of actively requeuing operations across all migrations, run the diagnostics summary.
The value of property
totalActionsRequeuing
is the total number of actions being retried.To run the diagnostics summary, see the Diagnostics section of this guide.
To view the total number of actions being retried for one migration, run the following command:
The value of property
numberPathRequeued
is the total number of paths being retried.
Configure the retry duration for failed file retries in the CLI
The default value for the retry duration is 43,200 seconds (12 hours). To configure the duration in the CLI, run the following commands:
Retrieve the current configuration
configuration get --key requeue.time.limit.seconds
configuration get requeue.time.limit.seconds
Apply new configuration
configuration set --key requeue.time.limit.seconds --value <number_of_seconds>
configuration set requeue.time.limit.seconds --value <number_of_seconds>
Learn more
To understand why paths and files fail to migrate, see Failed paths.
You can also monitor and troubleshoot Data Migrator and monitor failed operations to identify problems that cause failures.