Skip to main content
Version: 2.2

Auto source cleanup

note

Auto source cleanup is a feature that removes data from the source filesystem after the data is migrated successfully to a target.

To use this feature in production, please contact WANdisco Support.

info

Don't enable auto source cleanup for migrations with migration verifications.

Using auto source cleanup and migration verification on the same migration will cause the verification report to list files intentionally deleted during auto source cleanup as discrepancies.

Prerequisites

  • You have the Admin or Migration Manager role assigned.

    Learn more about user roles.

  • You have an ingest process set up to copy folders you want to migrate and clean up to a temporary space on your source filesystem.

    info

    To protect production data from being deleted, copy your data to a temporary folder.

    Add the temporary paths to your migration when you're creating it.

    Don't create a migration with auto source cleanup enabled on a production dataset path (for example, /data1). Instead, create a new folder (for example, /mig1) and make periodic copies of the data from /data1 to /mig1. When the data arrives in /mig1, Data Migrator transfers it and then removes it. This derisks inadvertently removing production data.

  • For local and NAS filesystems, Data Migrator needs write access to source files to perform auto source cleanup. Without write access, auto source cleanup won't work and files won't be removed from the source.

Use cases

If you ingest large volumes of data, we migrate the data to your cloud target and clean it from your source Hadoop Distributed File System (HDFS), local filesystem, or network-attached storage (NAS) to free up space, keeping your buffer size and costs to a minimum.

Data Migrator does the following:

  • Checks that source files exist on the target before removing them from the source.
  • Ignores files you’ve specified to exclude from the migration to your target so these aren't removed from your source. See Configure exclusions.

We support this feature for the following use cases:

Manage auto source cleanup with the UI

Enable auto source cleanup

You can enable auto source cleanup when you create a migration or at any time afterward.

  1. In the WANdisco® UI, create a migration with HDFS, NAS, or local filesystem as a source, and a cloud target.
  2. Under Migration Type, select Live, Recurring, or One-time.
  3. Under Advanced Options, select Enable source cleanup.
  4. Select a deletion mode:
    • Immediately - Delete files from the source after verifying they’re on the target.
    • After a file has not changed for - Delete files from the source after a selected period of no activity.
      1. Enter the minimum number of hours or days you want files to have existed on the target before being deleted from source. Select hour(s) or day(s) from the dropdown.
        info

        Review the ingest contract for your selected deletion mode for guidance on interacting with your migration paths while auto source cleanup is enabled.

  5. Select the acknowledgement checkbox(es) to enable the auto source cleanup feature.
  6. Continue creating your migration.
note

Source files that are being updated after you’ve enabled auto source cleanup and started migrating data from the source won't be migrated or removed. You will receive notifications to this effect in Data Migrator.

info

Reenabling auto source cleanup will require a rescan of source data.
If you reenable auto source cleanup, it defaults to the previous deletion mode (Immediately or After a file has not changed for a specified amount of time).

Non-recurring migrations

Changing (enabling, disabling, or reenabling) auto source cleanup settings will reset a non-recurring migration. If the migration was in a Running, Live, Scheduled, or Completed state before the change, it will restart.

Disable auto source cleanup

You can disable auto source cleanup at any time after enabling it.

  1. In the WANdisco® UI, go to the migration for which you want to disable auto source cleanup.
  2. Select the Auto Source Cleanup tab.
  3. Uncheck the Enable source cleanup checkbox.
  4. Select Save.

This will return the migration to the state it was in before auto source cleanup was enabled. When auto source cleanup is disabled, data will not be removed from the source filesystem after the data is migrated successfully to a target.

Check if auto source cleanup is enabled

Select the migration from your dashboard and go to the Auto Source Cleanup page. If the Enable source cleanup checkbox is selected, auto source cleanup is enabled.

Monitor the cleanup

note

Only relevant to live migrations. Recurring and one-time migrations don't show unsupported events.

On the Notifications page, you can view notifications for “unsupported events” on the source.

Unsupported events include changes made to files and directories that you added to the migration for which cleanup is enabled. Because we can’t remove source files or directories that are changing, we notify you of these events including file or path renames or new files added to paths, for example.

Manage auto source cleanup with the CLI

You can use auto source cleanup with the CLI using the migration add, migration auto source cleanup, and migration update configuration commands.

note

You can enable auto source cleanup when you create a migration or at any time afterward.

Create a new migration with auto source cleanup

Create a migration with auto source cleanup enabled using the migration add command with the --deletion-mode and --delayed-deletion-period parameters:

Create a migration with auto source cleanup and a delayed deletion period of 12 hours
migration add --migration-id migration1 --path /examplePath --target exampleTargetFS --deletion-mode DELAYED_DELETION --delayed-deletion-period 12H

Enable auto source cleanup for an existing migration

Enable auto source cleanup with the CLI using the migration auto source cleanup command:

migration auto source cleanup        [--migration-id] string  
[--deletion-mode] string
[--delayed-deletion-period] string
[--action-policy] string

Mandatory parameters

  • --migration-id The ID of the migration you want to update.

Optional parameters

  • --deletion-mode The deletion mode for the migration. There are three options available:
    • IMMEDIATE Delete files from the source after verifying they’re on the target.
    • DELAYED_DELETION Delete files from the source after a selected period of no activity. If you use DELAYED_DELETION, you need to specify a time period using the --delayed-deletion-period parameter.
    • NO_DELETION Don't delete files from the source. Use this option to disable auto source cleanup.
      info

      Review the ingest contract for your selected deletion mode for guidance on interacting with your migration paths while auto source cleanup is enabled.

  • --delayed-deletion-period The minimum number of hours (H) or days (D) you want files to have existed on the target before being deleted from source. For example, 6H (six hours).
  • --action-policy This parameter determines what happens if the migration encounters content in the target path with the same name and size. In the UI, this is called Skip Or Overwrite Settings.
    There are two options available:
    1. com.wandisco.livemigrator2.migration.OverwriteActionPolicy (default policy)
      Every file is replaced, even if file size is identical on the target storage. In the UI, this is called Overwrite.
    2. com.wandisco.livemigrator2.migration.SkipIfSizeMatchActionPolicy
      If the file size is identical between the source and target, the file is skipped. If it’s a different size, the whole file is replaced. In the UI, this is called Skip if Size Match.
Enable auto source cleanup with immediate deletion
migration auto source cleanup --migration-id migration1 --deletion-mode IMMEDIATE 
Enable auto source cleanup with a delayed deletion period of six hours
migration auto source cleanup --migration-id migration1 --deletion-mode DELAYED_DELETION --delayed-deletion-period 6H
info

You can't change the deletion mode of a migration after you've configured it.

Disable auto source cleanup

note

You can disable auto source cleanup at any time after enabling it.

Disable auto source cleanup with the CLI using the migration auto source cleanup command with the --deletion-mode parameter set to NO_DELETION:

Disable auto source cleanup
migration auto source cleanup --migration-id migration1 --deletion-mode NO_DELETION
note

The deletionMode value output by the migration show command doesn't change when you disable auto source cleanup.

Check if auto source cleanup is enabled

Check if auto source cleanup is enabled using the migration show command with the --detailed flag:

Show detailed migration details
migration show --migration-id migration1 --detailed

The command output contains the following values:

"deletionMode": "DELAYED_DELETION",
"delayedDeletionPeriodSeconds": 86400,
"autoSourceCleanupEnabled": true
info

autoSourceCleanupEnabled displays true if auto source cleanup is enabled and false if disabled.

Update delayed deletion period without disabling

Update existing auto source cleanup settings using the migration update configuration command:

Update the delayed deletion period to 12 hours
migration auto source cleanup  --migration-id migration1 --delayed-deletion-period 12H

Reporting

To check the correct files have been removed from your source and to ensure you have accurate information for auditing purposes, you can access reports which you can download and share.

Reports are:

  • Created every four hours automatically.
    The reporting period for the current date is four hours.
    The first report runs for 00:00 - 03:59, the next for 04:00 - 07:59, 08:00 - 11:59, and so on.
  • Placed in a folder whose name is derived from the migration ID. The location of the folder is /opt/wandisco/livedata-migrator/db/sourcecleanup.
  • A record of all the files that have been removed from the source during cleanup.
  • A record of what has been deleted successfully.
  • Available for immediate and delayed deletes.
  • Available for download in the following file formats:
    • .jsonl (uncompressed)
    • tar.gz (compressed)
      The four hour reports are compressed into a daily report.

You can view and download a report while a migration is still in progress.

The reporting period for archived reports is 24 hours, for example, from 00:00 to 23:59.

note

If a migration is reset, the reporting still captures files that were removed from the source before the migration was reset. All cleanup operations after the reset are captured in the same report. The cleanup report is simply added to a directory that contains the new name of the reset migration.

Reporting with the UI

  1. Select a data migration for which auto source cleanup is enabled.
  2. Select Auto Source Cleanup and go to the Source Cleanup History panel.
    If files were removed from the source, you can see the the report files generated. Download the files to view them:
    • In the last 4 hours under Latest Reports. For example, 21.02.2023-08:00:00.jsonl, 21.02.2023-12:00:00.jsonl.
    • In the last 24 hours under Archived Reports. For example, 20.02.2023.jsonl.gz.
  3. To download reports to check which files were removed from your source filesystem and compare the results with your target filesystem, select the download icon for the report that matches your needs.

You can delete archived reports only.

View reports for deleted migrations

You can view reports for deleted migrations. After a migration is deleted in the UI, you can view the report in the directory /opt/wandisco/livedata-migrator/db/sourcecleanup. The sub-directory names for the cleanup reports are derived from the migration IDs.

Download reports for deleted migrations

You can download reports for deleted migrations using the CLI command migration deletion-report download. For more information, see the Command reference.

Ingest contracts

Immediately

  • Data Migrator can delete a file after it is made available for migration and successfully migrated to the target.

  • You can't interact with or modify paths within a migration with immediate deletion configured.

  • The only supported source filesystem operation for a migration with immediate deletion configured is moving content into the migration path atomically (using the mv command) from outside the migration.

  • If you replace existing content on the source for a migration with immediate deletion configured, there is no guarantee that the new content will be migrated. The old version of the file may be migrated and the new version deleted.

  • Depending on the Skip or Overwrite Settings, you can replace content on the target for a migration with immediate deletion configured if you verify that the path on the source is empty before writing to it.

    note

    If Data Migrator has deleted a path after successfully migrating it to the target, it is possible to rewrite the source content and expect that the new changes will be replicated to the target.

    Confirm that Data Migrator deleted a path by checking it doesn't exist on the source or checking the audit log to see if it's registered as a deleted path.

    New content written to the source path can be replicated safely by then adding a rescan directory to the path. For recurring migrations, the change will be picked up automatically in the future scan iterations.

    info

    If the target action policy for the migration is SKIP_IF_SIZE_MATCH, the new changes will only be replicated if the file size has changed.

  • In migrations with an event stream that have immediate deletion configured, Data Migrator ignores all events except for moving data into the migration from outside the migration.

After a file has not changed for x days/hours

  • Data Migrator can delete each individual file after it meets the following criteria:
    • The age of the file on the source is at least equal to the delay period
    • Is a file
    • The file on the source is older than the file on the target
    • The file exists on the target and the source, and is consistent
    • The file on the target is older than the delay period
  • A file that can be deleted by Data Migrator is not guaranteed to be deleted immediately.
  • Interaction with a file (reading/appending/replacing) ready for deletion is not safe or recommended.
  • Delete operations are not supported while auto source cleanup is enabled. This is to prevent deletions made by Data Migrator being replicated to the target.