WANdisco Git MultiSite logo

5. Troubleshooting Guide

5.1 Logs

Git MultiSite has two sets of logs, one set is used for application, the other logs replication activity:

Application Logs

<install-dir>/git-multisite/logs

These logs are used for troubleshooting problems with MultiSite's running:


drwxrwxr-x 2 wandisco wandisco 4096 Dec  5 14:20 logs

	-rw-rw-r-- 1 wandisco wandisco 32647705 Dec  6 09:50 replicator.log.20121205-142035
	-rw-rw-r-- 1 wandisco wandisco   775889 Dec  6 15:00 replicator.log.20121206-101206
	-rw-rw-r-- 1 wandisco wandisco     1666 Dec  5 13:21 setup.20121205-132141
	-rw-rw-r-- 1 wandisco wandisco     1674 Dec  5 13:53 setup.20121205-135328
	-rw-rw-r-- 1 wandisco wandisco      157 Dec  5 14:20 setup.20121205-142034
	-rw-rw-r-- 1 wandisco wandisco     1979 Dec  5 14:20 startup.20121205-142035
	-rw-rw-r-- 1 wandisco wandisco     1980 Dec  6 10:12 startup.20121206-101206
	-rw-rw-r-- 1 wandisco wandisco       94 Dec  6 09:57 stop.20121206-095714
	-rw-rw-r-- 1 wandisco wandisco     5136 Dec  6 10:26 ui.launch.log

Replication Logs

<install-dir>/git-multisite/replicator/logs

	-rw-rw-r--. 1 wandisco wandisco  371282 Jan 23 10:00 fileysys.0
	-rw-rw-r--. 1 wandisco wandisco       0 Jan 18 10:40 fileysys.0.lck
	-rw-rw-r--. 1 wandisco wandisco 1054726 Jan 23 06:29 fileysys.1
	-rw-rw-r--. 1 wandisco wandisco 1055024 Jan 17 11:38 fileysys.10
	-rw-rw-r--. 1 wandisco wandisco 1049842 Jan 17 11:38 fileysys.11
	-rw-rw-r--. 1 wandisco wandisco 1050660 Jan 17 11:38 fileysys.12
	-rw-rw-r--. 1 wandisco wandisco 1049618 Jan 17 11:37 fileysys.13
	-rw-rw-r--. 1 wandisco wandisco 1049228 Jan 17 11:37 fileysys.14
        

The fileysys.xx logs record all actions taken by the replicator and a good starting point in the investigation of replication problems.

drwxrwxr-x. 2 wandisco wandisco    4096 Jan 17 11:43 stats

	-rw-rw-r--. 1 wandisco wandisco  22 Jan 17 11:43 111f4016-a23a-49d6-9769-ced78a250539
	-rw-rw-r--. 1 wandisco wandisco  22 Jan 17 11:43 420aa491-1bbb-4a6c-9cff-e8b60e911dd9

The Stats files are where the repository statistics used by Dashboard widgets are stored. These files are not intended to be readible by users.

drwxrwxr-x. 2 wandisco wandisco   12288 Jan 23 09:40 thread-dump

        -rw-rw-r--. 1 wandisco wandisco 55395 Jan 17 06:34 thread-dump-2013-01-17.06-34-00
        -rw-rw-r--. 1 wandisco wandisco 79150 Jan 17 07:34 thread-dump-2013-01-17.07-34-01    
        -rw-rw-r--. 1 wandisco wandisco 79162 Jan 17 08:34 thread-dump-2013-01-17.08-34-01

The thread-dump directory contains thread dumps that the replicator periodically writes to help with troubleshooting.

Logging Levels

SEVERE:
Message level indicating a serious failure.
WARNING:
A message level indicating a potential problem.
INFO:
Interesting runtime events (startup/shutdown). Expect these to be immediately visible on a console, so be conservative and keep to a minimum.
CONFIG:
Details of configuration messages.
FINE:
Provides a standard level of trace information.
FINER:
Provides a more detailed level of trace information.
FINEST:
Provides a boggling level of trace information for troubleshooting hard to identify problems.

Changing the Logging Level

You can change the current logging level by editing the logger properties file  install-dir>/git-multisite/replicator/properties/logger.properties. You can see a sample logger.properties file.


5.2 Consistency Check

The consistency Cheker gives you a quick and easy check whether a selected repository remains consistent across the nodes of its replication group. Follow these steps to check on consistency:

Limits of the Consistency Checker
The Consistency Check will tell you the last common revision shared between repository replicas. Given the dynamic nature of a replication group it's possible that there will be in-flight proposals in the system that have not yet been agreed upon at all nodes. For this reason it isn't possible for a consistency check to be completely authoritative.
  1. Login to a site, click on the REPOSITORIES tab.
    Shedule for you

    Go to the repository


  2. Click on one of the listed repositories. This will active the above line of buttons.
    Shedule for you

    Consistency Check is done on a per node basis


  3. Click on the Consistency check button. A growl message "Invokling consistency check on repository <Repository Name>" will appear. Shedule for you

    Consistency check in action


  4. Click on the DASHBOARD tab. The results of the consistency check will appear on the Replicator Tasks widget - you'll need to select All Tasks, instead of the default Pending Tasks. Shedule for you

    Repository replicas need to be identical - are they?


  5. Originating Node:
    The site from which the check was requested
    Lowest Revision Checked:
    The oldest revision compared across all nodes
    Number of Revisions Checked:
    Total number of revisions checked
    Repository is Consistent:
    The result of the check as a 'true' or 'false' statement
    Highest revision Checked:
    The youngest revision compared across all nodes
    Repository Being Checked:
    The name of the repository that has had its consistency checked

Log results

It's also possible to check the results of a consistency check by viewing the replicator's log file (fileysys.##). See Logs

5.3 Copying repositories

This section provides advice on getting your repository data distributed prior to starting or replication. You will need to use this procedure should you ever need to add additional nodes, or perform a recovery on a node.

Git installations must have:

These are things are a recap of the installation checklist, you must have them in place in order for replication to run effectively:

Copying Existing Repositories

It's simple enough to make a copy of a small repository and transfer it to each of your nodes. However, remember that any changes made to the original repository will invalidate your copies unless you perform a syncronzation prior to starting replication.

If a repository needs to remain available to users during the proccess, you should briefly halt access, in order to make a copy. The copy can then be transferred to each site. Then, when you are ready to begin replication, you need use rsync to update each of your replicas. Fore more information about rsync, see Synchronizing repositories using rysnc.

New Repositories

If you are creating brand new repositories, don't create them at each site, instead create the repository once, then rsync it to the other nodes. You need to do this to ensure that each replica has the same UUID.

5.4 Synchronizing repositories using rysnc

If for any reason repositories are corrupted or unable to automatically catch up it's usually possible to use rsync to get them back into sync.

Before you resync
To rsync you'll need to copy data from a repository replica that is up-to-date, before you do so it's good practice to perform an Git verify on the repository to be absolutely sure of it's integrity.

Use git's command on the 'helper' site:
git-fsck

From the site with the up-to-date repository, type the following commands:

rsync -rvlHtogpc /path/to/local/repo <Repository Name> remoteHost:/path/to/remote/repo

For example:
rsync -rvlHtogpc /path/to/git/repo/ root@172.7.2.33:/git

Then follow up with an additional rsync that will ensure that contents of the locks directory are identical (by deleting locks that are not present on the originating server)

rsync -rvlHtogpc --delete /path/to/git/repo/db/locks <Repository Name> remoteHost:/path/to/remote/repo

For example:
rsync -rvlHtogpc --delete /path/to/git/repo/db/locks root@172.7.2.33:/git

5.5 Adding a node to a replication group

Don't add a node during a period of high replication load
When adding nodes to a replication group that already contains three or more nodes, ensure that there isn't currently a large number of commits being replicated.

Adding a site during a period of high traffic (heavy level of commits) going to the repositories may cause the adding process to stall.

It's possible to add additional nodes to an existing replication group, so that there's minimal disruption to users. Here's the procedure:

  1. Login to a site, click on the REPLICATION GROUPS tab. Go to the replication group to which you will add a new site, click on its View.
  2. The replication group pop-up will appear. Click CONFIGURE.
  3. On the replication group configuration screen you can see the existing nodes. Click the Add Node button. Click on one or more of the available nodes to add them. It is possible to bring in a brand new site that isn't currently connected by clicking on CONNECT NEW SITE, this will take you to the Nodes tab.
  4. Any new node(s) that you add will now appear on the screen as a blue icon tab. You'll be presented with the following information:
    Adding Nodes
    
    You are about to add "NodeAuckland" to the "ReplicationGroupGladius" replication group.
    Please read through the following steps before you continue:
     
    1. Click ADD NODES and choose an existing Node as a helper.
    2. Copy the repositories replicated in the group from the helper to the new node. During this period these repositories will be read-only on the helper node.
    3. Once a repository is in place, selecting it and clicking COMPLETE SELECTED will take it out of read-only on the helper and the new member nodes.
    
    WARNING: Do not close your browser or log out during this process. If you do you'll need to complete the sync process for each individual repository via the REPAIR option on the repositories screen.
    Click ADD NODES to confirm your selection and proceed with adding them.
  5. Select one of the existing nodes to be the 'helper' from the CHOOSE HELPER NODE drop-down selector. This helper site will be used to supply copies of the relevant repositories to the new site. During the helper phase the relevant repositories will be temporarily read-only for their local users.
  6. Click START SYNC to begin your copying process. Once you click the button the helper node will's relevant repositories will bcome read-only to local users.
  7. READ-ONLY ALERT
    Starting the Sync may cause a brief disruption to users on the helper node. You may wish to alert them or complete the work at a time where the disruption will be minimal.
  8. The helper site is now ready for you to start copying data across. You can manually copy the repository files or you can use rsync - see our guide: 11. Synchronizing repositories using rsync.
  9. Once the relevant repositories have been copied to the new site, and the copies have been checked to ensure they're up-to-date, it's time finish the process by clicking COMPLETE ALL.
  10. The new site will now appear as a member of the replication group. It is added as a non-voter by default, to change its type, click on its icon and change it to a Voter or Tie-breaker .

5.6 Recover from the loss of a node

It's possible for Git MultiSite to recover from the brief outage of a member site, which should be able to resync once it is reconnected. The crucial requirement for MultiSite's continued operation is that agreement over transaction ordering must be able to continue. Votes must be cast and those votes must always result in an agreement - no situation must arise where the votes are evenly split between voters.

If after the loss of a node, a replication group can no longer form agreements then replication is halted. If the lost node was a voter, and there aren't enough remaining voters to form an agreement, then either the lost site must be repaired/reconnected, or the replication group must undergo emergency reconfiguration.

Emergency Reconfiguration is a final option for recovery

The emergency reconfiguration process can't be undone, and it represents a big shakeup of your replication system. Only complete an emergency reconfiguration if the lost site can not be repaired or reconnected in an acceptable amount of time.

Gone but not forgotten
After a lost site has been removed and a replication group reconfigured, the lost site MUST NOT be allowed to come back online. Ensure that you perform a cleanup after you have completed the emergency reconfiguration.

5.7 Emergency Reconfiguration

So, having confirmed that an emergency reconfiguration is required, follow this procedure:

  1. Verify the details of the site that is now declared 'lost'. Login to the user interface of one of the remaining nodes and view the Nodes tab. The missing site will show a status of Down.

  2. Select the lost node by ticking its corresponding checkbox and then click the RECONFIGURE button.

  3. The Emergency Reconfiguration screen will appear. Check and confirm that you have selected the correct site, then click on the Start Reconfiguration button.

  4. A warning will appear, asking you to confirm that you are ready to start the process, and that once started the process can not be cancelled. Click CONFIRM if you are ready to proceed.

  5. The Reconfiguration process will now run, creating new, replacement replication groups, activating them, then removing the old groups. The process is finished when all items are listed as Complete.
    How Reconfiguration Works
    The emergency reconfiguration process seeks to recreate functional replication groups using the remaining member nodes. In siutations where a replication group only contained two nodes, including the lost site, then a reconfiguration is not possible, in this scenario a new replication group will need to be created once a replacement site has been inducted.


  6. Finally, you should check the state of the reformed replication groups to ensure that they'll still perform according to your organization's requirements.

5.8 Restore replication on a problem node.

It's possible that a problem with a single transaction can result in a repository being put into a read-only mode. Causing the replication of this repository at just one node. If possible, use the following procedure to get replication started again:

  1. The first sign that a transaction has not been able to complete on a node is when the repository is placed in a protective read-only state. This is done to ensure that it will remain in a condition in which it can be recover and catch up. On the Repositories tab you'll see the repository is now flagged as locally read-only.
    Providing there are still enough nodes to reach agreement, repository changes at the other nodes can continue to be made.
  2. At the problem site, you would now need to indentify the cause of the problem. Check Git MultiSite's logs as well as the logs generated for Git users who are trying to commit changes to the problem repository. It may be possible to quickly fix the causeo f the problem, such as a permission problem that has prevented a file to be written to on the node.
  3. When the problem has been fixed you can go to the Repositories tab and edit the read-only repository. Remove the Local RO (Read-only) tick. The node will then attempt to catch up and get the repository back into sync with its other replicas.

5.9 Reusing IPs for Nodes

If you wish to reuse an IP that's previously been allocated to a MultiSite node make sure that it's been removed from all your existing MultiSite nodes' configuration.

If you don't, your existing nodes will detect the node on the pre-existing address, but be unable to communicate with it as the original configuration details are no longer present.

When a node fails, it is marked as no longer inducted. As MultiSite cannot determine why this happened, the configuration will remain until manually removed, at which point the network details can be used again.

5.10 Deleting Replication Groups

It's possible to remove replication groups from Git MultiSite , although only if they they have been emptied of repositories. Run through the following procedure as an example.

  1. We have identified that replication group "VinyardRepos" is to be removed from Git MultiSite. We can see that it has a single repository associated with it. Click on the Quick View to see which one.
  2. Click on Configure.
  3. On the Replication Group configuration screen we can see that Repo5 is associated with the group. We can see that currently the Delete Replication Group (VinyardRepos) is disable. You can follow the link to the repositories page to remove the association.
  4. On the repositories screen, click on the assoicated repository, in this example it's Repo5, then click on the EDIT button.
  5. On the Edit Repository box, use the Replication Group drop-down to move the repository to a different Replication Group. Then click SAVE.
  6. Repeat this process until there are no more repositories assoicated with the Replication Group that you wish to delete. In this example VinyardRepos only had a single repository, so it is now empty, and can be deleted. Click on the Quick View, then on the Configure.
  7. Now that Replication Group VinardRepos is effectly empty of replication payload the Delete link is enabled. Click on the link Delete Replication Group (VinyardRepos) to remove the replication group, taking note that there's no undo - although no data is removed when a replication group is deleted, it should be easy enough to recreate a group if necessary.

  8. A growl will appear confirming that the replication group has been deleted.

5.11 Performing a Synchronized Stop

The Synchronized Stop is used to stop replication between repository replicas, it can be performed on a per-repository basis or on a replication group basis (where replication will be stopped for all associated repositories). Stops are synchronized between nodes using a 'stop' proposal to which all nodes need to agree. So that while not all nodes will come to a stop at the same time they do all stop at the same point.

  1. Login to a node's browser-based UI and click on the Repositories tab. Click on the repository that you wish to stop replicating.
  2. With the repository selected, click the Sync Stop button. A growl message will appear to confirm that a synchronized stop has been requested. Note that the process may not be completed immediately, especially if there are large proposals transfering over a WAN link.
  3. On refreshing the screen you will see that a successfully sync stopped repository will have a status of Stopped and will be Local RO (Locally Read-only) at all nodes.

5.12 Performing a Synchronized Start

Restarting replication after performing a Synchronized Stop requires that the stopped replication be started in a synchronized manner.

  1. Click on a stopped repository and click on the Sync Start button.
  2. The repository will stop being Local Read-only on all nodes and will restart replicating again.