WANdisco
 Navigation:  v1.4.2-LA Build 9b16f99c | Release Notes | Install | Upgrade | Administration | Reference | Gerrit | API | Glossary | Archive

Troubleshooting Guide

1. Logs

Git MultiSite has two sets of logs, one set is used for application, the other logs replication activity:

1.1 Application logs: <install-dir>/git-multisite/logs

These logs are used for troubleshooting problems with MultiSite's running:

drwxrwxr-x 2 wandisco wandisco 4096 Dec  5 14:20 logs

	-rw-rw-r-- 1 wandisco wandisco 32647705 Dec  6 09:50 replicator.log.20121205-142035
	-rw-rw-r-- 1 wandisco wandisco   775889 Dec  6 15:00 replicator.log.20121206-101206
	-rw-rw-r-- 1 wandisco wandisco     1666 Dec  5 13:21 setup.20121205-132141
	-rw-rw-r-- 1 wandisco wandisco     1674 Dec  5 13:53 setup.20121205-135328
	-rw-rw-r-- 1 wandisco wandisco      157 Dec  5 14:20 setup.20121205-142034
	-rw-rw-r-- 1 wandisco wandisco     1979 Dec  5 14:20 startup.20121205-142035
	-rw-rw-r-- 1 wandisco wandisco     1980 Dec  6 10:12 startup.20121206-101206
	-rw-rw-r-- 1 wandisco wandisco       94 Dec  6 09:57 stop.20121206-095714
	-rw-rw-r-- 1 wandisco wandisco     5136 Dec  6 10:26 ui.launch.log

1.2 Replication logs: <install-dir>/git-multisite/replicator/logs

	-rw-rw-r--. 1 wandisco wandisco  371282 Jan 23 10:00 fileysys.0
	-rw-rw-r--. 1 wandisco wandisco       0 Jan 18 10:40 fileysys.0.lck
	-rw-rw-r--. 1 wandisco wandisco 1054726 Jan 23 06:29 fileysys.1
	-rw-rw-r--. 1 wandisco wandisco 1055024 Jan 17 11:38 fileysys.10
	-rw-rw-r--. 1 wandisco wandisco 1049842 Jan 17 11:38 fileysys.11
	-rw-rw-r--. 1 wandisco wandisco 1050660 Jan 17 11:38 fileysys.12
	-rw-rw-r--. 1 wandisco wandisco 1049618 Jan 17 11:37 fileysys.13
	-rw-rw-r--. 1 wandisco wandisco 1049228 Jan 17 11:37 fileysys.14
        

The fileysys.xx logs record all actions taken by the replicator and a good starting point in the investigation of replication problems.

drwxrwxr-x. 2 wandisco wandisco    4096 Jan 17 11:43 stats

	-rw-rw-r--. 1 wandisco wandisco  22 Jan 17 11:43 111f4016-a23a-49d6-9769-ced78a250539
	-rw-rw-r--. 1 wandisco wandisco  22 Jan 17 11:43 420aa491-1bbb-4a6c-9cff-e8b60e911dd9

The thread-dump directory contains thread dumps that the replicator periodically writes to help with troubleshooting.

1.3 Logging levels

SEVERE:
Message level indicating a serious failure.
WARNING:
A message level indicating a potential problem.
INFO:
Interesting runtime events (startup/shutdown). Expect these to be immediately visible on a console, so be conservative and keep to a minimum.
CONFIG:
Details of configuration messages.
FINE:
Provides a standard level of trace information.
FINER:
Provides a more detailed level of trace information.
FINEST:
Provides a boggling level of trace information for troubleshooting hard to identify problems.

Changing the logging level

You can change the current logging level by editing the logger properties file  install-dir>/git-multisite/replicator/properties/logger.properties. You can see a sample logger.properties file.


2. Consistency Check

The Consistency Check tool is used to verify that a repository's copies stored across different nodes are synchronized, that is, in exactly the same state for a given global sequence number (GSN).

Limits of the Consistency Checker
The Consistency Check tells you the last common revision shared between repository replicas. Given the dynamic nature of a replication group, there may be in-flight proposals in the system that have not yet been agreed upon at all nodes. Therefore, a consistency check cannot be completely authoritative.

Specifically, consistency checks should be made on replication groups that contain only Active (inc Active Voter) nodes. The presence of passive nodes causes consistency checks to fail.

If you run a consistency check for a repository that does not exist, the dashboard displays []. You also get this result if you perform an /api/consistencyCheck call on a removed node.

You will receive a consistency error if you run a consistency check when there is no quorum. Consistency checks cannot verify consistency without a quorum so shouldn't be run.

2.1 Inconsistency causes and cures

WANdisco's replication technology delivers active-active replication that will, subject to a number of external factors, ensure that all replicas are consistent. However, there are some things that can happen that break consistency that would result in a halt to replication.

Again, loss of consistency is generally caused by external factors such as environmental differences, system quirks or user error. We've never encountered a loss of sync that resulted from a deficiency in the replication engine.

2.2 How Git MultiSite verifies replica consistency

  1. Nodes take the values of all the Git refs, they then combine them into a string and generate an SHA-1 that corresponds with the string.
  2. A checkpoint creation proposal is sent by the replication system to all nodes.
  3. Each node assigns a checkpoint to the current agreement key (GSN), the nodes then send their checkpoint to all the other nodes.
  4. Each node keeps a map of the checkpoints for all nodes. Checkpoints are created on a per-repository basis and are transmitted through an empty POST API call to /dcone/repository/$repositoryid$/checkconsistency, where $repositoryid$ is the ID for the repository that the client wishes to check.
  5. The results of the consistency check are performed with a GET REST API call on /dcone/repository/$repositoryid$/checkconsistency. This results of this call may take some time to arrive as they need to come from all nodes. The responsibility for the consistency check remains with the client.

Example XML return

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<checkpoints>
	<checkpoint><gsn>2</gsn><hash>548b65f000a73f07ad7bf0b04c2177476eae45eb</hash><nodeId>decone_node_id_dcone1</nodeId><repoId>2bc6ef1c-f07a-11e2-be6a-fe5400f1cc5f</repoId></checkpoint>
	<checkpoint><gsn>2</gsn><hash>548b65f000a73f07ad7bf0b04c2177476eae45eb</hash><nodeId>decone_node_id_dcone2</nodeId><repoId>2bc6ef1c-f07a-11e2-be6a-fe5400f1cc5f</repoId></checkpoint>
	<checkpoint><gsn>2</gsn><hash>548b65f000a73f07ad7bf0b04c2177476eae45eb</hash><nodeId>decone_node_id_dcone3</nodeId><repoId>2bc6ef1c-f07a-11e2-be6a-fe5400f1cc5f</repoId></checkpoint>
</checkpoints>	

See more about running a Consistency Check via the REST API.

2.3 Running a consistency check via the admin UI

Follow this procedure to perform a consistency check for a given repository:

  1. Log in to the admin UI, click on the Repositories tab. Click on the link of the repository that you wish to check for consistency.
    consistency check 01

    Navigate to the repository that you want to check.


  2. On the repository's screen, click on the button for Consistency Check.
    consistency check 02

    Click Consistency check.


  3. The check will trigger a growl message "Invoking consistency check on repository <repository-name>.git.. Once the check has completed, the result will appear in the main panel. If all is well the check will read "All repositories are consistent".
    consistency check 03

    Check completed


  4. You can scrutinize the actual check details by clicking on the Show Details button. You'll be show a table of the results.
    consistency check 04

    Details

  5. Details key

    Name node
    This is the Node ID, the unique reference string that corresponds with the node being checked.
    GSN
    The Global Sequence Number is used by WANdisco's replication system to maintain the ordering of transactional changes. At any particular moment it is possible that replicas are not exactly the same, given that some changes may still be in-flight. However, any series of changes are applied in exactly the same order so that replicas are identical at a given GSN.
    SHA-1
    The message digest for the repository, based on its refs. Any difference between repository replicas will result in different SHA-1 hash values which will indicate a loss of consistency.


3. Copy repositories

This section provides advice on getting your repository data distributed prior to starting or replication. You will need to use this procedure should you ever need to add additional nodes, or perform a recovery on a node.

3.1 Git installation requirements

These are things are a recap of the installation checklist, you must have them in place in order for replication to run effectively:

3.2 Copy existing repositories

It's simple enough to make a copy of a small repository and transfer it to each of your sites. However, remember that any changes made to the original repository will invalidate your copies unless you perform a syncronzation prior to starting replication.

If a repository needs to remain available to users during the proccess, you should briefly halt access, in order to make a copy. The copy can then be transferred to each site. Then, when you are ready to begin replication, you need use rsync to update each of your replicas. Fore more information about rsync, see Synchronizing repositories using rysnc.

3.3 New repositories

If you are creating brand new repositories, don't create them at each site. Instead create the repository once, then rsync it to the other sites. This ensures that any edits you make to one repo are replicated everywhere.

4. Synchronize repositories using rysnc

If for any reason repositories are corrupted or unable to automatically catch up it's usually possible to use rsync to get them back into sync.

Before you resync
To rsync you'll need to copy data from a repository replica that is up-to-date, before you do so it's good practice to perform an Git verify on the repository to be absolutely sure of it's integrity.

Use git's command on the 'helper' site:
git-fsck

From the site with the up-to-date repository, type the following commands:

rsync -rvlHtogpc /path/to/local/repo <Repository Name> remoteHost:/path/to/remote/repo

For example:
rsync -rvlHtogpc /path/to/git/repo/ root@172.7.2.33:/Subversion

Then follow up with an additional rsync that will ensure that contents of the locks directory are identical (by deleting locks that are not present on the originating server)

rsync -rvlHtogpc --delete /path/to/git/repo/db/locks <Repository Name> remoteHost:/path/to/remote/repo

For example:
rsync -rvlHtogpc --delete /path/to/git/repo/db/locks root@172.7.2.33:/git

Knowledgebase
You can read a more detailed step-by-step guide to using rsync in the Knowledge Base article Reset and rsync repositories.

5. Add a node to a replication group

Don't add a node during a period of high replication load
When adding sites to a replication group that already contains three or more sites, ensure that there isn't currently a large number of commits being replicated.

Adding a site during a period of high traffic (heavy level of commits) going to the repositories may cause the adding process to stall.

It's possible to add additional sites to an existing replication group, so that there's minimal disruption to users. Here's the procedure:

  1. Login to a site, click on the REPLICATION GROUPS tab. Go to the replication group to which you will add a new site, click on its View.
  2. The replication group pop-up will appear. Click CONFIGURE.
  3. On the replication group configuration screen you can see the existing sites. Click the ADD SITES button. Click on one or more of the available sites to add them. It is possible to bring in a brand new site that isn't currently connected by clicking on CONNECT NEW SITE, this will take you to the SITES tab.
  4. Any new node(s) that you add will now appear on the screen as a blue icon tab. You'll be presented with the following information:
    Adding Nodes
    
    You are about to add "NodeAuckland" to the "ReplicationGroupGladius" replication group.
    Please read through the following steps before you continue:
     
    1. Click ADD NODES and choose an existing Node as a helper.
    2. Copy the repositories replicated in the group from the helper to the new node. During this period these repositories will be read-only on the helper node.
    3. Once a repository is in place, selecting it and clicking COMPLETE SELECTED will take it out of read-only on the helper and the new member nodes.
    
    WARNING: Do not close your browser or log out during this process. If you do you'll need to complete the sync process for each individual repository via the REPAIR option on the repositories screen.
    Click ADD NODES to confirm your selection and proceed with adding them.
  5. Select one of the existing sites to be the 'helper' from the CHOOSE HELPER NODE drop-down selector. This helper site will be used to supply copies of the relevant repositories to the new site. During the helper phase the relevant repositories will be temporarily read-only for their local users.
  6. Click START SYNC to begin your copying process. Once you click the button the helper node will's relevant repositories will bcome read-only to local users.
  7. READ-ONLY ALERT
    Starting the Sync may cause a brief disruption to users on the helper node. You may wish to alert them or complete the work at a time where the disruption will be minimal.
  8. The helper site is now ready for you to start copying data across. You can manually copy the repository files or you can use rsync - see our guide: 11. Synchronizing repositories using rsync.
    Always copy, never create
    Manually copy or rsync repositories, even if they're empty. Git repositories should match across inits (subject to the template directory matching across servers) .
  9. Once the relevant repositories have been copied to the new site, and the copies have been checked to ensure they're up-to-date, it's time finish the process by clicking COMPLETE ALL.
  10. The new site will now appear as a member of the replication group. It is added as a non-voter by default, to change its type, click on its icon and change it to a Voter or Tie-breaker .

6. Recover from the loss of a node

It's possible for Git MultiSite to recover from the brief outage of a member site, which should be able to resync once it is reconnected. The crucial requirement for MultiSite's continued operation is that agreement over transaction ordering must be able to continue. Votes must be cast and those votes must always result in an agreement - no situation must arise where the votes are evenly split between voters.

By default a node will be classed as offline (and picked up as such by other nodes) if it fails unexpectedly and is not responding after a ten minute interval (this can be changed in the Application Properties file). Any attempted commits during this window will only be accepted if there are still enough nodes left to achieve quorum.

If after the loss of a node, a replication group can no longer form agreements then replication is halted. If the lost node was a voter, and there aren't enough remaining voters to form an agreement, then either the lost site must be repaired/reconnected, or the replication group must undergo emergency reconfiguration.

6.1 Final option: emergency reconfiguration

The emergency reconfiguration is a final option for recovery and cannot be undone. It represents a big shakeup of your replication system. Only complete an emergency reconfiguration if the lost site can not be repaired or reconnected in an acceptable amount of time.

Gone but not forgotten
After a lost site has been removed and a replication group reconfigured, the lost site MUST NOT be allowed to come back online. Ensure that you perform a cleanup after you have completed the emergency reconfiguration.

7. Emergency reconfiguration

Important: Don't attempt an emergency reconfiguration without help
If you need to perform a permanent removal of a node (an emergency reconfiguration) then you should contact WANdisco's support team for assistance. The operation poses several risks to the overall operation of Access Control Plus. Therefore, we recommend that you do not attempt the procedure by yourself.

Having confirmed that an emergency reconfiguration is required, follow this procedure:

  1. Verify the details of the site that is now declared 'lost'. Login to the user interface of one of the remaining sites and view the Sites tab. The missing site will show a status of Down.

  2. Select the lost node by ticking its corresponding checkbox and then click the RECONFIGURE button.

  3. The Emergency Reconfiguration screen will appear. Check and confirm that you have selected the correct site, then click on the Start Reconfiguration button.

  4. A warning will appear, asking you to confirm that you are ready to start the process, and that once started the process can not be cancelled. Click CONFIRM if you are ready to proceed.

  5. The Reconfiguration process will now run, creating new, replacement replication groups, activating them, then removing the old groups. The process is finished when all items are listed as Complete.
    How Reconfiguration Works
    The emergency reconfiguration process seeks to recreate functional replication groups using the remaining member sites. In siutations where a replication group only contained two sites, including the lost site, then a reconfiguration is not possible, in this scenario a new replication group will need to be created once a replacement site has been inducted.


  6. Finally, you should check the state of the reformed replication groups to ensure that they'll still perform according to your organization's requirements.

8. Restore replication on a problem node

It's possible that a problem with a single transaction can result in a repository being put into a read-only mode. Causing the replication of this repository at just one node. If possible, use the following procedure to get replication started again:

  1. The first sign that a transaction has not been able to complete on a node is when the repository is placed in a protective read-only state. This is done to ensure that it will remain in a condition in which it can be recover and catch up. On the Repositories tab you'll see the repository is now flagged as locally read-only.
    Providing there are still enough nodes to reach agreement, repository changes at the other nodes can continue to be made.
  2. At the problem site, you would now need to indentify the cause of the problem. Check Git MultiSite's logs as well as the logs generated for Git users who are trying to commit changes to the problem repository. It may be possible to quickly fix the causeo f the problem, such as a permission problem that has prevented a file to be written to on the node.
  3. When the problem has been fixed you can go to the Repositories tab and edit the read-only repository. Remove the Local RO (Read-only) tick. The node will then attempt to catch up and get the repository back into sync with its other replicas.

9. Deleting replication groups

It's possible to remove replication groups from Git MultiSite , although only if they they have been emptied of repositories. Run through the following procedure as an example.

  1. We have identified that replication group "VinyardRepos" is to be removed from Git MultiSite. We can see that it has a single repository associated with it. Click on the Quick View to see which one.
  2. Click on Configure.
  3. On the Replication Group configuration screen we can see that Repo5 is associated with the group. We can see that currently the Delete Replication Group (VinyardRepos) is disable. You can follow the link to the repositories page to remove the association.
  4. On the repositories screen, click on the assoicated repository, in this example it's Repo5, then click on the EDIT button.
  5. On the Edit Repository box, use the Replication Group drop-down to move the repository to a different Replication Group. Then click SAVE.
  6. Repeat this process until there are no more repositories assoicated with the Replication Group that you wish to delete. In this example VinyardRepos only had a single repository, so it is now empty, and can be deleted. Click on the Quick View, then on the Configure.
  7. Now that Replication Group VinardRepos is effectly empty of replication payload the Delete link is enabled. Click on the link Delete Replication Group (VinyardRepos) to remove the replication group, taking note that there's no undo - although no data is removed when a replication group is deleted, it should be easy enough to recreate a group if necessary.

  8. A growl will appear confirming that the replication group has been deleted.

10. Performing a synchronized stop

The Synchronized Stop is used to stop replication between repository replicas, it can be performed on a per-repository basis or on a replication group basis (where replication will be stopped for all associated repositories). Stops are synchronized between nodes using a 'stop' proposal to which all nodes need to agree. So that while not all nodes will come to a stop at the same time they do all stop at the same point.

  1. Login to a node's browser-based UI and click on the Repositories tab. Click on the repository that you wish to stop replicating.
  2. With the repository selected, click the Sync Stop button. A growl message will appear to confirm that a synchronized stop has been requested. Note that the process may not be completed immediately, especially if there are large proposals transfering over a WAN link.
  3. On refreshing the screen you will see that a successfully sync stopped repository will have a status of Stopped and will be Local RO (Locally Read-only) at all nodes.

11. Performing a synchronized start

Restarting replication after performing a Synchronized Stop requires that the stopped replication be started in a synchronized manner.

  1. Click on a stopped repository and click on the Sync Start button.
  2. The repository will stop being Local Read-only on all nodes and will restart replicating again.

12. Running talkback

talkback is a bash script in your Git MultiSite installation to use if you need to talk to the WANdisco support team.

Manually run Talkback:

Log in to the server with admin privileges. Navigate to the Git MultiSite's binary directory: /opt/wandisco/git-multisite/bin/talkback

Run talkback.

[root@localhost bin]# ./talkback
Talkback Failure
We're aware of an issue that can cause Talkback to fail with the message:
"parser error : Start tag expected, '<' not found"
This problem only occurs if a check done by Talkback to verify that the replicator is running gets a false positive - possibly the replicator's REST API port is still connected and in a TIME_WAIT state. If you get this error, check that the replicator is running and re-run Talkback.

You'll need to provide some information during the run - also note the environmental variables noted below which can be used to further modify how the talkback script runs:


#######################################################################
# WANdisco talkback - Script for picking up system & replicator       #
# information for support                                             #
#######################################################################

    To run this script non-interactively please set following environment vars:

    ENV-VAR:
    MS_REP_UN                  Set username to login to MultiSite
    MS_REP_PS                  Set password to login to MultiSite
    MS_SUPPORT_TICKET          Set ticket number to give to WANdisco support team

    By default, your talkback is not uploaded. If you wish to upload it, you may also specify
    the following variables:

    MS_FTP_UN                  Set ftp username to upload to WANdisco support FTP server. Note that
                                specifying this may cause SSH to prompt for a password, so don't set
                                this variable if you wish to run this script non-interactively.


      ===================== INFO ========================
      The talkback agent will capture relevant configuration
      and log files to help WANdisco diagnose the problem
      you may be encountering.

Please enter replicator admin username: adminUIusername 
Please enter replicator admin password: thepasswordhere

retrieving details for repository "Repo1"
retrieving details for repository "Repo3"
retrieving details for repository "Repo4"
retrieving details for repository "repo2"
retrieving details for node "NodeSanFransisco"
retrieving details for node "NodeAuckland"
retrieving details for node "NodeParis"

Please enter your WANdisco support FTP username (leave empty to skip auto-upload process):
Skipping auto-FTP upload

TALKBACK COMPLETE

---------------------------------------------------------------
 Please upload the file:

     /opt/wandisco/git-multisite/talkback-201312191119-redhat6.3-64bit.tar.gz

 to WANdisco support with a description of the issue.

 Note: do not email the talkback files, only upload them
 via ftp or attach them via the web ticket user interface.
--------------------------------------------------------------

12.1 Example talkback output

replicator
	application.xml 
	config
	application.properties 
	license.key
	logger.properties 
	ui.properties
		users.properties
		license.xml
		locations.xml 
		md5s 
		memberships.xml
	nodes
		Node1 
			connection-test  
			location.xml 
			node.xml
		Node2  
			connection-test 
			location.xml  
			node.xml
		Node3  
			connection-test
			location.xml 
			node.xml  
	nodes.xml 
	recent-logs
		gitms.0.log
		gitms.0.log.lck 
		multisite.log 
		node-details
		replicator.20140203-170618.log
		thread-dump-2014-02-03.11-06-22 
		ui.20140203-165434.log 
		watchdog.log 
	replicationGroups.xml 
	replicator-file-list 
	repositories.xml 
	tasks.xml 
	system 
	file-max 
	file-nr
	limits.conf 
	logs
		messages
		messages-20140126  
		messages-20140202 
	netstat
	processes  
	services 
	sysctl.conf  
	sys-status 
	top

12.2 Create new users.properties file

If you need to create a fresh users.properties file for your deployment, follow this short procedure:

  1. Shut down all nodes, ensure the Git MultiSite service has stopped
  2. On one node, open the application.properties file in a text editor. Default location:
    /opt/wandisco/git-multisite/replicator/properties/application.properties
  3. Add the following entries to the file:
    application.username=admin
    application.password=wandisco
    
  4. Save the file, then restart the Git-MultiSite service on that node.
  5. Copy the newly created /opt/wandisco/git-multisite/replicator/properties/users.properties file to all other nodes.
  6. Restart the Git MultiSite services on all nodes.
  7. Once again, edit application.properties. This time you should remove the entries added in step 3 (application.username and application.password).