DOCS-467 new info on repl set troubleshooting

Bob Grabar · Bob Grabar · commit 20bd361045f3 · 2012-10-02T18:15:24.000-04:00
diff --git a/source/administration/replica-sets.txt b/source/administration/replica-sets.txt
@@ -385,7 +385,7 @@ Removing Members
 ~~~~~~~~~~~~~~~~
 
 You may remove a member of a replica at any time. Use the
-:method:`rs.remove()` function in the :program:`mongo` shell while
+:method:`rs.remove()` method in the :program:`mongo` shell while
 connected to the current :term:`primary`. Issue the
 :method:`db.isMaster()` command when connected to *any* member of the
 set to determine the current primary. Use a command in either
@@ -561,43 +561,76 @@ OpenSSL package to generate "random" content for use in a key file:
 
    Key file permissions are not checked on Windows systems.
 
-Troubleshooting
----------------
+Troubleshooting Replica Sets
+----------------------------
 
-This section defines reasonable troubleshooting processes for common
-operational challenges. While there is no single causes or guaranteed
-response strategies for any of these symptoms, the following sections
-provide good places to start a troubleshooting investigation with
+This section describes common strategies for troubleshooting
 :term:`replica sets <replica set>`.
 
 .. seealso:: :doc:`/administration/monitoring`.
 
+.. _replica-set-troubleshooting-check-replication-status:
+
+Check Replica Set Status
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+To display the current state of the replica set and current state of
+each member, run the :method:`rs.status()` method in a :program:`mongo`
+shell connected to the replica set's :term:`primary`. For descriptions
+of the information displayed by :method:`rs.status()`, see
+:doc:`/reference/replica-status`.
+
+.. note:: The :method:`rs.status()` method is a wrapper that runs the
+   :doc:`/reference/command/replSetGetStatus` database command.
+
 .. _replica-set-replication-lag:
 
-Replication Lag
-~~~~~~~~~~~~~~~
+Check the Replication Lag
+~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Replication lag is a delay between an operation on the :term:`primary`
 and the application of that operation from the :term:`oplog` to the
-:term:`secondary`. Such lag can be a significant issue and can
+:term:`secondary`. Replication lag can be a significant issue and can
 seriously affect MongoDB :term:`replica set` deployments. Excessive
 replication lag makes "lagged" members ineligible to quickly become
 primary and increases the possibility that distributed
 read operations will be inconsistent.
 
-Identify replication lag by checking the value of
-:data:`members[n].optimeDate` for each member of the replica set
-using the :method:`rs.status()` function in the :program:`mongo`
-shell.
+To check the current duration of replication lag, do one of the following:
+
+- Run the :doc:`db.printSlaveReplicationInfo()
+  </reference/method/db.printSlaveReplicationInfo>` method in a
+  :program:`mongo` shell connected to the replica set's primary.
+
+  The output displays each member's ``syncedTo`` value, which is the
+  last time the member read from the oplog, as shown in the following
+  example:
+
+  .. code-block:: javascript
+
+     source:   m1.example.net:30001
+         syncedTo: Tue Oct 02 2012 11:33:40 GMT-0400 (EDT)
+             = 7475 secs ago (2.08hrs)
+     source:   m2.example.net:30002
+         syncedTo: Tue Oct 02 2012 11:33:40 GMT-0400 (EDT)
+             = 7475 secs ago (2.08hrs)
 
-Also, you can monitor how fast replication occurs by watching the oplog
-time in the "replica" graph in the `MongoDB Monitoring Service`_. Also
-see the `documentation for MMS`_.
+  .. note:: The :method:`rs.status()` method is a wrapper that runs the
+     :doc:`/reference/command/replSetGetStatus` database command.
+
+- Run the :method:`rs.status()` method in a :program:`mongo` shell
+  connected to the replica set's primary. The output displays each
+  member's `optimeDate` value, which is the last time the member read
+  from the oplog.
+
+- Monitor how fast replication occurs by watching the oplog time in the
+  "replica" graph in the `MongoDB Monitoring Service`_. For more
+  information see the `documentation for MMS`_.
 
 .. _`MongoDB Monitoring Service`: http://mms.10gen.com/
 .. _`documentation for MMS`: http://mms.10gen.com/help/
 
-Possible causes of replication lag include:
+If replication lag is too large, check the following:
 
 - **Network Latency**
 
@@ -635,9 +668,9 @@ Possible causes of replication lag include:
 
   If you are performing a large data ingestion or bulk load operation
   that requires a large number of writes to the primary, the
-  secondaries will not be able to read the :term:`oplog` fast enough to keep
-  up with changes. Setting some level :ref:`write concern <write-concern>`, can
-  slow the overall progress of the batch, but will prevent the
+  secondaries will not be able to read the oplog fast enough to keep
+  up with changes. Setting some level of write concern can
+  slow the overall progress of the batch but will prevent the
   secondary from falling too far behind.
 
   To prevent this, use write concern so that MongoDB will perform
@@ -653,17 +686,88 @@ Possible causes of replication lag include:
   - :ref:`replica-set-write-concern`.
   - The :ref:`replica-set-oplog-sizing` topic in the :doc:`/core/replication` document.
   - The :ref:`replica-set-oplog` topic in the :doc:`/core/replication-internals` document.
-  - The :ref:`replica-set-procedure-change-oplog-size` topic this document.
+  - The :ref:`replica-set-procedure-change-oplog-size` topic in this document.
   - The :doc:`/tutorial/change-oplog-size` tutorial.
 
+.. _replica-set-troubleshooting-check-oplog-size:
+
+Check the Size of the Oplog
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The :term:`oplog` size can be the difference between a :term:`secondary`
+staying up-to-date or becoming stale.
+
+To check the size of the oplog for a given :term:`replica set` member,
+connect to the member in a :program:`mongo` shell and run the
+:doc:`db.printReplicationInfo()
+</reference/method/db.printReplicationInfo>` method.
+
+The method displays the size of the oplog and the date ranges of the
+operations contained in the oplog. In the following example, the oplog
+is about 10MB and is able to fit only about 20 minutes (1200 seconds) of
+operations:
+
+.. code-block:: javascript
+
+   configured oplog size:   10.10546875MB
+   log length start to end: 1200secs (0.33hrs)
+   oplog first event time:  Mon Mar 19 2012 13:50:38 GMT-0400 (EDT)
+   oplog last event time:   Tue Oct 02 2012 16:31:38 GMT-0400 (EDT)
+   now:                     Tue Oct 02 2012 17:04:20 GMT-0400 (EDT)
+
+
+The above example is likely a case where you would want to increase the
+size of the oplog. For more information on how oplog size affects
+operations, see:
+
+- The :ref:`replica-set-oplog-sizing` topic in the :doc:`/core/replication` document.
+- The :ref:`replica-set-delayed-members` topic in this document.
+- The :ref:`replica-set-replication-lag` topic in this document.
+
+To change oplog size, see :ref:`replica-set-procedure-change-oplog-size`
+in this document or see the :doc:`/tutorial/change-oplog-size` tutorial.
+
+.. _replica-set-troubleshooting-check-connection:
+
+Test the Connection Between Each Member
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+There must be connectivity from every :term:`replica set` member to
+every other member in order for replication to work. Problems with
+network or firewall rules can prevent this connectivity and prevent
+replication from working. To test the connection from every member to
+every other member, in both directions, consider the following example:
+
+.. example:: Given a replica set with three members running on three separate
+   hosts:
+
+   - ``m1.example.net``
+   - ``m2.example.net``
+   - ``m3.example.net``
+
+   You can test the connection from ``m1.example.net`` to the other hosts by running
+   the following operations from ``m1.example.net``:
+
+   .. code-block:: sh
+
+      mongo --host m2.example.net --port 27017"
+
+      mongo --host m3.example.net --port 27017"
+
+   Repeat the process on hosts ``m2.example.net`` and ``m3.example.net``.
+
+   If any of the connections fails, there's a networking or firewall
+   issue that needs to be diagnosed separately
+
 .. index:: pair: replica set; failover
 .. _replica-set-failover-administration:
 .. _failover:
 
 Failover and Recovery
 ~~~~~~~~~~~~~~~~~~~~~
 
-.. todo:: Revisit whether this belongs in troubleshooting. Perhaps this should be an H2 before troubleshooting.
+.. TODO Revisit whether this belongs in troubleshooting. Perhaps this
+   should be an H2 before troubleshooting.
 
 Replica sets feature automated failover. If the :term:`primary`
 goes offline or becomes unresponsive and a majority of the original