Skip to content

Commit 20bd361

Browse files
author
Bob Grabar
committed
DOCS-467 new info on repl set troubleshooting
1 parent 06408ea commit 20bd361

File tree

1 file changed

+127
-23
lines changed

1 file changed

+127
-23
lines changed

source/administration/replica-sets.txt

Lines changed: 127 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -385,7 +385,7 @@ Removing Members
385385
~~~~~~~~~~~~~~~~
386386

387387
You may remove a member of a replica at any time. Use the
388-
:method:`rs.remove()` function in the :program:`mongo` shell while
388+
:method:`rs.remove()` method in the :program:`mongo` shell while
389389
connected to the current :term:`primary`. Issue the
390390
:method:`db.isMaster()` command when connected to *any* member of the
391391
set to determine the current primary. Use a command in either
@@ -561,43 +561,76 @@ OpenSSL package to generate "random" content for use in a key file:
561561

562562
Key file permissions are not checked on Windows systems.
563563

564-
Troubleshooting
565-
---------------
564+
Troubleshooting Replica Sets
565+
----------------------------
566566

567-
This section defines reasonable troubleshooting processes for common
568-
operational challenges. While there is no single causes or guaranteed
569-
response strategies for any of these symptoms, the following sections
570-
provide good places to start a troubleshooting investigation with
567+
This section describes common strategies for troubleshooting
571568
:term:`replica sets <replica set>`.
572569

573570
.. seealso:: :doc:`/administration/monitoring`.
574571

572+
.. _replica-set-troubleshooting-check-replication-status:
573+
574+
Check Replica Set Status
575+
~~~~~~~~~~~~~~~~~~~~~~~~
576+
577+
To display the current state of the replica set and current state of
578+
each member, run the :method:`rs.status()` method in a :program:`mongo`
579+
shell connected to the replica set's :term:`primary`. For descriptions
580+
of the information displayed by :method:`rs.status()`, see
581+
:doc:`/reference/replica-status`.
582+
583+
.. note:: The :method:`rs.status()` method is a wrapper that runs the
584+
:doc:`/reference/command/replSetGetStatus` database command.
585+
575586
.. _replica-set-replication-lag:
576587

577-
Replication Lag
578-
~~~~~~~~~~~~~~~
588+
Check the Replication Lag
589+
~~~~~~~~~~~~~~~~~~~~~~~~~
579590

580591
Replication lag is a delay between an operation on the :term:`primary`
581592
and the application of that operation from the :term:`oplog` to the
582-
:term:`secondary`. Such lag can be a significant issue and can
593+
:term:`secondary`. Replication lag can be a significant issue and can
583594
seriously affect MongoDB :term:`replica set` deployments. Excessive
584595
replication lag makes "lagged" members ineligible to quickly become
585596
primary and increases the possibility that distributed
586597
read operations will be inconsistent.
587598

588-
Identify replication lag by checking the value of
589-
:data:`members[n].optimeDate` for each member of the replica set
590-
using the :method:`rs.status()` function in the :program:`mongo`
591-
shell.
599+
To check the current duration of replication lag, do one of the following:
600+
601+
- Run the :doc:`db.printSlaveReplicationInfo()
602+
</reference/method/db.printSlaveReplicationInfo>` method in a
603+
:program:`mongo` shell connected to the replica set's primary.
604+
605+
The output displays each member's ``syncedTo`` value, which is the
606+
last time the member read from the oplog, as shown in the following
607+
example:
608+
609+
.. code-block:: javascript
610+
611+
source: m1.example.net:30001
612+
syncedTo: Tue Oct 02 2012 11:33:40 GMT-0400 (EDT)
613+
= 7475 secs ago (2.08hrs)
614+
source: m2.example.net:30002
615+
syncedTo: Tue Oct 02 2012 11:33:40 GMT-0400 (EDT)
616+
= 7475 secs ago (2.08hrs)
592617

593-
Also, you can monitor how fast replication occurs by watching the oplog
594-
time in the "replica" graph in the `MongoDB Monitoring Service`_. Also
595-
see the `documentation for MMS`_.
618+
.. note:: The :method:`rs.status()` method is a wrapper that runs the
619+
:doc:`/reference/command/replSetGetStatus` database command.
620+
621+
- Run the :method:`rs.status()` method in a :program:`mongo` shell
622+
connected to the replica set's primary. The output displays each
623+
member's `optimeDate` value, which is the last time the member read
624+
from the oplog.
625+
626+
- Monitor how fast replication occurs by watching the oplog time in the
627+
"replica" graph in the `MongoDB Monitoring Service`_. For more
628+
information see the `documentation for MMS`_.
596629

597630
.. _`MongoDB Monitoring Service`: http://mms.10gen.com/
598631
.. _`documentation for MMS`: http://mms.10gen.com/help/
599632

600-
Possible causes of replication lag include:
633+
If replication lag is too large, check the following:
601634

602635
- **Network Latency**
603636

@@ -635,9 +668,9 @@ Possible causes of replication lag include:
635668

636669
If you are performing a large data ingestion or bulk load operation
637670
that requires a large number of writes to the primary, the
638-
secondaries will not be able to read the :term:`oplog` fast enough to keep
639-
up with changes. Setting some level :ref:`write concern <write-concern>`, can
640-
slow the overall progress of the batch, but will prevent the
671+
secondaries will not be able to read the oplog fast enough to keep
672+
up with changes. Setting some level of write concern can
673+
slow the overall progress of the batch but will prevent the
641674
secondary from falling too far behind.
642675

643676
To prevent this, use write concern so that MongoDB will perform
@@ -653,17 +686,88 @@ Possible causes of replication lag include:
653686
- :ref:`replica-set-write-concern`.
654687
- The :ref:`replica-set-oplog-sizing` topic in the :doc:`/core/replication` document.
655688
- The :ref:`replica-set-oplog` topic in the :doc:`/core/replication-internals` document.
656-
- The :ref:`replica-set-procedure-change-oplog-size` topic this document.
689+
- The :ref:`replica-set-procedure-change-oplog-size` topic in this document.
657690
- The :doc:`/tutorial/change-oplog-size` tutorial.
658691

692+
.. _replica-set-troubleshooting-check-oplog-size:
693+
694+
Check the Size of the Oplog
695+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
696+
697+
The :term:`oplog` size can be the difference between a :term:`secondary`
698+
staying up-to-date or becoming stale.
699+
700+
To check the size of the oplog for a given :term:`replica set` member,
701+
connect to the member in a :program:`mongo` shell and run the
702+
:doc:`db.printReplicationInfo()
703+
</reference/method/db.printReplicationInfo>` method.
704+
705+
The method displays the size of the oplog and the date ranges of the
706+
operations contained in the oplog. In the following example, the oplog
707+
is about 10MB and is able to fit only about 20 minutes (1200 seconds) of
708+
operations:
709+
710+
.. code-block:: javascript
711+
712+
configured oplog size: 10.10546875MB
713+
log length start to end: 1200secs (0.33hrs)
714+
oplog first event time: Mon Mar 19 2012 13:50:38 GMT-0400 (EDT)
715+
oplog last event time: Tue Oct 02 2012 16:31:38 GMT-0400 (EDT)
716+
now: Tue Oct 02 2012 17:04:20 GMT-0400 (EDT)
717+
718+
719+
The above example is likely a case where you would want to increase the
720+
size of the oplog. For more information on how oplog size affects
721+
operations, see:
722+
723+
- The :ref:`replica-set-oplog-sizing` topic in the :doc:`/core/replication` document.
724+
- The :ref:`replica-set-delayed-members` topic in this document.
725+
- The :ref:`replica-set-replication-lag` topic in this document.
726+
727+
To change oplog size, see :ref:`replica-set-procedure-change-oplog-size`
728+
in this document or see the :doc:`/tutorial/change-oplog-size` tutorial.
729+
730+
.. _replica-set-troubleshooting-check-connection:
731+
732+
Test the Connection Between Each Member
733+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
734+
735+
There must be connectivity from every :term:`replica set` member to
736+
every other member in order for replication to work. Problems with
737+
network or firewall rules can prevent this connectivity and prevent
738+
replication from working. To test the connection from every member to
739+
every other member, in both directions, consider the following example:
740+
741+
.. example:: Given a replica set with three members running on three separate
742+
hosts:
743+
744+
- ``m1.example.net``
745+
- ``m2.example.net``
746+
- ``m3.example.net``
747+
748+
You can test the connection from ``m1.example.net`` to the other hosts by running
749+
the following operations from ``m1.example.net``:
750+
751+
.. code-block:: sh
752+
753+
mongo --host m2.example.net --port 27017"
754+
755+
mongo --host m3.example.net --port 27017"
756+
757+
Repeat the process on hosts ``m2.example.net`` and ``m3.example.net``.
758+
759+
If any of the connections fails, there's a networking or firewall
760+
issue that needs to be diagnosed separately
761+
659762
.. index:: pair: replica set; failover
660763
.. _replica-set-failover-administration:
661764
.. _failover:
662765

663766
Failover and Recovery
664767
~~~~~~~~~~~~~~~~~~~~~
665768

666-
.. todo:: Revisit whether this belongs in troubleshooting. Perhaps this should be an H2 before troubleshooting.
769+
.. TODO Revisit whether this belongs in troubleshooting. Perhaps this
770+
should be an H2 before troubleshooting.
667771

668772
Replica sets feature automated failover. If the :term:`primary`
669773
goes offline or becomes unresponsive and a majority of the original

0 commit comments

Comments
 (0)