Skip to content

Commit 7222dec

Browse files
authored
(DOCSP-44989) Revises per second-draft edits. (#91)
* (DOCSP-44989) Revises per second-draft edits. * Adds shared includes and rewrites leftovers to match.
1 parent c8c4a98 commit 7222dec

8 files changed

+76
-17
lines changed
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Monitor CPU usage to determine whether data is retrieved from disk instead of memory.
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
Monitor whether disk IOPS approaches the maximum provisioned IOPS.
2+
Determine whether the cluster can handle future workloads.
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
Monitor disk latency to track the efficiency of reading from and
2+
writing to disk.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Monitor memory to determine whether to upgrade to a higher cluster tier. This metric represents the average value over the time period specified by the metric granularity.
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
Monitor the replication oplog window, together with replication
2+
headroom, to determine whether the secondary may soon require a
3+
full resync. The replication oplog window often helps to
4+
determine in advance the resilience of secondaries to planned
5+
and unplanned outages.
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
Monitor page faults to determine whether to increase your memory.
2+
This metric displays the average rate of page faults on this process per second
3+
over the selected sample period. In non-Windows
4+
environments this applies to hard page faults only.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Monitor replication lag to determine whether the secondary might fall off the oplog.

source/monitoring-alerts.txt

Lines changed: 60 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -170,124 +170,167 @@ same condition, one for a low priority / "warning" level of severity, and one fo
170170

171171
.. list-table::
172172
:header-rows: 1
173-
:widths: 30 35 35
173+
:widths: 20 25 25 30
174174
:stub-columns: 1
175175

176176
* - Condition
177177
- Recommended Alert Threshold: Low Priority
178178
- Recommended Alert Threshold: High Priority
179+
- Key Insights
179180

180181
* - Oplog Window
181182
- < 24h for 5 minutes
182183
- < 1h for 10 minutes
184+
- .. include:: /includes/cloud-docs/shared-metric-description-oplog-window.rst
183185

184186
* - :manual:`Election </core/replica-set-elections/>` events
185187
- > 3 for 5 minutes
186188
- > 30 for 5 minutes
189+
- Monitor election events, which occur when a primary node steps down and a
190+
secondary node is elected as the new primary. Frequent election events can
191+
disrupt operations and impact availability, causing temporary write
192+
unavailability and possible rollback of data. Keeping election events to
193+
a minimum ensures consistent write operations and stable {+cluster+} performance.
187194

188195
* - Read :atlas:`IOPS </reference/alert-resolutions/disk-io-utilization/>`
189196
- > 4000 for 2 minutes
190197
- > 9000 for 5 minutes
198+
- .. include:: /includes/cloud-docs/shared-metric-description-iops.rst
191199

192200
* - Write :atlas:`IOPS </reference/alert-resolutions/disk-io-utilization/>`
193201
- > 4000 for 2 minutes
194202
- > 9000 for 5 minutes
203+
- .. include:: /includes/cloud-docs/shared-metric-description-iops.rst
195204

196205
* - Read Latency
197206
- > 20ms for 5 minutes
198207
- > 50 s for 5 minutes
208+
- .. include:: /includes/cloud-docs/shared-metric-description-latency.rst
199209

200210
* - Write Latency
201211
- > 20ms for 5 minutes
202212
- > 50ms for more than 5 minutes
213+
- .. include:: /includes/cloud-docs/shared-metric-description-latency.rst
203214

204215
* - Swap use
205216
- > 2GB for 15 minutes
206217
- > 2GB for 15 minutes
218+
- .. include:: /includes/cloud-docs/shared-metric-description-memory.rst
207219

208220
* - Host down
209221
- 15 minutes
210222
- 24 hours
223+
- Monitor your hosts to detect downtime promptly. A host down for more than
224+
15 minutes can impact availability, while downtime exceeding 24 hours is
225+
critical, risking data accessibility and application performance.
211226

212227
* - No primary
213228
- 5 minutes
214229
- 5 minutes
230+
- Monitor the status of your replica sets to identify instances where there
231+
is no primary node. A lack of a primary for more than 5 minutes can halt
232+
write operations and impact application functionality.
215233

216234
* - Missing active ``mongos``
217235
- 15 minutes
218236
- 15 minutes
237+
- Monitor the status of active ``mongos`` processes to ensure effective query
238+
routing in sharded {+clusters+}. A missing ``mongos`` can disrupt query routing.
219239

220240
* - Page faults
221241
- > 50/second for 5 minutes
222242
- > 100/second for 5 minutes
243+
- .. include:: /includes/cloud-docs/shared-metric-description-page-faults.rst
223244

224245
* - Replication lag
225246
- > 240 second for 5 minutes
226247
- > 1 hour for 5 minutes
248+
- .. include:: /includes/cloud-docs/shared-metric-description-replication-lag.rst
227249

228250
* - Failed backup
229251
- Any occurrence
230252
- None
253+
- Track backup operations to ensure data integrity. A failed backup can compromise
254+
data availability.
231255

232256
* - Restored backup
233257
- Any occurrence
234258
- None
259+
- Verify restored backups to ensure data integrity and system functionality.
235260

236261
* - Fallback snapshot failed
237262
- Any occurrence
238263
- None
264+
- Monitor fallback snapshot operations to ensure data redundancy and recovery
265+
capability.
239266

240267
* - Backup schedule behind
241268
- > 12 hours
242269
- > 12 hours
243-
244-
* - Available write tickets
245-
- < 75 for 5 minutes
246-
- < 25 for 5 minutes
247-
248-
* - Available read tickets
249-
- < 75 for 5 minutes
250-
- < 25 for 5 minutes
270+
- Check backup schedules to ensure they are on track. Falling behind can
271+
risk data loss and compromise recovery plans.
272+
273+
* - Queued Reads
274+
- > 0-10
275+
- > 10+
276+
- Monitor queued reads to ensure efficient data retrieval. High levels of
277+
queued reads may indicate resource constraints or performance bottlenecks,
278+
requiring optimization to maintain system responsiveness.
279+
280+
* - Queued Writes
281+
- > 0-10
282+
- > 10+
283+
- Monitor queued writes to maintain efficient data processing. High levels
284+
of queued writes may signal resource constraints or performance bottlenecks, requiring optimization to maintain system responsiveness.
251285

252286
* - Restarts last hour
253287
- > 2
254288
- > 2
289+
- Track the number of restarts in the last hour to detect instability or
290+
configuration issues. Frequent restarts can indicate underlying problems
291+
that require immediate investigation to maintain system reliability and uptime.
255292

256293
* - :manual:`Primary election </core/replica-set-elections/>`
257294
- Any occurrence
258295
- None
296+
- Monitor primary elections to ensure stable {+cluster+} operations. Frequent
297+
elections can indicate network issues or resource constraints, potentially
298+
impacting the availability and performance of the database.
259299

260300
* - Maintenance no longer needed
261301
- Any occurrence
262302
- None
303+
- Review unnecessary maintenance tasks to optimize resources and minimize disruptions.
263304

264305
* - Maintenance started
265306
- Any occurrence
266307
- None
308+
- Track the start of maintenance tasks to ensure planned activities proceed smoothly.
309+
Proper oversight helps maintain system performance and minimize downtime during maintenance.
267310

268311
* - Maintenance scheduled
269312
- Any occurrence
270313
- None
314+
- Monitor scheduled maintenance to prepare for potential system impacts.
271315

272316
* - :atlas:`Steal </alert-basics/#cpu-steal>`
273317
- > 5% for 5 minutes
274318
- > 20% for 5 minutes
319+
- Monitor CPU steal on AWS EC2 {+clusters+} with Burstable Performance
320+
to identify when CPU usage exceeds the guaranteed baseline due to shared
321+
cores. High steal percentages indicate the CPU credit balance is depleted,
322+
affecting performance.
275323

276324
* - CPU
277325
- > 75% for 5 minutes
278326
- > 75% for 5 minutes
327+
- .. include:: /includes/cloud-docs/shared-metric-description-cpu.rst
279328

280329
* - Disk partition usage
281330
- > 90%
282331
- > 95% for 5 minutes
283-
284-
* - Index partition usage
285-
- > 90%
286-
- > 95% for 5 minutes
287-
288-
* - Journal partition usage
289-
- > 90%
290-
- > 95% for 5 minutes
332+
- Monitor disk partition usage to ensure sufficient storage availability.
333+
High usage levels can lead to performance degradation and potential system outages.
291334

292335
To learn more, see :atlas:`Configure and Resolve Alerts </alerts>`.
293336

0 commit comments

Comments
 (0)