@@ -170,124 +170,167 @@ same condition, one for a low priority / "warning" level of severity, and one fo
170
170
171
171
.. list-table::
172
172
:header-rows: 1
173
- :widths: 30 35 35
173
+ :widths: 20 25 25 30
174
174
:stub-columns: 1
175
175
176
176
* - Condition
177
177
- Recommended Alert Threshold: Low Priority
178
178
- Recommended Alert Threshold: High Priority
179
+ - Key Insights
179
180
180
181
* - Oplog Window
181
182
- < 24h for 5 minutes
182
183
- < 1h for 10 minutes
184
+ - .. include:: /includes/cloud-docs/shared-metric-description-oplog-window.rst
183
185
184
186
* - :manual:`Election </core/replica-set-elections/>` events
185
187
- > 3 for 5 minutes
186
188
- > 30 for 5 minutes
189
+ - Monitor election events, which occur when a primary node steps down and a
190
+ secondary node is elected as the new primary. Frequent election events can
191
+ disrupt operations and impact availability, causing temporary write
192
+ unavailability and possible rollback of data. Keeping election events to
193
+ a minimum ensures consistent write operations and stable {+cluster+} performance.
187
194
188
195
* - Read :atlas:`IOPS </reference/alert-resolutions/disk-io-utilization/>`
189
196
- > 4000 for 2 minutes
190
197
- > 9000 for 5 minutes
198
+ - .. include:: /includes/cloud-docs/shared-metric-description-iops.rst
191
199
192
200
* - Write :atlas:`IOPS </reference/alert-resolutions/disk-io-utilization/>`
193
201
- > 4000 for 2 minutes
194
202
- > 9000 for 5 minutes
203
+ - .. include:: /includes/cloud-docs/shared-metric-description-iops.rst
195
204
196
205
* - Read Latency
197
206
- > 20ms for 5 minutes
198
207
- > 50 s for 5 minutes
208
+ - .. include:: /includes/cloud-docs/shared-metric-description-latency.rst
199
209
200
210
* - Write Latency
201
211
- > 20ms for 5 minutes
202
212
- > 50ms for more than 5 minutes
213
+ - .. include:: /includes/cloud-docs/shared-metric-description-latency.rst
203
214
204
215
* - Swap use
205
216
- > 2GB for 15 minutes
206
217
- > 2GB for 15 minutes
218
+ - .. include:: /includes/cloud-docs/shared-metric-description-memory.rst
207
219
208
220
* - Host down
209
221
- 15 minutes
210
222
- 24 hours
223
+ - Monitor your hosts to detect downtime promptly. A host down for more than
224
+ 15 minutes can impact availability, while downtime exceeding 24 hours is
225
+ critical, risking data accessibility and application performance.
211
226
212
227
* - No primary
213
228
- 5 minutes
214
229
- 5 minutes
230
+ - Monitor the status of your replica sets to identify instances where there
231
+ is no primary node. A lack of a primary for more than 5 minutes can halt
232
+ write operations and impact application functionality.
215
233
216
234
* - Missing active ``mongos``
217
235
- 15 minutes
218
236
- 15 minutes
237
+ - Monitor the status of active ``mongos`` processes to ensure effective query
238
+ routing in sharded {+clusters+}. A missing ``mongos`` can disrupt query routing.
219
239
220
240
* - Page faults
221
241
- > 50/second for 5 minutes
222
242
- > 100/second for 5 minutes
243
+ - .. include:: /includes/cloud-docs/shared-metric-description-page-faults.rst
223
244
224
245
* - Replication lag
225
246
- > 240 second for 5 minutes
226
247
- > 1 hour for 5 minutes
248
+ - .. include:: /includes/cloud-docs/shared-metric-description-replication-lag.rst
227
249
228
250
* - Failed backup
229
251
- Any occurrence
230
252
- None
253
+ - Track backup operations to ensure data integrity. A failed backup can compromise
254
+ data availability.
231
255
232
256
* - Restored backup
233
257
- Any occurrence
234
258
- None
259
+ - Verify restored backups to ensure data integrity and system functionality.
235
260
236
261
* - Fallback snapshot failed
237
262
- Any occurrence
238
263
- None
264
+ - Monitor fallback snapshot operations to ensure data redundancy and recovery
265
+ capability.
239
266
240
267
* - Backup schedule behind
241
268
- > 12 hours
242
269
- > 12 hours
243
-
244
- * - Available write tickets
245
- - < 75 for 5 minutes
246
- - < 25 for 5 minutes
247
-
248
- * - Available read tickets
249
- - < 75 for 5 minutes
250
- - < 25 for 5 minutes
270
+ - Check backup schedules to ensure they are on track. Falling behind can
271
+ risk data loss and compromise recovery plans.
272
+
273
+ * - Queued Reads
274
+ - > 0-10
275
+ - > 10+
276
+ - Monitor queued reads to ensure efficient data retrieval. High levels of
277
+ queued reads may indicate resource constraints or performance bottlenecks,
278
+ requiring optimization to maintain system responsiveness.
279
+
280
+ * - Queued Writes
281
+ - > 0-10
282
+ - > 10+
283
+ - Monitor queued writes to maintain efficient data processing. High levels
284
+ of queued writes may signal resource constraints or performance bottlenecks, requiring optimization to maintain system responsiveness.
251
285
252
286
* - Restarts last hour
253
287
- > 2
254
288
- > 2
289
+ - Track the number of restarts in the last hour to detect instability or
290
+ configuration issues. Frequent restarts can indicate underlying problems
291
+ that require immediate investigation to maintain system reliability and uptime.
255
292
256
293
* - :manual:`Primary election </core/replica-set-elections/>`
257
294
- Any occurrence
258
295
- None
296
+ - Monitor primary elections to ensure stable {+cluster+} operations. Frequent
297
+ elections can indicate network issues or resource constraints, potentially
298
+ impacting the availability and performance of the database.
259
299
260
300
* - Maintenance no longer needed
261
301
- Any occurrence
262
302
- None
303
+ - Review unnecessary maintenance tasks to optimize resources and minimize disruptions.
263
304
264
305
* - Maintenance started
265
306
- Any occurrence
266
307
- None
308
+ - Track the start of maintenance tasks to ensure planned activities proceed smoothly.
309
+ Proper oversight helps maintain system performance and minimize downtime during maintenance.
267
310
268
311
* - Maintenance scheduled
269
312
- Any occurrence
270
313
- None
314
+ - Monitor scheduled maintenance to prepare for potential system impacts.
271
315
272
316
* - :atlas:`Steal </alert-basics/#cpu-steal>`
273
317
- > 5% for 5 minutes
274
318
- > 20% for 5 minutes
319
+ - Monitor CPU steal on AWS EC2 {+clusters+} with Burstable Performance
320
+ to identify when CPU usage exceeds the guaranteed baseline due to shared
321
+ cores. High steal percentages indicate the CPU credit balance is depleted,
322
+ affecting performance.
275
323
276
324
* - CPU
277
325
- > 75% for 5 minutes
278
326
- > 75% for 5 minutes
327
+ - .. include:: /includes/cloud-docs/shared-metric-description-cpu.rst
279
328
280
329
* - Disk partition usage
281
330
- > 90%
282
331
- > 95% for 5 minutes
283
-
284
- * - Index partition usage
285
- - > 90%
286
- - > 95% for 5 minutes
287
-
288
- * - Journal partition usage
289
- - > 90%
290
- - > 95% for 5 minutes
332
+ - Monitor disk partition usage to ensure sufficient storage availability.
333
+ High usage levels can lead to performance degradation and potential system outages.
291
334
292
335
To learn more, see :atlas:`Configure and Resolve Alerts </alerts>`.
293
336
0 commit comments