A production outage at 2 AM. The customer calls before you know there is a problem. This is what I set up to prevent that.
1. Uptime Monitoring
Question: Is my site up?
Tool: UptimeRobot or Checkly (free tier is enough to start)
Setup: Check every 5 minutes from multiple regions
-- Alert if:
- Response time > 3 seconds
- Status code != 200
- SSL certificate expires in 30 days
2. Error Tracking
Question: What exceptions are happening?
Tool: Sentry (free tier is generous)
Setup: Capture all unhandled exceptions and promise rejections
// In your app
import * as Sentry from '@sentry/react';
Sentry.init({
dsn: 'your-dsn',
environment: process.env.NODE_ENV,
});
window.addEventListener('unhandledrejection', (event) => {
Sentry.captureException(event.reason);
});
3. Logging + Metrics
Question: What is happening right now?
Tool: DataDog, Grafana+Prometheus, or CloudWatch
The key metrics to watch:
- Error rate (errors per minute)
- P99 latency
- CPU/Memory usage
The Alerting Rule
Only alert on things that matter:
- Site is down
- Error rate spiked > 5%
- P99 latency > 5 seconds
Don’t alert on every error. You will ignore alerts.
One More Thing
Have a status page ready. When things go wrong, update it before customers complain.