{"componentChunkName":"component---src-components-blog-template-js","path":"/blog/2025-03-16-observability-and-logging/","result":{"data":{"markdownRemark":{"frontmatter":{"title":"Observability and Logging","date":"2025-03-16"},"html":"<p>When something goes wrong in production, the first question is always \"what happened?\" Observability is about making our systems transparent enough to answer that question quickly. It rests on three pillars: logs, metrics, and traces.</p>\n<h3>Logs</h3>\n<p>Logs are the most basic form of observability. They are timestamped records of events that happened in the application. A good log entry includes enough context to understand what happened without needing to reproduce the issue.</p>\n<p>Structured logging, where log entries are formatted as JSON instead of plain text, makes logs much easier to search and filter.</p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre class=\"language-text\"><code class=\"language-text\">// Instead of this\nconsole.log(\"User login failed for user 123\");\n\n// Do this\nlogger.warn({\n  event: \"login_failed\",\n  userId: 123,\n  reason: \"invalid_password\",\n  ip: req.ip,\n});</code></pre></div>\n<p>Log levels help separate noise from signal. <code class=\"language-text\">debug</code> for development details, <code class=\"language-text\">info</code> for normal operations, <code class=\"language-text\">warn</code> for recoverable problems, and <code class=\"language-text\">error</code> for failures that need attention. In production, we typically set the level to <code class=\"language-text\">info</code> or <code class=\"language-text\">warn</code> to avoid flooding the logs with debug output.</p>\n<h3>Metrics</h3>\n<p>Metrics are numerical measurements collected over time. They tell us how the system is performing at a high level. Common metrics include request rate, error rate, response time percentiles, CPU usage, memory usage, and queue depth.</p>\n<p>The RED method is a useful framework for service metrics: Rate (requests per second), Errors (failed requests per second), and Duration (response time distribution). These three metrics give a good overview of a service's health.</p>\n<p>Tools like Prometheus collect metrics by scraping endpoints on our services. Grafana dashboards visualise those metrics and make trends visible. A sudden spike in error rate or a gradual increase in response time becomes obvious on a dashboard.</p>\n<h3>Traces</h3>\n<p>In a system with multiple services, a single user request might pass through several services before completing. Distributed tracing follows that request through the entire chain, showing how long each service took and where failures or bottlenecks occurred.</p>\n<p>Each request gets a unique trace ID that is passed along to every service in the chain. Tools like Jaeger or the OpenTelemetry SDK collect these traces and visualise them as a waterfall diagram, making it clear where time is being spent.</p>\n<h3>Alerting</h3>\n<p>Observability data is only useful if someone acts on it. Alerts notify the team when something is wrong. Good alerts are based on symptoms, not causes. Instead of alerting when CPU usage exceeds 80 percent, alert when the error rate exceeds a threshold or response times breach the SLA. High CPU might be fine if users are unaffected.</p>\n<p>Alerts should be actionable. If an alert fires and the response is \"I will look at it later,\" it is either not urgent enough to be an alert or it needs better context to guide the response.</p>\n<h3>Practical Tips</h3>\n<p>Start with logging. It is the simplest to add and provides the most immediate value. Then add key metrics to understand system health at a glance. Add tracing when the architecture involves multiple services and debugging cross-service issues becomes painful. Avoid the temptation to instrument everything upfront. Instrument the critical paths first and expand as needed.</p>"}},"pageContext":{"slug":"/2025-03-16-observability-and-logging/"}},"staticQueryHashes":[]}