The Silent Failure: Real-Time Observability in Asynchronous Systems

Q: Storing logs in a database table can lead to massive table growth in high-traffic systems. How did you handle the scaling of these logging tables?

We treated these as Short-Term Audit Logs. We implemented a rolling retention policy where logs older than 30 days were archived to cold storage or purged. This kept the database performance high while ensuring we had the "Hot Data" needed for immediate troubleshooting.

Context

As established in the previous architecture, the LaaS system relied heavily on a "Hit-and-Forget " mechanism. This approach was designed for maximum efficiency; the application would accept a request, acknowledge it immediately, and handle the heavy lifting of WMS and FMS integrations in the background. From a stability standpoint, it was successful—as long as the system accepted the request, the user felt the app was working perfectly.

Problem

However, this efficiency created a dangerous disconnect. Because the core business logic was encapsulated in background jobs, the user interface remained static once a request was accepted. In a logistics environment where API timeouts, credential expirations, or payload mismatches are inevitable, these background failures were silent.

A user might see a shipment marked as "In Process" indefinitely, unaware that the integration had failed minutes ago. This lack of feedback led to Data Drift : physical warehouse actions were stalled because the digital command had failed without an alert. We had built a stable engine, but we had failed to provide a real-time dashboard for its failures.

Solution

To bridge this gap, I led the implementation of a real-time notification layer and a more granular observability strategy:

Server-Sent Events (SSE): We implemented SSE to push background job statuses directly to the client. We intentionally chose SSE over WebSockets because our requirements were strictly one-way (Server-to-Client). This provided a lightweight, standard HTTP-based stream without the infrastructure overhead of managing full-duplex WebSocket connections.

Domain-Segregated Logging: We moved away from a single "catch-all" log. Instead, we implemented Encapsulated Logging Tables specifically for Inbound and Outbound integrations. By segregating logs into dedicated database tables, we ensured that high-volume delivery logs didn't clutter the context of critical warehouse sync failures.

Worker Health Monitoring: I introduced a "heartbeat" mechanism for our background processes. This ensured that we weren't just monitoring the jobs, but the engine itself. If the background worker stalled, the system would immediately flag the lack of activity to the engineering team.

Outcome

The implementation of SSE transformed the LaaS app from a "Black Box " into a Responsive Platform. Users received immediate, actionable feedback on background errors, allowing them to resolve issues—like fixing a bad address or re-authenticating a service—without manual engineering intervention. We didn't just reduce failures; we eliminated Silent Failures, which are the most expensive type of error in logistics.

Reflection

This project reinforced that a "Hit-and-Forget" pattern is incomplete without a robust feedback loop. SSE was a pragmatic, high-value win for our web-based platform. However, I recognized a critical constraint: if our target had been a mobile application, SSE’s lack of background persistence would have forced us toward Firebase Cloud Messaging (FCM) or WebSockets.

The decision to segregate logs into dedicated tables was also a major win; it turned "searching for a needle in a haystack" into a simple filtered query. It’s a small architectural detail that saves hours of "firefighting" during a production incident.

Questions & Answers

Q.
Why not use WebSockets instead of SSE?

SSE was chosen for its simplicity and lower resource overhead. Since we only needed the server to "push" notifications to the user about job statuses, the bi-directional nature of WebSockets would have added unnecessary complexity to our infrastructure.

Q.
Storing logs in a database table can lead to massive table growth in high-traffic systems. How did you handle the scaling of these logging tables?

We treated these as Short-Term Audit Logs . We implemented a rolling retention policy where logs older than 30 days were archived to cold storage or purged. This kept the database performance high while ensuring we had the "Hot Data" needed for immediate troubleshooting.