Optimizing N+1 Query Performance in a Monolithic ERP

Context

Twin v1 was a massive, monolithic ERP that grew organically—and painfully—over time. Originally designed strictly for operational flows like sales, warehouse management, and delivery, it was eventually forced to support heavy analytical reporting. Because reporting wasn't a priority in the early design, the system lacked a dedicated analytical layer. Instead, it relied on direct, real-time queries against the live production database for all reports.

It worked well enough in the early days, but as the data volume swelled and features expanded, the system began to collapse under its own weight.

The Problem

After roughly two years of accumulating live data, the performance degradation was impossible to ignore. The root cause was a classic, compounding architectural flaw: lazy loading and the N+1 query problem , heavily exacerbated by the reporting features. We had a single "god model " for Orders that served transactions, simple reads, and complex reports simultaneously.

Over time, developers had piled more and more query logic onto this single model. Convenience abstractions like Laravel’s pagination turned into silent killers, as the underlying COUNT queries became agonizingly expensive on millions of rows.

Furthermore, with no data cutoff or archiving strategy, reports were crunching years of live transaction history in real-time. The system was trying to sprint while dragging a massive anchor of historical data.

Impact

The user experience degraded into slow page loads, constant request timeouts, and reports that simply hung forever. As the reporting traffic grew, it started choking the core operational flows. Database connections pooled up, expensive queries deadlocked, and the database server crashed multiple times under the strain.

At this stage, performance was no longer just a UX concern; it had become a critical risk to business continuity.

Constraints

A full architectural rewrite was unfeasible due to the system's scale and its critical role in daily operations. Since the legacy write-logic governed the core transactions of the business, any structural changes introduced an unacceptable level of risk. We had to optimize the system within its existing, flawed framework to ensure stability while addressing the performance bottleneck.

Decision and Approach

Working alongside my tech lead and manager, we accepted that structural redesign wasn't an option. Instead, we chose a tactical retreat, focusing entirely on read optimization to stop the bleeding.

The main steps were:

Stripping reporting logic out of Eloquent models and replacing it with raw SQL and query builders.

Aggressively reducing lazy loading in reporting paths and enforcing explicit joins.

Ripping out standard pagination in favor of simple LIMIT queries where total counts weren’t strictly required.

Applying targeted database indexes (though we knew this would have limited impact on its own).

Introducing a master-slave database replication to physically separate read-heavy reporting from write-heavy operations.

Deploying behind feature flags and temporarily disabling the heaviest reports until they could be stabilized.

Outcome

The tactical intervention was successful. Query latency dropped significantly, memory usage stabilized, and database uptime returned to 100%. By physically separating workloads via replication and optimizing the read paths, we restored the reliability of core business transactions.

However, I recognized this as a high-impact mitigation rather than a structural resolution. While we stabilized the platform, the underlying data model remained untouched. Without a formal data archiving strategy, we were essentially extending the system’s "runway" before reaching the next scaling threshold. We successfully resolved the immediate operational crisis, but the fundamental architectural debt remained a known item on the long-term technical roadmap.

Reflection

Initially, the frustration of this project made me want to abandon ORMs entirely in favor of raw queries and strict repository layers. But over time, I realized the ORM wasn’t the enemy—the lack of architectural boundaries was. ORMs are fantastic for transactional logic, but they do not scale automatically when you force them to handle heavy analytical reporting. Convenience features like lazy loading are loans with high interest rates; eventually, the technical debt comes due.

Since Twin v1, I’ve become highly intentional about separating read and write paths, defaulting to query-based approaches for reporting, and planning for data archiving on day one. It was a painful lesson, but it fundamentally changed how I approach system longevity.

Questions & Answers

Q.
Why didn't you just use a dedicated Data Warehouse (like `BigQuery` or `ClickHouse`) for the reporting?

We were actually in the process of doing exactly that. We had hired specialists and started building the ETL pipelines, but you can’t build a skyscraper while the ground is shaking. The system was crashing now. A Data Warehouse is a long-term cure, but we needed a tourniquet to stop the immediate bleeding. The Master-Slave replication was a tactical move—it was the fastest way to physically isolate the "reporting" fire from the "transactional" core without a total rewrite. In the end, the company’s trajectory shifted before the "perfect" architecture could be finished, leaving us with a half-baked but stable solution.