Business Context
Mining operations are 24/7 industrial processes. A copper mine at 4,000m altitude in Chile or an iron ore operation in the Pilbara does not have planned downtime. The software running these operations — shift management, equipment tracking, safety compliance, maintenance scheduling — must match the operational availability of the physical systems they support.
The software portfolio across BHP and Collahuasi spanned operational management (shift handover, production tracking, operational reporting), maintenance systems (equipment health monitoring, preventive maintenance workflows, work order management), safety compliance (near-miss recording, regulatory submissions, incident management), and business intelligence (cross-domain operational dashboards, regulatory reports).
This was not consumer software. The users were plant operators, maintenance engineers, safety officers, and operational managers working under real-time production pressure. "The system is slow" has different consequences when the person reporting it is a mine shift supervisor waiting to hand over to the next crew.
Problem Space
The problems evolved across three distinct phases over 8 years:
Phase 1 (2013–2015): Greenfield with integration complexity
Building operational systems with modern (for 2013) enterprise patterns while integrating with SCADA systems, ERP platforms (SAP, JDE), and legacy operational databases from different eras of the mine's history. The integration landscape was the primary challenge: SCADA systems push data at fixed intervals using proprietary protocols and cannot respond to pull requests. Any architecture had to accommodate push-only data sources as a fundamental constraint.
Phase 2 (2016–2019): Multi-country scaling
Extending the same software platform to Collahuasi's Chilean operations. The same systems needed to accommodate different regulatory frameworks (Australian and Chilean mine safety regulations differ substantially in reporting requirements), different operational workflows, and different integration landscapes. The architecture faced its first real stress test: could it accommodate genuine variation without forking into two separate codebases?
Phase 3 (2019–2021): Operational maturity and technical debt
Systems in production for 6+ years accumulate operational knowledge, changing requirements, and technical debt. The challenge shifted from building to maintaining architectural integrity while accommodating change. This is where the quality of early architectural decisions becomes visible: the decisions that made the systems adaptable, and the decisions that made adaptation expensive.
System Constraints
The constraints that shaped every architectural decision:
- Zero planned downtime: Mining operations run continuously. There is no maintenance window, no Sunday 3am deployment. All upgrades must be deployable without stopping the system. This is a fundamental availability requirement, not an SLA target.
- Remote operation environments: Chilean mine sites have intermittent and bandwidth-limited connectivity. Systems must function in degraded connectivity modes and synchronize when full connectivity is restored. This ruled out any architecture requiring constant low-latency connections.
- Safety criticality: Safety-adjacent systems are legally regulated in both jurisdictions. Failures in certain systems trigger regulatory notification requirements. This created hard availability requirements for a specific subset of systems — not all systems, but those that couldn't fail silently.
- SCADA push-only constraint: Industrial control systems (SCADA) push data at fixed intervals. They cannot receive requests. Any integration must accommodate push-only data sources and cannot assume synchronous request-response patterns on the data acquisition path.
- Multi-country regulatory divergence: Australian and Chilean mine safety regulations differ in reporting requirements, incident classification, and submission formats. The same business events require different regulatory artifacts depending on jurisdiction. The architecture needed to encode this variation without creating separate codebases.
- 10-year maintenance horizon: Architectural decisions made in 2013 would still be in use in 2021 and beyond. Systems needed to be maintainable by teams that had not built them, documented for decisions that would be questioned years later.
Architecture
The foundational decision was domain boundary design. Mining operations involve genuinely distinct business domains with different data ownership, different consistency requirements, and different failure modes. The architecture encoded these distinctions explicitly.
Domain boundary rationale
Operations is high write-throughput, time-series heavy, and tolerates eventual consistency. A production record delayed by seconds is operationally acceptable; a production record lost is not. The consistency model optimizes for write availability.
Maintenance is lower write frequency but relationship-heavy. A work order cannot simultaneously be Open and Closed — strong consistency is required for state transitions. The domain owns equipment records independently from Operations; the same physical equipment exists in both contexts with different attributes and different ownership semantics.
Safety & Compliance is write-once, legally immutable. Safety records cannot be modified after creation — only superseded by new records. The storage model is append-only by design. Availability requirements are strict: regulatory submission windows have hard deadlines.
Reporting reads from all domains but owns none. It maintains purpose-built read models derived from domain events. Cross-domain analytics do not go through domain APIs — they go through read models built explicitly for query patterns. This prevents ad-hoc cross-domain coupling through "just add a join" database queries.
Architecture Decision Records
Centralized operational databases are common in mining ERP systems. They simplify reporting queries significantly and reduce data duplication. The existing mine data landscape included legacy centralized databases from different technology eras.
Each domain owns its database schema. Cross-domain data access goes through domain APIs, not direct database queries. Reporting domain maintains purpose-built read models derived from domain events — it does not join across domain databases.
Cross-domain queries require application-layer joining, which is slower than SQL joins. Domain schema changes don't break reporting until the read model is deliberately updated — this is a feature, not a bug. The isolation allows each domain to be upgraded, scaled, and deployed independently. Equipment schema changes in the Maintenance domain don't require coordinating with the Operations domain deployment.
SCADA systems push data at fixed intervals using proprietary protocols. They cannot accept HTTP requests or respond to pull-based queries. Building synchronous integrations would require polling, with all its latency, reliability, and coupling problems.
SCADA adapters receive pushed data, normalize it to domain event format, and emit to the enterprise message bus. Domain services consume events and update their models. No synchronous cross-boundary calls on the operational data path. The SCADA push model becomes the event bus's event stream.
SCADA data arrives in domain systems with propagation delay (typically <100ms). For operational monitoring, this is acceptable. For real-time safety-critical alerts requiring sub-second response, this pattern cannot be used — those systems integrate directly with SCADA via proprietary protocols. The event-driven path handles analytics and monitoring; the direct path handles safety-critical alerting.
Data shared between domains (equipment records used by both Operations and Maintenance) needed to be accessible to both. The simplest technical solution was a shared table with foreign key relationships.
Each domain maintains its own representation of shared data. Equipment records exist in Operations (as operational equipment state — current status, location, utilization) and in Maintenance (as maintenance-schedulable assets — service history, maintenance schedules, component tracking). Eventual consistency between representations is maintained through domain events.
Equipment attributes in two places can drift if event processing fails. We implemented idempotent event processors and dead-letter queues (DLQ) to manage processing failures, with operational dashboards monitoring DLQ depth. The independence gained — either domain can be deployed, upgraded, or scaled without coordinating with the other — was worth the operational overhead. This was validated concretely when the Maintenance domain database required a schema migration in 2018 that had no impact on Operations deployment.
SCADA systems and ERP platforms fail in non-obvious ways. They don't always time out cleanly. Sometimes they send corrupt data. Sometimes connectivity is partially available — requests reach the integration endpoint but responses are lost. A system that blocked on integration availability would inherit the reliability characteristics of its least-reliable dependency.
All external integrations are wrapped in circuit breakers. When an integration fails beyond threshold, the circuit opens: calls fail fast, downstream systems operate in an explicitly-defined degraded mode, and operational staff are notified. Each system has an explicit specification of its degraded-mode behavior.
Operational staff must understand degraded mode — what the system can and cannot do when SCADA data is unavailable. This required explicit UX design for degraded states and operational runbooks specifying what actions to take when specific circuit breakers open. The investment paid off consistently: SCADA maintenance windows became routine events with predictable system behavior, not operational crises.
Tradeoffs Considered
Strong consistency vs. availability in the Operations domain
- All reads see the latest write
- Simpler application logic
- Requires synchronous replication
- Write availability tied to replica health
- Unacceptable for high-availability write path
- Write path available even during replica lag
- Production record delayed is operationally acceptable
- Production record lost is not — durability is non-negotiable
- Application must handle stale reads explicitly
- Read models in Reporting lag domain state by design
Event-driven vs. synchronous cross-domain integration
Synchronous integration (REST or RPC across domain boundaries) is simpler to implement and reason about. The failure mode is also simpler: if the downstream system is unavailable, the call fails immediately and visibly. We chose event-driven integration for the data acquisition path because the SCADA constraint (push-only) required it for that integration point, and consistency drove us to the same pattern everywhere rather than mixing synchronous and asynchronous patterns within a single system boundary.
The cost of event-driven integration is operational: dead-letter queues must be monitored, event replay mechanisms must exist, idempotency must be implemented and tested. These are real ongoing costs. The benefit — loose coupling with independent failure modes — was essential for the availability requirements.
Data duplication vs. shared database
We accepted data duplication as a deliberate tradeoff. The alternative — a shared database with cross-domain tables — would have made Maintenance domain schema changes a coordination problem involving the Operations team. At 8 years and across two countries, the accumulated coordination cost of that alternative would have been significant. The duplication cost (monitoring DLQ depth, handling processing failures) was smaller and predictable.
Scalability Challenges
The most significant scaling challenge emerged in phase 3: the Reporting domain's read models grew to hundreds of millions of records as operational data accumulated over years. Annual regulatory reports for both jurisdictions required querying this entire history, and the read models were not designed for this query scale.
The architectural response was tiered storage:
- Hot tier (0–90 days): Full-fidelity operational data in the primary read models, optimized for operational dashboards and day-to-day queries
- Warm tier (90 days – 2 years): Aggregated data at lower fidelity, retained for trend analysis and periodic regulatory reporting
- Cold tier (2+ years): Pre-computed aggregates for regulatory submission shapes, archived data for audit purposes
This was a year-5 architectural change that should have been a year-1 design decision. Operational data growth patterns are predictable — the volume of shift data, equipment events, and production records is proportional to operational time. The growth trajectory was visible from month 1. The decision to design for it was deferred.
The lesson: data lifecycle planning belongs at architecture time, not after the storage problem becomes operational. The tiered storage architecture required 3 months of migration work on live systems. Designing it upfront would have been a 2-week engineering investment.
Observability
This is where the architecture was most deficient. Operational logging was in place from the start, but distributed tracing — the ability to correlate an event across domain boundaries — was added in year 5. Before that, debugging a cross-domain incident required manually correlating timestamps across separate log streams: Operations logs, Maintenance logs, the event bus audit log, and the SCADA adapter logs. This was slow and error-prone under operational time pressure.
The absence of distributed tracing was a deliberate decision in 2013, when the tooling landscape for distributed tracing was less mature and the instrumentation cost was genuinely higher. By 2016, the cost had dropped substantially and the incident response cost without it was visible. We didn't add it until 2018.
The lesson: distributed tracing instrumentation cost at project start is small; the incident response cost without it compounds with system complexity. In 2013 this was a reasonable tradeoff. In 2021 it would not be — the tooling is mature, the cost is low, and the operational benefit is immediate.
Lessons Learned
- Domain boundaries should follow operational workflows, not organizational structures. Our initial Operations/Maintenance boundary aligned with organizational reporting lines. This created friction when maintenance activities affected operational readiness — the domains were more coupled than the boundary implied. The boundary needed to reflect how the domains interacted operationally, not who reported to whom organizationally.
- Architecture Decision Records were the highest-ROI documentation investment. Teams inheriting systems 4+ years later could reason about architectural decisions because the decisions were documented with context, rationale, and explicit tradeoffs. "Why do we have two representations of equipment data?" has a documented answer in ADR-003, including the specific scenario that motivated it.
- Circuit breakers required operational runbooks to be effective. The technical mechanism was correct. The operational gap was that on-call teams needed clear guidance on what to do when a specific circuit opened. "SCADA adapter circuit open" requires a different response than "ERP integration circuit open." Generic "system degraded" alerts without contextual runbooks pushed decision-making to on-call engineers who didn't always have the context.
- Eventual consistency requires explicit degraded-mode UX design. Operational interfaces that show potentially stale data need to communicate staleness. We underinvested in this initially, which led to operator confusion when SCADA data was delayed. Adding staleness indicators to the operational dashboards was straightforward; the delay in adding them was organizational, not technical.
- Data growth planning is architecture work, not operations work. The tiered storage solution we built in year 5 should have been designed in year 1. The growth trajectory was predictable. The cost of retrofitting it onto live systems was high.
What I'd Improve Today
Distributed tracing from day one. The instrumentation cost in 2021 is low enough that there's no longer a credible argument for deferring it. Correlating events across domain boundaries should be trivial, not an incident response bottleneck.
Explicit degraded-mode specification for every integration boundary. When a circuit breaker opens, what does each downstream system do? This needs to be a first-class design artifact, not something determined ad-hoc during incidents. Every integration should have a documented degraded-mode behavior that operators and on-call engineers can reference without reading the code.
Data lifecycle planning at architecture time. Tiered storage, retention policies, and archival strategies for operational data should be designed when the write schemas are designed. Retrofitting these onto live systems with regulatory data is expensive and risky.
Contract testing for event schemas. Domain events are the contracts between bounded contexts. Schema changes that break consumer expectations are the event-driven equivalent of a broken REST API contract. Consumer-driven contract tests for event schemas would have caught schema evolution issues before deployment rather than in the event bus DLQ.
Invest in domain language documentation earlier. In mining operations, the same physical concept has different names in Operations (shift report), Maintenance (equipment record), and Safety (incident register). Documenting the ubiquitous language of each domain and making it explicit in code would have reduced the cognitive overhead for every engineer who worked on the system after the initial team.