Backup and Recovery
Overview
The current architecture of ONES on-premise backend involves two database engines. One of the engines, TimescaleDB supports storage and access of time series telemetry data while the other, PostgreSQL, supports the same for aggregated CRUD data. The current deployment form factor of these engines is single server. Being the backbones of the ONES application, they therefore require proper functioning to guarantee its overall availability.
In this document, we will first present a few unavailability scenarios relevant to the ONES application. For each unavailability scenario, we will highlight the impact in terms of mean time between occurrences, mean time to recovery, and last but not the least, whether the scenario would require a data recovery/migration aspect. Finally we will propose a standard set of solutions that are recommended by both TimescaleDB and Postgres to handle data and service recovery in such situations.
Scenarios Considered
Scenario
Impact
Mean Time Between Occurrences
Mean Time of Unavailability
Mean Time to Recover
Service Loss Impact
Data Loss Impact
Need for Data Recovery/Migration
ONES Application Upgrade
Low
Occasional, between releases (major as well as patch)
Low
Not applicable
Low
No
Not required
ONES TimescaleDB and Postgres Upgrade
Low
Occasional, when application needs new database features to be enabled
Low
Not applicable
Low
No
Not required
Application Crashes
Low
Occasional
Mostly low
Mostly low
Low
No
Not required
Database Instance Crashes (Recoverable)
Medium
Proven COTS components, very infrequent
Low
Low
Medium
No
Not required
Database Instance Crashes (Irrecoverable mainly due to data corruption)
High (depends on existing data volume)
Proven COTS components, very infrequent
Very High
Very High
High
No
Required within the same server
Media Failure (Recoverable)
High (depends on existing data volume)
Very infrequent
High
High
High
No
No
Media Failure (Irrecoverable)
High (depends on existing data volume)
Very infrequent
Very High
Very High
High
Yes
Required across different servers
Data Center Disaster
High (depends on existing data volume)
Extremely Infrequent
Extremely High
Extremely High
Extremely High
Extremely High
Cross location
Data Migration due to DB product replacement
High (depends on existing data volume)
Extremely Infrequent
Very High
Very High
High
Yes
Required
From the above table, it is evident that even though the scenarios that would require data migration/recovery are infrequent, the availability impact is high. The rest of the document presents a set of solutions, pros and cons of each of them, assumptions, and end user side requirements
ONES Solution
ONES provide DB backup service performs periodic backups to remote NFS backup server endpoint provided by the customer. On disaster scenarios, the customers can use our SRE teams to recover the data from backups.
In future, ONES will implement the backup and restore using a DB backup service which performs a transparent migration from single server Postgres/Timescale to distributed Timescale handling both fresh installs and upgrades.
Last updated