Backup and Recovery
Overview
The current architecture of ONES on-premise backend involves two database engines. One of the engines, TimescaleDB supports storage and access of time series telemetry data while the other, PostgreSQL, supports the same for aggregated CRUD data. The current deployment form factor of these engines is single server. Being the backbones of the ONES application, they therefore require proper functioning to guarantee its overall availability.
In this document, we will first present a few unavailability scenarios relevant to the ONES application. For each unavailability scenario, we will highlight the impact in terms of mean time between occurrences, mean time to recovery, and last but not the least, whether the scenario would require a data recovery/migration aspect. Finally we will propose a standard set of solutions that are recommended by both TimescaleDB and Postgres to handle data and service recovery in such situations.
Scenarios Considered
Scenario | Impact | Mean Time Between Occurrences | Mean Time of Unavailability | Mean Time to Recover | Service Loss Impact | Data Loss Impact | Need for Data Recovery/Migration |
| |||||||
ONES Application Upgrade | Low | Occasional, between releases (major as well as patch) | Low | Not applicable | Low | No | Not required |
ONES TimescaleDB and Postgres Upgrade | Low | Occasional, when application needs new database features to be enabled | Low | Not applicable | Low | No | Not required |
Application Crashes | Low | Occasional | Mostly low | Mostly low | Low | No | Not required |
Database Instance Crashes (Recoverable) | Medium | Proven COTS components, very infrequent | Low | Low | Medium | No | Not required |
Database Instance Crashes (Irrecoverable mainly due to data corruption) | High (depends on existing data volume) | Proven COTS components, very infrequent | Very High | Very High | High | No | Required within the same server |
Media Failure (Recoverable) | High (depends on existing data volume) | Very infrequent | High | High | High | No | No |
Media Failure (Irrecoverable) | High (depends on existing data volume) | Very infrequent | Very High | Very High | High | Yes | Required across different servers |
Data Center Disaster | High (depends on existing data volume) | Extremely Infrequent | Extremely High | Extremely High | Extremely High | Extremely High | Cross location |
Data Migration due to DB product replacement | High (depends on existing data volume) | Extremely Infrequent | Very High | Very High | High | Yes | Required |
From the above table, it is evident that even though the scenarios that would require data migration/recovery are infrequent, the availability impact is high. The rest of the document presents a set of solutions, pros and cons of each of them, assumptions, and end user side requirements
ONES Solution
ONES provide DB backup service performs periodic backups to remote NFS backup server endpoint provided by the customer. On disaster scenarios, the customers can use our SRE teams to recover the data from backups.
In future, ONES will implement the backup and restore using a DB backup service which performs a transparent migration from single server Postgres/Timescale to distributed Timescale handling both fresh installs and upgrades.
Last updated