Overview
The current architecture of ONES on-premise backend involves two database engines. One of the engines, TimescaleDB supports storage and access of time series telemetry data while the other, PostgreSQL, supports the same for aggregated CRUD data. The current deployment form factor of these engines is single server. Being the backbones of the ONES application, they therefore require proper functioning to guarantee its overall availability.
In this document, we will first present a few unavailability scenarios relevant to the ONES application. For each unavailability scenario, we will highlight the impact in terms of mean time between occurrences, mean time to recovery, and last but not the least, whether the scenario would require a data recovery/migration aspect. Finally we will propose a standard set of solutions that are recommended by both TimescaleDB and Postgres to handle data and service recovery in such situations.
From the above table, it is evident that even though the scenarios that would require data migration/recovery are infrequent, the availability impact is high. The rest of the document presents a set of solutions, pros and cons of each of them, assumptions, and end user side requirements
ONES provide DB backup service performs periodic backups to remote NFS backup server endpoint provided by the customer. On disaster scenarios, the customers can use our SRE teams to recover the data from backups.
In future, ONES will implement the backup and restore using a DB backup service which performs a transparent migration from single server Postgres/Timescale to distributed Timescale handling both fresh installs and upgrades.