Backup and Recovery

Overview

The current architecture of ONES on-premise backend involves two database engines. One of the engines, TimescaleDB supports storage and access of time series telemetry data while the other, PostgreSQL, supports the same for aggregated CRUD data. The current deployment form factor of these engines is single server. Being the backbones of the ONES application, they therefore require proper functioning to guarantee its overall availability.

In this document, we will first present a few unavailability scenarios relevant to the ONES application. For each unavailability scenario, we will highlight the impact in terms of mean time between occurrences, mean time to recovery, and last but not the least, whether the scenario would require a data recovery/migration aspect. Finally we will propose a standard set of solutions that are recommended by both TimescaleDB and Postgres to handle data and service recovery in such situations.

Scenarios Considered

Scenario

Impact

Mean Time Between Occurrences

Mean Time of Unavailability

Mean Time to Recover

Service Loss Impact

Data Loss Impact

Need for Data Recovery/Migration

ONES Application Upgrade

Low

Occasional, between releases (major as well as patch)

Low

Not applicable

Low

No

Not required

ONES TimescaleDB and Postgres Upgrade

Low

Occasional, when application needs new database features to be enabled

Low

Not applicable

Low

No

Not required

Application Crashes

Low

Occasional

Mostly low

Mostly low

Low

No

Not required

Database Instance Crashes (Recoverable)

Medium

Proven COTS components, very infrequent

Low

Low

Medium

No

Not required

Database Instance Crashes (Irrecoverable mainly due to data corruption)

High (depends on existing data volume)

Proven COTS components, very infrequent

Very High

Very High

High

No

Required within the same server

Media Failure (Recoverable)

High (depends on existing data volume)

Very infrequent

High

High

High

No

No

Media Failure (Irrecoverable)

High (depends on existing data volume)

Very infrequent

Very High

Very High

High

Yes

Required across different servers

Data Center Disaster

High (depends on existing data volume)

Extremely Infrequent

Extremely High

Extremely High

Extremely High

Extremely High

Cross location

Data Migration due to DB product replacement

High (depends on existing data volume)

Extremely Infrequent

Very High

Very High

High

Yes

Required

From the above table, it is evident that even though the scenarios that would require data migration/recovery are infrequent, the availability impact is high. The rest of the document presents a set of solutions, pros and cons of each of them, assumptions, and end user side requirements

ONES Solution

ONES provide DB backup service performs periodic backups to remote NFS backup server endpoint provided by the customer. On disaster scenarios, the customers can use our SRE teams to recover the data from backups.

In future, ONES will implement the backup and restore using a DB backup service which performs a transparent migration from single server Postgres/Timescale to distributed Timescale handling both fresh installs and upgrades.

Last updated