Monitoring with auto-recovery¶

High availability is a critical component of your SaaS and it requires several measures to achieve high availability.

In general, Omnistrate provides full support for your control plane, data plane (aka application) infrastructure and automated L1 support for your data plane failures.

Control plane failures¶

We will monitor, detect and recover any failures in your control plane to give you a 99.99% SLA.

Data plane infrastructure failures¶

Omnistrate automatically detects the following failures and seamlessly recovers from them:

Dead Process(es)
Machine failures
Network partitions
Degraded storage
Zonal failures

If we notice these failures, we try basic recovery mechanisms, ex - restart the process or machine or replacing the machine. If we can't recover, we will alert your team with the configured mechanism to look into it further.

Data plane non-infra issues¶

In addition, Omnistrate provides mechanisms for you to detect and recover from process failures using healthcheck actionhook.

In order to configure healthcheck actionhook, you can provide a check that we can use to validate the health of the process on a regular basis.

As an example, let's say you want to verify liveness for your database application. You can perform a read after write query and make sure database is making progress.

Note that you can specify different checks in the same healthcheck to make sure all components of your application are up and running. Let's say you also want to add a simple verification check to verify that your process is up and running, you can add a check using ps utility in addition to the above liveness check.

If your process health check is failing, we will alert your team with relevant details to look into it.

For more details on actionhooks, please see here

For more details on integrating with Operational tools like PagerDuty. To learn more about integrations, please see here