ONES Rule Engine

âš¡ Overview

In data center operations, maintaining reliability and uptime requires more than just monitoring — it demands proactive detection and rapid response.

A Rule Engine plays a critical role by continuously tracking key performance metrics and triggering alerts when thresholds are breached. This ensures that operators can:

✅ Identify anomalies early before they escalate. ✅ Respond quickly to potential risks or outages. ✅ Safeguard critical components & services from failure. ✅ Automate escalation via integrations (Slack, Zendesk, ServiceNow).

💡 In essence, Rule Engine alerts act as the first line of defense, keeping the data center environment stable, resilient, and secure.

Device Based Rules
  • CPU and Memory Utilisation

  • CPU Core temp alerts

  • Fan & PSU LED status

  • SSD Memory Utilization, Health & Temperature Status

  • Traffic Bandwidth

  • ASIC Routes (IPv4 & IPv6)

  • Device & Docker Down Alerts

  • Docker Per Process Down Alert

  • BGP Neighbour Down alter

  • Component failure

  • Interface Flap Alerts

  • Traffic Errors & Discard Counters

  • PFC Counters

  • Device Queue Counters

  • Config Change Alert

  • Docker CPU & Memory utilization(per service)

  • IP Change Alerts

  • Failed FAN's & PSU's

  • FAN Speed

  • MCLAG (Member/Peer/Session) Down Alerts

  • NTP Drift

  • PSU Temp Alerts

  • Unhealthy Devices Alerts

Interface Based Rules
  • Broadcast/Multicast Storm

  • PFC Rx/Tx Counters

  • Port Flap

  • Queue Transmit Counters

  • Traffic InDiscards

  • Traffic InErrors

  • Traffic OutDiscards

  • Traffic OutErrors

  • Traffic Rx/Tx Utilization

  • Transceiver Rx/Tx Power

  • Transceiver Temperature

  • Transceiver Voltage

GPU/ONES Server Based Rules
  • CPU Core Temperature

  • CPU Utilization

  • DISK Health

  • DISK Temperature

  • DISK Used Memory Percent

  • Device Down

  • Docker Down

  • GPU Memory Utilization

  • GPU PSU 1 Power Draw

  • GPU PSU 2 Power Draw

  • GPU Temperature

  • GPU Utilization

  • Memory Utilization

Rule engine alerts ensure efficient resource utilization, timely troubleshooting, early detection of potential issues, and overall operational stability within the data centre environment.

Alert Trigger on threshold value breach

Notification

ONES-App is capable of triggering breached threshold values to

  • Slack Channel

  • Zendesk Support

  • ServiceNow

Rules are categorized based on the metric hierarchy

  1. Device Level

  2. Interface Level

  3. Server Level

List of all the Metrics Supported by Rule Engine with possible units and measured value a user can use

Hierarchy

Metrics

Unit

Measure

Value

Device/Server

CPU Utilization

Percentage (%)

AVG/MIN/MAX

0-100

Device/Server

Memory Utilization

Percentage (%)

AVG/MIN/MAX

0-100

Device

Failed Fans

Count ()

MIN/MAX

Count

Device

Failed PSU

Count ()

MIN/MAX

Count

Device/Server

CPU Core Temperature

Celsius ()

AVG/MIN/MAX

Celsius

Device

PSU Temperature

Celsius ()

AVG/MIN/MAX

Celsius

Device

FAN Speed

Percentage (%)

AVG/MIN/MAX

0-100

Device

ASIC IPv4 Routes Utilization

Percentage (%)

AVG/MIN/MAX

0-100

Device

ASIC IPv6 Routes Utilization

Percentage (%)

AVG/MIN/MAX

0-100

Device

BGP Nbrs Operationally Down

Count ()

AVG/MIN/MAX

Count of Nbrs

Device

FRR Container CPU Utilization

Percentage (%)

AVG/MIN/MAX

0-100

Device

Syncd Container CPU Utilization

Percentage (%)

AVG/MIN/MAX

0-100

Device/Serer

Device Down

NA

NA

NA

Device/Server

Docker Down

NA

NA

NA

Device

Queue Counter

Count()

AVG/MIN/MAX

Count

Device/Server

SSD Health

Percentage(%)

Percentage(%)

0-100

Device/Server

SSD Temperature

Celsius ()

AVG/MIN/MAX

Celsius

Device/Server

SSD Memory

Percentage(%)

Percentage(%)

0-100

Interface

Int Flap

NA

NA

NA

Interface

PFC Counters

Count ()

AVG/MIN/MAX

Count

Interfaec

Queue Counters

Count ()

AVG/MIN/MAX

Count

Interface

TX Utilization

Percentage (%)

AVG/MIN/MAX

0-100

Interface

RX Utilization

Percentage (%)

AVG/MIN/MAX

0-100

Interface

In Errors

Count ()

AVG/MIN/MAX

User defined

Interface

Out Errors

Count ()

AVG/MIN/MAX

User defined

Interface

In Discards

Count ()

AVG/MIN/MAX

User defined

Interface

Out Discards

Count ()

AVG/MIN/MAX

User defined

Interface

Tranx TX Power

dBm

AVG/MIN/MAX

User defined

Interface

Tranx Rx Power

dBm

AVG/MIN/MAX

User defined

Interface

Tranx Temperature

Celscius ()

AVG/MIN/MAX

User defined

Interface

Tranx Voltage

Volts ()

AVG/MIN/MAX

User defined

Server

GPU Mem Util

Percentage (%)

AVG/MIN/MAX

0-100

Server

GPU PSU 1 Power Draw

Count ()

AVG/MIN/MAX

User defined

Server

GPU PSU 2 Power Draw

Count ()

AVG/MIN/MAX

User defined

Server

GPU Temperature

Celscius ()

AVG/MIN/MAX

User defined

Server

GPU Utilization

Percentage (%)

AVG/MIN/MAX

0-100

Last updated