ONES Rule Engine

⚡ Overview

In data center operations, maintaining reliability and uptime requires more than just monitoring — it demands proactive detection and rapid response.

A Rule Engine plays a critical role by continuously tracking key performance metrics and triggering alerts when thresholds are breached. This ensures that operators can:

Identify anomalies early before they escalate. ✅ Respond quickly to potential risks or outages. ✅ Safeguard critical components & services from failure. ✅ Automate escalation via integrations (Slack, Zendesk, ServiceNow).

💡 In essence, Rule Engine alerts act as the first line of defense, keeping the data center environment stable, resilient, and secure.

Device Based Rules
  • CPU and Memory Utilisation

  • CPU Core temp alerts

  • Fan and PSU LED status

  • SSD Memory Utilization, Health and Temperature Status

  • Traffic Bandwidth

  • ASIC Routes (IPv4 and IPv6)

  • Device and Docker Down alerts

  • BGP Neighbour Down alter

  • Component failure

  • Interface Flap Alerts

  • Traffic Errors and Discard Counters

  • PFC Counters

  • Device Queue Counters

  • Config Change alert

  • Docker CPU and Memory utilization(per service)

  • IP Change Alerts

  • Failed FAN's and PSU's

  • FAN Speed

  • MCLAG (Member/Peer/Session) Down Alerts

  • NTP Drift

  • PSU Temp Alerts

  • Unhealthy Devices Alerts

Interface Based Rules
  • Broadcast/Multicast Storm

  • PFC Rx/Tx Counters

  • Port Flap

  • Queue Transmit Counters

  • Traffic InDiscards

  • Traffic InErrors

  • Traffic OutDiscards

  • Traffic OutErrors

  • Traffic Rx/Tx Utilization

  • Transceiver Rx/Tx Power

  • Transceiver Temperature

  • Transceiver Voltage

Server/GPU Based Rules
  • CPU Core Temperature

  • CPU Utilization

  • DISK Health

  • DISK Temperature

  • DISK Used Memory Percent

  • Device Down

  • Docker Down

  • GPU Memory Utilization

  • GPU PSU 1 Power Draw

  • GPU PSU 2 Power Draw

  • GPU Temperature

  • GPU Utilization

  • Memory Utilization

Rule engine alerts ensure efficient resource utilization, timely troubleshooting, early detection of potential issues, and overall operational stability within the data centre environment.

Notification

ONES-App is capable of triggering breached threshold values to

  • Slack Channel

  • Zendesk Support

  • ServiceNow

Rules are categorized based on the metric hierarchy

  1. Device Level

  2. Interface Level

List of all the Metrics Supported by Rule Engine with possible units and measured value a user can use

Hierarchy

Metrics

Unit

Measure

Value

Device

CPU Utilization

Percentage (%)

AVG/MIN/MAX

0/100

Device

Memory Utilization

Percentage (%)

AVG/MIN/MAX

0/100

Device

Failed Fans

Count ()

MIN/MAX

Count

Device

Failed PSU

Count ()

MIN/MAX

Count

Device

CPU Core Temperature

Celsius ()

AVG/MIN/MAX

Celsius

Device

PSU Temperature

Celsius ()

AVG/MIN/MAX

Celsius

Device

FAN Speed

Percentage (%)

AVG/MIN/MAX

0/100

Device

ASIC IPv4 Routes Utilization

Percentage (%)

AVG/MIN/MAX

0/100

Device

ASIC IPv6 Routes Utilization

Percentage (%)

AVG/MIN/MAX

0/100

Device

BGP Nbrs Operationally Down

Count ()

AVG/MIN/MAX

Count of Nbrs

Device

FRR Container CPU Utilization

Percentage (%)

AVG/MIN/MAX

0/100

Device

Syncd Container CPU Utilization

Percentage (%)

AVG/MIN/MAX

0/100

Device

Device Down

NA

NA

NA

Device

Queue Counter

Count()

AVG/MIN/MAX

Count

Device

SSD Health

Percentage(%)

Percentage(%)

0/100

Device

SSD Temperature

Celsius ()

AVG/MIN/MAX

Celsius

Device

SSD Memory

Percentage(%)

Percentage(%)

0/100

Interface

Int Flap

NA

NA

NA

Interface

PFC Counters

Count ()

AVG/MIN/MAX

Count

Interfaec

Queue Counters

Count ()

AVG/MIN/MAX

Count

Interface

TX Utilization

Percentage (%)

AVG/MIN/MAX

0/100

Interface

RX Utilization

Percentage (%)

AVG/MIN/MAX

0/100

Interface

In Errors

Count ()

AVG/MIN/MAX

User defined

Interface

Out Errors

Count ()

AVG/MIN/MAX

User defined

Interface

In Discards

Count ()

AVG/MIN/MAX

User defined

Interface

Out Discards

Count ()

AVG/MIN/MAX

User defined

Interface

Tranx TX Power

dBm

AVG/MIN/MAX

User defined

Interface

Tranx Rx Power

dBm

AVG/MIN/MAX

User defined

Interface

Tranx Temperature

Celscius ()

AVG/MIN/MAX

User defined

Interface

Tranx Voltage

Volts ()

AVG/MIN/MAX

User defined

Last updated