ONES Rule Engine

Overview

In data center operations, a rule engine with alerts for various metrics is essential for proactive monitoring and management of critical components and services. Let's discuss the need for rule engine alerts for specific metrics in a data centre environment

ASIC IPv4 Routes ASIC IPv6 Routes BGP Neighbours Down CPU Core Temperature CPU Utilization DISK Health DISK Temperature DISK Used Memory Percent Device Config Change Device Down Device Queue Transmit Counter Docker CPU Utilization Docker Down Docker MEM Utilization Dynamic IP Change Dynamic IP Change With Only Conflicts FAN Speed Failed FANs Failed PSUs MGLAG Member Link Down MGLAG Peer Link Down MGLAG Session Down Memory Utilization PSU Temperature Unhealthy Devices

Rule engine alerts ensure efficient resource utilization, timely troubleshooting, early detection of potential issues, and overall operational stability within the data centre environment.

Notification

ONES-App is capable of triggering breached threshold values to

  • Slack Channel

  • Zendesk Support

  • ServiceNow

Rules are categorised based on the metric hierarchy

  1. Device Based

  2. Interface Based

  3. GPU Server Based

List of all the Metrics Supported by Rule Engine with possible units and measured value a user can use

Hierarchy

Metrics

Unit

Measure

Value

Device

CPU Utilization

Percentage (%)

AVG/MIN/MAX

0/100

Device

Memory Utilization

Percentage (%)

AVG/MIN/MAX

0/100

Device

Failed Fans

Count ()

MIN/MAX

Count

Device

Failed PSU

Count ()

MIN/MAX

Count

Device

CPU Core Temperature

Celsius ()

AVG/MIN/MAX

Celsius

Device

PSU Temperature

Celsius ()

AVG/MIN/MAX

Celsius

Device

FAN Speed

Percentage (%)

AVG/MIN/MAX

0/100

Device

ASIC IPv4 Routes Utilization

Percentage (%)

AVG/MIN/MAX

0/100

Device

ASIC IPv6 Routes Utilization

Percentage (%)

AVG/MIN/MAX

0/100

Device

BGP Nbrs Operationally Down

Count ()

AVG/MIN/MAX

Count of Nbrs

Device

FRR Container CPU Utilization

Percentage (%)

AVG/MIN/MAX

0/100

Device

Syncd Container CPU Utilization

Percentage (%)

AVG/MIN/MAX

0/100

Device

Device Down

NA

NA

NA

Device

Queue Counter

Count()

AVG/MIN/MAX

Count

Device

DISK Health

Percentage(%)

Percentage(%)

0/100

Device

DISK Temperature

Celsius ()

AVG/MIN/MAX

Celsius

Device

DISK Memory

Percentage(%)

Percentage(%)

0/100

Device

Docker CPU Utilization

Percentage(%)

Percentage(%)

0/100

Device

Docker Memory Utilization

Percentage(%)

Percentage(%)

0/100

Device

Docker Down

NA

NA

NA

Device

Device IP Change

NA

NA

NA

Device

Device IP Change with Conflict

NA

NA

NA

Device

Unhealthy Device

NA

NA

NA

Interface

Int Flap

NA

NA

NA

Interface

PFC Counters

Count ()

AVG/MIN/MAX

Count

Interfaec

Queue Transmit Counters

Count ()

AVG/MIN/MAX

Count

Interface

TX Utilization

Percentage (%)

AVG/MIN/MAX

0/100

Interface

RX Utilization

Percentage (%)

AVG/MIN/MAX

0/100

Interface

In Errors

Count ()

AVG/MIN/MAX

User defined

Interface

Out Errors

Count ()

AVG/MIN/MAX

User defined

Interface

In Discards

Count ()

AVG/MIN/MAX

User defined

Interface

Out Discards

Count ()

AVG/MIN/MAX

User defined

Interface

Transceiver TX Power

dBm

AVG/MIN/MAX

User defined

Interface

Transceiver Rx Power

dBm

AVG/MIN/MAX

User defined

Interface

Transceiver Temperature

Celscius ()

AVG/MIN/MAX

User defined

Interface

Transceiver Voltage

Volts ()

AVG/MIN/MAX

User defined

Server

CPU Core Temperature

Celsius ()

AVG/MIN/MAX

User defined

Server

CPU Utilization

Percentage (%)

AVG/MIN/MAX

0/100

Server

DISK Health

Percentage (%)

AVG/MIN/MAX

0/100

Server

DISK Temperature

Celsius ()

AVG/MIN/MAX

User defined

Server

DISK used Memory %

Percentage (%)

AVG/MIN/MAX

0/100

Server

Device Down

NA

NA

NA

Server

Docker Down

NA

NA

NA

Server

GPU Memory Utilization

Percentage (%)

AVG/MIN/MAX

0/100

Server

GPU PSU-1 Power Draw

Celsius ()

AVG/MIN/MAX

User defined

Server

GPU PSU-2 Power Draw

Celsius ()

AVG/MIN/MAX

User defined

Server

GPU Temperature

Celsius ()

AVG/MIN/MAX

User defined

Server

GPU Utilization

Percentage (%)

AVG/MIN/MAX

0/100

Server

Memory Utilization

Percentage (%)

AVG/MIN/MAX

0/100

Last updated