A
Access Logs
Logs that capture all login, logout, and permission-based activity performed by users across the HEAL platform.
Account
An account is a HEAL tag that ties all the applications, services, and instances together. Every enterprise or organization would be considered an account.
Advance DS
Internal “Advance Data Service” that periodically pulls policy and configuration metadata for the Machine-Learning Engine (MLE) from the Percona MySQL cluster.
AIOps Platform
A big-data & ML system that ingests observability signals and automates event correlation, root-cause isolation, and remediation.
Alert Management
A feature that lets users define custom alert thresholds, notification rules, and escalation policies.
Anomaly
A fully qualified anomaly created when a metric value violates the NOR or SOR thresholds.
Anomaly Score
A severity indicator between 0 and 1 that quantifies the abnormality of an event.
Apdex Score
A user-centric performance index from 0 to 1 that classifies response times as Satisfied, Tolerating, or Frustrated
Application
Application is a virtual concept, like a tag, that groups a set of services.
Auto Discovery
Automatically discovers hosts, components, connections, and their attributes to populate the Service Dependency Map.
Auto-Instrumentation
Automatic injection of tracing hooks or metrics into code at runtime or build time (e.g., OpenTelemetry auto-instrumentation).
Automated Anomaly Detection
Advanced algorithms that automatically detect unusual patterns or behaviors in the system, alerting teams to potential issues before they escalate.
Automated Root Cause Analysis (RCA)
Automatically identifies the root cause of issues by correlating events and analyzing causal relationships.
Availability Metric
A binary health metric that indicates whether an instance, process, port, or connection is Up (1) or Down (0), usually derived from heartbeat or ping checks.
B
Batch Incident
A type of Incident specifically tied to long-running or scheduled jobs, raised when anomalies occur in batch job execution.
Batch Job Alerts
Alerts generated when scheduled or background batch jobs behave abnormally or fail.
Behavior KPI
Key performance indicators about how the monitored instances behave or function.
Behavior Metrics
Metrics that reflect how an instance behaves operationally (e.g., CPU, memory usage).
C
Calls
Discrete client-to-server requests (HTTP, gRPC, JDBC, AMQP, etc.) exchanged between services or from users. In HEAL each call can be captured as a span within an end-to-end trace and rolled up into Transaction KPIs.
Capacity Forecasting
Predicts future infrastructure requirements and choke points using regression models and historical usage data.
Category
KPIs similarly impacting the system’s performance are tagged together as a category.
Causal Graph
Directed graph built by HEAL’s AIOps engine where nodes are events/metrics and edges represent statistically inferred cause-effect relationships. Traversing the graph surfaces the most probable root cause for an incident.
Cluster
Clusters are groups of instances in a service. A cluster can be created for the host and component.
Cluster (K8s)
The full Kubernetes control-plane and all worker nodes that run containers. In HEAL this maps one-to-one with a monitored Service Dependency Map domain.
Cluster-Level Metrics
Aggregated metric data presented at the cluster level, providing a high-level view of health service.
Component
A software element that runs on a host server (Ex: Apache Tomcat, Weblogic, WebSphere, Oracle)
Component Instance
A single running copy of a software component (e.g., one JVM, one Oracle DB process) deployed on a host and monitored independently for KPIs.
Component Instance Attributes
Attributes that are required to collect the KPI data of the component instance.
Component Monitoring
Provides detailed metrics from Server, JVM, Host, and Application components.
Comprehensive Dashboards
Intuitive, user-friendly interfaces that present complex data in an accessible format, allowing teams to quickly assess system health and performance metrics.
Computed Metric
A metric that is derived from other metrics using custom formulas, often used for availability.
ConfigMap / Secret
Key-value objects that inject non-confidential (ConfigMap) or sensitive (Secret) config into pods.
Configuration Watch
A feature that monitors configuration changes (e.g., file changes, system properties) and raises Info Incident when deviations occur.
Consul
HashiCorp Consul cluster used by HEAL for distributed service discovery, health checking, and a strongly-consistent key/value store for dynamic configuration.
Control Center UI
Administrative workspace within the HEAL UI where authorised users: 1) register Accounts/Applications, 2) start/stop agents and connectors, 3) define NOR/SOR thresholds and maintenance windows, 4) manage alert policies and notification channels, and 5) set RBAC roles. All changes are versioned and stored centrally in the HEAL configuration database.
Custom Metric Framework
A framework for collecting custom metrics using scripts (e.g., shell, PowerShell), enabling support for non-standard environments.
Custom Reports & Dashboards
Allows users to create tailored views and exportable reports for operational insights.
Customizable Alerts
Flexible alerting mechanisms that can be tailored to the specific needs of the business.
D
DaemonSet
Ensures one (or more) pod copies run on every node (common for log or metric collectors).
Data Center (DC)
A physical facility that houses HEAL’s servers, storage, and networking hardware for data collection and processing.
Data Sources
Data Source is a name given to the connection from an application such as UI-Service to DB, CC to DB. The name is commonly used when creating a query to the database.
Database Server
Engine (e.g., Oracle, MySQL, Percona) that stores and serves structured data via SQL. Monitored in HEAL as a Component with KPIs like query latency, cache hit %.
Deep Dive Data
Deep dive is comprehensive performance data (code level, SQL level). This data can be collected only by instrumenting a component.
Dependency
A logical entity represents how a service depends on another service.
Deployment
A K8s object that maintains a specified number of pod replicas and supports rolling updates.
Disaster Recovery (DR)
A backup setup mirroring the production system to ensure failover and high availability.
Docker
OCI-compliant container runtime and image format that packages applications + dependencies into immutable layers. HEAL can both monitor Dockerised workloads and ship its own micro-services as Docker images.
E
Early Warning
A proactive incident raised when HEAL detects related anomalies that signal emerging risk, even though entry-point transactions are not yet impacted.
End User Monitoring
Captures performance and errors from actual end-user sessions in browsers or mobile apps.
Event Correlation
Combines related or duplicate alerts into a single actionable incident to reduce alert fatigue.
Event Ranking
Prioritize events based on relevance scores to highlight the most critical issues affecting metrics and services.
Events
Discrete state changes or alerts (e.g., deployment, pod restart, threshold breach) that provide context around metric behaviour. In HEAL, related events are auto-correlated into Incidents.
F
Forensic Action Grouping
Forensic actions are grouped based on instance and metrics category to avoid duplicate collection within a time window.
Forensic Plugin
Collects Forensics
Forensics
Just-in-time diagnostic data collected when an event occurs, used for root cause analysis and troubleshooting.
G
Grafana Dashboard (Custom Dashboards)
A Grafana instance bundled with HEAL, pre-wired to the HEAL time-series and trace APIs. It lets SREs create ad-hoc or business-specific visualisations, mixed with third-party data sources, without altering the core HEAL UI.
Group Metric
A collection of related Metrics grouped together for consolidated monitoring and visualization.
H
HAPROXY
HAProxy (High Availability Proxy) is an open-source proxy and load balancing server software. It provides high availability at the network (TCP) and application (HTTP/S) layers, improving speed and performance by distributing the workload across multiple servers.
HEAL App Server
Middleware (e.g., WildFly, WebLogic, Tomcat) that hosts business logic, manages threads, pools, and JEE services, often running inside a JVM.
HEAL Connectors
Interfaces used to pull telemetry data from third-party systems like AWS CloudWatch, Azure AppInsights, Prometheus, and Elasticsearch.
HEAL Control Center
To add and manage accounts.
HEAL DB Server
Stores configuration data, metrics, incidents, events, command jobs, and forensic records.
HEAL UI (Out-of-Box Dashboard)
The default, single-page web interface shipped with HEAL. It presents pre-built dashboards (health overviews, service maps, incident timelines) that visualise KPIs, traces, logs, and AI insights without any additional setup
HEAL UI Service
To manage HEAL UI.
HEAL Web Server
Software (e.g., Nginx, Apache HTTPD) that handles HTTP/S requests, serves static content, and forwards dynamic calls to application servers.
Heat Map
A visual representation of service health and metrics across instances, using color coding to show anomalies or normal states
Host
Representation of actual or physical server on which an operating system runs and the component instances are deployed. (Ex: AIX, Linux, Windows Server.)
Host instance
Host instance is the Representation of a single instance of the host on which one or more component instances are deployed.
Host Instance Attributes
Attributes that are required to collect the KPI data of the Host instance.
HTTP (HyperText Transfer Protocol)
A protocol used for transmitting data over the web in an unsecured format.
HTTPS (HyperText Transfer Protocol Secure)
A secure version of HTTP that encrypts data using SSL/TLS for safe communication over the internet.
I
Incident
A collection of temporally related events that together indicate abnormal service behaviour. Incidents are classified as Early Warning, Problem, or Info.
Incident Status
The lifecycle state of an incident — typically Open, Closed, or Upgraded
Incident Timeline
A chronological view of all events and affected services associated with an incident, helping users track the evolution of issues.
Info Incident
A non-critical incident generated for informational metrics (e.g., configuration changes) that do not impact system performance directly.
Ingress
K8s object that routes external HTTP/S traffic to cluster services, often terminated with TLS.
Instance-Level Metrics
Detailed metric data for a specific instance within a cluster, used for fine-grained analysis.
Instances
Is a single, running copy of a service or application component that operates independently and is monitored separately.
Integration Capabilities
Seamless compatibility with a wide range of technologies and platforms, enabling a unified view of all data sources for more effective monitoring and analysis.
Intrusive Monitoring
Monitors methods and tracks exceptions, errors, and slowness in code execution.
IP (Internet Protocol)
A unique numerical address assigned to each device connected to a network that identifies it for communication.
J
JIM
Monitors Java Applications that are running on JVM. It runs only on Target Server.
JMX
JMX is the Java Management Extension used to manage java applications, systems, and networks.
K
Kairos DB
To fetch time series data from Cassandra.
Key Performance Indicator
KPI, or Metric, is an attribute with a numeric or binary value for the Behavior, Workload.
Keycloak
To provide SSO (Single sign-on).
Kubernetes and OpenShift Monitoring
Specialized monitoring tools for Kubernetes and OpenShift
L
Log file
A log file is a file that records either events that occur in an operating system or other software runs, or messages between different users of communication software
Log Forwarder
Collects transaction data from log files. It runs only on Target Server.
Log Monitoring
Monitors and alerts on specific pre-configured error patterns in log files.
Logs
Timestamped text records emitted by applications, OSs, and middleware. HEAL parses them for error patterns, transactions, and forensic details.
M
Maintenance Window
A predefined period during which services are marked under maintenance, preventing alerts or incident from being raised.
Metric
A measurable value that indicates the performance, behavior, or availability of a service or component.
Metric Pod
A HEAL UI widget that shows a single metric’s current value, threshold bands, and recent trend for rapid visual assessment.
Metrics
Numeric time-series measurements collected at regular intervals (e.g., CPU %, JVM heap MB, Transactions Per Second). In HEAL they feed ML models to establish NOR/SOR bands.
ML Batch Job Monitoring
Monitors long-running batch jobs using machine learning to detect anomalies, delays, and cascading failures.
ML Insights
AI-generated insights showing key metrics and patterns leading to an incident or anomaly, aiding in root cause analysis.
MLE
A machine learning Engine (MLE) is a type of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so. Machine learning algorithms use historical data as input to predict new output values. It generates NORs.
N
Namespace
A logical partition that isolates resources (pods, services, secrets) within a cluster—useful for multi-tenant dashboards.
Node
A single VM or bare-metal host in the cluster. Node-level KPIs include CPU %, Memory %, Disk IO, and Kubelet health.
Noise Reduction
ML or rules that suppress duplicates and low-value alerts before they reach humans.
Nomad
Job orchestration.
Nomad File
A configuration file used to deploy HEAL services (like Notification Processor, Supervisor).
Normal Operating Range (NOR)
A dynamic threshold band generated by HEAL’s ML engine that defines healthy values for a KPI. Values outside the band are flagged as anomalous.
Notification Plugin
Sends SMS and Email notifications.
O
OOB (Out Of Box) Component
A component with a built-in functionality that comes directly from the vendor and works immediately when the product is placed in service.
OOB Reports
Predefined reports covering transaction trends, metric thresholds, server utilization, and alert analysis.
OpenSearch
OpenSearch is a community-driven search and analytics suite that simplifies data ingest, search, visualization, and analysis
Operator
Kubernetes controller that encodes domain knowledge to automate the lifecycle (install, upgrade, backup) of complex apps.
Organization
Representation of an organization, which may contain one or more applications.
OTLP (OpenTelemetry Protocol)
gRPC/HTTP protocol used by OpenTelemetry collectors and SDKs to export metrics, logs, and traces.
P
PERCONA
Percona is an open-source version of the MySQL database management system. Percona is used for storing configuration and master data.
Performance Metric
Represents the performance information of the component and host instances. The values can be numerical values of different types.
Persistence
The minimum count of consecutive threshold breaches that must occur before HEAL generates an event, preventing alerts on transient spikes.
Persistent Volume (PV)
Cluster-provisioned storage resource.
Persistent Volume Claim (PVC)
A pod’s request for a PV with specific size and mode.
Pipeline
Stream-processing topology that ingests raw metrics/logs/traces, performs real-time aggregation, and persists roll-ups into the time-series store.
Pod
The smallest deployable unit—a group of one or more containers sharing network + storage. HEAL associates Sidecar agents or OpenTelemetry collectors here.
Port
A logical endpoint used to distinguish different types of network services running on the same IP address.
Predictive Insights
Utilization of AI and ML to forecast potential future issues based on historical data, allowing for proactive measures to prevent downtime.
Problem
A critical incident raised when abnormal service behaviour is actively impacting end-user transactions or service availability.
Producer
The producer is known as scripts, procedures, or methods used to collect data for KPIs via Component Agent.
Producers
Sources of telemetry such as applications, agents, or services that generate and send metrics, logs, and traces into HEAL.
Q
Query APIs
To extract details of HEAL entities.
R
RABBITMQ
RabbitMQ is a message-queueing software known as a message broker or queue manager. Said, it is software where queues are defined to which applications connect to transfer a message or messages—queue manager (AMPQ).
RAN
Static web server for storing and downloading docker images and creating HTTP web server.
Real-time Monitoring
Continuous tracking of application and infrastructure performance to identify issues as they occur.
Recommendations Validated
Recommendations Validated are validating the solutions provided at the backend with the help of some logic
Request
The request is a specific technical request (ex: URL call) received by a service from an external actor or another service.
Root cause
The primary underlying fault (e.g., “DB CPU starvation by runaway query order_history”) that triggers downstream anomalies; identified by correlating Causal Graph, Forensics, and Topology data.
Root Cause Analysis
The act of analyzing related “incidents” to get to the source of a “problem.”
S
Safe Operating Range (SOR)
A static range defined manually to detect metric threshold violations.
Sampling Rate
Percentage of traces kept and sent to the backend (e.g., “10 % probabilistic sampling”).
Scalable Architecture
Designed to support environments of any size, from small deployments to large-scale, distributed systems.
Scheduler
The HEAL component is responsible for arranging periodic tasks such as data ingestion, report generation, and alert evaluations.
Service
Logical entity representing multiple homogenous component instances running on one or numerous host instances.
Service Account
Identity used by pods to access the API-server or other resources securely.
Service Dependency Map (SDM)
The auto-generated topology view that visualises services and their inter-dependencies, including call flow and real-time health status.
Service Mesh (Istio / Envoy)
Sidecar-based data plane plus control plane that provides uniform traffic management, security, and built-in telemetry.
Services
Logical groupings of one or more instances representing a distinct application function or microservice in HEAL.
Sidecar
An auxiliary container in the same pod that intercepts traffic or collects telemetry (e.g., HEAL proxy, Fluentd).
Simple Network Management Protocol (SNMP)
Is an application-layer protocol used for monitoring and managing network devices such as routers, switches, servers, and printers by enabling communication between these devices and a central network management system.
SNMP Monitoring
Allows monitoring of network and hardware devices using SNMP protocol.
Solution Recommendation Validation
Allows users to validate or provide feedback on recommended solutions to improve future suggestions.
Solution Recommendations
Automatically generated, ranked remediation steps based on historical fixes and expert rules; users can validate or rate their effectiveness.
Solutions Recommended
Solutions recommended for the root cause when there is a Problem / Early Warning
Span
The atomic unit of work in a trace that records operation name, start/end time, and attributes.
StatefulSet
Manages ordered, persistent pods (e.g., databases) ensuring stable hostnames and storage.
Static Operating Range
Static Operating Range is a fixed value along with conditions (like “less than,” “greater than,” and “not between”) defined by the Administrator user in HEAL Control Center. KPIs for which the SOR is limited are expected not to breach the set value.
Supervisor
Executes the forensic scripts. It runs only on Target Server.
Supervisor Controller
Collects forensic action.
Suppression
A noise-reduction mechanism that temporarily withholds new events for repeated, similar violations after the first occurrence, preventing alert storms.
Synthetic Monitoring
Scheduled scripts or “robots” that emulate user journeys from multiple locations to measure uptime & latency.
T
Threshold
A point beyond which there is a change in the metrics. When either a workload or behavior metric breaches a threshold set by MLE or SOR mechanisms even once, it is termed as a violation. It can be static or dynamic.
Tickets
Ticket is the alert raised by any Alerting/Ticketing tool. E.g., ServiceNow, SolarWinds
Topology
A real-time map of interdependencies and communication paths among services and instances within HEAL.
Trace ID / Parent–Child Context
Unique IDs that tie spans together and propagate across service boundaries
Traces
End-to-end request flows broken into timing spans across services. Traces reveal latency hotspots and are stitched into the Service Dependency Map.
Transaction
Metrics that represent the volume and performance of application transactions (e.g., request count, response time).
Transaction Monitoring
Monitors all transactions end-to-end, with the ability to mark specific transactions for detailed monitoring.
U
Unified Ticketing Plugin (UTP)
Consolidates HEAL Incidents into ITSM tickets (ServiceNow, BMC, Remedy). Ensures one ticket per incident, and maintains bi-directional status sync.
Unknown Service
Service(s) identified but not monitored and send/receive a request to/from the service under monitoring are known as ‘Unknown Services.’
User ID
A unique identifier that tracks actions or configurations applied by a specific user across HEAL components.
User Roles
Role-Based Access Control (RBAC) used in HEAL to manage what users can view, configure, or modify.
User Session Timeout
A predefined period of user inactivity after which the session is automatically terminated for security purposes.
UTP (Unified Ticket Plugin)
A module which consolidates and manages incident tickets across multiple monitoring tools.
W
Wildfly
WildFly provides a Java web application, an extension to the JVM with a complete runtime environment that will connect the database on one end to the web client on the other.
Workload KPI
Key Performance Indicators about the amount of load in the Request-Response pair served by the instance, status, and response time of the request are known as Workload KPI.
Z
Zulu
It is an open-source JRE/JDK and open-source Java that can be used to run all jar files.