Course Outline

Introduction to AIOps

  • What is AIOps and why it matters
  • Traditional monitoring vs. AIOps-driven observability
  • AIOps architecture and key components

Collecting and Normalizing Operational Data

  • Types of observability data: metrics, logs, and traces
  • Ingesting data from multiple sources (servers, containers, cloud)
  • Using agents and exporters (Prometheus, Beats, Fluentd)

Data Correlation and Anomaly Detection

  • Time series correlation and statistical methods
  • Using ML models for anomaly detection
  • Detecting incidents across distributed systems

Alerting and Noise Reduction

  • Designing intelligent alert rules and thresholds
  • Suppression, deduplication, and alert grouping
  • Integrating with Alertmanager, Slack, PagerDuty, or Opsgenie

Root Cause Analysis and Visualization

  • Using dashboards to visualize metrics and detect trends
  • Exploring events and timelines for RCA
  • Tracing issues across layers with distributed tracing tools

Automation and Remediation

  • Triggering automated scripts or workflows from incidents
  • Integrating with ITSM systems (ServiceNow, Jira)
  • Use cases: self-healing, scaling, traffic rerouting

Open Source and Commercial AIOps Platforms

  • Overview of tools: Prometheus, Grafana, ELK, Moogsoft, Dynatrace
  • Evaluation criteria for selecting an AIOps platform
  • Demo and hands-on with a selected stack

Summary and Next Steps

Requirements

  • An understanding of IT operations and system monitoring concepts
  • Experience with monitoring tools or dashboards
  • Familiarity with basic log and metric formats

Audience

  • Operations teams responsible for infrastructure and applications
  • Site Reliability Engineers (SREs)
  • IT monitoring and observability teams
 14 Hours

Number of participants


Price Per Participant (Exc. Tax)

Provisional Courses

Related Categories