Updated on 2023-11-28 GMT+08:00

Overview

Application Operations Management (AOM) is a one-stop, multi-dimensional O&M management platform for cloud applications. It monitors applications and related cloud resources in real time, collects and associates resource metrics, logs, and events to analyze application health status, and supports alarm reporting and data visualization, helping you detect faults in a timely manner and monitor the running status of applications, resources, and services in real time.

Specifically, AOM monitors and uniformly manages servers, storage devices, networks, web containers, and applications hosted in Docker and Kubernetes, effectively preventing problems, facilitating fault locating, and reducing O&M costs. Unlike traditional monitoring systems, AOM monitors services by applications. It meets enterprises' requirements for high efficiency and fast iteration, provides effective IT support for their services, and protects and optimizes their IT assets, enabling enterprises to achieve strategic goals.

Console Description

Table 1 AOM console description

Category

Description

Overview

Both the O&M overview and dashboard are provided.

  • O&M

    The O&M page supports full-link, multi-layer, and one-stop O&M for resources, applications, and user experience.

  • Dashboard

    With a dashboard, different graphs such as line graphs and digit graphs are displayed on the same screen, which lets you view comprehensive monitoring data.

Alarm center

The alarm center displays the alarm list, event list, alarm rules, and notification rules.

  • Alarm list

    Alarms are the information which is reported when AOM or an external service is abnormal or may cause exceptions. You need to take measures accordingly. Otherwise, service exceptions may occur.

    The alarm list displays the alarms generated within a specified time range.

  • Event list

    Events generally carry some important information, informing you of the changes of AOM or an external service. Such changes do not necessarily cause exceptions.

    The event list displays the events generated within a specified time range.

  • Alarm rules

    By setting alarms rules, you can define event conditions for services or threshold conditions for resource metrics. If the resource data of a service meets the event condition, an event alarm will be generated. If the metric data of a resource meets the threshold condition, a threshold alarm will be generated. If no metric data is reported, an insufficient data event will be generated. In this way, you can discover and handle exceptions at the earliest time.

  • Alarm notification

    AOM supports alarm notification. You can create notification rules and alarm action rules, and configure alarm noise reduction. When alarms are reported due to an exception in AOM or an external service, alarm information can be sent to specified personnel by email or Short Message Service (SMS) message. In this way, they can rectify faults in time to avoid service loss.

Monitoring

Functions such as application monitoring, component monitoring, host monitoring, container monitoring, and metric monitoring are provided.

  • Application monitoring

    An application is a group of identical or similar components divided based on service requirements. AOM supports monitoring by application.

  • Component monitoring

    Components refer to the services that you deploy, including containers and common processes.

    The Component Monitoring page displays information such as type, CPU usage, memory usage, and status of each component. AOM supports drill-down from components to instances, and then to containers, enabling multi-dimensional monitoring.

  • Host monitoring

    The Host Monitoring page enables you to monitor common system devices such as disks and file systems, and resource usage and health status of hosts and service processes or instances running on them.

  • Container monitoring

    For container monitoring, only workloads deployed using Cloud Container Engine (CCE) and applications created using ServiceStage are monitored.

  • Metric monitoring

    The Metric Monitoring page displays metric data of each resource. You can monitor metric values and trends in real time, add desired metrics to dashboards, create threshold rules, and export monitoring reports. In this way, you can monitor services and analyze data in real time.

  • Cloud service monitoring

    The Cloud Service Monitoring page displays historical performance curves of each cloud service instance. You can view cloud service data of the last six months.

Log

Functions such as log search, log file, log dump, and path configuration are provided.

  • Log search

    AOM enables you to quickly query logs, and locate faults based on log sources and contexts.

  • Log files

    You can quickly view log files of component instances to locate faults.

  • Log dumps

    AOM enables you to dump logs to Object Storage Service (OBS) buckets for long-term storage.

  • Path configuration

    AOM can collect and display container and VM logs. VM refers to an Elastic Cloud Server (ECS) or a Bare Metal Server (BMS) running Linux. Before collecting logs, ensure that you have configured a log collection path.

  • Log buckets

    A log bucket is a logical group of log files. You can dump log files, create statistical rules, and view logs by log bucket.

  • Statistical rules

    A statistical rule takes effect by log bucket. You can configure keywords in statistical rules. Then, AOM periodically counts the number of such keywords in log buckets and generates log metrics.

  • Log structuring

    In log structuring, original logs can be separated by regular expressions or special characters so that structured logs can be queried and analyzed based on the SQL syntax.

  • Accessing LTS

    By adding access rules, you can map logs of CCE, Cloud Container Instance (CCI), or custom clusters in AOM to Log Tank Service (LTS). Then you can view and analyze logs on LTS. Mapping does not generate extra fees, but duplicate mapping will.

Configuration management

Functions such as ICAgent management, application discovery, and log configuration are provided.

  • ICAgent management

    ICAgent collects metrics, logs, and application performance data in real time. For hosts purchased from the Elastic Cloud Server (ECS) or Bare Metal Server (BMS) console, you need to manually install the ICAgent. For hosts purchased from the CCE console, the ICAgent is automatically installed.

  • Data subscription

    AOM allows you to subscribe to metrics or alarms. After the subscription, data can be forwarded to custom Kafka or Distributed Message Service (DMS) topics for you to retrieve.

  • Application discovery

    AOM can discover applications and collect their metrics based on configured rules.

  • Log configuration

    Log quotas and delimiters can be configured.

  • Quota configuration

    Earlier metrics will be deleted when the metric quota is exceeded.

    You can change the metric quota by switching between the basic edition and pay-per-use edition. In the basic edition, limited functions are provided for free.

  • Metric configuration

    You can enable the metric collection function to collect metrics (excluding SLA and custom metrics).

Process for Using AOM

The following figure shows the process of using AOM.

Figure 1 Process of using AOM
  1. (Mandatory) Subscribe to AOM.
  2. (Optional) Create IAM users and set permissions.
  3. (Mandatory) Purchase a cloud host.
  4. (Mandatory) Install the ICAgent.

    ICAgent is a collector used to collect metric, log, and application performance data in real time.

    If a cloud host is purchased through CCE, ICAgent is automatically installed on it.

  5. (Optional) Configure an application discovery rule.

    For the applications that meet built-in application discovery rules, they will be automatically discovered after the ICAgent is installed. For the applications that cannot be discovered using built-in application discovery rules, customize an application discovery rule.

  6. (Optional) Configure a log collection path.

    To use AOM to monitor host logs, configure a log collection path first.

  7. (Optional) Implement O&M.

    Use AOM functions such as Monitoring Overview, Alarm Management, Resource Monitoring, and Log Management to perform routine O&M.