Skip to end of metadata
Go to start of metadata

Monitoring at AppNexus

Overview

At AppNexus, we monitor the following parts of our physical infrastructure and core internals:

  • Physical Servers
  • Switches / Routers
  • GSLB
  • Local Load Balancing
  • CDN
  • AppNexus URLs
  • Databases

We do not monitor customers' applications running within instances, but we do monitor discrepancies between our database records for the instance state and reality.  For monitoring, we use Nagios and AlertSite as an external tool.  On each critical event, Nagios and AlertSite trigger the pagers of the sysops on duty.  Non-critical events (e.g., high load on the physical server for a minute), are reported by email.

Duty Schedule

There are always members of SysOps on duty at all times to fill requests and monitor infrastructure.

Physical Servers

We monitor all critical server hardware metrics.  In the case of any HDD, memory, power supply, or similar issues, sysops is immediately paged.  After investigating the issue, they make a decision on further hardware maintenance.  In the case of an extremely critical issue, SysOps sends an appropriate notification to the customer, suggesting immediate migration to another server.  Otherwise, regular maintenance (RMA) is scheduled, and we notify customers about it 7–10 days or more in advance.

Services

On any critical service issues, sysops will receive alerts and starts an investigation immediately.  Such issues include, but are not limited to:

  • A server goes off-line
  • A disk has failed in a storage unit
  • A host is unavailable or flapping
  • Load is critical on a server
  • An instance stops responding to ping
  • Critical disk or volume issues are detected
  • Instances are failing or launch or are taking extreme amounts of time to launch

URLs

AppNexus monitors the following URL resources:

If issues are detected, SysOps is alerted.

Core Internals

We are monitoring via Nagios the health and load status of all important AppNexus infrastructure.  This includes, but isn't limited to: 

  • Our API
  • Databases
  • Local Load Balancers
  • Puppet  

Pagers of the SysOps members on duty are triggered in case of problems with these components.

Nagios

Nagios is an open-source, enterprise-class monitoring system.  Nagios can perform checks for various services (SMTP, POP3, HTTP, NNTP, PING), as well as resources checks (CPU load, disk usage).

Checks are broken down into active and passive checks.  Active checks are performed for the following:
1) On the Nagios box by different plugins (check_ping, check_dns, check_ssh, check_https, etc.),
2) On hosts using the NRPE daemon.

NRPE stands for Nagios Remote Plugin Executor.  On the AppNexus side, it runs tests such as check_nrpe_disk, check_nrpe_users, check_nrpe_load, check_nrpe_swap, check_nrpe_exp_memory, check_nrpe_lvm, and many others.  When a check fails, an alarm message goes to sysops.

Service checks which are performed and submitted to Nagios by external applications are called passive checks.  (More info on passive checks could be found here: http://nagios.sourceforge.net/docs/3_0/passivechecks.html).  The snmptrapd daemon routes SNMP traps to Nagios using passive checks.  Networking gear (F5, PDU, Core Switches) and NAS units are monitored via SNMP using passive checks.

More info can be found on the Nagios homepage: http://www.nagios.org/.