Alarms

Dashboards are useful tools, but for production use cases we want alarms triggered automatically if our metrics cross certain thresholds. That way we can alert our operations team, open tickets in our operations systems, or even start an automated response in some cases.

Amazon CloudWatch alarms

As a quick review, Amazon CloudWatch alarms trigger based on individual metrics or combinations of metrics. Alarms send notifications to SNS topics. The SNS topic will notify recipients via email, SMS, or via an HTTP endpoint. You can also have the topic send the notification to an Amazon SQS queue or invoke a AWS Lambda function for additional automation. For example, a Lambda function could initiate a scaling action. (Amazon SQS is a fully managed message queuing service, and AWS Lambda is a serverless computing service.)

Setting thresholds

We can set an alarm trigger threshold for any Amazon CloudWatch metric. The thresholds will vary on your use case. For example, a read-heavy workload may see consistently high load on the read replicas and a lightly used primary node. Write-heavy workloads would see the opposite.

Here are a few suggestions for initial alarm thresholds, but be sure to adjust these thresholds based on your application.

  • CPUUtilization over 80%
  • DatabaseConnectionsMax over 90% of the limit
  • DatabaseCursorsMax over 90% of the limit
  • VolumeBytesUsed over 85% of the 64 TiB limit
  • NetworkThroughput over 90% of the sustainable limit
  • DBClusterReplicaLagMaximum consistently over 10 seconds
  • Buffer cache hit ratio under 90%
  • Index buffer cache hit ratio under 90%

Creating your first alarm

Let’s set up one alarm manually in this chapter. We’ll look at automated alarm creation in the next chapter.

First, go to the Amazon CloudWatch console and make sure you are in the correct region. Now go to the Alarms section and click Create Alarm.

Amazon CloudWatch Alarms

Click Select Metric.

Amazon CloudWatch Alarm Metric

Now, just as when we set up a dashboard, select the metric you want to alarm on. Go into the DocDB namespace and select the dimension Cluster metrics by role for the cluster you want to monitor; if you created your cluster in the first chapter, the cluster identifier is getting-started-with-documentdb. In this example we can choose CPUUtilization on the read replicas, as it’s fairly easy to artifically trigger this alarm for testing.

Amazon CloudWatch Alarm Metric Selection

On the next page, we’ll configure the alarm. First we’ll select the metric aggregation and period, which defaults to the average over 5 minutes. Change the period to 1 minute.

Amazon CloudWatch Alarm Configuration

On the second half of the screen, we specify how the alarm triggers. A static value triggers on a specific limit, such as CPU utilization exceeding 80%. An anomaly detection band will trigger based on expected ranges of the metric. For this example, let’s set it to trigger when the reader CPU utilization exceeds 3% once over a minute. (3% is an artificially low threshold, used so we can generate an alarm.)

Amazon CloudWatch Alarm Configuration

On the next screen, we determine what happens when the alarm triggers. For the sake of a quick test, choose to have a new SNS topic configured and specify an email address to receive the notification.

Amazon CloudWatch Alarm SNS

Click Create Topic and then Next. Finally, give the alarm a name.

Amazon CloudWatch Alarm Finish

On the final preview page, click Create alarm. Check your email; you will receive an SNS confirmation email in a few minutes. You need to confirm the subscription before you’ll receive any alarm emails.

Triggering the alarm

In order to see the alarm in action, we’ll connect to the Cloud9 IDE and introduce some read-heavy load. We’ll use the Yahoo! Cloud Serving Benchmark (YCSB), a database benchmarking framework, to run workloads on the database cluster.

Navigate to the AWS cloud9 management Console and choose open IDE to launch the AWS Cloud9 environment.

Prepare the IDE

We’ll need to resize the EBS volume used for Cloud9. Follow these steps:

  • Download the resize script onto your laptop.
  • Upload this script to Cloud9, or just copy and paste the contents.
  • Rename the script to resize.sh.
  • Make the script executable: chmod +x resize.sh
  • Execute the script: sh resize.sh 40

We also need to upgrade to JDK 8:

  • sudo yum install java-1.8.0-openjdk-devel -y
  • sudo alternatives --config java (select JDK 8)
  • sudo yum remove -y java-1.7.0-openjdk-devel

Prepare the certificate store

Since we are using TLS for the Amazon DocumentDB connection, we have to prepare a Java keystore for YCSB. First, download this script and upload it to Cloud9, or just paste it into an editor in Cloud9. Then run:

mkdir /tmp/certs
chmod +x cert.sh
./cert.sh

Configure YCSB

Now install YCSB.

curl -O --location https://github.com/brianfrankcooper/YCSB/releases/download/0.17.0/ycsb-0.17.0.tar.gz
tar xfvz ycsb-0.17.0.tar.gz
cd ycsb-0.17.0

We need to edit the file bin/ycsb to set the Java keystore properties. Open this file in an editor on Cloud9, and change the ycsb_command on line 331:

ycsb_command = ([java] + args.jvm_args +
            ['-Djavax.net.ssl.trustStore=/tmp/certs/rds-truststore.jks','-Djavax.net.ssl.trustStorePassword=changeit'] +
                ["-cp", classpath,
                 main_classname, "-db", db_classname] + remaining)

Run load test

Run the load tester. We’ll use workloadb, which is a read-heavy workload. In these commands, note that DatabaseUser and DatabasePassword are the values you specified for your Amazon DocumentDB database when you created it. DbEndpoint is available on the console, as described in the Query Cluster section.

python2 ./bin/ycsb load mongodb -s -P workloads/workloadb -p recordcount=100000 -p mongodb.url=mongodb://<DatabaseUser>:<DatabasePassword>@<DbEndpoint>:27017/?ssl=true&replicaSet=rs0&readPreference=secondaryPreferred&retryWrites=false > load.dat

After several minutes the alarm will trigger and you’ll receive an email. You should either delete the alarm or raise the threshold so you won’t get repeated alarms; 3% is a very low threshold for CPU utilization.

Additional alarms

You can add more alarms for other metrics. Beyond alarms on single metrics, you can use an anomaly detection capability to trigger whenever the metric is outside of a normal range. Or, you can set a compound alarm that triggers when multiple conditions are met. For example, you may want to send an alarm email when both CPU and memory usage are high, but not when only one is high.