Skip to content

nubbthedestroyer/specialsnowflake

Repository files navigation

SpecialSnowflake

A cloudwatch ingester for custom metrics
"Hey baby, do you need monitoring? Because I could hit your endpoint all day long!"

Intro

This script grabs metrics with a bash command that you specify and uploads it with the parameters you specify into Cloudwatch. Also creates and manages alarms defined in each flake file. Requires AWS credentials or IAM Role to run. SpecialSnowflake is a very useful tool for advanced custom monitoring of various disparate metrics and can be used to monitor anything from the output of a SQL query to a grep and line count of a log file on a remote server. It's designed to be flexible and can support any linux tool to generate metric data, but also support integration with nearly any external application like PagerDuty or Slack by using SNS endpoints on alarms. It can be run inside of the docker container or natively outside of it if the dependencies are available.

Requirements

  • AWS Credentials with Cloudwatch R/W access
    • also configured under the linux account that will be running the script OR at the AWS Instance Role level.
  • See the Dockerfile for a list of required packages, add any you need for the flakes.
  • Terraform installed on the machine that is running the script (install into "/usr/local/bin/")
    • Dockerfile and standard docker environment installs this by default.
  • There is a docker file and related scripts included in this repo that you can use to test a flake. Run this command to start a test.
    • ./test-a-flake.sh <your-flake-name>
      • This will pull in credentials from infra/scipts/credentials, build the docker, and run any flake(s) that match(es) the string you pass to the shell script.

Operation

The script loops through each file in the flakes directory, sourcing the variables within. Then runs the flakeCommand string and puts the output of the command into Cloudwatch using the configuration you specify in the flake file. It will also log errors into individual log files for each flake based on the flakeName which can then be ingested by any log retention system like logstash.

Tips

  • Ensure that the command run in the flakeCommand variable returns naked data and does not include quotes or warnings. Should only be a single integer or float.
    • That being said, you can use the shell file that the flakeCommand points to for advanced calculations and ratios between multiple query return values, including external scripts like php.
  • You cannot make notes in the json files, but you can add fields (like '_comment1', '_comment2', etc) and use those to store comments or important info.
    • You definitely can add comments to the resource scripts.
  • Use mysql accounts that are locked down and very specific to the use case. Where possible, do not store creds in the script, rather create mysql users that allow with no password from IP address.
  • Snowflake runs as a docker container in production, so its possible to test a newly written flake before deploying it into a production environment. Here's how to do it.
    • Get docker running on your system (boot2docker, Kitematic, docker-machine, etc)
    • Write your flake file and associated external scripts and place them in appropriate folders relevant to the task.
    • run this command
    • ./test-a-flake.sh <name-of-flake>
      • If you exclude your flake name it will just run an example flake. Run multiple flakes by including matching text. So to run all flakes that begin with "zq-" put "zq-" for name-of-flake.
    • This will build the container, install dependencies, and put everything where it needs to be. Fortunately, it does this with the exact same architecture and environment as it would in production, so you get a pretty good test for your flake as it would run in prod.
      • Might take a while the first time as the docker container has to build.
      • This will post real data and real alarms to AWS. Be prepared for alerts if your threshold is breached and you've defined Cloudwatch Alarms.
    • If you get an HTTP 200 from Cloudwatch then your flake should be good.

Default config

See the below default config file at flakes/example-perminute.flake.json for an example of a full configuration.

{
  "flakeName": "example-perminute",
  "flakeCronstring": "* * * * *",
  "flakeCommand": "./res/test3/test3.sh",
  "flakeType": "metric",
  "flakeUnit": "Count",
  "flakeMetricNamespace": "test3/KPIs",
  "flakeRegion": "us-east-1",
  "flakeAlarms": [
    {
      "alarmName": "TestHighThreshold",
      "alarmDescription": "Alarm if the metric is too high",
      "alarmThreshold": "3",
      "alarmOperator": "GreaterThanThreshold",
      "alarmPeriodLength": "60",
      "alarmPeriods": "1",
      "alarmStatistic": "Average",
      "alarmEndpoints": {
        "ok": "arn:aws:sns:us-east-1:123456789012:PagerDuty-IT-Only",
        "alarm": "arn:aws:sns:us-east-1:123456789012:PagerDuty-IT-Only"
      }
    },
    {
      "alarmName": "TestLowThreshold",
      "alarmDescription": "Alarm if the metric is too low",
      "alarmThreshold": "3",
      "alarmOperator": "LessThanThreshold",
      "alarmPeriodLength": "60",
      "alarmPeriods": "1",
      "alarmStatistic": "Average",
      "alarmEndpoints": {
        "ok": "arn:aws:sns:us-east-1:123456789012:PagerDuty-IT-Only",
        "alarm": "arn:aws:sns:us-east-1:123456789012:PagerDuty-IT-Only"
      }
    }
  ]
}

See the below example of a resource file for the script to collect the metric, what could be "./res/test3/test3.sh". It is very important that this script return only a number and no errors or other characters or formatting.

#!/usr/bin/env bash
mysql -Ns -h mysqlhost.example.com -u${user1} -p${pass1} -e "SHOW STATUS;" 2>&1 | grep -v 'Warning' | grep Uptime | sed 's/[^0-9]*//g'  | head -n 1

Flake file reference

  • flakeName
    • Name of the flake, must be unique.
  • flakeConstring
    • Linux crontab style time string that signifies the frequency with which this flake should run.
  • flakeCommand
    • Shell command you want to run for this flake. Recommend that this be a separate executable stored under the "res/" directory to avoid having to escape and stringify special characters due to JSON syntax constraints.
  • flakeType
    • Can be "metric" or "job". The idea with this is that a job won't trigger a failure flag on non-numerical output, it will just log it. Job types will also not send data to Cloudwatch. Metric type is expecting a number and will fail if output is non-numerical.
  • flakeUnit
  • flakeMetricNamespace
  • flakeRegion
    • AWS region for Cloudwatch reporting.
  • flakeAlarms

Future Improvements

  • I would like to eventually get all the flakes to reference Dockerfiles to run instead of native scripts. As flakes are added each more and more packages and dependencies will be added to the environment, which I think will cause problems at scale. This can be achieved via Docker D-in-D, but we would need to translate this to an ECS cluster I think for load balancing.
  • There is a small memory leak somewhere in the storm.py threading loop that I can't pinpoint. Might be the way I'm threading. I expect python to know to garbage collect after each cycle finishes, but maybe it's not.

About

A cloudwatch ingester for custom metrics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published