docs/alerts.md
Check to see if certain EC2 instances are unhealthy by listing all the individual EC2 instances matching the blue|green clusters and clicking monitoring.
Recycle the unhealthy instances by terminating them (50% first, then the other 50% after five minutes). Note: Make sure not to terminate steward instances. Please follow these steps to recycle steward instances.
You can look into container statistics with Hashi-UI. Please follow these steps to access it:
$ cd nomad
$ source nomad_proxy_functions.sh
$ proxy_on production
Then navigate to http://localhost:8080/nomad/production/allocations to see containers and their status. If the container has issues with sending requests to Vault (which can be seen in container status), check the CPU/Disk status of steward servers.
Here are the things to look for immediately in the Bugsnags:
ssh into the worker and run sudo rm -rf /tmp/*. You may get errors saying "Operation not permitted", nothing to do about that. Then follow these steps:grd
cd ..
ls -l
Take note of the directory current links to. Then:
cd releases
ls -l
You want to remove all the past deploys using rm -rf [name] where the name is NOT the one current links to. If the size of the directory current links to is huge, then you have no choice but to make some change to the code, deploy to production, and remove that directory which is now old as current will link to the new deploy.
You will get an alert saying the Fullest disk > 90%. First check New Relic to see which servers they are. 'ssh' into the server and run:
sudo rm -rf /tmp/*
After that, run:
sudo find / -type f -size +20M -exec ls -lh {} \\;
This will tell you all files larger than 20MB. Likely it will be a log directory that is exploding. If not, make sure to examine what you are removing before you actually remove it. If it is safe to remove, remove the files and/or directories. Check New Relic to ensure the disk usage went down. If not, try starting the stopping the instance. HOWEVER, follow the directions here to ensure the site does not go down.
In our production environment (and generally anywhere) you can use the following commands to check connectivity, SSL, etc.
nc -z -w 5 [host] [port]
where:
-z tells it to try connecting then tell us :+1: or :-1:-w 5 only wait 5 secondsopenssl s_client -showcerts -connect [host]:[port]