Each hostname will have a fully qualified domain name.
Each host group contains the operatins system followed by a two letter, lower-case country code.
The default will be to monitor services every 2 minutes.
When an error is caught by the monitoring, the service will be checked 3 times every 30 seconds until a notification is sent.
If the production service is not critical, the service can be checked less frequently.
Test systems, if important, will be checked every 15 minutes, and then will be rechecked 5 more times every 5 minutes before a notification is sent. (Only important test systems will have monitoring and they could be down for 25-45 minutes with no alerts)
The goal behind our monitoring systems is to have each system handle as much of the monitoring checks as possible to lessen the load off of the main Nagios monitoring system. This will allow more frequent checks and more stability of the monitoring server.
The 4 status codes that we will standardize on are:
0 = OK (0) means that the process ran to completion and is running within acceptable parameters.
1 = Warning (1) means that the process didn't fail, but it is in a state where some action may be required.
2 = Error (2) means that an error occurred with the process and action needs to be taken.
3 = Unknown (3) means that something unknown may have happened to the process and should be checked.