There are a lot of tools in the market which can do sufficient monitoring and alert for your application and infrastructure. The growth of containerized platforms helped to have the monitoring and alerting setups very quickly. But this is the transition phase where large enterprises still run the core systems on legacy and enterprise platforms.
Understanding these platforms:
If you are running your app in a container and using container orchestration to achieve your final goals in terms of resources, availability, and stability then you are running a better system with less manual work. But everywhere is not the same.
Some applications are based on monolithic architecture.Such apps are complex to scale and containerize. Still, we need to achieve resource utilization, stability, availability, and scale. With these businesses can only survive and will grow.
Modification in such a complex system is risky. To maintain the stability it’s very important that we should have the best monitoring and alerting with infrastructure and applications. It’s necessary to understand which component is causing the loophole and what to do with emergency situations.
Monitoring with time series data:
Metrics are really important in terms of monitoring. From service level agreements to average response time etc. should be exposed by metrics. But what about applications which can not expose such metrics because of there legacy and complex architecture? The best solution to this problem is to look at what we have and get more meaningful insights from it. At last, having something in place is good than nothing. The fundamental of every application that it will write a log file. And mostly these log files are maintained in shared storage like NAS, NFS, Object storage etc.Metric example:
Google mtail is the tool where you can extract the internal state of applications and export it in time series database. It works with the graphite and Prometheus. And with these, you can also setup your monitoring and alerting.
Its always good to have the active and passive setups for the mtail processes and monitoring systems. To understand the higher availability of applications its important to maintain the availability of whitebox monitoring and alerting.
Don’t depend on averages!
Sometimes averages are just not useful. Check following use case:
If 5 out of 100 request served over >1s , then your application somehow performing slow. But you can not get these details with average metric because 95% of requests are served within your SLA.
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Some considerations for the implementation of modern monitoring and alerting on legacy systems:
1. Avoid the changes in legacy systems and start looking for something that will give you the exact outcomes.
2. Look for the right data. Its always necessary to look for the right data.
3. Get an internal state of application and infrastructure. Make sure to export the metrics for the health of application and infrastructure.
4. Verify your metrics with your legacy monitoring systems.
5. If you have metrics exported and collected then add the efficient alerting.
6. Check the status and response of the monitoring system during trouble/maintenance and infrastructure-related operations.
7. At last, improve the monitoring day by day parallel planning to move your legacy system to modern micro service architecture.
Hope this article to will help you to improve the monitoring and alerting with your legacy systems.
Some of reference tools :
- Google Mtail: https://github.com/google/mtail/
- Prometheus : https://prometheus.io/
- Graphite: https://graphite.readthedocs.io/en/latest/overview.html
- GoAccess: https://goaccess.io/
- Alertmanager: https://prometheus.io/docs/alerting/alertmanager/