SRE (Site Reliability Engineering)
Why do we need SRE’s?
Reliability, we will define this a bit later, of services is paramount to having a functioning service that your customers will want to use. To that end having engineers who’s expertise is helping DevOps teams (or combinations of) to ensure this reliability is a great way to ensure that expertise is availalble to all teams who need it.
What is reliability?
This will differ based on the service, sometimes this is uptime of the service, sometimes is being able to count of lack of errors, etc. Commonly it’s a mixture of all of these.
What do SRE’s do?
- Reduce toil (which also has to be captured) via automation
Techniques to improve reliability
- Chaos Engineering - Chaos in production system, such as with Netflix’s Chaos Monkey/Kong/etc, to ensure even during failures the system is resilient
- When scaling both vertically and horizontally consider all levels of the application flow, such as if the application scales up will the services it had dependencies also be able to scale up
Pages
Resources
- Google's SRE Books - As expected these are good sources of information and practices. Definitely worth a read on a regular basis
- The Site Reliability Workbook -
- Site Reliability Engineering -