Site reliability engineering

Updated on Dec 28, 2024

Edit

Comment

Site reliability engineering at dropbox

Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies that to operations whose goals are to create ultra-scalable and highly reliable software systems. Defined by Ben Treynor, founder of Google's Site Reliability Team: "what happens when a software engineer is tasked with what used to be called operations."

A history of site reliability engineering at uber

History

Site Reliability Engineering was created at Google around 2003 when Ben Treynor was hired to lead a team of seven software engineers to run a production environment. The team was tasked to make Google's sites run smoothly, efficiently and more reliably. Early on, Google's large-scale systems required the company to come up with new paradigms on how to manage such large systems that have never existed before and at the same time introduce new features continuously but at a very high-quality end user experience. The SRE footprint at Google is now larger than 1500 engineers. Many products have small to medium sized SRE teams supporting them. Not all products have SREs. The SRE processes that have been honed over the years are being used by other, mainly large scale, companies that are also starting to implement this paradigm. Microsoft, Apple, Twitter, Facebook, Dropbox, Amazon and Oracle have all put together SRE teams.

Roles

A site reliability engineer (SRE) will ideally spend up to 50% of their time doing "ops" related work such as issues, on-call, and manual intervention. Since the software system that an SRE oversees is expected to be highly automatic and self-healing, the SRE should spend the other 50% of their time on development tasks such as new features, scaling or automation. The ideal SRE candidate is a coder who also has operational and systems knowledge and likes to whittle down complex tasks.

DevOps vs SRE

DevOps is a practice, which was coined around 2008, that encompasses automation of manual tasks, continuous integration and continuous delivery. It applies to a wide audience of companies whereas SRE might be considered a subset of DevOps that possesses additional skill sets.

References

Site reliability engineering Wikipedia

(Text) CC BY-SA