When I started this blog, the goal was to make a writeup of any problem I couldn't easily duckduckgo the answer to.
Well, now I have gone quite far down the rabbit hole, as the goal apparently changed to a problem that I can't find a good answer to in dudckduckgo, several books, courses etc.. This time the topic is not technical, but I feel it's all the more important.
To cover enough ground, and avoid a wall of text, I'll split this topic into a few posts with their own themes.
Normal disclaimer: these are my own opinions based on my own experiences.
I don't think there are great resources on how to apply agile methods to run and develop services based on existing (open source) software. We have struggled with it, and in this post I try to write down practices that work for us, mostly taken from Scrum and Site Reliability Engineering (SRE) practices.
Our field is often thought of as "normal sysadmin / linux admin work", but I don't think that's how the world works any more. I especially think that we should get out of the mindset that installing and configuring the software one time in the start is the main part of the work
This post is just the problem statement, if you are interested in the meat, jump to the next post.
In my career have mainly worked with IT services. Usually these services are based on open source technologies (but not always nor exclusively). Running these services reliably at a production level is not trivial. Some examples that I have personally worked with are OpenStack, OpenShift/OKD and HPC clusters.
If we look to the past, often services like these have been run in a quite traditional way, with a view that they are more static and in the scope of traditional system administration rather than evolving software projects. In most of my discussions, people heartily agree that agile methods would benefit these services, but there seems to be a struggle how to implement them.
While running our OpenStack platform, we started implementing Scrum practices. We have been evolving our processes for years since, and they finally feel like they start to really work. We have tried looking at different approaches and existing material, while figuring out how to best organize our work, but there seems to be little material available.
What I'll try to do in these posts is to give practical experiences on how to:
- Manage production services that are based on existing (probably open source) software with Agile frameworks
- A mapping of service lifecycles to different working methods
- How to organize and resource the work in a way that keeps the admins and the organization happy
- Scaling beyond one team
Here is a picture of my cat for the cat tax, so this post won't be a wall of text.
Tons and tons have been written about Agile methodologies in IT. Mostly these talk about software development (where this is easiest to implement - author's opinion), e.g. Scrum.
There are some operational agile approaches - Kanban, ScrumOps, and maybe most importantly the Site Reliability Engineering (SRE) started by Google.
These all have great practices, but none of them seem to fit that well.
I think Scrum is a great starting point for a lot of Agile work. It has clear processes to implement. We can see which of them works, and evolve our practices. However, even if Scrum says it's widely applicable outside software development, I disagree. Standard Scrum falls apart quite quickly when the amount of unplanned work grows.
When running a service our work consists of a lot of development (I'll get back to that in a later post), but also tons of different types of unplanned work. The load of the unplanned work is in practice incompatible with many concepts in Scrum.
The timeline for changes in the running services is in my experience significantly higher than when running Scrum for pure software development. Making changes to running services often either depend on, or interact with other parts of the organization, and attempting to plan these in three week sprints might not be feasible.
Kanban is maybe most known for its Kanban board, which is nowadays standard practice in most places. I think most Kanban practices are good, and many of them are part of Scrum. If you just do plain Kanban, it may fit the operational work better, but it doesn't do much to address the whole picture of how to manage the service.
There are some evolutions on Scrum that are a bit more suited for work with more unplanned work, e.g. Scrumban and Scrumops. While these help with some of the problems with scheduling work, they still don't look at the operations in any depth.
The Site Reliability Engineering (SRE) practices started by Google are among the best descriptions of the work we do. The practices do a good job of describing operational work. This is however very focused on operations and specific operational support teams.
When you run services based on existing software, a lot of the work that would fall on the dedicated development team falls on you. How do you integrate it to business processes? Do you need to tweak it and put changes upstream? How do you do updates, take features into use, do dependency management, documentation etc. etc. ? The SRE is somewhat lacking in this space.
A lot of the work of running the service is development work, even if not all of it is pure coding.
Do we need something new?
Of course, with the nature of information (and, well, to be honest, me), there may of exist a perfect framework which I haven't found. If you know of any, please reach out, so I can link to that too.
The goal of these posts is not to create a new framework to Solve Everything (TM). The main idea is to describe how to take Scrum with its strictly defined practices, and SRE with its some strict rules, but generally less strictly defined practices, and mash them into something that makes sense. It has taken us years, and we're still learning things, so hopefully these posts can help other teams avoid some of the the pitfalls.
I limit these posts to services where the goals are to create reliable production services. Running a service as production quality adds more cost and people, which may not be worth the cost in all cases.
For example, if you need to run something small scale for 2-3 years and know it will be shut down, it's needed, but reliability isn't that important, you may have for much cheaper by doing it quick-and-dirty. Your technical debt will also be limited, which makes everything easier.
You might also have deployed a test platform for some project. You can shoot the test platform down and recreate it when needed. This means less operational issues, and maintenance effort. As my favourite demotivational poster says:
The agile service practices also only work with teams, which sets a lower limit on the organization size where this is useful. If you have two IT admins in your organizations, that's probably not enough.
While these posts is supposed to apply specifically to running services based on existing software, many parts are probably more generally applicable when mixing operations with development.
In the next post I'll discuss service Horizons.