This is the fourth post in the series about how to apply Agile methods to run services based on existing software.
The first post describing the problem at some length is here, but the TL;DR; is
I don't think there are great resources on how to apply agile methods to run and develop services based on existing (open source) software. We have struggled with it, and I try to write down practices that work for us, mostly taken from Scrum and SRE.
The second post discussed how the service lifecycle looks and when, what kind, and how much work we need to apply to the service.
The third post tried to classify that work that goes into the day-to-day administration of a production service.
This post starts to get into the meat of the topic. When writing this intro I realize this will probably not fit into one blog post, but I will end up writing another post on this, then a few more. I expect this blog series will turn out to be a trilogy in seven parts.
Teams and Individuals
I'll talk a lot about teams in this post. I think well run teams are in the core of operating long term sustainable services. But before we jump to the topic of teams, let's take a look at how admin work is done - often successfully - with individuals.
The Life and Times of the Superhero Admin (SA)
If I'm at all correct about the audience of the blog, I think you all know at least one SA. It's Brent from The Phoenix Project. It's the person who knows how every service works - in extreme detail. They can fix any problem. They seem like a bottomless well of knowledge. If they're in a meeting where somebody is trying to make bad decisions about IT infrastructure, they can point out all design flaws in one sitting - usually with some level of abrasiveness depending on the SA in question. Someone pointed out to me when discussing this topic that these sound like the "10x developers". Thinking about it, it may be the same persons in different roles.
Please note, in this context I talk about Superhero Admins working alone. Many SAa are a great additions to admin teams too, but there their role differs a bit.
Sometimes we run one or more services, where an SA basically takes care of all the admin tasks. They are quick to react to requests, they have creative ideas how to fix things, they can probably get you what you need quickly. They know service architectures, and how decisions impact the service.
However, they are just one person.
This quickly becomes limiting, and it can be a very hard situation to get out of.
- We can't get around the Bus Factor -> High business risks
- At some point one person is not enough to do all the tasks -> No easy way to scale our service
- We have only one set of eyes and one brain on the task -> Hard to build in quality
These limitations can often be outweighed by the capability and flexibility of SAs in early stages of companies/services, but at some point,we hit a wall.
Solving this issue is not as easy as just adding another person to help with the work, and just hoping people sort it out. To build a long-term sustainable team you need to focus on how a team should work.
Teams are in the core of Agile and Scrum, and their benefits work for service operation and development too. With some changes.
A good summary on Scrum teams can be found here.
I think the "Characteristics" part of the description of a Scrum team is one of the most generalizable things in Scrum. If we look at the bullet points under "Characteristics", one major bullet jumps out. Not because it's the most important, but because the team often can't affect it themselves.
The people within the Scrum Team work full time in the team
For some reason, to me it seems that a lot of people argue against this point. I have only heard one good argument against it though. I'm still a bit incredulous that we think splitting people is a good thing, yet this practice seems to be very persistent.
Let's look at some of the arguments that I hear why "Teams with 100% dedicated people might work for others, but not for us".
We only have one admin in our organization, and I can't hire a whole team.
Correct. We do need some scale to operate in this fashion. We can't do it with one admin. This is the good argument.
I can't justify the cost of a whole team to manage this service, so we can't have full time team members.
This argument often misses the main point. The point is to organize around good processes and teams. The point is not to organize around services. We may not be able to afford a team for each service, but who says that each team must only maintain one service?
If the services are not too complex nor large, by all means, have our teams operate several services. Preferably the services have some things in common. Having a team of four people handling four services, is much better than having four admins managing one service each. The worst case is four virtual teams with four 25% allocated admins managing four services with four different managers deciding priorities.
But I must prevent siloing!
We need to have enough information transfer to make sure our admins are aware of what is going on around them, and what is relevant for their work. Splitting people is not the way to solve that problem.
We need to have structures for teams to communicate and exchange information efficiently (I hope I'll get deeper into that topic in a few blog posts). Splitting people may help with siloing a bit, but the downsides are drastically worse than any upsides (see below).
But we just run this as a project, and then when we push the service into production, we don't need that many people.
Please read the second blog post about service lifecycles. I think we're heading for Horizon 0, not Horizon 1.
Ah! But I have Virtual Teams! (Longish rant ahead)
We probably shouldn't. Having no teams is a better solution than virtual teams. Virtual teams can be discussion clubs, but they should never have any responsibilities.
Why are Virtual Teams so bad?
If we introduce a team instead of relying on efficient individuals, we already add overhead for communication. There is a minimum amount of communication overhead that each team introduces. In addition, each team that needs to establish processes for handing their responsibilities. The same processes aren't optimal for each team, so each team should have (at least somewhat) differing processes.
In addition to the overhead created by communication and multiple sets of processes, we introduce a large amount of context switching for admins in virtual teams. Context switching is expensive by itself.
It's not enough that this makes our admins much less efficient. Now we have to try to prioritize work. With part time people. Who have large overhead in their work. Prioritizing (and having any time estimates) is hard enough with a dedicated team, but this makes it practically impossible.
So far we haven't even gotten to a problem that has wide-ranging consequences - complexity. Growing complexity of our IT environment is a challenge in itself. Having a good view of the service is challenging even working within only one team.
I'll discuss more on complexity below, but for each team, the content of that team's work usually has tons of complexity. This is complexity that admins have to keep in their head while working. This gets multiplied for each team an admin works in. In general that means that our admins will not keep the same complexity in their head. This affects the quality of the decisions our admins make.
In short - please think of the admins and the services, let's not use virtual teams.
Team Responsibilities and Autonomy
We need to clearly define the responsibility of a team producing services. We also need to define the confines of the team and the external processes they must adhere to.
One big responsibility of a team is to optimize their work. When they know where they can freely optimize, and what their confines are, they are usually the best people to see the big picture when it comes to their service. We can get great benefits when autonomous teams are given free reign within their confines.
When our teams know what they are responsible for, the work should be a team responsibility. The team has a shared backlog and we don't have per-admin work lists.
If we have four persons in a team, and everybody only works on their own track, we probably don't actually have a team.
One easy test is vacations. Let's say one of our four admins goes on vacation for four weeks. What happens with the tasks they were working on? The answer should of course be that the tasks are finished by the time they return from vacation (or at least it has progressed, if the scope of the task is large). If the answer is "nothing", do we really have a team?
The Size of the Team and The Complexity of Work
So how large should an agile service team be? Scrum says it's 7+/- 2 persons. The question is of course much more complex, and we need to take into account the service stage (Horizon), the scale, the cost,the criticality and the complexity of the service.
The absolute minimal team I would have is probably three people, with one of the team members acting as the product owner. For production services in many cases a minimum of four would be suitable. This also depends on how much business risk we are able to take. What happens to the team's service(s) if two admins leave? A four person team has a good chance of climbing out of the situation, a three person team less so.
That's the minimum size. How about the maximum size?
I think that the maximum team size should be limited by the complexity of the services the team operates. Complexity is usually inevitable, but there should be conscious effort to minimize the complexity of services. A lot of things add to the complexity of a service. It's obvious that our design choices affect this, but complexity is also affected by the scale of the service, its redundancy and security requirements, among a lot of other things.
The more complex our services are, the more difficult our work - both the development and the operation - is. Complexity also adds a lot of time for training new people to become productive, as the context they need to know to work on things grows. And of course, complex systems fail in complex ways. I love the conciseness and simplicity of the How Complex Systems Fail website.
To summarize, another excellent quote, attributed to a variety of people.
Everything Should Be Made as Simple as Possible, But Not Simpler.
How does complexity affect team sizes? Sometimes, instead of adding responsibilities to a team, and growing the team, maybe we should split the team, or move some of the responsibilities and complexity to another team.
Please note, I haven't tested this in practice, but my feeling is we you can justify two teams, and they are large enough, it's better to split the responsibilities into two smaller teams, rather than one large team. A larger team has several benefits (e.g. resiliency), but I'm not sure they outweigh the benefits of reduced complexity.
The roles defined in Scrum are quite simple and useful. The team member role, scrum master role, and product owner roles are all important.
In our case one of the team members take care of the scrum master responsibilities. Depending on the team size, the product owner may also actively take part in the work of the team.
The product owner role is combined with the service owner role in our case. We could call the role service owner instead, but I prefer the term product owner even for services. The product owner title immediately makes you think of an agile team, while the service owner name is more mired in the history of ITIL.
I think the product owner role for services is often not taken seriously enough. We need a product owner for our teams. Please note. Not for services, for teams. If a team operates many services, they still have one product owner. That product owner is the service owner for all the services. In some cases, the product owner role that takes up a whole person. In a smaller team, a product owner role may not be a 100% role.
I think Scrum has a good basic rules for teams, that mostly work for our case too. The idea is to have self-organized and empowered teams with clear responsibilities.
Some things in the Scrum description are not perfect fits, like the "Collocation" requirement may be past its expiration date. Also the rule for having three sprints for a team to become productive seems utopistic for our work.
And then we have things that I completely disagree with, like each and every item under "Responsibilities of the Scrum Team". They don't make sense for our work, and I think if we hold tight to those, we have missed the point of Agile.
Why don't those points make sense? Whoops, I think this post is getting long enough. See you in part 5 :).