We've been working with OpenStack for a while now. The first version we started playing around with was Diablo in 2011. After some time of working on proof-of-concepts, and getting our hands dirty, we were tasked with setting up a production platform.
There are tons of information about the technical stuff you need to do when running a production platform, but one aspect which I have found under represented is the actual work organization. It naturally depends on the organization you work for, but even so, I haven't found many sources which discuss how to run a modern production platform in a modern way. There are the high level Lean Enterprise things, but resources on day-to-day work organization are rarer.
First of all, I don't think old Linux server admin ways of working (ssh + vim) work for modern services. The minimum requirement to run these is configuration management, which means code, which means workflows for code review, testing, etc. This can be manageable without strict processes as long as you're only a few people. When the team grows, you need something more formal. We noticed this when we went from a two-man show to a team. The DevOps term usually seems to come from coders who also maintain the service they develop. In our case we deploy existing software products (OpenStack et.al.), we don't actually develop them. But we use similar methodologies for the deployment and the operation of the service, so I will use the term DevOps for our work too.
Because of its popularity, we decided to start using Scrum. We tried to have pretty much by-the-book Scrum, and brought it into our DevOps work. 3 week sprints, tasks definitions, planning poker, etc. In the beginning it actually felt like this was working, we were getting organized and we got things done.
When we had been doing this for half a year, it became apparent that Scrum is not a direct fit to our work. When you run a platform like OpenStack, you have to dedicate quite a large chunk of time for actual operations. In other words, unplanned work. Scrum does not like this.
We also tried to wrestle our work into Scrum-sized bits.
"Hey, we need to deploy this service."
"Oh damn. I'm not sure we can do this task in 3 weeks."
"Hmm, let's split it into smaller tasks, and schedule them independently?"
Not OK. In retrospect, not OK at all. This meant we started splitting work into chunks which actually don't bring any benefit. E.g. cable this server, or install (but don't configure) that piece of software. This meant we had tons of stuff going on, and we seldom completed anything. On the Scrum board we didn't have a lot of work in progress (WIP), but in reality we did. It's not that we didn't get any results, it's just that we got them in spite of the process, not because of the process.
When you mix this with the invisible operations work - which was seen more as a nuisance, not as a part of the job - the work started getting demoralizing. I still loved working with OpenStack, but not as much with our process.
So thanks to our Scrum-master we don't do that any more. Not exactly that at least. The major changes we did were
- No locking tasks to 3 week sprints. We plan our priorities every 3 weeks and give write definition of done "DoD's" for the most important tasks. We also take a look at our 6 month plan.
- All tasks must bring a concrete benefit to either the admins or the end-users.
- Limit amount of tasks in progress.
- Move to a Kanban board where all development and operation tasks live.
So far this has worked much better. No more arguments or guessing what fits in a 3 week sprints. No more pressure to change priorities every 3 weeks. No more useless staring at (and being pressured by) "velocity" or other Scrum stats. Operation work is visible, and organized through in the same way as development work.
Now we work on a limited amount of tasks until they finish. If we have a 3-week period with tons of ops work and little development, that's visible too. Some tasks might take 2 months, but they get done.
Our current way is not perfect, but it's better. There's still a ton of small task that don't warrant a ticket to handle them, and then you don't know should it go through the process or not? Then there are also long-term collaboration tasks with other teams, and these are hard to record in a meaningful way.
But you know what they say. Perfect is the enemy of good.
Our current process in a nutshell
I have heard the term ScrumBan, I'm not sure if this is what we do, and in general, it doesn't really matter, as long as it works. Here I'll use "task" to mean any unit of work (ops, or development) which brings concrete benefits. Usually this is a Scrum "story" sometimes an "epic".
- Planning every 3 weeks.
- Reviews (within team) and retrospectives scheduled with planning.
- 15 minute daily meetings.
- Higher level stakeholder meetings on a ~6 week cycle.
- One Kanban board, separated to ops and development.
- Ops tasks take priority when taking new tasks.
- Limit amount of tasks "In progress". A task is worked on until it's done, can't proceed, or no longer relevant.
- A generic DoD which includes code review, documentation, etc.
- Only very high level guesses on the work amount for a task.
- All tasks need to bring a concrete benefit (if they don't, redefine the scope).
- The work can be further divided into subtasks as needed.
- Tasks are seldom finished by only a single person.
- Each week 2 persons (on rotation) are responsible for handling ops tasks that arise.
Conclusions and some recommendations
Are we done? No! We change things up pretty much every 3 weeks and try new things to improve our work. And I think that's the most important of the recommendations. Here are some of the main points what I've personally learned.
- Have retrospectives, take them seriously, and change stuff if something doesn't work.
- There are reasons for the basic rules of Scrum (or method X, Y or Z). Try to understand the reasons behind them. Those are much more important than following the rules.
- Pure Scrum is probably not a great fit for ops-heavy devops work.
(Hmm, that was a bit of a Wall of Text. Maybe I should include pictures of cute kittens coding).
Geek. Systems Specialist @CSCfi