I usually write an upgrade report on our OpenStack version upgrades. We recently upgraded from Juno to Liberty. I haven't gotten to writing how it went, partly because it went so smoothly.
Disclaimer: The preparation and work behind this was mostly done by others than yours truly. Architecture-wise we still run monolithic controller nodes. We use neutron with linuxbridges, VLANs, and standard l3-agents for outbound connectivity. We use Puppet for configuration.
What about Kilo?
You can't just jump from Juno to Liberty directly. Kilo introduced some flavor migrations that must be run in Kilo. So the first thing we did was to come up with this (simplified) plan.
- Shut down all OpenStack services
- Deploy a Kilo controller nodes
- Update the databases
- Tear down the Kilo control layer
- Deploy the Liberty controller nodes
- Upgrade compute / network nodes directly to liberty
The plan was that this would impose a few hour break in API functionality, but otherwise it shouldn't affect the customers. The network nodes should stay up all the time, and when the Liberty version of the Neutron agents come up, all network namespaces should be in the same state as Juno left them. The control plane for the storage will go down, but the upgrade won't touch the data plane.
First we read through all the relevant material we could find. This included the release notes, of course, upgrade reports from other sites, etc.
In our development environment
A lot of the upgrade time went to getting the right versions of the puppet modules working. E.g. not all OpenStack Kilo puppet modules work against the latest OpenStack Kilo RPMs, and the same goes for Liberty. In general we tended to err on the side of the fresher (or freshest) modules if possible. Then it was iterating the Puppet configuration, removing obsolete stuff, changing configuration to a newer version, etc.
The next step was to come up with a step-by-step upgrade procedure. Basically document every command you need to run in the correct order. Then redeploying our devel environment with Juno, launching VMs, creating volumes, etc. to have a realistic environment, and going through the procedure. After than fixing the procedure, and doing it all again.
The next step was upgrading our staging environment according to the instructions. We treat our staging environment as a production environment, so it was basically a dress-rehearsal. Based on these findings, we again fixed the procedure.
We have learned something during the years. Not one single OpenStack upgrade has succeeded without problems with updating the production database. Staging and devel environment databases haven't accumulated nearly the amount of crud that our production database, so even if tests work, the real thing might not.
This time we took a dump of the production database, imported it to our devel environment, and ran through the upgrade procedure. And it was a good thing we did, since the database was inconsistent due to mandatory manual hacks done at some point. It wasn't a huge deal, it took an hour to figure out and fix. That was a nice calm hour, not a sweat-soaked terrified have-to-get-it-done-now hour during the upgrade of the production system.
Having tested the database upgrade thoroughly also removed most of the "Did we really fix the issue?" thoughts, which would have come up during a frantic production fix. This was the first time we did a proper test with the production database prior to the actual upgrade, and I strongly recommend doing so.
The actual upgrade day
It was pretty much boring. Come in early on a Saturday. Go through step by step instructions. See that everything works.
We did hit one larger problem. In the end of the upgrade, while installing Liberty on the network nodes, we noticed that our canary VMs stopped working. These are VMs we start before the upgrade. We do disk IO on them, and we run ping on them to verify that we haven't broken anything. Well now we had.
When we updated the network nodes to Liberty, one came up much faster than the other. This was mainly due to facter, which gets run by Puppet. By default facter looks at all network interfaces and does something or other for figuring out network stuff. With a few hundred network namespaces this takes a while. One of the network nodes had more l3 routers than the other one, which means it had more interfaces, which means it was slower.
Now one network node coming up before the other is usually not a problem. However we run the oldschool l3-migrate cron script. The purpose of this script is to check whether all neutron-l3-agents are up. If they aren't, after a timeout it'll start migrating away the routers from the broken agent to other agents. This basically allows you to autorecover from a failed network node.
For some reason this script didn't work in Juno, and we had planned on removing it. We didn't realize that it would suddenly start working in Liberty. The network node that came up first started migrating all routers to itself. However, it turned out that it wasn't capable of handling all routers, and it ran out of memory and crashed. Our network nodes are VMs, so we stopped them, added memory, and restarted them. After this the routers started recovering, but the whole thing caused a network break up to 20 minutes for the customers.
We didn't catch the problem in our earlier test, most likely due to scale. We don't have hundreds of network namespaces in our staging environment, so the puppet runs finish pretty much at the same time. Earlier we have hit other issues caused by scale (e.g. db connection amounts), during production upgrades. Production-scale testing is hard.
All in all, the upgrade went well, and the aftermath of the upgrade was minimal. There were a few issues where the neutron security groups were interpreted more strictly, and blocked some traffic they didn't use to, but that's about it.