It's a while since the last blog. For some reason the technial blog posts stopped at the same time I took the product owner hat. Interpret that as you may.
Normal disclaimer: CentOS 7, RDO on OpenStack Newton
I can't resize!
Resizing and migrating VMs are some of the standard tools in the OpenStack arsenal . Resizing is often liked by customers, especially if we get cool new hardware they can utilize. Just a resize, and boom, you have upgraded your system. If you can...
Those who have dived deeper into the workings of OpenStack know that a resize is basically just a non-live migration which changes the flavor on the destination node. A resize should go through the normal scheduling any VM placement goes through. Imagine our surprise when we introduced new hardware, and a customer comes and says they tried to resize, and they got a
That's strange. The customer quota allows the resize, the flavor aggregate definitely has room in it. We even checked that there weren't some silly server groups with affinity rules. For some reason the scheduler refused to schedule it.
Until I realize!
So, nova debug logging it is, and let's try to trawl through the logs and see if we can find anything. There were a lot of logs, but this entry in nova-scheduler.log stuck out.
2018-02-15 13:49:48.508 31871 INFO nova.scheduler.host_manager [req-UUID1 UUID2 UUID3 - - -] Host filter only checking host obfuscated-hostname1.csc.fi and node obfuscated-hostname1.csc.fi
obfuscated-hostname1.csc.fi was the node were the instance was currently running. This is strange. Why would the scheduler exclude all hosts except the one the VM is running on?
Duckduckgoing the error message found me the code that gave the error. Apparently this host was the only one in the "requested_hosts" list. I verified that the API call didn't contain anything like that, so it must exist in the database. To save some time, imagine a montage of backtracking in the code, with cool background music.
Ha! There! The nova_api database seems to have a table called request_specs, which contains a long JSON in the "spec" column. For these VMs, the JSON contained a section like this.
Doing a few quick queries, It seems like this wasn't normal for other VMs. For most of the "requested_destination" was null. These problem VMs had a history of rebuilds and other things, some of which had failed. Maybe some of this left dirty data in the DB?
So in a "don't do this at home" moment, we dumped the JSON, carefully edited and set requested_destination to null, like it was for other VMs, ran the JSON through a verifier, and updated the requested_spec for that node.
We're still not sure why this data was there, but we're hoping changing it won't come and bite us later on.