[nova][ops][stable] Any interest in backporting --dry-run and/or --instance options for heal_allocations?
Yea - this is our number one pain point with Nova and Rocky, and having this backported would be invaluable.
Since we are on the topic some additional issues we are having.
- Sometimes heal_allocations just fails without a good error (e.g. Compute host could not be found.)
- Errors are always sequential and always halt execution, so if you have a lot of errors, you'll end up fixing them all one-by-one.
- Better logging when unexpected errors do happen (maybe something more verbose like --debug would be good?).
Best Regards, Erik Olof Gunnar Andersson
From: melanie witt <melwittt at gmail.com>
Sent: Wednesday, November 6, 2019 9:02 AM
To: openstack-discuss at lists.openstack.org
Subject: Re: [nova][ops][stable] Any interest in backporting --dry-run and/or --instance options for heal_allocations?
On 11/5/19 08:45, Matt Riedemann wrote:
> I was helping someone recover from a stuck live migration today where
> the migration record was stuck in pre-migrating status and somehow the
> request never hit the compute or was lost. The guest was stopped on
> the guest and basically the live migration either never started or
> never completed properly (maybe rabbit dropped the request or the
> compute service was restarted, I don't know).
> I instructed them to update the database to set the migration record
> status to 'error' and hard reboot the instance to get it running again.
> Then they pointed out they were seeing this in the compute logs:
> "There are allocations remaining against the source host that might
> need to be removed"
> That's because the source node allocations are still tracked in
> placement by the migration record and the dest node allocations are
> tracked by the instance. Cleaning that up is non-trivial. I have a
> troubleshooting doc started for manually cleaning up that kind of
> stuff here  but ultimately just told them to delete the allocations
> in placement for both the migration and the instance and then run the
> heal_allocations command to recreate the allocations for the instance.
> Since this person's nova deployment was running Stein, they don't have
> the --dry-run  or --instance  options for the heal_allocations
> command. This isn't a huge problem but it does mean they could be
> healing allocations for instances they didn't expect.
> They could work around this by installing nova from train or master in
> a VM/container/virtual environment and running it against the stein
> setup, but that's maybe more work than they want to do.
> The question I'm posing is if people would like to see those options
> backported to stein and if so, would the stable team be OK with it?
> I'd say this falls into a gray area where these are things that are
> optional, not used by default, and are operational tooling so less
> risk to backport, but it's not zero risk. It's also worth noting that
> when I wrote those patches I did so with the intent that people could
> backport them at least internally.
I think tools like this that provide significant operability benefit are worthwhile to backport and that the value is much greater than the risk.
Related but not nearly as simple, I've backported nova-manage db purge and nova-manage db archive_deleted_rows --purge, --before, and --all-cells downstream because of the amount of bugs support/operators have opened around database cleanup pain. These were all pretty difficult to backport with the number of differences and conflicts, but my point is that I understand the motivation well and support the idea.
The fact that the patches in question were written with backportability in mind is A Good Thing.
> GkxCspm5LX4aBQ$ 
> GkxCspniVit4uQ$