[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[largescale-sig] RPC ping

Hey all,

TLDR: I propose a change to oslo_messaging to allow doing a ping over RPC,
      this is useful to monitor liveness of agents.

Few weeks ago, I proposed a patch to oslo_messaging [1], which is adding a
ping endpoint to RPC dispatcher.
It means that every openstack service which is using oslo_messaging RPC
endpoints (almosts all OpenStack services and agents - e.g. neutron
server + agents, nova + computes, etc.) will then be able to answer to a
specific "ping" call over RPC.

I decided to propose this patch in my company mainly for 2 reasons:
1 - we are struggling monitoring our nova compute and neutron agents in a
  correct way:

1.1 - sometimes our agents are disconnected from RPC, but the python process
is still running.
1.2 - sometimes the agent is still connected, but the queue / binding on
rabbit cluster is not working anymore (after a rabbit split for
example). This one is very hard to debug, because the agent is still
reporting health correctly on neutron server, but it's not able to
receive messages anymore.

2 - we are trying to monitor agents running in k8s pods:
when running a python agent (neutron l3-agent for example) in a k8s pod, we
wanted to find a way to monitor if it is still live of not.

Adding a RPC ping endpoint could help us solve both these issues.
Note that we still need an external mechanism (out of OpenStack) to do this
We also think it could be nice for other OpenStackers, and especially
large scale ops.

Feel free to comment.


Arnaud Morin