codehaus


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[largescale-sig][nova][neutron][oslo] RPC ping


Tagging with Nova and Neutron as they are mentioned and I thought some 
people from those teams had opinions on this.

Can you refresh my memory on why we dropped this before? I recall 
talking about it in Denver, but I can't for the life of me remember what 
the conclusion was. Did we intend to use something else for this that 
has since fallen through?

On 7/27/20 4:57 AM, Arnaud Morin wrote:
> Hey all,
> 
> TLDR: I propose a change to oslo_messaging to allow doing a ping over RPC,
>        this is useful to monitor liveness of agents.
> 
> 
> Few weeks ago, I proposed a patch to oslo_messaging [1], which is adding a
> ping endpoint to RPC dispatcher.
> It means that every openstack service which is using oslo_messaging RPC
> endpoints (almosts all OpenStack services and agents - e.g. neutron
> server + agents, nova + computes, etc.) will then be able to answer to a
> specific "ping" call over RPC.
> 
> I decided to propose this patch in my company mainly for 2 reasons:
> 1 - we are struggling monitoring our nova compute and neutron agents in a
>    correct way:
> 
> 1.1 - sometimes our agents are disconnected from RPC, but the python process
> is still running.
> 1.2 - sometimes the agent is still connected, but the queue / binding on
> rabbit cluster is not working anymore (after a rabbit split for
> example). This one is very hard to debug, because the agent is still
> reporting health correctly on neutron server, but it's not able to
> receive messages anymore.
> 
> 
> 2 - we are trying to monitor agents running in k8s pods:
> when running a python agent (neutron l3-agent for example) in a k8s pod, we
> wanted to find a way to monitor if it is still live of not.
> 
> 
> Adding a RPC ping endpoint could help us solve both these issues.
> Note that we still need an external mechanism (out of OpenStack) to do this
> ping.
> We also think it could be nice for other OpenStackers, and especially
> large scale ops.
> 
> Feel free to comment.
> 
> 
> [1] https://review.opendev.org/#/c/735385/
> 
>