codehaus


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

NUMATopologyFilter and AMD Epyc Rome


On Thu, 2020-11-19 at 12:56 +0000, Eyle Brinkhuis wrote:
> Hi Stephen,
> 
> We run:
> Compiled against library: libvirt 5.4.0
> Using library: libvirt 5.4.0
> Using API: QEMU 5.4.0
> Running hypervisor: QEMU 4.0.0
> 
> ubuntu at compute02:~$ virsh capabilities | xmllint --xpath '/capabilities/host/topology' -
> XPath set is empty
> (On a node with NPS-1)
> 
> compute03:~$ virsh capabilities | xmllint --xpath '/capabilities/host/topology' -
> <topology>
>       <cells num="2">
>         <cell id="0">
>           <memory unit="KiB">65854792</memory>
>           <pages unit="KiB" size="4">2383698</pages>
>           <pages unit="KiB" size="2048">27500</pages>
>           <pages unit="KiB" size="1048576">0</pages>
>           <distances>
>             <sibling id="0" value="10"/>
>             <sibling id="1" value="12"/>
>           </distances>
>           <cpus num="32">
>             <cpu id="0" socket_id="0" core_id="0" siblings="0,32"/>
>             <cpu id="1" socket_id="0" core_id="1" siblings="1,33"/>
>             <cpu id="2" socket_id="0" core_id="2" siblings="2,34"/>
>             <cpu id="3" socket_id="0" core_id="3" siblings="3,35"/>
>             <cpu id="4" socket_id="0" core_id="4" siblings="4,36"/>
>             <cpu id="5" socket_id="0" core_id="5" siblings="5,37"/>
>             <cpu id="6" socket_id="0" core_id="6" siblings="6,38"/>
>             <cpu id="7" socket_id="0" core_id="7" siblings="7,39"/>
>             <cpu id="8" socket_id="0" core_id="8" siblings="8,40"/>
>             <cpu id="9" socket_id="0" core_id="9" siblings="9,41"/>
>             <cpu id="10" socket_id="0" core_id="10" siblings="10,42"/>
>             <cpu id="11" socket_id="0" core_id="11" siblings="11,43"/>
>             <cpu id="12" socket_id="0" core_id="12" siblings="12,44"/>
>             <cpu id="13" socket_id="0" core_id="13" siblings="13,45"/>
>             <cpu id="14" socket_id="0" core_id="14" siblings="14,46"/>
>             <cpu id="15" socket_id="0" core_id="15" siblings="15,47"/>
>             <cpu id="32" socket_id="0" core_id="0" siblings="0,32"/>
>             <cpu id="33" socket_id="0" core_id="1" siblings="1,33"/>
>             <cpu id="34" socket_id="0" core_id="2" siblings="2,34"/>
>             <cpu id="35" socket_id="0" core_id="3" siblings="3,35"/>
>             <cpu id="36" socket_id="0" core_id="4" siblings="4,36"/>
>             <cpu id="37" socket_id="0" core_id="5" siblings="5,37"/>
>             <cpu id="38" socket_id="0" core_id="6" siblings="6,38"/>
>             <cpu id="39" socket_id="0" core_id="7" siblings="7,39"/>
>             <cpu id="40" socket_id="0" core_id="8" siblings="8,40"/>
>             <cpu id="41" socket_id="0" core_id="9" siblings="9,41"/>
>             <cpu id="42" socket_id="0" core_id="10" siblings="10,42"/>
>             <cpu id="43" socket_id="0" core_id="11" siblings="11,43"/>
>             <cpu id="44" socket_id="0" core_id="12" siblings="12,44"/>
>             <cpu id="45" socket_id="0" core_id="13" siblings="13,45"/>
>             <cpu id="46" socket_id="0" core_id="14" siblings="14,46"/>
>             <cpu id="47" socket_id="0" core_id="15" siblings="15,47"/>
>           </cpus>
>         </cell>
>         <cell id="1">
>           <memory unit="KiB">66014072</memory>
>           <pages unit="KiB" size="4">2423518</pages>
>           <pages unit="KiB" size="2048">27500</pages>
>           <pages unit="KiB" size="1048576">0</pages>
>           <distances>
>             <sibling id="0" value="12"/>
>             <sibling id="1" value="10"/>
>           </distances>
>           <cpus num="32">
>             <cpu id="16" socket_id="0" core_id="16" siblings="16,48"/>
>             <cpu id="17" socket_id="0" core_id="17" siblings="17,49"/>
>             <cpu id="18" socket_id="0" core_id="18" siblings="18,50"/>
>             <cpu id="19" socket_id="0" core_id="19" siblings="19,51"/>
>             <cpu id="20" socket_id="0" core_id="20" siblings="20,52"/>
>             <cpu id="21" socket_id="0" core_id="21" siblings="21,53"/>
>             <cpu id="22" socket_id="0" core_id="22" siblings="22,54"/>
>             <cpu id="23" socket_id="0" core_id="23" siblings="23,55"/>
>             <cpu id="24" socket_id="0" core_id="24" siblings="24,56"/>
>             <cpu id="25" socket_id="0" core_id="25" siblings="25,57"/>
>             <cpu id="26" socket_id="0" core_id="26" siblings="26,58"/>
>             <cpu id="27" socket_id="0" core_id="27" siblings="27,59"/>
>             <cpu id="28" socket_id="0" core_id="28" siblings="28,60"/>
>             <cpu id="29" socket_id="0" core_id="29" siblings="29,61"/>
>             <cpu id="30" socket_id="0" core_id="30" siblings="30,62"/>
>             <cpu id="31" socket_id="0" core_id="31" siblings="31,63"/>
>             <cpu id="48" socket_id="0" core_id="16" siblings="16,48"/>
>             <cpu id="49" socket_id="0" core_id="17" siblings="17,49"/>
>             <cpu id="50" socket_id="0" core_id="18" siblings="18,50"/>
>             <cpu id="51" socket_id="0" core_id="19" siblings="19,51"/>
>             <cpu id="52" socket_id="0" core_id="20" siblings="20,52"/>
>             <cpu id="53" socket_id="0" core_id="21" siblings="21,53"/>
>             <cpu id="54" socket_id="0" core_id="22" siblings="22,54"/>
>             <cpu id="55" socket_id="0" core_id="23" siblings="23,55"/>
>             <cpu id="56" socket_id="0" core_id="24" siblings="24,56"/>
>             <cpu id="57" socket_id="0" core_id="25" siblings="25,57"/>
>             <cpu id="58" socket_id="0" core_id="26" siblings="26,58"/>
>             <cpu id="59" socket_id="0" core_id="27" siblings="27,59"/>
>             <cpu id="60" socket_id="0" core_id="28" siblings="28,60"/>
>             <cpu id="61" socket_id="0" core_id="29" siblings="29,61"/>
>             <cpu id="62" socket_id="0" core_id="30" siblings="30,62"/>
>             <cpu id="63" socket_id="0" core_id="31" siblings="31,63"/>
>           </cpus>
>         </cell>
>       </cells>
>     </topology>
> (On a node with NPS-2)
> 
> > It's worth noting that by setting NP1 to 1, you're already cutting your performance. This makes it look like you've got a single NUMA node but of
> > course, that doesn't change the physical design of the chip and there are still multiple memory controllers, some of which will be slower to
> > access to from certain cores. You're simply mixing best and worst case performance to provide an average. You said you have two SR-IOV NICs. I
> > assume you're bonding these NICs? If not, you could set NPS to 2 and then ensure the NICs are in PCI slots that correspond to different NUMA
> > nodes. You can validate this configuration using tools like 'lstopo' and 'numactl'.
> Our setup is a little different. We donâ??t use any OVS or SR-IOV. We use FDioâ??s VPP, with networking-vpp as switch, and use VPPâ??s RDMA capabilities
> to haul packets left and right. Our performance tuning sessions on these machines, without an openstack setup (so throughput in VPP) showed that
> NPS-1 is the best setting for us. We are only using one CX5 by the way, and use both ports (2x100G) in a LACP setup for redundancy.

interesting when you set NPS to 4 did you ensure you have 1 PMD per numa node.
when using dpdk you should normally have 1 PMD per numa node.

the other thing to note is that you cant assume that the nic even if attache to socket 0 will be on numa 0 when you set NPS=4

we havesee it on other numa nodes in some test we have done so if you only have 1 PMD per socket enabeld you woudl want to ensure its
on a core in the same numa ndoe as the nic.
> 
> Thanks for your quick reply!
> 
> Regards,
> 
> Eyle
> 
> > On 19 Nov 2020, at 13:31, Stephen Finucane <stephenfin at redhat.com> wrote:
> > 
> > On Thu, 2020-11-19 at 12:25 +0000, Stephen Finucane wrote:
> > > On Thu, 2020-11-19 at 12:00 +0000, Eyle Brinkhuis wrote:
> > > > Hi all,
> > > > 
> > > > Weâ??re running into an issue with deploying our infrastructure to run high throughput, low latency workloads.
> > > > 
> > > > Background:
> > > > 
> > > > We run Lenovo SR635 systems with an AMD Epyc 7502P processor. In the BIOS of this system, we are able to define the amount of NUMA cells per
> > > > socket (called NPS). We can set 1, 2 or 4. As we run a 2x 100Gbit/s Mellanox CX5 in this system as well, we use the preferred-io setting in
> > > > the BIOS to give preferred io throughput to the Mellanox CX5.
> > > > To make sure we get as high performance as possible, we set the NPS setting to 1, resulting in a single numa cell with 64 CPU threads
> > > > available.
> > > > 
> > > > Next, in Nova (train distribution), we demand huge pages. Hugepages however, demands a NUMAtopology, but as this is one large NUMA cell, even
> > > > with cpu=dedicated or requesting a single numa domain, we fail:
> > > > 
> > > > compute03, compute03 fails NUMA topology requirements. No host NUMA topology while the instance specified one. host_passes
> > > > /usr/lib/python3/dist-packages/nova/scheduler/filters/numa_topology_filter.py:119
> > > 
> > > Oh, this is interesting. This would suggest that when NPS is configured to 1, the host is presented as a UMA system and libvirt doesn't present
> > > topology information for us to parse. That seems odd and goes against how I though newer versions of libvirt worked.
> > > 
> > > What do you see for when you run e.g.:
> > > 
> > >         $ virsh capabilities | xmllint --xpath '/capabilities/host/topology' -
> > 
> > Also, what version of libvirt are you using? Past investigations [1] led me to believe that libvirt would now always present a NUMA topology for
> > hosts, even if those hosts were in fact UMA.
> > 
> > [1] https://github.com/openstack/nova/commit/c619c3b5847de85b21ffcbf750c10421d8b7d193
> > 
> > > > Any idea how to counter this? Setting NPS-2 will create two NUMA domains, but also cut our performance way down.
> > > 
> > > It's worth noting that by setting NP1 to 1, you're already cutting your performance. This makes it look like you've got a single NUMA node but
> > > of course, that doesn't change the physical design of the chip and there are still multiple memory controllers, some of which will be slower to
> > > access to from certain cores. You're simply mixing best and worst case performance to provide an average. You said you have two SR-IOV NICs. I
> > > assume you're bonding these NICs? If not, you could set NPS to 2 and then ensure the NICs are in PCI slots that correspond to different NUMA
> > > nodes. You can validate this configuration using tools like 'lstopo' and 'numactl'.
> > > 
> > > Stephen
> > > 
> > > > Thanks!
> > > > 
> > > > Regards,
> > > > 
> > > > Eyle
>