[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YnF7CjzYBhASi1Eo@fuller.cnet>
Date: Tue, 3 May 2022 15:57:14 -0300
From: Marcelo Tosatti <mtosatti@...hat.com>
To: Christoph Lameter <cl@...two.de>
Cc: linux-kernel@...r.kernel.org, Nitesh Lal <nilal@...hat.com>,
Nicolas Saenz Julienne <nsaenzju@...hat.com>,
Frederic Weisbecker <frederic@...nel.org>,
Juri Lelli <juri.lelli@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Alex Belits <abelits@...its.com>, Peter Xu <peterx@...hat.com>,
Thomas Gleixner <tglx@...utronix.de>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Oscar Shiang <oscar0225@...email.tw>,
linux-rdma@...r.kernel.org
Subject: Re: [patch v12 00/13] extensible prctl task isolation interface and
vmstat sync
On Wed, Apr 27, 2022 at 11:19:02AM +0200, Christoph Lameter wrote:
> Ok I actually have started an opensource project that may make use of the
> onshot interface. This is a bridging tool between two RDMA protocols
> called ib2roce. See https://gentwo.org/christoph/2022-bridging-rdma.pdf
>
> The relevant code can be found at
> https://github.com/clameter/rdma-core/tree/ib2roce/ib2roce. In
> particular look at the ib2roce.c source code. This is still
> under development.
>
> The ib2roce briding can run in a busy loop mode (-k option) where it spins
> on ibv_poll_cq() which is an RDMA call to handle incoming packets without
> kernel interaction. See busyloop() in ib2roce.c
>
> Currently I have configured the system to use CONFIG_NOHZ_FULL. With that
> I am able to reliably forward packets at a rate that saturates 100G
> Ethernet / EDR Infiniband from a single spinning thread.
>
> Without CONFIG_NOHZ_FULL any slight disturbance causes the forwarding to
> fall behind which will lead to dramatic packet loss since we are looking
> here at a potential data rate of 12.5Gbyte/sec and about 12.5Mbyte per
> msec. If the kernel interrupts the forwarding by say 10 msecs then we are
> falling behind by 125MB which would have to be buffered and processing by
> additional codes. That complexity makes it processing packets much slower
> which could cause the forwarding to slow down so that a recovery is not
> possible should the data continue to arrive at line rate.
Right.
> Isolation of the threads was done through the following kernel parameters:
>
> nohz_full=8-15,24-31 rcu_nocbs=8-15,24-31 poll_spectre_v2=off
> numa_balancing=disable rcutree.kthread_prio=3 intel_pstate=disable nosmt
>
> And systemd was configured with the following affinites:
>
> system.conf:CPUAffinity=0-7,16-23
>
> This means that the second socket will be generally free of tasks and
> kernel threads.
>
> The NUMA configuration:
>
> $ numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7
> node 0 size: 94798 MB
> node 0 free: 92000 MB
> node 1 cpus: 8 9 10 11 12 13 14 15
> node 1 size: 96765 MB
> node 1 free: 96082 MB
>
> node distances:
> node 0 1
> 0: 10 21
> 1: 21 10
>
>
> I could modify busyloop() in ib2roce.c to use the oneshot mode via prctl
> provided by this patch instead of the NOHZ_FULL.
>
> What kind of metric could I be using to show the difference in idleness of
> the quality of the cpu isolation?
Interruption length and frequencies:
-------|xxxxx|---------------|xxx|---------
5us 3us
which is what should be reported by oslat ?
>
> The ib2roce tool already has a CLI mode where one can monitor the
> latencies that the busyloop experiences. See the latency calculations in
> busyloop() and the CLI command "core". Stats can be reset via the "zap"
> command.
>
> I can see the usefulness of the oneshot mode but (I am very very sorry)
Its in there...
> I
> still think that this patchset overdoes what is needed and I fail to
> understand what the point of inheritance, per syscall quiescint etc is.
Inheritance is an attempt to support unmodified binaries like so:
1) configure task isolation parameters (eg sync per-CPU vmstat to global
stats on system call returns).
2) enable inheritance (so that task isolation configuration and
activation states are copied across to child processes).
3) enable task isolation.
4) execv(binary, params)
Per syscall quiescint ? Not sure what you mean here.
> Those cause needless overhead in syscall handling and increase the
> complexity of managing a busyloop.
Inheritance seems like a useful feature to us. Isnt it? (to be able to
configure and activate task isolation for unmodified binaries).
> Special handling when the scheduler
> switches a task? If tasks are being switched that requires them to be low
> latency and undisturbed then something went very very wrong with the
> system configuration and the only thing I would suggest is to issue some
> kernel warning that this is not the way one should configure the system.
Trying to provide mechanisms, not policy?
Or from another POV: if the user desires, we can display the warning.
Powered by blists - more mailing lists