netdev - Re: [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAGnkfhwS3J=7W7BQabFXSLLad3dV3YfocDf9qADaWqqgZKPB7g@mail.gmail.com>
Date:   Mon, 16 Jul 2018 16:54:09 +0000
From:   Matteo Croce <mcroce@...hat.com>
To:     Pravin B Shelar <pshelar@....org>
Cc:     netdev <netdev@...r.kernel.org>, dev@...nvswitch.org,
        Stefano Brivio <sbrivio@...hat.com>,
        Jiri Benc <jbenc@...hat.com>, Aaron <aconole@...hat.com>
Subject: Re: [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in
 per-port round-robin order

On Tue, Jul 10, 2018 at 6:31 PM Pravin Shelar <pshelar@....org> wrote:
>
> On Wed, Jul 4, 2018 at 7:23 AM, Matteo Croce <mcroce@...hat.com> wrote:
> > From: Stefano Brivio <sbrivio@...hat.com>
> >
> > Open vSwitch sends to userspace all received packets that have
> > no associated flow (thus doing an "upcall"). Then the userspace
> > program creates a new flow and determines the actions to apply
> > based on its configuration.
> >
> > When a single port generates a high rate of upcalls, it can
> > prevent other ports from dispatching their own upcalls. vswitchd
> > overcomes this problem by creating many netlink sockets for each
> > port, but it quickly exceeds any reasonable maximum number of
> > open files when dealing with huge amounts of ports.
> >
> > This patch queues all the upcalls into a list, ordering them in
> > a per-port round-robin fashion, and schedules a deferred work to
> > queue them to userspace.
> >
> > The algorithm to queue upcalls in a round-robin fashion,
> > provided by Stefano, is based on these two rules:
> >  - upcalls for a given port must be inserted after all the other
> >    occurrences of upcalls for the same port already in the queue,
> >    in order to avoid out-of-order upcalls for a given port
> >  - insertion happens once the highest upcall count for any given
> >    port (excluding the one currently at hand) is greater than the
> >    count for the port we're queuing to -- if this condition is
> >    never true, upcall is queued at the tail. This results in a
> >    per-port round-robin order.
> >
> > In order to implement a fair round-robin behaviour, a variable
> > queueing delay is introduced. This will be zero if the upcalls
> > rate is below a given threshold, and grows linearly with the
> > queue utilisation (i.e. upcalls rate) otherwise.
> >
> > This ensures fairness among ports under load and with few
> > netlink sockets.
> >
> Thanks for the patch.
> This patch is adding following overhead for upcall handling:
> 1. kmalloc.
> 2. global spin-lock.
> 3. context switch to single worker thread.
> I think this could become bottle neck on most of multi core systems.
> You have mentioned issue with existing fairness mechanism, Can you
> elaborate on those, I think we could improve that before implementing
> heavy weight fairness in upcall handling.

Hi Pravin,

vswitchd allocates N * P netlink sockets, where N is the number of
online CPU cores, and P the number of ports.
With some setups, this number can grow quite fast, also exceeding the
system maximum file descriptor limit.
I've seen a 48 core server failing with -EMFILE when trying to create
more than 65535 netlink sockets needed for handling 1800+ ports.

I made a previous attempt to reduce the sockets to one per CPU, but
this was discussed and rejected on ovs-dev because it would remove
fairness among ports[1].
I think that the current approach of opening a huge number of sockets
doesn't really work, (it doesn't scale for sure), it still needs some
queueing logic (either in kernel or user space) if we really want to
be sure that low traffic ports gets their upcalls quota when other
ports are doing way more traffic.

If you are concerned about the kmalloc or spinlock, we can solve them
with kmem_cache or two copies of the list and rcu, I'll happy to
discuss the implementation details, as long as we all agree that the
current implementation doesn't scale well and has an issue.

[1] https://mail.openvswitch.org/pipermail/ovs-dev/2018-February/344279.html

--
Matteo Croce
per aspera ad upstream