netdev - Re: [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Thu, 16 Aug 2018 23:07:37 +0200
From:   Stefano Brivio <sbrivio@...hat.com>
To:     Pravin Shelar <pshelar@....org>
Cc:     Matteo Croce <mcroce@...hat.com>,
        Justin Pettit <jpettit@...are.com>,
        Greg Rose <gvrose8192@...il.com>, Ben Pfaff <blp@....org>,
        netdev <netdev@...r.kernel.org>, ovs dev <dev@...nvswitch.org>,
        Jiri Benc <jbenc@...hat.com>, Aaron <aconole@...hat.com>
Subject: Re: [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in
 per-port round-robin order

Pravin,

On Wed, 15 Aug 2018 00:19:39 -0700
Pravin Shelar <pshelar@....org> wrote:

> My argument is not about proposed fairness algorithm. It is about cost
> of the fairness and I do not see it is addressed in any of the follow
> ups.

We are still working on it (especially on the points that you mentioned
already), that's why there hasn't been any follow-up yet.

After all, we marked this patch as RFC because we thought we needed to
gather some feedback before we'd reach a solution that's good enough :)

> I revisited the original patch, here is what I see in term of added
> cost to existing upcall processing:

Thanks for the review and the detailed summary, some answers follow:

> 1. one "kzalloc(sizeof(*upcall), GFP_ATOMIC);" This involve allocate
> and initialize memory

We would use kmem_cache_alloc() here, the queue is rather small.
Initialisation, we don't really need it, we can drop it.

> 2. copy flow key which is more than 1 KB (upcall->key = *key)

The current idea here is to find a way to safely hold the pointer to the
flow key. Do you have any suggestion?

> 3. Acquire spin_lock_bh dp->upcalls.lock, which would disable bottom
> half processing on CPU while waiting for the global lock.

A double list, whose pointer is swapped when we start dequeuing
packets (same as it's done e.g. for the flow table on rehashing), would
avoid the need for this spinlock. We're trying that out.

> 4. Iterate list of queued upcalls, one of objective it is to avoid out
> of order packet. But I do not see point of ordering packets from
> different streams.

Please note, though, that we also have packets from the same stream.
Actually, the whole point of this exercise is to get packets from
different streams out of order, while maintaining order for a given
stream.

> 5. signal upcall thread after delay ovs_dp_upcall_delay(). This adds
> further to the latency.

The idea behind ovs_dp_upcall_delay() is to schedule without delay if
we don't currently have a storm of upcalls.

But if we do, we're probably introducing less latency by doing this than
by letting ovs-vswitchd handle them. It's also a fundamental requirement
to have fairness: we need to schedule upcalls, and to schedule we need
some (small, in the overall picture) delay. This is another point where
we need to show some detailed measurements, I guess.

> 6. upcall is then handed over to different thread (context switch),
> likely on different CPU.
> 8. the upcall object is freed on remote CPU.

The solution could be to use cmwq instead and have per-CPU workers and
queues. But I wonder what would be the drawbacks of having per-CPU
fairness. I think this depends a lot on how ovs-vswitchd handles the
upcalls. We could check how that performs. Any thoughts?

> 9. single lock essentially means OVS kernel datapath upcall processing
> is single threaded no matter number of cores in system.

This should also be solved by keeping two queues.

> I would be interested in how are we going to address these issues.
> 
> In example you were talking about netlink fd issue on server with 48
> core, how does this solution works when there are 5K ports each
> triggering upcall ? Can you benchmark your patch? Do you have
> performance numbers for TCP_CRR with and without this patch ? Also
> publish latency numbers for this patch. Please turn off megaflow to
> exercise upcall handling.

We just run some tests that show that fairness is maintained with a
much lower number of ports, but we have no performance numbers at the
moment -- other than the consideration that when flooding with upcalls,
ovs-vswitchd is the bottleneck. We'll run proper performance tests,
focusing especially on latency (which we kind of ignored so far).

> I understand fairness has cost, but we need to find right balance
> between performance and fairness. Current fairness scheme is a
> lockless algorithm without much computational overhead, did you try to
> improve current algorithm so that it uses less number of ports.

We tried with one socket per thread, it just doesn't work. We can
definitely try a bit harder. The problem I see here is that the current
mechanism is not actually a fairness scheme. It kind of works for most
workloads, but if a port happens to be flooding with a given timing, I
don't see how fairness can be guaranteed.

-- 
Stefano