[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20210824221522.5kokuv3eekalo2ha@ast-mbp.dhcp.thefacebook.com>
Date: Tue, 24 Aug 2021 15:15:22 -0700
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: sdf@...gle.com
Cc: hjm2133@...umbia.edu, bpf@...r.kernel.org, netdev@...r.kernel.org,
ppenkov@...gle.com
Subject: Re: [RFC PATCH bpf-next 0/2] bpf: Implement shared persistent
fast(er) sk_storoage mode
On Tue, Aug 24, 2021 at 09:03:20AM -0700, sdf@...gle.com wrote:
> On 08/23, Alexei Starovoitov wrote:
> > On Mon, Aug 23, 2021 at 05:52:50PM -0400, Hans Montero wrote:
> > > From: Hans Montero <hjm2133@...umbia.edu>
> > >
> > > This patch set adds a BPF local storage optimization. The first patch
> > adds the
> > > feature, and the second patch extends the bpf selftests so that the
> > feature is
> > > tested.
> > >
> > > We are running BPF programs for each egress packet and noticed that
> > > bpf_sk_storage_get incurs a significant amount of cpu time. By
> > inlining the
> > > storage into struct sock and accessing that instead of performing a
> > map lookup,
> > > we expect to reduce overhead for our specific use-case.
>
> > Looks like a hack to me. Please share the perf numbers and setup details.
> > I think there should be a different way to address performance concerns
> > without going into such hacks.
>
> What kind of perf numbers would you like to see? What we see here is
> that bpf_sk_storage_get() cycles are somewhere on par with hashtable
> lookups (we've moved off of 5-tuple ht lookup to sk_storage). Looking
> at the code, it seems it's mostly coming from the following:
>
> sk_storage = rcu_dereference(sk->sk_bpf_storage);
> sdata = rcu_dereference(local_storage->cache[smap->cache_idx]);
> return sdata->data
>
> We do 3 cold-cache references :-(
Only if the prog doesn't read anything at all through 'sk' pointer,
but sounds like the bpf logic is accessing it, so for a system with millions
of sockets the first access to 'sk' will pay that penalty.
I suspect if you rearrange bpf prog to do sk->foo first the cache
miss will move and bpf_sk_storage_get() won't be hot anymore.
That's why optimizing loads like this without considering the full
picture might not achieve the desired result at the end.
> This is where the idea of inlining
> something in the socket itself came from. The RFC is just to present
> the case and discuss. I was thinking about doing some kind of
> inlining at runtime (and fallback to non-inlined case) but wanted
> to start with discussing this compile-time option first.
>
> We can also try to save sdata somewhere in the socket to avoid two
> lookups for the cached case, this can potentially save us two
> rcu_dereference's.
> Is that something that looks acceptable?
Not until there is clear 'perf report' where issue is obvious.
> I was wondering whether you've
> considered any socket storage optimizations on your side?
Quite the opposite. We've refactored several bpf progs to use
local storage instead of hash maps and achieved nice perf wins.
> I can try to set up some office hours to discuss in person if that's
> preferred.
Indeed, it's probably the best do discuss it on a call.
Powered by blists - more mailing lists