netdev - Re: Introduce FCLONE_SCRATCH skbs to reduce stack memory useage and napi jitter

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 27 Oct 2011 21:37:30 -0400
From:	Neil Horman <nhorman@...driver.com>
To:	Eric Dumazet <eric.dumazet@...il.com>
Cc:	netdev@...r.kernel.org, "David S. Miller" <davem@...emloft.net>
Subject: Re: Introduce FCLONE_SCRATCH skbs to reduce stack memory useage and
 napi jitter

On Fri, Oct 28, 2011 at 12:55:46AM +0200, Eric Dumazet wrote:
> Le jeudi 27 octobre 2011 à 15:53 -0400, Neil Horman a écrit :
> > I had this idea awhile ago while I was looking at the receive path for multicast
> > frames.   The top of the mcast recieve path (in __udp4_lib_mcast_deliver, has a
> > loop in which we traverse a hash list linearly, looking for sockets that are
> > listening to a given multicast group.  For each matching socket we clone the skb
> > to enqueue it to the corresponding socket.  This creates two problems:
> > 
> > 1) Application driven jitter in the receive path
> >    As you add processes that listen to the same multcast group, you increase the
> > number of iterations you have to preform in this loop, which can lead to
> > increases in the amount of time you spend processing each frame in softirq
> > context, expecially if you are memory constrained, and the skb_clone operation
> > has to call all the way back into the buddy allocator for more ram.  This can
> > lead to needlessly dropped frames as rx latency increases in the stack.
> > 
> 
> Hmm... time to perform this loop not depends on memory constraints,
> since GFP_ATOMIC allocations are done. It succeed or not, immediately.
> 
Thats not entirely correct.  The time it takes to allocate a new object from the
slab can be very much affected by memory contraints.  Systems under memory
pressue may need to shrink slab caches, allocate from remote nodes, etc, all of
which contributes to additional allocation time.  This time is multiplied by the
number of sockets listening to a particular group.  I'm not saying its a huge
change, but its a change that I saw the ability to address.   This change (I
hope) tries, when able, to eliminate the jitter from that path.

> Time is consumed on the copy of the skb head, and refcnt
> increases/decreases on datarefcnt. Your patch doesnt avoid this.
> 
No it doesn't, you're correct.  I was really looking at lower hanging fruit with
this change.

> When application calls recvmsg() we then perform the two atomics on skb
> refcnt and data refcnt and free them, with cache line false sharing...
> 
> > 2) Increased memory usage
> >    As you increase the number of listeners to a multicast group, you directly
> > increase the number of times you clone and skb, putting increased memory
> > pressure on the system.
> > 
> 
> One skb_head is about 256 bytes (232 bytes on 64bit arches)
> 
Yes, thats right.  Depending on the size of the received frame I can allocate
anywhere from 1 to 4 additional skbs using otherwise unused memory at the tail
of the data segment with this patch.

> > while neither of these problems is a huge concern, I thought it would be nice if
> > we could mitigate the effects of increased application instances on performance
> > in this area.  As such I came up with this patch set.  I created a new skb
> > fclone type called FCLONE_SCRATCH.  When available, it commandeers the
> > internally fragmented space of an skb data buffer and uses that to allocate
> > additional skbs during the clone operation. Since the skb->data area is
> > allocated with a kmalloc operation (and is therefore nominally a power of 2 in
> > size), and nominally network interfaces tend to have an mtu of around 1500
> > bytes, we typically can reclaim several hundred bytes of space at the end of an
> > skb (more if the incomming packet is not a full MTU in size).  This space, being
> > exclusively accessible to the softirq doing the reclaim, can be quickly accesed
> > without the need for additional locking, potntially providing lower jitter in
> > napi context per frame during a receive operation, as well as some memory
> > savings.
> > 
> > I'm still collecting stats on its performance, but I thought I would post now to
> > get some early reviews and feedback on it.
> > 
> 
> I really doubt you'll find a significative performance increase.
> 
With the perf script I included in this set, I'm currently reducing the napi rx
path runtime by about 2 microsecond per iteration when multiple processes are
listening to the same group, and we can allocate skbs from the data area of the
received skb.  I'm not sure how accurate (or valuable) that is, but thats what
I'm getting.  While I've not measured, theres also the fact that kfree_skb doesn't
have any work to do with FCLONE_SCRATCH skbs, since they all get returned when
the data area gets kfreed.  Lastly, theres the savings of 256 bytes per
listening app that uses an FCLONE_SCRATCH skb, since that memory is already
reserved, and effectively free.  I can't really put a number on how significant
that is, but I think its valuable.


> I do believe its a too complex : skb code is already a nightmare if you
> ask me.
> 
I'm not sure I agree with that.  Its certainly complex in the sense that it adds
additional state to be checked in the skb_clone and kfree_skb paths, but beyond
that, it doesn't require any semantic changes to any of the other parts of the
skb api(s).  Is there something specific about this change that you find
unacceptibly complex, or something I can do to make it simpler?

> And your hack/idea wont work quite well if you have 8 receivers for each
> frame.
> 
It will work fine, to whatever degree its able, and this it will fall back on
the standard skb_clone behavior.  The performance gains will be reduced, of
course, which I think is what you are referring to, but the memory advantages
still exit.  If you have 8 receivers, but a given skb has only enough free space
at the end of the data segment to allocate 4 additional skbs, we'll allocate
those 4, and do a real kmem_cache_alloc of 4 more in skb_clone.  The last 4 have
to use the slab_allocation path as we otherwise would, so we're no worse off
than before, but we still get the memory savings (which would be about 1k per
received frame for this mcast group, which I think is significant).

> What about finding another way to queue one skb to N receive queue(s),
> so that several multicast sockets can share same skb head ?
> 
Yes, I'd experimented with some of this, and had mixed results.  Specifically I
came up with a way to use the additional space as an array of list heads, with
backpointers to the socket_queue each list_head was owned by and the skb that
owned the entry.  It was nice because it let me enqueue the same skb to hundreds
of sockets at the same time, which was really nice.  It was bad however, because
it completely broke socket accounting (any likely anything else that required on
any part of the socket state).  Any thoughts on ways I might improve on that.
If I could make some sort of reduced sk_buff so that I didn't have to allocate
all 256 bytes of a full sk_buff, that would be great!

> I always found sk_receive queue being very inefficient, since a queue or
> dequeue must dirty a lot of cache lines.
> 
> This forces us to use a spinlock to protect queue/enqueue operations,
> but also the socket lock (because of the MSG_PEEK stuff and
> sk_rmem_alloc / sk_forward_alloc)
> 
> sk_receive_queue.lock is the real jitter source.
> 
> Ideally, we could have a fast path using a small circular array per
> socket, of say 8 or 16 pointers to skbs, or allow application or
> sysadmin to size this array.
> 
> A circular buffer can be handled without any lock, using atomic
> operations (cmpxchg()) on load/unload indexes. The array of pointers is
> written only by the softirq handler cpu, read by consumers.
> 
> Since this array is small [and finite size], and skb shared, we dont
> call skb_set_owner_r() anymore, avoiding expensive atomic ops on
> sk->sk_rmem_alloc.
> 
> UDP receive path could become lockless, allowing the softirq handler to
> run without being slowed down by concurrent recvmsg()
> 
> At recvmsg() time, N-1 threads would only perform the skb->refcnt
> decrement, and the last one would free the skb and data as well.
> 
This seems like a reasonable idea, and I'm happy to experiment with it, but I
still like the idea of using that often available space at the end of an skb,
just because I think the memory savings is attractive.  Shall I roll them into
the same patch series, or do them separately?
Neil

> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html