netdev - Re: [PATCH net-next 4/5] net: sctp: decouple cleaning socket data from endpoint

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130619115519.GA15229@hmsreliant.think-freely.org>
Date:	Wed, 19 Jun 2013 07:55:19 -0400
From:	Neil Horman <nhorman@...driver.com>
To:	Vlad Yasevich <vyasevich@...il.com>
Cc:	Daniel Borkmann <dborkman@...hat.com>, davem@...emloft.net,
	netdev@...r.kernel.org, linux-sctp@...r.kernel.org
Subject: Re: [PATCH net-next 4/5] net: sctp: decouple cleaning socket data
 from endpoint

On Tue, Jun 18, 2013 at 06:55:27PM -0400, Vlad Yasevich wrote:
> On 06/18/2013 03:12 PM, Neil Horman wrote:
> >On Tue, Jun 18, 2013 at 02:15:21PM -0400, Vlad Yasevich wrote:
> >>On 06/18/2013 01:45 PM, Neil Horman wrote:
> >>>On Tue, Jun 18, 2013 at 12:24:31PM -0400, Vlad Yasevich wrote:
> >>>>On 06/18/2013 12:02 PM, Daniel Borkmann wrote:
> >>>>>On 06/18/2013 04:22 PM, Neil Horman wrote:
> >>>>>>I like this idea, but I think I'm maybe missing something from it - we
> >>>>>>reference
> >>>>>>the socket in both the receive and send paths (sctp_unpack_cookie, is
> >>>>>>specifically called from the rx path, which makes use of sp->hmac).  a
> >>>>>>socket
> >>>>>>destructor can be called from __sk_free when sk_wmem_alloc reaches
> >>>>>>zero, but we
> >>>>>>use sk_refcnt in the rx path to prevent premature socket cleanup.  If
> >>>>>>we drain
> >>>>>>our send queeue while wer'e still processing rx messages, what
> >>>>>>prevents us from
> >>>>>>freeing the socket in the tx path, via sk_free while we're still using
> >>>>>>the
> >>>>>>socket in the rx path.  Note I don't think this patch is wrong per-se,
> >>>>>>but it
> >>>>>>seems to me there is more work to do to properly interlock the use of
> >>>>>>sk_refcnt
> >>>>>>and sk_wmem_alloc here (unless I'm just missing something obvious,
> >>>>>>which is
> >>>>>>entirely possible, I've been in the sun alot lately :) ).
> >>>>>
> >>>>>Hm, __sk_free() calls sk_prot_free() which frees our socket structure
> >>>>>and in
> >>>>>sctp_wfree() we do a sctp_association_put(asoc) after sock_wfree(skb).
> >>>>>
> >>>>>So no matter if having this patch or not, couldn't this use-after-free like
> >>>>>scenario already happen with the current code?
> >>>>>
> >>>>>F.e. through a given call graph like that:
> >>>>>
> >>>>>sctp_wfree(skb):
> >>>>>  1) sock_wfree(skb)
> >>>>>     -> __sk_free()
> >>>>
> >>>>I don't think this can happen.  sk_wmem_alloc is set to 1 in sk_alloc()
> >>>>and that acts as a guard to make sure that sk_free() has been called
> >>>>before we try to free things up.  So, in this partcular case, for
> >>>>__sk_free() to be called, sk_free() had to be called meaning the
> >>>>last ref on the socket was released.  However, that's not possible since
> >>>>we are still holding the association and thus holding the socket
> >>>>associated with it.
> >>>>
> >>>I see what your saying, and I agree, with that bias added in sk_alloc, it looks
> >>>like we won't ever call __sk_free until sk_wmem_alloc is 1 _and_ sk_refcnt is 0.
> >>>It still seems messy and confusing though.  It would make more sense to me to
> >>>increment the refcount an additional time when the socket is initalized, and
> >>>then decrement it again when the socket is closed and sk_wmem_alloc reaches
> >>>zero.  That would isolate the refcounting to a single variable.
> >>
> >>See commit 2b85a34e911bf483c27cfdd124aeb1605145dc80.  The whole
> >>sk_wmem_alloc tric was done so that we dont have to do
> >>sock_hold/sock_put on transmits.
> >>
> >I know about why it was done, I'm just proposing that we don't have to do it
> >that way to get the speedup it gives us, and make the code a bit more clear at
> >the same time.  If we set the refcnt to 2 at init time, we can decrement that
> >added initial bias when the sk is marked as SOCK_DEAD and the sk_wmem_queued
> >reaches 0.  We still don't have to do a sock_hold/put on every tx, and we keep
> >all the reference counting in one location.
> 
> sorry, not following..  regardless, this seems to have gotten a bit
SOCK_DEAD set in sk->flags IIRC should indicate that the socket has been
orphaned from any user space file descriptors (i.e. the same meaning that the
initial bias to sk_wmem_alloc is telling us).  I'm just saying we can use that
to replace the sk_wmem_alloc == 0 test.

> off topic.  I am going to take a deeper look at Daniel's to make
> sure it doesn't introduce any races.
Yes, this is off topic, I was really just complaining that the current scheme
makes it harder to evaluate patches like this.

Neil

> 
> -vlad
> 
> >
> >Alternatively we can move all the reference counting to sk_wmem_alloc, and have
> >sock_hold/put increment that field by one instead of sk_refcnt, meaning we could
> >remove that field
> >
> >We don't really have to do any of this, of course, but not having looked at it
> >in some time, it seems confusing to me to have have a single additional ref
> >count tracked in sk_wmem_alloc.
> >
> >>It might be good to see if we can do that in sctp as well.
> >Not sure what you mean by "that" here.  You mean remove the additional uses of
> >sctp_hold/put?
> >
> >Neil
> >
> >>
> >>-vlad
> >>
> >>
> >>>Neil
> >>>
> >>>>-vlad
> >>>>
> >>>>>      -> sk_prot_free(.., sk)
> >>>>>       -> kmem_cache_free(.., sk) or kfree(sk)
> >>>>>  2) __sctp_write_space(asoc)
> >>>>>  3) sctp_association_put(asoc)
> >>>>>     -> sctp_association_destroy(asoc)
> >>>>>      -> sctp_endpoint_put(asoc->ep)
> >>>>>       -> sctp_endpoint_destroy(ep)
> >>>>>        -> crypto_free_hash(sctp_sk(ep->base.sk)->hmac)
> >>>>>           (etc, all unconditionally accessed while sk is
> >>>>>            already dead/freed)
> >>>>>
> >>>>>Then, this might need a fix in general. :-)
> >>>>>
> >>>>>Assuming you would reduce the buffer space via setsockopt(.., SO_SNDBUF,
> >>>>>..),
> >>>>>you might end up with a minimum buffer space of SOCK_MIN_SNDBUF [*] and
> >>>>>a call to
> >>>>>sk->sk_write_space(sk), which is sctp_write_space() and calls
> >>>>>__sctp_write_space()
> >>>>>on all asocs belonging to the socket, but it seems not to alter the current
> >>>>>sk->sk_wmem_alloc I think, but rather sk->sk_sndbuf.
> >>>>>
> >>>>>[*] Btw, shouldn't this rather be (2048 + sizeof(struct sk_buff)) or
> >>>>>     SKB_TRUESIZE(2048), at least like in SOCK_MIN_RCVBUF since we operate
> >>>>>     on skb->truesize as well?
> >>>>>--
> >>>>>To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> >>>>>the body of a message to majordomo@...r.kernel.org
> >>>>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>
> >>>>
> >>
> >>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html