netdev - Re: [PATCH V2 net 1/1] net/tls(TLS_SW): Fix list_del double free caused by a race condition in tls_tx

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190927173753.418634ef@cakuba.netronome.com>
Date:   Fri, 27 Sep 2019 17:37:53 -0700
From:   Jakub Kicinski <jakub.kicinski@...ronome.com>
To:     Pooja Trivedi <poojatrivedi@...il.com>
Cc:     netdev@...r.kernel.org, davem@...emloft.net, daniel@...earbox.net,
        john.fastabend@...il.com, davejwatson@...com, aviadye@...lanox.com,
        borisp@...lanox.com, Pooja Trivedi <pooja.trivedi@...ckpath.com>,
        Mallesham Jatharakonda <mallesh537@...il.com>
Subject: Re: [PATCH V2 net 1/1] net/tls(TLS_SW): Fix list_del double free
 caused by a race condition in tls_tx_records

On Tue, 24 Sep 2019 12:48:26 -0400, Pooja Trivedi wrote:
> On Mon, Sep 23, 2019 at 8:28 PM Jakub Kicinski wrote:
> > On Sat, 21 Sep 2019 23:19:20 -0400, Pooja Trivedi wrote:  
> > > On Wed, Sep 18, 2019 at 5:45 PM Jakub Kicinski wrote:  
> > > > On Wed, 18 Sep 2019 17:37:44 -0400, Pooja Trivedi wrote:  
> > > > > Hi Jakub,
> > > > >
> > > > > I have explained one potential way for the race to happen in my
> > > > > original message to the netdev mailing list here:
> > > > > https://marc.info/?l=linux-netdev&m=156805120229554&w=2
> > > > >
> > > > > Here is the part out of there that's relevant to your question:
> > > > >
> > > > > -----------------------------------------
> > > > >
> > > > > One potential way for race condition to appear:
> > > > >
> > > > > When under tcp memory pressure, Thread 1 takes the following code path:
> > > > > do_sendfile ---> ... ---> .... ---> tls_sw_sendpage --->
> > > > > tls_sw_do_sendpage ---> tls_tx_records ---> tls_push_sg --->
> > > > > do_tcp_sendpages ---> sk_stream_wait_memory ---> sk_wait_event  
> > > >
> > > > Ugh, so do_tcp_sendpages() can also release the lock :/
> > > >
> > > > Since the problem occurs in tls_sw_do_sendpage() and
> > > > tls_sw_do_sendmsg() as well, should we perhaps fix it at that level?  
> > >
> > > That won't do because tls_tx_records also gets called when completion
> > > callbacks schedule delayed work. That was the code path that caused
> > > the crash for my test. Cavium's nitrox crypto offload driver calling
> > > tls_encrypt_done, which calls schedule_delayed_work. Delayed work that
> > > was scheduled would then be processed by tx_work_handler.
> > > Notice in my previous reply,
> > > "Thread 2 code path:
> > > tx_work_handler ---> tls_tx_records"
> > >
> > > "Thread 2 code path:
> > > tx_work_handler ---> tls_tx_records"  
> >
> > Right, the work handler would obviously also have to obey the exclusion
> > mechanism of choice.
> >
> > Having said that this really does feel like we are trying to lock code,
> > not data here :(  
> 
> Agree with you and exactly the thought process I went through. So what
> are some other options?
> 
> 1) A lock member inside of ctx to protect tx_list
> We are load testing ktls offload with nitrox and the performance was
> quite adversely affected by this. This approach can be explored more,
> but the original design of using socket lock didn't follow this model
> either.
> 2) Allow tagging of individual record inside of tx_list to indicate if
> it has been 'processed'
> This approach would likely protect the data without compromising
> performance. It will allow Thread 2 to proceed with the TX portion of
> tls_tx_records while Thread 1 sleeps waiting for memory. There will
> need to be careful cleanup and backtracking after the thread wakes up
> to ensure a consistent state of tx_list and record transmission.
> The approach has several problems, however -- (a) It could cause
> out-of-order record tx (b) If Thread 1 is waiting for memory, Thread 2
> most likely will (c) Again, socket lock wasn't designed to follow this
> model to begin with
> 
> 
> Given that socket lock essentially was working as a code protector --
> as an exclusion mechanism to allow only a single writer through
> tls_tx_records at a time -- what other clean ways do we have to fix
> the race without a significant refactor of the design and code?

Very sorry about the delay. I don't think we can maintain the correct
semantics without sleeping :( If we just bail in tls_tx_records() when
there's already another writer the later writer will return from the
system call, even though the data is not pushed into the TCP layer.

What was reason for the performance impact on (1)? My feeling is that
we need to make writers wait to maintain socket write semantics, and
that implies putting writers to sleep, which is indeed very costly..

Perhaps something along the lines of:

	if (ctx->in_tcp_sendpages) {
		rc = sk_stream_wait_memory(sk, &timeo);
		...
	}

in tls_tx_records() would be the "most correct" solution? If we get
there and there is already a writer, that means the first writer has
to be waiting for memory, and so should the second..

WDYT?