netdev - Re: ESN, seqhi and out-of-order calls to advance()

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <20201123123745.GH15658@gauss3.secunet.de>
Date:   Mon, 23 Nov 2020 13:37:45 +0100
From:   Steffen Klassert <steffen.klassert@...unet.com>
To:     Nic Dade <nic.dade@...il.com>
CC:     <netdev@...r.kernel.org>
Subject: Re: ESN, seqhi and out-of-order calls to advance()

Hi,

I've Cced netdev, maybe other people have an opinion on this too.

On Thu, Nov 19, 2020 at 01:39:29PM -0800, Nic Dade wrote:
> I've been investigating a problem which happens when I use IPsec
> (strongswan in userspace), ESN, the default anti-replay window (32
> seqnums), on a multi-core CPU. The small window size isn't necessary, it
> just makes it easier to reproduce.

Using the same SA on multiple cores has synchronization problems anyway.
We curently try to solve that on the protocol level:

https://datatracker.ietf.org/doc/html/draft-pwouters-multi-sa-performance

> 
> It looks like it's the same problem which was found and patched in commit
> 3b59df46a4, but that commit only applies to async case, and doesn't quite
> solve the problem when it is the CPU cores which are causing it.
> 
> The trouble is that xfrm_replay_seqhi() is called twice for each packet,
> once to decide what seqhi value to use for authentication, and a second
> time to decide what seq num to advance to for anti-replay. Sometimes they
> are not the same value.
> 
> Here's what happens to cause this: two packets belonging to the same ESP SA
> are being received, and the packet's seqnums are > window apart. The
> packets pass through xfrm_input() which does the 1st call to
> xfrm_replay_seqhi(). Once the packet and seqhi have been authenticated,
> x->repl->advance() == xfrm_replay_advance_esn() is called. If the higher
> seqnum packet gets to advance() first, then it moves the auti-replay window
> forward. Then when the second packet arrives at advance(), the call
> xfrm_replay_advance_esn() makes to xfrm_replay_seqhi() thinks the seqnum <
> bottom means 32-bit seqnums wrapped, and it returns seqhi+1.
> xfrm_replay_advance_esn() stores that away in the xfrm_replay_state_esn for
> the future. From now until the ESP SA renews, or the sender gets to the
> point where seqhi really did increment, all packets will fail
> authentication.

Well analyzed! Yes, that can happen if the packets are processed on
different cpus.

> 
> This is b/c the seqhi which was authenticated != the seqhi which advance()
> believed to be correct.
> 
> I think it would be safer if the authenticated seqhi computed in
> xfrm_input() was passed to the advance() function as an argument, rather
> than having advance() recompute seqhi. That would also fix the async case.
> Or the non-async case also needs to recheck.

Right, using the authenticated seqhi in the advance() function makes
sense. We can do away the recheck() if that fixes the problem entirely.
But I did not look into this problem for some years, so not absolutely
sure if that is sufficient.


> 
> Also it seems to be that calling xfrm_replay_seqhi() while not holding
> x->lock (as xfrm_input() does) is not safe, since both x->seq_hi and x->seq
> are used by xfrm_replay_seqhi(), and that can race with the updates to seq
> and seq_hi in xfrm_replay_advance_esn()

True, indeed.

> ----------------------------------------------------------
> 
> Note I still have some mysteries. I do know for sure (from instrumenting
> the code and reproducing) that the 2nd call to xfrm_replay_seqhi() does
> result in a different answer than the 1st call, and that this happens when
> the packets are processed simultaneously and out of order as described
> above. My mystery is that my instrumentation also indicates it's the same
> CPU core (by smp_processor_id()) processing the two packets, which I cannot
> explain.

That is odd. Should not happen if the packets are processed on the same cpu 
with a sync cipher.