netdev - Re: : getsockopt(fd, SOL_IP, SO_ORIGINAL_DST, sa, &salen) is in fact sometimes returning the source IP instead the destination IP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190112172636.GA26639@1wt.eu>
Date:   Sat, 12 Jan 2019 18:26:36 +0100
From:   Willy Tarreau <w@....eu>
To:     Florian Westphal <fw@...len.de>
Cc:     Lukas Tribus <lists@...i.eu>,
        Mohandass Roobesh <Roobesh_Mohandass@...fee.com>,
        netdev@...r.kernel.org
Subject: Re: : getsockopt(fd, SOL_IP, SO_ORIGINAL_DST, sa, &salen) is
 in fact sometimes returning the source IP instead the destination IP

Hi Florian!

Sorry for the lag, I was busy with other issues.

On Sat, Jan 12, 2019 at 05:04:00PM +0100, Florian Westphal wrote:
> Lukas Tribus <lists@...i.eu> wrote:
> > The application (haproxy) needs to know the original destination IP
> > address, however it does not know whether -j REDIRECT was used or not.
> > Because of this the application always queries SO_ORIGINAL_DST, and
> > this includes configurations without -j REDIRECT.
> > 
> > Are you saying the behavior of SO_ORIGINAL_DST is undefined when not
> > used with -j REDIRECT and that this issue does not happen when -j
> > REDIRECT is actually used?
> 
> No, thats not what I said.  Because OP provided a link that mentions
> TPROXY, I concluded OP was using TPROXY, so I pointed out that the
> error source can be completely avoided by not using SO_ORIGINAL_DST.
> 
> As I said, SO_ORIGINAL_DST returns the dst address of
> the original direction *as seen by conntrack*.
> 
> In case REDIRECT or DNAT was used, the address returned is the on-wire
> one, before DNAT rewrite took place.
>
> Therefore, SO_ORIGINAL_DST is only needed when REDIRECT or DNAT was
> used. If no DNAT rewrite takes place, sockaddr returned by accept or
> getsockname can be used directly and SO_ORIGINAL_DST isn't needed.
> The returned address should be identical to the one given by accept().

That's also the way we're using it :

     http://git.haproxy.org/?p=haproxy.git;a=blob;f=src/proto_tcp.c;h=52a4691d2ba93aec93a3e8a0edd1e90d93de968d;hb=HEAD#l600

More precisely if getsockopt() fails we only use getsockname(). It
happens that this code is very old (2002) and used to work with kernel
2.4's TPROXY which did rely on conntrack, hence the ifdef including
TPROXY. But it's irrelevant here, with a modern TPROXY the getsockopt()
here will usually fail, and the result will come from getsockname()
only.

When seeing the (old) haproxy code there, I've been thinking about
another possibility. Right now haproxy does getsockname() then *tries*
getsockopt(). In the unlikely case getsockopt() would modify part of
the address already returned by getsockname(), it could leave an
incorrect address there. But such an issue would require partial uses
of copy_to_user() which doesn't make much sense, and getorigdst()
doesn't do this either. So this one was ruled out.

I've been wondering if it was possible that from time to time, this
getsockopt() accidently succeeds and returns something wrong, maybe
due to a race with another thread, or maybe because it would return
an uninitialized field which happened to see the source address
previously. The very low amount of occurrence makes me think it could
possibly happen only upon certain error conditions. But looking at
getorigdst(), I hardly imagine how this would happen.

> If SO_ORIGINAL_DST returns the reply, then conntrack picked up
> a reply packet as the first packet of the connection, so it believes
> originator is the responder and vice versa.
> 
> One case where this can happen is when nf_conntrack_tcp_loose
> (mid-stream pickup) is enabled.

Very interesting case indeed, I hadn't thought about it! I think we
don't have enough info from the original reporter's setup but it
would definitely make sense and explain why it's the other end which
is retrieved!

I'm seeing one possibility to explain this : Let's say the OP's setup
has a short conntrack timeout and a longer haproxy timeout. If the
address is only retrieved for logging, it will be retrieved at the
end of the connection. Let's assume haproxy receives a request from
a client, a conntrack entry is created and haproxy forwards the request
to a very slow server. Before the server responds, the conntrack entry
expires, then the server responds and haproxy forwards to the client,
re-creating the entry and hitting this case before the address is
picked up for logging.

Roobesh, do you use the destination address only for logging or
anywhere else in the request path ? And could you check if you have
nf_conntrack_tcp_loose set as Florian suggests ? I really think he
figured it right.

If that's the problem, I think we can address it with documentation :
first, warn about the use case, second, explain how to store the
address into a variable once the connection is accepted since it will
not have expired and will still be valid at this moment.

> This is not a haproxy bug.
> 
> Only thing that haproxy could is to provide a knob to make it only
> use addresses returned by accept, rather than relying on
> SO_ORIGINAL_DST for those that use TPROXY to do MITM interception.

Since it's the destination we get it using getsockname(), not accept(),
but yeah maybe we'd benefit from having a tunable to disable the use of
getsockopt(SO_ORIGINAL_DST). Actually I'd prefer to avoid doing this
until the issue it confirmed because I don't feel comfortable with having
an option saying "disable the use of getsockopt(SO_ORIGINAL_DST) which
may fail once per million due to some obscure reasons". However if it's
confirmed that it indeed is the scenario above that happens, we could
possibly think about adding a setting to work around if the doc approach
is not enough.

Many thanks for your insights, these are exactly the reason I asked
Roobesh to bring the issue here :-)

Cheers,
Willy