linux-kernel - Re: [PATCH] libceph: handle EADDRNOTAVAIL more gracefully

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260210071929.15602-1-ionut.nechita@windriver.com>
Date: Tue, 10 Feb 2026 09:19:29 +0200
From: "Ionut Nechita (Wind River)" <ionut.nechita@...driver.com>
To: idryomov@...il.com
Cc: amarkuze@...hat.com, bigeasy@...utronix.de, ceph-devel@...r.kernel.org,
        clrkwllms@...nel.org, ionut.nechita@...driver.com,
        ionut_n2001@...oo.com, jkosina@...e.com, jlayton@...nel.org,
        linux-kernel@...r.kernel.org, linux-rt-devel@...ts.linux.dev,
        rostedt@...dmis.org, sage@...dream.net, slava@...eyko.com,
        superm1@...nel.org, xiubli@...hat.com
Subject: Re: [PATCH] libceph: handle EADDRNOTAVAIL more gracefully

Hi Ilya,

Thank you for the thorough review and the good questions. You're right
to challenge the "1-2 seconds" claim -- looking at the dmesg data more
carefully, that was misleading in the commit message.

> I'm missing how an error that is typically transient and goes away in
> 1-2s can cause a delay of 15+ seconds against a 250ms, 500ms, 1s, 2s,
> 4s, 8s, 15s backoff loop.

You're absolutely right that if the address became valid in 1-2s, the
third or fourth attempt would succeed. The problem is that in our
environment, EADDRNOTAVAIL does NOT resolve in 1-2 seconds. That was
an incorrect generalization from simple DAD scenarios.

>From the production dmesg (6.12.0-1-rt-amd64, StarlingX on Dell
PowerEdge R720, IPv6-only Ceph cluster), the EADDRNOTAVAIL condition
persists for much longer:

  13:20:52 - mon0 session lost, hunting begins, first error -99
  13:57:03 - mon0 session finally re-established

That's approximately 36 minutes of continuous EADDRNOTAVAIL on all
source addresses. This happens during a StarlingX rolling upgrade,
where the platform reconfigures the network stack extensively (interface
teardown/rebuild, address reassignment, routing changes).

The reason the delays compound beyond the simple backoff sequence is
that there are two independent backoff mechanisms stacking:

1) Connection-level backoff (con_fault in messenger.c):
   250ms -> 500ms -> 1s -> 2s -> 4s -> 8s -> 15s (MAX_DELAY_INTERVAL)

2) Monitor hunt-level backoff (mon_client.c delayed_work):
   3s * hunt_mult, where hunt_mult doubles each cycle up to 10x max,
   so the hunt interval grows: 3s -> 6s -> 12s -> 24s -> 30s (capped)

At steady state, each monitor gets ~30 seconds of attempts before
the hunt timer switches to the next one. Within those 30 seconds,
the connection goes through the full exponential backoff (several
attempts up to the 15s max delay). The round-trip through both
monitors takes ~60 seconds at max backoff.

> How many attempts do you see per session and in total for the event
> before and after this patch?

Before the patch (from the dmesg):
- Total error-99 messages: ~470 connect attempts over 36 minutes
- Per monitor session (one hunt cycle at steady state): ~8 attempts
  (immediate x3, +1s, +2s, +3s, +5s, +8s before hunt switches)
- The sync task was blocked for 983+ seconds (over 16 minutes),
  triggering repeated hung task warnings:
    12:52:11 - "task sync blocked for more than 122 seconds"
    13:31:05 - "task sync blocked for more than 122 seconds" (new sync)
    13:33:08 - 245 seconds
    13:35:11 - 368 seconds
    ...continued up to 983+ seconds at 13:45:26

After the patch:
- The ADDRNOTAVAIL_DELAY (HZ/10 = 100ms) replaces the exponential
  backoff for EADDRNOTAVAIL failures specifically, so retries happen
  at a fixed 100ms interval instead of growing to 15s
- In testing with the same rolling upgrade scenario, the total
  reconnection time dropped from 36 minutes to under 3 seconds once
  the address became available, because the client was retrying every
  100ms rather than waiting 15s between attempts at the connection
  level
- Total attempts per event: similar count, but compressed into a
  much shorter window with faster recovery once the address is valid

I should correct the commit message -- the "1-2 seconds" claim was
wrong. The accurate description is that the duration of EADDRNOTAVAIL
varies widely depending on the environment: it can be brief (simple
DAD) or very long (complex network reconfiguration during rolling
upgrades). The patch helps in both cases by keeping the retry interval
short so that recovery happens as soon as the address becomes
available, rather than potentially waiting up to 15 seconds for the
next connection attempt.

I will also note that the connection-level backoff delay does NOT
reset when the monitor client switches monitors via reopen_session(),
because ceph_con_open() sets con->delay = 0 but the new connection
immediately hits EADDRNOTAVAIL and con_fault() sets it right back
into exponential backoff. So even though the monitor switch gives a
few immediate retries (the first 3 attempts at the start of each
cycle show 0-delay), the backoff quickly ramps up again.

I've prepared a v2 with a corrected commit message that reflects
the actual production data and explains the two compounding backoff
mechanisms. The code is unchanged -- only the commit message and
the Fixes: tag are updated.

The patch is currently scheduled for testing through a full cycle of
multiple rolling upgrades to validate the improvement under sustained
production conditions. I will share the results once that completes.

Thanks,
Ionut