linux-kernel - Re: [PATCH v6 md-6.18 11/11] md/md-llbitmap: introduce new lockless bitmap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <79a2c3ff-e490-4fb5-b0fc-0bdadfae1b81@infradead.org>
Date: Wed, 27 Aug 2025 21:15:07 -0700
From: Randy Dunlap <rdunlap@...radead.org>
To: Yu Kuai <yukuai1@...weicloud.com>, hch@...radead.org, corbet@....net,
 agk@...hat.com, snitzer@...nel.org, mpatocka@...hat.com, song@...nel.org,
 xni@...hat.com, hare@...e.de, linan122@...wei.com, colyli@...nel.org
Cc: linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
 dm-devel@...ts.linux.dev, linux-raid@...r.kernel.org, yukuai3@...wei.com,
 yi.zhang@...wei.com, yangerkun@...wei.com, johnny.chenyi@...wei.com
Subject: Re: [PATCH v6 md-6.18 11/11] md/md-llbitmap: introduce new lockless
 bitmap



On 8/26/25 1:52 AM, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@...wei.com>
> 
> Redundant data is used to enhance data fault tolerance, and the storage
> method for redundant data vary depending on the RAID levels. And it's
> important to maintain the consistency of redundant data.
> 
> Bitmap is used to record which data blocks have been synchronized and which
> ones need to be resynchronized or recovered. Each bit in the bitmap
> represents a segment of data in the array. When a bit is set, it indicates
> that the multiple redundant copies of that data segment may not be
> consistent. Data synchronization can be performed based on the bitmap after
> power failure or readding a disk. If there is no bitmap, a full disk

                   reading

> synchronization is required.
> 
> Key Features:
> 
>  - IO fastpath is lockless, if user issues lots of write IO to the same

                    lockless. If the user

>  bitmap bit in a short time, only the first write have additional overhead

                                                    has

>  to update bitmap bit, no additional overhead for the following writes;
>  - support only resync or recover written data, means in the case creating
>  new array or replacing with a new disk, there is no need to do a full disk
>  resync/recovery;
> 
> Key Concept:
> 
>  - State Machine:
> 
> Each bit is one byte, contain 6 difference state, see llbitmap_state. And

                        contains 6 different states,


> there are total 8 differenct actions, see llbitmap_action, can change state:

                    different                                that can change state:

> 
> llbitmap state machine: transitions between states
> 
> |           | Startwrite | Startsync | Endsync | Abortsync|
> | --------- | ---------- | --------- | ------- | -------  |
> | Unwritten | Dirty      | x         | x       | x        |
> | Clean     | Dirty      | x         | x       | x        |
> | Dirty     | x          | x         | x       | x        |
> | NeedSync  | x          | Syncing   | x       | x        |
> | Syncing   | x          | Syncing   | Dirty   | NeedSync |
> 
> |           | Reload   | Daemon | Discard   | Stale     |
> | --------- | -------- | ------ | --------- | --------- |
> | Unwritten | x        | x      | x         | x         |
> | Clean     | x        | x      | Unwritten | NeedSync  |
> | Dirty     | NeedSync | Clean  | Unwritten | NeedSync  |
> | NeedSync  | x        | x      | Unwritten | x         |
> | Syncing   | NeedSync | x      | Unwritten | NeedSync  |
> 
> Typical scenarios:
> 
> 1) Create new array
> All bits will be set to Unwritten by default, if --assume-clean is set,

                                       default. If

> all bits will be set to Clean instead.
> 
> 2) write data, raid1/raid10 have full copy of data, while raid456 doesn't and
> rely on xor data
> 
> 2.1) write new data to raid1/raid10:
> Unwritten --StartWrite--> Dirty
> 
> 2.2) write new data to raid456:
> Unwritten --StartWrite--> NeedSync
> 
> Because the initial recover for raid456 is skipped, the xor data is not build
> yet, the bit must set to NeedSync first and after lazy initial recover is
> finished, the bit will finially set to Dirty(see 5.1 and 5.4);

                         finally

> 
> 2.3) cover write
> Clean --StartWrite--> Dirty
> 
> 3) daemon, if the array is not degraded:
> Dirty --Daemon--> Clean
> 
> For degraded array, the Dirty bit will never be cleared, prevent full disk

                                                           preventing

> recovery while readding a removed disk.

                 reading

> 
> 4) discard
> {Clean, Dirty, NeedSync, Syncing} --Discard--> Unwritten
> 
> 5) resync and recover
> 
> 5.1) common process
> NeedSync --Startsync--> Syncing --Endsync--> Dirty --Daemon--> Clean
> 
> 5.2) resync after power failure
> Dirty --Reload--> NeedSync
> 
> 5.3) recover while replacing with a new disk
> By default, the old bitmap framework will recover all data, and llbitmap
> implement this by a new helper, see llbitmap_skip_sync_blocks:

  implements

> 
> skip recover for bits other than dirty or clean;
> 
> 5.4) lazy initial recover for raid5:
> By default, the old bitmap framework will only allow new recover when there
> are spares(new disk), a new recovery flag MD_RECOVERY_LAZY_RECOVER is add

                                                                        added

> to perform raid456 lazy recover for set bits(from 2.2).
> 
> Bitmap IO:
> 
>  - Chunksize
> 
> The default bitmap size is 128k, incluing 1k bitmap super block, and

                                   including

> the default size of segment of data in the array each bit(chunksize) is 64k,
> and chunksize will adjust to twice the old size each time if the total number
> bits is not less than 127k.(see llbitmap_init)
> 
>  - READ
> 
> While creating bitmap, all pages will be allocated and read for llbitmap,

                                                                  llbitmap.

> there won't be read afterwards

  There          a read afterwards.

> 
>  - WRITE
> 
> WRITE IO is divided into logical_block_size of the array, the dirty state
> of each block is tracked independently, for example:
> 
> each page is 4k, contain 8 blocks; each block is 512 bytes contain 512 bit;

                                                       bytes and contains 512 bits:

> 
> | page0 | page1 | ... | page 31 |
> |       |
> |        \-----------------------\
> |                                |
> | block0 | block1 | ... | block 8|
> |        |
> |         \-----------------\
> |                            |
> | bit0 | bit1 | ... | bit511 |
> 
> From IO path, if one bit is changed to Dirty or NeedSync, the corresponding
> subpage will be marked dirty, such block must write first before the IO is

                         dirty;

> issued. This behaviour will affect IO performance, to reduce the impact, if

                                        performance. To

> multiple bits are changed in the same block in a short time, all bits in this
> block will be changed to Dirty/NeedSync, so that there won't be any overhead
> until daemon clears dirty bits.
> 
> Dirty Bits syncronization:
> 
> IO fast path will set bits to dirty, and those dirty bits will be cleared
> by daemon after IO is done. llbitmap_page_ctl is used to synchronize between
> IO path and daemon;
> 
> IO path:
>  1) try to grab a reference, if succeed, set expire time after 5s and return;
>  2) if failed to grab a reference, wait for daemon to finish clearing dirty
>  bits;
> 
> Daemon(Daemon will be waken up every daemon_sleep seconds):

                will be woken up
or
                will be awakened

> For each page:
>  1) check if page expired, if not skip this page; for expired page:

                    expired; if not, skip this page. For expired page:

>  2) suspend the page and wait for inflight write IO to be done;
>  3) change dirty page to clean;
>  4) resume the page;
> 
> Signed-off-by: Yu Kuai <yukuai3@...wei.com>
> ---
>  Documentation/admin-guide/md.rst |   20 +
>  drivers/md/Kconfig               |   11 +
>  drivers/md/Makefile              |    1 +
>  drivers/md/md-bitmap.c           |    9 -
>  drivers/md/md-bitmap.h           |   31 +-
>  drivers/md/md-llbitmap.c         | 1600 ++++++++++++++++++++++++++++++
>  drivers/md/md.c                  |    6 +
>  drivers/md/md.h                  |    4 +-
>  8 files changed, 1670 insertions(+), 12 deletions(-)
>  create mode 100644 drivers/md/md-llbitmap.c
> 
> diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
> index 001363f81850..47d1347ccd00 100644
> --- a/Documentation/admin-guide/md.rst
> +++ b/Documentation/admin-guide/md.rst
> @@ -387,6 +387,8 @@ All md devices contain:
>           No bitmap
>       bitmap
>           The default internal bitmap
> +     llbitmap
> +         The lockless internal bitmap
>  
>  If bitmap_type is not none, then additional bitmap attributes bitmap/xxx or
>  llbitmap/xxx will be created after md device KOBJ_CHANGE event.
> @@ -447,6 +449,24 @@ If bitmap_type is bitmap, then the md device will also contain:
>       once the array becomes non-degraded, and this fact has been
>       recorded in the metadata.
>  
> +If bitmap_type is llbitmap, then the md device will also contain:
> +
> +  llbitmap/bits
> +     This is readonly, show status of bitmap bits, the number of each

                read-only; it shows the status of bitmap bits,

> +     value.
> +
> +  llbitmap/metadata
> +     This is readonly, show bitmap metadata, include chunksize, chunkshift,

                read-only; it shows bitmap metadata, including

> +     chunks, offset and daemon_sleep.
> +
> +  llbitmap/daemon_sleep
> +     This is readwrite, time in seconds that daemon function will be

                read-write, time in seconds

> +     triggered to clear dirty bits.
> +
> +  llbitmap/barrier_idle
> +     This is readwrite, time in seconds that page barrier will be idled,

                read-write,> +     means dirty bits in the page will be cleared.
> +
>  As component devices are added to an md array, they appear in the ``md``
>  directory as new directories named::
>  

> diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
> new file mode 100644
> index 000000000000..88207f31c728
> --- /dev/null
> +++ b/drivers/md/md-llbitmap.c
> @@ -0,0 +1,1600 @@

> +/*
> + * #### Background
> + *
> + * Redundant data is used to enhance data fault tolerance, and the storage
> + * method for redundant data vary depending on the RAID levels. And it's

      methods

> + * important to maintain the consistency of redundant data.
> + *
> + * Bitmap is used to record which data blocks have been synchronized and which
> + * ones need to be resynchronized or recovered. Each bit in the bitmap
> + * represents a segment of data in the array. When a bit is set, it indicates
> + * that the multiple redundant copies of that data segment may not be
> + * consistent. Data synchronization can be performed based on the bitmap after
> + * power failure or readding a disk. If there is no bitmap, a full disk

                       reading

> + * synchronization is required.
> + *
> + * #### Key Features
> + *
> + *  - IO fastpath is lockless, if user issues lots of write IO to the same

                        lockless. If the user

> + *  bitmap bit in a short time, only the first write have additional overhead

                                                        has

> + *  to update bitmap bit, no additional overhead for the following writes;

                        bit; there is no additional overhead for the following writes;

> + *  - support only resync or recover written data, means in the case creating
> + *  new array or replacing with a new disk, there is no need to do a full disk
> + *  resync/recovery;
> + *
> + * #### Key Concept
> + *
> + * ##### State Machine
> + *
> + * Each bit is one byte, contain 6 difference state, see llbitmap_state. And

                      byte, containing           states,

> + * there are total 8 differenct actions, see llbitmap_action, can change state:

                        different                              , that can change state.

> + *
> + * llbitmap state machine: transitions between states

                                                  states::

Use "::" to maintain the table spacing.

> + *
> + * |           | Startwrite | Startsync | Endsync | Abortsync|
> + * | --------- | ---------- | --------- | ------- | -------  |
> + * | Unwritten | Dirty      | x         | x       | x        |
> + * | Clean     | Dirty      | x         | x       | x        |
> + * | Dirty     | x          | x         | x       | x        |
> + * | NeedSync  | x          | Syncing   | x       | x        |
> + * | Syncing   | x          | Syncing   | Dirty   | NeedSync |
> + *
> + * |           | Reload   | Daemon | Discard   | Stale     |
> + * | --------- | -------- | ------ | --------- | --------- |
> + * | Unwritten | x        | x      | x         | x         |
> + * | Clean     | x        | x      | Unwritten | NeedSync  |
> + * | Dirty     | NeedSync | Clean  | Unwritten | NeedSync  |
> + * | NeedSync  | x        | x      | Unwritten | x         |
> + * | Syncing   | NeedSync | x      | Unwritten | NeedSync  |
> + *
> + * Typical scenarios:
> + *
> + * 1) Create new array
> + * All bits will be set to Unwritten by default, if --assume-clean is set,

                                           default. If

> + * all bits will be set to Clean instead.
> + *
> + * 2) write data, raid1/raid10 have full copy of data, while raid456 doesn't and
> + * rely on xor data
> + *
> + * 2.1) write new data to raid1/raid10:
> + * Unwritten --StartWrite--> Dirty
> + *
> + * 2.2) write new data to raid456:
> + * Unwritten --StartWrite--> NeedSync
> + *
> + * Because the initial recover for raid456 is skipped, the xor data is not build

                                                                              built

> + * yet, the bit must set to NeedSync first and after lazy initial recover is

                   must be set to

> + * finished, the bit will finially set to Dirty(see 5.1 and 5.4);

                             finally be set to

> + *
> + * 2.3) cover write
> + * Clean --StartWrite--> Dirty
> + *
> + * 3) daemon, if the array is not degraded:
> + * Dirty --Daemon--> Clean
> + *
> + * For degraded array, the Dirty bit will never be cleared, prevent full disk

                                                               preventing

> + * recovery while readding a removed disk.

                     reading

> + *
> + * 4) discard
> + * {Clean, Dirty, NeedSync, Syncing} --Discard--> Unwritten
> + *
> + * 5) resync and recover
> + *
> + * 5.1) common process
> + * NeedSync --Startsync--> Syncing --Endsync--> Dirty --Daemon--> Clean
> + *
> + * 5.2) resync after power failure
> + * Dirty --Reload--> NeedSync
> + *
> + * 5.3) recover while replacing with a new disk
> + * By default, the old bitmap framework will recover all data, and llbitmap
> + * implement this by a new helper, see llbitmap_skip_sync_blocks:

      implements

> + *
> + * skip recover for bits other than dirty or clean;
> + *
> + * 5.4) lazy initial recover for raid5:
> + * By default, the old bitmap framework will only allow new recover when there
> + * are spares(new disk), a new recovery flag MD_RECOVERY_LAZY_RECOVER is add

                     disk). A new                                           added

> + * to perform raid456 lazy recover for set bits(from 2.2).
> + *
> + * ##### Bitmap IO
> + *
> + * ##### Chunksize
> + *
> + * The default bitmap size is 128k, incluing 1k bitmap super block, and
> + * the default size of segment of data in the array each bit(chunksize) is 64k,
> + * and chunksize will adjust to twice the old size each time if the total number
> + * bits is not less than 127k.(see llbitmap_init)
> + *
> + * ##### READ
> + *
> + * While creating bitmap, all pages will be allocated and read for llbitmap,
> + * there won't be read afterwards
> + *
> + * ##### WRITE
> + *
> + * WRITE IO is divided into logical_block_size of the array, the dirty state
> + * of each block is tracked independently, for example:
> + *
> + * each page is 4k, contain 8 blocks; each block is 512 bytes contain 512 bit;

                                                                 and contains 512 bits;

> + *
> + * | page0 | page1 | ... | page 31 |
> + * |       |
> + * |        \-----------------------\
> + * |                                |
> + * | block0 | block1 | ... | block 8|
> + * |        |
> + * |         \-----------------\
> + * |                            |
> + * | bit0 | bit1 | ... | bit511 |
> + *
> + * From IO path, if one bit is changed to Dirty or NeedSync, the corresponding
> + * subpage will be marked dirty, such block must write first before the IO is
> + * issued. This behaviour will affect IO performance, to reduce the impact, if
> + * multiple bits are changed in the same block in a short time, all bits in this
> + * block will be changed to Dirty/NeedSync, so that there won't be any overhead
> + * until daemon clears dirty bits.
> + *
> + * ##### Dirty Bits syncronization

                       synchronization

[snip]

> +
> +static struct md_sysfs_entry llbitmap_bits =
> +__ATTR_RO(bits);

One line, or if you feel that it must be 2 lines, the second line
should be indented.

> +
> +static struct md_sysfs_entry llbitmap_metadata =
> +__ATTR_RO(metadata);

One line, or if you feel that it must be 2 lines, the second line
should be indented.

> +
> +static struct md_sysfs_entry llbitmap_daemon_sleep =
> +__ATTR_RW(daemon_sleep);
> +

One line, or if you feel that it must be 2 lines, the second line
should be indented.

> +
> +static struct md_sysfs_entry llbitmap_barrier_idle =
> +__ATTR_RW(barrier_idle);
> +

One line, or if you feel that it must be 2 lines, the second line
should be indented.


-- 
~Randy