linux-kernel - Re: [PATCH v3 11/11] md/md-llbitmap: introduce new lockless bitmap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bc215f88-a652-090f-ae99-8aaba6c591c4@huaweicloud.com>
Date: Mon, 21 Jul 2025 15:12:49 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Hannes Reinecke <hare@...e.de>, Yu Kuai <yukuai1@...weicloud.com>,
 corbet@....net, agk@...hat.com, snitzer@...nel.org, mpatocka@...hat.com,
 song@...nel.org
Cc: linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
 dm-devel@...ts.linux.dev, linux-raid@...r.kernel.org, yi.zhang@...wei.com,
 yangerkun@...wei.com, johnny.chenyi@...wei.com,
 "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH v3 11/11] md/md-llbitmap: introduce new lockless bitmap

Hi,

在 2025/07/21 14:20, Hannes Reinecke 写道:
> On 7/18/25 11:23, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@...wei.com>
>>

>> +
>> +#define BITMAP_DATA_OFFSET 1024
>> +
>> +/* 64k is the max IO size of sync IO for raid1/raid10 */
>> +#define MIN_CHUNK_SIZE (64 * 2)
>> +
> 
> Hmm? Which one is it?
> Comment says 'max IO size', but it's called 'MIN_CHUNK_SIZE'...

max IO size here means the internal recovery IO size for raid1/10,
and we're handling at most one llbimtap bit(chunksie) at a time, so
chunksize should be at least 64k, otherwise recovery IO size will be
less.

>> +/*
>> + * Dirtied bits that have not been accessed for more than 5s will be 
>> cleared
>> + * by daemon.
>> + */
>> +#define BARRIER_IDLE 5
>> +
> 
> Should this be changeable, too?

Yes, idealy this should. Perhaps a new sysfs api?

> 

>> +    if (!test_bit(LLPageDirty, &pctl->flags))
>> +        set_bit(LLPageDirty, &pctl->flags);
>> +
>> +    /*
>> +     * The subpage usually contains a total of 512 bits. If any 
>> single bit
>> +     * within the subpage is marked as dirty, the entire sector will be
>> +     * written. To avoid impacting write performance, when multiple bits
>> +     * within the same sector are modified within a short time frame, 
>> all
>> +     * bits in the sector will be collectively marked as dirty at once.
>> +     */
> 
> How short is the 'short timeframe'?
> Is this the BARRIER_IDLE setting?
> Please clarify.

Yes, if the page is not accessed for BARRIER_IDLE seconds.

>> +static struct page *llbitmap_read_page(struct llbitmap *llbitmap, int 
>> idx)
>> +{
>> +    struct mddev *mddev = llbitmap->mddev;
>> +    struct page *page = NULL;
>> +    struct md_rdev *rdev;
>> +
>> +    if (llbitmap->pctl && llbitmap->pctl[idx])
>> +        page = llbitmap->pctl[idx]->page;
>> +    if (page)
>> +        return page;
>> +
>> +    page = alloc_page(GFP_KERNEL | __GFP_ZERO);
>> +    if (!page)
>> +        return ERR_PTR(-ENOMEM);
>> +
>> +    rdev_for_each(rdev, mddev) {
>> +        sector_t sector;
>> +
>> +        if (rdev->raid_disk < 0 || test_bit(Faulty, &rdev->flags))
>> +            continue;
>> +
>> +        sector = mddev->bitmap_info.offset +
>> +             (idx << PAGE_SECTORS_SHIFT);
>> +
>> +        if (sync_page_io(rdev, sector, PAGE_SIZE, page, REQ_OP_READ,
>> +                 true))
>> +            return page;
>> +
>> +        md_error(mddev, rdev);
>> +    }
>> +
>> +    __free_page(page);
>> +    return ERR_PTR(-EIO);
>> +}
>> +
> 
> Have you considered moving to folios here?
> 

Of course, however, because the md high level helpers is still using
page, I'm thinking about using page for now, which is simpler, and
moving to folios for all md code later.


>> +static int llbitmap_resize(struct mddev *mddev, sector_t blocks, int 
>> chunksize)
>> +{
>> +    struct llbitmap *llbitmap = mddev->bitmap;
>> +    unsigned long chunks;
>> +
>> +    if (chunksize == 0)
>> +        chunksize = llbitmap->chunksize;
>> +
>> +    /* If there is enough space, leave the chunksize unchanged. */
>> +    chunks = DIV_ROUND_UP(blocks, chunksize);
>> +    while (chunks > mddev->bitmap_info.space << SECTOR_SHIFT) {
>> +        chunksize = chunksize << 1;
>> +        chunks = DIV_ROUND_UP(blocks, chunksize);
>> +    }
>> +
>> +    llbitmap->chunkshift = ffz(~chunksize);
>> +    llbitmap->chunksize = chunksize;
>> +    llbitmap->chunks = chunks;
>> +
>> +    return 0;
>> +}
>> +
> 
> Hmm. I do get confused with the chunksize here.
> Is this the granularity of the bits in the bitmap
> (ie how many data bytes are covered by one bit)?
> Or is it the chunksize of the bitmap itself?

Yes, the llbitmap->chunksize means the data bytes by one llbitmap bit.
> 
> In either case, if it's a 'chunksize' in the sense
> of the block layer (namely a boundary which I/O
> must not cross), shouldn't you set the request
> queue limits accordingly?

I think we don't, it's fine if IO cross the boundary of
llbitmap->chunksize, multiple bits will be recorded. In fact, we support
plug and recored lots of bits at a time, the only restriction is that
dirty bits have to be written beffore issuing IO.

> 
>> +static int llbitmap_load(struct mddev *mddev)
>> +{
>> +    enum llbitmap_action action = BitmapActionReload;
>> +    struct llbitmap *llbitmap = mddev->bitmap;
>> +
>> +    if (test_and_clear_bit(BITMAP_STALE, &llbitmap->flags))
>> +        action = BitmapActionStale;
>> +
>> +    llbitmap_state_machine(llbitmap, 0, llbitmap->chunks - 1, action);
>> +    return 0;
>> +}
>> +
>> +static void llbitmap_destroy(struct mddev *mddev)
>> +{
>> +    struct llbitmap *llbitmap = mddev->bitmap;
>> +
>> +    if (!llbitmap)
>> +        return;
>> +
>> +    mutex_lock(&mddev->bitmap_info.mutex);
>> +
>> +    timer_delete_sync(&llbitmap->pending_timer);
>> +    flush_workqueue(md_llbitmap_io_wq);
>> +    flush_workqueue(md_llbitmap_unplug_wq);
>> +
>> +    mddev->bitmap = NULL;
>> +    llbitmap_free_pages(llbitmap);
>> +    kfree(llbitmap);
>> +    mutex_unlock(&mddev->bitmap_info.mutex);
>> +}
>> +
>> +static void llbitmap_start_write(struct mddev *mddev, sector_t offset,
>> +                 unsigned long sectors)
>> +{
>> +    struct llbitmap *llbitmap = mddev->bitmap;
>> +    unsigned long start = offset >> llbitmap->chunkshift;
>> +    unsigned long end = (offset + sectors - 1) >> llbitmap->chunkshift;
>> +    int page_start = (start + BITMAP_DATA_OFFSET) >> PAGE_SHIFT;
>> +    int page_end = (end + BITMAP_DATA_OFFSET) >> PAGE_SHIFT;
>> +
>> +    llbitmap_state_machine(llbitmap, start, end, 
>> BitmapActionStartwrite);
>> +
>> +
>> +    while (page_start <= page_end) {
>> +        llbitmap_raise_barrier(llbitmap, page_start);
>> +        page_start++;
>> +    }
>> +}
>> +
>> +static void llbitmap_end_write(struct mddev *mddev, sector_t offset,
>> +                   unsigned long sectors)
>> +{
>> +    struct llbitmap *llbitmap = mddev->bitmap;
>> +    unsigned long start = offset >> llbitmap->chunkshift;
>> +    unsigned long end = (offset + sectors - 1) >> llbitmap->chunkshift;
>> +    int page_start = (start + BITMAP_DATA_OFFSET) >> PAGE_SHIFT;
>> +    int page_end = (end + BITMAP_DATA_OFFSET) >> PAGE_SHIFT;
>> +
>> +    while (page_start <= page_end) {
>> +        llbitmap_release_barrier(llbitmap, page_start);
>> +        page_start++;
>> +    }
>> +}
>> +
>> +static void llbitmap_start_discard(struct mddev *mddev, sector_t offset,
>> +                   unsigned long sectors)
>> +{
>> +    struct llbitmap *llbitmap = mddev->bitmap;
>> +    unsigned long start = DIV_ROUND_UP(offset, llbitmap->chunksize);
>> +    unsigned long end = (offset + sectors - 1) >> llbitmap->chunkshift;
>> +    int page_start = (start + BITMAP_DATA_OFFSET) >> PAGE_SHIFT;
>> +    int page_end = (end + BITMAP_DATA_OFFSET) >> PAGE_SHIFT;
>> +
>> +    llbitmap_state_machine(llbitmap, start, end, BitmapActionDiscard);
>> +
>> +    while (page_start <= page_end) {
>> +        llbitmap_raise_barrier(llbitmap, page_start);
>> +        page_start++;
>> +    }
>> +}
>> +
>> +static void llbitmap_end_discard(struct mddev *mddev, sector_t offset,
>> +                 unsigned long sectors)
>> +{
>> +    struct llbitmap *llbitmap = mddev->bitmap;
>> +    unsigned long start = DIV_ROUND_UP(offset, llbitmap->chunksize);
>> +    unsigned long end = (offset + sectors - 1) >> llbitmap->chunkshift;
>> +    int page_start = (start + BITMAP_DATA_OFFSET) >> PAGE_SHIFT;
>> +    int page_end = (end + BITMAP_DATA_OFFSET) >> PAGE_SHIFT;
>> +
>> +    while (page_start <= page_end) {
>> +        llbitmap_release_barrier(llbitmap, page_start);
>> +        page_start++;
>> +    }
>> +}
>> +
>> +static void llbitmap_unplug_fn(struct work_struct *work)
>> +{
>> +    struct llbitmap_unplug_work *unplug_work =
>> +        container_of(work, struct llbitmap_unplug_work, work);
>> +    struct llbitmap *llbitmap = unplug_work->llbitmap;
>> +    struct blk_plug plug;
>> +    int i;
>> +
>> +    blk_start_plug(&plug);
>> +
>> +    for (i = 0; i < llbitmap->nr_pages; i++) {
>> +        if (!test_bit(LLPageDirty, &llbitmap->pctl[i]->flags) ||
>> +            !test_and_clear_bit(LLPageDirty, &llbitmap->pctl[i]->flags))
> 
> Confused. Is this some kind of micro-optimisation?
> Why not simply 'test_and_clear_bit()'?

Yes, because this is called from IO hot path, and in the most cases,
only a few pages will be dirtied by plugged IO.

Thanks for the review!
Kuai

> Cheers,
> 
> Hannes