netdev - Re: [PATCH v2 1/4] net: introduce helper sendpages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ef7ea4a8-c0e4-4fd9-9abb-42ae95090fc8@grimberg.me>
Date: Tue, 4 Jun 2024 16:01:17 +0300
From: Sagi Grimberg <sagi@...mberg.me>
To: Christoph Hellwig <hch@....de>
Cc: Ofir Gal <ofir.gal@...umez.com>, davem@...emloft.net,
 linux-block@...r.kernel.org, linux-nvme@...ts.infradead.org,
 netdev@...r.kernel.org, ceph-devel@...r.kernel.org, dhowells@...hat.com,
 edumazet@...gle.com, pabeni@...hat.com, kbusch@...nel.org, axboe@...nel.dk,
 philipp.reisner@...bit.com, lars.ellenberg@...bit.com,
 christoph.boehmwalder@...bit.com, idryomov@...il.com, xiubli@...hat.com
Subject: Re: [PATCH v2 1/4] net: introduce helper sendpages_ok()



On 04/06/2024 11:24, Sagi Grimberg wrote:
>
>
> On 04/06/2024 7:27, Christoph Hellwig wrote:
>> On Tue, Jun 04, 2024 at 12:27:06AM +0300, Sagi Grimberg wrote:
>>>>> I still don't understand how a page in the middle of a contiguous 
>>>>> range ends
>>>>> up coming from the slab while others don't.
>>>> I haven't investigate the origin of the IO
>>>> yet. I suspect the first 2 pages are the superblocks of the raid
>>>> (mdp_superblock_1 and bitmap_super_s) and the rest of the IO is the 
>>>> bitmap.
>>> Well, if these indeed are different origins and just *happen* to be a
>>> mixture
>>> of slab originated pages and non-slab pages combined together in a 
>>> single
>>> bio of a bvec entry,
>>> I'd suspect that it would be more beneficial to split the bvec 
>>> (essentially
>>> not allow bio_add_page
>>> to append the page to tail bvec depending on a queue limit (similar 
>>> to how
>>> we handle sg gaps).
>> So you want to add a PageSlab check to bvec_try_merge_page? That sounds
>> fairly expensive..
>>
>
> The check needs to happen somewhere apparently, and given that it will 
> be gated by a queue flag
> only request queues that actually needed will suffer, but they will 
> suffer anyways...

Something like the untested patch below:
--
diff --git a/block/bio.c b/block/bio.c
index 53f608028c78..e55a4184c0e6 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -18,6 +18,7 @@
  #include <linux/highmem.h>
  #include <linux/blk-crypto.h>
  #include <linux/xarray.h>
+#include <linux/net.h>

  #include <trace/events/block.h>
  #include "blk.h"
@@ -960,6 +961,9 @@ bool bvec_try_merge_hw_page(struct request_queue *q, 
struct bio_vec *bv,
                 return false;
         if (len > queue_max_segment_size(q) - bv->bv_len)
                 return false;
+       if (q->limits.splice_pages &&
+           sendpage_ok(bv->bv_page) ^ sendpage_ok(page))
+                       return false;
         return bvec_try_merge_page(bv, page, len, offset, same_page);
  }

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index a7e820840cf7..82e2719acb9c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1937,6 +1937,7 @@ static void nvme_set_ctrl_limits(struct nvme_ctrl 
*ctrl,
         lim->virt_boundary_mask = NVME_CTRL_PAGE_SIZE - 1;
         lim->max_segment_size = UINT_MAX;
         lim->dma_alignment = 3;
+       lim->splice_pages = ctrl->splice_pages;
  }

  static bool nvme_update_disk_info(struct nvme_ns *ns, struct 
nvme_id_ns *id,
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 3f3e26849b61..d9818330e236 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -398,6 +398,7 @@ struct nvme_ctrl {

         enum nvme_ctrl_type cntrltype;
         enum nvme_dctype dctype;
+       bool splice_pages
  };

  static inline enum nvme_ctrl_state nvme_ctrl_state(struct nvme_ctrl *ctrl)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 02076b8cb4d8..618b8f20206a 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2146,6 +2146,12 @@ static int nvme_tcp_configure_admin_queue(struct 
nvme_ctrl *ctrl, bool new)
         if (error)
                 goto out_stop_queue;

+       /*
+        * we want to prevent contig pages with conflicting 
splice-ability with
+        * respect to the network transmission
+        */
+       ctrl->splice_pages = true;
+
         nvme_unquiesce_admin_queue(ctrl);

         error = nvme_init_ctrl_finish(ctrl, false);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 69c4f113db42..ec657ddad2a4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -331,6 +331,12 @@ struct queue_limits {
          * due to possible offsets.
          */
         unsigned int            dma_alignment;
+
+       /*
+        * Drivers that use MSG_SPLICE_PAGES to send the bvec over the 
network,
+        * will need either bvec entry contig pages spliceable or not.
+        */
+       bool                    splice_pages;
  };

  typedef int (*report_zones_cb)(struct blk_zone *zone, unsigned int idx,
--

What I now see is that we will check PageSlab twice (bvec last index and 
append page)
and skb_splice_from_iter checks it again... How many times check we 
check this :)

Would be great if the network stack can just check it once and fallback 
to page copy...