lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240212102742.34e1e2c2@jacob-builder>
Date: Mon, 12 Feb 2024 10:27:42 -0800
From: Jacob Pan <jacob.jun.pan@...ux.intel.com>
To: Jens Axboe <axboe@...nel.dk>
Cc: LKML <linux-kernel@...r.kernel.org>, X86 Kernel <x86@...nel.org>, Peter
 Zijlstra <peterz@...radead.org>, iommu@...ts.linux.dev, Thomas Gleixner
 <tglx@...utronix.de>, Lu Baolu <baolu.lu@...ux.intel.com>,
 kvm@...r.kernel.org, Dave Hansen <dave.hansen@...el.com>, Joerg Roedel
 <joro@...tes.org>, "H. Peter Anvin" <hpa@...or.com>, Borislav Petkov
 <bp@...en8.de>, Ingo Molnar <mingo@...hat.com>, Paul Luse
 <paul.e.luse@...el.com>, Dan Williams <dan.j.williams@...el.com>, Raj Ashok
 <ashok.raj@...el.com>, "Tian, Kevin" <kevin.tian@...el.com>,
 maz@...nel.org, seanjc@...gle.com, Robin Murphy <robin.murphy@....com>,
 jacob.jun.pan@...ux.intel.com
Subject: Re: [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI

Hi Jens,

On Fri, 9 Feb 2024 13:31:17 -0700, Jens Axboe <axboe@...nel.dk> wrote:

> On 2/9/24 10:43 AM, Jacob Pan wrote:
> > Hi Jens,
> > 
> > On Thu, 8 Feb 2024 08:34:55 -0700, Jens Axboe <axboe@...nel.dk> wrote:
> >   
> >> Hi Jacob,
> >>
> >> I gave this a quick spin, using 4 gen2 optane drives. Basic test, just
> >> IOPS bound on the drive, and using 1 thread per drive for IO. Random
> >> reads, using io_uring.
> >>
> >> For reference, using polled IO:
> >>
> >> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
> >> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
> >> IOPS=20.37M, BW=9.95GiB/s, IOS/call=31/31
> >>
> >> which is abount 5.1M/drive, which is what they can deliver.
> >>
> >> Before your patches, I see:
> >>
> >> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
> >> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
> >> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
> >> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
> >>
> >> at 2.82M ints/sec. With the patches, I see:
> >>
> >> IOPS=14.73M, BW=7.19GiB/s, IOS/call=32/31
> >> IOPS=14.90M, BW=7.27GiB/s, IOS/call=32/31
> >> IOPS=14.90M, BW=7.27GiB/s, IOS/call=31/32
> >>
> >> at 2.34M ints/sec. So a nice reduction in interrupt rate, though not
> >> quite at the extent I expected. Booted with 'posted_msi' and I do see
> >> posted interrupts increasing in the PMN in /proc/interrupts, 
> >>  
> > The ints/sec reduction is not as high as I expected either, especially
> > at this high rate. Which means not enough coalescing going on to get the
> > performance benefits.  
> 
> Right, it means that we're getting pretty decent commands-per-int
> coalescing already. I added another drive and repeated, here's that one:
> 
> IOPS w/polled: 25.7M IOPS
> 
> Stock kernel:
> 
> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32
> IOPS=21.44M, BW=10.47GiB/s, IOS/call=32/32
> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32
> 
> at ~3.7M ints/sec, or about 5.8 IOPS / int on average.
> 
> Patched kernel:
> 
> IOPS=21.90M, BW=10.69GiB/s, IOS/call=31/32
> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/31
> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/32
> 
> at the same interrupt rate. So not a reduction, but slighter higher
> perf. Maybe we're reaping more commands on average per interrupt.
> 
> Anyway, not a lot of interesting data there, just figured I'd re-run it
> with the added drive.
> 
> > The opportunity of IRQ coalescing is also dependent on how long the
> > driver's hardirq handler executes. In the posted MSI demux loop, it does
> > not wait for more MSIs to come before existing the pending IRQ polling
> > loop. So if the hardirq handler finishes very quickly, it may not
> > coalesce as much. Perhaps, we need to find more "useful" work to do to
> > maximize the window for coalescing.
> > 
> > I am not familiar with optane driver, need to look into how its hardirq
> > handler work. I have only tested NVMe gen5 in terms of storage IO, i saw
> > 30-50% ints/sec reduction at even lower IRQ rate (200k/sec).  
> 
> It's just an nvme device, so it's the nvme driver. The IRQ side is very
> cheap - for as long as there are CQEs in the completion ring, it'll reap
> them and complete them. That does mean that if we get an IRQ and there's
> more than one entry to complete, we will do all of them. No IRQ
> coalescing is configured (nvme kind of sucks for that...), but optane
> media is much faster than flash, so that may be a difference.
> 
Yeah, I also check the the driver code it seems just wake up the threaded
handler.

For the record, here is my set up and performance data for 4 Samsung disks.
IOPS increased from 1.6M per disk to 2.1M. One difference I noticed is that
IRQ throughput is improved instead of reduction with this patch on my setup.
e.g. BEFORE: 185545/sec/vector 
     AFTER:  220128

CPU: (highest non-turbo freq, maybe different on yours).
echo "Set CPU frequency P1 2.7GHz"                                                                      
for i in `seq 0 1 127`; do  echo 2700000 >  /sys/devices/system/cpu/cpu$i/cpufreq/scaling_max_freq ;done
for i in `seq 0 1 127`; do  echo 2700000 >  /sys/devices/system/cpu/cpu$i/cpufreq/scaling_min_freq ;done

PCI:
[root@...-bkc posted_msi_tests]# lspci -vv -nn -s 0000:64:00.0|grep -e Lnk -e Sam -e nvme                                                   
64:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller PM174X [144d:a826] (prog-if 02 [NVM Express]) 
        Subsystem: Samsung Electronics Co Ltd Device [144d:aa0a]                                                                            
                LnkCap: Port #0, Speed 32GT/s, Width x4, ASPM notsupported                                                                 
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled-CommClk+                                                                     
                LnkSta: Speed 32GT/s (ok), Width x4(ok)                                                                                    
                LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS
                LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis

NVME setup:                                            
nvme5n1       SAMSUNG MZWLO1T9HCJR-00A07                    
nvme6n1       SAMSUNG MZWLO1T9HCJR-00A07                    
nvme3n1       SAMSUNG MZWLO1T9HCJR-00A07                    
nvme4n1       SAMSUNG MZWLO1T9HCJR-00A07                    

FIO:
[global]                      
bs=4k                         
direct=1                      
norandommap                   
ioengine=libaio               
randrepeat=0                  
readwrite=randread            
group_reporting               
time_based                    
iodepth=64                    
exitall                       
random_generator=tausworthe64 
runtime=30                    
ramp_time=3                   
numjobs=8                     
group_reporting=1             
                              
#cpus_allowed_policy=shared   
cpus_allowed_policy=split     
[disk_nvme6n1_thread_1]       
filename=/dev/nvme6n1         
cpus_allowed=0-7       
[disk_nvme6n1_thread_1]
filename=/dev/nvme5n1  
cpus_allowed=8-15      
[disk_nvme5n1_thread_2]
filename=/dev/nvme4n1  
cpus_allowed=16-23     
[disk_nvme5n1_thread_3]
filename=/dev/nvme3n1  
cpus_allowed=24-31     

iostat w/o posted MSI patch, v6.8-rc1:						
nvme3c3n1     1615525.00   6462100.00         0.00         0.00    6462100						
nvme4c4n1     1615471.00   6461884.00         0.00         0.00    6461884						
nvme5c5n1     1615602.00   6462408.00         0.00         0.00    6462408						
nvme6c6n1     1614637.00   6458544.00         0.00         0.00    6458544	

irqtop (delta 1 sec.)					
           IRQ           TOTAL          DELTA NAME                                      							
           800         6290026         185545 IR-PCI-MSIX-0000:65:00.0 76-edge nvme5q76							
           797         6279554         185295 IR-PCI-MSIX-0000:65:00.0 73-edge nvme5q73							
           799         6281627         185200 IR-PCI-MSIX-0000:65:00.0 75-edge nvme5q75							
           802         6285742         185185 IR-PCI-MSIX-0000:65:00.0 78-edge nvme5q78							
	... ... similar irq rate for all 32 vectors

iostat w/ posted MSI patch:
Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd						
nvme3c3n1     2184313.00   8737256.00         0.00         0.00    8737256          0          0						
nvme4c4n1     2184241.00   8736972.00         0.00         0.00    8736972          0          0						
nvme5c5n1     2184269.00   8737080.00         0.00         0.00    8737080          0          0						
nvme6c6n1     2184003.00   8736012.00         0.00         0.00    8736012          0          0						
						
irqtop w/ posted MSI patch:
           IRQ           TOTAL           DELTA NAME                                     							
           PMN      5230078416         5502657 Posted MSI notification event            							
           423       138068935          220128 IR-PCI-MSIX-0000:64:00.0 80-edge nvme4q80							
           425       138057654          219963 IR-PCI-MSIX-0000:64:00.0 82-edge nvme4q82							
           426       138101745          219890 IR-PCI-MSIX-0000:64:00.0 83-edge nvme4q83							
	... ... similar irq rate for all 32 vectors
IRQ coalescing ratio: posted interrupt notification (PMN)/total MSIs = 78%
550/(22*32.)=.78125         


Thanks,

Jacob

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ