aix_partition() reads the physical volume descriptor into a fixed-size
struct pvd and then scans its physical-partition-extent array:
int numpps = be16_to_cpu(pvd->pp_count);
...
for (i = 0; i < numpps; i += 1) {
struct ppe *p = pvd->ppe + i;
...
lp_ix = be16_to_cpu(p->lp_ix);
pvd points at a single kmalloc()'d struct pvd whose ppe[] member holds a
fixed ARRAY_SIZE(pvd->ppe) (1016) entries, but the loop runs up to the
on-disk pp_count. pp_count is an unvalidated __be16 read straight from
the descriptor, so a crafted AIX image with pp_count larger than 1016
drives the loop to read pvd->ppe[i] past the end of the allocation (up
to 65535 entries, ~2 MB out of bounds).
The partition scan runs without mounting anything, when a block device
with a crafted AIX/IBM partition table appears (an attacker-supplied
image attached with losetup -P, or a device auto-scanned by udev), via
msdos_partition() -> aix_partition().
Clamp the scan to the number of entries the ppe[] array can hold.
Fixes: 6ceea22bbb ("partitions: add aix lvm partition support files")
Cc: stable@vger.kernel.org
Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
Acked-by: Philippe De Muyter <phdm@macqel.be>
Link: https://patch.msgid.link/20260607064137.302574-1-hexlabsecurity@proton.me
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The blkg_conf_open_bdev_frozen() calling convention is not compatible
with lock context annotations. Fold both blkg_conf_open_bdev_frozen()
and blkg_conf_close_bdev_frozen() into their only caller. This patch
prepares for enabling lock context analysis.
The type of 'memflags' has been changed from unsigned long into unsigned
int to match the type of current->flags. See also <linux/sched.h>.
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/05661d1555decc6dd5389174ba448d803b72ed9a.1780682325.git.bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:
commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")
The refactoring is going to alter the default behavior of
alloc_workqueue() to be unbound by default.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU. For more details see the Link tag below.
In order to keep alloc_workqueue() behavior identical, explicitly request
WQ_PERCPU.
Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://patch.msgid.link/20260604105347.168322-1-marco.crivellari@suse.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull NVMe updates from Keith:
"- Per-controller timeouts
- Multipath telemetry
- Namespace format validation
- Various other fixes"
* tag 'nvme-7.2-2026-06-04' of git://git.infradead.org/nvme: (34 commits)
nvme: export controller reconnect event count via sysfs
nvme: export controller reset event count via sysfs
nvme: export I/O failure count when no path is available via sysfs
nvme: export I/O requeue count when no path is usable via sysfs
nvme: export command error counters via sysfs
nvme: export multipath failover count via sysfs
nvme: export command retry count via sysfs
nvme: add diag attribute group under sysfs
nvme-tcp: lockdep: use dynamic lockdep keys per socket instance
nvme-tcp: move nvme_tcp_reclassify_socket()
nvme: validate FDP configuration descriptor sizes
nvmet-auth: validate reply message payload bounds against transfer length
nvme: refresh multipath head zoned limits from path limits
nvme: fix FDP fdpcidx bounds check
nvme-tcp: Use WQ_PERCPU explicitly if wq_unbound is false.
nvmet: fix pre-auth out-of-bounds heap read in Discovery Get Log Page
nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks
nvme-multipath: require exact iopolicy names for module parameter
nvme-multipath: pass NS head to nvme_mpath_revalidate_paths()
nvme-pci: fix out-of-bounds access in nvme_setup_descriptor_pools
...
When an NVMe-oF link goes down, the driver attempts to recover the
connection by repeatedly reconnecting to the remote controller at
configured intervals. A maximum number of reconnect attempts is also
configured, after which recovery stops and the controller is removed
if the connection cannot be re-established.
The driver maintains a counter, nr_reconnects, which is incremented on
each reconnect attempt. However if in case the reconnect is successful
then this counter reset to zero. Moreover, currently, this counter is
only reported via kernel log messages and is not exposed to userspace.
Since dmesg is a circular buffer, this information may be lost over
time.
So introduce a new accumulator which accumulates nr_reconnect attempts
and also expose this accumulator per-fabric ctrl via a new sysfs
attribute reconnect_count, under diag attribute grroup to provide
persistent visibility into the number of reconnect attempts made by the
host. This information can help users diagnose unstable links or
connectivity issues. Furthermore, this sysfs attribute is also writable
so user may reset it to zero, if needed.
The reconnect_count can also be consumed by monitoring tools such as
nvme-top to improve controller-level observability.
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
The NVMe controller transitions into the RESETTING state during error
recovery, link instability, firmware activation, or when a reset is
explicitly triggered by the user.
Expose a per-ctrl sysfs attribute reset_count, under diag attribute
group to provide visibility into these RESETTING state transitions.
Observing the frequency of reset events can help users identify issues
such as PCIe errors or unstable fabric links. This counter is also
writable thus allowing user to reset its value, if needed.
This counter can also be consumed by monitoring tools such as nvme-top
to improve controller-level observability.
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
When I/O is submitted to the NVMe namespace head and no available path
can handle the request, the driver fails the I/O immediately. Currently,
such failures are only reported via kernel log messages, which may be
lost over time since dmesg is a circular buffer.
Add a new ns-head sysfs counter io_fail_no_available_path_count, under
diag attribute group to expose the number of I/Os that failed due to the
absence of an available path. This provides persistent visibility into
path-related I/O failures and can help users diagnose the cause of I/O
errors. This counter is also writable and so user may reset its value,
if needed.
This counter can also be consumed by monitoring tools such as nvme-top.
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
When the NVMe namespace head determines that there is no currently
available path to handle I/O (for example, while a controller is
resetting/connecting or due to a transient link failure), incoming
I/Os are added to the requeue list.
Currently, there is no visibility into how many I/Os have been requeued
in this situation. Add a new ns-head sysfs counter
io_requeue_no_usable_path_count, under diag attribute group to expose
the number of I/Os that were requeued due to the absence of an available
path. This counter is also writable thus allowing user to reset it, if
needed.
This statistic can help users understand I/O slowdowns or stalls caused
by temporary path unavailability, and can be consumed by monitoring
tools such as nvme-top for real-time observability.
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
When an NVMe command completes with an error status, the driver
logs the error to the kernel log. However, these messages may be
lost or overwritten over time since dmesg is a circular buffer.
Expose per-path and ctrl sysfs attribute command_error_count, under
diag attribute group to provide persistent visibility into error
occurrences. This allows users to observe the total number of commands
that have failed on a given path over time, which can be useful for
diagnosing path health and stability.
This attribute is both readable and writable thus allowing user to reset
these counters. These counters can also be consumed by observability
tools such as nvme-top to provide additional insight into NVMe error
behavior.
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
When an NVMe command completes with a path-specific error, the NVMe
driver may retry the command on an alternate controller or path if one
is available. These failover events indicate that I/O was redirected
away from the original path.
Currently, the number of times requests are failed over to another
available path is not visible to userspace. Exposing this information
can be useful for diagnosing path health and stability.
Export per-path sysfs attribute "multipath_failover_count" under diag
attribute group. This attribute is both readable and writable and thus
allowing user to reset the counter. This counter can be consumed by
monitoring tools such as nvme-top to help identify paths that
consistently trigger failovers under load.
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
When Advanced Command Retry Enable (ACRE) is configured, a controller
may interrupt command execution and return a completion status
indicating command interrupted with the DNR bit cleared. In this case,
the driver retries the command based on the Command Retry Delay (CRD)
value provided in the completion status.
Currently, these command retries are handled entirely within the NVMe
driver and are not visible to userspace. As a result, there is no
observability into retry behavior, which can be a useful diagnostic
signal.
Expose a per-namespace sysfs attribute command_retries_count, under
diag attribute group to provide visibility into retry activity. This
information can help identify controller-side congestion under load
and enables comparison across paths in multipath setups (for example,
detecting cases where one path experiences significantly more retries
than another under identical workloads).
This exported metric is intended for diagnostics and monitoring tools
such as nvme-top, and does not change command retry behavior. A new
sysfs attribute named "command_retries_count" is added for this purpose.
This attribute is both readable as well as writable. So user could
reset this counter if needed.
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Add a new diag attribute group under:
/sys/class/nvme/<ctrl>/
/sys/block/<nvme-path-dev>/
/sys/block/<ns-head-dev>/
This new sysfs attribute group will be used to organize NVMe diagnostic
and telemetry-related counters under it.
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
When NVMe-TCP controller setup and teardown are repeated with lockdep
enabled, lockdep reports false positives WARN for the following locks:
1) &q->elevator_lock : IO scheduler change context
2) &q->q_usage_counter(io) : SCSI disk probe context
3) fs_reclaim : CPU hotplug bring-up context
4) cpu_hotplug_lock : socket establishment context
5) sk_lock-AF_INET-NVME : MQ sched dispatch context for the socket
6) set->srcu : NVMe controller delete context
The lockdep WARN was observed by running blktests test case nvme/005 for
tcp transport on v7.1-rc1 kernel with a patch. Refer to the Link tag for
the details of the WARN.
This is a false positive because lockdep confuses lock 4) (socket
establishment) with lock 5) (socket in use) for different socket
instances. The locks belong to different sockets, but lockdep treats
them as the same due to shared static lockdep keys.
Fix this by using dynamically allocated lockdep keys per socket instance
instead of static keys nvme_tcp_sk_key[] and nvme_tcp_slock_key[]. Add
nvme_tcp_sk_key and nvme_tcp_slock_key fields to struct nvme_tcp_queue
and pass them to sock_lock_init_class_and_name() for proper lockdep
tracking. Change the argument of nvme_tcp_reclassify_socket() from
'struct socket *' to 'struct nvme_tcp_queue *' to pass both the socket
and the keys. Add CONFIG_DEBUG_LOCK_ALLOC guards to nvme_tcp_alloc_queue()
and nvme_tcp_free_queue() to register and unregister the dynamic keys.
Additionally, move nvme_tcp_reclassify_socket() inside these guards since
it's only needed when lockdep is enabled.
Link: https://lore.kernel.org/linux-nvme/afB5syZbUrppgsDQ@shinmob/
Suggested-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Move nvme_tcp_reclassify_socket() in tcp.c after the struct
nvme_tcp_queue definition. This is preparation for adding a reference
to struct nvme_tcp_queue in the function, which would otherwise cause a
compile failure due to the struct being defined after the function.
Move the entire CONFIG_DEBUG_LOCK_ALLOC block along with the function
to maintain the code organization.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Validate descriptor sizes while walking the FDP configurations log so
dsze == 0 or a descriptor past the log end cannot cause unbounded
iteration or reads past the buffer.
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: liuxixin <gliuxen@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
nvmet_auth_reply() accesses the variable-length rval[] array using
attacker-controlled hl (hash length) and dhvlen (DH value length) fields
without verifying they fit within the allocated buffer of tl bytes.
A malicious NVMe-oF initiator can craft a DHCHAP_REPLY message with a
small transfer length but large hl/dhvlen values, causing out-of-bounds
heap reads when the target processes the DH public key (rval + 2*hl) or
performs the host response memcmp.
With DH authentication configured, the OOB pointer is passed directly to
sg_init_one() and read by crypto_kpp_compute_shared_secret(), reaching
up to 526 bytes past the buffer. This is exploitable pre-authentication.
Add bounds validation ensuring sizeof(*data) + 2*hl + dhvlen <= tl before
any access to the variable-length fields.
Discovered by Atuin - Automated Vulnerability Discovery Engine.
Fixes: db1312dd95 ("nvmet: implement basic In-Band Authentication")
Cc: stable@vger.kernel.org
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Signed-off-by: Tianchu Chen <flynnnchen@tencent.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
BFQ cgroup stats contain percpu counters embedded in struct bfq_group,
but the old free path destroys them from bfq_pd_free(), which is tied
to blkg policy-data teardown.
That is not the same lifetime as struct bfq_group. BFQ pins bfq_group
while bfq_queue entities refer to it, so bfq_pd_free() can drop the
policy-data reference while other bfq_group references still exist. The
following blkcg change also defers policy-data release through RCU and
leaves BFQ to run the final bfqg_put() from an RCU callback. For that
conversion, stats teardown must belong to the last bfq_group put, not to
policy-data teardown.
Move stats teardown to bfqg_put() so the embedded counters are destroyed
exactly when the last bfq_group reference is released, before kfree(bfqg).
Without this preparatory change, the RCU-delayed policy-data free
conversion reproduced the following KASAN report:
BUG: KASAN: slab-use-after-free in percpu_counter_destroy_many+0xf1/0x2e0
Write of size 8 at addr ffff88811d9409e0 by task test_blkcg/535
CPU: 0 UID: 0 PID: 535 Comm: test_blkcg Not tainted 7.1.0-rc2-g1e14adca0199 #1 PREEMPT ea13f83d4b74a12510d20db4a7d9a0fe8275f05c
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x54/0x70
print_address_description+0x77/0x200
? percpu_counter_destroy_many+0xf1/0x2e0
print_report+0x64/0x70
kasan_report+0x118/0x150
? percpu_counter_destroy_many+0xf1/0x2e0
percpu_counter_destroy_many+0xf1/0x2e0
__mmdrop+0x1d8/0x350
finish_task_switch+0x3f5/0x570
__schedule+0xe8e/0x18a0
schedule+0xfe/0x1c0
schedule_timeout+0x7f/0x1d0
__wait_for_common+0x26c/0x3f0
wait_for_completion_state+0x21/0x40
call_usermodehelper_exec+0x271/0x2c0
__request_module+0x296/0x410
elv_iosched_store+0x1bc/0x2c0
queue_attr_store+0x152/0x1c0
kernfs_fop_write_iter+0x1d7/0x280
vfs_write+0x580/0x630
ksys_write+0xec/0x190
do_syscall_64+0x156/0x490
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Allocated by task 535:
kasan_save_track+0x3e/0x80
__kasan_kmalloc+0x72/0x90
bfq_pd_alloc+0x60/0x100 [bfq]
blkg_create+0x3bb/0xbe0
blkg_lookup_create+0x3a2/0x460
blkg_conf_start+0x24a/0x2d0
bfq_io_set_weight+0x17f/0x430 [bfq]
cgroup_file_write+0x1c5/0x4b0
kernfs_fop_write_iter+0x1d7/0x280
vfs_write+0x580/0x630
ksys_write+0xec/0x190
do_syscall_64+0x156/0x490
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Freed by task 0:
kasan_save_track+0x3e/0x80
kasan_save_free_info+0x46/0x50
__kasan_slab_free+0x3a/0x60
kfree+0x14e/0x4f0
rcu_core+0x6f3/0xcd0
handle_softirqs+0x1a0/0x550
__irq_exit_rcu+0x8c/0x150
irq_exit_rcu+0xe/0x20
sysvec_apic_timer_interrupt+0x6e/0x80
asm_sysvec_apic_timer_interrupt+0x1a/0x20
Last potentially related work creation:
kasan_save_stack+0x3e/0x60
kasan_record_aux_stack+0x99/0xb0
call_rcu+0x55/0x5c0
blkg_free_workfn+0x130/0x220
process_scheduled_works+0x655/0xb60
worker_thread+0x446/0x600
kthread+0x1f4/0x230
ret_from_fork+0x259/0x420
ret_from_fork_asm+0x1a/0x30
Signed-off-by: Yu Kuai <yukuai@fygo.io>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260601061502.899552-1-yukuai@fygo.io
Signed-off-by: Jens Axboe <axboe@kernel.dk>
queue_limits_stack_bdev() updates the multipath head limits from the
path queue, but it does not propagate max_open_zones or
max_active_zones. As a result, a zoned multipath namespace head can
keep stale 0/0 values even after a ready path reports finite zoned
resource limits.
When refreshing the head limits in nvme_update_ns_info(), stack the
zoned resource limits directly after stacking the path queue limits.
Use min_not_zero() so the block layer's 0 value keeps its "no limit"
meaning while finite limits are combined conservatively.
This avoids advertising "no limit" on the multipath head while keeping
the zoned-limit handling local to the NVMe multipath update path.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yao Sang <sangyao@kylinos.cn>
Signed-off-by: Keith Busch <kbusch@kernel.org>
The fdpcidx bounds check sets n = NUMFDPC + 1 but used > instead of >=,
incorrectly accepting fdp_idx when it equals n (i.e. NUMFDPC + 1).
Fixes: 30b5f20bb2 ("nvme: register fdp parameters with the block layer")
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: liuxixin <gliuxen@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Since commit 21c05ca88a ("workqueue: Add warnings and ensure
one among WQ_PERCPU or WQ_UNBOUND is present"), we must explicitly
set WQ_PERCPU or WQ_UNBOUND when creating workqueue.
nvme_tcp_init_module() sets WQ_UNBOUND when the module param
wq_unbound is set, but otherwise, WQ_PERCPU is missing, triggering
the warning below:
workqueue: nvme_tcp_wq is using neither WQ_PERCPU or WQ_UNBOUND. Setting WQ_PERCPU.
WARNING: kernel/workqueue.c:5856 at __alloc_workqueue+0x1d02/0x2070 kernel/workqueue.c:5855, CPU#0: swapper/0/1
Let's set WQ_PERCPU if wq_unbound is false.
Reported-by: syzbot+d078cba4418e65f61984@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6a1a9a86.323e8352.141b09.0001.GAE@google.com/
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
nvmet_execute_disc_get_log_page() validates only the dword alignment
of the host-supplied Log Page Offset (lpo). The 64-bit offset is then
added to a small kzalloc'd buffer that holds the discovery log page
and the result is passed straight to nvmet_copy_to_sgl(), which
memcpy()s data_len bytes out to the host with no source-side bound
check:
u64 offset = nvmet_get_log_page_offset(req->cmd); /* 64-bit host */
size_t data_len = nvmet_get_log_page_len(req->cmd); /* 32-bit host */
...
if (offset & 0x3) { ... } /* only check */
...
alloc_len = sizeof(*hdr) + entry_size * discovery_log_entries(req);
buffer = kzalloc(alloc_len, GFP_KERNEL);
...
status = nvmet_copy_to_sgl(req, 0, buffer + offset, data_len);
The Discovery controller is unauthenticated -- nvmet_host_allowed()
returns true unconditionally for the discovery subsystem -- so the call
is reachable pre-authentication by any TCP/RDMA/FC peer that can reach
the nvmet target. With a discovery log page of ~1 KiB, an attacker
requesting up to 4 KiB starting at offset == alloc_len reads the next
slab page out and gets its content returned over the fabric (an
empirical run on a default nvmet-tcp loopback target leaked 81
canonical kernel pointers in one Get Log Page response). Pointing the
offset at unmapped kernel memory faults the in-kernel memcpy and
crashes (or panics, on panic_on_oops=1) the target host instead.
The attacker-controlled source-side offset pattern
"nvmet_copy_to_sgl(req, 0, buffer + ATTACKER_OFFSET, ...)" is unique
to nvmet_execute_disc_get_log_page in the entire nvmet codebase: every
other Get Log Page handler in admin-cmd.c either ignores lpo (and
silently starts every response at offset 0) or tracks a local
destination offset with a fixed source pointer.
Validate the host-supplied offset against the log page size, cap the
copy length to what is actually available, and zero-fill any remainder
of the host transfer buffer. The zero-fill matches the existing
short-response pattern in nvmet_execute_get_log_changed_ns()
(admin-cmd.c) and prevents leaking transport SGL contents when the
host asks for more bytes than the log page contains.
Fixes: a07b4970f4 ("nvmet: add a generic NVMe target")
Cc: stable@vger.kernel.org
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
When nvme_ns_head_submit_bio() remaps a bio from the multipath head to a
per-path namespace, bio_set_dev() clears BIO_REMAPPED. The remapped bio
is then resubmitted through submit_bio_noacct() which calls
bio_check_eod() because BIO_REMAPPED is not set.
This races with nvme_ns_remove() which zeroes the per-path capacity
before synchronize_srcu():
CPU 0 (IO submission)
---------------------
srcu_read_lock()
nvme_find_path() -> ns
[NVME_NS_READY is set]
CPU 1 (namespace removal)
-------------------------
clear_bit(NVME_NS_READY)
set_capacity(ns->disk, 0)
synchronize_srcu() <- blocks
CPU 0 (IO submission)
---------------------
bio_set_dev(bio, ns->disk->part0)
[clears BIO_REMAPPED]
submit_bio_noacct(bio)
-> bio_check_eod() sees capacity=0
-> bio fails with IO error
The SRCU read lock prevents synchronize_srcu() from completing, but does
not prevent set_capacity(0) from executing. The bio fails the EOD check
before it reaches the NVMe driver, so nvme_failover_req() never gets a
chance to redirect it to another path of multipath. IO errors are
reported to the application despite another path being available.
On older kernels (before commit 0b64682e78 "block: skip unnecessary
checks for split bio"), the same race was also reachable through split
remainders resubmitted via submit_bio_noacct().
Fix this by setting BIO_REMAPPED after bio_set_dev() in
nvme_ns_head_submit_bio(). This skips bio_check_eod() on the per-path
device; the EOD check already passed on the multipath head.
NVMe per-path namespace devices are always whole disks (bd_partno=0), so
the blk_partition_remap() skip also gated by BIO_REMAPPED is a no-op.
The flag does not persist across failover and cannot go stale if the
namespace geometry changes between attempts: nvme_failover_req() calls
bio_set_dev() to redirect the bio back to the multipath head, which
clears BIO_REMAPPED. When nvme_requeue_work() resubmits through
submit_bio_noacct(), bio_check_eod() runs normally against the current
capacity.
Same approach as commit 3a905c37c3 ("block: skip bio_check_eod for
partition-remapped bios").
Fixes: a7c7f7b2b6 ("nvme: use bio_set_dev to assign ->bi_bdev")
Cc: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Igor Achkinazi <igor.achkinazi@dell.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
The iopolicy module parameter uses strncmp prefix matching, so values
like "numax" are accepted as "numa". The per-subsystem sysfs attribute
already requires an exact match via sysfs_streq(). Parse both through
a shared helper so invalid values are rejected consistently.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: liyouhong <liyouhong@kylinos.cn>
Signed-off-by: Keith Busch <kbusch@kernel.org>
In nvme_mpath_revalidate_paths(), we are passed a NS pointer and use that
to lookup the NS head and then use that same NS pointer as an iter variable.
It makes more sense pass the NS head and use a local variable for the NS
iter.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Pull MD updates and fixes from Yu Kuai:
"Bug Fixes:
- Only requeue dm-raid bios when dm is suspending. (Benjamin Marzinski)
- Reset raid10 read_slot when reusing r10bio for discard. (Chen Cheng)
- Fix raid1/raid10 deadlock in read error recovery path.
(Abd-Alrhman Masalkhi)
- Fix raid1/raid10 error-path detection with md_cloned_bio().
(Abd-Alrhman Masalkhi)
- Fix raid1/raid10 bio accounting for split md cloned bios.
(Abd-Alrhman Masalkhi)
- Fix raid1 nr_pending leak in REQ_ATOMIC bad-block path.
(Abd-Alrhman Masalkhi)
Improvements:
- Skip redundant raid_disks updates when the value is unchanged.
(Abd-Alrhman Masalkhi)
Cleanups:
- Update MAINTAINERS email addresses. (Yu Kuai, Li Nan)
- Clean up raid1 read error handling. (Christoph Hellwig)
- Move the exceed_read_errors condition out of fix_read_error().
(Christoph Hellwig)
- Use str_plural() in raid0 dump_zones(). (Thorsten Blum)"
* tag 'md-7.2-20260531' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux:
md/raid0: use str_plural helper in dump_zones
raid1: fix nr_pending leak in REQ_ATOMIC bad-block error path
md/raid1: move the exceed_read_errors condition out of fix_read_error
md/raid1: cleanup handle_read_error
md/raid1,raid10: fix bio accounting for split md cloned bios
md/raid1,raid10: fix error-path detection with md_cloned_bio()
md/raid1,raid10: fix deadlock in read error recovery path
md/raid10: reset read_slot when reusing r10bio for discard
md: skip redundant raid_disks update when value is unchanged
dm-raid: only requeue bios when dm is suspending
MAINTAINERS: Update Li Nan's E-mail address
MAINTAINERS: update Yu Kuai's email address
snap_count is u32 but the comparison is against a SIZE_MAX-derived value
(~2^61 on 64-bit), which clang flags as always false with
-Wtautological-constant-out-of-range-compare.
The proper check here should be that snap_count does not go over
RBD_MAX_SNAP_COUNT.
Assisted-by: Opencode:Big-pickle
Signed-off-by: Rosen Penev <rosenp@gmail.com>
Reviewed-by: Alex Elder <elder@riscstar.com>
Link: https://patch.msgid.link/20260530011255.52916-1-rosenp@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
GenDiskBuilder::build() still has fallible work after
__blk_mq_alloc_disk(), but its error path only recovers the
foreign queue data. That leaks the temporary gendisk and
request_queue until later teardown. If the caller moved the last
Arc<TagSet<T>> into build(), the leaked queue can retain blk-mq
state after the tag set is dropped.
Fix the pre-registration failure path by dropping the temporary
gendisk reference with put_disk() before recovering queue_data,
so disk_release() can tear down the owned queue.
Also pair GenDisk::drop() with put_disk() after del_gendisk().
Once a Rust GenDisk has been added with device_add_disk(),
del_gendisk() only unregisters it; the final gendisk reference
still has to be dropped to complete the release path.
Fixes: 3253aba340 ("rust: block: introduce `kernel::block::mq` module")
Cc: stable@kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
Signed-off-by: Haoze Xie <royenheart@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Link: https://patch.msgid.link/b70aff9a920cc42110fe5cf454c3099561863519.1780063368.git.royenheart@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In raid1_write_request(), each per-mirror loop iteration begins by
incrementing rdev->nr_pending. If a REQ_ATOMIC write encounters a
badblock within the requested range, the code jumps to err_handle
without dropping the reference taken for the current mirror.
err_handle's cleanup loop will only decrements for k < i and
r1_bio->bios[k] is non-NULL. The current slot is therefore skipped,
leaving its nr_pending reference leaked permanently. The reference
prevents the rdev from ever being removed, since raid1_remove_conf()
refuses to remove an rdev with nr_pending > 0.
Fix this by calling rdev_dec_pending() before jumping to err_handle.
Fixes: f2a38abf5f ("md/raid1: Atomic write support")
Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
Link: https://patch.msgid.link/20260530151411.4119-1-abd.masalkhi@gmail.com
Signed-off-by: Yu Kuai <yukuai@fygo.io>
Use md_cloned_bio() to control bio accounting instead of relying
on r1bio_existed in raid1 or the io_accounting flag in raid10.
The previous logic does not reliably reflect whether a bio is an
md cloned bio. When a failed bio is split and resubmitted via
bio_submit_split_bioset() on the error path, this can lead to either
double accounting for md cloned bios, or missing accounting for bios
returned from bio_submit_split_bioset()
Fix this by using md_cloned_bio() to detect md cloned bios and
skip accounting accordingly.
Fixes: bb2a9acefa ("md/raid1: switch to use md_account_bio() for io accounting")
Fixes: 8204552383 ("md/raid10: switch to use md_account_bio() for io accounting")
Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
Reviewed-by: Xiao Ni <xiao@kernel.org>
Link: https://patch.msgid.link/20260501114652.590037-4-abd.masalkhi@gmail.com
Signed-off-by: Yu Kuai <yukuai@fygo.io>
Detect the error path using md_cloned_bio() instead of relying
on r1_bio in raid1 or r10_bio->read_slot in raid10, which may be
NULL or -1 after splitting and resubmitting a failed bio.
As a result, the error path may not be recognized and memory
allocations can incorrectly use GFP_NOIO instead of
(GFP_NOIO | __GFP_HIGH), which can lead to a deadlock under
memory pressure.
Fixes: 689389a06c ("md/raid1: simplify handle_read_error().")
Fixes: 545250f248 ("md/raid10: simplify handle_read_error()")
Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
Reviewed-by: Xiao Ni <xiao@kernel.org>
Link: https://patch.msgid.link/20260501114652.590037-3-abd.masalkhi@gmail.com
Signed-off-by: Yu Kuai <yukuai@fygo.io>
raid1d and raid10d may resubmit a split md cloned bio while handling
a read error. In this case, resubmitting the bio can lead to a deadlock
if the array is suspended before md_handle_request() acquires an
active_io reference via percpu_ref_tryget_live().
Since the cloned bio already holds an active_io reference,
trying to acquire another reference via percpu_ref_tryget_live()
can lead to a deadlock while the array is suspended.
Fix this by using percpu_ref_get() for md cloned bios.
Fixes: bb2a9acefa ("md/raid1: switch to use md_account_bio() for io accounting")
Fixes: 8204552383 ("md/raid10: switch to use md_account_bio() for io accounting")
Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
Reviewed-by: Xiao Ni <xiao@kernel.org>
Reviewed-by: Yu Kuai <yukuai@fygo.io>
Link: https://patch.msgid.link/20260501114652.590037-2-abd.masalkhi@gmail.com
Signed-off-by: Yu Kuai <yukuai@fygo.io>