summaryrefslogtreecommitdiff
path: root/net/netfilter
AgeCommit message (Collapse)Author
7 daysnetfilter: nf_tables: avoid softlockup warnings in nft_chain_validateFlorian Westphal
This reverts commit 314c82841602 ("netfilter: nf_tables: can't schedule in nft_chain_validate"): Since commit a60a5abe19d6 ("netfilter: nf_tables: allow iter callbacks to sleep") the iterator callback is invoked without rcu read lock held, so this cond_resched() is now valid. Signed-off-by: Florian Westphal <fw@strlen.de>
7 daysnetfilter: nf_tables: avoid chain re-validation if possibleFlorian Westphal
Hamza Mahfooz reports cpu soft lock-ups in nft_chain_validate(): watchdog: BUG: soft lockup - CPU#1 stuck for 27s! [iptables-nft-re:37547] [..] RIP: 0010:nft_chain_validate+0xcb/0x110 [nf_tables] [..] nft_immediate_validate+0x36/0x50 [nf_tables] nft_chain_validate+0xc9/0x110 [nf_tables] nft_immediate_validate+0x36/0x50 [nf_tables] nft_chain_validate+0xc9/0x110 [nf_tables] nft_immediate_validate+0x36/0x50 [nf_tables] nft_chain_validate+0xc9/0x110 [nf_tables] nft_immediate_validate+0x36/0x50 [nf_tables] nft_chain_validate+0xc9/0x110 [nf_tables] nft_immediate_validate+0x36/0x50 [nf_tables] nft_chain_validate+0xc9/0x110 [nf_tables] nft_immediate_validate+0x36/0x50 [nf_tables] nft_chain_validate+0xc9/0x110 [nf_tables] nft_table_validate+0x6b/0xb0 [nf_tables] nf_tables_validate+0x8b/0xa0 [nf_tables] nf_tables_commit+0x1df/0x1eb0 [nf_tables] [..] Currently nf_tables will traverse the entire table (chain graph), starting from the entry points (base chains), exploring all possible paths (chain jumps). But there are cases where we could avoid revalidation. Consider: 1 input -> j2 -> j3 2 input -> j2 -> j3 3 input -> j1 -> j2 -> j3 Then the second rule does not need to revalidate j2, and, by extension j3, because this was already checked during validation of the first rule. We need to validate it only for rule 3. This is needed because chain loop detection also ensures we do not exceed the jump stack: Just because we know that j2 is cycle free, its last jump might now exceed the allowed stack size. We also need to update all reachable chains with the new largest observed call depth. Care has to be taken to revalidate even if the chain depth won't be an issue: chain validation also ensures that expressions are not called from invalid base chains. For example, the masquerade expression can only be called from NAT postrouting base chains. Therefore we also need to keep record of the base chain context (type, hooknum) and revalidate if the chain becomes reachable from a different hook location. Reported-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Closes: https://lore.kernel.org/netfilter-devel/20251118221735.GA5477@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/ Tested-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Signed-off-by: Florian Westphal <fw@strlen.de>
11 daysnetfilter: nf_tables: remove redundant chain validation on register storePablo Neira Ayuso
This validation predates the introduction of the state machine that determines when to enter slow path validation for error reporting. Currently, table validation is perform when: - new rule contains expressions that need validation. - new set element with jump/goto verdict. Validation on register store skips most checks with no basechains, still this walks the graph searching for loops and ensuring expressions are called from the right hook. Remove this. Fixes: a654de8fdc18 ("netfilter: nf_tables: fix chain dependency validation") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
11 daysnetfilter: nf_nat: remove bogus direction checkFlorian Westphal
Jakub reports spurious failures of the 'conntrack_reverse_clash.sh' selftest. A bogus test makes nat core resort to port rewrite even though there is no need for this. When the test is made, nf_nat_used_tuple() would already have caused us to return if no other CPU had added a colliding entry. Moreover, nf_nat_used_tuple() would have ignored the colliding entry if their origin tuples had been the same. All that is left to check is if the colliding entry in the hash table is subject to NAT, and, if its not, if our entry matches in the reverse direction, e.g. hash table has addr1:1234 -> addr2:80, and we want to commit addr2:80 -> addr1:1234. Because we already checked that neither the new nor the committed entry is subject to NAT we only have to check origin vs. reply tuple: for non-nat entries, the reply tuple is always the inverted original. Just in case there are more problems extend the error reporting in the selftest while at it and dump conntrack table/stats on error. Reported-by: Jakub Kicinski <kuba@kernel.org> Closes: https://lore.kernel.org/netdev/20251206175135.4a56591b@kernel.org/ Fixes: d8f84a9bc7c4 ("netfilter: nf_nat: don't try nat source port reallocation for reverse dir clash") Signed-off-by: Florian Westphal <fw@strlen.de>
12 daysnetfilter: always set route tuple out ifindexLorenzo Bianconi
Always set nf_flow_route tuple out ifindex even if the indev is not one of the flowtable configured devices since otherwise the outdev lookup in nf_flow_offload_ip_hook() or nf_flow_offload_ipv6_hook() for FLOW_OFFLOAD_XMIT_NEIGH flowtable entries will fail. The above issue occurs in the following configuration since IP6IP6 tunnel does not support flowtable acceleration yet: $ip addr show 5: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 00:11:22:33:22:55 brd ff:ff:ff:ff:ff:ff link-netns ns1 inet6 2001:db8:1::2/64 scope global nodad valid_lft forever preferred_lft forever inet6 fe80::211:22ff:fe33:2255/64 scope link tentative proto kernel_ll valid_lft forever preferred_lft forever 6: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 00:22:22:33:22:55 brd ff:ff:ff:ff:ff:ff link-netns ns3 inet6 2001:db8:2::1/64 scope global nodad valid_lft forever preferred_lft forever inet6 fe80::222:22ff:fe33:2255/64 scope link tentative proto kernel_ll valid_lft forever preferred_lft forever 7: tun0@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1452 qdisc noqueue state UNKNOWN group default qlen 1000 link/tunnel6 2001:db8:2::1 peer 2001:db8:2::2 permaddr a85:e732:2c37:: inet6 2002:db8:1::1/64 scope global nodad valid_lft forever preferred_lft forever inet6 fe80::885:e7ff:fe32:2c37/64 scope link proto kernel_ll valid_lft forever preferred_lft forever $ip -6 route show 2001:db8:1::/64 dev eth0 proto kernel metric 256 pref medium 2001:db8:2::/64 dev eth1 proto kernel metric 256 pref medium 2002:db8:1::/64 dev tun0 proto kernel metric 256 pref medium default via 2002:db8:1::2 dev tun0 metric 1024 pref medium $nft list ruleset table inet filter { flowtable ft { hook ingress priority filter devices = { eth0, eth1 } } chain forward { type filter hook forward priority filter; policy accept; meta l4proto { tcp, udp } flow add @ft } } Fixes: b5964aac51e0 ("netfilter: flowtable: consolidate xmit path") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Florian Westphal <fw@strlen.de>
12 daysipvs: fix ipv4 null-ptr-deref in route error pathSlavin Liu
The IPv4 code path in __ip_vs_get_out_rt() calls dst_link_failure() without ensuring skb->dev is set, leading to a NULL pointer dereference in fib_compute_spec_dst() when ipv4_link_failure() attempts to send ICMP destination unreachable messages. The issue emerged after commit ed0de45a1008 ("ipv4: recompile ip options in ipv4_link_failure") started calling __ip_options_compile() from ipv4_link_failure(). This code path eventually calls fib_compute_spec_dst() which dereferences skb->dev. An attempt was made to fix the NULL skb->dev dereference in commit 0113d9c9d1cc ("ipv4: fix null-deref in ipv4_link_failure"), but it only addressed the immediate dev_net(skb->dev) dereference by using a fallback device. The fix was incomplete because fib_compute_spec_dst() later in the call chain still accesses skb->dev directly, which remains NULL when IPVS calls dst_link_failure(). The crash occurs when: 1. IPVS processes a packet in NAT mode with a misconfigured destination 2. Route lookup fails in __ip_vs_get_out_rt() before establishing a route 3. The error path calls dst_link_failure(skb) with skb->dev == NULL 4. ipv4_link_failure() → ipv4_send_dest_unreach() → __ip_options_compile() → fib_compute_spec_dst() 5. fib_compute_spec_dst() dereferences NULL skb->dev Apply the same fix used for IPv6 in commit 326bf17ea5d4 ("ipvs: fix ipv6 route unreach panic"): set skb->dev from skb_dst(skb)->dev before calling dst_link_failure(). KASAN: null-ptr-deref in range [0x0000000000000328-0x000000000000032f] CPU: 1 PID: 12732 Comm: syz.1.3469 Not tainted 6.6.114 #2 RIP: 0010:__in_dev_get_rcu include/linux/inetdevice.h:233 RIP: 0010:fib_compute_spec_dst+0x17a/0x9f0 net/ipv4/fib_frontend.c:285 Call Trace: <TASK> spec_dst_fill net/ipv4/ip_options.c:232 spec_dst_fill net/ipv4/ip_options.c:229 __ip_options_compile+0x13a1/0x17d0 net/ipv4/ip_options.c:330 ipv4_send_dest_unreach net/ipv4/route.c:1252 ipv4_link_failure+0x702/0xb80 net/ipv4/route.c:1265 dst_link_failure include/net/dst.h:437 __ip_vs_get_out_rt+0x15fd/0x19e0 net/netfilter/ipvs/ip_vs_xmit.c:412 ip_vs_nat_xmit+0x1d8/0xc80 net/netfilter/ipvs/ip_vs_xmit.c:764 Fixes: ed0de45a1008 ("ipv4: recompile ip options in ipv4_link_failure") Signed-off-by: Slavin Liu <slavin452@gmail.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
12 daysnetfilter: nf_conncount: fix leaked ct in error pathsFernando Fernandez Mancera
There are some situations where ct might be leaked as error paths are skipping the refcounted check and return immediately. In order to solve it make sure that the check is always called. Fixes: be102eb6a0e7 ("netfilter: nf_conncount: rework API to use sk_buff directly") Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Florian Westphal <fw@strlen.de>
12 daysnetfilter: conntrack: warn when cleanup is stuckJakub Kicinski
nf_conntrack_cleanup_net_list() calls schedule() so it does not show up as a hung task. Add an explicit check to make debugging leaked skbs/conntack references more obvious. Acked-by: Florian Westphal <fw@strlen.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251207010942.1672972-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-28Merge tag 'nf-next-25-11-28' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following batch contains Netfilter updates for net-next: 0) Add sanity check for maximum encapsulations in bridge vlan, reported by the new AI robot. 1) Move the flowtable path discovery code to its own file, the nft_flow_offload.c mixes the nf_tables evaluation with the path discovery logic, just split this in two for clarity. 2) Consolidate flowtable xmit path by using dev_queue_xmit() and the real device behind the layer 2 vlan/pppoe device. This allows to inline encapsulation. After this update, hw_ifidx can be removed since both ifidx and hw_ifidx now point to the same device. 3) Support for IPIP encapsulation in the flowtable, extend selftest to cover for this new layer 3 offload, from Lorenzo Bianconi. 4) Push down the skb into the conncount API to fix duplicates in the conncount list for packets with non-confirmed conntrack entries, this is due to an optimization introduced in d265929930e2 ("netfilter: nf_conncount: reduce unnecessary GC"). From Fernando Fernandez Mancera. 5) In conncount, disable BH when performing garbage collection to consolidate existing behaviour in the conncount API, also from Fernando. 6) A matching packet with a confirmed conntrack invokes GC if conncount reaches the limit in an attempt to release slots. This allows the existing extensions to be used for real conntrack counting, not just limiting new connections, from Fernando. 7) Support for updating ct count objects in nf_tables, from Fernando. 8) Extend nft_flowtables.sh selftest to send IPv6 TCP traffic, from Lorenzo Bianconi. 9) Fixes for UAPI kernel-doc documentation, from Randy Dunlap. * tag 'nf-next-25-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nf_tables: improve UAPI kernel-doc comments netfilter: ip6t_srh: fix UAPI kernel-doc comments format selftests: netfilter: nft_flowtable.sh: Add the capability to send IPv6 TCP traffic netfilter: nft_connlimit: add support to object update operation netfilter: nft_connlimit: update the count if add was skipped netfilter: nf_conncount: make nf_conncount_gc_list() to disable BH netfilter: nf_conncount: rework API to use sk_buff directly selftests: netfilter: nft_flowtable.sh: Add IPIP flowtable selftest netfilter: flowtable: Add IPIP tx sw acceleration netfilter: flowtable: Add IPIP rx sw acceleration netfilter: flowtable: use tuple address to calculate next hop netfilter: flowtable: remove hw_ifidx netfilter: flowtable: inline pppoe encapsulation in xmit path netfilter: flowtable: inline vlan encapsulation in xmit path netfilter: flowtable: consolidate xmit path netfilter: flowtable: move path discovery infrastructure to its own file netfilter: flowtable: check for maximum number of encapsulations in bridge vlan ==================== Link: https://patch.msgid.link/20251128002345.29378-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-28net: Remove KMSG_COMPONENT macroHeiko Carstens
The KMSG_COMPONENT macro is a leftover of the s390 specific "kernel message catalog" from 2008 [1] which never made it upstream. The macro was added to s390 code to allow for an out-of-tree patch which used this to generate unique message ids. Also this out-of-tree patch doesn't exist anymore. The pattern of how the KMSG_COMPONENT macro is used can also be found at some non s390 specific code, for whatever reasons. Besides adding an indirection it is unused. Remove the macro in order to get rid of a pointless indirection. Replace all users with the string it defines. In all cases this leads to a simple replacement like this: - #define KMSG_COMPONENT "af_iucv" - #define pr_fmt(fmt) KMSG_COMPONENT ": " fmt + #define pr_fmt(fmt) "af_iucv: " fmt [1] https://lwn.net/Articles/292650/ Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Alexandra Winter <wintera@linux.ibm.com> Acked-by: Julian Anastasov <ja@ssi.bg> Acked-by: Sidraya Jayagond <sidraya@linux.ibm.com> Link: https://patch.msgid.link/20251126140705.1944278-1-hca@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-28netfilter: nft_connlimit: add support to object update operationFernando Fernandez Mancera
This is useful to update the limit or flags without clearing the connections tracked. Use READ_ONCE() on packetpath as it can be modified on controlplane. Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-11-28netfilter: nft_connlimit: update the count if add was skippedFernando Fernandez Mancera
Connlimit expression can be used for all kind of packets and not only for packets with connection state new. See this ruleset as example: table ip filter { chain input { type filter hook input priority filter; policy accept; tcp dport 22 ct count over 4 counter } } Currently, if the connection count goes over the limit the counter will count the packets. When a connection is closed, the connection count won't decrement as it should because it is only updated for new connections due to an optimization on __nf_conncount_add() that prevents updating the list if the connection is duplicated. To solve this problem, check whether the connection was skipped and if so, update the list. Adjust count_tree() too so the same fix is applied for xt_connlimit. Fixes: 976afca1ceba ("netfilter: nf_conncount: Early exit in nf_conncount_lookup() and cleanup") Closes: https://lore.kernel.org/netfilter/trinity-85c72a88-d762-46c3-be97-36f10e5d9796-1761173693813@3c-app-mailcom-bs12/ Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-11-28netfilter: nf_conncount: make nf_conncount_gc_list() to disable BHFernando Fernandez Mancera
For convenience when performing GC over the connection list, make nf_conncount_gc_list() to disable BH. This unifies the behavior with nf_conncount_add() and nf_conncount_count(). Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-11-28netfilter: nf_conncount: rework API to use sk_buff directlyFernando Fernandez Mancera
When using nf_conncount infrastructure for non-confirmed connections a duplicated track is possible due to an optimization introduced since commit d265929930e2 ("netfilter: nf_conncount: reduce unnecessary GC"). In order to fix this introduce a new conncount API that receives directly an sk_buff struct. It fetches the tuple and zone and the corresponding ct from it. It comes with both existing conncount variants nf_conncount_count_skb() and nf_conncount_add_skb(). In addition remove the old API and adjust all the users to use the new one. This way, for each sk_buff struct it is possible to check if there is a ct present and already confirmed. If so, skip the add operation. Fixes: d265929930e2 ("netfilter: nf_conncount: reduce unnecessary GC") Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-11-28netfilter: flowtable: Add IPIP tx sw accelerationLorenzo Bianconi
Introduce sw acceleration for tx path of IPIP tunnels relying on the netfilter flowtable infrastructure. This patch introduces basic infrastructure to accelerate other tunnel types (e.g. IP6IP6). IPIP sw tx acceleration can be tested running the following scenario where the traffic is forwarded between two NICs (eth0 and eth1) and an IPIP tunnel is used to access a remote site (using eth1 as the underlay device): ETH0 -- TUN0 <==> ETH1 -- [IP network] -- TUN1 (192.168.100.2) $ip addr show 6: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 00:00:22:33:11:55 brd ff:ff:ff:ff:ff:ff inet 192.168.0.2/24 scope global eth0 valid_lft forever preferred_lft forever 7: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 00:11:22:33:11:55 brd ff:ff:ff:ff:ff:ff inet 192.168.1.1/24 scope global eth1 valid_lft forever preferred_lft forever 8: tun0@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1480 qdisc noqueue state UNKNOWN group default qlen 1000 link/ipip 192.168.1.1 peer 192.168.1.2 inet 192.168.100.1/24 scope global tun0 valid_lft forever preferred_lft forever $ip route show default via 192.168.100.2 dev tun0 192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.2 192.168.1.0/24 dev eth1 proto kernel scope link src 192.168.1.1 192.168.100.0/24 dev tun0 proto kernel scope link src 192.168.100.1 $nft list ruleset table inet filter { flowtable ft { hook ingress priority filter devices = { eth0, eth1 } } chain forward { type filter hook forward priority filter; policy accept; meta l4proto { tcp, udp } flow add @ft } } Reproducing the scenario described above using veths I got the following results: - TCP stream trasmitted into the IPIP tunnel: - net-next: (baseline) ~ 85Gbps - net-next + IPIP flowtable support: ~102Gbps Co-developed-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-11-28netfilter: flowtable: Add IPIP rx sw accelerationLorenzo Bianconi
Introduce sw acceleration for rx path of IPIP tunnels relying on the netfilter flowtable infrastructure. Subsequent patches will add sw acceleration for IPIP tunnels tx path. This series introduces basic infrastructure to accelerate other tunnel types (e.g. IP6IP6). IPIP rx sw acceleration can be tested running the following scenario where the traffic is forwarded between two NICs (eth0 and eth1) and an IPIP tunnel is used to access a remote site (using eth1 as the underlay device): ETH0 -- TUN0 <==> ETH1 -- [IP network] -- TUN1 (192.168.100.2) $ip addr show 6: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 00:00:22:33:11:55 brd ff:ff:ff:ff:ff:ff inet 192.168.0.2/24 scope global eth0 valid_lft forever preferred_lft forever 7: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 00:11:22:33:11:55 brd ff:ff:ff:ff:ff:ff inet 192.168.1.1/24 scope global eth1 valid_lft forever preferred_lft forever 8: tun0@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1480 qdisc noqueue state UNKNOWN group default qlen 1000 link/ipip 192.168.1.1 peer 192.168.1.2 inet 192.168.100.1/24 scope global tun0 valid_lft forever preferred_lft forever $ip route show default via 192.168.100.2 dev tun0 192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.2 192.168.1.0/24 dev eth1 proto kernel scope link src 192.168.1.1 192.168.100.0/24 dev tun0 proto kernel scope link src 192.168.100.1 $nft list ruleset table inet filter { flowtable ft { hook ingress priority filter devices = { eth0, eth1 } } chain forward { type filter hook forward priority filter; policy accept; meta l4proto { tcp, udp } flow add @ft } } Reproducing the scenario described above using veths I got the following results: - TCP stream received from the IPIP tunnel: - net-next: (baseline) ~ 71Gbps - net-next + IPIP flowtbale support: ~101Gbps Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-11-28netfilter: flowtable: use tuple address to calculate next hopPablo Neira Ayuso
This simplifies IPIP tunnel support coming in follow up patches. No function changes are intended. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-11-28netfilter: flowtable: remove hw_ifidxPablo Neira Ayuso
hw_ifidx was originally introduced to store the real netdevice as a requirement for the hardware offload support in: 73f97025a972 ("netfilter: nft_flow_offload: use direct xmit if hardware offload is enabled") Since ("netfilter: flowtable: consolidate xmit path"), ifidx and hw_ifidx points to the real device in the xmit path, remove it. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-11-28netfilter: flowtable: inline pppoe encapsulation in xmit pathPablo Neira Ayuso
Push the pppoe header from the flowtable xmit path, inlining is faster than the original xmit path because it can avoid some locking. This is based on a patch originally written by wenxu. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-11-28netfilter: flowtable: inline vlan encapsulation in xmit pathPablo Neira Ayuso
Push the vlan header from the flowtable xmit path, instead of passing the packet to the vlan device. This is based on a patch originally written by wenxu. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-11-27netfilter: flowtable: consolidate xmit pathPablo Neira Ayuso
Use dev_queue_xmit() for the XMIT_NEIGH case. Store the interface index of the real device behind the vlan/pppoe device, this introduces an extra lookup for the real device in the xmit path because rt->dst.dev provides the vlan/pppoe device. XMIT_NEIGH now looks more similar to XMIT_DIRECT but the check for stale dst and the neighbour lookup still remain in place which is convenient to deal with network topology changes. Note that nft_flow_route() needs to relax the check for _XMIT_NEIGH so the existing basic xfrm offload (which only works in one direction) does not break. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-11-27netfilter: flowtable: move path discovery infrastructure to its own filePablo Neira Ayuso
This file contains the path discovery that is run from the forward chain for the packet offloading the flow into the flowtable. This consists of a series of calls to dev_fill_forward_path() for each device stack. More topologies may be supported in the future, so move this code to its own file to separate it from the nftables flow_offload expression. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-11-27netfilter: flowtable: check for maximum number of encapsulations in bridge vlanPablo Neira Ayuso
Add a sanity check to skip path discovery if the maximum number of encapsulation is reached. While at it, check for underflow too. Fixes: 26267bf9bb57 ("netfilter: flowtable: bridge vlan hardware offload and switchdev") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-11-04net: Convert proto_ops connect() callbacks to use sockaddr_unsizedKees Cook
Update all struct proto_ops connect() callback function prototypes from "struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the compiler about object sizes. Calls into struct proto handlers gain casts that will be removed in the struct proto conversion patch. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-3-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04net: Convert proto_ops bind() callbacks to use sockaddr_unsizedKees Cook
Update all struct proto_ops bind() callback function prototypes from "struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the compiler about object sizes. Calls into struct proto handlers gain casts that will be removed in the struct proto conversion patch. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-2-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.18-rc4). No conflicts, adjacent changes: drivers/net/ethernet/stmicro/stmmac/stmmac_main.c ded9813d17d3 ("net: stmmac: Consider Tx VLAN offload tag length for maxSDU") 26ab9830beab ("net: stmmac: replace has_xxxx with core_type") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-30netfilter: conntrack: disable 0 value for conntrack_max settingFlorian Westphal
Undocumented historical artifact inherited from ip_conntrack. If value is 0, then no limit is applied at all, conntrack table can grow to huge value, only limited by size of conntrack hashes and the kernel-internal upper limit on the hash chain lengths. This feature makes no sense; users can just set conntrack_max=2147483647 (INT_MAX). Disallow a 0 value. This will make it slightly easier to allow per-netns constraints for this value in a future patch. Signed-off-by: Florian Westphal <fw@strlen.de>
2025-10-30netfilter: nf_tables: use C99 struct initializer for nft_set_iterFernando Fernandez Mancera
Use C99 struct initializer for nft_set_iter, simplifying the code and preventing future errors due to uninitialized fields if new fields are added to the struct. Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Florian Westphal <fw@strlen.de>
2025-10-29netfilter: nft_ct: add seqadj extension for natted connectionsAndrii Melnychenko
Sequence adjustment may be required for FTP traffic with PASV/EPSV modes. due to need to re-write packet payload (IP, port) on the ftp control connection. This can require changes to the TCP length and expected seq / ack_seq. The easiest way to reproduce this issue is with PASV mode. Example ruleset: table inet ftp_nat { ct helper ftp_helper { type "ftp" protocol tcp l3proto inet } chain prerouting { type filter hook prerouting priority 0; policy accept; tcp dport 21 ct state new ct helper set "ftp_helper" } } table ip nat { chain prerouting { type nat hook prerouting priority -100; policy accept; tcp dport 21 dnat ip prefix to ip daddr map { 192.168.100.1 : 192.168.13.2/32 } } chain postrouting { type nat hook postrouting priority 100 ; policy accept; tcp sport 21 snat ip prefix to ip saddr map { 192.168.13.2 : 192.168.100.1/32 } } } Note that the ftp helper gets assigned *after* the dnat setup. The inverse (nat after helper assign) is handled by an existing check in nf_nat_setup_info() and will not show the problem. Topoloy: +-------------------+ +----------------------------------+ | FTP: 192.168.13.2 | <-> | NAT: 192.168.13.3, 192.168.100.1 | +-------------------+ +----------------------------------+ | +-----------------------+ | Client: 192.168.100.2 | +-----------------------+ ftp nat changes do not work as expected in this case: Connected to 192.168.100.1. [..] ftp> epsv EPSV/EPRT on IPv4 off. ftp> ls 227 Entering passive mode (192,168,100,1,209,129). 421 Service not available, remote server has closed connection. Kernel logs: Missing nfct_seqadj_ext_add() setup call WARNING: CPU: 1 PID: 0 at net/netfilter/nf_conntrack_seqadj.c:41 [..] __nf_nat_mangle_tcp_packet+0x100/0x160 [nf_nat] nf_nat_ftp+0x142/0x280 [nf_nat_ftp] help+0x4d1/0x880 [nf_conntrack_ftp] nf_confirm+0x122/0x2e0 [nf_conntrack] nf_hook_slow+0x3c/0xb0 .. Fix this by adding the required extension when a conntrack helper is assigned to a connection that has a nat binding. Fixes: 1a64edf54f55 ("netfilter: nft_ct: add helper set support") Signed-off-by: Andrii Melnychenko <a.melnychenko@vyos.io> Signed-off-by: Florian Westphal <fw@strlen.de>
2025-10-29netfilter: nft_connlimit: fix possible data race on connection countFernando Fernandez Mancera
nft_connlimit_eval() reads priv->list->count to check if the connection limit has been exceeded. This value is being read without a lock and can be modified by a different process. Use READ_ONCE() for correctness. Fixes: df4a90250976 ("netfilter: nf_conncount: merge lookup and add functions") Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Florian Westphal <fw@strlen.de>
2025-10-29netfilter: nft_ct: enable labels for get case tooFlorian Westphal
conntrack labels can only be set when the conntrack has been created with the "ctlabel" extension. For older iptables (connlabel match), adding an "-m connlabel" rule turns on the ctlabel extension allocation for all future conntrack entries. For nftables, its only enabled for 'ct label set foo', but not for 'ct label foo' (i.e. check). But users could have a ruleset that only checks for presence, and rely on userspace to set a label bit via ctnetlink infrastructure. This doesn't work without adding a dummy 'ct label set' rule. We could also enable extension infra for the first (failing) ctnetlink request, but unlike ruleset we would not be able to disable the extension again. Therefore turn on ctlabel extension allocation if an nftables ruleset checks for a connlabel too. Fixes: 1ad8f48df6f6 ("netfilter: nftables: add connlabel set support") Reported-by: Antonio Ojea <aojea@google.com> Closes: https://lore.kernel.org/netfilter-devel/aPi_VdZpVjWujZ29@strlen.de/ Signed-off-by: Florian Westphal <fw@strlen.de>
2025-10-08netfilter: nft_objref: validate objref and objrefmap expressionsFernando Fernandez Mancera
Referencing a synproxy stateful object from OUTPUT hook causes kernel crash due to infinite recursive calls: BUG: TASK stack guard page was hit at 000000008bda5b8c (stack is 000000003ab1c4a5..00000000494d8b12) [...] Call Trace: __find_rr_leaf+0x99/0x230 fib6_table_lookup+0x13b/0x2d0 ip6_pol_route+0xa4/0x400 fib6_rule_lookup+0x156/0x240 ip6_route_output_flags+0xc6/0x150 __nf_ip6_route+0x23/0x50 synproxy_send_tcp_ipv6+0x106/0x200 synproxy_send_client_synack_ipv6+0x1aa/0x1f0 nft_synproxy_do_eval+0x263/0x310 nft_do_chain+0x5a8/0x5f0 [nf_tables nft_do_chain_inet+0x98/0x110 nf_hook_slow+0x43/0xc0 __ip6_local_out+0xf0/0x170 ip6_local_out+0x17/0x70 synproxy_send_tcp_ipv6+0x1a2/0x200 synproxy_send_client_synack_ipv6+0x1aa/0x1f0 [...] Implement objref and objrefmap expression validate functions. Currently, only NFT_OBJECT_SYNPROXY object type requires validation. This will also handle a jump to a chain using a synproxy object from the OUTPUT hook. Now when trying to reference a synproxy object in the OUTPUT hook, nft will produce the following error: synproxy_crash.nft: Error: Could not process rule: Operation not supported synproxy name mysynproxy ^^^^^^^^^^^^^^^^^^^^^^^^ Fixes: ee394f96ad75 ("netfilter: nft_synproxy: add synproxy stateful object support") Reported-by: Georg Pfuetzenreuter <georg.pfuetzenreuter@suse.com> Closes: https://bugzilla.suse.com/1250237 Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2025-09-24netfilter: nf_conntrack: do not skip entries in /proc/net/nf_conntrackEric Dumazet
ct_seq_show() has an opportunistic garbage collector : if (nf_ct_should_gc(ct)) { nf_ct_kill(ct); goto release; } So if one nf_conn is killed there, next time ct_get_next() runs, we skip the following item in the bucket, even if it should have been displayed if gc did not take place. We can decrement st->skip_elems to tell ct_get_next() one of the items was removed from the chain. Fixes: 58e207e4983d ("netfilter: evict stale entries when user reads /proc/net/nf_conntrack") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2025-09-24netfilter: nft_set_pipapo_avx2: fix skip of expired entriesFlorian Westphal
KASAN reports following splat: BUG: KASAN: slab-out-of-bounds in pipapo_get_avx2+0x941/0x25d0 Read of size 1 at addr ffff88814c561be0 by task nft/3944 Call Trace: pipapo_get_avx2+0x941/0x25d0 nft_pipapo_insert+0x440/0x11b0 nf_tables_newsetelem+0x220a/0x3a00 .. This bisects to commit 84c1da7b38d9 ("netfilter: nft_set_pipapo: use AVX2 algorithm for insertions too"). However, that change merely uncovers this bug. When we find a match but that match has expired or timed out, the AVX2 implementation restarts the full match loop. At that point, the pointer to the key data has already been changed and points to the keys last field. This will then result in out-of-bounds read once its incremented again for the next field. The restart logic in AVX2 is different compared to the plain C implementation, but both should follow the same logic. The C implementation just calls pipapo_refill() again do check the next entry. Do the same in the AVX2 implementation. Note that with this change, due to implementation differences of pipapo_refill vs. nft_pipapo_avx2_refill, the refill call will return the same element again. Then, on the next call, it will move to the next entry as expected. This is because avx2_refill doesn't clear the bitmap in the 'last' conditional. This is harmless. Expired/timed out elements are also not expected to be frequent. selftest is added in a followup commit. Fixes: 7400b063969b ("nft_set_pipapo: Introduce AVX2-based lookup implementation") Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2025-09-24netfilter: nft_set_pipapo: use 0 genmask for packetpath lookupsFlorian Westphal
In commit c4eaca2e1052 ("netfilter: nft_set_pipapo: don't check genbit from packetpath lookups") I replaced genmask_cur() with NFT_GENMASK_ANY, but this change has no effect in the pipapo set type. New entries are unreachable from the active copy, so NFT_GENMASK_ANY has same result as genmask_cur(): current-gen elements are disabled and the new-generation elements cannot be found. Tests did not catch this incomplete fix because the change also dropped the genmask test from the AVX2 version of the algorithm, so test only fails if host cpu lacks AVX2 support. Use genmask test only from the control plane (inserts, deletions, ..). Packet path has to skip the check, use of 0 is enough for this because ext->genmask has a the relevant bit set when the element is INACTIVE in that generation: using a 0 genmask thus makes nft_set_elem_active() always return true. Fix the comment and replace NFT_GENMASK_ANY with 0. Fixes: c4eaca2e1052 ("netfilter: nft_set_pipapo: don't check genbit from packetpath lookups") Signed-off-by: Florian Westphal <fw@strlen.de>
2025-09-24netfilter: nfnetlink: reset nlh pointer during batch replayFernando Fernandez Mancera
During a batch replay, the nlh pointer is not reset until the parsing of the commands. Since commit bf2ac490d28c ("netfilter: nfnetlink: Handle ACK flags for batch messages") that is problematic as the condition to add an ACK for batch begin will evaluate to true even if NLM_F_ACK wasn't used for batch begin message. If there is an error during the command processing, netlink is sending an ACK despite that. This misleads userspace tools which think that the return code was 0. Reset the nlh pointer to the original one when a replay is triggered. Fixes: bf2ac490d28c ("netfilter: nfnetlink: Handle ACK flags for batch messages") Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Florian Westphal <fw@strlen.de>
2025-09-24ipvs: Defer ip_vs_ftp unregister during netns cleanupSlavin Liu
On the netns cleanup path, __ip_vs_ftp_exit() may unregister ip_vs_ftp before connections with valid cp->app pointers are flushed, leading to a use-after-free. Fix this by introducing a global `exiting_module` flag, set to true in ip_vs_ftp_exit() before unregistering the pernet subsystem. In __ip_vs_ftp_exit(), skip ip_vs_ftp unregister if called during netns cleanup (when exiting_module is false) and defer it to __ip_vs_cleanup_batch(), which unregisters all apps after all connections are flushed. If called during module exit, unregister ip_vs_ftp immediately. Fixes: 61b1ab4583e2 ("IPVS: netns, add basic init per netns.") Suggested-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Slavin Liu <slavin452@gmail.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2025-09-22net: replace use of system_wq with system_percpu_wqMarco Crivellari
Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_unbound_wq should be the default workqueue so as not to enforce locality constraints for random work whenever it's not required. Adding system_dfl_wq to encourage its use when unbound work should be used. The old system_unbound_wq will be kept for a few release cycles. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://patch.msgid.link/20250918142427.309519-3-marco.crivellari@suse.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-12Merge tag 'nf-next-25-09-11' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== netfilter: updates for net-next 1) Don't respond to ICMP_UNREACH errors with another ICMP_UNREACH error. 2) Support fetching the current bridge ethernet address. This allows a more flexible approach to packet redirection on bridges without need to use hardcoded addresses. From Fernando Fernandez Mancera. 3) Zap a few no-longer needed conditionals from ipvs packet path and convert to READ/WRITE_ONCE to avoid KCSAN warnings. From Zhang Tengfei. 4) Remove a no-longer-used macro argument in ipset, from Zhen Ni. * tag 'nf-next-25-09-11' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nf_reject: don't reply to icmp error messages ipvs: Use READ_ONCE/WRITE_ONCE for ipvs->enable netfilter: nft_meta_bridge: introduce NFT_META_BRI_IIFHWADDR support netfilter: ipset: Remove unused htable_bits in macro ahash_region selftest:net: fixed spelling mistakes ==================== Link: https://patch.msgid.link/20250911143819.14753-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.17-rc6). Conflicts: net/netfilter/nft_set_pipapo.c net/netfilter/nft_set_pipapo_avx2.c c4eaca2e1052 ("netfilter: nft_set_pipapo: don't check genbit from packetpath lookups") 84c1da7b38d9 ("netfilter: nft_set_pipapo: use avx2 algorithm for insertions too") Only trivial adjacent changes (in a doc and a Makefile). Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-11ipvs: Use READ_ONCE/WRITE_ONCE for ipvs->enableZhang Tengfei
KCSAN reported a data-race on the `ipvs->enable` flag, which is written in the control path and read concurrently from many other contexts. Following a suggestion by Julian, this patch fixes the race by converting all accesses to use `WRITE_ONCE()/READ_ONCE()`. This lightweight approach ensures atomic access and acts as a compiler barrier, preventing unsafe optimizations where the flag is checked in loops (e.g., in ip_vs_est.c). Additionally, the `enable` checks in the fast-path hooks (`ip_vs_in_hook`, `ip_vs_out_hook`, `ip_vs_forward_icmp`) are removed. These are unnecessary since commit 857ca89711de ("ipvs: register hooks only with services"). The `enable=0` condition they check for can only occur in two rare and non-fatal scenarios: 1) after hooks are registered but before the flag is set, and 2) after hooks are unregistered on cleanup_net. In the worst case, a single packet might be mishandled (e.g., dropped), which does not lead to a system crash or data corruption. Adding a check in the performance-critical fast-path to handle this harmless condition is not a worthwhile trade-off. Fixes: 857ca89711de ("ipvs: register hooks only with services") Reported-by: syzbot+1651b5234028c294c339@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=1651b5234028c294c339 Suggested-by: Julian Anastasov <ja@ssi.bg> Link: https://lore.kernel.org/lvs-devel/2189fc62-e51e-78c9-d1de-d35b8e3657e3@ssi.bg/ Signed-off-by: Zhang Tengfei <zhtfdev@gmail.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2025-09-11netfilter: ipset: Remove unused htable_bits in macro ahash_regionZhen Ni
Since the ahash_region() macro was redefined to calculate the region index solely from HTABLE_REGION_BITS, the htable_bits parameter became unused. Remove the unused htable_bits argument and its call sites, simplifying the code without changing semantics. Fixes: 8478a729c046 ("netfilter: ipset: fix region locking in hash types") Signed-off-by: Zhen Ni <zhen.ni@easystack.cn> Reviewed-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Florian Westphal <fw@strlen.de>
2025-09-10netfilter: nf_tables: restart set lookup on base_seq changeFlorian Westphal
The hash, hash_fast, rhash and bitwise sets may indicate no result even though a matching element exists during a short time window while other cpu is finalizing the transaction. This happens when the hash lookup/bitwise lookup function has picked up the old genbit, right before it was toggled by nf_tables_commit(), but then the same cpu managed to unlink the matching old element from the hash table: cpu0 cpu1 has added new elements to clone has marked elements as being inactive in new generation perform lookup in the set enters commit phase: A) observes old genbit increments base_seq I) increments the genbit II) removes old element from the set B) finds matching element C) returns no match: found element is not valid in old generation Next lookup observes new genbit and finds matching e2. Consider a packet matching element e1, e2. cpu0 processes following transaction: 1. remove e1 2. adds e2, which has same key as e1. P matches both e1 and e2. Therefore, cpu1 should always find a match for P. Due to above race, this is not the case: cpu1 observed the old genbit. e2 will not be considered once it is found. The element e1 is not found anymore if cpu0 managed to unlink it from the hlist before cpu1 found it during list traversal. The situation only occurs for a brief time period, lookups happening after I) observe new genbit and return e2. This problem exists in all set types except nft_set_pipapo, so fix it once in nft_lookup rather than each set ops individually. Sample the base sequence counter, which gets incremented right before the genbit is changed. Then, if no match is found, retry the lookup if the base sequence was altered in between. If the base sequence hasn't changed: - No update took place: no-match result is expected. This is the common case. or: - nf_tables_commit() hasn't progressed to genbit update yet. Old elements were still visible and nomatch result is expected, or: - nf_tables_commit updated the genbit: We picked up the new base_seq, so the lookup function also picked up the new genbit, no-match result is expected. If the old genbit was observed, then nft_lookup also picked up the old base_seq: nft_lookup_should_retry() returns true and relookup is performed in the new generation. This problem was added when the unconditional synchronize_rcu() call that followed the current/next generation bit toggle was removed. Thanks to Pablo Neira Ayuso for reviewing an earlier version of this patchset, for suggesting re-use of existing base_seq and placement of the restart loop in nft_set_do_lookup(). Fixes: 0cbc06b3faba ("netfilter: nf_tables: remove synchronize_rcu in commit phase") Signed-off-by: Florian Westphal <fw@strlen.de>
2025-09-10netfilter: nf_tables: make nft_set_do_lookup available unconditionallyFlorian Westphal
This function was added for retpoline mitigation and is replaced by a static inline helper if mitigations are not enabled. Enable this helper function unconditionally so next patch can add a lookup restart mechanism to fix possible false negatives while transactions are in progress. Adding lookup restarts in nft_lookup_eval doesn't work as nft_objref would then need the same copypaste loop. This patch is separate to ease review of the actual bug fix. Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2025-09-10netfilter: nf_tables: place base_seq in struct netFlorian Westphal
This will soon be read from packet path around same time as the gencursor. Both gencursor and base_seq get incremented almost at the same time, so it makes sense to place them in the same structure. This doesn't increase struct net size on 64bit due to padding. Signed-off-by: Florian Westphal <fw@strlen.de>
2025-09-10netfilter: nft_set_rbtree: continue traversal if element is inactiveFlorian Westphal
When the rbtree lookup function finds a match in the rbtree, it sets the range start interval to a potentially inactive element. Then, after tree lookup, if the matching element is inactive, it returns NULL and suppresses a matching result. This is wrong and leads to false negative matches when a transaction has already entered the commit phase. cpu0 cpu1 has added new elements to clone has marked elements as being inactive in new generation perform lookup in the set enters commit phase: I) increments the genbit A) observes new genbit B) finds matching range C) returns no match: found range invalid in new generation II) removes old elements from the tree C New nft_lookup happening now will find matching element, because it is no longer obscured by old, inactive one. Consider a packet matching range r1-r2: cpu0 processes following transaction: 1. remove r1-r2 2. add r1-r3 P is contained in both ranges. Therefore, cpu1 should always find a match for P. Due to above race, this is not the case: cpu1 does find r1-r2, but then ignores it due to the genbit indicating the range has been removed. It does NOT test for further matches. The situation persists for all lookups until after cpu0 hits II) after which r1-r3 range start node is tested for the first time. Move the "interval start is valid" check ahead so that tree traversal continues if the starting interval is not valid in this generation. Thanks to Stefan Hanreich for providing an initial reproducer for this bug. Reported-by: Stefan Hanreich <s.hanreich@proxmox.com> Fixes: c1eda3c6394f ("netfilter: nft_rbtree: ignore inactive matching element with no descendants") Signed-off-by: Florian Westphal <fw@strlen.de>
2025-09-10netfilter: nft_set_pipapo: don't check genbit from packetpath lookupsFlorian Westphal
The pipapo set type is special in that it has two copies of its datastructure: one live copy containing only valid elements and one on-demand clone used during transaction where adds/deletes happen. This clone is not visible to the datapath. This is unlike all other set types in nftables, those all link new elements into their live hlist/tree. For those sets, the lookup functions must skip the new elements while the transaction is ongoing to ensure consistency. As the clone is shallow, removal does have an effect on the packet path: once the transaction enters the commit phase the 'gencursor' bit that determines which elements are active and which elements should be ignored (because they are no longer valid) is flipped. This causes the datapath lookup to ignore these elements if they are found during lookup. This opens up a small race window where pipapo has an inconsistent view of the dataset from when the transaction-cpu flipped the genbit until the transaction-cpu calls nft_pipapo_commit() to swap live/clone pointers: cpu0 cpu1 has added new elements to clone has marked elements as being inactive in new generation perform lookup in the set enters commit phase: I) increments the genbit A) observes new genbit removes elements from the clone so they won't be found anymore B) lookup in datastructure can't see new elements yet, but old elements are ignored -> Only matches elements that were not changed in the transaction II) calls nft_pipapo_commit(), clone and live pointers are swapped. C New nft_lookup happening now will find matching elements. Consider a packet matching range r1-r2: cpu0 processes following transaction: 1. remove r1-r2 2. add r1-r3 P is contained in both ranges. Therefore, cpu1 should always find a match for P. Due to above race, this is not the case: cpu1 does find r1-r2, but then ignores it due to the genbit indicating the range has been removed. At the same time, r1-r3 is not visible yet, because it can only be found in the clone. The situation persists for all lookups until after cpu0 hits II). The fix is easy: Don't check the genbit from pipapo lookup functions. This is possible because unlike the other set types, the new elements are not reachable from the live copy of the dataset. The clone/live pointer swap is enough to avoid matching on old elements while at the same time all new elements are exposed in one go. After this change, step B above returns a match in r1-r2. This is fine: r1-r2 only becomes truly invalid the moment they get freed. This happens after a synchronize_rcu() call and rcu read lock is held via netfilter hook traversal (nf_hook_slow()). Cc: Stefano Brivio <sbrivio@redhat.com> Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Signed-off-by: Florian Westphal <fw@strlen.de>
2025-09-10netfilter: nft_set_bitmap: fix lockdep splat due to missing annotationFlorian Westphal
Running new 'set_flush_add_atomic_bitmap' test case for nftables.git with CONFIG_PROVE_RCU_LIST=y yields: net/netfilter/nft_set_bitmap.c:231 RCU-list traversed in non-reader section!! rcu_scheduler_active = 2, debug_locks = 1 1 lock held by nft/4008: #0: ffff888147f79cd8 (&nft_net->commit_mutex){+.+.}-{4:4}, at: nf_tables_valid_genid+0x2f/0xd0 lockdep_rcu_suspicious+0x116/0x160 nft_bitmap_walk+0x22d/0x240 nf_tables_delsetelem+0x1010/0x1a00 .. This is a false positive, the list cannot be altered while the transaction mutex is held, so pass the relevant argument to the iterator. Fixes tag intentionally wrong; no point in picking this up if earlier false-positive-fixups were not applied. Fixes: 28b7a6b84c0a ("netfilter: nf_tables: avoid false-positive lockdep splats in set walker") Signed-off-by: Florian Westphal <fw@strlen.de>
2025-09-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.17-rc5). No conflicts. Adjacent changes: include/net/sock.h c51613fa276f ("net: add sk->sk_drop_counters") 5d6b58c932ec ("net: lockless sock_i_ino()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-04netfilter: nf_tables: Introduce NFTA_DEVICE_PREFIXPhil Sutter
This new attribute is supposed to be used instead of NFTA_DEVICE_NAME for simple wildcard interface specs. It holds a NUL-terminated string representing an interface name prefix to match on. While kernel code to distinguish full names from prefixes in NFTA_DEVICE_NAME is simpler than this solution, reusing the existing attribute with different semantics leads to confusion between different versions of kernel and user space though: * With old kernels, wildcards submitted by user space are accepted yet silently treated as regular names. * With old user space, wildcards submitted by kernel may cause crashes since libnftnl expects NUL-termination when there is none. Using a distinct attribute type sanitizes these situations as the receiving part detects and rejects the unexpected attribute nested in *_HOOK_DEVS attributes. Fixes: 6d07a289504a ("netfilter: nf_tables: Support wildcard netdev hook specs") Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Florian Westphal <fw@strlen.de>