Can be used in bridge prerouting hook to redirect the packet to the
receiving physical device for processing.
table bridge nat {
chain PREROUTING {
type filter hook prerouting priority 0; policy accept;
ether daddr de:ad:00:00:be:ef meta pkttype set host ether daddr set meta ibrhwaddr accept
}
}
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
When building NFTA_{FLOWTABLE_,}HOOK_DEVS attributes, detect trailing
asterisks in interface names and transmit the leading part in a
NFTA_DEVICE_PREFIX attribute.
Deserialization (i.e., appending asterisk to interface prefixes returned
in NFTA_DEVICE_PREFIX atributes happens in libnftnl.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Reviewed-by: Pablo Neira Ayuso <pablo@netfilter.org>
New kernels dump info for flowtable hooks the same way as for base
chains.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Reviewed-by: Florian Westphal <fw@strlen.de>
Upcoming kernel change provides the packets conntrack state in the
trace message data.
This allows to see if packet is seen as original or reply, the conntrack
state (new, establieshed, related) and the status bits which show if e.g.
NAT was applied. Alsoi include conntrack ID so users can use conntrack
tool to query the kernel for more information via ctnetlink.
This improves debugging when e.g. packets do not pick up the expected
NAT mapping, which could e.g. also happen because of expectations
following the NAT binding of the owning conntrack entry.
Example output ("conntrack: " lines are new):
trace id 32 t PRE_RAW packet: iif "enp0s3" ether saddr [..]
trace id 32 t PRE_RAW rule tcp flags syn meta nftrace set 1 (verdict continue)
trace id 32 t PRE_RAW policy accept
trace id 32 t PRE_MANGLE conntrack: ct direction original ct state new ct id 2641368242
trace id 32 t PRE_MANGLE packet: iif "enp0s3" ether saddr [..]
trace id 32 t ct_new_pre rule jump rpfilter (verdict jump rpfilter)
trace id 32 t PRE_MANGLE policy accept
trace id 32 t INPUT conntrack: ct direction original ct state new ct status dnat-done ct id 2641368242
trace id 32 t INPUT packet: iif "enp0s3" [..]
trace id 32 t public_in rule tcp dport 443 accept (verdict accept)
v3: remove clash bit again, kernel won't expose it anymore.
v2: add more status bits: helper, clash, offload, hw-offload.
add flag explanation to documentation.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Pablo Neira Ayuso <pablo@netfilter.org>
Hitherto, the kernel has required constant values for the `xor` and
`mask` attributes of boolean bitwise expressions. This has meant that
the right-hand operand of a boolean binop must be constant. Now the
kernel has support for AND, OR and XOR operations with right-hand
operands passed via registers, we can relax this restriction. Allow
non-constant right-hand operands if the left-hand operand is not
constant, e.g.:
ct mark & 0xffff0000 | meta mark & 0xffff
The kernel now supports performing AND, OR and XOR operations directly,
on one register and an immediate value or on two registers, so we need
to be able to generate and parse bitwise boolean expressions of this
form.
If a boolean operation has a constant RHS, we continue to send a
mask-and-xor expression to the kernel.
Add tests for {ct,meta} mark with variable RHS operands.
JSON support is also included.
This requires Linux kernel >= 6.13-rc.
[ Originally posted as patch 1/8 and 6/8 which has been collapsed and
simplified to focus on initial {ct,meta} mark support. Tests have
been extracted from 8/8 including a tests/py fix to payload output
due to incorrect output in original patchset. JSON support has been
extracted from patch 7/8 --pablo]
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Switch from recursive-make to a single top-level Makefile. This is the
first step, the following patches will continue this.
Unlike meson's subdir() or C's #include, automake's SUBDIRS= does not
include a Makefile. Instead, it calls `make -C $dir`.
https://www.gnu.org/software/make/manual/html_node/Recursion.htmlhttps://www.gnu.org/software/automake/manual/html_node/Subdirectories.html
See also, "Recursive Make Considered Harmful".
https://accu.org/journals/overload/14/71/miller_2004/
This has several problems, which we an avoid with a single Makefile:
- recursive-make is harder to maintain and understand as a whole.
Recursive-make makes sense, when there are truly independent
sub-projects. Which is not the case here. The project needs to be
considered as a whole and not one directory at a time. When
we add unit tests (which we should), those would reside in separate
directories but have dependencies between directories. With a single
Makefile, we see all at once. The build setup has an inherent complexity,
and that complexity is not necessarily reduced by splitting it into more files.
On the contrary it helps to have it all in once place, provided that it's
sensibly structured, named and organized.
- typing `make` prints irrelevant "Entering directory" messages. So much
so, that at the end of the build, the terminal is filled with such
messages and we have to scroll to see what even happened.
- with recursive-make, during build we see:
make[3]: Entering directory '.../nftables/src'
CC meta.lo
meta.c:13:2: error: #warning hello test [-Werror=cpp]
13 | #warning hello test
| ^~~~~~~
With a single Makefile we get
CC src/meta.lo
src/meta.c:13:2: error: #warning hello test [-Werror=cpp]
13 | #warning hello test
| ^~~~~~~
This shows the full filename -- assuming that the developer works from
the top level directory. The full name is useful, for example to
copy+paste into the terminal.
- single Makefile is also faster:
$ make && perf stat -r 200 -B make -j
I measure 35msec vs. 80msec.
- recursive-make limits parallel make. You have to craft the SUBDIRS= in
the correct order. The dependencies between directories are limited,
as make only sees "LDADD = $(top_builddir)/src/libnftables.la" and
not the deeper dependencies for the library.
- I presume, some people like recursive-make because of `make -C $subdir`
to only rebuild one directory. Rebuilding the entire tree is already very
fast, so this feature seems not relevant. Also, as dependency handling
is limited, we might wrongly not rebuild a target. For example,
make check
touch src/meta.c
make -C examples check
does not rebuild "examples/nft-json-file".
What we now can do with single Makefile (and better than before), is
`make examples/nft-json-file`, which works as desired and rebuilds all
dependencies.
Signed-off-by: Thomas Haller <thaller@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
All these are used to reset state in set/map elements, i.e. reset the
timeout or zero quota and counter values.
While 'reset element' expects a (list of) elements to be specified which
should be reset, 'reset set/map' will reset all elements in the given
set/map.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Iptables supports the matching of DCCP packets based on the presence
or absence of DCCP options. Extend exthdr expressions to add this
functionality to nftables.
Link: https://bugzilla.netfilter.org/show_bug.cgi?id=930
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This allows 'nft list hooks' to also display the bpf program id
attached. Example:
hook input {
-0000000128 nf_hook_run_bpf id 6
..
Signed-off-by: Florian Westphal <fw@strlen.de>
"destroy" command performs a deletion as "delete" command but does not fail
if the object does not exist. As there is no NLM_F_* flag for ignoring such
error, it needs to be ignored directly on error handling.
Example of use:
# nft list ruleset
table ip filter {
chain output {
}
}
# nft destroy table ip missingtable
# echo $?
0
# nft list ruleset
table ip filter {
chain output {
}
}
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Reset rule counters and quotas in kernel, i.e. without having to reload
them. Requires respective kernel patch to support NFT_MSG_GETRULE_RESET
message type.
Signed-off-by: Phil Sutter <phil@nwl.cc>
This patch adds the initial infrastructure to support for inner header
tunnel matching and its first user: vxlan.
A new struct proto_desc field for payload and meta expression to specify
that the expression refers to inner header matching is used.
The existing codebase to generate bytecode is fully reused, allowing for
reusing existing supported layer 2, 3 and 4 protocols.
Syntax requires to specify vxlan before the inner protocol field:
... vxlan ip protocol udp
... vxlan ip saddr 1.2.3.0/24
This also works with concatenations and anonymous sets, eg.
... vxlan ip saddr . vxlan ip daddr { 1.2.3.4 . 4.3.2.1 }
You have to restrict vxlan matching to udp traffic, otherwise it
complains on missing transport protocol dependency, e.g.
... udp dport 4789 vxlan ip daddr 1.2.3.4
The bytecode that is generated uses the new inner expression:
# nft --debug=netlink add rule netdev x y udp dport 4789 vxlan ip saddr 1.2.3.4
netdev x y
[ meta load l4proto => reg 1 ]
[ cmp eq reg 1 0x00000011 ]
[ payload load 2b @ transport header + 2 => reg 1 ]
[ cmp eq reg 1 0x0000b512 ]
[ inner type 1 hdrsize 8 flags f [ meta load protocol => reg 1 ] ]
[ cmp eq reg 1 0x00000008 ]
[ inner type 1 hdrsize 8 flags f [ payload load 4b @ network header + 12 => reg 1 ] ]
[ cmp eq reg 1 0x04030201 ]
JSON support is not included in this patch.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add userspace support for the netdev egress hook which is queued up for
v5.16-rc1, complete with documentation and tests. Usage is identical to
the ingress hook.
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Update this command to display the hook datapath for a packet depending
on its family.
This patch also includes:
- Group of existing hooks based on the hook location.
- Order hooks by priority, from INT_MIN to INT_MAX.
- Do not add sign to priority zero.
- Refresh include/linux/netfilter/nfnetlink_hook.h cache copy.
- Use NFNLA_CHAIN_* attributes to print the chain family, table and name.
If NFNLA_CHAIN_* attributes are not available, display the hookfn name.
- Update syntax: remove optional hook parameter, promote the 'device'
argument.
The following example shows the hook datapath for IPv4 packets coming in
from netdevice 'eth0':
# nft list hooks ip device eth0
family ip {
hook ingress {
+0000000010 chain netdev x y [nf_tables]
+0000000300 chain inet m w [nf_tables]
}
hook input {
-0000000100 chain ip a b [nf_tables]
+0000000300 chain inet m z [nf_tables]
}
hook forward {
-0000000225 selinux_ipv4_forward
0000000000 chain ip a c [nf_tables]
}
hook output {
-0000000225 selinux_ipv4_output
}
hook postrouting {
+0000000225 selinux_ipv4_postroute
}
}
Note that the listing above includes the existing netdev and inet
hooks/chains which *might* interfer in the travel of an incoming IPv4
packet. This allows users to debug the pipeline, basically, to
understand in what order the hooks/chains are evaluated for the IPv4
packets.
If the netdevice is not specified, then the ingress hooks are not
shown.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Commit 4694f7230195 introduced nfnetlink_hook.h but didn't update the
automake system to take account of the new file.
Signed-off-by: Duncan Roe <duncan_roe@optusnet.com.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Extend exthdr expression to support scanning through SCTP packet chunks
and matching on fixed fields' values.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Acked-by: Florian Westphal <fw@strlen.de>
Add a catchall expression (EXPR_SET_ELEM_CATCHALL).
Use the asterisk (*) to represent the catch-all set element, e.g.
table x {
set y {
type ipv4_addr
counter
elements = { 1.2.3.4 counter packets 0 bytes 0, * counter packets 0 bytes 0 }
}
}
Special handling for segtree: zap the catch-all element from the set
element list and re-add it after processing.
Remove wildcard_expr deadcode in src/parser_bison.y
This patch also adds several tests for the tests/py and tests/shell
infrastructures.
Acked-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Old kernel reject requests for element with multiple statements because
userspace sets on the flags for multi-statements.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Stateless SCTP header mangling doesn't work reliably.
This tells the kernel to update the checksum field using
the sctp crc32 algorithm.
Note that this needs additional kernel support to work.
Signed-off-by: Florian Westphal <fw@strlen.de>
iptables had a "-m socket --transparent" which didn't match sockets that are
bound to all addresses (e.g. 0.0.0.0 for ipv4, and ::0 for ipv6). It was
possible to override this behavior by using --nowildcard, in which case it
did match zero bound sockets as well.
The issue is that nftables never included the wildcard check, so in effect
it behaved like "iptables -m socket --transparent --nowildcard" with no
means to exclude wildcarded listeners.
This is a problem as a user-space process that binds to 0.0.0.0:<port> that
enables IP_TRANSPARENT would effectively intercept traffic going in _any_
direction on the specific port, whereas in most cases, transparent proxies
would only need this for one specific address.
The solution is to add "socket wildcard" key to the nft_socket module, which
makes it possible to match on the wildcardness of a socket from
one's ruleset.
This is how to use it:
table inet haproxy {
chain prerouting {
type filter hook prerouting priority -150; policy accept;
socket transparent 1 socket wildcard 0 mark set 0x00000001
}
}
This patch effectively depends on its counterpart in the kernel.
Signed-off-by: Balazs Scheidler <bazsi77@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch allows you to group rules in a subchain, e.g.
table inet x {
chain y {
type filter hook input priority 0;
tcp dport 22 jump {
ip saddr { 127.0.0.0/8, 172.23.0.0/16, 192.168.13.0/24 } accept
ip6 saddr ::1/128 accept;
}
}
}
This also supports for the `goto' chain verdict.
This patch adds a new chain binding list to avoid a chain list lookup from the
delinearize path for the usual chains. This can be simplified later on with a
single hashtable per table for all chains.
From the shell, you have to use the explicit separator ';', in bash you
have to escape this:
# nft add rule inet x y tcp dport 80 jump { ip saddr 127.0.0.1 accept\; ip6 saddr ::1 accept \; }
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch allows you to specify an interval of IP address in maps.
table ip x {
chain y {
type nat hook postrouting priority srcnat; policy accept;
snat ip prefix to ip saddr map { 10.141.11.0/24 : 192.168.2.0/24 }
}
}
The example above performs SNAT to packets that comes from
10.141.11.0/24 using the prefix 192.168.2.0/24, e.g. 10.141.11.4 is
mangled to 192.168.2.4.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Get this header in sync with nf.git as of commit ef516e8625dd.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Get this header in sync with nf-next as of merge commit
b3a608222336 (5.6-rc1-ish).
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The kernel UAPI header includes a couple of new bitwise netlink
attributes and an enum.
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The comment documenting how bitwise expressions work includes a table
which summarizes the mask and xor arguments combined to express the
supported boolean operations. However, the row for OR:
mask xor
0 x
is incorrect.
dreg = (sreg & 0) ^ x
is not equivalent to:
dreg = sreg | x
What the code actually does is:
dreg = (sreg & ~x) ^ x
Update the documentation to match.
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Adds "meta sdif" and "meta sdifname".
Both only work in input/forward hook of ipv4/ipv6/inet family.
Cc: Martin Willi <martin@strongswan.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Add support for "synproxy" stateful object. For example (for TCP port 80 and
using maps with saddr):
table ip foo {
synproxy https-synproxy {
mss 1460
wscale 7
timestamp sack-perm
}
synproxy other-synproxy {
mss 1460
wscale 5
}
chain bar {
tcp dport 80 synproxy name "https-synproxy"
synproxy name ip saddr map { 192.168.1.0/24 : "https-synproxy", 192.168.2.0/24 : "other-synproxy" }
}
}
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
These keywords introduce new checks for a timestamp, an absolute date (which is converted to a timestamp),
an hour in the day (which is converted to the number of seconds since midnight) and a day of week.
When converting an ISO date (eg. 2019-06-06 17:00) to a timestamp,
we need to substract it the GMT difference in seconds, that is, the value
of the 'tm_gmtoff' field in the tm structure. This is because the kernel
doesn't know about time zones. And hence the kernel manages different timestamps
than those that are advertised in userspace when running, for instance, date +%s.
The same conversion needs to be done when converting hours (e.g 17:00) to seconds since midnight
as well.
The result needs to be computed modulo 86400 in case GMT offset (difference in seconds from UTC)
is negative.
We also introduce a new command line option (-t, --seconds) to show the actual
timestamps when printing the values, rather than the ISO dates, or the hour.
Some usage examples:
time < "2019-06-06 17:00" drop;
time < "2019-06-06 17:20:20" drop;
time < 12341234 drop;
day "Saturday" drop;
day 6 drop;
hour >= 17:00 drop;
hour >= "17:00:01" drop;
hour >= 63000 drop;
We need to convert an ISO date to a timestamp
without taking into account the time zone offset, since comparison will
be done in kernel space and there is no time zone information there.
Overwriting TZ is portable, but will cause problems when parsing a
ruleset that has 'time' and 'hour' rules. Parsing an 'hour' type must
not do time zone conversion, but that will be automatically done if TZ has
been overwritten to UTC.
Hence, we use timegm() to parse the 'time' type, even though it's not portable.
Overwriting TZ seems to be a much worse solution.
Finally, be aware that timestamps are converted to nanoseconds when
transferring to the kernel (as comparison is done with nanosecond
precision), and back to seconds when retrieving them for printing.
We swap left and right values in a range to properly handle
cross-day hour ranges (e.g. 23:15-03:22).
Signed-off-by: Ander Juaristi <a@juaristi.eus>
Reviewed-by: Florian Westphal <fw@strlen.de>
Update dependency on libnftnl. Missing nf_synproxy.h in Makefile.am too.
Update release name based Jazz series, Fats Waller performing "Scram":
https://www.youtube.com/watch?v=c9-noJc9ifI
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Refresh it to fetch what we have in 5.3-rc1.
Remove NFT_OSF_F_VERSION definition, this is already available in
include/linux/netfilter/nf_tables.h
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add support for "synproxy" statement. For example (for TCP port 8888):
table ip x {
chain y {
type filter hook prerouting priority raw; policy accept;
tcp dport 8888 tcp flags syn notrack
}
chain z {
type filter hook input priority filter; policy accept;
tcp dport 8888 ct state invalid,untracked synproxy mss 1460 wscale 7 timestamp sack-perm
ct state invalid drop
}
}
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add capability to have rules matching IPv4 options. This is developed
mainly to support dropping of IP packets with loose and/or strict source
route route options.
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add support for version fingerprint in "osf" expression. Example:
table ip foo {
chain bar {
type filter hook input priority filter; policy accept;
osf ttl skip name "Linux"
osf ttl skip version "Linux:4.20"
}
}
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This can be used to match the kind type of iif or oif
interface of the packet. Example:
add rule inet raw prerouting meta iifkind "vrf" accept
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Florian Westphal <fw@strlen.de>