Project

General

Profile

Bug #3579

charon-nm delays systemd shutdown

Added by René Krell about 1 month ago. Updated 25 days ago.

Status:
Feedback
Priority:
Normal
Assignee:
-
Category:
networkmanager (charon-nm)
Target version:
Start date:
Due date:
Estimated time:
Affected version:
5.7.2
Resolution:

Description

Excuses in advance if the description doesn't initially contain all the required information, I describe the problem as good as I can. Please ask what you might further:

I'm on Debian Buster, using the default and backport repository, no additional or self-compiled software.

I have set up two client connections with the Network manager applet
(1) Type strongswan, IKEv2 (to a LANCOM router, tried static and automatic IPv4routing)
(2) Type l2tp, (with IKEv2 encryption to a UniFi router. automatic IPv4 routing)

I just use the network manager applet or Network system settings tool to connect and disconnect, neither manual patches to config files nor command line tools.

- Start the OS
- Connect to (1) shows Connected - I can ping the 1st site's remote gateway and hosts, everything is fine.
- Disconnect from (1)
- Wait for minute, because otherwise the second connection doesn't come up, probably while disconnecting the first connection
- Connect to (2) shows Connected - I can ping the 2nd site's remote gateway and hosts, everything is fine.
- Disconnect from (2)
- Connect to (1) again shows Connected - I CANNOT ping the 1st site's remote gateway and hosts any longer
- As root, run the following commands:
service strongswan stop
service network-manager restart
service strongswan start
- Connect to (1) again shows Connected - I CAN ping the 1st site's remote gateway and hosts, everything is fine.

When being connected to the first connection, even without a second configured VPN connection, my OS often hangs when shutting down for 90s, until sytemd kills it. On the console I see that it is hanging on charon-nm here.

Can you guide me to provide the necessary information for analyzing, please?

History

#1 Updated by René Krell about 1 month ago

/etc/ipsec.conf:

conn user@site
keyexchange=ikev2
ike=aes256-sha256-modp2048!
esp=aes256-sha256!
auto=add
authby=psk
left=%defaultroute
leftid=user@site
leftauth=psk
leftsourceip=%config4, %config6
right=<REMOTE_GW_IP_ADDRESS>
rightid=<REMOTE_GW_IP_ADDRESS>
rightauth=psk
rightsubnet=0.0.0.0/0

#2 Updated by René Krell about 1 month ago

If connecting just to (1) repeatedly, the problem with the missing data doesn't occur (except charon-nm cannot be regularly shut down, which is still the case). Just after switching to the second connection (2) and back to (1).

#3 Updated by René Krell about 1 month ago

/var/log/messages when disconnecting from (2):

Sep 26 13:36:26 rkrell NetworkManager[105502]: <info>  [1601120186.8710] audit: op="connection-deactivate" uuid="6f3346b7-1756-4f3a-99e2-5ae018db5ea8" name="VPN (2)" pid=105709 uid
=1000 result="success" 
Sep 26 13:36:26 rkrell NetworkManager[105502]: <info>  [1601120186.8740] policy: set 'Wired DHCP' (enp0s31f6) as default for IPv4 routing and DNS
Sep 26 13:36:26 rkrell charon-systemd[1620]: interface ppp0 deactivated
Sep 26 13:36:26 rkrell charon-systemd[1620]: 192.168.145.1 disappeared from ppp0
Sep 26 13:36:26 rkrell NetworkManager[105502]: <info>  [1601120186.9074] device (ppp0): state change: disconnected -> unmanaged (reason 'connection-assumed', sys-iface-state: 'external')
Sep 26 13:36:26 rkrell gnome-shell[1423]: Object NM.ActiveConnection (0x563c76e19550), has been already deallocated — impossible to access it. This might be caused by the object having been destroyed from C code using something such as destroy(), dispose(), or remove() vfuncs.
Sep 26 13:36:26 rkrell org.gnome.Shell.desktop[1423]: == Stack trace for context 0x563c772ea270 ==
Sep 26 13:36:26 rkrell org.gnome.Shell.desktop[1423]: #0   563c7818d2b0 i   resource:///org/gnome/gjs/modules/overrides/GObject.js:468 (7f53b56cc040 @ 25)
Sep 26 13:36:26 rkrell org.gnome.Shell.desktop[1423]: #1   563c7818d208 i   resource:///org/gnome/shell/ui/status/network.js:1827 (7f53b50e1040 @ 111)
Sep 26 13:36:26 rkrell org.gnome.Shell.desktop[1423]: #2   563c7818d160 i   resource:///org/gnome/gjs/modules/_legacy.js:82 (7f53b56b0b80 @ 71)
Sep 26 13:36:26 rkrell org.gnome.Shell.desktop[1423]: #3   563c7818d0a0 i   self-hosted:979 (7f53b56f01f0 @ 440)
Sep 26 13:36:26 rkrell pppd[107709]: Terminating on signal 15
Sep 26 13:36:26 rkrell pppd[107709]: Connect time 0.4 minutes.
Sep 26 13:36:26 rkrell pppd[107709]: Sent 50396 bytes, received 61418 bytes.
Sep 26 13:36:26 rkrell pppd[107709]: Overriding mtu 1500 to 1400
Sep 26 13:36:26 rkrell pppd[107709]: Overriding mru 1500 to mtu value 1400
Sep 26 13:36:26 rkrell gnome-shell[1423]: Object NM.ActiveConnection (0x563c76e19550), has been already deallocated — impossible to get any property from it. This might be caused by the object having been destroyed from C code using something such as destroy(), dispose(), or remove() vfuncs.
Sep 26 13:36:26 rkrell org.gnome.Shell.desktop[1423]: == Stack trace for context 0x563c772ea270 ==
Sep 26 13:36:26 rkrell org.gnome.Shell.desktop[1423]: #0   563c7818d488 i   resource:///org/gnome/shell/ui/status/network.js:496 (7f53b50da550 @ 46)
Sep 26 13:36:26 rkrell org.gnome.Shell.desktop[1423]: #1   563c7818d3e0 i   resource:///org/gnome/gjs/modules/_legacy.js:82 (7f53b56b0b80 @ 71)
Sep 26 13:36:26 rkrell org.gnome.Shell.desktop[1423]: #2   563c7818d340 i   resource:///org/gnome/shell/ui/status/network.js:2051 (7f53b50e1af0 @ 216)
Sep 26 13:36:26 rkrell org.gnome.Shell.desktop[1423]: #3   563c7818d298 i   resource:///org/gnome/gjs/modules/_legacy.js:82 (7f53b56b0b80 @ 71)
Sep 26 13:36:26 rkrell org.gnome.Shell.desktop[1423]: #4   563c7818d208 i   resource:///org/gnome/shell/ui/status/network.js:1854 (7f53b50e10d0 @ 138)
Sep 26 13:36:26 rkrell org.gnome.Shell.desktop[1423]: #5   563c7818d160 i   resource:///org/gnome/gjs/modules/_legacy.js:82 (7f53b56b0b80 @ 71)
Sep 26 13:36:26 rkrell org.gnome.Shell.desktop[1423]: #6   563c7818d0a0 i   self-hosted:979 (7f53b56f01f0 @ 440)
Sep 26 13:36:26 rkrell NetworkManager[105502]: <info>  [1601120186.9930] vpn-connection[0x558ffbbf86f0,6f3346b7-1756-4f3a-99e2-5ae018db5ea8,"VPN (2)",0]: VPN plugin: state changed: stopping (5)
Sep 26 13:36:27 rkrell nm-l2tp-service[107624]: ipsec shut down
Sep 26 13:36:27 rkrell NetworkManager[105502]: <info>  [1601120187.0897] vpn-connection[0x558ffbbf86f0,6f3346b7-1756-4f3a-99e2-5ae018db5ea8,"VPN (2)",0]: VPN plugin: state changed: stopped (6)
Sep 26 13:36:32 rkrell pppd[107709]: Connection terminated.
Sep 26 13:36:32 rkrell charon-systemd[1620]: interface ppp0 deleted
Sep 26 13:36:33 rkrell cinnamon[2659]: JS LOG: Removing a network device that was not added
Sep 26 13:36:33 rkrell gnome-shell[1423]: Removing a network device that was not added
Sep 26 13:36:33 rkrell pppd[107709]: Modem hangup
Sep 26 13:36:33 rkrell pppd[107709]: Exit.

... and after connecting back connection (2) again (now the remote gateway can't be pinged any longer):

Sep 26 13:40:43 rkrell NetworkManager[105502]: <info>  [1601120443.4167] audit: op="connection-activate" uuid="d41f2548-42bf-448c-97ae-6ee9b0dceff3" name="VPN (1)" pid=105709 uid=1000 result="success" 
Sep 26 13:40:43 rkrell gnome-shell[1423]: JS ERROR: TypeError: item is undefined#012setActiveConnections/<@resource:///org/gnome/shell/ui/status/network.js:1517:17#012setActiveConnections@resource:///org/gnome/shell/ui/status/network.js:1514:9#012wrapper@resource:///org/gnome/gjs/modules/_legacy.js:82:22#012_syncVpnConnections@resource:///org/gnome/shell/ui/status/network.js:1852:9#012wrapper@resource:///org/gnome/gjs/modules/_legacy.js:82:22
Sep 26 13:40:43 rkrell NetworkManager[105502]: <info>  [1601120443.4462] vpn-connection[0x558ffbbf86f0,d41f2548-42bf-448c-97ae-6ee9b0dceff3,"VPN (1)",0]: Saw the service appear; activating connection
Sep 26 13:40:43 rkrell NetworkManager[105502]: <info>  [1601120443.6508] vpn-connection[0x558ffbbf86f0,d41f2548-42bf-448c-97ae-6ee9b0dceff3,"VPN (1)",0]: VPN plugin: state changed: starting (3)
Sep 26 13:40:43 rkrell charon-systemd[1620]: 192.168.135.169 appeared on enp0s31f6
Sep 26 13:40:43 rkrell NetworkManager[105502]: <info>  [1601120443.7899] vpn-connection[0x558ffbbf86f0,d41f2548-42bf-448c-97ae-6ee9b0dceff3,"VPN (1)",0]: VPN connection: (IP4 Config Get) reply received from old-style plugin
Sep 26 13:40:43 rkrell NetworkManager[105502]: <info>  [1601120443.7929] vpn-connection[0x558ffbbf86f0,d41f2548-42bf-448c-97ae-6ee9b0dceff3,"VPN (1)",0]: Data: VPN Gateway: <REMOTE_GW_IP_ADDRESS>
Sep 26 13:40:43 rkrell NetworkManager[105502]: <info>  [1601120443.7932] vpn-connection[0x558ffbbf86f0,d41f2548-42bf-448c-97ae-6ee9b0dceff3,"VPN (1)",0]: Data: Tunnel Device: (null)
Sep 26 13:40:43 rkrell NetworkManager[105502]: <info>  [1601120443.7933] vpn-connection[0x558ffbbf86f0,d41f2548-42bf-448c-97ae-6ee9b0dceff3,"VPN (1)",0]: Data: IPv4 configuration:
Sep 26 13:40:43 rkrell NetworkManager[105502]: <info>  [1601120443.7934] vpn-connection[0x558ffbbf86f0,d41f2548-42bf-448c-97ae-6ee9b0dceff3,"VPN (1)",0]: Data:   Internal Address: 192.168.135.169
Sep 26 13:40:43 rkrell NetworkManager[105502]: <info>  [1601120443.7935] vpn-connection[0x558ffbbf86f0,d41f2548-42bf-448c-97ae-6ee9b0dceff3,"VPN (1)",0]: Data:   Internal Prefix: 32
Sep 26 13:40:43 rkrell NetworkManager[105502]: <info>  [1601120443.7935] vpn-connection[0x558ffbbf86f0,d41f2548-42bf-448c-97ae-6ee9b0dceff3,"VPN (1)",0]: Data:   Internal Point-to-Point Address: 192.168.135.169
Sep 26 13:40:43 rkrell NetworkManager[105502]: <info>  [1601120443.7936] vpn-connection[0x558ffbbf86f0,d41f2548-42bf-448c-97ae-6ee9b0dceff3,"VPN (1)",0]: Data:   Static Route: 192.168.134.0/24   Next Hop: 192.168.134.1
Sep 26 13:40:43 rkrell NetworkManager[105502]: <info>  [1601120443.7936] vpn-connection[0x558ffbbf86f0,d41f2548-42bf-448c-97ae-6ee9b0dceff3,"VPN (1)",0]: Data:   Static Route: 192.168.135.169/32   Next Hop: 0.0.0.0
Sep 26 13:40:43 rkrell NetworkManager[105502]: <info>  [1601120443.7937] vpn-connection[0x558ffbbf86f0,d41f2548-42bf-448c-97ae-6ee9b0dceff3,"VPN (1)",0]: Data:   Internal DNS: 192.168.134.1
Sep 26 13:40:43 rkrell NetworkManager[105502]: <info>  [1601120443.7937] vpn-connection[0x558ffbbf86f0,d41f2548-42bf-448c-97ae-6ee9b0dceff3,"VPN (1)",0]: Data:   DNS Domain: '(none)'
Sep 26 13:40:43 rkrell NetworkManager[105502]: <info>  [1601120443.7937] vpn-connection[0x558ffbbf86f0,d41f2548-42bf-448c-97ae-6ee9b0dceff3,"VPN (1)",0]: Data: No IPv6 configuration
Sep 26 13:40:43 rkrell NetworkManager[105502]: <info>  [1601120443.8134] vpn-connection[0x558ffbbf86f0,d41f2548-42bf-448c-97ae-6ee9b0dceff3,"VPN (1)",0]: VPN connection: (IP Config Get) complete
Sep 26 13:40:43 rkrell NetworkManager[105502]: <info>  [1601120443.8140] vpn-connection[0x558ffbbf86f0,d41f2548-42bf-448c-97ae-6ee9b0dceff3,"VPN (1)",0]: VPN plugin: state changed: started (4)

#4 Updated by René Krell about 1 month ago

Since it behaves like a wrong route, here's the ip route after that action, 192.168.134.1 is the remote gateway's LAN address, but it is not reachable any longer. The routing table doesn't look that bad, right?

default via 192.168.0.1 dev enp0s31f6 proto dhcp metric 100 
<REMOTE_GW_IP_ADDRESS> via 192.168.0.1 dev enp0s31f6 proto static metric 100 
169.254.0.0/16 dev enp0s31f6 scope link metric 1000 
192.168.0.0/24 dev enp0s31f6 proto kernel scope link src 192.168.0.73 metric 100 
192.168.0.1 dev enp0s31f6 proto static scope link metric 100 
192.168.134.0/24 via 192.168.134.1 dev enp0s31f6 proto static metric 50 
192.168.134.1 dev enp0s31f6 proto static scope link metric 50 
192.168.135.169 dev enp0s31f6 proto kernel scope link src 192.168.135.169 metric 50 
192.168.135.169 dev enp0s31f6 proto kernel scope link src 192.168.135.169 metric 100 

#5 Updated by René Krell about 1 month ago

Now I found that

service strongswan restart

and reconnecting to (1) is enough to reestablish a working connection (1). Restarting network-manager isn't really necessary.

#6 Updated by Tobias Brunner 29 days ago

  • Category set to networkmanager (charon-nm)
  • Status changed from New to Feedback

(2) Type l2tp, (with IKEv2 encryption to a UniFi router. automatic IPv4 routing)

Please note that that's not implemented by our NM plugin, which only supports IKEv2 with plain ESP, no L2TP. I have no idea how that actually works, but presumably it controls the regular charon daemon via legacy ipsec.conf etc. (our NM plugin uses a dedicated backend, charon-nm, and configures it directly and not via config files).

- Connect to (1) again shows Connected - I CANNOT ping the 1st site's remote gateway and hosts any longer

Check log, routes (table 220), IPsec policies etc.

- As root, run the following commands:
service strongswan stop

That will not have any effect on strongSwan's NM backend, which runs as a separate daemon directly started by NM. To restart that you might have to manually kill charon-nm. This service unit controls the legacy starter/charon daemons (in newer versions it controls the charon-systemd daemon), so this might affect whatever that L2TP NM plugin did.

service network-manager restart

Not sure actually if that stops charon-nm. If you want to make sure, kill it manually.

When being connected to the first connection, even without a second configured VPN connection, my OS often hangs when shutting down for 90s, until sytemd kills it. On the console I see that it is hanging on charon-nm here.

No idea what that could be. Check the log during shutdown.

Now I found that

[...]

and reconnecting to (1) is enough to reestablish a working connection (1). Restarting network-manager isn't really necessary.

Since I don't know what the L2TP plugin does, I can't say for sure, but it might not clean up properly when disconnecting. So restarting the starter/charon daemons that plugin configured might help with that.

#7 Updated by René Krell 28 days ago

Tobias, you're right.

The essential point is to kill /usr/lib/ipsec/charon-nm after disconnecting from l2tp.
It is also started by the l2tp plugin.

BTW, according to your advice with the separate daemon I've also disabled the strongswan service, which is probably nonsense here:

systemctl disable strongswan

Nevertheless, although reconnecting with the strongswan plugin succeeds using the same connection in case I avoid the second, L2TP, connection, its /usr/lib/ipsec/charon-nm remains hanging after disconnecting. I've checked this after cleaning up this process explicitly. Is this expected?

#8 Updated by Tobias Brunner 28 days ago

The essential point is to kill /usr/lib/ipsec/charon-nm after disconnecting from l2tp.
It is also started by the l2tp plugin.

It is? So this uses IKEv2 combined with L2TP? Why?

BTW, according to your advice with the separate daemon I've also disabled the strongswan service, which is probably nonsense here:

Well, you apparently have a configuration in ipsec.conf, which that daemon uses. But if you don't actually use that config, go ahead and disable it.

Nevertheless, although reconnecting with the strongswan plugin succeeds using the same connection in case I avoid the second, L2TP, connection, its /usr/lib/ipsec/charon-nm remains hanging after disconnecting.

Not sure what you mean with "remains hanging" because it's simply that NM does not terminate the daemons it starts anymore when connections are terminated (not sure what the reasoning was behind this as earlier versions didn't do that). While there is a "quit" D-Bus signal, it never seems to be called. Reconnecting might work better with newer versions of charon-nm where e.g. secrets are cleared when terminating a connection so they don't affect new connections.

#9 Updated by René Krell 27 days ago

Tobias Brunner wrote:

The essential point is to kill /usr/lib/ipsec/charon-nm after disconnecting from l2tp.
It is also started by the l2tp plugin.

It is? So this uses IKEv2 combined with L2TP? Why?

This is given by the UBNT UniFi system and USG router, I cannot change this at the moment.
It requires an LT2P with IPSec, alternatively a PPTP connection, at least officially.

I'll check the rest...

#10 Updated by René Krell 27 days ago

Tobias Brunner wrote:

...

BTW, according to your advice with the separate daemon I've also disabled the strongswan service, which is probably nonsense here:

Well, you apparently have a configuration in ipsec.conf, which that daemon uses. But if you don't actually use that config, go ahead and disable it.

Yes, I did an apt install --reinstall -o Dpkg::Options::="--force-confask,confnew,confmiss" strongswan-starter, which cleant this up.

Nevertheless, although reconnecting with the strongswan plugin succeeds using the same connection in case I avoid the second, L2TP, connection, its /usr/lib/ipsec/charon-nm remains hanging after disconnecting.

Not sure what you mean with "remains hanging" because it's simply that NM does not terminate the daemons it starts anymore when connections are terminated (not sure what the reasoning was behind this as earlier versions didn't do that). While there is a "quit" D-Bus signal, it never seems to be called. Reconnecting might work better with newer versions of charon-nm where e.g. secrets are cleared when terminating a connection so they don't affect new connections.

I'd just expect the "daemon" process to disappear when disconnecting.
Reconnecting seems no longer to be a problem.
The only problem I have the whole time - a charon-nm process blocks shutdown of the computer for about 90s.

#11 Updated by Tobias Brunner 27 days ago

The only problem I have the whole time - a charon-nm process blocks shutdown of the computer for about 90s.

How do you know it's charon-nm? Does it matter if the a VPN is established or not (or which one of them)? Any corresponding log messages?

#12 Updated by René Krell 27 days ago

Tobias Brunner wrote:

The only problem I have the whole time - a charon-nm process blocks shutdown of the computer for about 90s.

How do you know it's charon-nm? Does it matter if the a VPN is established or not (or which one of them)? Any corresponding log messages?

It is written on the console, a single line, that shutdown is waiting for charon-nm.
It depends probably on the VPN connection if established at least once, after that the charon-nm process does not disappear (unless killed manually).

#13 Updated by Tobias Brunner 27 days ago

It depends probably on the VPN connection if established at least once

Yeah, it won't run otherwise.

after that the charon-nm process does not disappear (unless killed manually).

Hm, I wonder what systemd tries to do to terminate it (apparently not by sending a SIGTERM). Perhaps that's when the "quit" signal comes into play. I guess I'll do some experiments.

#14 Updated by Tobias Brunner 26 days ago

  • Tracker changed from Issue to Bug
  • Target version set to 5.9.1

Perhaps that's when the "quit" signal comes into play.

That evidently is the case. I've rebased the nm-quit branch, which includes a four year old commit that implements support for that signal. With it, the daemon is signaled during shutdown and subsequently terminates itself by sending a SIGTERM to itself.

#15 Updated by René Krell 26 days ago

Tobias Brunner wrote:

Perhaps that's when the "quit" signal comes into play.

That evidently is the case. I've rebased the nm-quit branch, which includes a four year old commit that implements support for that signal. With it, the daemon is signaled during shutdown and subsequently terminates itself by sending a SIGTERM to itself.

Great job. If you need any particular information let me know.
I'm looking forward to try the new version here, would make work easier. Thanks for the answers and your activities.

#16 Updated by René Krell 26 days ago

BTW, I tried the shutdown and all I can see in the console during waiting is:

systemd-shutdown[1]: Waiting for process: charon-nm

#17 Updated by Tobias Brunner 26 days ago

  • Subject changed from Multiple VPN IKEV2 client connections require strongswan and network-manager to be restarted to transfer data to charon-nm delays systemd shutdown

If you need any particular information let me know.
I'm looking forward to try the new version here, would make work easier.

If you could try the patch that'd be great. But I understand if it's too much of a hassle to rebuild the Debian package with the patch or build from sources.

BTW, I tried the shutdown and all I can see in the console during waiting is:
[...]

Interesting. I never saw such an explicit message, the system either shutdown normally (not sure how that worked), just hung at the very last log message or there was some generic message about waiting two minutes for some task.

#18 Updated by René Krell 25 days ago

Tobias Brunner wrote:

If you need any particular information let me know.
I'm looking forward to try the new version here, would make work easier.

If you could try the patch that'd be great. But I understand if it's too much of a hassle to rebuild the Debian package with the patch or build from sources.

I'd try, I must got to find out how to do this in Debian, used OpenSuSe the last time for rebuilding RPMS :-)

BTW, I tried the shutdown and all I can see in the console during waiting is:
[...]

Interesting. I never saw such an explicit message, the system either shutdown normally (not sure how that worked), just hung at the very last log message or there was some generic message about waiting two minutes for some task.

Moreover, from several documentations and discussions regarding systemd I read that it send SIGTERM instead of SIGQUIT on shutdown.
The systemd version is 246, the one from the Debian Buster backports repository.

#19 Updated by Tobias Brunner 25 days ago

Moreover, from several documentations and discussions regarding systemd I read that it send SIGTERM instead of SIGQUIT on shutdown.
The systemd version is 246, the one from the Debian Buster backports repository.

Ah, I used that of the default repository (241), so that probably explains the different log messages. But a SIGTERM doesn't seem to be sent with either version as that's properly handled by charon-nm (as mentioned, with the fix above, it uses that signal to terminate itself). A SIGQUIT also seems unlikely as that's currently not handled i.e. would probably just crash the daemon.

The charon-nm daemon is also not a proper systemd service (i.e. there is no unit file for it), as it's a VPN plugin for NetworkManager. It gets started by NM as D-Bus service. So I'm not sure how the responsibilities during shutdown are divided between NM and systemd (and perhaps even D-Bus). I guess systemd terminates NM, which then sends that "quit" signal to the plugin/daemon. But why systemd waits explicitly for it I don't know (or why it apparently doesn't send any signals to it until it eventually kills it). Perhaps D-Bus services are automatically adopted by systemd, or it just sees charon-nm as child process of NM and that's why it waits for it.

Also available in: Atom PDF