Project

General

Profile

Feature #2205

systemd service file enhancments (dependency on network-online.target, restart after an unexpected termination)

Added by Christian Ehrhardt almost 4 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Category:
charon-systemd
Target version:
Start date:
03.01.2017
Due date:
Estimated time:
Resolution:
Fixed

Description

Hi,
I'm trying to feed back some of the Delta Ubuntu was carrying for a while to strongswan itself.
On the service file I think two things could be picked up:

  1. really wait for network being up according to https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/
  2. restart the service on failure according to https://www.freedesktop.org/software/systemd/man/systemd.service.html#Restart=

A change would look like:

--- a/init/systemd/strongswan.service.in
+++ b/init/systemd/strongswan.service.in
@@ -1,10 +1,11 @@
 [Unit]
 Description=strongSwan IPsec IKEv1/IKEv2 daemon using ipsec.conf
-After=syslog.target network.target
+After=syslog.target network-online.target

 [Service]
 ExecStart=@SBINDIR@/@IPSEC_SCRIPT@ start --nofork
 StandardOutput=syslog
+Restart=on-failure

 [Install]
 WantedBy=multi-user.target

Associated revisions

Revision 262bff8b (diff)
Added by Tobias Brunner almost 4 years ago

init: Depend on network-online.target instead of network.target in systemd units

This makes sure the network is "up" before connections are
loaded/initiated.

Fixes #2205.

Revision 014737dd (diff)
Added by Tobias Brunner almost 4 years ago

init: Let systemd restart daemons if they get terminated unexpectedly

Fixes #2205.

History

#1 Updated by Tobias Brunner almost 4 years ago

  • Tracker changed from Issue to Feature
  • Description updated (diff)
  • Category set to charon-systemd
  • Status changed from New to Feedback

really wait for network being up according to https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/

Hm, to quote that page:

network-online.target ... It is strongly recommended not to pull in this target too liberally: for example network server software should generally not pull this in (since server software generally is happy to accept local connections even before any routable network interface is up), it's primary purpose is network client software that cannot operate without network.

strongSwan is both a client and a server so whether this should be pulled in might depend on the use case (e.g. whether auto=route or auto=start connections are defined).

restart the service on failure according to https://www.freedesktop.org/software/systemd/man/systemd.service.html#Restart=

How does that interact with starter restarting the daemon on certain signals? (or does systemd only see signals sent to starter?) Also, if the config is invalid and starter's exit code is LSB_RC_INVALID_ARGUMENT this might result in a loop. So maybe on-abnormal might be a better choice?

#2 Updated by Christian Ehrhardt almost 4 years ago

Tobias Brunner wrote:
[...]

strongSwan is both a client and a server so whether this should be pulled in might depend on the use case (e.g. whether auto=route or auto=start connections are defined).

While true being a client and server, the case of a "Strongwswan server being happy to accept local connections" appears rather unusable to me given that it is a VPN solution and not a random service that happens to have a network entry point for control. I understand that e.g. for a web-administrated local thing, you want it up early as it can go via local connections - but a VPN from&to the same local system - I don't see that being useful.

Also while I have no case for it I'd assume that starting strongwan with a config for a given external network - even being a server config - will have issues if network isn't up. Not sure thou.

That said I'd still expect that network-online.target is the better choice for strongswan.

restart the service on failure according to https://www.freedesktop.org/software/systemd/man/systemd.service.html#Restart=

How does that interact with starter restarting the daemon on certain signals? (or does systemd only see signals sent to starter?) Also, if the config is invalid and starter's exit code is LSB_RC_INVALID_ARGUMENT this might result in a loop. So maybe on-abnormal might be a better choice?

Yeah, I haven't seen a restart loop of strongswan but I understand to your concerns.
In general the loop is rather safe as systemd as it does retry, ratelimit and give up.

On the other hand I'd expect the process started with "ExecStart=SBINDIR/IPSEC_SCRIPT start --nofork" to always exit cleanly for any expected case.
So we also might keep on-failure, but add any "known-reasonable-exits" within "SuccessExitStatus=".

Yet if that is unwanted on-abnormal really might be a safer choice - restarting it in weird unexpected cases but keeping it to the starter for normal handling.

Sorry, just trying to improve - but not the biggest systemd expert on earth :-/

#3 Updated by Tobias Brunner almost 4 years ago

That said I'd still expect that network-online.target is the better choice for strongswan.

I don't really know systemd, so was only reading that page and considering that strongSwan, as responder, doesn't care whether there is a network connection or not when it starts. So I figured, there is no reason to delay the system start (that's what the documentation indicates). But I guess on clients it makes more sense to wait for a network connection. If the impact isn't that severe (e.g. if there are other common services that depend on this anyway) we can certainly go with network-online.target (and probably change that in strongswan-swacntl.service.in too).

restart the service on failure according to https://www.freedesktop.org/software/systemd/man/systemd.service.html#Restart=

How does that interact with starter restarting the daemon on certain signals? (or does systemd only see signals sent to starter?) Also, if the config is invalid and starter's exit code is LSB_RC_INVALID_ARGUMENT this might result in a loop. So maybe on-abnormal might be a better choice?

Yeah, I haven't seen a restart loop of strongswan but I understand to your concerns.
In general the loop is rather safe as systemd as it does retry, ratelimit and give up.

Did you try with an invalid ipsec.conf file (e.g. a line with just conn on it)?

On the other hand I'd expect the process started with "ExecStart=SBINDIR/IPSEC_SCRIPT start --nofork" to always exit cleanly for any expected case.

What do you mean with "cleanly"? Not crashing? Exist status 0?

So we also might keep on-failure, but add any "known-reasonable-exits" within "SuccessExitStatus=".

Maybe listing the fatal error codes in RestartPreventExitStatus could be an option to avoid the loop. Then again starter only exists with a non-zero exit code when there is something seriously wrong during initialization. Most of these issues (config invalid, daemon not found) I wouldn't expect to get fixed automatically, so restarting does not really make sense to me in those cases (in particular if the user has to be informed something might be wrong). After starter successfully got to its main loop it only exits with 0. There is one potential exit path that might be recoverable, which is if it finds a running process with the PID listed in its PID file. If that's not actually another starter process it might disappear and allow starter to restart.

Yet if that is unwanted on-abnormal really might be a safer choice - restarting it in weird unexpected cases but keeping it to the starter for normal handling.

I do think on-abnormal might be the better choice. An alternative to it could be to run ipsec start --conftest in ExecStartPre, so at least config errors wouldn't result in a loop.

#4 Updated by Christian Ehrhardt almost 4 years ago

Yeah, I haven't seen a restart loop of strongswan but I understand to your concerns.
In general the loop is rather safe as systemd as it does retry, ratelimit and give up.

Did you try with an invalid ipsec.conf file (e.g. a line with just conn on it)?

Yeah, obviously that is not the case we want to address (more any unexpected transient issues), but a broken conf is the best test for the loop avoidance.
Adding a random string in the config gives me:

ipsec[2547]: /etc/ipsec.conf:5: syntax error, unexpected STRING [blarbl]
ipsec_starter[2547]: /etc/ipsec.conf:5: syntax error, unexpected STRING [b
ipsec[2547]: invalid config file '/etc/ipsec.conf'
ipsec_starter[2547]: invalid config file '/etc/ipsec.conf'
ipsec[2547]: unable to start strongSwan -- fatal errors in config
ipsec_starter[2547]: unable to start strongSwan -- fatal errors in config
[...]
strongswan.service: Unit entered failed state.
strongswan.service: Failed with result 'exit-code'.
pam_unix(sudo:session): session closed for user root
strongswan.service: Service hold-off time over, scheduling restart

<Note - the default time here is 100ms, but RestartSec= could overwrite if needed)

It tries that 5 times and then gave up:

strongswan.service: Start request repeated too quickly.
Failed to start strongSwan IPsec services.

I agree that if ipsec has a defined RC for "bad config" that could be used in RestartPreventExitStatus=, but I don't think it is required.
For a human being 5x100ms is instant, so nobody has to wait for anything.
But all sorts of weird "who knows why it failed" could resolve in that time.

On the other hand I'd expect the process started with "ExecStart=SBINDIR/IPSEC_SCRIPT start --nofork" to always exit cleanly for any expected case.

What do you mean with "cleanly"? Not crashing? Exist status 0?

By definition it is "In this context, a clean exit means an exit code of 0, or one of the signals SIGHUP, SIGINT, SIGTERM or SIGPIPE, and additionally, exit statuses and signals specified in SuccessExitStatus=."
But Yes, I personally expected and meant ReturnCode=0 for all good/normal cases.

I do think on-abnormal might be the better choice. An alternative to it could be to run ipsec start --conftest in ExecStartPre, so at least config errors wouldn't result in a loop.

As I said I'm fine with on-abnormal if you vote for it and I can follow your reasoning.
We can certainly start there and still evolve if myself or somebody else can prove a point for any other.
But IMHO on-abnormal could give us already quite some resiliency against "the unexpected"

Further thoughts:
While discussing about the service - would you think adding the following would also make sense?
AFAIK on stop the default is to send SIGkill, not sure what reload does by default.
But we could make it more explicit via:

ExecStop=@SBINDIR@/@IPSEC_SCRIPT@ stop
ExecReload=@SBINDIR@/@IPSEC_SCRIPT@ reload

#5 Updated by Tobias Brunner almost 4 years ago

  • Target version set to 5.5.2

I agree that if ipsec has a defined RC for "bad config" that could be used in RestartPreventExitStatus=, but I don't think it is required.

2 is returned in that case.

For a human being 5x100ms is instant, so nobody has to wait for anything.

Sure, but it probably produces the same error messages five times, which is not really helpful.

I do think on-abnormal might be the better choice. An alternative to it could be to run ipsec start --conftest in ExecStartPre, so at least config errors wouldn't result in a loop.

As I said I'm fine with on-abnormal if you vote for it and I can follow your reasoning.
We can certainly start there and still evolve if myself or somebody else can prove a point for any other.
But IMHO on-abnormal could give us already quite some resiliency against "the unexpected"

Yeah, I'd go with on-abnormal for now. I pushed a change to the 2205-systemd-updates branch (also includes the network-online.target change).

While discussing about the service - would you think adding the following would also make sense?
AFAIK on stop the default is to send SIGkill, not sure what reload does by default.

What systemd does if ExecStop is not specified sounds similar to what ipsec stop does, that is, send SIGTERM and after a timeout SIGKILL. The timeout in the ipsec script is 11 seconds, systemd seems to default to 90 seconds (DefaultTimeoutStopSec), so that's fine. The only difference is some /var/lock/subsys/ipsec handling that is probably not relevant anyway on any current system (at least on my, already a bit dated, Ubuntu system /var/lock/subsys does not exist). So I don't think it's necessary to specify this explicitly.

Regarding ExecReload, the problem is that there are two commands to reload the config: update and reload. The latter should be avoided if possible as it removes all configs (even unchanged ones) and loads them again. As established connections could be affected by this it's not recommended. update only replaces the configs that have changed, so this has less of an impact on the whole system. So if anything, update should be called. But then there are the many reread... and purge... commands to reload secrets and certificates etc. And there might even be another option, which is sending a SIGHUP to the charon daemon (not starter) to reload strongswan.conf, the loggers and the plugins (there is currently no ipsec sub-command that does that). But only doing that would also be a bit strange and untransparent (and doing all of it might also not be what the user wants). So it's probably best to not specify anything specific and let the user call whatever seems appropriate via ipsec.

Not sure if you know this, but ipsec/starter/stroke is the legacy interface that will be replaced by swanctl/vici (and if applicable charon-systemd) in the long run.

#6 Updated by Christian Ehrhardt almost 4 years ago

Tobias Brunner wrote:

[...]

Yeah, I'd go with on-abnormal for now. I pushed a change to the 2205-systemd-updates branch (also includes the network-online.target change).

I checked the branch on git://git.strongswan.org/strongswan.git, should be good for now - thank you.
I can always open a new issue if I want to further improve it.

[...]

Thanks for elaborating the current stop handling of the script.

Not sure if you know this, but ipsec/starter/stroke is the legacy interface that will be replaced by swanctl/vici (and if applicable charon-systemd) in the long run.

Didn't know yet - thanks for the heads up.
Sounds like this will be the way to go then eventually to be integrated in services as needed - good to know.

#7 Updated by Tobias Brunner almost 4 years ago

  • Status changed from Feedback to Closed
  • Assignee set to Tobias Brunner
  • Resolution set to Fixed

#8 Updated by Tobias Brunner over 3 years ago

  • Subject changed from systemd service file enhancments to systemd service file enhancments (dependency on network-online.target, restart after an unexpected termination)

Also available in: Atom PDF