Project

General

Profile

Bug #3541

"unable to install policy"

Added by Richard Laager 2 months ago. Updated 27 days ago.

Status:
Feedback
Priority:
Normal
Assignee:
-
Category:
kernel-interface
Target version:
-
Start date:
17.08.2020
Due date:
Estimated time:
Affected version:
5.8.2
Resolution:

Description

We have a "road warrior" VPN setup. The clients are typically Windows, but we do have Linux, Android, and iOS used a bit too. Twice now we have seen an issue where a client's connection fails. Both times, it was a Windows client, but that may be a coincidence since that is the common case.

This time, the log messages were like this (with IPV6_SUBNET being substituted here for the actual prefix):

ipsec[4657]: 06[CFG] unable to install policy IPV6_SUBNET::1f/128 === ::/0 in for reqid 1366, the same policy for reqid 1364 exists
ipsec[4657]: 06[CFG] unable to install policy IPV6_SUBNET::1f/128 === ::/0 fwd for reqid 1366, the same policy for reqid 1364 exists
ipsec[4657]: 06[CFG] unable to install policy ::/0 === IPV6_SUBNET::1f/128 out for reqid 1366, the same policy for reqid 1364 exists

From what I recall, last time it was IPv4 where the routes were stuck.

We were previously running on Ubuntu 18.04 with strongswan 5.6.2. After the first time, we upgraded to Ubuntu 20.04 with strongswan 5.8.2, but it has just recurred.

Last time, I tried removing the "ip xfrm rules". That didn't fix anything. I restarted the strongswan service and that fixed it. Likewise, restarting the strongswan-starter service fixed it this time too. So it seems that the problem is the in-memory state in strongswan (broadly defined, possibly literally the charon? daemon).

Is there anything we should look for now? More importantly, is there anything we should look for next time before restarting the service?

ipsec.conf (2.91 KB) ipsec.conf Richard Laager, 17.08.2020 06:57

History

#1 Updated by Richard Laager 2 months ago

I have attached our configuration file, slightly sanitized.

#2 Updated by Tobias Brunner 2 months ago

  • Category changed from charon to kernel-interface
  • Status changed from New to Feedback

If there really is an active CHILD_SA with a duplicate policy, the same reqid should get assigned. There might be a weird race condition (previous CHILD_SA gone but policies not yet fully uninstalled - although that should not actually happen as the reqid is released after removing the policies). You'd have to check the log before and around the time when this happens to see what's going on (preferably with log levels for chd and knl set to 2).

#3 Updated by Richard Laager 2 months ago

I have raised the logging levels for chd and knl to 2. At this point, we are just waiting for it to recur. If the current pattern holds, it should happen again within a couple of weeks, but who knows; it was stable for a long time before this.

#4 Updated by Tobias Brunner 2 months ago

You mentioned that you had to restart the daemon to fix this. Does that mean that (re-)connecting with this client/identity (assuming the virtual IP is reassigned based on the identity) resulted in this error repeatedly?

#5 Updated by Richard Laager 2 months ago

Correct, reconnecting the client is not sufficient. As you expected, the client is reassigned the same IP, so it just keeps hitting the same issue each time. On the most recent failure, the user tried connecting 4 times, got the same address 4 times, and failed 4 times.

The desired reqid increases each time and the reqid that exists stays the same.

In grepping the logs, looks like we other instances of this that I didn't hear about. It looks like this happened on the 29th (twice), 2nd, 5th, 6th, and 13th. Some were IPv4 and some were IPv6.

#6 Updated by Tobias Brunner 2 months ago

Hm, then a race condition seems unlikely. Sounds more like the kernel interface's tracking of these policies got out of sync for some reason. Let's see if the logs can shed some light on it.

#7 Updated by Tobias Brunner 28 days ago

Any update on this?

#8 Updated by Richard Laager 28 days ago

It has not recurred since we raised the debug level, so we're still in a holding pattern.

#9 Updated by Tobias Brunner 27 days ago

It has not recurred since we raised the debug level, so we're still in a holding pattern.

I see. If it was some kind of race condition, more logging could definitely change the timing and subsequently prevent the issue.

Also available in: Atom PDF