Project

General

Profile

Issue #2833

Strongwan creating multiple P2 (child SA) entries

Added by vijay Rao almost 7 years ago. Updated almost 5 years ago.

Status:
Closed
Priority:
Normal
Category:
configuration
Affected version:
5.6.3
Resolution:
No feedback

Description

Hey guys,

We have been having an issue with the IKEv2 protocol creating multiple child sa (p2) entries everytime the lifetime is renewed.

This is a site-to-site IPsec VPN setup between Strongswan to Pfsense. The Strongswan is located in the Amazon Ec2 instance using Amazon linux 2 OS.(StrongSwan U5.6.3/K4.14.62-70.117.amzn2.x86_64)

Plus when the Strongswan service is reloaded it multiplies the Phase 2 entry

The other end is PFsense version v2.4.3 / v2.3.5

Attaching the configuration & logs of both the ends.

Strongswan Logs.txt.gz (471 KB) Strongswan Logs.txt.gz vijay Rao, 22.11.2018 17:04
Pfsense Ipsec Logs.txt (11.7 KB) Pfsense Ipsec Logs.txt vijay Rao, 22.11.2018 17:04
Strongswan configuration.png (23.4 KB) Strongswan configuration.png vijay Rao, 22.11.2018 17:04
Pfsense_Phase 2_Part1.png (54.7 KB) Pfsense_Phase 2_Part1.png vijay Rao, 22.11.2018 17:07
Pfsense_IPSec status.png (114 KB) Pfsense_IPSec status.png vijay Rao, 22.11.2018 17:07
Pfsense_VPN_IPsec.png (40.9 KB) Pfsense_VPN_IPsec.png vijay Rao, 22.11.2018 17:07
Pfsense_Phase2_Part 2.png (14 KB) Pfsense_Phase2_Part 2.png vijay Rao, 22.11.2018 17:07
Pfsense_Phase1_Part2.png (72 KB) Pfsense_Phase1_Part2.png vijay Rao, 22.11.2018 17:07
Pfsense_Phase1_Part1.png (63.2 KB) Pfsense_Phase1_Part1.png vijay Rao, 22.11.2018 17:07
Strongswan Status.txt (2.53 KB) Strongswan Status.txt vijay Rao, 05.12.2018 15:18
ae_ie-frankfurt.log (471 KB) ae_ie-frankfurt.log vijay Rao, 05.12.2018 15:18
Sydney_logs_Strongswan.txt (1.35 MB) Sydney_logs_Strongswan.txt vijay Rao, 13.12.2018 13:43
Logs_Pfsense_Ipsec_Sydney(1).txt (1.13 MB) Logs_Pfsense_Ipsec_Sydney(1).txt vijay Rao, 13.12.2018 13:46

History

#1 Updated by Tobias Brunner almost 7 years ago

  • Category changed from charon to configuration
  • Status changed from New to Feedback
  • Assignee changed from Andreas Steffen to Tobias Brunner
  • Priority changed from Urgent to Normal

That strongSwan config snippet does not match your log (different connection name), please post complete configs and don't use screenshots for text content.

Also, something there seems to constantly (i.e. multiple times per second) query the SAs and policies (weird DPD settings? status queries?).

Then around 2018-11-21 22:49 there is an unexplained update of the policies and SAs (apparently the peer's IP address has changed, but there was no IKE message or NAT mapping change before - at least according to the log). But there seem to be other log messages missing (e.g. the ones that are logged when a rekey job is queued and perhaps when an acquire is received from the kernel).

At 2018-11-21 23:34 the initial IKE_SA is destroyed and recreated due to DPDs but that's immediately closed due to a duplicate created by the peer and that IKE_SA is also closed immediately due to the duplicheck plugin (Wed, 2018-11-21 23:35 07[CFG] <texas_off|677> got a response on a duplicate IKE_SA for 'ex-ip.aumora.cex.au.webuy.cx', deleting new IKE_SA). Disable that plugin, it does not what you might think it does (read the description for it closely to see why it deleted both IKE_SAs). That might actually be one of the main reasons for your issues.

Another could be the combination of trap policies (auto=route, but includes dpdaction=hold) with reauthentication (or recreation of SAs by the peer), see ExpiryRekey for details on this issue (depending on the configuration of both peers it might not be possible to completely avoid duplicate CHILD_SAs if both can initiate the connection concurrently but you shouldn't get more and more of them).

#2 Updated by vijay Rao almost 7 years ago

Thank you for the information Tobias.

We have disabled "duplicheck" plugin in the charon configuration.

We are using the following DPD Settings at the moment:
dpddelay=20
dpdtimeout=120
dpdaction=hold
keyingtries=5

Can you suggest the ideal settings for DPD?

We are unsure about the status queries configuration. Can you provide me more information on this.

What configuration need to be checked to avoid duplication of CHILD_SAs? Also, what is the ideal log configuration setting for the charon log?

Logs as follows:

<texas_off|1162> policy 10.5.26.0/24 === 10.200.1.0/24 in already exists, increasing refcount
Fri, 2018-11-23 15:56 13[KNL] <texas_off|1162> updating policy 10.5.26.0/24 === 10.200.1.0/24 in [priority 375423, refcount 8]
Fri, 2018-11-23 15:56 13[KNL] <texas_off|1162> policy 10.5.26.0/24 === 10.200.1.0/24 fwd already exists, increasing refcount
Fri, 2018-11-23 15:56 13[KNL] <texas_off|1162> updating policy 10.5.26.0/24 === 10.200.1.0/24 fwd [priority 375423, refcount 8]
Fri, 2018-11-23 15:56 13[IKE] <texas_off|1162> inbound CHILD_SA texas_off{6107} established with SPIs cf4820e2_i c413ace7_o and TS 10.200.1.0/24 === 10.5.26.0/24
Fri, 2018-11-23 15:56 13[KNL] <texas_off|1162> policy 10.200.1.0/24 === 10.5.26.0/24 out already exists, increasing refcount
Fri, 2018-11-23 15:56 13[KNL] <texas_off|1162> updating policy 10.200.1.0/24 === 10.5.26.0/24 out [priority 375423, refcount 8]
Fri, 2018-11-23 15:56 13[KNL] <texas_off|1162> getting a local address in traffic selector 10.200.1.0/24
Fri, 2018-11-23 15:56 13[KNL] <texas_off|1162> no local address found in traffic selector 10.200.1.0/24
Fri, 2018-11-23 15:56 13[IKE] <texas_off|1162> outbound CHILD_SA texas_off{6107} established with SPIs cf4820e2_i c413ace7_o and TS 10.200.1.0/24 === 10.5.26.0/24
Fri, 2018-11-23 15:56 09[KNL] <texas_off|1162> deleting policy 10.200.1.0/24 === 10.5.26.0/24 out
Fri, 2018-11-23 15:56 09[KNL] <texas_off|1162> policy still used by another CHILD_SA, not removed
Fri, 2018-11-23 15:56 09[KNL] <texas_off|1162> not updating policy 10.200.1.0/24 === 10.5.26.0/24 out [priority 375423, refcount 7]

Can you please explain the last 3 log entry as we have seen that refcount getting increased.

#3 Updated by Tobias Brunner almost 7 years ago

Can you suggest the ideal settings for DPD?

Depends on the scenario, but as mentioned above, dpdaction=hold (which installs trap policies if the peer appears dead) could be problematic if break-before-make reauthentication is used (refer to the link I gave above), or the peer concurrently re-initiates the SAs. And dpdtimeout has no effect on IKEv2 SAs.

We are unsure about the status queries configuration. Can you provide me more information on this.

I don't have any. I just see the queries in the log (without any apparent reason, you might see a bit more with different log settings, e.g. job increased to 2). If you don't have a script running that queries the status constantly I don't know what else it could be (have you perhaps set charon.keep_alive very low?).

What configuration need to be checked to avoid duplication of CHILD_SAs?

As mentioned, depending on the setup (e.g. if peers can concurrently (re-initiate the connection) you might not be able to avoid duplicates completely (uniqueids=yes can avoid duplicate IKE_SAs but not always depending on the timing).

Also, what is the ideal log configuration setting for the charon log?

Depends. For debugging, use the ones on HelpRequests (perhaps with the modification above to see if there is a difference).

Fri, 2018-11-23 15:56 09[KNL] <texas_off|1162> deleting policy 10.200.1.0/24 === 10.5.26.0/24 out
Fri, 2018-11-23 15:56 09[KNL] <texas_off|1162> policy still used by another CHILD_SA, not removed
Fri, 2018-11-23 15:56 09[KNL] <texas_off|1162> not updating policy 10.200.1.0/24 === 10.5.26.0/24 out [priority 375423, refcount 7]

Can you please explain the last 3 log entry as we have seen that refcount getting increased.

CHILD_SAs are rekeyed overlapping to avoid traffic loss. But a number that high indicates that there are duplicate CHILD_SAs (plus perhaps trap policies installed).

#4 Updated by vijay Rao almost 7 years ago

Hey Tobias,

As per your suggestion, we have disabled the duplicheck plugin on the Server, but we are still observing duplicate CHILD_SAs which is consuming the Memory on the Server.

What is the log configuration changes I need to do to get more in-depth information on phase2?

Do you suggest to use "responder only" in our case? Will it help?

#5 Updated by Tobias Brunner almost 7 years ago

What is the log configuration changes I need to do to get more in-depth information on phase2?

As I said, use the settings on HelpRequests (optionally with job set to 2).

Do you suggest to use "responder only" in our case? Will it help?

Maybe, depends on why the duplicates are created (in particular if they accumulate i.e. if you don't just have one duplicate per CHILD_SA but more).

#6 Updated by vijay Rao almost 7 years ago

Thank you Tobias

We are currently investigating the logs from our side.

Can you analyze the logs for the configuration "ae_ie-frankfurt" as we are still observing multiple CHILD_SA entries. Attaching the logs as well showing the status of it.

#7 Updated by Tobias Brunner almost 7 years ago

Attaching the logs

Not the complete logs, though.

Anyway, this looks weird:

initiating IKE_SA ae_ie-frankfurt[89] to 0.0.0.0
...
sending packet: from 10.200.0.253[500] to 0.0.0.0[500] (400 bytes)

Why 0.0.0.0? Or did you unnecessarily replace a specific address with 0.0.0.0?

At 2018-12-02 06:33 the peer closes the IKE_SA (for unknown reasons, perhaps it does a break-before-make reauthentication?) and both hosts create new ones (this one either due to closeaction=restart or an installed trap policy e.g. because of dpdaction=hold). However, the peer creates not one but three CHILD_SAs (two duplicates) with the new IKE_SA (unique ID 651, the initiated IKE_SA with ID 650 is closed as duplicate). So check the log there to see why it does these things.

#8 Updated by vijay Rao over 6 years ago

Hey Tobias,

Why 0.0.0.0? Or did you unnecessarily replace a specific address with 0.0.0.0?

We purposely changed it for security reasons.

At 2018-12-02 06:33 the peer closes the IKE_SA (for unknown reasons, perhaps it does a break-before-make reauthentication?) and both hosts create new ones (this one either due to closeaction=restart or an installed trap policy e.g. because of dpdaction=hold). However, the peer creates not one but three CHILD_SAs (two duplicates) with the new IKE_SA (unique ID 651, the initiated IKE_SA with ID 650 is closed as duplicate). So check the log there to see why it does these things.

We have gone through the logs but haven't found anything suspicious.

There is a strongswan update command ran every minute in the background cronjob. Do you think that can cause the issue?

#9 Updated by Tobias Brunner over 6 years ago

We have gone through the logs but haven't found anything suspicious.

What did you look for?

There is a strongswan update command ran every minute in the background cronjob. Do you think that can cause the issue?

Why would you do that? Still if the config doesn't actually change, the update command (unlike the reload command) shoulnd't have any effect.

#10 Updated by vijay Rao over 6 years ago

Hey Tobias,

What did you look for?

When Phase 1 gets rekeyed, the CHILD_SA count gets increased by 1. We looked at the step by step process of Phase 1 Rekeying cycle, but we are still unsure why the CHILD_SA gets incremented.

Why would you do that? Still, if the config doesn't actually change, the update command (unlike the reload command) shouldn't have any effect.

Since we have more than 100 tunnels configured, we do this to update the IPsec configuration.

To get a bigger picture of the issue, we have attached both the logs of the pfSense (Australia time) & Strongswan (UTC time). Can you help us if you can find anything from it?

At the moment, we are trying to increase the logs verbosity and see if that gives us more info on this.

#11 Updated by Tobias Brunner over 6 years ago

When Phase 1 gets rekeyed, the CHILD_SA count gets increased by 1. We looked at the step by step process of Phase 1 Rekeying cycle, but we are still unsure why the CHILD_SA gets incremented.

Something with that pfSense box is messed up. There are SAs the daemon somehow tracks but the kernel apparently already deleted. Maybe you should reboot it and follow the log closely from the beginning. Also, it seems to do a lot of status updates (maybe from the Web UI?). The other log is also not from the beginning so it's hard to tell when or why the issue starts occurring (the log levels there are also not ideal, as mentioned before, e.g. knl messages are missing). As mentioned before, too, the problem could be trap policies combined with break-before-make reauthentication.

Since we have more than 100 tunnels configured, we do this to update the IPsec configuration.

What do you mean? Do these regularly change without your knowledge? Can't you just call update when there actually was a change to the config?

#12 Updated by vijay Rao over 6 years ago

Hey Tobias,

With regards to the Trap policies combined with break-before-make reauthentication, we have enabled make-before-break on Strongswan as well as the Pfsense, but we are still observing multiple Phase 2 entries after reauthentication of IKE.

As per your recommendation, we have collected the Charon logs. Providing the link for the same.
Strongswan Logs > https://drive.google.com/file/d/1NL8JMdmV_wiGuIEz8x1jb2J6b4zir50/view?usp=sharing
Pfsense Logs -> https://drive.google.com/file/d/1OhJHBWpupmvRm2hBjiHIvgiEnRwRXCzL/view?usp=sharing

#13 Updated by vijay Rao over 6 years ago

Strongswan Logs:

https://drive.google.com/open?id=1NL8JMdmV-_wiGuIEz8x1jb2J6b4zir50

#14 Updated by Tobias Brunner over 6 years ago

Ideally we'd have the complete logs from the very start (covering the same time span) to follow where duplicates might get created.

But even the beginning of the sent strongSwan log is strange:

Sun, 2019-01-13 17:15 13[CFG] received stroke: add connection 'vietnam_tunnel'
Sun, 2019-01-13 17:15 13[CFG] added configuration 'vietnam_tunnel'
Sun, 2019-01-13 17:15 05[CFG] received stroke: initiate 'vietnam_tunnel'

If you use trap policies, why would there be a manual initiation? Or did you actually configure auto=start and not auto=route on the strongSwan side?

Also, both peers appear to create an SA concurrently, which naturally can lead to duplicates (even with a uniqueness policy).

This looks very strange:

Sun, 2019-01-13 17:15 16[CHD] <vietnam_tunnel|19> CHILD_SA vietnam_tunnel{15} state change: INSTALLING => INSTALLED
...
Sun, 2019-01-13 17:15 05[CHD] <vietnam_tunnel|33> CHILD_SA vietnam_tunnel{32} state change: INSTALLING => INSTALLED

So within the same minute and without additional log messages there is a jump from IKE identifier 19 to 33 and from CHILD_SA identifier 15 to 32 (these are monotonically increased integers). Did you filter the log?

Also IKE_SA 19 is deleted shortly after 33 is created, without any information why:

Sun, 2019-01-13 17:15 08[IKE] <vietnam_tunnel|19> deleting IKE_SA vietnam_tunnel[19] between 10.200.0.253[33.102.22.132]...33.102.44.44[ex-ip.du.cex.au.webuy.cx]

Was that a manual termination of that SA? (If it was a uniqueness check there should have been a message.)

The same exact thing described above happens again at Sat, 2019-01-19 17:15.

Also, you say you configured make-before-break reauthentication on both sides. That's definitely not the case for pfSense (where we also see acquires), which logs e.g.:

Sun Feb 10 01:00:56 2019;10.5.7.1; <30>Feb 10 01:00:47 charon: 08[IKE] <con2|221> reauthenticating IKE_SA con2[221]
Sun Feb 10 01:00:56 2019;10.5.7.1; <30>Feb 10 01:00:47 charon: 08[IKE] <con2|221> deleting IKE_SA con2[221] between 192.168.1.11[33.102.44.44]...33.102.22.132[33.102.22.132]
Sun Feb 10 01:00:56 2019;10.5.7.1; <30>Feb 10 01:00:47 charon: 08[IKE] <con2|221> sending DELETE for IKE_SA con2[221]
Sun Feb 10 01:00:56 2019;10.5.7.1; <30>Feb 10 01:00:47 charon: 08[ENC] <con2|221> generating INFORMATIONAL request 140 [ D ]
Sun Feb 10 01:00:56 2019;10.5.7.1; <30>Feb 10 01:00:47 charon: 08[NET] <con2|221> sending packet: from 192.168.1.11[4500] to 33.102.22.132[4500] (76 bytes)
Sun Feb 10 01:00:56 2019;10.5.7.1; <30>Feb 10 01:00:47 charon: 07[NET] <con2|221> received packet: from 33.102.22.132[4500] to 192.168.1.11[4500] (76 bytes)
Sun Feb 10 01:00:56 2019;10.5.7.1; <30>Feb 10 01:00:47 charon: 07[ENC] <con2|221> parsed INFORMATIONAL response 140 [ ]
Sun Feb 10 01:00:56 2019;10.5.7.1; <30>Feb 10 01:00:47 charon: 07[IKE] <con2|221> IKE_SA deleted
...
Sun Feb 10 01:00:56 2019;10.5.7.1; <30>Feb 10 01:00:47 charon: 07[IKE] <con2|221> restarting CHILD_SA con2
Sun Feb 10 01:00:56 2019;10.5.7.1; <30>Feb 10 01:00:47 charon: 07[IKE] <con2|221> initiating IKE_SA con2[222] to 33.102.22.132

The old IKE_SA is clearly deleted first.

Also I saw stuff like this in the strongSwan log:

creating rekey job for CHILD_SA ...
...
establishing CHILD_SA ...
...
parsed CREATE_CHILD_SA response ... [ N(TS_UNACCEPT) ]
received TS_UNACCEPTABLE notify, no CHILD_SA built
failed to establish CHILD_SA, keeping IKE_SA
...
closing expired CHILD_SA ...
...
scheduling CHILD_SA recreate after hard expire

That is, there are failed rekeyings that cause the recreation of the CHILD_SA to get scheduled once the CHILD_SA expires. This may lead to duplicates if an acquire is received while that replacement is created (if you actually use trap policies). The pfSense side just logs that the traffic selectors are unacceptable (traffic selectors 10.5.7.0/24|/0 === 10.200.1.0/24|/0 inacceptable). I guess if the pfSense box actually uses the old config interface (stroke/ipsec.conf) this could happen if the config is replaced at runtime. Then the CHILD_SA config could get lost for a particular IKE_SA and rekeying or creation of new CHILD_SAs may fail. It shouldn't happen, though, with the new config backend (vici/swanctl.conf), so check what the box is using.
Just look at something like this:

Mon Feb 11 07:18:47 2019;10.5.7.1; <30>Feb 11 07:18:38 charon: 11[ENC] <con2|227> parsed CREATE_CHILD_SA request 13119 [ SA No KE TSi TSr ]
Mon Feb 11 07:18:48 2019;10.5.7.1; <30>Feb 11 07:18:38 charon: 11[IKE] <con2|227> traffic selectors 10.5.7.0/24|/0 === 10.200.1.0/24|/0 inacceptable
Mon Feb 11 07:18:48 2019;10.5.7.1; <30>Feb 11 07:18:38 charon: 11[IKE] <con2|227> failed to establish CHILD_SA, keeping IKE_SA
Mon Feb 11 07:18:48 2019;10.5.7.1; <30>Feb 11 07:18:38 charon: 11[ENC] <con2|227> generating CREATE_CHILD_SA response 13119 [ N(TS_UNACCEPT) ]
Mon Feb 11 07:18:48 2019;10.5.7.1; <30>Feb 11 07:18:38 charon: 11[NET] <con2|227> sending packet: from 192.168.1.11[4500] to 33.102.22.132[4500] (76 bytes)
Mon Feb 11 07:18:48 2019;10.5.7.1; <30>Feb 11 07:18:38 charon: 16[NET] <con2|227> received packet: from 33.102.22.132[4500] to 192.168.1.11[4500] (412 bytes)
Mon Feb 11 07:18:48 2019;10.5.7.1; <30>Feb 11 07:18:38 charon: 16[ENC] <con2|227> parsed CREATE_CHILD_SA response 16889 [ SA No KE TSi TSr ]
Mon Feb 11 07:18:48 2019;10.5.7.1; <30>Feb 11 07:18:39 charon: 16[IKE] <con2|227> CHILD_SA con2{45342} established with SPIs c8f54e90_i cbdfc04a_o and TS 10.5.7.0/24|/0 === 10.200.1.0/24|/0

That is, the SA the peer initiated failed, however, at the same time, the SA the pfSense box initiated (with the same exact traffic selectors) is established successfully on the same IKE_SA (when initiating a CHILD_SA the corresponding child config is passed to the task explicitly, so that works even when the peer config of the IKE_SA has no child configs associated anymore).

The kernel errors on the pfSense side also look unideal. Not sure what exactly causes them. Maybe the rekey timings are too narrow (too little time between soft and hard expire). Or it could also be a bug in the FreeBSD kernel or even our kernel-pfkey plugin (e.g. that no soft expire is actually sent or that we configure something in a way the kernel expects differently). Not sure if they are actually a problem.

Also, did you try without reauthentication at all (i.e. using regular IKE_SA rekeying)?

#15 Updated by Tobias Brunner almost 5 years ago

  • Status changed from Feedback to Closed
  • Resolution set to No feedback