Project

General

Profile

Issue #2450

Traffic hole during CHILD SA rekeying when rekey collision has been detected

Added by Emeric Poupon almost 5 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Category:
freebsd
Affected version:
5.5.3
Resolution:
No change required

Description

Hello,

Here is the test case:
- strongSwan 5.5.3 on FreeBSD 10.3
- 100kfps traffic sent in each tunnel, for each direction
- single TUNNEL connection:

On the left side:

conn "TUNNEL" 
    type = tunnel
    auto = route
    keyexchange = ikev2
    mobike = no
    left = 192.168.0.1
    right = 192.168.0.2
    leftauth = psk
    rightauth = psk
    rightid = 192.168.0.2
    esp = aes128-sha1-modp1024-noesn!
    ike = aes128-sha1-modp1024,blowfish128-sha1-modp1024,3des-sha1-modp1024!
    lifetime = 60
    ikelifetime = 432000
    margintime = 12
    rekeyfuzz = 100%
    leftsendcert = no
    rightsendcert = no
    dpdaction = hold
    dpddelay = 0
    fragmentation = no
    replay_window = 2016
    leftsubnet = 1.1.1.1
    rightsubnet = 1.1.1.2

On the right side :

conn "TUNNEL" 
    type = tunnel
    auto = route
    keyexchange = ikev2
    mobike = no
    left = 192.168.0.2
    right = 192.168.0.1
    leftauth = psk
    rightauth = psk
    rightid = 192.168.0.1
    esp = aes128-sha1-modp1024-noesn!
    ike = aes128-sha1-modp1024,blowfish128-sha1-modp1024,3des-sha1-modp1024!
    lifetime = 60
    ikelifetime = 432000
    margintime = 12
    rekeyfuzz = 100%
    leftsendcert = no
    rightsendcert = no
    dpdaction = hold
    dpddelay = 0
    fragmentation = no
    replay_window = 2016
    leftsubnet = 1.1.1.2
    rightsubnet = 1.1.1.1

We noticed some extra ACQUIRE events from the kernel during the test. Since we rely on this event to establish the tunnel, we end up with more and more CHILD SA over time.
It looks there may be some race with the rekey collisions, please find attached the logs.

charon.debug.left (122 KB) charon.debug.left Emeric Poupon, 16.10.2017 11:01
charon.debug.right (306 KB) charon.debug.right Emeric Poupon, 16.10.2017 11:01

History

#1 Updated by Tobias Brunner almost 5 years ago

  • Status changed from New to Feedback

You should reduce the log levels for the enc and job log groups. I've also changed some things regarding CHIDL_SA rekey collisions with 5.6.0, so you might want to try that. However, the reason for the acquire is the same as that noticed in #2446, i.e. the old rekeyed SA is already expired in the kernel when that SA is explicitly deleted by the peer and the new outbound SA is finally installed. Due to that there is a short while when no outbound SA is installed and the kernel triggers an acquire.

#2 Updated by Emeric Poupon almost 5 years ago

Tobias Brunner wrote:

You should reduce the log levels for the enc and job log groups. I've also changed some things regarding CHIDL_SA rekey collisions with 5.6.0, so you might want to try that. However, the reason for the acquire is the same as that noticed in #2446, i.e. the old rekeyed SA is already expired in the kernel when that SA is explicitly deleted by the peer and the new outbound SA is finally installed. Due to that there is a short while when no outbound SA is installed and the kernel triggers an acquire.

Ok thanks for your support, we will test again in 5.6.0 and let you know,

I forgot to tell the date is not synchronized, "right" is about five minutes late.

Are you sure it is the same reason? What I see here on "right" is that the outbound SA that won the collision was installed (line 4206) after the redondant CHILD SA has been deleted (line 3876).

#3 Updated by Tobias Brunner almost 5 years ago

Are you sure it is the same reason? What I see here on "right" is that the outbound SA that won the collision was installed (line 4206) after the redondant CHILD SA has been deleted (line 3876).

Yes, but until then the old CHILD_SA ({4032}) should have been kept installed (its deletion is what triggers the installation of the new outbound SA). However, the outbound SA already expired and was removed from the kernel, which is why the query on line 4185 and the delete on line 4211 both fail.

#4 Updated by Emeric Poupon almost 5 years ago

Tobias Brunner wrote:

Are you sure it is the same reason? What I see here on "right" is that the outbound SA that won the collision was installed (line 4206) after the redondant CHILD SA has been deleted (line 3876).

Yes, but until then the old CHILD_SA ({4032}) should have been kept installed (its deletion is what triggers the installation of the new outbound SA). However, the outbound SA already expired and was removed from the kernel, which is why the query on line 4185 and the delete on line 4211 both fail.

Indeed you are right, it seems the SA ccc99365 disappeared.
But this is strange, SADB_EXPIRE message for ccc99365 is received at 20:10:11, but ccc99365 is installed at 20:09:29. Since the hard lifetime is set to 60s, it should have been present until 20:10:29.

#5 Updated by Tobias Brunner almost 5 years ago

But this is strange, SADB_EXPIRE message for ccc99365 is received at 20:10:11, but ccc99365 is installed at 20:09:29. Since the hard lifetime is set to 60s, it should have been present until 20:10:29.

Yes, definitely looks strange. The inbound SA is still there (they have the same hard lifetime configured, however, their soft lifetime might be different due to rekeyfuzz, but there is also no soft expire for the outbound SA) and there never seemed to have been a hard SADB_EXPIRE at all (there are other SAs for which querying fails). Perhaps a kernel issue?

#6 Updated by Emeric Poupon almost 5 years ago

Yes, definitely looks strange. The inbound SA is still there (they have the same hard lifetime configured, however, their soft lifetime might be different due to rekeyfuzz, but there is also no soft expire for the outbound SA) and there never seemed to have been a hard SADB_EXPIRE at all (there are other SAs for which querying fails). Perhaps a kernel issue?

Ok there are two kernel problems:
- the kernel does not send a EXPIRE message when the HARD tiemout is reached,
- the kernel deletes the older IPSEC SA if it is set to use only the newest one.

Both problems are already corrected on FreeBSD, we just have to upgrade and test again using strongSwan 5.6.0
Thanks again for your support.

#7 Updated by Tobias Brunner almost 5 years ago

Ok there are two kernel problems:
- the kernel does not send a EXPIRE message when the HARD tiemout is reached,
- the kernel deletes the older IPSEC SA if it is set to use only the newest one.

Both problems are already corrected on FreeBSD, we just have to upgrade and test again using strongSwan 5.6.0

I suspect it would also work with 5.6.0 on this kernel because it won't install the redundant CHILD_SA created during the collision and thus avoid the second issue (and strongSwan doesn't care for EXPIREs received for SAs that are currently rekeyed). It's strange that the second issue only affected the outbound SA, or was that specifically targeted at those? Do you have any information regarding which kernel versions are affected or which contain the fix?

#8 Updated by Emeric Poupon almost 5 years ago

Tobias Brunner wrote:

Ok there are two kernel problems:
- the kernel does not send a EXPIRE message when the HARD tiemout is reached,
- the kernel deletes the older IPSEC SA if it is set to use only the newest one.

Both problems are already corrected on FreeBSD, we just have to upgrade and test again using strongSwan 5.6.0

I suspect it would also work with 5.6.0 on this kernel because it won't install the redundant CHILD_SA created during the collision and thus avoid the second issue (and strongSwan doesn't care for EXPIREs received for SAs that are currently rekeyed). It's strange that the second issue only affected the outbound SA, or was that specifically targeted at those? Do you have any information regarding which kernel versions are affected or which contain the fix?

Here is the problematic code on FreeBSD 10 branch:
https://svnweb.freebsd.org/base/stable/10/sys/netipsec/key.c?view=markup#l984

The code has been partially rewritten on HEAD by "ae" and the code has been backported on the 11 branch: it is still present in 11.0 but not in 11.1.
Note that FreeBSD still uses old sa by default.

#9 Updated by Tobias Brunner over 4 years ago

  • Category set to freebsd
  • Status changed from Feedback to Closed
  • Assignee set to Tobias Brunner
  • Resolution set to No change required

Also available in: Atom PDF