Project

General

Profile

Feature #192

Reconnecting a routed connection takes a random (60 to 120 sec) amount of time if server closes connection

Added by Ronald Uit over 8 years ago. Updated about 6 years ago.

Status:
Closed
Priority:
Low
Category:
libhydra
Target version:
Start date:
08.05.2012
Due date:
Estimated time:
Resolution:
Fixed

Description

Hello strongswan developers,

Everything is working perfectly, however there is one minor oddity that doesn't seem to be intended. When a connection is 'routed' at the client and the remote host (server) shutsdown the connection (but is still available afterwards). Then the connection should be reƫstablished immediatly if the client pings the server, right?

However, these are my observations:

client: ipsec route home
server: ipsec add remote
client: ping server

Connection succesfully established

server: ipsec down remote # for e.g. changing keys, maintanence ...

Connection succesfully terminated

client: ping server
client: ... nothing

However, after a long period of time (approx 1 or 2 minutes):

ronald@Charlie ~ :( $ ping 10.1.9.253
PING 10.1.9.253 (10.1.9.253) 56(84) bytes of data.
64 bytes from 10.1.9.253: icmp_req=120 ttl=64 time=3.90 ms
64 bytes from 10.1.9.253: icmp_req=121 ttl=64 time=1.19 ms
...

I didn't find any relevant timeout values in ipsec.conf nor in strongswan.conf. Is this intended?

ipsec-client.txt (630 Bytes) ipsec-client.txt IPSEC client configuration Ronald Uit, 08.05.2012 08:27
ipsec-server.txt (526 Bytes) ipsec-server.txt IPSEC server configuration Ronald Uit, 08.05.2012 08:27
gateway-server.log.txt (24.2 KB) gateway-server.log.txt Server log Ronald Uit, 08.05.2012 19:28
roadwarrior-client.log.txt (24.7 KB) roadwarrior-client.log.txt Client log Ronald Uit, 08.05.2012 19:28
strongswan.conf.txt (269 Bytes) strongswan.conf.txt Strongswan.conf used on both sides Ronald Uit, 08.05.2012 19:28
3-gateway.log.txt (35.4 KB) 3-gateway.log.txt Gateway annotated timestamped log Ronald Uit, 09.05.2012 15:09
3-ping.log (358 Bytes) 3-ping.log Ping output Ronald Uit, 09.05.2012 15:09
3-roadwarrior.log.txt (36.2 KB) 3-roadwarrior.log.txt Roadwarrior annotated timestamped log Ronald Uit, 09.05.2012 15:09
3-strongswan.conf (241 Bytes) 3-strongswan.conf New strongswan.conf with your suggestions Ronald Uit, 09.05.2012 15:09

History

#1 Updated by Tobias Brunner over 8 years ago

  • Status changed from New to Feedback
  • Assignee changed from Martin Willi to Tobias Brunner

Hi Ronald,

could you attach the client log? The gateway log perhaps too. And could you log with log level 2 for the KNL log group on the client.

Thanks,
Tobias

#2 Updated by Ronald Uit over 8 years ago

Here it is, I also attached the strongswan.conf with the loggin directive to verify that I correctly executed your request.

#3 Updated by Tobias Brunner over 8 years ago

Thanks for the logs. Timestamps might have helped as it is not clear how much time goes by on the client between the deletion of the SA and the acquire generated by the kernel.
Basically between this line:

01[IKE] deleting IKE_SA home[1] between 10.1.9.252[C=NL, O=Questionmark, CN=charlie]...10.1.9.253[C=NL, O=Questionmark, CN=alpha]

and this line:

08[KNL] received a XFRM_MSG_ACQUIRE

Afterwards charon reestablishes the SA and what wee see is this

02[ENC] generating IKE_AUTH request 1 [ ... ]
02[NET] sending packet: from 10.1.9.252[4500] to 10.1.9.253[4500]
12[IKE] retransmit 1 of request with message ID 1
12[NET] sending packet: from 10.1.9.252[4500] to 10.1.9.253[4500]
15[IKE] retransmit 2 of request with message ID 1
15[NET] sending packet: from 10.1.9.252[4500] to 10.1.9.253[4500]
03[NET] received packet: from 10.1.9.253[4500] to 10.1.9.252[4500]

Timestamps would have helped here too but with the default settings the second retransmit is sent after 11 seconds so this accounts for at least a part of the delay.

The gateway does receive all these three messages. Again timestamps would have helped to see when it does so exactly, but it receives the first retransmit after it sent its initial response to the IKE_AUTH request. Since the client receives only one of them it's probably save to assume that there are some problems with the connectivity between the two hosts.

What was the observed delay in this test here? More then 15 seconds?

One other (probably unrelated) issue is this:

00[KNL] received netlink error: Operation not supported (95)
00[KNL] unable to create IPv4 routing table rule

It means your kernel is not compiled with CONFIG_IP_MULTIPLE_TABLES. If you don't want to change that you might want to set charon.routing_table to 0 in strongswan.conf (or use --with-routing-table=0 with ./configure).

#4 Updated by Ronald Uit over 8 years ago

Thank you for your time.

While creating the logs for testing, the timeout was also in the 1 to 2 minute range. I don't remember exactly, but those were long delays as well. I should have confirmed this. I will use the syslog in order to enable timestamps for the next test. Sorry for that...

CONFIG_IP_MULTIPLE_TABLES doesn't have any added benefits for me, I do not require routing based on source ip or TOS. My setup is a simple server-client VPN whereby the client is a roadwarrior.

I will try setting charon.routing_table = 0 in my strongswan.conf file. And rerun the testcase and post my new observations.

Regarding your observations with respect to timeout, I have non-standard settings for these (see my strongswan.conf):

retransmit_base=1
retransmit_timeout=2
retransmit_tries=3

I have set 'base' to 1, which effectively disables the exponential behaviour in timeouts. I added this to aid testing and rule out any weird time-out related issue's.

Thank you

#5 Updated by Ronald Uit over 8 years ago

Hello,

Here is the new data, I attached all relevant logs. Output log now has timestamps, I also annotated it whenever an 'event' occurred.

The output of 3-ping.log indicates a 48 second timeout. However, after bringing down the connection from the server (gateway), it took me 30 seconds to start this command. So it could have easily been > 48. Probably 70 or 80 seconds in total.
This does not matter, if I wait 3 minutes after bringing down the connection and then ping, the connection immediatly succeeds. It's not the fact that I 'bang the door each second', but the fact that the first x seconds are completely ignored. This observation validates this testcase.

As a sidenote, thanks for mentioning the charon.routing_table option. Saves a few nasty warnings on startup ;)

#6 Updated by Tobias Brunner over 8 years ago

May 09 14:41:20 [ronald] ### Reinitiating connection with ping (at remote roadwarrior host)
...
May 09 14:42:44 [charon] 08[KNL] received a XFRM_MSG_ACQUIRE_

Yep, so there is a delay of ~80 seconds between starting to ping and the kernel finally sending the acquire.

Actually, have a look at this:

May 09 14:39:59 [charon] 08[KNL] received a XFRM_MSG_ACQUIRE
...
May 09 14:42:44 [charon] 08[KNL] received a XFRM_MSG_ACQUIRE_

The difference between these two acquires is exactly 165 seconds. Which is precisely the value we configure as lifetime for SAs acquired by the kernel (see source:src/libhydra/plugins/kernel_netlink/kernel_netlink_ipsec.c#L2691).
The default value is 30 seconds but since it is used as lifetime for SPIs allocated from the kernel we increase it in case setting up an SA takes longer than that.

The problem is that the same value (/proc/sys/net/core/xfrm_acq_expires - the name gives it away) is also used for SAs temporarily created when an acquire is sent to the daemon. This SA is very specific (addresses/protocol/ports) and could theoretically be used by the daemon (similar to the SA created when an SPI is allocated). But we currently don't really care for such a specific SA and create our own based on whatever is configured in ipsec.conf. Unfortunately, we also don't delete the temporary SA once we install our own.

So eventually, when you delete the regular SA and then ping again, the temporary SA is still there and the kernel has marked it with XFRM_STATE_ACQ, meaning an acquire has already been sent to the userland. Therefore, any packets matching this SA are ignored until it finally times out.

I'm not sure what the best way to fix this is. We can't really delete the original SA as we currently don't have the exact information of the acquire once we install our own SA (this might change in the future, though). And in most situations it does work fine (still 165 seconds is quite a long time, but then again it is the default retransmission timeout). So in your situation the easiest is probably to reduce the timeout a bit. In the future we might make this value configurable, or actually calculate it based on the configured retransmission settings.

#7 Updated by Ronald Uit over 8 years ago

Thank you for your explanation. I'll change xfrm_acq_expires and see if it doesn't have any nasty side effects...

Thank you for your time and patience!

#8 Updated by Martin Willi about 6 years ago

  • Tracker changed from Bug to Feature
  • Category changed from charon to libhydra
  • Status changed from Feedback to Closed
  • Target version set to 5.2.0
  • Resolution set to Fixed

In the future we might make this value configurable

Since 5.1.1 we have a strongswan.conf option for this purpose, see 255b9dac.

Also available in: Atom PDF