Project

General

Profile

Bug #2518

MOBIKE and handover-problems

Added by Kristian Evensen almost 3 years ago. Updated over 2 years ago.

Status:
Closed
Priority:
Normal
Category:
libcharon
Target version:
Start date:
Due date:
Estimated time:
Affected version:
5.6.1
Resolution:
Fixed

Description

Hello,

I have a Linux-client (kernel 4.9) running Strongswan 5.6.1, and have configured a forwarding tunnel to another Linux-machine running Strongswan 5.3.5. The Linux-client is equipped with two network interfaces (two LTE modems) and routing is configured correctly, i.e., sending traffic over both interfaces (assuming they are both available) works fine. The tunnel also works fine in "normal" operation. As long as there is no change in the state of the interface currently being used for the tunnel, traffic is correctly forwarded via my server and on to the internet. I have tested both networks, to ensure that there are no evile firewalls, NATs, etc.

The client is intended to be mobile, and I am currently testing Strongswan's MOBIKE support and am experiencing some issues making MOBIKE work properly. When networks go up/down, handover sometimes works and sometimes not. I have not found a pattern in when handover works or not, order of interfaces, etc. seems not to matter. Here is the log from when handover works (a.a.a.a is the public IP of my server and 10.0.0.11/2.150.104.242 are the addresses of the modems):

Fri Jan 19 09:46:17 2018 daemon.info : 14[IKE] <ipsec-clwy-1|10> old path is not preferred anymore
Fri Jan 19 09:46:17 2018 daemon.info : 14[IKE] <ipsec-clwy-1|10> looking for a route to a.a.a.a ...
Fri Jan 19 09:46:17 2018 daemon.info : 14[IKE] <ipsec-clwy-1|10> requesting address change using MOBIKE
Fri Jan 19 09:46:17 2018 daemon.info : 14[ENC] <ipsec-clwy-1|10> generating INFORMATIONAL request 2 [ ]
Fri Jan 19 09:46:17 2018 daemon.info : 14[IKE] <ipsec-clwy-1|10> checking path 10.0.0.11[4500] - a.a.a.a[4500]
Fri Jan 19 09:46:17 2018 daemon.info : 14[NET] <ipsec-clwy-1|10> sending packet: from 10.0.0.11[4500] to a.a.a.a[4500] (76 bytes)
Fri Jan 19 09:46:17 2018 daemon.info : 14[IKE] <ipsec-clwy-1|10> checking path 10.0.0.11[4500] - 10.10.5.2[4500]
Fri Jan 19 09:46:17 2018 daemon.info : 14[NET] <ipsec-clwy-1|10> sending packet: from 10.0.0.11[4500] to 10.10.5.2[4500] (76 bytes)
Fri Jan 19 09:46:17 2018 daemon.info : 14[IKE] <ipsec-clwy-1|10> checking path 10.0.0.11[4500] - 10.24.3.1[4500]
Fri Jan 19 09:46:17 2018 daemon.info : 14[NET] <ipsec-clwy-1|10> sending packet: from 10.0.0.11[4500] to 10.24.3.1[4500] (76 bytes)
Fri Jan 19 09:46:17 2018 daemon.info : 14[IKE] <ipsec-clwy-1|10> checking path 10.0.0.11[4500] - 10.10.0.1[4500]
Fri Jan 19 09:46:17 2018 daemon.info : 14[NET] <ipsec-clwy-1|10> sending packet: from 10.0.0.11[4500] to 10.10.0.1[4500] (76 bytes)
Fri Jan 19 09:46:17 2018 daemon.info : 14[IKE] <ipsec-clwy-1|10> checking path 10.0.0.11[4500] - 10.10.0.1[4500]
Fri Jan 19 09:46:17 2018 daemon.info : 14[NET] <ipsec-clwy-1|10> sending packet: from 10.0.0.11[4500] to 10.10.0.1[4500] (76 bytes)
Fri Jan 19 09:46:17 2018 daemon.info : 16[NET] <ipsec-clwy-1|10> received packet: from a.a.a.a[4500] to 10.0.0.11[4500] (76 bytes)
Fri Jan 19 09:46:17 2018 daemon.info : 16[ENC] <ipsec-clwy-1|10> parsed INFORMATIONAL response 2 [ ]
Fri Jan 19 09:46:17 2018 daemon.info : 16[ENC] <ipsec-clwy-1|10> generating INFORMATIONAL request 3 [ N(UPD_SA_ADDR) N(NATD_S_IP) N(NATD_D_IP) N(COOKIE2) ]
Fri Jan 19 09:46:17 2018 daemon.info : 16[NET] <ipsec-clwy-1|10> sending packet: from 10.0.0.11[4500] to a.a.a.a[4500] (156 bytes)
Fri Jan 19 09:46:17 2018 daemon.info : 05[NET] <ipsec-clwy-1|10> received packet: from a.a.a.a[4500] to 10.0.0.11[4500] (156 bytes)
Fri Jan 19 09:46:17 2018 daemon.info : 05[ENC] <ipsec-clwy-1|10> parsed INFORMATIONAL response 3 [ N(NATD_S_IP) N(NATD_D_IP) N(COOKIE2) ]
Fri Jan 19 09:46:17 2018 daemon.info : 05[IKE] <ipsec-clwy-1|10> local host is behind NAT, sending keep alives

And here is the log when handover does not work:

Fri Jan 19 09:46:48 2018 daemon.info : 06[IKE] <ipsec-clwy-1|10> old path is not preferred anymore
Fri Jan 19 09:46:48 2018 daemon.info : 06[IKE] <ipsec-clwy-1|10> looking for a route to a.a.a.a ...
Fri Jan 19 09:46:48 2018 daemon.info : 06[IKE] <ipsec-clwy-1|10> requesting address change using MOBIKE
Fri Jan 19 09:46:48 2018 daemon.info : 06[ENC] <ipsec-clwy-1|10> generating INFORMATIONAL request 4 [ ]
Fri Jan 19 09:46:48 2018 daemon.info : 06[IKE] <ipsec-clwy-1|10> checking path 2.150.104.242[4500] - a.a.a.a[4500]
Fri Jan 19 09:46:48 2018 daemon.info : 06[NET] <ipsec-clwy-1|10> sending packet: from 2.150.104.242[4500] to a.a.a.a[4500] (76 bytes)
Fri Jan 19 09:46:48 2018 daemon.info : 06[IKE] <ipsec-clwy-1|10> checking path 2.150.104.242[4500] - 10.10.5.2[4500]
Fri Jan 19 09:46:48 2018 daemon.info : 06[NET] <ipsec-clwy-1|10> sending packet: from 2.150.104.242[4500] to 10.10.5.2[4500] (76 bytes)
Fri Jan 19 09:46:48 2018 daemon.info : 06[IKE] <ipsec-clwy-1|10> checking path 2.150.104.242[4500] - 10.24.3.1[4500]
Fri Jan 19 09:46:48 2018 daemon.info : 06[NET] <ipsec-clwy-1|10> sending packet: from 2.150.104.242[4500] to 10.24.3.1[4500] (76 bytes)
Fri Jan 19 09:46:48 2018 daemon.info : 06[IKE] <ipsec-clwy-1|10> checking path 2.150.104.242[4500] - 10.10.0.1[4500]
Fri Jan 19 09:46:48 2018 daemon.info : 06[NET] <ipsec-clwy-1|10> sending packet: from 2.150.104.242[4500] to 10.10.0.1[4500] (76 bytes)
Fri Jan 19 09:46:48 2018 daemon.info : 06[IKE] <ipsec-clwy-1|10> checking path 2.150.104.242[4500] - 10.10.0.1[4500]
Fri Jan 19 09:46:48 2018 daemon.info : 06[NET] <ipsec-clwy-1|10> sending packet: from 2.150.104.242[4500] to 10.10.0.1[4500] (76 bytes)
Fri Jan 19 09:46:48 2018 daemon.info : 15[NET] <ipsec-clwy-1|10> received packet: from a.a.a.a[4500] to 2.150.104.242[4500] (76 bytes)
Fri Jan 19 09:46:48 2018 daemon.info : 15[ENC] <ipsec-clwy-1|10> parsed INFORMATIONAL response 4 [ ]
Fri Jan 19 09:46:48 2018 daemon.info : 15[ENC] <ipsec-clwy-1|10> generating INFORMATIONAL request 5 [ N(UPD_SA_ADDR) N(NATD_S_IP) N(NATD_D_IP) N(COOKIE2) N(ADD_4_ADDR) ]
Fri Jan 19 09:46:48 2018 daemon.info : 15[NET] <ipsec-clwy-1|10> sending packet: from 2.150.104.242[4500] to a.a.a.a[4500] (172 bytes)
Fri Jan 19 09:46:48 2018 daemon.info : 14[IKE] <ipsec-clwy-1|10> sending address list update using MOBIKE
Fri Jan 19 09:46:48 2018 daemon.info : 05[NET] <ipsec-clwy-1|10> received packet: from a.a.a.a[4500] to 2.150.104.242[4500] (156 bytes)
Fri Jan 19 09:46:48 2018 daemon.info : 05[ENC] <ipsec-clwy-1|10> parsed INFORMATIONAL response 5 [ N(NATD_S_IP) N(NATD_D_IP) N(COOKIE2) ]
Fri Jan 19 09:46:48 2018 daemon.info : 05[ENC] <ipsec-clwy-1|10> generating INFORMATIONAL request 6 [ N(ADD_4_ADDR) ]
Fri Jan 19 09:46:48 2018 daemon.info : 05[NET] <ipsec-clwy-1|10> sending packet: from 2.150.104.242[4500] to a.a.a.a[4500] (76 bytes)
Fri Jan 19 09:46:48 2018 daemon.info : 10[NET] <ipsec-clwy-1|10> received packet: from a.a.a.a[4500] to 2.150.104.242[4500] (76 bytes)
Fri Jan 19 09:46:48 2018 daemon.info : 10[ENC] <ipsec-clwy-1|10> parsed INFORMATIONAL response 6 [ ]

Running tcpdump, I can see that packets are sent over the correct interface and that they arrive on the server. When checking the output of ipsec statusall, I see that the SA is installed and that the source/destionation address is correct, so everything seems fine in all cases and on both machines. However, something goes wrong in the server as the packet/data counter does not increase. One thing I did notice, was that the tunnel recovers after rekeying. If I don't wait for rekeying, I have to restart the tunnel to get traffic going again.

Does anyone have any suggestions on what could be wrong or how to proceed in debugging the issue?

Thanks in advance for any help,
Kristian

state_client.log (1.49 KB) state_client.log Kristian Evensen, 24.01.2018 20:18
charon_debug_client.log (48.2 KB) charon_debug_client.log Kristian Evensen, 24.01.2018 20:18
policy_client.log (9.37 KB) policy_client.log Kristian Evensen, 24.01.2018 20:18
policy_server.log (4.55 KB) policy_server.log Kristian Evensen, 24.01.2018 20:18
charon_debug_server.log (36.4 KB) charon_debug_server.log Kristian Evensen, 24.01.2018 20:18
state_server.log (1.38 KB) state_server.log Kristian Evensen, 24.01.2018 20:19
charon_debug_server.log (46.5 KB) charon_debug_server.log Kristian Evensen, 06.02.2018 14:37
charon_debug_client.log (100 KB) charon_debug_client.log Kristian Evensen, 06.02.2018 14:37
charon_debug_client.log (154 KB) charon_debug_client.log Kristian Evensen, 06.02.2018 16:02
charon_debug_server.log (224 KB) charon_debug_server.log Kristian Evensen, 06.02.2018 16:02
charon_debug_recovered_client.log (107 KB) charon_debug_recovered_client.log Kristian Evensen, 06.02.2018 16:17
charon_debug_recovered_server.log (292 KB) charon_debug_recovered_server.log Kristian Evensen, 06.02.2018 16:17

Associated revisions

Revision d58a84e0
Added by Tobias Brunner over 2 years ago

Merge branch 'mobike-nat'

These changes improve MOBIKE task queuing. In particular we don't
want to ignore the response to an update (with NAT-D payloads) if only
an address list update or DPD is queued as that could prevent use from
updating the UDP encapsulation in the kernel.

A new optional roam trigger is added to the kernel-netlink plugin based
on routing rule changes. This only works properly, though, if the kernel
based route lookup is used as the kernel-netlink plugin does currently
not consider routing rules for its own route lookup.

Another change prevents acquires during address updates if we have to
update IPsec SAs by deleting and readding them. Because the outbound policy
is still installed an acquire and temporary SA might get triggered in
the short time no IPsec SA is installed, which could subsequently prevent the
reinstallation of the SA. To this end we install drop policies before
updating the policies and SAs. These also replace the fallback drop policies
we previously used to prevent plaintext leaks during policy updates (which
reduces the overhead in cases where addresses never or rarely change as
additional policies will only have to be tracked during address updates).

Fixes #2518.

History

#1 Updated by Kristian Evensen almost 3 years ago

One thing I did not notice when testing last night, but which I saw now, is that one of the modems is behind a NAT and the other not. Could this be the issue? I now tested with two modems behind NAT (different public IPs) and handover seems to work fine, at least I haven't been able to break it.

#2 Updated by Kristian Evensen almost 3 years ago

I spent some more time during the day today looking into this problem and noticed two things:

1) It happense idenpendent of if the interfaces is behind a NAT or not.
2) When the tunnel seemingly hangs, it recovers immediatly if the previously available interface becomes available again.

#3 Updated by Tobias Brunner almost 3 years ago

  • Status changed from New to Feedback

Seems strange, especially if the SAs are correctly installed on both ends (as you claim). Could you please provide more information (output of ip -s xfrm state and policy, ip route list table all), strongSwan status and log output (see HelpRequests for log settings, in this case a higher log level for knl might be useful though) from different stages of the scenario. Are any ESP packets sent if it does not work? Is traffic in both directions broken? Have you enabled charon.prefer_best_path on the client? What kernel runs on the server?

#4 Updated by Kristian Evensen almost 3 years ago

Hi,

Thank you for your response. I am currently having trouble getting handover to work at all, so I have only been able to produce logs from a scenario where Mobike fails. I hope I did everything right with parameters, etc. My client was equipped with one WAN connection and one LTE modem, while the server only has one interface. I configured the tunnel so that traffic to 8.8.4.4 would go through. In order to trigger the error, I started pinging 8.8.4.4 and disconnected the WAN cable after a while. When I disconnected the cable, ping stopped working. I then tried to restart ping, just in case for example connection tracking had gone stale, but no change.

charon_debug_{client,server}.log is the charon-log, state_{client,server}.log the output of ip -s xfrm state and policy_{client,server}.log the out of ip -s xfrm policy.

I am using policy routing with prefer_best_path enabled, and I use the default routing mark and table (0x3000000/0xff000000 and table 220). Before failure, table 220 looks as follows:

8.8.4.4 via 10.0.0.1 dev eth0.2 proto static src 192.168.1.1 
192.168.1.0/24 dev br-lan proto static src 192.168.1.1 

While after failure, the table looks like this:

8.8.4.4 via 176.75.17.152 dev wwan0 proto static src 192.168.1.1
192.168.1.0/24 dev br-lan proto static src 192.168.1.1

The Strongswan status after failure on the client:

Status of IKE charon daemon (strongSwan 5.6.1, Linux 4.9.77, mips):
  uptime: 6 minutes, since Jan 24 19:00:31 2018
  worker threads: 11 of 16 idle, 5/0/0/0 working, job queue: 0/0/0/0, scheduled: 5
  loaded plugins: charon aes des rc2 sha2 sha1 md5 random nonce x509 revocation constraints pubkey pkcs1 pgp dnskey sshkey pem fips-prf gmp curve25519 xcbc hmac attr kernel-netlink resolve socket-default connmark stroke vici updown xauth-generic
Listening IP addresses:
  192.168.1.1
  176.75.17.151
Connections:
passthrough-0:  %any...%any  IKEv2, dpddelay=300s
passthrough-0:   local:  uses public key authentication
passthrough-0:   remote: uses public key authentication
passthrough-0:   child:  192.168.1.0/24 === 192.168.1.0/24 PASS, dpdaction=clear
passthrough-1:   child:  192.168.1.1/32 === 0.0.0.0/0[tcp/ssh] 0.0.0.0/0[tcp/www] 0.0.0.0/0[tcp/https] PASS, dpdaction=clear
ipsec-clwy-1:  %any...a.a.a.a  IKEv2, dpddelay=300s
ipsec-clwy-1:   local:  [client-2] uses pre-shared key authentication
ipsec-clwy-1:   remote: [server-2] uses pre-shared key authentication
ipsec-clwy-1:   child:  192.168.1.0/24 === 8.8.4.4/32 TUNNEL, dpdaction=clear
Shunted Connections:
passthrough-0:  192.168.1.0/24 === 192.168.1.0/24 PASS
passthrough-1:  192.168.1.1/32 === 0.0.0.0/0[tcp/ssh] 0.0.0.0/0[tcp/www] 0.0.0.0/0[tcp/https] PASS
Security Associations (1 up, 0 connecting):
ipsec-clwy-1[1]: ESTABLISHED 6 minutes ago, 176.75.17.151[client-2]...a.a.a.a[server-2]
ipsec-clwy-1[1]: IKEv2 SPIs: 52567eeb7d9366cc_i* 50b0df6633d04248_r, pre-shared key reauthentication in 45 minutes
ipsec-clwy-1[1]: IKE proposal: AES_CBC_128/HMAC_SHA1_96/PRF_HMAC_SHA1/MODP_2048
ipsec-clwy-1{1}:  INSTALLED, TUNNEL, reqid 1, ESP in UDP SPIs: c37cd252_i cb8f07e4_o
ipsec-clwy-1{1}:  AES_CBC_128/HMAC_SHA1_96, 2268 bytes_i (27 pkts, 279s ago), 30240 bytes_o (360 pkts, 0s ago), rekeying in 9 minutes
ipsec-clwy-1{1}:   192.168.1.0/24 === 8.8.4.4/32

Status on the server after failure:

Status of IKE charon daemon (strongSwan 5.3.5, Linux 4.10.0-21-generic, x86_64):
  uptime: 7 minutes, since Jan 24 19:59:30 2018
  malloc: sbrk 3289088, mmap 532480, used 1093936, free 2195152
  worker threads: 11 of 16 idle, 5/0/0/0 working, job queue: 0/0/0/0, scheduled: 2
  loaded plugins: charon test-vectors unbound ldap pkcs11 aes rc2 sha1 sha2 md4 md5 random nonce x509 revocation constraints acert pubkey pkcs1 pkcs7 pkcs8 pkcs12 pgp dnskey sshkey dnscert ipseckey pem openssl gcrypt af-alg fips-prf gmp agent chapoly xcbc cmac hmac ctr ccm gcm ntru bliss curl soup mysql sqlite attr kernel-netlink resolve socket-default connmark farp stroke updown eap-identity eap-sim eap-sim-pcsc eap-aka eap-aka-3gpp2 eap-simaka-pseudonym eap-simaka-reauth eap-md5 eap-gtc eap-mschapv2 eap-dynamic eap-radius eap-tls eap-ttls eap-peap eap-tnc xauth-generic xauth-eap xauth-pam xauth-noauth tnc-tnccs tnccs-20 tnccs-11 tnccs-dynamic dhcp whitelist lookip error-notify certexpire led radattr addrblock unity
Virtual IP pools (size/online/offline):
  10.3.0.0/24: 254/0/0
Listening IP addresses:
  a.a.a.a
  10.24.3.1
Connections:
  controller:  a.a.a.a...%any  IKEv2
  controller:   local:  [osl-controller] uses pre-shared key authentication
  controller:   remote: uses pre-shared key authentication
  controller:   child:  0.0.0.0/0 === 192.168.0.0/16 TUNNEL
    server-2:  a.a.a.a...%any  IKEv2
    server-2:   local:  [server-2] uses pre-shared key authentication
    server-2:   remote: uses pre-shared key authentication
    server-2:   child:  0.0.0.0/0 === 192.168.0.0/16 TUNNEL
Security Associations (1 up, 0 connecting):
    server-2[2]: ESTABLISHED 6 minutes ago, a.a.a.a[server-2]...176.75.17.151[client-2]
    server-2[2]: IKEv2 SPIs: cc66937deb7e5652_i 4842d03366dfb050_r*, pre-shared key reauthentication in 48 minutes
    server-2[2]: IKE proposal: AES_CBC_128/HMAC_SHA1_96/PRF_HMAC_SHA1/MODP_2048
    server-2{1}:  INSTALLED, TUNNEL, reqid 1, ESP SPIs: cb8f07e4_i c37cd252_o
    server-2{1}:  AES_CBC_128/HMAC_SHA1_96, 7644 bytes_i, 2268 bytes_o, rekeying in 8 minutes
    server-2{1}:   8.8.4.4/32 === 192.168.1.0/24

When looking at traffic with tcpdump I see that ESP packets are sent from the client to the server, and the TX counter output by Strongswan increase on the server.

Thanks for all the help so far.

#5 Updated by Tobias Brunner almost 3 years ago

The problem seems to be that the client uses UDP encapsulation after the handover, while the server does not. From the log it looks like the client only updates the SA before sending the UPDATE_SA_ADDRESS notify to the peer (i.e. before noticing the change of NAT state that's determined with the NATD_* payloads in that INFORMATIONAL exchange). And there does not seem to be another SA update after the exchange. I have to look into this in more detail, as a workaround you could probably get this to work if you use forceencaps=yes on the client. There is actually a test case for this scenario (ikev2/mobike), which seems to work fine. So not sure what exactly goes wrong in your case.

#6 Updated by Tobias Brunner almost 3 years ago

There is actually a test case for this scenario (ikev2/mobike), which seems to work fine.

Actually, the scenario does not switch from NAT to no NAT.

#7 Updated by Kristian Evensen almost 3 years ago

Hi Tobias,

Thanks for the detailed replies and sorry for my late answer. I tested adding forceencaps and that seems to work fine. I have been switching between the networks for the better for of an hour now, and the tunnel always moves between the two networks and with no interruptions.

#8 Updated by Tobias Brunner almost 3 years ago

  • Tracker changed from Issue to Bug
  • Target version set to 5.6.2

I tested adding forceencaps and that seems to work fine. I have been switching between the networks for the better for of an hour now, and the tunnel always moves between the two networks and with no interruptions.

OK. I tried to reproduce the issue (see the scenario in the 2518-mobike-nat branch), but wasn't able to. Switching from a public IP to an IP behind a NAT and back works fine in the testing scenario. Moving from behind the NAT looks like this there:

As you can see, the SAs are updated again after detecting the client is not behind a NAT anymore (that log message saying so is new, but otherwise the code is unchanged).

I think the problem in your case is that there is another MOBIKE update while the address update is being performed (but after the connectivity checks already were done).

The message to update the SA is sent:

Wed, 2018-01-24 19:02 10[ENC] <ipsec-clwy-1|1> generating INFORMATIONAL request 3 [ N(UPD_SA_ADDR) N(NATD_S_IP) N(NATD_D_IP) N(COOKIE2) N(ADD_4_ADDR) ]
Wed, 2018-01-24 19:02 10[KNL] <ipsec-clwy-1|1> using 176.75.17.151 as address to reach a.a.a.a/32
Wed, 2018-01-24 19:02 10[NET] <ipsec-clwy-1|1> sending packet: from 176.75.17.151[4500] to a.a.a.a[4500] (172 bytes)

But there is a network change triggering another update, which is queued after the message above has been sent and the thread can lock the IKE_SA:

Wed, 2018-01-24 19:02 12[KNL] creating roam job due to address/link change
...
Wed, 2018-01-24 19:02 12[KNL] <ipsec-clwy-1|1> using 176.75.17.151 as address to reach a.a.a.a/32
Wed, 2018-01-24 19:02 12[IKE] <ipsec-clwy-1|1> keeping connection path 176.75.17.151 - a.a.a.a
Wed, 2018-01-24 19:02 12[IKE] <ipsec-clwy-1|1> sending address list update using MOBIKE
Wed, 2018-01-24 19:02 12[IKE] <ipsec-clwy-1|1> queueing IKE_MOBIKE task
Wed, 2018-01-24 19:02 12[IKE] <ipsec-clwy-1|1> delaying task initiation, INFORMATIONAL exchange in progress

This increases the pending updates counter and when the response is received

Wed, 2018-01-24 19:02 06[NET] <ipsec-clwy-1|1> received packet: from a.a.a.a[4500] to 176.75.17.151[4500] (156 bytes)
Wed, 2018-01-24 19:02 06[ENC] <ipsec-clwy-1|1> parsed INFORMATIONAL response 3 [ N(NATD_S_IP) N(NATD_D_IP) N(COOKIE2) ]

it is silently ignored, so the returned NAT-D payloads are not processed and there is no update of the SAs in the kernel. Then the queued MOBIKE task runs:

Wed, 2018-01-24 19:02 06[IKE] <ipsec-clwy-1|1> activating new tasks
Wed, 2018-01-24 19:02 06[IKE] <ipsec-clwy-1|1>   activating IKE_MOBIKE task
Wed, 2018-01-24 19:02 06[KNL] <ipsec-clwy-1|1> using 176.75.17.151 as address to reach a.a.a.a/32
Wed, 2018-01-24 19:02 06[ENC] <ipsec-clwy-1|1> generating INFORMATIONAL request 4 [ N(ADD_4_ADDR) ]
Wed, 2018-01-24 19:02 06[KNL] <ipsec-clwy-1|1> using 176.75.17.151 as address to reach a.a.a.a/32
Wed, 2018-01-24 19:02 06[NET] <ipsec-clwy-1|1> sending packet: from 176.75.17.151[4500] to a.a.a.a[4500] (76 bytes)
Wed, 2018-01-24 19:02 07[NET] <ipsec-clwy-1|1> received packet: from a.a.a.a[4500] to 176.75.17.151[4500] (76 bytes)
Wed, 2018-01-24 19:02 07[ENC] <ipsec-clwy-1|1> parsed INFORMATIONAL response 4 [ ]

The problem is that this update only sends additional addresses, it does not request an address update or include any NAT-D payloads, so there is no evaluation of the NAT situation after this exchange. The additional address is actually also redundant as the same address was already sent with the exchange that included the UPDATE_SA_ADDR notify (but there is currently no check whether currently running updates or pending ones do the same thing).

I'm not yet sure what the best way to fix this is. The pending updates counter seems kind of a thing of the past as in the current code base there can at most be one active and one queue MOBIKE task, so we should probably be able to coordinate work between the two a bit better (in particular should we not ignore the response if the queued task does not update the SA's addresses).

#9 Updated by Tobias Brunner over 2 years ago

I've pushed some commits to the 2518-mobike-nat branch, which try to address the MOBIKE update issue seen above. First, queued MOBIKE updates are now additive (i.e. if an address list update is already queued we don't just throw away a roam event but update the task to perform both). Second, we only ignore responses to the active MOBIKE task if there actually is a roam task queued, not if it is a task used for DPD or to update the list of additional addresses.

#10 Updated by Kristian Evensen over 2 years ago

Sorry it took a while before I was able to test your patches. I have been spending all morning switching between the two problematic networks, and the results are much better than before. Now handover usually works, but I sometimes still see failures. I haven't been able to create a reliable test-case for triggering the error. I will update the issue with logs when I do.

The symptoms when handover fails are the same though. The messages when using default logging are equal and ipsec statusall on both server and client indicates that everything should be fine. If I connect the previous network again (i.e., the one we switched from), then the tunnel recovers immediately.

Thanks for all the help so far!

#11 Updated by Kristian Evensen over 2 years ago

I was bit too fast hitting "Send", sorry about that. There was one thing I forgot to mention. I sometimes saw ~5 second periods of packet loss when switching networks. When checking the logs, I found the following message:

Thu Feb  1 10:44:36 2018 daemon.info : 15[IKE] <ipsec-clwy-1|1> old path is not preferred anymore
Thu Feb  1 10:44:36 2018 daemon.info : 15[IKE] <ipsec-clwy-1|1> looking for a route to a.a.a.a ...
Thu Feb  1 10:44:36 2018 daemon.info : 15[IKE] <ipsec-clwy-1|1> requesting address change using MOBIKE
Thu Feb  1 10:44:36 2018 daemon.info : 15[ENC] <ipsec-clwy-1|1> generating INFORMATIONAL request 84 [ ]
Thu Feb  1 10:44:36 2018 daemon.info : 15[IKE] <ipsec-clwy-1|1> checking path 10.0.0.11[4500] - a.a.a.a[4500]
Thu Feb  1 10:44:36 2018 daemon.info : 15[NET] <ipsec-clwy-1|1> sending packet: from 10.0.0.11[4500] to a.a.a.a[4500] (76 bytes)
Thu Feb  1 10:44:36 2018 daemon.info : 15[IKE] <ipsec-clwy-1|1> checking path 10.0.0.11[4500] - 10.10.5.2[4500]
Thu Feb  1 10:44:36 2018 daemon.info : 15[NET] <ipsec-clwy-1|1> sending packet: from 10.0.0.11[4500] to 10.10.5.2[4500] (76 bytes)
Thu Feb  1 10:44:36 2018 daemon.info : 15[IKE] <ipsec-clwy-1|1> checking path 10.0.0.11[4500] - 10.24.3.1[4500]
Thu Feb  1 10:44:36 2018 daemon.info : 15[NET] <ipsec-clwy-1|1> sending packet: from 10.0.0.11[4500] to 10.24.3.1[4500] (76 bytes)
Thu Feb  1 10:44:37 2018 daemon.info : 08[IKE] <ipsec-clwy-1|1> old path is not preferred anymore
Thu Feb  1 10:44:37 2018 daemon.info : 08[IKE] <ipsec-clwy-1|1> looking for a route to a.a.a.a ...
Thu Feb  1 10:44:37 2018 daemon.info : 08[IKE] <ipsec-clwy-1|1> requesting address change using MOBIKE
Thu Feb  1 10:44:37 2018 daemon.info : 09[IKE] <ipsec-clwy-1|1> old path is not preferred anymore
Thu Feb  1 10:44:37 2018 daemon.info : 09[IKE] <ipsec-clwy-1|1> looking for a route to a.a.a.a ...
Thu Feb  1 10:44:37 2018 daemon.info : 09[IKE] <ipsec-clwy-1|1> requesting address change using MOBIKE
Thu Feb  1 10:44:37 2018 daemon.info : 09[DMN] <ipsec-clwy-1|1> thread 9 received 11
Thu Feb  1 10:44:37 2018 daemon.info : 09[LIB] <ipsec-clwy-1|1> no support for capturing backtraces
Thu Feb  1 10:44:37 2018 daemon.info : 09[DMN] <ipsec-clwy-1|1> killing ourself, received critical signal
Thu Feb  1 10:44:37 2018 authpriv.info ipsec_starter[20027]: charon has died -- restart scheduled (5sec)

#12 Updated by Tobias Brunner over 2 years ago

Thu Feb  1 10:44:37 2018 daemon.info : 09[DMN] <ipsec-clwy-1|1> thread 9 received 11

Ugh, there was a small copy-n-paste error (an enumerator was destroyed twice because there was a return statement instead of a break at one location). I've pushed an updated commit to the branch.

#13 Updated by Kristian Evensen over 2 years ago

Hi,

Sorry for the delay in testing yet again. I just tested the latest patches and I still see the error. In order to simplify my setup I changed it slightly, my device now has two connections to the same network (both behind NAT). The rest of the setup (configuration files, etc.) is the same as earlier. I am also using a modified version of the rule patch where routes are installed, because I am debugging an (at least seemingly unrelated) issue. When the error strikes, the routing table looks correct (a default route is installed).

The error manifests it same in the same way as earlier. I disconnect/connect the priority link and this can work fine for a long time, before the tunnel "suddenly" dies when I bring up the priority link again (the error always hits when I bring up the link). The output of ipsec statusall, etc. looks the same as in my previous comment (after failure). I have attached logs from the client and server, please let me know if I should change some logging parameters or include more information.

#14 Updated by Tobias Brunner over 2 years ago

I just tested the latest patches and I still see the error.

What error are you referring to exactly? You mean you have UDP encapsulation on one end and not on the other?

The rest of the setup (configuration files, etc.) is the same as earlier.

With or without forceencaps=yes?

(the error always hits when I bring up the link)

What's the name of that interface?

#15 Updated by Kristian Evensen over 2 years ago

Tobias Brunner wrote:

What error are you referring to exactly? You mean you have UDP encapsulation on one end and not on the other?

Sorry, I meant that data is no longer sent from the server over the tunnel.

With or without forceencaps=yes?

I have disable forceencaps.

What's the name of that interface?

eth0.3.

#16 Updated by Tobias Brunner over 2 years ago

The client log looks fine (there are some NAT mapping triggers for address update notifies that might be superfluous, but that should not affect this issue here). Unfortunately, the server log does not match the client session (it seems like you re-uploaded the previous server log).

#17 Updated by Kristian Evensen over 2 years ago

I see there was one observation I forgot to mention in my previous comment. The amount of traffic going through the tunnel seems to influence the probability of a successful handover.

In order to make my tests somewhat automated and reproducible, I created a small script which disconnects/connects the priority link every 30 seconds. When I ran a small amount of traffic through the tunnel (ping), I was able to run the script for hours and did not have any problems with the tunnel seemingly dying. I verified that there were no periods of lost connectivity by checking the packet loss (0%).

When I ran a large amount of traffic through the tunnel (downloading a file), the tunnel always failed after ~3-4 handovers. The speed of the download is roughly the same as that of my connection, i.e., the link is congested. Could the problem be that a packet is lost and not retransmitted?

Also, I have not patched the server, I don't know if I should do that too?

#18 Updated by Kristian Evensen over 2 years ago

Tobias Brunner wrote:

The client log looks fine (there are some NAT mapping triggers for address update notifies that might be superfluous, but that should not affect this issue here). Unfortunately, the server log does not match the client session (it seems like you re-uploaded the previous server log).

Very sorry about that. Here are the correct logs.

#19 Updated by Tobias Brunner over 2 years ago

When I ran a large amount of traffic through the tunnel (downloading a file), the tunnel always failed after ~3-4 handovers. The speed of the download is roughly the same as that of my connection, i.e., the link is congested. Could the problem be that a packet is lost and not retransmitted?

Does it recover when you do nothing (i.e. no interface/address/route changes occur)? What if you stop the data transfer but change nothing else, does that help? Or is the connection completely broken after this happens (until perhaps the interface changes again)? Do you have the logs from such a situation? IKE packets are retransmitted, so there must be several packets not getting through to prevent an address update (which, I guess, is possible if you completely saturate the link for a longer time while).

Also, I have not patched the server, I don't know if I should do that too?

Not necessary, the patches only affect the behavior on the client.

#20 Updated by Kristian Evensen over 2 years ago

Tobias Brunner wrote:

When I ran a large amount of traffic through the tunnel (downloading a file), the tunnel always failed after ~3-4 handovers. The speed of the download is roughly the same as that of my connection, i.e., the link is congested. Could the problem be that a packet is lost and not retransmitted?

Does it recover when you do nothing (i.e. no interface/address/route changes occur)? What if you stop the data transfer but change nothing else, does that help? Or is the connection completely broken after this happens (until perhaps the interface changes again)? Do you have the logs from such a situation? IKE packets are retransmitted, so there must be several packets not getting through to prevent an address update (which, I guess, is possible if you completely saturate the link for a longer time while).

Yes, the tunnel does recover after a while when doing nothing. I have attached logs from such a situation (around 15:10/11 on client, 16:10/11 on server), but I am not able to see exactly what is the trigger. The tunnel also recovers on rekeying. Stopping data transfer, reconnecting interfaces, etc. has no effect (i.e., tunnel is completely broken).

#21 Updated by Tobias Brunner over 2 years ago

I have attached logs from such a situation (around 15:10/11 on client, 16:10/11 on server), but I am not able to see exactly what is the trigger.

Hm, at 15:11 there is a hard expire of an IPsec SA on the client:

Tue, 2018-02-06 15:11 10[KNL] received a XFRM_MSG_EXPIRE
Tue, 2018-02-06 15:11 10[KNL] creating delete job for CHILD_SA ESP/0xc0548b5b/a.a.a.a
Tue, 2018-02-06 15:11 10[IKE] <ipsec-clwy-1|1> queueing CHILD_DELETE task
Tue, 2018-02-06 15:11 10[IKE] <ipsec-clwy-1|1> activating new tasks
Tue, 2018-02-06 15:11 10[IKE] <ipsec-clwy-1|1>   activating CHILD_DELETE task
Tue, 2018-02-06 15:11 10[IKE] <ipsec-clwy-1|1> closing expired CHILD_SA ipsec-clwy-1{1} with SPIs c78bb3c3_i c0548b5b_o and TS 192.168.31.128/28 === 0.0.0.0/0

This should not happen as the SAs usually get rekeyed and deleted before.

There are several strange error messages regarding this SA in the log. At 15:08 there is an attempt to update it, but that fails:

Tue, 2018-02-06 15:08 10[KNL] <ipsec-clwy-1|1> querying replay state from SAD entry with SPI c0548b5b
Tue, 2018-02-06 15:08 10[KNL] <ipsec-clwy-1|1> deleting SAD entry with SPI c0548b5b
Tue, 2018-02-06 15:08 10[KNL] <ipsec-clwy-1|1> deleted SAD entry with SPI c0548b5b
Tue, 2018-02-06 15:08 10[KNL] <ipsec-clwy-1|1> updating SAD entry with SPI c0548b5b from 10.0.0.62[4500]..a.a.a.a[4500] to 10.0.0.70[4500]..a.a.a.a[4500]
Tue, 2018-02-06 15:08 10[KNL] <ipsec-clwy-1|1> received netlink error: Invalid argument (22)
Tue, 2018-02-06 15:08 10[KNL] <ipsec-clwy-1|1> unable to update SAD entry with SPI c0548b5b

This is quite strange. It's also strange that the kernel complains while the SA is apparently still installed (an expire is later triggered for it after all). Such errors are currently not handled as fatal (i.e. the SA is not torn down and recreated or something), but, of course, outbound traffic won't be addressed correctly if the SA is not properly installed.

A bit further back in the log there is another failure during an update, but without an explicit kernel error. And this finally explains what's going on (I actually noticed the acquires before, but didn't fully recognize their significance):

Tue, 2018-02-06 15:07 11[KNL] <ipsec-clwy-1|1> querying SAD entry with SPI c0548b5b for update
Tue, 2018-02-06 15:07 11[KNL] <ipsec-clwy-1|1> querying replay state from SAD entry with SPI c0548b5b
Tue, 2018-02-06 15:07 11[KNL] <ipsec-clwy-1|1> deleting SAD entry with SPI c0548b5b
Tue, 2018-02-06 15:07 12[KNL] received a XFRM_MSG_ACQUIRE
Tue, 2018-02-06 15:07 12[KNL]   XFRMA_TMPL
Tue, 2018-02-06 15:07 12[KNL] creating acquire job for policy 192.168.31.136/32[tcp/49232] === a.a.a.a/32[tcp/www] with reqid {1}
Tue, 2018-02-06 15:07 12[CFG] trap not found, unable to acquire reqid 1
Tue, 2018-02-06 15:07 11[KNL] <ipsec-clwy-1|1> deleted SAD entry with SPI c0548b5b
Tue, 2018-02-06 15:07 11[KNL] <ipsec-clwy-1|1> updating SAD entry with SPI c0548b5b from 10.0.0.62[4500]..a.a.a.a[4500] to 10.0.0.70[4500]..a.a.a.a[4500]
Tue, 2018-02-06 15:07 11[KNL] <ipsec-clwy-1|1> unable to update SAD entry with SPI c0548b5b

Let me explain what's going on. In order to update the addresses of the installed SAs in the kernel we can't directly update the existing SAs as the kernel uses the addresses as keys (together with the protocol and SPI). So instead we query the current state of the SA (sequence number, replay state, keys etc.), delete the SA and install it again with the queried data and updated addresses (plus encap state if necessary).

The problem is that some outbound traffic might hit the outbound IPsec policy in that short time window when the SA is not installed. If that happens an acquire is triggered and the kernel creates a temporary SA. Because we set the SPI in the outbound policy that temporary SA will probably also have that SPI set. So my guess is that the error above indicates that the kernel returned an EEXIST error, which we don't explicitly log but handle specifically where it applies (in this case it should never apply because we explicitly delete the SA, so it is not handled).

What happens later, and causes the EINVAL error, is that we query the temporary SA and try to install that again. But because it has no keys etc. the kernel misses a lot of data (hence the invalid argument error). It also explains the expire we receive later, the temporary SAs the kernel creates expire 165 seconds after they were created, by default, if they are not completed (which never happens here, it is also never deleted because we can't actually delete such temporary SAs).

The above also explains why you don't see this issue during low traffic situations. It's only a problem if a packet hits the outbound policy exactly when no SA is installed, which is a lot more likely if there is lots of traffic.

OK, so how do we fix this. We could try to install the new SA before deleting the old one, however, that only works if the destination address is different (to check for existing SAs the kernel uses the tuple SPI/protocol/destination address), so with outbound SAs on the client that won't work, we'd just get an EEXIST back (same as above when we try to replace the temporary SA created by the kernel). What we could try is resend the add message as an update, the kernel might accept that. But the kernel is pretty picky what data it actually allows in update messages, so that might not actually work. And it would also not prevent the acquire in the first place, which might be problematic if trap policies are used (in your case the acquire has no effect because there is no trap policy found in the manager).

Another approach might be to fiddle with the policies. We actually already do something to that end during policy updates (which I now see is not very effective in some cases). When updating the policies (currently SAs are updated first, then the policies) we remove the old policies, while we have matching low priority outbound drop policies installed to avoid plaintext traffic from leaving the host, and then reinstall the new policies and update the fallback drop policies with the new addresses. These fallback policies wouldn't actually help if there are trap policies installed (although this is currently rarely the case in such scenarios as they don't really work with virtual IPs) because they have lower priority than traps (so acquires would still get triggered). Anyway, I think we could replace these fallback policies with actual drop policies that have the same priority as the regular policies and are only installed while updating the policies and SAs. I've pushed a commit to the branch that does that. This might actually be a change that is also relevant on the server as it will also have to update the SAs.

#22 Updated by Kristian Evensen over 2 years ago

I have been spending the afternoon testing with the latest patch, and it seems problem is fixed. I have so far only patched the client, but since my tests are mainly RX (client receiving large amounts of data) I guess the error is most likely to strike at the client. My tunnel has now survived fours hours of continuous data transfer and changes in priority link. I made one two hour test where I changed the priority link, and another two hour test where I enabled/disabled the priority link, to test both cases. No error was seen in either case.

Thanks a lot for all the help and detailed explanations!

#23 Updated by Tobias Brunner over 2 years ago

Thanks for the feedback. I pushed another commit to the branch that avoids the unnecessary double update after detecting changed NAT mappings during an address update.

#24 Updated by Kristian Evensen over 2 years ago

Thanks. I forgot to write it in my previous message, but I have been also testing the NAT/non-NAT case (as well as NAT/NAT) and handover works perfect.

#25 Updated by Tobias Brunner over 2 years ago

  • Category set to libcharon
  • Status changed from Feedback to Closed
  • Assignee set to Tobias Brunner
  • Resolution set to Fixed

I forgot to write it in my previous message, but I have been also testing the NAT/non-NAT case (as well as NAT/NAT) and handover works perfect.

Great, thanks. I've merged the changes to master.

Also available in: Atom PDF