Bug #1400: charon is unable to add policy to kernel sometimes during the reauth at runtime. - strongSwan

Bug #1400

charon is unable to add policy to kernel sometimes during the reauth at runtime.

Added by Bin Liu over 9 years ago. Updated over 9 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Tobias Brunner

Category:

kernel-interface

Target version:

5.5.0

Start date:

13.04.2016

Due date:

Estimated time:

Affected version:

5.4.0

Resolution:

Fixed

Description

Hello,
In my scenario, 10 ipsec tunnels are established between client and server with reauth enabled.About every 40 min, the client initiate reauth reqest to server. With many hours pass by, for example sometimes 20h or more, probably one or two tunnels are down and fail to establish during one reauth process.I read the logs and find that during the reauth at that time, the charon of server is suddenly unable to add policy to kernel by netlink and then delete the SAs created. Pls help me to find the possible reason why charon is unable to add policy to kernel sometimes. I tried many times, the result is the same.

...
Mar 30 23:19:18 ubuntu charon: 15[KNL] unable to add policy 172.17.0.0/16 === 172.16.1.10/32 out
...
Mar 30 23:19:18 ubuntu charon: 15[IKE] unable to install IPsec policies (SPD) in kernel
Mar 30 23:19:18 ubuntu charon: 15[IKE] failed to establish CHILD_SA, keeping IKE_SA

The total log of server is given in the attachment, one of the client ip whose tunnel reauth failed is 172.16.1.10. you can search key words or time in the log.

syslog_charon.rar (898 KB) syslog_charon.rar

Bin Liu, 13.04.2016 05:40

Related issues

History

#1 Updated by Tobias Brunner over 9 years ago

Status changed from New to Feedback
Priority changed from Urgent to Normal

I read the logs and find that during the reauth at that time, the charon of server is suddenly unable to add policy to kernel by netlink and then delete the SAs created. Pls help me to find the possible reason why charon is unable to add policy to kernel sometimes.

The reason is a race condition. For example, at Mar 30 13:33:33 there is one thread deleting the policies of a deleted SA while another thread installs policies for a new SA:

Mar 30 13:33:33 ubuntu charon: 13[KNL] deleting policy 172.16.1.6/32 === 172.17.0.0/16 in  (mark 0/0x00000000)
Mar 30 13:33:33 ubuntu charon: 06[NET] sending packet: from 172.16.254.254[500] to 172.16.1.6[500] (328 bytes)
Mar 30 13:33:33 ubuntu charon: 13[KNL] sending XFRM_MSG_DELPOLICY 39264: => 80 bytes @ 0x7ff6e9de17a0
Mar 30 13:33:33 ubuntu charon: 13[KNL]    0: 50 00 00 00 14 00 05 00 60 99 00 00 D7 0E 00 00  P.......`.......
Mar 30 13:33:33 ubuntu charon: 13[KNL]   16: AC 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Mar 30 13:33:33 ubuntu charon: 13[KNL]   32: AC 10 01 06 00 00 00 00 00 00 00 00 00 00 00 00  ................
Mar 30 13:33:33 ubuntu charon: 13[KNL]   48: 00 00 00 00 00 00 00 00 02 00 10 20 00 00 00 00  ........... ....
Mar 30 13:33:33 ubuntu charon: 13[KNL]   64: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Mar 30 13:33:33 ubuntu charon: 13[KNL] received (2) 39264: => 36 bytes @ 0x7ff6a800cea0
Mar 30 13:33:33 ubuntu charon: 13[KNL]    0: 24 00 00 00 02 00 00 00 60 99 00 00 D7 0E 00 00  $.......`.......
Mar 30 13:33:33 ubuntu charon: 13[KNL]   16: 00 00 00 00 50 00 00 00 14 00 05 00 60 99 00 00  ....P.......`...
Mar 30 13:33:33 ubuntu charon: 13[KNL]   32: D7 0E 00 00                                      ....
Mar 30 13:33:33 ubuntu charon: 13[KNL] deleting policy 172.16.1.6/32 === 172.17.0.0/16 fwd  (mark 0/0x00000000)

We see thread 13 is starting to delete the FWD policy above. It is the last user of it so the cached reference in the kernel-netlink plugin is removed and the policy is deleted. Before it actually does so, though, it deletes the route associated with it. And at the same time thread 15 is starting to install a new SA and its policies:

Mar 30 13:33:33 ubuntu charon: 13[KNL] sending RTM_DELROUTE 8535: => 60 bytes @ 0x7ff6e9de11f0
Mar 30 13:33:33 ubuntu charon: 13[KNL]    0: 3C 00 00 00 19 00 05 00 57 21 00 00 D7 0E 00 00  <.......W!......
Mar 30 13:33:33 ubuntu charon: 13[KNL]   16: 02 20 00 00 DC 04 00 01 00 00 00 00 08 00 01 00  . ..............
Mar 30 13:33:33 ubuntu charon: 13[KNL]   32: AC 10 01 06 08 00 07 00 AC 11 11 02 08 00 05 00  ................
Mar 30 13:33:33 ubuntu charon: 13[KNL]   48: AC 10 01 06 08 00 04 00 07 00 00 00              ............
Mar 30 13:33:33 ubuntu charon: 15[KNL] sending XFRM_MSG_ALLOCSPI 39265: => 248 bytes @ 0x7ff6e8ddf5e0
...
Mar 30 13:33:33 ubuntu charon: 15[KNL] adding policy 172.17.0.0/16 === 172.16.1.6/32 out  (mark 0/0x00000000)
...
Mar 30 13:33:33 ubuntu charon: 15[KNL] adding policy 172.16.1.6/32 === 172.17.0.0/16 fwd  (mark 0/0x00000000)
Mar 30 13:33:33 ubuntu charon: 15[KNL] sending XFRM_MSG_NEWPOLICY 39270: => 184 bytes @ 0x7ff6e8ddf390

At this point the policy is still installed in the kernel, but since thread 15 didn't find a reference in the policy cache it sends an XFRM_MSG_NEWPOLICY request. However, because the policy still exists this will fail with an EEXIST error. Right after reporting that error thread 13 finally deletes the policy:

Mar 30 13:33:33 ubuntu charon: 15[KNL] unable to add policy 172.16.1.6/32 === 172.17.0.0/16 fwd
Mar 30 13:33:33 ubuntu charon: 13[KNL] sending XFRM_MSG_DELPOLICY 39271: => 80 bytes @ 0x7ff6e9de17a0

You set the "Affected version" field to 5.4.0 (or didn't change the default), however, I don't think you are actually using that version. Because since 5.3.3 the kernel-netlink plugin tries to update an existing policy (using XFRM_MSG_UPDPOLICY) if it receives an EEXIST error (you'd see a log message like policy already exists, try to update it followed by another XFRM message). This was not added for situations like these, though, but to recover from crashes. Because it would actually only have helped here if the delete by thread 13 occurred before the update by thread 15. Otherwise, the race is not handled properly as can be seen in this diagram:

Thread A                  Thread B
delete policy             add policy
<lock>
remove cache entry
</unlock>
                          <lock>
                          add new cache entry
                          </unlock
                          add policy to kernel
                          -> EEXIST
                          update policy in kernel
delete policy in kernel

As you can see, if this happens you'd end up without a policy in the kernel (I guess the race above could be resolved with some elaborate thread synchronization, e.g. a condvar that lets thread B wait for thread A's deletion to complete).

That's one of the many issues with the break-before-make reauthentication approach.

#2 Updated by Bin Liu over 9 years ago

Thanks very much for your analyse.Just as you said, I indeed forget to change the default version. I am sorry for that.
So, what is your suggestion for me to solve this problem. Will this problem be solved if I upgrade the strongswan version to 5.4.0? or need some other solutions ?

#3 Updated by Tobias Brunner over 9 years ago

Just as you said, I indeed forget to change the default version.

What version do you actually use?

So, what is your suggestion for me to solve this problem. Will this problem be solved if I upgrade the strongswan version to 5.4.0? or need some other solutions ?

As explained above it depends on the timing as not all races are currently handled. 5.4.0 (or anything >= 5.3.3) would only help if the XFRM_MSG_DELPOLICY message is sent between the XFRM_MSG_NEWPOLICY and XFRM_MSG_UPDPOLICY messages (or before any of these, of course). This depends on the thread scheduling, which can't really be controlled. Are you running this on a physical machine? With multiple CPU cores?

A workaround is to disable reauthentication and just use IKE rekeying (reauth=no on both sides). Or try to use make-before-break reauthentication as mentioned in #1395.

#4 Updated by Bin Liu over 9 years ago

---With multiple CPU cores?
multiple CPU cores on physical Intel processing board.
---As explained above it depends on the timing as not all races are currently handled.
will you fix it in higher strongswan version since not all races are handled?

#5 Updated by Tobias Brunner over 9 years ago

---As explained above it depends on the timing as not all races are currently handled.
will you fix it in higher strongswan version since not all races are handled?

Maybe.

#6 Updated by Tobias Brunner over 9 years ago

Related to Issue #1439: Reauthentication fails with load-tester plugin added

#7 Updated by Tobias Brunner over 9 years ago

Tracker changed from Issue to Bug
Category set to kernel-interface
Status changed from Feedback to Closed
Assignee set to Tobias Brunner
Target version set to 5.5.0
Resolution set to Fixed

Project

General

Profile

strongSwan

Issues