Project

General

Profile

Issue #3388

DPD action and Transmit timeout

Added by Krishnamurthy Daulatabad 3 months ago. Updated 3 months ago.

Status:
Closed
Priority:
Normal
Category:
libcharon
Affected version:
5.7.1
Resolution:
No change required

Description

In our IKE configuration, we have dpd_action=restart and we have enabled transmit timeouts (of 9secs). Once tunnel is established, dpd keepalives are also subject to transmit timeouts. Is that right? Will transmit timeouts always trigger tunnel reestablishment in case of an initiator?

So given the above dpd and transmit timeout configuration, can a normal control packet retransmit timeout and DPD transmit timeout can possibly trigger tunnel reestablishments simultaneously? We are seeing this behaviour as in the logs below. After some time we are hitting the "giving up after 5 retries" issue (as in MAX_RETRIES hit in libcharon/sa/ikev2/tasks/ike_init.c) and tunnel reestablishment stops permanently.

Instance 1:

2020-03-28T16:04:14.0+0000 14[IKE] <server-test-1|31> CHILD_SA server-test-1{31} established with SPIs 6b61518e_i 10c3e0de_o and TS 172.16.105.29/32 === 103.135.57.126/32
2020-03-28T16:04:14.0+0000 14[IKE] <server-test-1|31> peer supports MOBIKE
2020-03-28T16:04:14.0+0000 09[IKE] <server-test-1|31> received message ID 1, expected 2, ignored
2020-03-28T16:04:19.0+0000 15[IKE] <server-test-1|31> retransmit 1 of request with message ID 2
2020-03-28T16:04:21.0+0000 12[IKE] <server-test-1|31> retransmit 2 of request with message ID 2
2020-03-28T16:04:23.0+0000 16[IKE] <server-test-1|31> giving up after 2 retransmits
2020-03-28T16:04:23.0+0000 16[IKE] <server-test-1|31> restarting CHILD_SA server-test-1
2020-03-28T16:04:23.0+0000 16[IKE] <server-test-1|31> initiating IKE_SA server-test-1[32] to 103.135.57.126
2020-03-28T16:04:23.0+0000 15[IKE] <server-test-1|32> initiating IKE_SA server-test-1[32] to 103.135.57.126

Instance 2:

2020-03-28T16:21:04.0+0000 15[IKE] <server-test-1|49> authentication of 'client-test-1' (myself) with RSA_EMSA_PKCS1_SHA2_256 successful
2020-03-28T16:21:04.0+0000 15[IKE] <server-test-1|49> sending end entity cert "C=, O=abc.com, OU=, CN=client-test-1" 
2020-03-28T16:21:04.0+0000 15[IKE] <server-test-1|49> establishing CHILD_SA server-test-1{74}
2020-03-28T16:21:04.0+0000 15[IKE] <server-test-1|49> get_spi: Called
2020-03-28T16:21:04.0+0000 15[ENC] <server-test-1|49> splitting IKE message (2528 bytes) into 3 fragments
2020-03-28T16:21:06.0+0000 07[IKE] <server-test-1|49> retransmit 1 of request with message ID 1
2020-03-28T16:21:08.0+0000 05[IKE] <server-test-1|49> retransmit 2 of request with message ID 1
2020-03-28T16:21:10.0+0000 11[IKE] <server-test-1|49> giving up after 2 retransmits
2020-03-28T16:21:10.0+0000 11[IKE] <server-test-1|49> peer not responding, trying again (8/0)
2020-03-28T16:21:10.0+0000 11[IKE] <server-test-1|49> initiating IKE_SA server-test-1[49] to 103.135.57.126
2020-03-28T16:21:12.0+0000 13[IKE] <server-test-1|49> retransmit 1 of request with message ID 0
2020-03-28T16:21:12.0+0000 12[IKE] <server-test-1|49> initiating IKE_SA server-test-1[49] to 103.135.57.126

History

#1 Updated by Tobias Brunner 3 months ago

  • Description updated (diff)
  • Status changed from New to Feedback

Once tunnel is established, dpd keepalives are also subject to transmit timeouts. Is that right?

Yes, dpd_delay only enables the initiation of regular empty INFORMATIONAL exchanges (unless there has been inbound traffic). All exchanges are subject to the configured retransmission timeouts and will trigger the configured DPD action.

Will transmit timeouts always trigger tunnel reestablishment in case of an initiator?

Yes, unless there is a fatal error during the reestablishment.

So given the above dpd and transmit timeout configuration, can a normal control packet retransmit timeout and DPD transmit timeout can possibly trigger tunnel reestablishments simultaneously?

Don't know what settings you are referring to. But only one exchange can be active at a time. But if you configure the same DPD action on two peers it's possible that both try to initiate the connection concurrently if the connectivity breaks, which could lead to duplicate SAs.

After some time we are hitting the "giving up after 5 retries" issue ... and tunnel reestablishment stops permanently.

I don't see either of that in the logs you posted.

#2 Updated by Krishnamurthy Daulatabad 3 months ago

sorry i missed that log when tunnel establishment stops.

2020-03-28T16:21:46.0+0000 14[IKE] <server-test-1|49> initiating IKE_SA server-test-149 to 103.135.57.126
2020-03-28T16:21:46.0+0000 12[IKE] <server-test-1|49> initiating IKE_SA server-test-149 to 103.135.57.126
2020-03-28T16:21:46.0+0000 12[IKE] <server-test-1|49> giving up after 5 retries

As you can see 2 threads have initiated the tunnel to the same destination and hit the MAX_RETRIES

#3 Updated by Tobias Brunner 3 months ago

As you can see 2 threads have initiated the tunnel to the same destination and hit the MAX_RETRIES

MAX_RETRIES is only used for IKE_SA_INIT retries caused by INVALID_KE_PAYLOAD or COOKIE notifies (it has nothing to do with retransmission or DPDs). So there must be some fatal error either during DH negotiation or related to cookies (increase the log levels to see more about it).

#4 Updated by Krishnamurthy Daulatabad 3 months ago

This issue was seen on a live server when the link was down and flapping quite often. So can this error occur due to partial IKE segments received or stale message (out of order) being received?. Just wondering how to simulate this issue in a test setup.

#5 Updated by Tobias Brunner 3 months ago

So can this error occur due to partial IKE segments received or stale message (out of order) being received?

As I said, the retry counter is only increased if one of these notifies is received. Check the log why one of them was returned. And yes, there have been issues in older releases if delayed responses (e.g. with INVALID_KE_PAYLOAD notify) are received but 5.7.2 contains the fixes for that.

#6 Updated by Krishnamurthy Daulatabad 3 months ago

Loglevel was 1 on the system. So there is no error logs that I see. But I see this message 4 times (at different attempts) before this error occurred.

2020-03-28T16:15:16.0+0000 08[IKE] <server-test-1|44> received message ID 1, expected 2, ignored.

I guess this will increase the retry count

#7 Updated by Tobias Brunner 3 months ago

Loglevel was 1 on the system.

It wasn't, there would have been way more messages (e.g. sent/received messages).

I guess this will increase the retry count

No, determine and fix the actual issue.

#8 Updated by Krishnamurthy Daulatabad 3 months ago

INVALID_KE_PAYLOAD has DBG1 error log and I don't see that in the logs. So this error has not happened. (Also both initiator and responder are same strongswan version, So DH group proposals cannot be rejected)

COOKIE notify type has DBG2 log in process_i() method. So have to enable log level 2 and see if this error is occurring.

Are you referring to this fix done for delayed cookie handling? We don't seem to have this fix in our 5.7.2 code base
commit 01f462f0d327bc202b88cce1d5ee3f0e90056be2
Author: Tobias Brunner <>
Date: Wed Nov 28 15:52:27 2018 +0100

ikev2: Ignore COOKIE notifies we already received
This could be due to a delayed response to an IKE_SA_INIT retransmit.
Fixes #2837.

#9 Updated by Tobias Brunner 3 months ago

INVALID_KE_PAYLOAD has DBG1 error log and I don't see that in the logs. So this error has not happened.

There are a lot of level 1 messages that should be there but aren't (at least in the log excerpts you posted).

(Also both initiator and responder are same strongswan version, So DH group proposals cannot be rejected)

They definitely can, it's a configuration issue (if you didn't configure any proposals on either side then it shouldn't, though).

COOKIE notify type has DBG2 log in process_i() method. So have to enable log level 2 and see if this error is occurring.

There is a log level 1 message that shows what messages are sent/received, which also shows what payloads they contain.

Are you referring to this fix done for delayed cookie handling?

Among others (e.g. changing the initiator SPI when restarting after several retransmits).

We don't seem to have this fix in our 5.7.2 code base

Then you either didn't check properly or you are not using 5.7.2 as 01f462f0d3 was definitely part of that version:

git tag --contains 01f462f0d327bc202b88cce1d5ee3f0e90056be2 
5.7.2dr4
5.7.2rc1
5.7.2
5.8.0dr2
5.8.0rc1
5.8.0
...

#10 Updated by Krishnamurthy Daulatabad 3 months ago

Sorry we took 5.7.1 as base. So we don't have this fix in our code and I checked the code too.

Some of the transmit and receive logs (LOG1) in the normal path were changed to 2 as we were filling our log capacity too soon. So I cannot see the logs anymore

#11 Updated by Tobias Brunner 3 months ago

  • Category set to libcharon
  • Status changed from Feedback to Closed
  • Assignee set to Tobias Brunner
  • Affected version changed from 5.7.2 to 5.7.1
  • Resolution set to No change required

Also available in: Atom PDF