Issue #1491: "init_limit_half_open" is ignored - strongSwan

Issue #1491

"init_limit_half_open" is ignored

Added by Danny Kulchinsky over 9 years ago. Updated over 6 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

kernel

Affected version:

5.3.5

Resolution:

Fixed

Description

We defined init_limit_half_open = 500 and restarted strongSwan, however it seems that this setting is ignored.

Below output of swanctl -S showing "1648 half-open"

# swanctl -S
uptime: 4 hours, since May 29 07:29:26 2016
worker threads: 256 total, 62 idle, working: 4/89/1/100
job queues: 0/0/0/0
jobs scheduled: 42153
IKE_SAs: 5592 total, 1648 half-open
mallinfo: sbrk 4890624, mmap 528384, used 2964992, free 1925632

Are we missing something in our configuration ? or our observation is incorrect ?

strongSwan_IKE_SA_INIT_requests_count.PNG (7.83 KB) strongSwan_IKE_SA_INIT_requests_count.PNG		Danny Kulchinsky, 30.05.2016 12:19
strongSwan_Assign_Virtual_IP-CHILD_SA_Established_delay_dsitribution.PNG (13.7 KB) strongSwan_Assign_Virtual_IP-CHILD_SA_Established_delay_dsitribution.PNG	Duration distribution between "assigning virtual ip" => "CHILD_SA XXXXXXX established"	Danny Kulchinsky, 02.06.2016 09:35
Informational Messages.PNG (24.1 KB) Informational Messages.PNG	Informational Messages (TPS) - 24 hours	Danny Kulchinsky, 24.06.2016 15:06
Worker threads.PNG (45.6 KB) Worker threads.PNG	Worker threads - 24 hours	Danny Kulchinsky, 24.06.2016 15:06
vmcore-dmesg.zip (20.9 KB) vmcore-dmesg.zip	kdump dmesg	Danny Kulchinsky, 26.07.2016 15:55

History

#1 Updated by Danny Kulchinsky over 9 years ago

I raised the loglevel of NET subsystem to 1, and I can see the following messages:

2016-05-29 18:46:04.982 02[NET] ignoring IKE_SA setup from xxx.xxx.xxx.xxx, half open IKE_SA count of 4528 exceeds limit of 500

Seems that it does ignore new setup requests, however if the limit is 500 how is it that we have 4528 already in half-open state ? shouldn't they've been blocked ?

#2 Updated by Tobias Brunner over 9 years ago

Status changed from New to Feedback

Seems that it does ignore new setup requests, however if the limit is 500 how is it that we have 4528 already in half-open state ? shouldn't they've been blocked ?

IKE_SAs are only registered as half-open after the first message has successfully been handled. In what time frame did the SAs get established? Is this host responder only? Half-open IKE_SAs might also result from rekeying collisions, were there any rekeyings? The log will tell you more about what's going on.

#3 Updated by Danny Kulchinsky over 9 years ago

File strongSwan_IKE_SA_INIT_requests_count.PNG strongSwan_IKE_SA_INIT_requests_count.PNG added

Tobias Brunner wrote:

Seems that it does ignore new setup requests, however if the limit is 500 how is it that we have 4528 already in half-open state ? shouldn't they've been blocked ?

IKE_SAs are only registered as half-open after the first message has successfully been handled. In what time frame did the SAs get established? Is this host responder only? Half-open IKE_SAs might also result from rekeying collisions, were there any rekeyings? The log will tell you more about what's going on.

1) This host is Responder only

2) Looking at the logs we see very few "rekeyed" messages, most of the time there are 3-4 re-keying happening per hour, at most 100

3) The bursts of IKE_SA_INITs are very concentrated, between 1 - 3 seconds with intervals of relative "quite time" (20, 30, 40 seconds), I'm attaching a graph (10 minutes zoom-in) of log file visualization. It counts the number of "parsed IKE_SA_INIT request" events per second (each column represents a second)

How the "IKE_SA_INIT Dropping" handles such short bursts ? we do see "ignoring IKE_SA setup" messages but still the number of half-open IKE_SAs is increasing.

My understanding is that Network subsystem should ignore all IKE_SA requests as long as the number of half-open IKE_SAs is above the limit, but I don't see it's happening.

#4 Updated by Tobias Brunner over 9 years ago

How the "IKE_SA_INIT Dropping" handles such short bursts ? we do see "ignoring IKE_SA setup" messages but still the number of half-open IKE_SAs is increasing.

As I said, half-open SAs are only registered after replying to the IKE_SA_INIT. So if there are lots of IKE_SA_INIT requests at the same time the number might increase beyond the set limit. After that happened new IKE_SA_INIT requests are droppped, but any pending request that arrived before is still handled.

#5 Updated by Danny Kulchinsky over 9 years ago

Tobias Brunner wrote:

How the "IKE_SA_INIT Dropping" handles such short bursts ? we do see "ignoring IKE_SA setup" messages but still the number of half-open IKE_SAs is increasing.

As I said, half-open SAs are only registered after replying to the IKE_SA_INIT. So if there are lots of IKE_SA_INIT requests at the same time the number might increase beyond the set limit. After that happened new IKE_SA_INIT requests are droppped, but any pending request that arrived before is still handled.

Thank you Tobias!

I did further investigation and seem to have found the culprit, I figured I'd share the results here.

We found it very bizarre that (according to charon logs) we are receiving such huge spikes of IKE INIT Requests (up to 2000) that are concentrated in 1-2 seconds time frames (load is spread across multiple security gateway via DNS Round Robin)

The increase in Medium job queue was a clue, however what helped to analyze the situation better is capturing actual traffic on the Public interface, this revealed that the rate of IKE INIT Requests is stable throughout, however at some point the gateway stops sending responses for 10/15/20/30 seconds following by a huge spikes of responses in few short seconds.

This led me to think that something is hogging the system, I reviewed few samples of the log and found huge delay between the following messages:

assigning virtual IP 10.11.68.15 to peer 'xxxxx@yyyy.com'
CHILD_SA XXXXXX{yyyyy} established with SPIs c931ba7d_i 0ab96e4f_o and TS 10.1.1.0/24 === 10.11.68.15/32

The gap between messages was sometimes in minutes.

My understanding is that the actions taken between these two messages have to do with XFRM Netlink setups and iptables rules updates, with XFRM I don't think there's much I could do but realizing that we have quite a few tunnels established per second it probably resulted in many iptables commands executed concurrently which seem to have caused the delay (since we did use "leftfirewall=yes")

So I decided to disable "leftfirewall" and set a general FORWARD rule in iptables instead, after applying this change this problem has completely disappeared !

Now I can see that only the LOW Job Queue has pending jobs, since we are using EAP-AKA authentication method (using EAP-RADIUS) and the authentication session can take ~1-2 seconds e2e.

I'm trying to think how can I alleviate this bottleneck ? currently we have 64 threads defined (4 reserved to High, and 16 reserved to Medium), the server has 8 cores.

1) I can see that Medium job queue is 0 and at most have 2 active threads, so probably reserving 16 threads is too much ? Low job queue fluctuates between few hundred and ~1300
2) will it help to increase the total thread count to 128 ? will this allow more parallel EAP sessions ? or will this only cause contention over CPU cores and actually makes things worse ?

Would appreciate your inputs/insights on how to proceed tuning the system.

Thanks !
Danny

#6 Updated by Tobias Brunner over 9 years ago

My understanding is that the actions taken between these two messages have to do with XFRM Netlink setups and iptables rules updates, with XFRM I don't think there's much I could do but realizing that we have quite a few tunnels established per second it probably resulted in many iptables commands executed concurrently which seem to have caused the delay (since we did use "leftfirewall=yes")

leftfirewall=yes runs the default updown script with iptables argument, which in turn adds ACCEPT rules via iptables for each tunnel. Whether these are needed depends on your firewall configuration (e.g. the default policy of the INPUT/FORWARD chains).

So I decided to disable "leftfirewall" and set a general FORWARD rule in iptables instead, after applying this change this problem has completely disappeared !

Interesting. The updown script actually runs after the CHILD_SA ... established with SPIs ..0 message has been logged.

1) I can see that Medium job queue is 0 and at most have 2 active threads, so probably reserving 16 threads is too much ? Low job queue fluctuates between few hundred and ~1300

You have to do tests and see how it develops over time. Medium priority is the default for most jobs, e.g. to initiate or handle rekeyings, so they may be needed only after some time.

2) will it help to increase the total thread count to 128 ? will this allow more parallel EAP sessions ? or will this only cause contention over CPU cores and actually makes things worse ?

Processing each client will take 1-2 seconds, you can't make that faster by throwing more threads at it (you can perhaps handle more clients concurrently but also not unlimited - the more threads you have the more overhead there will be to synchronize them). Also see EapRadius.

#7 Updated by Danny Kulchinsky over 9 years ago

File strongSwan_Assign_Virtual_IP-CHILD_SA_Established_delay_dsitribution.PNG strongSwan_Assign_Virtual_IP-CHILD_SA_Established_delay_dsitribution.PNG added

Tobias Brunner wrote:

My understanding is that the actions taken between these two messages have to do with XFRM Netlink setups and iptables rules updates, with XFRM I don't think there's much I could do but realizing that we have quite a few tunnels established per second it probably resulted in many iptables commands executed concurrently which seem to have caused the delay (since we did use "leftfirewall=yes")

leftfirewall=yes runs the default updown script with iptables argument, which in turn adds ACCEPT rules via iptables for each tunnel. Whether these are needed depends on your firewall configuration (e.g. the default policy of the INPUT/FORWARD chains).

So I decided to disable "leftfirewall" and set a general FORWARD rule in iptables instead, after applying this change this problem has completely disappeared !

Interesting. The updown script actually runs after the CHILD_SA ... established with SPIs ..0 message has been logged.

I still see that quite a bit of time is spent between the messages ("assigning virtual ip" => "CHILD_SA XXXXXXX established"), I have created a graph that shows the distribution in seconds (attached), as you can see majority take between 2 and 15 seconds, this sounds a lot... what exactly happens between these two stages ? what could help reduce this time ?

Not sure if it's related, but we did tune IKE SA Table to use Has Tables:

ikesa_table_size = 4096
ikesa_table_segments = 16

1) I can see that Medium job queue is 0 and at most have 2 active threads, so probably reserving 16 threads is too much ? Low job queue fluctuates between few hundred and ~1300

You have to do tests and see how it develops over time. Medium priority is the default for most jobs, e.g. to initiate or handle rekeyings, so they may be needed only after some time.

We have very few rekeying happening as most tunnels are short lived, I will try to reduce from 16 to 8 and monitor. By the way, changing Thread Priorities requires restart to charon to take effect, right ?

2) will it help to increase the total thread count to 128 ? will this allow more parallel EAP sessions ? or will this only cause contention over CPU cores and actually makes things worse ?

Processing each client will take 1-2 seconds, you can't make that faster by throwing more threads at it (you can perhaps handle more clients concurrently but also not unlimited - the more threads you have the more overhead there will be to synchronize them). Also see EapRadius.

I have reviewed EapRadius Wiki, we seem to be following best practices - we have 8 AAA servers and enough sockets per server to handle the load.

Other than the time spent between "assigning virtual ip" => "CHILD_SA XXXXXXX established", the bottleneck seems to be the time spent during different EAP-AKA authentication phases. From your comment I understand that only way to improve concurrency of tunnel establishment is improve AAA response times. We are looking into that, one of the things we consider to implement fast-re-authentication (thoughts ?)

#8 Updated by Tobias Brunner over 9 years ago

So I decided to disable "leftfirewall" and set a general FORWARD rule in iptables instead, after applying this change this problem has completely disappeared !

Interesting. The updown script actually runs after the CHILD_SA ... established with SPIs ..0 message has been logged.

I still see that quite a bit of time is spent between the messages ("assigning virtual ip" => "CHILD_SA XXXXXXX established"), I have created a graph that shows the distribution in seconds (attached), as you can see majority take between 2 and 15 seconds, this sounds a lot... what exactly happens between these two stages ? what could help reduce this time ?

One possible culprit could be the nexthop lookup done when installing routes in table 220 for the clients. Depending on your configuration and network topology these might not be necessary and you could disable their installation via charon.install_routes. It's also possible to use a faster lookup by setting charon.plugins.kernel-netlink.fwmark to !<mark> where <mark> is an arbitrary number that is not otherwise used as firewall mark on your system. This will add a selector to the routing rule (ip rule) that only directs traffic to routing table 220 (where strongSwan installs its routes) if that mark is not set. This allows strongSwan to set the mark in its RTM_GETROUTE message so that routing table 220 is automatically excluded from the lookups (thus avoiding having to dump the complete routing table).

By the way, changing Thread Priorities requires restart to charon to take effect, right ?

Yes.

We are looking into that, one of the things we consider to implement fast-re-authentication (thoughts ?)

That would probably avoid a few low priority jobs and make some of them get processed faster.

#9 Updated by Danny Kulchinsky over 9 years ago

Tobias Brunner wrote:

So I decided to disable "leftfirewall" and set a general FORWARD rule in iptables instead, after applying this change this problem has completely disappeared !

Interesting. The updown script actually runs after the CHILD_SA ... established with SPIs ..0 message has been logged.

I still see that quite a bit of time is spent between the messages ("assigning virtual ip" => "CHILD_SA XXXXXXX established"), I have created a graph that shows the distribution in seconds (attached), as you can see majority take between 2 and 15 seconds, this sounds a lot... what exactly happens between these two stages ? what could help reduce this time ?

One possible culprit could be the nexthop lookup done when installing routes in table 220 for the clients. Depending on your configuration and network topology these might not be necessary and you could disable their installation via charon.install_routes. It's also possible to use a faster lookup by setting charon.plugins.kernel-netlink.fwmark to !<mark> where <mark> is an arbitrary number that is not otherwise used as firewall mark on your system. This will add a selector to the routing rule (ip rule) that only directs traffic to routing table 220 (where strongSwan installs its routes) if that mark is not set. This allows strongSwan to set the mark in its RTM_GETROUTE message so that routing table 220 is automatically excluded from the lookups (thus avoiding having to dump the complete routing table).

Thanks! I believe we can safely disable route installation in our setup, I'm now "draining" users to restart the daemon. I understand the "fwmark" tweak is another alternative, right ? I don't need both.

By the way, changing Thread Priorities requires restart to charon to take effect, right ?

Yes.

We are looking into that, one of the things we consider to implement fast-re-authentication (thoughts ?)

That would probably avoid a few low priority jobs and make some of them get processed faster.

#10 Updated by Tobias Brunner over 9 years ago

I understand the "fwmark" tweak is another alternative, right ? I don't need both.

Correct.

#11 Updated by Danny Kulchinsky over 9 years ago

Tobias Brunner wrote:

I understand the "fwmark" tweak is another alternative, right ? I don't need both.

Correct.

As suggested we disabled "install_routes", configured the necessary routing and restarted the daemon.

We can see major improvement in tunnel establishment speed and specifically the delay between the two events in question, now at most it takes ~10ms for ~85% of the cases and worst case scenario is ~30ms - down from seconds in double digit !

Thank you Tobias!

I have also made adjustment in Threads priorities, so far seems to be OK. We'll monitor the situation throughout the daily cycle as traffic increases and number of concurrent tunnels grows but I'm cautiously optimistic :)

#12 Updated by Danny Kulchinsky over 9 years ago

Things are looking MUCH better now.

I have a question about High priority jobs. According to JobPriority only DPD is managed by High Priority threads, we are seeing spikes of High thread workers (30-50 threads).

Could DPD consume so many threads ? dpddelay is set to 60sec since we need to identify stale tunnels asap in order to trigger a delete by AAA in a users database.

#13 Updated by Tobias Brunner over 9 years ago

I have a question about High priority jobs. According to JobPriority only DPD is managed by High Priority threads, we are seeing spikes of High thread workers (30-50 threads).

DPDs are just an example. All INFORMATIONAL messages are handled with that priority i.e. also those to delete SAs or for MOBIKE updates/checks and as mentioned DPDs by clients or the server. Also running with high priority are the jobs that retransmit messages and those that send NAT keepalives (probably not that relevant on the server).

#14 Updated by Danny Kulchinsky over 9 years ago

Tobias Brunner wrote:

I have a question about High priority jobs. According to JobPriority only DPD is managed by High Priority threads, we are seeing spikes of High thread workers (30-50 threads).

DPDs are just an example. All INFORMATIONAL messages are handled with that priority i.e. also those to delete SAs or for MOBIKE updates/checks and as mentioned DPDs by clients or the server. Also running with high priority are the jobs that retransmit messages and those that send NAT keepalives (probably not that relevant on the server).

Out of the procedures you mentioned, it seems that the one to cause most of the workload in our case is retransmits. I see a clear correlation between time frames when there are many received retransmit of request with ID XXX, retransmitting response messages and increase in High threads workers.

I'm trying to understand if this is due to internet network conditions, or could there be other factors ?

#15 Updated by Tobias Brunner over 9 years ago

I have a question about High priority jobs. According to JobPriority only DPD is managed by High Priority threads, we are seeing spikes of High thread workers (30-50 threads).

DPDs are just an example. All INFORMATIONAL messages are handled with that priority i.e. also those to delete SAs or for MOBIKE updates/checks and as mentioned DPDs by clients or the server. Also running with high priority are the jobs that retransmit messages and those that send NAT keepalives (probably not that relevant on the server).

Out of the procedures you mentioned, it seems that the one to cause most of the workload in our case is retransmits. I see a clear correlation between time frames when there are many received retransmit of request with ID XXX, retransmitting response messages and increase in High threads workers.

These are, however, processed with the priority derived from the message type (i.e. a retransmit of an IKE_AUTH message is handled with medium priority). I was referring to retransmits of a request sent by the responder when it initiates an exchange.

I'm trying to understand if this is due to internet network conditions, or could there be other factors ?

Could be the network condition. In particular, if the response has actually been sent already, that is, you don't see the message ignoring request with ID ..., already processing, which would be the case if handling the message and preparing the response just takes longer than the initiator is willing to wait before sending a retransmit. It's also possible that the retransmit arrived and a job got queued while handling the original message and so it only seems like it arrived after the response has already been sent (because the job with the retransmit was just not processed earlier). If there are lots of high priority jobs that could delay processing retransmits of e.g. IKE_AUTH messages with medium priority. Also, sent messages are queued to the sender, which sends them sequentially. I guess if there are a lot of messages sent/queued at the same time it's possible that the original response has not yet hit the wire when the retransmit arrives and another instance of the message is queued (you'd have to check the log/capture - e.g. increase the log level for net to 2 to see when the socket sends/receives messages).

#16 Updated by Danny Kulchinsky about 9 years ago

I've been investigating this and I still can't put my finger precisely on the root-cause.

These are, however, processed with the priority derived from the message type

Since most of the contention is around High priority jobs and it correlates to re-transmits, would you consider DPD re-transmits as the main contributor to the load here ?

#17 Updated by Tobias Brunner about 9 years ago

Since most of the contention is around High priority jobs and it correlates to re-transmits, would you consider DPD re-transmits as the main contributor to the load here ?

Retransmits of what exactly? You should check that in the log. And are there actually that many concurrent DPDs? You should see that in the log too. Are they initiated by the server or the clients? Or are there other jobs (or messages) with high priority (all INFORMATIONAL messages are processed with high priority, i.e. also deletes or MOBIKE updates). Are there any IKEv1 SAs involved? Is the server behind a NAT?

#18 Updated by Danny Kulchinsky about 9 years ago

File Informational Messages.PNG Informational Messages.PNG added
File Worker threads.PNG Worker threads.PNG added

Tobias Brunner wrote:

Since most of the contention is around High priority jobs and it correlates to re-transmits, would you consider DPD re-transmits as the main contributor to the load here ?

Retransmits of what exactly? You should check that in the log.

I'm trying to analyze the logs, but the amount of data is staggering... because there are thousands of tunnels.

And are there actually that many concurrent DPDs? You should see that in the log too. Are they initiated by the server or the clients?

Due to the large amount of logs, I'm using Splunk to visualize the patterns. Attached graph (Informational Messages) shows TPS (5 minutes average, last 24 hours) for the following events:

1) parsed INFORMATIONAL request --> DPD Request from the Client (right ?)
2) generating INFORMATIONAL request --> DPD Request form the Server (right ?)
3) sending DPD request --> this line correlate with 2 above (this makes sense, isn't it ?)

The second graph (Worker threads), show the distribution of threads over time (measured every 5 minutes, last 24 hours).

It seems that there's a negative correlation between number of Informational Messages processed and threads utilization... as you can see HIGH threads are the busiest.

What else should I be looking for ? what is causing such a high load on these threads ?

Or are there other jobs (or messages) with high priority (all INFORMATIONAL messages are processed with high priority, i.e. also deletes or MOBIKE updates).

There are very few "received DELETE for IKE_SA" messages (you can see it in the attached graph, ~1.4 per second at worst) and no MOBIKE updates - client doesn't support it.

Are there any IKEv1 SAs involved?

No, none (only IKEv2)

Is the server behind a NAT?

No, no NAT - Public IP is defined on the external interface of the server

#19 Updated by Tobias Brunner about 9 years ago

Due to the large amount of logs, I'm using Splunk to visualize the patterns. Attached graph (Informational Messages) shows TPS (5 minutes average, last 24 hours) for the following events:

1) parsed INFORMATIONAL request --> DPD Request from the Client (right ?)

Probably, but depends on the contents. But since the clients don't support MOBIKE and you accounted for the DELETES they are probably DPDs (unless the clients use INFORMATIONALs to send something else entirely).

2) generating INFORMATIONAL request --> DPD Request form the Server (right ?)
3) sending DPD request --> this line correlate with 2 above (this makes sense, isn't it ?)

Makes sense. What DPD settings have you configured on the server? What about the clients?

The second graph (Worker threads), show the distribution of threads over time (measured every 5 minutes, last 24 hours).

Did you also measure the job queues and scheduled jobs?

It seems that there's a negative correlation between number of Informational Messages processed and threads utilization... as you can see HIGH threads are the busiest.

It seems strange that during the busiest time the utilization would go down to nearly zero (unless you always sampled the thread counts when all threads were getting new jobs at nearly the same time). Also, the number of total threads might be a bit high as they all will eventually have to fight for the same resources (depends on the number of CPU cores your system has and whether you often have jobs that have to wait for some I/O operation to finish, e.g. RADIUS authentication).

What else should I be looking for ? what is causing such a high load on these threads ?

Analyzing which jobs are actually executed by the threads might be something to look at. Or how much time they spend on a job (e.g. how much time spent in processor_t::process_job() for a particular job and at a particular point in time). But that's not possible without code changes. And jobs don't have a type associated with them (like e.g. tasks) which does not make it easier.
There is also a configure option to profile the locks used in strongSwan (--enable-lock-profiler). I guess it has a slight overhead but will provide the total amount of time threads waited to acquire a particular lock once it is destroyed. This information is currently logged to stderr only, so you'd have to start the daemon in the foreground e.g. with ipsec start --nofork, or when using swanctl just start charon directly. But that could also be changed to logging via DBG() functions by replacing stderr with NULL in source:src/libstrongswan/threading/lock_profiler.h#L66 and the fprintf(stderr,...) call before that with DBG1(DBG_LIB,...). The thresholds might also have to be adjusted.

#20 Updated by Danny Kulchinsky about 9 years ago

Sorry for not replying sooner, summer vacation and moving to a new place...

Here's some feedback and progress I was able to make:

1) DPD on Server was aggressive (every 60 seconds), I increased it to 3600s and it reduced to some extent the utilization of High threads.

2) Client side DPD is 600 seconds, we do not control that.

3) The server has 8-cores, the reason for 64 threads was mainly due to Radius response times (sometimes takes 300-600msec to get an answer). We were running on a CentOS 6.5 (kernel 2.6.32) and recently upgraded to CentOS 7.2 (kernel 3.10.0) to utilize pcrypt, we also reduced the thread count to 32, it seems that individual CPUs are not spiking as often to 100% as before, but still overall load on the CPU is quite high (I should mention it is a Virtual Machine running on VMware, however we made sure it has almost 0% Ready Time - so it doesn't wait for CPU).

4) We also upgraded to strongSwan 5.5.0

5) We installed and activated irqbalance, it helped some what with distributing the workload better across the cores.

6) Job queues increase from time to time to 500 - 1000, mainly on Medium and Low queues.

7) Most of the CPU Usage is in Kernel (System), I can see high CPU usage on kworker/* and ksoftirqd/* processes. When analyzing the syscalls for these processes using perf utility, I see the following:

kworker/* process:
Overhead Shared O Symbol
77.41% [kernel] [k] xfrm_policy_match
10.39% [kernel] [k] xfrm_policy_lookup_bytype

kworker/* processes are better distributed across the cores.

ksoftirqd/* process:
Overhead Shared O Symbol
77.95% [kernel] [k] xfrm_policy_match
7.37% [kernel] [k] xfrm_selector_match
5.10% [kernel] [k] xfrm_policy_lookup_bytype

ksoftirqd/* seems to working only in pairs, I can only see two such processes busy at any given point in time. bottleneck ? could these be the XFRM crypto workqueues (one for encrypt and one for decrypt ?)

charon process:
Overhead Shared Object Symbol
51.25% [kernel] [k] __write_lock_failed
30.98% [kernel] [k] xfrm_policy_match
3.64% [kernel] [k] xfrm_policy_lookup_bytype
3.36% [kernel] [k] xfrm_selector_match
2.45% [kernel] [k] xfrm_policy_insert
1.59% [kernel] [k] xfrm_policy_bysel_ctx

charon is busy on write_lock_failed and xfrm_policy_match

The busiest syscalls seems to be:

1) __write_lock_failed (charon)
2) xfrm_policy_match (kworker, ksoftirqd and charon)

We have around 10K tunnels on this box, overall bandwidth is quite low <2MB/s (at peak)

Any idea why so much CPU time is used on these particular Syscalls ? Any way to improve ?

I'm pretty much at a dead-end here with this... not sure how to proceed.

#21 Updated by Tobias Brunner about 9 years ago

Any idea why so much CPU time is used on these particular Syscalls ? Any way to improve ?

As the names of the functions indicate these are the policy lookups when handling ESP traffic. And when charon inserts/updates policies this could conflict (it requires a write lock while handling ESP traffic requires a read lock, thus the __write_lock_failed escape of the spinlock).

Policies in the Linux kernel are stored in linear lists (one per direction) ordered by priority. So if you have lots of them that might not scale well (with 5.5.0 additional policies are actually installed in the FWD direction, you could try if using 5.4.0 makes any difference - or modify source:src/libcharon/sa/child_sa.c#L884 so these additional policies are not installed).

Policies are cached in the flow cache but if the number of flows exceeds its size a potentially expensive lookup via xfrm_policy_lookup_bytype is required. The cache has a fixed size (whenever the number of entries exceeds 4096 it is reduced again to at most 2048 entries), it is also per CPU so if flows are not kept on specific CPUs that could amplify the issue, as do lots of fluctuations regarding SAs/policies as the cache is invalidated whenever policies are changed.

For a quicker policy lookup the kernel only hashes host-host (non-prefixed) policies, by default. However, since 3.18 it is also possible to hash prefixed policies up to certain thresholds (perhaps these changes were backported to your kernel), which could help a lot depending on the traffic selectors negotiated in your scenario.

#22 Updated by Danny Kulchinsky about 9 years ago

Tobias Brunner wrote:

Any idea why so much CPU time is used on these particular Syscalls ? Any way to improve ?

As the names of the functions indicate these are the policy lookups when handling ESP traffic. And when charon inserts/updates policies this could conflict (it requires a write lock while handling ESP traffic requires a read lock, thus the __write_lock_failed escape of the spinlock).

Policies in the Linux kernel are stored in linear lists (one per direction) ordered by priority. So if you have lots of them that might not scale well (with 5.5.0 additional policies are actually installed in the FWD direction, you could try if using 5.4.0 makes any difference - or modify source:src/libcharon/sa/child_sa.c#L884 so these additional policies are not installed).

We have an (almost) identical machine running with strongSwan 5.3.5, it is showing same kind of behavior.

Policies are cached in the flow cache but if the number of flows exceeds its size a potentially expensive lookup via xfrm_policy_lookup_bytype is required. The cache has a fixed size (whenever the number of entries exceeds 4096 it is reduced again to at most 2048 entries), it is also per CPU so if flows are not kept on specific CPUs that could amplify the issue, as do lots of fluctuations regarding SAs/policies as the cache is invalidated whenever policies are changed.

I do see xfrm_policy_lookup_bytype (they are in the outputs of perf top above), but it's relatively low percentage compared to xfrm_policy_match

For a quicker policy lookup the kernel only hashes host-host (non-prefixed) policies, by default. However, since 3.18 it is also possible to hash prefixed policies up to certain thresholds (perhaps these changes were backported to your kernel), which could help a lot depending on the traffic selectors negotiated in your scenario.

I don't see the changes you are referring to in our kernel version, we can try and upgrade to 4.X branch and see if it helps. But will it require any changes in strongSwan to use this new functionality ?

Our TS is quite simple, Clients are provided with a Virtual IP from a /16 (a.b.0.0/16) subnet and they all access the same /24 (x.y.z.0/24) subnet behind the security gateway:

CHILD_SA XXXXXXX{301762} established with SPIs xxxxxxxx_i yyyyyyyy_o and TS x.y.z.0/24 === a.b.c.7/32

CHILD_SA XXXXXXX{301764} established with SPIs xxxxxxxx_i yyyyyyyy_o and TS x.y.z.0/24 === a.b.c.126/32

Is there anything we could do to reduce the time spent looking for policies ? it seems we cannot scale beyond the point we reached :(

#23 Updated by Tobias Brunner about 9 years ago

Policies are cached in the flow cache but if the number of flows exceeds its size a potentially expensive lookup via xfrm_policy_lookup_bytype is required. The cache has a fixed size (whenever the number of entries exceeds 4096 it is reduced again to at most 2048 entries), it is also per CPU so if flows are not kept on specific CPUs that could amplify the issue, as do lots of fluctuations regarding SAs/policies as the cache is invalidated whenever policies are changed.

I do see xfrm_policy_lookup_bytype (they are in the outputs of perf top above), but it's relatively low percentage compared to xfrm_policy_match

xfrm_policy_match is called by xfrm_policy_lookup_bytype for every enumerated policy.

I don't see the changes you are referring to in our kernel version, we can try and upgrade to 4.X branch and see if it helps. But will it require any changes in strongSwan to use this new functionality ?

Yes, you'd need the following two patches (or you write your own tool that configures this, I think iproute2 does not support it yet): commit:52f91d978b, commit:f7e0c7e0ea

Our TS is quite simple, Clients are provided with a Virtual IP from a /16 (a.b.0.0/16) subnet and they all access the same /24 (x.y.z.0/24) subnet behind the security gateway:

In that case, I think you could actually benefit quite a bit from the above kernel changes by setting lbits to 24 and leave rbits at the default of 32. Then all the policies should get hashed. One exception might be the already mentioned FWD policies in the "out" direction that 5.5.0 installs, as the kernel patches seem to only consider FWD policies in the "in" direction, i.e. with lbits <= dst-prefix and rbits <= src-prefix, so the FWD "out" policies with src/24 and dst/32 would not apply - I guess you could configure lbits = rbits = 24 to fix that.

Is there anything we could do to reduce the time spent looking for policies ? it seems we cannot scale beyond the point we reached :(

If you'd manually manage policies via ip xfrm policy you might be able to simplify the inbound and forward policies (allow traffic from the complete /16 subnet to the /24 subnet) but every client still needs a separate outbound policy that directs traffic addressed to its virtual IP to the correct IPsec SA (you could install/delete these via custom updown script).

If you want to try whether this improves the situation or not you could do the following. Before starting the daemon (or via charon.start-scripts) you install a forward and an inbound policy (the latter is only necessary if the gateway itself must be reachable on a local address within the /24 subnet) allowing traffic between the two subnets:

ip xfrm policy add src a.b.0.0/16 dst x.y.z.0/24 dir fwd tmpl mode tunnel proto esp
ip xfrm policy add src a.b.0.0/16 dst x.y.z.0/24 dir in  tmpl mode tunnel proto esp

And then configure installpolicy=no and leftupdown=/path/to/your/updown-script (if you use leftfirewall=yes you have to remove that and perhaps base your script on the default script that installs these firewall rules). The script then must contain at least something like this:

#!/bin/sh

case "$PLUTO_VERB:$1" in
up-client:)
    ip xfrm policy add dst $PLUTO_PEER_CLIENT src $PLUTO_MY_CLIENT dir out tmpl src $PLUTO_ME dst $PLUTO_PEER proto esp mode tunnel reqid $PLUTO_REQID
    ;;
down-client:)
    ip xfrm policy delete dst $PLUTO_PEER_CLIENT src $PLUTO_MY_CLIENT dir out
    ;;
*)    exit 1
    ;;
esac

#24 Updated by Danny Kulchinsky about 9 years ago

Thank you for the information and detailed analysis.

We are going to upgrade to Kernel 4.6.4 later today, I have already pushed the necessary patches (52f91d978b, f7e0c7e0ea) to 5.5.0 src

Just to confirm, if I set lbits=rbits=24 do I also need to "Disable" the new SPD FWD Out policy in child_sa.c ? (we don't need it in our setup anyway) or the lbits+rbits setting will take care of that as well ?

#25 Updated by Tobias Brunner about 9 years ago

Just to confirm, if I set lbits=rbits=24 do I also need to "Disable" the new SPD FWD Out policy in child_sa.c ? (we don't need it in our setup anyway) or the lbits+rbits setting will take care of that as well ?

When setting both values to 24 you don't need to disable the additional FWD policies as they will also be hashed. Only with the "correct", minimal thresholds (where rbits would be 32) that would be necessary. Because for FWD policies the kernel always applies rbits to the source selector, which for the FWD OUT policies is the local /24 subnet.

#26 Updated by Danny Kulchinsky about 9 years ago

Tobias Brunner wrote:

Just to confirm, if I set lbits=rbits=24 do I also need to "Disable" the new SPD FWD Out policy in child_sa.c ? (we don't need it in our setup anyway) or the lbits+rbits setting will take care of that as well ?

When setting both values to 24 you don't need to disable the additional FWD policies as they will also be hashed. Only with the "correct", minimal thresholds (where rbits would be 32) that would be necessary. Because for FWD policies the kernel always applies rbits to the source selector, which for the FWD OUT policies is the local /24 subnet.

Thanks for the clarification.

I do have a doubt regarding patch f7e0c7e0ea, the link above refers to src/libhydra, while I only have src/libcharon and if I replace the file as is, it seems to fail compilation:

kernel_netlink_ipsec.c:38:19: fatal error: hydra.h: No such file or directory
 #include <hydra.h>
                   ^
compilation terminated.
make[4]: *** [kernel_netlink_ipsec.lo] Error 1

Can you suggest what exactly needs to be changed/updated in src/libcharon/plugins/kernel_netlink/kernel_netlink_ipsec.c

Thanks !

#27 Updated by Tobias Brunner about 9 years ago

I do have a doubt regarding patch f7e0c7e0ea, the link above refers to src/libhydra, while I only have src/libcharon and if I replace the file as is, it seems to fail compilation:

Yes, that stuff was moved to libcharon with 5.4.0. Not sure where that hydra.h include comes from as it is not part of the patch itself (or did you download the complete file?). Anyway, I rebased the xfrm-spd-hash-thresh branch to the current master. Please try commit:5e8a3d669f and commit:5719830803.

#28 Updated by Danny Kulchinsky about 9 years ago

Tobias Brunner wrote:

I do have a doubt regarding patch f7e0c7e0ea, the link above refers to src/libhydra, while I only have src/libcharon and if I replace the file as is, it seems to fail compilation:

Yes, that stuff was moved to libcharon with 5.4.0. Not sure where that hydra.h include comes from as it is not part of the patch itself (or did you download the complete file?). Anyway, I rebased the xfrm-spd-hash-thresh branch to the current master. Please try commit:5e8a3d669f and commit:5719830803.

Great :) Ok, now things seem to be compiling as expected.

Unfortunately, our CentOS box keeps hanging/crashing after upgrading to Kernel 4.6.4 :( I think we'll setup another box with Fedora F24 (comes pre-installed with Kernel 4.6.4)

#29 Updated by Danny Kulchinsky about 9 years ago

File vmcore-dmesg.zip vmcore-dmesg.zip added

Danny Kulchinsky wrote:

Tobias Brunner wrote:

I do have a doubt regarding patch f7e0c7e0ea, the link above refers to src/libhydra, while I only have src/libcharon and if I replace the file as is, it seems to fail compilation:

Yes, that stuff was moved to libcharon with 5.4.0. Not sure where that hydra.h include comes from as it is not part of the patch itself (or did you download the complete file?). Anyway, I rebased the xfrm-spd-hash-thresh branch to the current master. Please try commit:5e8a3d669f and commit:5719830803.

Great :) Ok, now things seem to be compiling as expected.

Unfortunately, our CentOS box keeps hanging/crashing after upgrading to Kernel 4.6.4 :( I think we'll setup another box with Fedora F24 (comes pre-installed with Kernel 4.6.4)

Still struggling with the Kernel crash/hang.

However, it seems that it only happens when I configure lbits and rbits in kernel-netlink.conf

We setup kdump (full vmcore-dmesg.txt attached), and this is what we see:

[ 3151.315569] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
[ 3151.315621] IP: [<ffffffff8168a0cd>] xfrm_hash_rebuild+0x11d/0x200
[ 3151.315650] PGD 232bfb067 PUD 233f4d067 PMD 0 
[ 3151.315670] Oops: 0000 [#1] SMP 
[ 3151.315692] Modules linked in: pcrypt crypto_user nfnetlink_queue nfnetlink_log nfnetlink bluetooth rfkill authenc echainiv xfrm6_mode_tunnel xfrm4_mode_tunnel xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key vmw_vsock_vmci_transport vsock coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ppdev vmw_balloon pcspkr input_leds sg vmw_vmci i2c_piix4 parport_pc shpchp parport 8250_fintek acpi_cpufreq ip_tables xfs libcrc32c sr_mod cdrom ata_generic pata_acpi sd_mod vmwgfx drm_kms_helper syscopyarea sysfillrect sysimgblt crc32c_intel serio_raw fb_sys_fops ttm mptspi ata_piix mptscsih mptbase drm vmxnet3 scsi_transport_spi libata floppy fjes dm_mirror dm_region_hash dm_log dm_mod
[ 3151.316017] CPU: 2 PID: 9083 Comm: kworker/2:0 Not tainted 4.6.4-1.el7.elrepo.x86_64 #1
[ 3151.316047] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/17/2015
[ 3151.316085] Workqueue: events xfrm_hash_rebuild
[ 3151.316105] task: ffff88021f042d00 ti: ffff8800ad164000 task.ti: ffff8800ad164000
[ 3151.316130] RIP: 0010:[<ffffffff8168a0cd>]  [<ffffffff8168a0cd>] xfrm_hash_rebuild+0x11d/0x200
[ 3151.316160] RSP: 0018:ffff8800ad167de0  EFLAGS: 00010202
[ 3151.316178] RAX: 0000000000000004 RBX: ffffffff81d0acc0 RCX: 00000000deadbeef
[ 3151.316203] RDX: 0000000000000004 RSI: ffff8802189e94b4 RDI: ffffffff81d0acc0
[ 3151.316226] RBP: ffff8800ad167e10 R08: 0000000000000003 R09: ffffffff81d0ad00
[ 3151.316250] R10: 0000000000000004 R11: 00000000000001c1 R12: ffffffff81d0c1b8
[ 3151.316273] R13: ffffffff81d0c444 R14: ffff8802189e9400 R15: 0000000000000018
[ 3151.316299] FS:  0000000000000000(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
[ 3151.316326] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3151.316345] CR2: 0000000000000004 CR3: 0000000233973000 CR4: 00000000000406e0
[ 3151.316408] Stack:
[ 3151.316419]  0000008000000080 ffffffff81d0c258 ffff88021f204540 ffff88023fc96640
[ 3151.316455]  ffff88023fc9b100 0000000000000080 ffff8800ad167e58 ffffffff810992e2
[ 3151.316495]  000000001f204570 0000000000000000 ffff88023fc96660 ffff88021f204570
[ 3151.316531] Call Trace:
[ 3151.317283]  [<ffffffff810992e2>] process_one_work+0x152/0x400
[ 3151.317996]  [<ffffffff81099bd5>] worker_thread+0x125/0x4b0
[ 3151.318705]  [<ffffffff81712b85>] ? __schedule+0x345/0x970
[ 3151.319393]  [<ffffffff81099ab0>] ? rescuer_thread+0x380/0x380
[ 3151.320073]  [<ffffffff8109f738>] kthread+0xd8/0xf0
[ 3151.320723]  [<ffffffff81716f02>] ret_from_fork+0x22/0x40
[ 3151.321374]  [<ffffffff8109f660>] ? kthread_park+0x60/0x60
[ 3151.322017] Code: c0 fe ff ff 0f 84 98 00 00 00 41 8b 8e 98 00 00 00 41 0f b7 96 cc 01 00 00 49 8d b6 a4 00 00 00 48 89 df 83 e1 07 e8 a3 f5 ff ff <48> 8b 08 48 85 c9 0f 84 8e 00 00 00 41 8b b6 94 00 00 00 3b b1 
[ 3151.324269] RIP  [<ffffffff8168a0cd>] xfrm_hash_rebuild+0x11d/0x200
[ 3151.324918]  RSP <ffff8800ad167de0>
[ 3151.325542] CR2: 0000000000000004

Am I hitting this Warning I saw on the commit commit:5719830803 ?

WARNING: Due to a bug at least in 3.19, the kernel crashes with a NULL pointer
dereference if a socket policy is installed while changing hash thresholds.
Subject to further investigation.

I'm running Kernel 4.6.4... so the bug is still there ? any way to fix/avoid/workaround ?

#30 Updated by Tobias Brunner about 9 years ago

Am I hitting this Warning I saw on the commit commit:5719830803 ?

I honestly don't know. I'm actually not sure how to interpret the comment. That is, does the crash only occur if these two things happen concurrently or is changing the thresholds in the mere presence of any socket policies enough to trigger it. Or do thresholds cause a crash later when socket policies or regular policies (in the presence of socket policies) are installed. The first two interpretations shouldn't really cause a crash as thresholds and socket policies are set up by the same thread and the policies can't be added before the kernel-netlink plugin has been initialized, which sets the thresholds. But the latter means thresholds are set when socket policies and regular policies are installed. So it could be this issue or something completely different.

If the problem is caused by socket policies you could try setting charon.plugins.kernel-netlink.port_bypass=yes so regular bypass policies are installed that allow traffic on UDP port 500/4500 instead of socket policies.

#31 Updated by Danny Kulchinsky about 9 years ago

Tobias Brunner wrote:

Am I hitting this Warning I saw on the commit commit:5719830803 ?

I honestly don't know. I'm actually not sure how to interpret the comment. That is, does the crash only occur if these two things happen concurrently or is changing the thresholds in the mere presence of any socket policies enough to trigger it. Or do thresholds cause a crash later when socket policies or regular policies (in the presence of socket policies) are installed. The first two interpretations shouldn't really cause a crash as thresholds and socket policies are set up by the same thread and the policies can't be added before the kernel-netlink plugin has been initialized, which sets the thresholds. But the latter means thresholds are set when socket policies and regular policies are installed. So it could be this issue or something completely different.

If the problem is caused by socket policies you could try setting charon.plugins.kernel-netlink.port_bypass=yes so regular bypass policies are installed that allow traffic on UDP port 500/4500 instead of socket policies.

I did as you suggested (set port_bypass=yes) and it seems to have worked, it's not crashing now with lbits=rbits=24

How can I verify that this is working as expected ?

#32 Updated by Tobias Brunner about 9 years ago

How can I verify that this is working as expected ?

Performance improvements when handling traffic. And in perf the overhead by xfrm_policy* functions should drop down.

#33 Updated by Danny Kulchinsky about 9 years ago

Tobias Brunner wrote:

How can I verify that this is working as expected ?

Performance improvements when handling traffic. And in perf the overhead by xfrm_policy* functions should drop down.

Got it, we'll be moving some workload back to this machine and see how it behaves.

Thanks for everything once again! :)

#34 Updated by Tobias Brunner about 9 years ago

By the way, I tried reproducing the crash in our testing environment with the 4.6.4 kernel and the code from the xfrm-spd-hash-thresh branch but wasn't able to (with socket policies). When exactly does it happen? Directly after the daemon is started? Or when a client connects? Or does it require traffic passing through?

#35 Updated by Danny Kulchinsky about 9 years ago

I was always curious, what exactly these settings in kernel-netlink.conf do:

    # Whether to perform concurrent Netlink ROUTE queries on a single socket.
    # parallel_route = no

    # Whether to perform concurrent Netlink XFRM queries on a single socket.
    # parallel_xfrm = no

Is this something relevant in our case ?

Thanks :)

#36 Updated by Tobias Brunner about 9 years ago

Is this something relevant in our case ?

No, these were added for a third-party implementation of the Netlink interface. Using them on vanilla Linux kernels does not improve performance (could even deteriorate it).

#37 Updated by Danny Kulchinsky about 9 years ago

Tobias Brunner wrote:

By the way, I tried reproducing the crash in our testing environment with the 4.6.4 kernel and the code from the xfrm-spd-hash-thresh branch but wasn't able to (with socket policies). When exactly does it happen? Directly after the daemon is started? Or when a client connects? Or does it require traffic passing through?

Sorry, didn't see this before.

It seems that it happens as soon as the daemon has started, but it's possible that immediately after daemon started a client tried to connect (we removed this record from DNS, but there's always stale DNS cache entries out there).

I'm not sure, but I did not see anything in the logs although flush_line was set to "no" so perhaps there was something but didn't get a chance to be flushed to the log file.

For sure there wasn't traffic, so my guess is the crash happens when a client tries to connect (I'm basing this on the fact that I'm pretty sure that at least few times it happened 1-2 minutes after daemon started)

I could try to reproduce again, but now it's tricky because it's already back in service.

Sorry for not being very scientific about this.

#38 Updated by Tobias Brunner about 9 years ago

I was able to reproduce this crash on CentOS 7 using the ELRepo 4.6.4 and 4.7 kernels. Starting the daemon with thresholds configured triggers the crash immediately. Strangely that's not the case with the vanilla 4.6.4 kernel we use in our testing environment or the 4.6.4 kernel on Fedora 24 (4.6.4-301.fc24.x86_64). However, I had a closer look at the kernel sources and was able to identify the bug and then was able to crash the other systems too.

As suspected the issue is caused by the socket policies. If any are installed when xfrm_hash_rebuild is called the kernel crashes. Setting the thresholds schedules a call of that function. While this seems to happen pretty much immediately on the latter two systems (so that the installation of the socket policies happens afterwards), on CentOS the worker is delayed so that the rebuild happens after the socket policies were installed. On the other systems a crash may be provoked by starting two instances of charon (so socket policies are there when the second instance (re-)sets the thresholds).

The bug in the kernel is that socket policies are not skipped when rebuilding the hash tables. The latter enumerates all policies (including the socket policies) and calculates their hash values and inserts them in hash tables that are maintained per direction (IN/0, OUT/1, FWD/2). The problem is that socket policies don't use regular direction IDs but direction + MAX (3) (i.e. IN/3, OUT/4, FWD/5). Therefore, after hashing such a policy there is an array overflow causing a dereference of a NULL or invalid pointer.

I will file a bug report/patch. Until that makes it into stable kernels the workaround is to use port_bypass=yes.

#39 Updated by Danny Kulchinsky about 9 years ago

Wow, thanks so much for the detailed analysis!

On our end things are stable with port_bypass=yes, we are gradually increasing the traffic over this server (currently ~8400 tunnels).

In terms of performance, I can see improvement compared to other server without xfrm-spd-hash-thresh patch. xfrm_policy_lookup is very low (1-2%) & write_lock_failed is gone completely, Software Interrupts <1%.

I will report back when we scale beyond the previous "breaking point"

B.T.W. is there any side effect for setting port_bypass=yes ?

#40 Updated by Danny Kulchinsky about 9 years ago

Things are looking much better now, we are running double the load on this server and CPU <10% (previous situation was half the load with ~75% CPU Usage, mostly in Software Interrupts).

Only observation which seems kind of strange is the overall number of "Scheduled Jobs", it seems it be only rising, regardless of the amount of tunnels being managed.

Over the daily cycle, we have between 9,000 and 13,000 active tunnels at different points in time, however when looking at "Scheduled Jobs" the number keeps rising. 24 hours ago it was ~300,000 and at this time we had ~9000 tunnels and right now it is double ~600,000 and number of tunnels is roughly the same ~9500 tunnels.

Perhaps there's an issue with cleaning up jobs that belong to already nonexistent tunnels ? or am I missing something in my observation ?

#41 Updated by Tobias Brunner about 9 years ago

B.T.W. is there any side effect for setting port_bypass=yes ?

I guess it's slightly less efficient and because regular policies are used they could be manipulated externally (e.g. flushed via ip xfrm policy flush), which is not possible with socket policies.

Things are looking much better now, we are running double the load on this server and CPU <10% (previous situation was half the load with ~75% CPU Usage, mostly in Software Interrupts).

OK, great. I submitted the fix for the socket policy issue, but it will be a while until it makes it to the stable kernels.

Perhaps there's an issue with cleaning up jobs that belong to already nonexistent tunnels ? or am I missing something in my observation ?

Jobs are currently not cleaned up when an IKE_SA is deleted (there is no association between IKE_SAs and their scheduled jobs). These jobs will just be no-ops when they are finally executed and the corresponding IKE_SA does not exist anymore. So if you have lots of fluctuation and events like IKE_SA rekeyings/deletes or DPDs are scheduled a while in the future these jobs will accumulate. Besides some memory requirements this should not really have that much of an impact.

#42 Updated by Danny Kulchinsky about 9 years ago

I guess it's slightly less efficient and because regular policies are used they could be manipulated externally (e.g. flushed via ip xfrm policy flush), which is not possible with socket policies.

Ok, I'm not too worried about this at the moment.

OK, great. I submitted the fix for the socket policy issue, but it will be a while until it makes it to the stable kernels.

Perfect, will keep an eye on this.

Jobs are currently not cleaned up when an IKE_SA is deleted (there is no association between IKE_SAs and their scheduled jobs). These jobs will just be no-ops when they are finally executed and the corresponding IKE_SA does not exist anymore. So if you have lots of fluctuation and events like IKE_SA rekeyings/deletes or DPDs are scheduled a while in the future these jobs will accumulate. Besides some memory requirements this should not really have that much of an impact.

Understood, seems that in our case it doesn't give us any benefit/insight to monitor this.
Will keep monitoring the Job Queues as I think this is more important.

#43 Updated by Noel Kuntze over 6 years ago

Category set to kernel
Status changed from Feedback to Closed
Resolution set to Fixed

Patch applied by upstream.

Project

General

Profile

strongSwan

Issues

Issue #1491

"init_limit_half_open" is ignored

History

#1 Updated by Danny Kulchinsky over 9 years ago

#2 Updated by Tobias Brunner over 9 years ago

#3 Updated by Danny Kulchinsky over 9 years ago

#4 Updated by Tobias Brunner over 9 years ago

#5 Updated by Danny Kulchinsky over 9 years ago

#6 Updated by Tobias Brunner over 9 years ago

#7 Updated by Danny Kulchinsky over 9 years ago

#8 Updated by Tobias Brunner over 9 years ago

#9 Updated by Danny Kulchinsky over 9 years ago

#10 Updated by Tobias Brunner over 9 years ago

#11 Updated by Danny Kulchinsky over 9 years ago

#12 Updated by Danny Kulchinsky over 9 years ago

#13 Updated by Tobias Brunner over 9 years ago

#14 Updated by Danny Kulchinsky over 9 years ago

#15 Updated by Tobias Brunner over 9 years ago

#16 Updated by Danny Kulchinsky about 9 years ago

#17 Updated by Tobias Brunner about 9 years ago

#18 Updated by Danny Kulchinsky about 9 years ago

#19 Updated by Tobias Brunner about 9 years ago

#20 Updated by Danny Kulchinsky about 9 years ago

#21 Updated by Tobias Brunner about 9 years ago

#22 Updated by Danny Kulchinsky about 9 years ago

#23 Updated by Tobias Brunner about 9 years ago

#24 Updated by Danny Kulchinsky about 9 years ago

#25 Updated by Tobias Brunner about 9 years ago

#26 Updated by Danny Kulchinsky about 9 years ago

#27 Updated by Tobias Brunner about 9 years ago

#28 Updated by Danny Kulchinsky about 9 years ago

#29 Updated by Danny Kulchinsky about 9 years ago

#30 Updated by Tobias Brunner about 9 years ago

#31 Updated by Danny Kulchinsky about 9 years ago

#32 Updated by Tobias Brunner about 9 years ago

#33 Updated by Danny Kulchinsky about 9 years ago

#34 Updated by Tobias Brunner about 9 years ago

#35 Updated by Danny Kulchinsky about 9 years ago

#36 Updated by Tobias Brunner about 9 years ago

#37 Updated by Danny Kulchinsky about 9 years ago

#38 Updated by Tobias Brunner about 9 years ago

#39 Updated by Danny Kulchinsky about 9 years ago

#40 Updated by Danny Kulchinsky about 9 years ago

#41 Updated by Tobias Brunner about 9 years ago

#42 Updated by Danny Kulchinsky about 9 years ago

#43 Updated by Noel Kuntze over 6 years ago