Project

General

Profile

Job priority management

Purpose

The current architecture of charon uses some synchronous, blocking operations while working on IKE_SAs. Two examples of such operations are communication with a RADIUS server via EAP-RADIUS, or fetching CRL/OCSP information during certificate chain verification. Under high load conditions, the thread pool may run out of threads, and some more important jobs, such as liveness checking, may not get executed in time.

To prevent thread starvation in such situations, strongSwan 4.5.3 introduced job priorities. The job processor will reserve some threads for higher priority jobs, these threads are not available for lower priority, locking jobs.

Implementation

Currently 4 priorities have been defined, and they are used in charon as follows:

Priority Description
CRITICAL Priority for long-running dispatcher jobs.
HIGH INFORMATIONAL exchanges, as used by liveness checking (DPD).
MEDIUM Everything not HIGH/LOW, including IKE_SA_INIT processing.
LOW IKE_AUTH message processing. RADIUS and CRL fetching block here, hence the low priority.

Although IKE_SA_INIT processing is computationally expensive, it is explicitly kept in the MEDIUM class. This allows charon to do the DH exchange while other threads are blocked in IKE_AUTH. To prevent the daemon from accepting more IKE_SA_INITs it can handle, use IKE_SA_INIT dropping (see below).

The thread pool processes jobs strictly by priority, meaning it will consume all higher priority jobs before looking for ones with lower priority. Further, it reserves threads for certain priorities. A priority class having reserved n threads will always have n threads available for this class (either currently processing a job, or waiting for one).

Configuration

To always have enough threads available for higher priority tasks, they must be reserved for each priority class. This is done in strongswan.conf:

Key Default Description
libstrongswan.processor.priority_threads.critical 0 Threads reserved for CRITICAL priority class jobs
libstrongswan.processor.priority_threads.high 0 Threads reserved for HIGH priority class jobs
libstrongswan.processor.priority_threads.medium 0 Threads reserved for MEDIUM priority class jobs
libstrongswan.processor.priority_threads.low 0 Threads reserved for LOW priority class jobs

Let's consider the following configuration:

libstrongswan {
    processor {
        priority_threads {
            high = 1
            medium = 4
        }
    }
}

With this configuration, one threads is reserved for HIGH priority tasks. As currently only liveness checking and stroke message processing is done with high priority, one or two threads should be sufficient.

The MEDIUM class mostly processes non-blocking jobs. Unless your setup is experiencing many blocks in locks while accessing shared resources, threads for one or two times the number of CPU cores is fine.

It is usually not required to reserve threads for CRITICAL jobs. Jobs in this class rarely return and do not release their thread to the pool.

The remaining threads are available for LOW priority jobs. Reserving threads does not make sense (until we have an even lower priority).

Monitoring thread usage / job load

To see what the threads are actually doing, invoke ipsec statusall. Under high load, something like this will show up:

worker threads: 2 of 32 idle, 5/1/2/22 working, job queue: 0/0/1/149, scheduled: 198

From 32 worker threads,
  • 2 are currently idle
  • 5 are running CRITICAL priority jobs (dispatching from sockets, etc.)
  • 1 is currently handling a HIGH priority job. This is actually the thread currently providing this information via stroke.
  • 2 are handling MEDIUM priority jobs, likely IKE_SA_INIT or CREATE_CHILD_SA messages
  • 22 are handling LOW priority tasks, probably waiting for a EAP-RADIUS response while processing IKE_AUTH messages

The job queue load shows how many jobs are queued for each priority, ready for execution. The single MEDIUM priority job will get executed immediately, as we have two spare threads reserved for MEDIUM class jobs.

IKE_SA_INIT dropping

If a responder receives more connection requests per second than it can handle, it does not make sense to accept more IKE_SA_INIT messages. If they are queued but can't get processed in time, an answer might be sent after the client has already given up and restarted its connection setup. This additionally increases the load on the responder.

To limit the responder load resulting from new connection attempts, the daemon can drop IKE_SA_INIT messages just after reception. There are two mechanisms to decide if this should happen, configured with strongswan.conf options:

Key Default Criteria Description
charon.init_limit_half_open 0 (disabled) Number of half open IKE_SAs Half open IKE_SAs are SAs in connecting state, but not yet established
charon.init_limit_job_load 0 (disabled) Job load on the processor Number of jobs currently queued for processing, sum over all job priorities

The second limit includes load from other jobs, such as rekeying. Choosing a good value is difficult and depends on the hardware and generated load.

The first limit is simpler to calculate, but includes the load from new connections only. If your responder is capable of negotiating 100 tunnels/s, you might set this limit to 1000. It will then drop new connection attempts if generating a response would require more than 10 seconds. If you're allowing for with a maximum response time of more than 30 seconds, consider adjusting the timeout for connecting IKE_SAs (half-open timeout) in strongswan.conf:

Key Default Description
charon.half_open_timeout 30 Timeout, in seconds, of connecting IKE_SA

A responder, by default, deletes an IKE_SA if the initiator does not establish it within 30 seconds. Under high load, a higher value might be required.