Project

General

Profile

Issue #2139

strongSwan in HA configuration with parallel crypto (pcrypt) causes a Kernel crash

Added by Danny Kulchinsky about 4 years ago. Updated over 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Affected version:
5.5.0
Resolution:
Fixed

Description

We have dual active strongSwan (5.5.0) HA setup (serving as a responder/security gateway), running on CentOS 7.2 with Kernel 4.7.2 (vanilla with HA patches).

We have configured both strongSwan nodes to use parallel crypto (pcrypt) driver using crconf:

crconf add driver "pcrypt(authenc(hmac(md5),cbc(aes)))" type 3

Once the client connects and sends some traffic over the tunnel, it causes the strongSwan node that is managing the segment to crash, below is the output captured with kdump.

When we fall back to cryptd (do not preload pcrypt with crconf) everything seems to work smoothly, no crashes.

vmcore-dmesg from node-1:

[ 1562.740331] BUG: unable to handle kernel NULL pointer dereference at 00000000000002f0
[ 1562.740391] IP: [<ffffffffa047ccf8>] esp_input_done2+0x48/0x240 [esp4]
[ 1562.740421] PGD 201989067 PUD 22ca44067 PMD 0 
[ 1562.740443] Oops: 0000 [#1] SMP
[ 1562.740459] Modules linked in: pcrypt crypto_user nfnetlink_queue nfnetlink_log nfnetlink bluetooth rfkill authenc echainiv xfrm6_mode_tunnel xfrm4_mode_tunnel xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key iptable_nat vmw_vsock_vmci_transport nf_nat_ipv4 vsock nf_nat xt_policy ipt_REJECT nf_reject_ipv4 ipt_CLUSTERIP nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 iptable_filter nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables sb_edac edac_core intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd intel_rapl_perf ppdev vmw_balloon pcspkr input_leds sg vmw_vmci shpchp parport_pc i2c_piix4 parport acpi_cpufreq ip_tables xfs libcrc32c sr_mod cdrom ata_generic pata_acpi
[ 1562.740768]  sd_mod vmwgfx crc32c_intel drm_kms_helper syscopyarea serio_raw sysfillrect sysimgblt fb_sys_fops ttm mptspi ata_piix scsi_transport_spi vmxnet3 mptscsih drm mptbase libata floppy fjes dm_mirror dm_region_hash dm_log dm_mod
[ 1562.740856] CPU: 1 PID: 16577 Comm: kworker/1:0 Not tainted 4.7.2+ #1
[ 1562.740879] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/17/2015
[ 1562.740920] Workqueue: pdecrypt padata_serial_worker
[ 1562.740939] task: ffff880231bc2d00 ti: ffff8802344c4000 task.ti: ffff8802344c4000
[ 1562.740965] RIP: 0010:[<ffffffffa047ccf8>]  [<ffffffffa047ccf8>] esp_input_done2+0x48/0x240 [esp4]
[ 1562.740999] RSP: 0018:ffff8802344c7d50  EFLAGS: 00010246
[ 1562.741016] RAX: ffff8802349463c0 RBX: ffff8800bbb37c00 RCX: 00000000ffff8802
[ 1562.741039] RDX: ffffffffffff8801 RSI: 0000000000000000 RDI: ffff8800a0087800
[ 1562.741063] RBP: ffff8802344c7da0 R08: ffff8800a00878b0 R09: ffff88023fc57740
[ 1562.741086] R10: 0000000000000000 R11: 0000016bda948d17 R12: 0000000000000000
[ 1562.741109] R13: ffff8802344c7dd8 R14: ffffe8ffffc43c10 R15: 0000000000000064
[ 1562.741133] FS:  0000000000000000(0000) GS:ffff88023fc40000(0000) knlGS:0000000000000000
[ 1562.741160] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1562.741179] CR2: 00000000000002f0 CR3: 0000000234135000 CR4: 00000000000406e0
[ 1562.741242] Stack:
[ 1562.741966]  0000000000000001 ffff880231bc2d00 ffff8802344c7dc8 ffff8802344c7dc0
[ 1562.742686]  00000000fb8e20f6 ffff8800bbb37c00 ffff88022d83361c ffff8802344c7dd8
[ 1562.743396]  ffffe8ffffc43c10 0000000000000040 ffff8802344c7db8 ffffffffa047d1a6
[ 1562.744122] Call Trace:
[ 1562.744833]  [<ffffffffa047d1a6>] esp_input_done+0x16/0x30 [esp4]
[ 1562.745537]  [<ffffffffa0552016>] pcrypt_aead_serial+0x16/0x20 [pcrypt]
[ 1562.746264]  [<ffffffff81184ba0>] padata_serial_worker+0xa0/0xe0
[ 1562.746961]  [<ffffffff810988b2>] process_one_work+0x152/0x400
[ 1562.747662]  [<ffffffff810991a5>] worker_thread+0x125/0x4b0
[ 1562.748336]  [<ffffffff81099080>] ? rescuer_thread+0x380/0x380
[ 1562.749005]  [<ffffffff8109ecd8>] kthread+0xd8/0xf0
[ 1562.749663]  [<ffffffff8172207f>] ret_from_fork+0x1f/0x40
[ 1562.750292]  [<ffffffff8109ec00>] ? kthread_park+0x60/0x60
[ 1562.750912] Code: 00 00 00 89 75 b4 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 48 8b 47 68 48 8b 7f 50 8b 48 04 8d 51 ff 48 63 d2 4c 8b 64 d0 08 <49> 8b 84 24 f0 02 00 00 44 8b 28 48 8b 40 50 44 8b 70 f8 e8 60 
[ 1562.752974] RIP  [<ffffffffa047ccf8>] esp_input_done2+0x48/0x240 [esp4]
[ 1562.753588]  RSP <ffff8802344c7d50>
[ 1562.754171] CR2: 00000000000002f0

Logs from node-1 (this node crashed):

2016-10-09 14:00:46.646 30[ENC] <2> parsed IKE_SA_INIT request 0 [ SA KE No N(NATD_S_IP) N(NATD_D_IP) N(FRAG_SUP) ]
2016-10-09 14:00:46.646 30[IKE] <2> <Client Public IP> is initiating an IKE_SA
2016-10-09 14:00:46.647 30[IKE] <2> remote host is behind NAT
2016-10-09 14:00:46.647 30[ENC] <2> generating IKE_SA_INIT response 0 [ SA KE No N(NATD_S_IP) N(NATD_D_IP) N(FRAG_SUP) N(MULT_AUTH) ]
2016-10-09 14:00:46.838 24[ENC] <2> unknown attribute type (16384)
2016-10-09 14:00:46.838 24[ENC] <2> parsed IKE_AUTH request 1 [ IDi N(INIT_CONTACT) IDr CPRQ(ADDR MASK DNS DNS (16384)) N(ESP_TFC_PAD_N) N(NON_FIRST_FRAG) SA TSi TSr N(EAP_ONLY) ]
2016-10-09 14:00:46.838 24[CFG] <2> looking for peer configs matching <strongSwan Cluster IP>[ims]...<Client Public IP>[<Client EAP Identity>]
2016-10-09 14:00:46.838 24[CFG] <some name|2> selected peer config 'some name'
2016-10-09 14:00:46.838 24[CFG] <some name|2> sending RADIUS Access-Request to server 'MAPR-TGA-RAD07'
2016-10-09 14:00:46.840 24[CFG] <some name|2> received RADIUS Access-Challenge from server 'MAPR-TGA-RAD07'
2016-10-09 14:00:46.840 24[IKE] <some name|2> initiating EAP_AKA method (id 0x01)
2016-10-09 14:00:46.840 24[IKE] <some name|2> received ESP_TFC_PADDING_NOT_SUPPORTED, not using ESPv3 TFC padding
2016-10-09 14:00:46.840 24[ENC] <some name|2> generating IKE_AUTH response 1 [ IDr EAP/REQ/AKA ]
2016-10-09 14:00:47.119 15[ENC] <some name|2> parsed IKE_AUTH request 2 [ EAP/RES/AKA ]
2016-10-09 14:00:47.119 15[CFG] <some name|2> sending RADIUS Access-Request to server 'MAPR-TGA-RAD07'
2016-10-09 14:00:47.458 15[CFG] <some name|2> received RADIUS Access-Challenge from server 'MAPR-TGA-RAD07'
2016-10-09 14:00:47.458 15[ENC] <some name|2> generating IKE_AUTH response 2 [ EAP/REQ/AKA ]
2016-10-09 14:00:47.897 20[ENC] <some name|2> parsed IKE_AUTH request 3 [ EAP/RES/AKA ]
2016-10-09 14:00:47.897 20[CFG] <some name|2> sending RADIUS Access-Request to server 'MAPR-TGA-RAD07'
2016-10-09 14:00:47.899 20[CFG] <some name|2> received RADIUS Access-Accept from server 'MAPR-TGA-RAD07'
2016-10-09 14:00:47.899 20[IKE] <some name|2> RADIUS authentication of '<Client EAP Identity>' successful
2016-10-09 14:00:47.899 20[IKE] <some name|2> EAP method EAP_AKA succeeded, MSK established
2016-10-09 14:00:47.899 20[ENC] <some name|2> generating IKE_AUTH response 3 [ EAP/SUCC ]
2016-10-09 14:00:48.156 31[ENC] <some name|2> parsed IKE_AUTH request 4 [ AUTH ]
2016-10-09 14:00:48.156 31[IKE] <some name|2> authentication of '<Client EAP Identity>' with EAP successful
2016-10-09 14:00:48.156 31[IKE] <some name|2> authentication of 'ims' (myself) with EAP
2016-10-09 14:00:48.156 31[IKE] <some name|2> IKE_SA some name[2] established between <strongSwan Cluster IP>[ims]...<Client Public IP>[<Client EAP Identity>]
2016-10-09 14:00:48.156 31[IKE] <some name|2> scheduling rekeying in 85806s
2016-10-09 14:00:48.156 31[IKE] <some name|2> maximum IKE_SA lifetime 86346s
2016-10-09 14:00:48.156 31[IKE] <some name|2> peer requested virtual IP %any
2016-10-09 14:00:48.156 31[CFG] <some name|2> acquired address 10.23.0.2 from HA pool 'wifi-pool-4'
2016-10-09 14:00:48.156 31[IKE] <some name|2> assigning virtual IP 10.23.0.2 to peer '<Client EAP Identity>'
2016-10-09 14:00:48.157 31[CFG] <some name|2> handling HA CHILD_SA some name{4} 10.1.1.0/24 10.1.2.0/24 === 10.23.0.2/32 (segment in: 1, out: 1)
2016-10-09 14:00:48.157 31[IKE] <some name|2> CHILD_SA some name{4} established with SPIs c4a61882_i 093f29f1_o and TS 10.1.1.0/24 10.1.2.0/24 === 10.23.0.2/32
2016-10-09 14:00:48.157 31[CFG] <some name|2> sending RADIUS Accounting-Request to server 'MAPR-TGA-RAD06'
2016-10-09 14:00:48.160 31[CFG] <some name|2> received RADIUS Accounting-Response from server 'MAPR-TGA-RAD06'
2016-10-09 14:00:48.160 31[ENC] <some name|2> generating IKE_AUTH response 4 [ AUTH CPRP(ADDR) SA TSi TSr ]

Logs from node-2:

2016-10-09 14:00:48.142 04[IKE] <1> remote host is behind NAT
2016-10-09 14:00:48.142 04[CFG] <some name|1> installed HA passive IKE_SA 'some name' <strongSwan Cluster IP>[ims]...<Client Public IP>[<Client EAP Identity>]
2016-10-09 14:00:48.149 04[CFG] <some name|1> installed HA CHILD_SA some name{1} 10.1.1.0/24 10.1.2.0/24 === 10.23.0.2/32 (segment in: 1*, out: 1*)
2016-10-09 14:00:48.149 04[CFG] <some name|1> reserved address 10.23.0.2 in HA pool 'wifi-pool-4'
2016-10-09 14:01:30.193 05[CFG] no heartbeat received, taking all segments
2016-10-09 14:01:30.194 05[IKE] <some name|1> scheduling rekeying in 85646s
2016-10-09 14:01:30.194 05[IKE] <some name|1> maximum IKE_SA lifetime 86186s
2016-10-09 14:01:30.194 05[CFG] HA segment 2 activated, now active: 1,2

ha-async-crypto.patch (1.42 KB) ha-async-crypto.patch Luka Logar, 21.08.2017 12:53
ha-async-crypto-v2.patch (1.33 KB) ha-async-crypto-v2.patch Luka Logar, 04.10.2017 10:07

History

#1 Updated by Danny Kulchinsky about 4 years ago

Did a few more tests, it is actually the node where the PASSIVE IKE_SA is installed that crashes (not sure how I got it mixed up above).

Also, I found that it doesn't happen immediately, ~15 packets will go through before it triggers the crash.

I was monitoring using watch -n 1 swanctl -l until it crashed.

# swanctl -l
Test: #3, PASSIVE, IKEv2, 043ee77edc67da80_i 7a8042c12a3dc72c_r*        
  local  '<local id>' @ <Cluster IP>[4500]
  remote '<remote id>' @ <Client IP>[56953] [10.23.0.1]
  AES_CBC-256/HMAC_MD5_96/PRF_HMAC_MD5/MODP_1024
  Test: #3, reqid 3, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-256/HMAC_SHA1_96
    installed 32s ago, rekeying in 28098s, expires in 28768s
    in  c8d088d6,      0 bytes,     0 packets
    out b2bc6177,   1260 bytes,    15 packets,     1s ago
    local  10.1.1.0/24 10.1.2.0/24
    remote 10.23.0.1/32

Connection timed out

Question - should the "out" packets/bytes increment for the PASSIVE IKE_SA ? shouldn't these have been dropped by CLUSTERIP before reaching xfrm ?
In any case, the same happens (out increments on PASSIVE IKE_SA) when switching back to cryptd, but no kernel crash then.

#2 Updated by Tobias Brunner about 4 years ago

  • Status changed from New to Feedback

Question - should the "out" packets/bytes increment for the PASSIVE IKE_SA ? shouldn't these have been dropped by CLUSTERIP before reaching xfrm ?

To sync the sequence numbers both nodes will handle outbound traffic, but the passive node should drop the packet before encryption.

The problem here is actually handling the inbound traffic on the passive node, which will process every n-th packet (default is every 16-th) to verify and increase the inbound sequence number (this is also supported by the backtrace - esp_input_done). For this purpose the patch directly calls x->type->input() (i.e. esp_input), which could return EINPROGRESS if pcrypt is used. This is currently just handled as an errors. And the problem is that the esp_input_done callback that's called after the asynchronous decryption assumes the initial call came via xfrm_input(), which it will try to resume. But that's obviously wrong in this case. I don't think this can easily be fixed, so I guess this makes the HA patches inherently incompatible with pcrypt.

#3 Updated by Danny Kulchinsky about 4 years ago

Thank you Tobias, for now we have fallen back to cryptd and things seem to be working OK.

In our scenario the strongSwan cluster needs to manage many thousands of tunnels, while the traffic per tunnel is very low (~30KB), I wonder how the lack of pcrypt is going to impact on the performance of such a setup, any thoughts ?

Would you consider finding a fix/solution for this? or you consider this out of scope.

P.S. I think it would be a good idea to add some comment/note about this limitation in the HighAvailability Wiki.

#4 Updated by Tobias Brunner about 4 years ago

Would you consider finding a fix/solution for this? or you consider this out of scope.

It's currently definitely out of scope. I'm not even sure if there is a reasonable way to make the inbound asynchronous code path more flexible.

P.S. I think it would be a good idea to add some comment/note about this limitation in the HighAvailability Wiki.

Yeah, added a note.

#5 Updated by Danny Kulchinsky about 4 years ago

Thanks Tobias!

We are proceeding without pcrypt for now, we started with a very small load.

I have noticed something strange though, it seems we have quite a bit of "Stale Connections" which I don't really understand their status and the number of half-open connections keep increasing.

In swanctl -l, I see a long list of connections in "CONNECTING" state, for example:

(unnamed): #26216, CONNECTING, IKEv2, 1d88959770f8dacd_i 3026b7b230c3d196_r*
  local  '%any' @ %any[500]
  remote '%any' @ <Client IP>[500]
  AES_CBC-256/HMAC_MD5_96/PRF_HMAC_SHA1/MODP_1024

(unnamed): #27381, CONNECTING, IKEv2, 2e4c5fb30ec4b9cd_i 60b3e68dab5fa8ba_r*
  local  '%any' @ %any[500]
  remote '%any' @ <Client IP>[500]
  AES_CBC-256/HMAC_MD5_96/PRF_HMAC_SHA1/MODP_1024

I searched the log (all subsystems are set to loglevel=1) and there's nothing about these IKE IDs (26216, 27381). Above is just a sample, but the situation is the same for most of these "CONNECTING" sessions.

It's like they are lost somewhere in the process and become stale, they are counted as half-open but are not cleared after the half_open_timeout of 30 seconds.

All the above symptoms disappear as soon as I stop strongSwan on one of the nodes and terminate the stale connections (using swanctl -t).

When I try to re-integrate the node, it is able to synchronize ~200-300 IKE SAs and seems to fail for the rest (there are ~1600 IKS_SAs), I see many of the following messages:

2016-10-10 17:04:00.046 07[CFG] passive HA IKE_SA to update not found
2016-10-10 17:04:24.615 07[CHD] IKE_SA for HA CHILD_SA not found

I found this thread (https://lists.strongswan.org/pipermail/users/2014-August/006397.html) which seems to describe a similar condition as in my case, but there's no conclusion there (I checked the items that Thomas suggest and it looks fine).

#6 Updated by Tobias Brunner about 4 years ago

It's like they are lost somewhere in the process and become stale, they are counted as half-open but are not cleared after the half_open_timeout of 30 seconds.

Might be the same issue as #1192, you may try the patch in the 1192-ha-half-open branch.

#7 Updated by Danny Kulchinsky about 4 years ago

Tobias Brunner wrote:

It's like they are lost somewhere in the process and become stale, they are counted as half-open but are not cleared after the half_open_timeout of 30 seconds.

Might be the same issue as #1192, you may try the patch in the 1192-ha-half-open branch.

Thank you Tobias, I will try this patch ASAP and report back.

I also found ha-sync-delay branch, could this help with the reintegration issue ?

#8 Updated by Tobias Brunner about 4 years ago

I also found ha-sync-delay branch, could this help with the reintegration issue ?

I don't think that syncing while not having the complete config would result in the messages you saw. It rather sounds like some messages on the sync channel got lost or reordered (e.g. if the message that adds an IKE_SA did not arrive (yet) but the following messages that try to update it or add CHILD_SAs are). If sending a sync message immediately does not work an asynchronous job is queued (see source:src/libcharon/plugins/ha/ha_socket.c#L86), which could delay the messages or get them reordered (as these jobs will be processed concurrently). You could add a log message there to see if this happens in your setup.

#9 Updated by Danny Kulchinsky about 4 years ago

Tobias Brunner wrote:

I also found ha-sync-delay branch, could this help with the reintegration issue ?

I don't think that syncing while not having the complete config would result in the messages you saw. It rather sounds like some messages on the sync channel got lost or reordered (e.g. if the message that adds an IKE_SA did not arrive (yet) but the following messages that try to update it or add CHILD_SAs are). If sending a sync message immediately does not work an asynchronous job is queued (see source:src/libcharon/plugins/ha/ha_socket.c#L86), which could delay the messages or get them reordered (as these jobs will be processed concurrently). You could add a log message there to see if this happens in your setup.

I patched our cluster with the 1192-ha-half-open patch, it seems that the half-open issue has been resolved.

But I still have some concerns about the reintegration of a failed node, perhaps the issue is due to dropped udp packets on the receiving (joining) node when the active sends thousands of messages.

I do see quite a bit of UdpInErrors & UdpRcvbufErrors on the joining node:

UdpInErrors                     72718              0.0
UdpRcvbufErrors                 72576              0.0

I have tuned the following Kernel parameters on both nodes:

net.core.rmem_default = 16777216
net.core.rmem_max = 33554432
net.core.netdev_max_backlog = 5000

I also see the following error quite often (few per second):

2016-10-11 20:46:00.311 03[NET] error writing to socket: Operation not permitted

I will make another test and update on the results, if there are any other ideas - please let me know.

#10 Updated by Danny Kulchinsky about 4 years ago

I performed re-integration tests today, unfortunately they are causing charon to crash on both nodes.

I'm moving this to a new Issue #2146

#11 Updated by Danny Kulchinsky about 4 years ago

It seems that with cryptd we still hit these Kernel crashes from time-to-time (every ~24-48 hours).

[887538.646246] BUG: unable to handle kernel NULL pointer dereference at 00000000000002f0
[887538.646305] IP: [<ffffffffa04b3cf8>] esp_input_done2+0x48/0x240 [esp4]
[887538.646336] PGD 17fe7c067 PUD 1c07c9067 PMD 0 
[887538.646356] Oops: 0000 [#1] SMP
[887538.646370] Modules linked in: iptable_nat nf_nat_ipv4 nf_nat iptable_filter ip_tables nfnetlink_queue nfnetlink_log nfnetlink bluetooth rfkill authencesn binfmt_misc authenc echainiv xfrm6_mode_tunnel xfrm4_mode_tunnel xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key ip6t_REJECT nf_reject_ipv6 xt_policy nf_conntrack_ipv6 ipt_REJECT nf_defrag_ipv6 nf_reject_ipv4 ipt_CLUSTERIP ip6table_filter ip6_tables nf_conntrack_ipv4 nf_defrag_ipv4 vmw_vsock_vmci_transport vsock xt_conntrack nf_conntrack sb_edac edac_core intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ppdev intel_rapl_perf vmw_balloon input_leds pcspkr vmw_vmci sg i2c_piix4 shpchp parport_pc parport acpi_cpufreq xfs libcrc32c sr_mod cdrom ata_generic pata_acpi
[887538.647356]  sd_mod crc32c_intel serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm vmxnet3 mptspi ata_piix scsi_transport_spi mptscsih drm mptbase libata floppy fjes dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ip_tables]
[887538.647448] CPU: 2 PID: 21499 Comm: kworker/2:1 Not tainted 4.7.2+ #1
[887538.647469] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/17/2015
[887538.647510] Workqueue: crypto cryptd_queue_worker [cryptd]
[887538.647529] task: ffff880099d18000 ti: ffff88016fa98000 task.ti: ffff88016fa98000
[887538.647564] RIP: 0010:[<ffffffffa04b3cf8>]  [<ffffffffa04b3cf8>] esp_input_done2+0x48/0x240 [esp4]
[887538.647595] RSP: 0018:ffff88016fa9bd38  EFLAGS: 00010246
[887538.647612] RAX: ffff880234084d40 RBX: ffff88007d89d300 RCX: 00000000ffff8802
[887538.647634] RDX: ffffffffffff8801 RSI: 0000000000000000 RDI: ffff88005b5adc00
[887538.647656] RBP: ffff88016fa9bd88 R08: ffff88005b5adc00 R09: 0000000000000020
[887538.647678] R10: ffff88005003e9b0 R11: 0003273631bb275e R12: 0000000000000000
[887538.647700] R13: ffff88005b5adda0 R14: ffff88023fc9d200 R15: 0000000000000074
[887538.647723] FS:  0000000000000000(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
[887538.647747] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[887538.647765] CR2: 00000000000002f0 CR3: 0000000182aa7000 CR4: 00000000000406e0
[887538.647842] Stack:
[887538.648607]  0000000000000000 0000000000000000 0000000000000000 ffff88005b5adc00
[887538.649309]  00000000b37af5a6 ffff88007d89d300 ffffe8ffffc826f0 ffff88005b5adda0
[887538.650068]  ffff88023fc9d200 0000000000000080 ffff88016fa9bda0 ffffffffa04b41a6
[887538.650797] Call Trace:
[887538.651497]  [<ffffffffa04b41a6>] esp_input_done+0x16/0x30 [esp4]
[887538.652211]  [<ffffffffa038c419>] cryptd_blkcipher_crypt+0x69/0xa0 [cryptd]
[887538.652915]  [<ffffffffa038c46c>] cryptd_blkcipher_decrypt+0x1c/0x20 [cryptd]
[887538.653620]  [<ffffffffa038cc34>] cryptd_queue_worker+0x64/0x90 [cryptd]
[887538.654339]  [<ffffffff810988b2>] process_one_work+0x152/0x400
[887538.655068]  [<ffffffff810991a5>] worker_thread+0x125/0x4b0
[887538.655747]  [<ffffffff81099080>] ? rescuer_thread+0x380/0x380
[887538.656392]  [<ffffffff8109ecd8>] kthread+0xd8/0xf0
[887538.657050]  [<ffffffff8172207f>] ret_from_fork+0x1f/0x40
[887538.657702]  [<ffffffff8109ec00>] ? kthread_park+0x60/0x60
[887538.658301] Code: 00 00 00 89 75 b4 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 48 8b 47 68 48 8b 7f 50 8b 48 04 8d 51 ff 48 63 d2 4c 8b 64 d0 08 <49> 8b 84 24 f0 02 00 00 44 8b 28 48 8b 40 50 44 8b 70 f8 e8 60 
[887538.660217] RIP  [<ffffffffa04b3cf8>] esp_input_done2+0x48/0x240 [esp4]
[887538.660845]  RSP <ffff88016fa9bd38>
[887538.661416] CR2: 00000000000002f0

Is there any way to avoid this ? any workaround ?

#12 Updated by Tobias Brunner about 4 years ago

It seems that with cryptd we still hit these Kernel crashes from time-to-time (every ~24-48 hours).

As asynchronous crypto processing is not compatible with the HA patch (same reason why it isn't compatible with pcrypt), cryptd shouldn't be involved.

#13 Updated by Danny Kulchinsky about 4 years ago

Tobias Brunner wrote:

It seems that with cryptd we still hit these Kernel crashes from time-to-time (every ~24-48 hours).

As asynchronous crypto processing is not compatible with the HA patch (same reason why it isn't compatible with pcrypt), cryptd shouldn't be involved.

I'm not sure I understand, what other option we have in Linux for a crypto driver? or I'm missing something in my setup/configuration?

cryptd is picked up by default, can I turn it off? use something else?

#14 Updated by Tobias Brunner about 4 years ago

It seems that with cryptd we still hit these Kernel crashes from time-to-time (every ~24-48 hours).

As asynchronous crypto processing is not compatible with the HA patch (same reason why it isn't compatible with pcrypt), cryptd shouldn't be involved.

I'm not sure I understand, what other option we have in Linux for a crypto driver? or I'm missing something in my setup/configuration?

cryptd is picked up by default, can I turn it off? use something else?

Yes, I guess the module can't be disabled. And it probably only causes asynchronous processing of crypto requests if needed, so it might work without problems for a while until a request can't be handled right away (not sure if this can be controlled somehow). It looks like it is mainly used for hardware accelerated crypto on the x86 platform (e.g. by the aesni_intel module). In general ESP contexts are explicitly initialized with crypto_alloc_... and not cryptd_alloc_..., but depending on the algorithms/platform that might still cause a call to cryptd_alloc_... (the crypto framework is hugely modular, so it's quite untransparent what's going on exactly). Do you see any cryptd(...) entries in /proc/crypto (e.g. for AES-GCM)? Do all clients use the same cipher suites? Did you try disabling the aesni_intel module?

#15 Updated by Luka Logar about 4 years ago

I am following this issue closely, as we'll also have to implement HA in the near future. If I understand you (Tobias) correctly then all crypto algorithms from /proc/crypto that have "async : yes" are potentially dangerous to use in HA configuration. Most of the algorithms implemented by aesni_intel module (including cbc, ctr and gcm modes) have this flag set (and depend on cryptd), but if you instantiate cbc(aes-aesni) and not the optimized version cbc-aes-aesni you get the blkcipher (and not ablkcipher) version of the aes in cbc mode which can then further be used in authenc(hmac(sha256-ssse3),cbc(aes-aesni)) (which has "async : no"). What is interesting is, that on a fairly decent CPU (i3-4350T) we get higher throughput with authenc(hmac(sha256-ssse3),cbc(aes-aesni)) than with optimized/parallelized pcrypt(authenc(hmac(sha256-ssse3),cbc-aes-aesni)).

So to sum it up, I'd try registering this algorithm (choose the highest priority)

crconf add driver authenc(hmac(sha256-ssse3),cbc(aes-aesni)) type 3 priority 5000

which the kernel should then (I guess) select as the crypto for ESP (provided aes cbc is specified in ipsec.conf) ...

Luka

#16 Updated by Danny Kulchinsky about 4 years ago

Tobias Brunner wrote:

It seems that with cryptd we still hit these Kernel crashes from time-to-time (every ~24-48 hours).

As asynchronous crypto processing is not compatible with the HA patch (same reason why it isn't compatible with pcrypt), cryptd shouldn't be involved.

I'm not sure I understand, what other option we have in Linux for a crypto driver? or I'm missing something in my setup/configuration?

cryptd is picked up by default, can I turn it off? use something else?

Yes, I guess the module can't be disabled. And it probably only causes asynchronous processing of crypto requests if needed, so it might work without problems for a while until a request can't be handled right away (not sure if this can be controlled somehow). It looks like it is mainly used for hardware accelerated crypto on the x86 platform (e.g. by the aesni_intel module). In general ESP contexts are explicitly initialized with crypto_alloc_... and not cryptd_alloc_..., but depending on the algorithms/platform that might still cause a call to cryptd_alloc_... (the crypto framework is hugely modular, so it's quite untransparent what's going on exactly).

Do you see any cryptd(...) entries in /proc/crypto (e.g. for AES-GCM)?

Yes, we do:

name         : __cbc-aes-aesni
driver       : cryptd(__driver-cbc-aes-aesni)
module       : cryptd
priority     : 50
refcnt       : 4973
selftest     : passed
internal     : yes
type         : ablkcipher
async        : yes
blocksize    : 16
min keysize  : 16
max keysize  : 32
ivsize       : 0
geniv        : <default>

Do all clients use the same cipher suites?

Yes, all clients use the same cipher suite, specifically:

ike=aes256-md5-sha1-modp1024!
esp=aes256-md5-esn-noesn!

with ESN auto-negotiated, most of the clients do not use it, some small percentage do.

Did you try disabling the aesni_intel module?

No, I have not, per above I understand that it must be the culprit.

Do you see any way around this or AES-NI cannot be used in HA setup ?

#17 Updated by Tobias Brunner about 4 years ago

Do you see any way around this or AES-NI cannot be used in HA setup ?

Please have a look at the message from Luka (thanks, by the way). In your case something like the following might do:

crconf add driver authenc(hmac(md5),cbc(aes-aesni)) type 3 priority 5000

Depending on the architecture you might get better performance when switching to a different integrity algorithm than MD5 (SHA1 or SHA256 both have optimized versions on Intel platforms, again, see Luka's message).

#18 Updated by Danny Kulchinsky about 4 years ago

Tobias Brunner wrote:

Do you see any way around this or AES-NI cannot be used in HA setup ?

Please have a look at the message from Luka (thanks, by the way). In your case something like the following might do:

crconf add driver authenc(hmac(md5),cbc(aes-aesni)) type 3 priority 5000

Yes, Thank you both - I'm going to test this and will let you know.
I guess a note on this issue in the HA Wiki is called for :)

Depending on the architecture you might get better performance when switching to a different integrity algorithm than MD5 (SHA1 or SHA256 both have optimized versions on Intel platforms, again, see Luka's message).

I realize that MD5 is not optimal, unfortunately we don't have a say on the cipher suite selection, we must comply with the client vendor and MD5 is their choice.

#19 Updated by Danny Kulchinsky about 4 years ago

Following your suggestion, I have instantiated cbc(aes-aesni) instead of cbc-aes-aesni

Here are the crconf commands I used:

crconf add driver "authenc(hmac(md5),cbc(aes-aesni))" type 3 priority 5000
crconf add driver "authencesn(hmac(md5),cbc(aes-aesni))" type 3 priority 5000

crconf update driver "authenc(hmac(md5-generic),cbc(aes-aesni))" type 3 priority 5000
crconf update driver "authencesn(hmac(md5-generic),cbc(aes-aesni))" type 3 priority 5000

crconf del driver "authenc(hmac(md5-generic),cbc-aes-aesni)" type 3
crconf del driver "authencesn(hmac(md5-generic),cbc-aes-aesni)" type 3

Note that for some reason, when adding the driver and specifying priority 5000 it is instantiated with a priority of 3000 (hence the update).

By looking at /proc/crypto it seems that the correct algo is picked up (judging by the refcnt):

name         : echainiv(authencesn(hmac(md5),cbc(aes)))
driver       : echainiv(authencesn(hmac(md5-generic),cbc(aes-aesni)))
module       : echainiv
priority     : 5000
refcnt       : 3
selftest     : passed
internal     : no
type         : aead
async        : no
blocksize    : 16
ivsize       : 16
maxauthsize  : 16
geniv        : <none>

name         : echainiv(authenc(hmac(md5),cbc(aes)))
driver       : echainiv(authenc(hmac(md5-generic),cbc(aes-aesni)))
module       : echainiv
priority     : 5000
refcnt       : 53
selftest     : passed
internal     : no
type         : aead
async        : no
blocksize    : 16
ivsize       : 16
maxauthsize  : 16
geniv        : <none>

So far things have been stable, but this issue may take some days even to reappear.

#20 Updated by Luka Logar about 4 years ago

The priority should be high enough for the algorithm to be selected by the kernel. If there are no algorithms with priority higher than 3000 then you can leave it at 3000 (or whatever you choose).

Also, be careful not to use "crconf del driver pcrypt(...)" (not that you are using it now and also not related to ipsec), as there is a bug in pcrypt module and deregistering the pcrypt driver causes kernel to crash. See https://bugzilla.kernel.org/show_bug.cgi?id=178651

#21 Updated by Danny Kulchinsky about 4 years ago

Luka Logar wrote:

The priority should be high enough for the algorithm to be selected by the kernel. If there are no algorithms with priority higher than 3000 then you can leave it at 3000 (or whatever you choose).

Thanks Luka, but for some reason the cbc-aes-aesni is instantiated with priority 4000 (by default), so when I added initially cbc(aes-aesni) it had a priority of 3000 and wasn't picked up until I changed the priority to something higher.

Also, be careful not to use "crconf del driver pcrypt(...)" (not that you are using it now and also not related to ipsec), as there is a bug in pcrypt module and deregistering the pcrypt driver causes kernel to crash. See https://bugzilla.kernel.org/show_bug.cgi?id=178651

Good catch! thanks :)

#22 Updated by Luka Logar about 3 years ago

The inability to use async crypto (e.g. aes-ni) bothered me for quite some time.

It turned out, that the main problem lies in xfrm_input NF_HOOK() freeing (NF_DROP-ed) skb before async crypto is done. The solution is not to NF_DROP every 16th packet, instead just flag it for dropping (a one-line patch to skbuff.h required), NF_ACCEPT it and then drop it after the decryption is done. As every 16th packet destined for decryption is now accepted instead of dropped, there is also no need for special x->type->input() call inside hook, as the default xfrm_type_input() is called for these packets. All that remains is simply to advance replay window and drop the packet once the xfrm_input() is called/continued after the decryption.

The resulting (attached) patch (to be applied after the regular strongSwan HA kernel patch) is quite small and we are using it for some time now with HA and async crypto (esp=aes256gcm16...) without any kernel crashes.

#23 Updated by Tobias Brunner about 3 years ago

The solution is not to NF_DROP every 16th packet, instead just flag it for dropping (a one-line patch to skbuff.h required), NF_ACCEPT it and then drop it after the decryption is done.

Thanks for looking into it. Sounds reasonable. Instead of changing struct skb the flag could probably also be added to struct sec_path (or one of the xfrm_*_skb_cb structs) to avoid the overhead for non-IPsec traffic. I guess avoiding the flag is tricky (e.g. by calling the hook twice)?

All that remains is simply to advance replay window and drop the packet once the xfrm_input() is called/continued after the decryption.

Why not place the ha_dropit check after the existing call to repl->advance? Then we'd also get the call to repl->recheck in case that's necessary.

#24 Updated by Luka Logar about 3 years ago

Tobias Brunner wrote:

Thanks for looking into it. Sounds reasonable. Instead of changing struct skb the flag could probably also be added to struct sec_path (or one of the xfrm_*_skb_cb structs) to avoid the overhead for non-IPsec traffic. I guess avoiding the flag is tricky (e.g. by calling the hook twice)?

Since I'm not exactly a Linux kernel guru, I went with the safest choice and put flag in struct skb. Indeed, it could be moved to struct sec_path. I did this in the attached ha-async-crypto-v2.patch. I wouldn't complicate by trying to avoid using ha_dropit flag, as the kernel has to be patched anyway and one more small patch doesn't matter.

Why not place the ha_dropit check after the existing call to repl->advance? Then we'd also get the call to repl->recheck in case that's necessary.

The first version of the patch stayed as close to your original patch as possible thus returning from xfrm_input() as soon as possible. I've followed your suggestion in patch v2.

So far (for a few days) it's been running fine...

#25 Updated by Tobias Brunner about 3 years ago

So far (for a few days) it's been running fine...

Great, thanks for the updated patch and the test. I've incorporated the changes into the patches for the 4.4+, 4.6+ (this one's new because of the dev_put fix), 4.12 and 4.13 kernels on download.strongswan.org and in our repository (see the branches here).

#26 Updated by Tobias Brunner over 2 years ago

  • Status changed from Feedback to Closed
  • Resolution set to Fixed

Also available in: Atom PDF