Project

General

Profile

Bug #757

Charon crash under high load

Added by richard hu over 5 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
charon
Target version:
Start date:
03.11.2014
Due date:
Estimated time:
Affected version:
5.2.1
Resolution:
Fixed

Description

Encounter a strongswan server crash issue on production system with more than 300 users online.
Here is the ipsec frontend output:

/var/log/strongswan# ipsec start --nofork
Starting strongSwan 5.2.1dr1 IPsec [starter]...
charon (12295) started after 40 ms
*** buffer overflow detected ***: /usr/lib/ipsec/charon terminated
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x37)[0x7fc6a44c2e67]
/lib/x86_64-linux-gnu/libc.so.6(+0x109d60)[0x7fc6a44c1d60]
/lib/x86_64-linux-gnu/libc.so.6(+0x10ae1e)[0x7fc6a44c2e1e]
/usr/lib/ipsec/libradius.so.0(+0x3002)[0x7fc69ede2002]
/usr/lib/ipsec/libradius.so.0(+0x3731)[0x7fc69ede2731]
/usr/lib/ipsec/plugins/libstrongswan-eap-radius.so(+0x3a97)[0x7fc69efe9a97]
/usr/lib/ipsec/plugins/libstrongswan-xauth-eap.so(+0xe2c)[0x7fc69e9d9e2c]
/usr/lib/ipsec/libcharon.so.0(+0x51cb1)[0x7fc6a49e5cb1]
/usr/lib/ipsec/libcharon.so.0(+0x482e3)[0x7fc6a49dc2e3]
/usr/lib/ipsec/libcharon.so.0(+0x26faf)[0x7fc6a49bafaf]
/usr/lib/ipsec/libcharon.so.0(+0x214a7)[0x7fc6a49b54a7]
/usr/lib/ipsec/libstrongswan.so.0(+0x2b733)[0x7fc6a4e3b733]
/usr/lib/ipsec/libstrongswan.so.0(+0x3aea0)[0x7fc6a4e4aea0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a)[0x7fc6a477ee9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fc6a44ac31d]
======= Memory map: ========

we focus on free radius settings in strongswan.conf and found a interesting issue:
if we set sockets = 20, and use ss -s to monitor udp socket usage, when user load up, it increase from 20 to 40 and stop.
if we set sockets = 1000, and ss -s found udp usage from 1000 to 1040 and strongswan crashed.
By look into the code, sockets parameter is a socket pool for radius client and it's a maxim limited. request larger than that will need wait until old one free.
How is this affect the charon stability?

And for such core dump issue, do you have any good way to debug or suggestion for the stability under high load.

Associated revisions

Revision f6f3b0db
Added by Martin Willi over 5 years ago

Merge branch 'poll'

Replace relevant uses of select() by poll(). poll(2) avoids the difficulties
we have with more than 1024 open file descriptors, and seems to be fairly
portable.

Fixes #757.

History

#1 Updated by richard hu over 5 years ago

Another clue, when high load, "ipsec status all" will hang and no output printed.

#2 Updated by Tobias Brunner over 5 years ago

  • Description updated (diff)
  • Status changed from New to Feedback
  • Priority changed from High to Normal

And for such core dump issue, do you have any good way to debug or suggestion for the stability under high load.

A stack trace with debug symbols (maybe via a core file and GDB, or via --attach-gdb instead of --nofork) would help, so we could see where this buffer overflow happens exactly (given it is reproducible, otherwise try to resolve the addresses and offsets via addr2line).

#3 Updated by richard hu over 5 years ago

For the charon.plugins.eap-radius.sockets parameters in strongswan.conf. we found that when it set to larger than 500, the charon will crash with high load.
We tested about various radius sockets settings today as well, and the finding is, if it is like 20/100/200, the total UDP connection will be increase to like 41/201/401 but stay at that number with no problem. If increase it to bigger number (>500), strongSwan will crash after UDP connection exceed 1000.
Do you know any limitations on this?

btw: for issues inside .so lib, it's hard to use addr2line

#4 Updated by Martin Willi over 5 years ago

Please provide a backtrace with debug symbols, either by inspecting the dumped core file or by attaching a debugger, as indicated by Tobias.

For the charon.plugins.eap-radius.sockets parameters in strongswan.conf. we found that when it set to larger than 500, the charon will crash with high load.

I don't think having more than 500 sockets makes any sense for a total of 300 users online. A socket is required to authenticate one user in parallel over RADIUS; that only applies to the RADIUS authentication step, not an online user. You need 500 sockets if you plan to handle 500 simultaneous RADIUS authentication sessions, i.e. you have 500 users connecting simultaneously, and you want to authenticate them all in parallel. But it's not unlikely that you hit some system limit with that many sockets, so I highly recommend to use a smaller number of sockets.

Regards
Martin

#5 Updated by richard hu over 5 years ago

I tried to use "ipsec start --attach-gdb" to debug the issue. But found that I can not put my compiled .so to use.
-------------------
Here is what I did:
envrionment is ubuntu.
previous working strongswan is install by apt-get and works well:
apt-get install libstrongswan strongswan strongswan-ike strongswan-ikev1 strongswan-ikev2 strongswan-plugin-eap-md5 strongswan-plugin-eap-radius strongswan-plugin-openssl strongswan-plugin-unity strongswan-plugin-xauth-eap strongswan-plugin-xauth-generic strongswan-plugin-xauth-noauth strongswan-starter

To do code level debug, I download a same version (5.2.1) to my working dir.
and did configure/make all
./configure --prefix=/usr --sysconfdir=/etc --libexecdir=/usr/lib --with-ipsecdir=/usr/lib/ipsec --with-strongswan-conf=/etc/strongswan.conf --enable-eap-radius --enable-xauth-eap --enable-eap-identity --enable-eap-mschapv2 --enable-md4 --enable-md5 --enable-openssl --enable-pkcs11 --enable-blowfish --enable-agent --enable-eap-md5 --enable-eap-peap --enable-eap-tls
then copy compiled .libs/libstrongswan-xauth-eap.so to /usr/lib/ipsec/plugins/libstrongswan-xauth-eap.so
then I use ipsec start --attach-gdb to run.
but it stopped at:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fcf4b5f1700 (LWP 2880)]
0x00007fcf4d3fec08 in process (this=0x7fcf080037d0, in=<optimized out>, out=<optimized out>) at xauth_eap.c:225
225 name = lib->settings->get_str(lib->settings,

Could you suggest if any configuration problem for my step? how to use own compiled .so to use in my environment.

#6 Updated by richard hu over 5 years ago

Finally I let above steps works. Not know the exact reason but it can debug.

After I add stress to server, the crash bt is:
Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fdc184b0700 (LWP 53752)]
0x00007fdc256e70d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007fdc256e70d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007fdc256ea83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007fdc2572504e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007fdc257bbe67 in __fortify_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x00007fdc257bad60 in __chk_fail () from /lib/x86_64-linux-gnu/libc.so.6
#5 0x00007fdc257bbe1e in __fdelt_warn () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x00007fdc1fed1002 in request (this=0x7fdc27ed3650, request=0x7fdbd4093890) at radius_socket.c:186
#7 0x00007fdc1fed1731 in request (this=0x7fdbd4091470, req=0x7fdbd4093890) at radius_client.c:100
#8 0x00007fdc200d851f in initiate (this=0x7fdbd4092b20, out=0x7fdc184afb98) at eap_radius.c:241
#9 0x00007fdc1fac8c53 in verify_eap (backend=0x7fdbd4092b20, this=0x7fdbdc0a05f0) at xauth_eap.c:119
#10 process (this=0x7fdbdc0a05f0, in=<optimized out>, out=<optimized out>) at xauth_eap.c:241
#11 0x00007fdc25cded21 in process_i (this=0x7fdbdc0a19f0, message=<optimized out>) at sa/ikev1/tasks/xauth.c:472
#12 0x00007fdc25cd5273 in process_response (message=0x7fdc0c000970, this=0x7fdbf00aea10) at sa/ikev1/task_manager_v1.c:1212
#13 process_message (this=0x7fdbf00aea10, msg=0x7fdc0c000970) at sa/ikev1/task_manager_v1.c:1475
#14 0x00007fdc25cb3faf in process_message (this=0x7fdbf00ae5a0, message=0x7fdc0c000970) at sa/ike_sa.c:1268
#15 0x00007fdc25cae4a7 in execute (this=0x7fdc0c000ef0) at processing/jobs/process_message_job.c:74
#16 0x00007fdc261347a3 in process_job (worker=0x7fdc27fe0670, this=0x7fdc27e8f130) at processing/processor.c:235
#17 process_jobs (worker=0x7fdc27fe0670) at processing/processor.c:321
#18 0x00007fdc26143f10 in thread_main (this=0x7fdc27fe06a0) at threading/thread.c:312
#19 0x00007fdc25a77e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#20 0x00007fdc257a531d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#21 0x0000000000000000 in ?? ()

It crashed at 186 of radius_socket.c
--->
FD_ZERO(&fds);
FD_SET(*fd, &fds);
res = select((*fd) + 1, &fds, NULL, NULL, &tv);

the FD_SET line.
the FD_SETSIZE under linux is 1024, and when crash happened I saw UDP is 1022 by "ss -s"

should we add a protection check against FD_SETSIZE for this?

And for the best value for charon.plugins.eap-radius.sockets, which is best under high load?
20 is safe, but it very slow since lots user wait in pipe to authentication.

#7 Updated by Martin Willi over 5 years ago

  • Status changed from Feedback to Assigned
  • Assignee changed from Tobias Brunner to Martin Willi

Hi,

the FD_SETSIZE under linux is 1024, and when crash happened I saw UDP is 1022 by "ss -s"

With that many file descriptors, you hit the limit of FD_SETSIZE. Basically having a ulimit -n higher than your FD_SETSIZE is not really supported in strongSwan.

A work-around is to use less RADIUS sockets; a few hundred probably should work, but you'll need just as many as you want to authenticate clients over RADIUS in parallel.

In the long term, we'll need to replace that unsafe use of fd_set, and use something more appropriate. As we have many other file descriptors across the process, we need to fix that in any place we are using fd_set.

Regards
Martin

#8 Updated by Martin Willi over 5 years ago

  • Tracker changed from Issue to Bug

#9 Updated by Martin Willi over 5 years ago

It seems that using a dynamically allocated buffer for fd_set is not really an option, as FORTIFY_SOURCE checks against FD_SETSIZE.

Instead, we probably should migrate most select(2) users to poll(2).

Please try the new poll branch, it uses poll(2) and should be capable of handling more than 1024 file descriptors.

Regards
Martin

#10 Updated by richard hu over 5 years ago

Seems the poll branch created very recently?
I guess it's not stable for production, right?
will it will be in next release e.g. 5.3.0 5.4.0?

#11 Updated by Martin Willi over 5 years ago

  • Status changed from Assigned to Feedback

Hi,

I've create these changes yesterday because of your issue. It certainly needs some more testing, and you are very welcome in helping with that.

If we don't find any major issues with that branch, the changes will be integrated into the next release.

Regards
Martin

#12 Updated by richard hu over 5 years ago

Thanks Martin, we can help on testing the new branch after you guys wrapped them up.

BTW: Seems epoll have more capability and stable than poll, why we do not try epoll.

#13 Updated by Martin Willi over 5 years ago

Hi,

epoll is very Linux specific and not portable. poll is POSIX.1-2001 and widely available, on Windows we can use WSAPoll().

Regards
Martin

#14 Updated by Martin Willi over 5 years ago

  • Status changed from Feedback to Closed
  • Target version set to 5.2.2
  • Resolution set to Fixed

Fixed with the associated merge commit.

Also available in: Atom PDF