Bug #3634: Deadlock because of resolver thread - strongSwan

Bug #3634

Deadlock because of resolver thread

Added by Frédéric Martinsons almost 5 years ago. Updated almost 5 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Tobias Brunner

Category:

libstrongswan

Target version:

5.9.2

Start date:

Due date:

Estimated time:

Affected version:

5.0.2

Resolution:

Fixed

Description

Hello,

I experienced a deadlock when initiating IKE and if there are resolution failures of remote peer.
Below is my analysis of the problem:

The initiate IKE (when done through retry initiate job) hold the IKE_SA (via checkout) and inside initiate method there is DNS resolution.
Which bring us inside resolve method of host_resolver.c , inside this , there is a thread creation (which do a blocking) getaddrinfo and
pthread condition waiting in the calling thread to get back the results.
My problem is that thread creation fail (thread_create return NULL), I didn't found the exact root cause of this on my system (maybe a limit reached)
but it ends with the resolve methods wait a pthread condition that will never come.
Hence the deadlock.

I attached a patch to fix this. Please tell me if it fits you and if my analysis of the problem is correct.

Thanks in advance.

0001-Do-not-wait-for-condition-if-resolve_hosts-thread-ha.patch (1.29 KB) 0001-Do-not-wait-for-condition-if-resolve_hosts-thread-ha.patch

Frédéric Martinsons, 20.11.2020 10:28

History

#1 Updated by Frédéric Martinsons almost 5 years ago

From what I see on git, this behavior is here since the 5.1.0 (at least)

#2 Updated by Tobias Brunner almost 5 years ago

Tracker changed from Issue to Bug
Status changed from New to Feedback
Target version set to 5.9.2
Affected version changed from 5.8.1 to 5.0.2

My problem is that thread creation fail (thread_create return NULL), I didn't found the exact root cause of this on my system (maybe a limit reached)
but it ends with the resolve methods wait a pthread condition that will never come.

If creating new threads is a problem, did you try increasing charon.host_resolver.min_threads? Unused threads are destroyed after no new query was queued for 30 seconds.

Anyway, I think we just need to make sure there is at least one resolver thread (and do some cleaning up if not). See the patch in the 3634-resolver-threads branch.

#3 Updated by Frédéric Martinsons almost 5 years ago

I didn't use min_threads and my retry initiate internal is set to 30s (so there are requests every 30s). I noticed also (by adding trace in this thread) that the thread id is increased indefinitely, is this mean that resolver threads are not destroyed.
I must admit that I don't know why increasing min thread will help me here, can you please elaborate ?

I have a setup which reproduce the problem pretty easily (2 or 3h of fails resolution). I'll test your commit toward this and tell you as soon as possible.

#4 Updated by Frédéric Martinsons almost 5 years ago

I let the setup run the whole weekend without any issues. Thank you for the commit.

#5 Updated by Tobias Brunner almost 5 years ago

I didn't use min_threads and my retry initiate internal is set to 30s (so there are requests every 30s).

If creating a resolver thread initially worked, this does not help as it might be gone when another resolution is necessary later. And since it takes slightly more than 30s for the retry to start (scheduling the job etc.) that won't help either.

I noticed also (by adding trace in this thread) that the thread id is increased indefinitely, is this mean that resolver threads are not destroyed.

Trace where exactly? And does that mean threads are actually created successfully and fail only later somehow?

I must admit that I don't know why increasing min thread will help me here, can you please elaborate ?

Unless thread creation fails the very first time already (which is not likely as threads are required for the proper functioning of the daemon anyway), this would keep at least some resolver threads running for future resolutions.

I let the setup run the whole weekend without any issues. Thank you for the commit.

OK, great. But do resolutions fail? Or did you also increase min_thread?

#6 Updated by Frédéric Martinsons almost 5 years ago

I noticed also (by adding trace in this thread) that the thread id is increased indefinitely, is this mean that resolver threads are not destroyed.

Trace where exactly? And does that mean threads are actually created successfully and fail only later somehow?

I added traces in the resolver thread , just after getaddrinfo call. Yes thread creation failed after several hours of DNS error.

OK, great. But do resolutions fail? Or did you also increase min_thread?

Yes, my setup force resolution error and no, I let min_thread to its default value (0)

#7 Updated by Tobias Brunner almost 5 years ago

I noticed also (by adding trace in this thread) that the thread id is increased indefinitely, is this mean that resolver threads are not destroyed.

Trace where exactly? And does that mean threads are actually created successfully and fail only later somehow?

I added traces in the resolver thread , just after getaddrinfo call.

I see. If you see different thread IDs there, new threads are handling the resolution. What was the number when thread creation started to fail? Would be interesting to know what kind of resource limit causes it to fail. On what platform are you trying this?

OK, great. But do resolutions fail? Or did you also increase min_thread?

Yes, my setup force resolution error and no, I let min_thread to its default value (0)

OK, so you make the actual DNS resolution via getaddrinfo() fail. But does it now also fail with the new error message ("no resolver threads") after a few hours?

#8 Updated by Frédéric Martinsons almost 5 years ago

I added traces in the resolver thread , just after getaddrinfo call.

I see. If you see different thread IDs there, new threads are handling the resolution. What was the number when thread creation started to fail? Would be interesting to know what kind of resource limit causes it to fail. On what platform are you trying this?

On a custom linux distro (built with https://www.yoctoproject.org/) with a custom hardware.

OK, so you make the actual DNS resolution via getaddrinfo() fail. But does it now also fail with the new error message ("no resolver threads") after a few hours?

Sorry I didn't pay attention to the new error messages, I only check there was no more deadlock. From what I see, the number of thread (based on thread id) can be different (I sometimes see failure after thread id to 600 and also around 1000).
I'll give another long run tonight and try to get these info.

#9 Updated by Frédéric Martinsons almost 5 years ago

Hello, unfortunately , I didn't manage to reproduce anymore although the thread id gone around 1500 ... I clearly miss what happen on my system toward that.
Nevertheless I think the fix you provide is accurate and I'll integrate it into our patched strongswan.

Moreover, I lower my retry initiate interval to 25s to make only one thread alive when there are consecutive DNS resolution failures.

Thank you for your time.

#10 Updated by Tobias Brunner almost 5 years ago

Hello, unfortunately , I didn't manage to reproduce anymore although the thread id gone around 1500 ... I clearly miss what happen on my system toward that.
Nevertheless I think the fix you provide is accurate and I'll integrate it into our patched strongswan.

OK, thanks.

Moreover, I lower my retry initiate interval to 25s to make only one thread alive when there are consecutive DNS resolution failures.

Note that if resolver threads are busy, more threads will get created unless you reduce charon.host_resolver.max_threads. But if you only have one connection, that's probably fine.

#11 Updated by Frédéric Martinsons almost 5 years ago

Yep, I only have one connection.

#12 Updated by Frédéric Martinsons almost 5 years ago

Do you plan to close this issue by merging the commits you provided or do you need me for some more testing scenario ?

#13 Updated by Tobias Brunner almost 5 years ago

Do you plan to close this issue by merging the commits you provided or do you need me for some more testing scenario ?

Yes, the fix will eventually be included in 5.9.2. No additional tests required.

#14 Updated by Frédéric Martinsons almost 5 years ago

Ok thanks.

#15 Updated by Tobias Brunner almost 5 years ago

Status changed from Feedback to Closed
Assignee set to Tobias Brunner
Resolution set to Fixed

Project

General

Profile

strongSwan

Issues