Project

General

Profile

Issue #1220

Random packet loss using AES

Added by G B almost 10 years ago. Updated about 8 years ago.

Status:
Closed
Priority:
Low
Assignee:
-
Category:
kernel
Affected version:
Resolution:
No change required

Description

Hi,

I'm running many Strongswan 5.3.* CentOS (Kernel 3.16.7, 4.2.6, 4.1.*) instances on AWS to terminate VPNs between each Others and/or to other devices across the Internet.
While investigating some application issues, I've noticed that on every machine I have random packet losses (from 1% to 4% over 100 to 300 requests sent) when doing ICMP Echos across the tunnel.
This only happens when the two following conditions are met: (a) AES encryption used, (b) IP packet size shorter than about (150+8+20)Bytes.

In tcpdump I can actually see all packets (requests and replies) being "sent" actually, but on destination server (on the same "LAN") they are not being received... it's just like the packet is being lost before it's being actually seriallized onto the network by the XEN NIC driver
Pinging form the gateway itself always works fine though.

The Internet path between the routers has no loss of sort.
The path between the "gateway" and the internal hosts has no loss of sort.
I had tried different Kernels, different instances sizes, I've enabled/removed the AES-NI plugin, and all this sort of things.

Has somebody an idea about where the issue might be?

Thanks,
Gabriele

t2_lspci.txt (3.25 KB) t2_lspci.txt t2 lspci Davide Del Grande, 29.03.2017 13:51
m4_lspci.txt (3.89 KB) m4_lspci.txt m4 lspci Davide Del Grande, 29.03.2017 13:51

History

#1 Updated by Tobias Brunner almost 10 years ago

  • Status changed from New to Feedback

Obviously, IP is potentially lossy. How can you be sure there are no packet losses on any of the links? Due to the integrity and replay checks applied with IPsec it might just be more obvious that something occasionally goes wrong. Adding virtualization into the mix certainly doesn't help either.

If you can't reproduce this reliably it's really hard to tell what the reason could be (kernel, XEN, hardware, network...).

#2 Updated by G B almost 10 years ago

Hi Tobias,

I can reproduce it on every instance, using different accounts, different kernels, different instance types... I've even tried using an Amazon Linux instance with the same result (it uses Kernel 4.1.*)
Pinging the same remote host side-by-side with a 1300bytes and 32bytes payload shows a loss only on the "32bytes ones"... (tcpdump always shows that all packets are being sent though).
What's the "low-level" difference between using AES and 3DES/NULL? Most likely none... so I'm really at a loss thinking why I see this behaviour (I don't know when it started as well).

Thanks,
Gabriele

#3 Updated by Tobias Brunner almost 10 years ago

I can reproduce it on every instance, using different accounts, different kernels, different instance types... I've even tried using an Amazon Linux instance with the same result (it uses Kernel 4.1.*)
Pinging the same remote host side-by-side with a 1300bytes and 32bytes payload shows a loss only on the "32bytes ones"... (tcpdump always shows that all packets are being sent though).

Any other traffic on the same IPsec SA? Or just two ping commands sending their ICMP echo requests (with defaults, other than the size)? I can't reproduce this in our testing environment with a 4.1 or 4.2 kernel. Do you see any failures in ip -s xfrm state (on either host)?. You could also check /proc/net/xfrm_stat for errors (only available if your kernels were built with CONFIG_XFRM_STATISTICS).

What's the "low-level" difference between using AES and 3DES/NULL? Most likely none... so I'm really at a loss thinking why I see this behaviour (I don't know when it started as well).

I guess the timing is different. Does is happen with AES-GCM too?

Anyway, you might want to ask the kernel devs for their opinion (netdev mailing list).

#4 Updated by G B almost 10 years ago

Hi Tobias,

I've build a small "test lab" with just two of such routers plus a third host used to pin the remote end.. I can reproduce the issue with both Kernel 3.16.7 and 4.2.6

There is no visible error shown using those two commands.
I've also tried using libreswan/OpenSwan (with PSK since setting up Certificates there is not as straightforward as in Strongswan) as suggested by AWS, with the same results... I'll check with Kernel developers to see if they can help

By the way, using AES-GCM [ike=aes256gcm16-aesxcbc-modp2048!, esp=aes256gcm16-modp2048!] there is no packet loss.

I'll try to test with kernel-libipsec+kernel-netlink as well.

Gabriele

#5 Updated by G B almost 10 years ago

Hi Tobias,

I've tested using kernel-libipsec+kernel-netlink using AES128/SHA1 and there is no packet loss (even with an aggressive "ping -c 1000 -i 0 10.1.255.100")

Gabriele

#6 Updated by J D over 9 years ago

We can reproduce consistently. All of these need to be in place for the problem to occur:

- AES-CBC encryption
- kernel-netlink
- Packet sizes under 220 bytes (pre encapsulation)
- Amazon paravirtualized drivers (SR-IOV is unaffected)

#7 Updated by Davide Del Grande over 8 years ago

Hello,

I can confirm this, presumably AWS/XEN, bug.
I tried on AWS "t2.*" instances, with HVM (not PV - paravirtualized) Ubuntu 16.04 64bit, same AWS Region and AZ.
I briefly tried with instance m4.large, and apparently the bug did not show up.

The underlying and emulated hardware (and presumably drivers) are different (see lspci attach), maybe it's related ?

As workaround, I tried enabling libipsec in netlink-libipsec.conf - but I fear I did not it correclty because :
- i did not see tun/ipsec in ifconfig
- bug still present.

I saw other reference of same bug: [[https://forum.vyos.net/showthread.php?tid=26931]]

It's strange, though, I have a VPN on an identical machine (t2.micro), same AWS Region/AZ, same OS/kernel/libs/strongswan cfg.
There, problem seem not to be present - but the other peer is not strongswan, it is a telco Juniper, located on Internet.
How may I try to understand if maybe some minor VPN parameters are negotiated differently (logs/diag ?) ?

Thank you.
PS: If needed, I have this "testing" infrastructure that I can use to reproduce / test, if needed.

Bye

#8 Updated by Davide Del Grande over 8 years ago

Hello!

Good News!!

My test setup is:
node-1 === StrongSwan-A ==== AWS-IGW (Internet) AWS-IGW ==== StrongSwan-B === node-2

node-1: Windows 2012R2, t2 instance
node-2: Ubuntu 14.04 LTS, t2 instance
StrongSwan-A: Ubuntu 16.04 LTS (kernel 4.4.x, StrongSwan 5.3.5), t2 instance
StrongSwan-B: Ubuntu 16.04 LTS (kernel 4.4.x, StrongSwan 5.3.5), t2 instance

No Packet loss with ping size (-s) >= 147.

With this cfg:

from node-2: ping -f node-1 -c 100000 gives 8~9 % of packet loss

AWS AZ (1c or 1a tested) does not matter.

Then I switched alternatively StrongSwan-A or StrongSwan-B to be a "m4.large" instance.
-> Packet Loss dropped to 3~4 %

So I switched BOTH to m4.large -> Packet Loss dropped to 0% !!
This might be a (costly) workaround for those who cannot apply the following solution:

I then SWITCHED BACK both StrongSwans to "t2" instances (back to 8~9 % of packet loss).

I UPGRADED StrongSwan-B to Ubuntu 16.10 (kernel 4.8.x, StrongSwan 5.3.5)
Same 8~9 % of packet loss

Then, quite hopelessly, I upgraded StrongSwan-B to 17.04 beta (Zesty Zapus):
Kernel 4.10.x and StrongSwan 5.5.1

Packet Loss halved to 3~4 % !!!

Bringed back StrongSwan-A to "m4.large" (I cannot easily revert back this instance, so no OS-upgrade was possible).

ZERO PACKET LOSS :D

So the problem is either something in Kernel < 4.10 (xen drivers?) or in StrongSwan version 5.3.5

Next step might be to try StrongSwan 5.5.1 on Ubuntu 16.04 or 5.3.5 on Kernel 4.10 --- any volunteer ? ;P

HTH, Bye!

#9 Updated by Noel Kuntze over 8 years ago

So the problem is either something in Kernel < 4.10 (xen drivers?) or in StrongSwan version 5.3.5

strongSwan doesn't process any data packets (unless you use libipsec, but then you wouldn't have that problem with the dropped packets1). The kernel does. So it's the kernel.

[1] https://wiki.strongswan.org/issues/1220#note-5

#10 Updated by Noel Kuntze over 8 years ago

  • Category set to kernel
  • Affected version deleted (5.3.5)

#11 Updated by Davide Del Grande over 8 years ago

Yep, confirmed.

Recompiled strongswan 5.5.1 for Ubuntu 16.04 and problem still there.
Installed Kernel 4.10 ([[http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.10/]] and 0 packet loss (Strongswan version is indifferent).

Bye!

#12 Updated by Noel Kuntze over 8 years ago

  • Status changed from Feedback to Closed
  • Resolution set to No change required

I guess this is resolved then.

#13 Updated by Philip D'Ath over 8 years ago

Noel Kuntze wrote:

I guess this is resolved then.

I am experiencing this same issue still with kernel 4.4.0-72-generic and strongswan 5.3.5-1ubuntu3.1 on Ubuntu 16.04.2 in Amazon AWS. I have the same behaviour - it only happens when AES is used on t2 instances. It does not happen when using 3DES or on an m4 instance.

However it does not happen on all of the instances we are running - only some. I have spent ages looking at this, and I think I have found what it might be.

If I do a "cat /proc/cpuinfo" I find instances that report they are running CPU microcode version 0x2b get the intermittent packet loss of small packets. On instances running CPU microcode version 0x36 it works perfectly (AES on t2).

I think this could be related to a CPU microcode bug.

#14 Updated by Davide Del Grande about 8 years ago

Philip,

I confirm this bug is still present up to kernel 4.8, included (using the "HWE" kernel stack of Ubuntu 16.04.2), on t2.* instances.

If you manually upgrade kernel to 4.10.x, it is resolved (fixed microcode in kernel ??)
I'm using 4.10.17 in production with no packet loss.

Drawbacks: you won't have automatic security updates for kernel. This is not an issue for us, since instance is restricted via Security Groups.

BTW: does this wiki have a way to notify via email for new posts ? I just accidentally came back here...

#15 Updated by Tobias Brunner about 8 years ago

BTW: does this wiki have a way to notify via email for new posts ? I just accidentally came back here...

That's what the "Watch" functionality is for.

#16 Updated by Noel Kuntze about 8 years ago

Davide Del Grande wrote:

Drawbacks: you won't have automatic security updates for kernel. This is not an issue for us, since instance is restricted via Security Groups.

That doesn't improve security by any amount that is worth mentioning. Attacks are primarily executed on the application level by exploiting the service and then some way (via userspace vulnerabilities or kernel vulnerabilities), root or privileged access is attained. Just having SGs doesn't cut it in any way, shape or form.

#17 Updated by Davide Del Grande about 8 years ago

Tobias Brunner wrote:

That's what the "Watch" functionality is for.

lol so easy I didn't see it! ty!

Noel Kuntze wrote:

That doesn't improve security by any amount that is worth mentioning. Attacks are primarily executed on the application level by exploiting the service and then some way (via userspace vulnerabilities or kernel vulnerabilities), root or privileged access is attained. Just having SGs doesn't cut it in any way, shape or form.

I just meant that, via SG, we limited to just 1 Internet IP the peer that can talk to our vpn server - this at least contributes to limit the exposure.
This ofc does not mean that the machine is not updated, it is just updated manually (= less often than apt automatic updates), but off-topic I think :D

#18 Updated by Terry Wang about 8 years ago

Davide Del Grande wrote:

Philip,

I confirm this bug is still present up to kernel 4.8, included (using the "HWE" kernel stack of Ubuntu 16.04.2), on t2.* instances.

If you manually upgrade kernel to 4.10.x, it is resolved (fixed microcode in kernel ??)
I'm using 4.10.17 in production with no packet loss.

Drawbacks: you won't have automatic security updates for kernel. This is not an issue for us, since instance is restricted via Security Groups.

BTW: does this wiki have a way to notify via email for new posts ? I just accidentally came back here...

I also confirm the same, the xen-netfront bug (upstream kernel patch: https://patchwork.kernel.org/patch/9338979/) should have been backported into the 4.10.0 series kernel.

Both sides are

- Ubuntu 16.04 (on AWS EC2 t2.micro)
- strongSwan 5.5.3 (build from source)
- Linux 4.10.0-26-generic
- ike=aes256-sha256-modp2048!
- esp=aes256-sha256-modp2048!

On 4.4 and 4.8 packet loss rate is normally 1% - 8%.

$ ping -c 100 -i 1 192.168.1.200
PING 192.168.1.200 (192.168.1.200) 56(84) bytes of data.
64 bytes from 192.168.1.200: icmp_seq=1 ttl=62 time=170 ms
64 bytes from 192.168.1.200: icmp_seq=2 ttl=62 time=170 ms
...
...
64 bytes from 192.168.1.200: icmp_seq=99 ttl=62 time=170 ms
64 bytes from 192.168.1.200: icmp_seq=100 ttl=62 time=170 ms

--- 192.168.1.200 ping statistics ---
100 packets transmitted, 100 received, 0% packet loss, time 99062ms
rtt min/avg/max/mdev = 170.685/170.982/172.749/0.611 ms

Now it's good ;-)