Version 1 - History - High Availability - strongSwan

1

Martin Willi

{{>toc}}

2

1

Martin Willi

3

1

Martin Willi

h1. High Availability

4

1

Martin Willi

5

1

Martin Willi

Starting with the upcoming 4.4.0 release, the IKEv2 daemon will experimentally support pseudo active/active High Availability and Load Sharing capabilities using a cluster of (initially) two nodes.

6

1

Martin Willi

7

1

Martin Willi

The source code is currently maintained in a separate, not yet public branch. The HA plugin and the required kernel patches will be released with 4.4.0.

8

1

Martin Willi

9

1

Martin Willi

h2. Problem statement

10

1

Martin Willi

11

1

Martin Willi

The IKEv2/IPsec protocol is not well suited for operation in an active/active cluster. While it is possible to share the state of IKE_SAs over high-speed links in a cluster, sharing the kernel maintained IPsec ESP SA is very difficult. Due to the strict sequence numbering of ESP packets, the overhead for synchronizing ESP sequence numbers would be very high.

12

1

Martin Willi

13

1

Martin Willi

The IETF ipsecme working group "currently discusses":http://tools.ietf.org/html/draft-nir-ipsecme-ipsecha-00 the problems in more details and will probably standardize a solution which involves the client to work around the problems. This will, however, require an extension to the IKEv2 protocol. Clients connecting to a highly available cluster will benefit from this features only if they support this extension, existing clients (such as the one shipped with Windows 7) will not take advantage of this efforts.

14

1

Martin Willi

15

1

Martin Willi

h2. Possible approaches

16

1

Martin Willi

17

1

Martin Willi

h3. Node to node synchronization

18

1

Martin Willi

19

1

Martin Willi

While the synchronization of IKE state including sequence numbers is realistic between two nodes, exchanging state information of ESP security association is difficult.

20

1

Martin Willi

21

1

Martin Willi

Synchronizing the state for each processed IPsec packet will put high load on the nodes. Synchronizing after a certain amount of packets and/or after a certain timeout will reduce the load, but will make fail-over handling much more difficult, as we have to guess on the taking over node how many packets the failing node has processed, but could not synchronize before it failed.

22

1

Martin Willi

23

1

Martin Willi

Another problem with such an approach is that there is no way of doing load sharing between nodes. An SA is strictly bound to a single node until the event of a failure.

24

1

Martin Willi

25

1

Martin Willi

h3. Client to cluster synchronization

26

1

Martin Willi

27

1

Martin Willi

Another approach to consider is requesting state information from the client. In the event of a failure, the taking over node can request sequence numbers from the client. But this approach has the same deficiencies as previously discussed. Further, it requires to extend the IKE protocol between client and gateway, making existing implementations incompatible to this approach.

28

1

Martin Willi

29

1

Martin Willi

h2. Functional specification

30

1

Martin Willi

31

1

Martin Willi

strongSwan uses a slightly different approach. Our solution should provide:

32

1

Martin Willi

* Failure detection: If a node fails due to power loss, hardware failures, kernel oops or daemon crashes, the node will be removed from the cluster.

33

1

Martin Willi

* State synchronization: If a node is removed due to failure or administrative purposes, the cluster should already have an up-to-date copy of the nodes state to take over.

34

1

Martin Willi

* Take over: Node failure detection and state take over should happen within 1-3 seconds.

35

1

Martin Willi

* Transparent migration: TCP or application sessions should not be interrupted during take over.

36

1

Martin Willi

* Load sharing: Load should be shared across all actives nodes in a cluster.

37

1

Martin Willi

* Reintgration: A repaired node can be (re-)added to an existing cluster, taking over a part of the load .

38

1

Martin Willi

* Legacy clients: No protocol extension, any IKEv2 client should benefit of High Availability if connected to a cluster.

39

1

Martin Willi

40

1

Martin Willi

Migration of clients to another node does not affect the connection, a client does usually not detect a takeover. This allows a gateway administrator to e.g. remove a node from the load sharing cluster, apply security updates, reboot and reintegrate the node.

41

1

Martin Willi

42

1

Martin Willi

h2. Selected solution

43

1

Martin Willi

44

1

Martin Willi

The selected solution is based on the idea of "ClusterIP":http://lwn.net/Articles/108078, a Linux kernel module allowing a set of nodes to provide a service under a single virtual IP.

45

1

Martin Willi

46

1

Martin Willi

h3. How ClusterIP works

47

1

Martin Willi

48

1

Martin Willi

All nodes in a ClusterIP based setup act under a single virtual IP address. The nodes spoof ARP requests with a multicast MAC address. This will make the switch forwarding the packet to each node in the cluster.

49

1

Martin Willi

50

1

Martin Willi

The received packet is associated to a segment by calculating a hash value of it. In the simplest setup, the source address is hashed and the hash value modulo the number of segments results in the responsible segment number. Each segment is handled by exactly one node in the cluster.

51

1

Martin Willi

52

1

Martin Willi

The node responsible for the packet will pass it to upper layers, where all others just drop the packet in the netfilter code. Depending on the hash value, e.g. TCP connections are kept on the same node. If a node fails, a remaining node will take over the segment and process packets for it.

53

1

Martin Willi

54

1

Martin Willi

h3. IPsec with ClusterIP

55

1

Martin Willi

56

1

Martin Willi

While the ClusterIP module itself is not designed to handle IPsec traffic or even act as a forwarding router, the principle of ClusterIP is. If the IKE daemons in the cluster can synchronize the IKE state and the basic IPsec SA state without sequence numbers, a modified ClusterIP module can do the rest:

57

1

Martin Willi

58

1

Martin Willi

For traffic to decrypt, the SPI value of the ESP packet can be included in the hash calculation, resulting in a spread of the packet flow across all nodes. As each node processes a packet of an SA from time to time, sequence numbers are automatically incremented if a packet is processed. Sequence numbers are not mangled if a packet is not verified using the IPsec authentication algorithm, as an attacker would be able to manipulate the SA state otherwise. Paket flows are partitioned in a dozen of packets to avoid too much packet reordering.

59

1

Martin Willi

60

1

Martin Willi

For traffic to encrypt on the cluster, the SA is looked up and the hash value is feed with the SPI of the found SA. If the segment matches, the packet is further processed. If not, only the sequence number is incremented. To avoid assigning the same sequence number to different packets on multiple nodes, additional logic is required.

61

1

Martin Willi

62

1

Martin Willi

h2. Kernel Implementation

63

1

Martin Willi

64

1

Martin Willi

The ClusterIP Netfilter module uses an additional PREROUTING hook to mark received packets for forwarding. Two new Netfilter hooks are included in the IPsec processing, exactly before the decryption/encryption process (XFRM_IN/XFRM_OUT).

65

1

Martin Willi

66

1

Martin Willi

<pre>

67

1

Martin Willi

                 v        PLAIN        ^

68

1

Martin Willi

    +------------------------------------------------+

69

1

Martin Willi

    |            |                     |             |

70

1

Martin Willi

    |     +--------------+      +--------------+     |

71

1

Martin Willi

    |  4. |  PREROUTING  |      |   DECRYPT    |  3. |

72

1

Martin Willi

    |     +--------------+      +--------------+     |

73

1

Martin Willi

    |            |                     ^             |

74

1

Martin Willi

    |            v                     |             |

75

1

Martin Willi

    |     +--------------+      +--------------+     |

76

1

Martin Willi

    |  5. |   XFRM_OUT   |      |   XFRM_IN    |  2. |

77

1

Martin Willi

    |     +--------------+      +--------------+     |

78

1

Martin Willi

    |            |                     ^             |

79

1

Martin Willi

    |            v           ^         | ESP/AH      |

80

1

Martin Willi

    |     +--------------+   |  +--------------+     |

81

1

Martin Willi

    |  6. |   ENCRYPT    |   +--|    INPUT     |  1. |

82

1

Martin Willi

    |     +--------------+      +--------------+     |

83

1

Martin Willi

    |            |                     |             |

84

1

Martin Willi

    +------------------------------------------------+

85

1

Martin Willi

                 v       CRYPTED       ^

86

1

Martin Willi

</pre>

87

1

Martin Willi

88

1

Martin Willi

#  AH, ESP and UDP-Encapsulated ESP packets are all accepted. Other traffic is subject to the ClusterIP selection algorithm based on the source IP address (e.g. IKE traffic).

89

1

Martin Willi

# Undecrypted IPsec traffic gets dropped using a ClusterIP algorithm, based on the IPsec SA.

90

1

Martin Willi

# Decryption process is done on the responsible node only.

91

1

Martin Willi

# Traffic is received on ClusterIP multicast MAC and must be tagged as unicast traffic to advance through IP forwarding.

92

1

Martin Willi

# After IPsec policy lookup, unencrypted traffic gets dropped using a ClusterIP algorithm, based on IPsec SA. Outgoing sequence numbers are assigned before the packed drop, this will keep outgoing sequence numbers in sync on all nodes.

93

1

Martin Willi

# Encryption process is done on the responsible node only.

94

1

Martin Willi

95

1

Martin Willi

h2. IKE daemon implementation

96

1

Martin Willi

97

1

Martin Willi

A separate high availability plugin implemented for the IKEv2 daemon charon is responsible for state synchronization between the nodes in a cluster and simple monitoring functionality. It is currently designed for two nodes, but will be extended to synchronize larger clusters in the future.

98

1

Martin Willi

99

1

Martin Willi

h3. Daemon hooks

100

1

Martin Willi

101

1

Martin Willi

The plugin registers itself at several hooks in the daemon. These hooks are used for notifications about SA state changes and push information to the plugin. The following hooks are used:

102

1

Martin Willi

* ike_keys(): receives IKE key material (DH, nonces, proposals)

103

1

Martin Willi

* ike_updown()/ike_rekey(): monitor state changes of IKE SAs

104

1

Martin Willi

* message(): used to update IKE message IDs

105

1

Martin Willi

* child_keys(): receives CHILD key material

106

1

Martin Willi

* child_state_change(): monitor state changes of CHILD SAs

107

1

Martin Willi

108

1

Martin Willi

The plugin registers its hook functions at the daemon bus. These hooks are sufficient to synchronize all IKE- and CHILD SAs with all the state required to do a fail-over of IKE and ESP SAs.

109

1

Martin Willi

110

1

Martin Willi

h3. Synchronization messages

111

1

Martin Willi

112

1

Martin Willi

The hook functions collect the required synchronization data and prepare messages to be sent to other nodes in the cluster. Messages are sent in unencrypted UDP datagrams, sent and received on port 4510. As these messages contain sensitive key material, securing the messages by IPsec is recommended.

113

1

Martin Willi

114

1

Martin Willi

No packet acknowledge/retransmit scheme is currently implemented, the cluster needs a reliable network with very few packet losses. It might be necessary to use a more reliable transport protocol in the future, especially if nodes start to drop packets due to an overloaded CPU.

115

1

Martin Willi

116

1

Martin Willi

Messages contain a protocol version, a message type and different attributes. The following synchronization message types are currently defined:

117

1

Martin Willi

* IKE_ADD: A new IKE_SA has been established. This message contains all information to derive key material. If the message contains a REKEY attribute, the IKE_SA inherits all required parameters from the old SA.

118

1

Martin Willi

* IKE_UPDATE: Update IKE_SA with newer information (e.g. Identities when authentication is complete).

119

1

Martin Willi

* IKE_DELETE: Delete an established IKE_SA.

120

1

Martin Willi

* CHILD_ADD: CHILD_SA has been established, contains keying material.

121

1

Martin Willi

* CHILD_DELETE: CHILD_SA has been deleted.

122

1

Martin Willi

123

1

Martin Willi

h3. State synchronization

124

1

Martin Willi

125

1

Martin Willi

Received synchronization messages are parsed, mirrored IKE and CHILD_SAs are created from this information. Mirrored CHILD_SAs do not differ from normally exchanged ones; they are installed in the kernel and handle packets if ClusterIP feels responsible for it.

126

1

Martin Willi

127

1

Martin Willi

IKE_SAs are installed in a special PASSIVE state. They do not handle traffic, but accept state changes from sync messages only. PASSIVE IKE_SAs are managed in the IKE_SA manager as any other SA and are accessible through e.g. "ipsec statusall".

128

1

Martin Willi

129

1

Martin Willi

Key derivation is repeated on mirrored SAs the same way as it is done on the real SAs. This allows the reuse of existing installation routines and the HA plugin to be very unobtrusive. A node responsible for an IKE_SA does not keep the keying material in memory, it just pushes the exchanged secret to other nodes and forgets the secrets afterwards.

130

1

Martin Willi

131

1

Martin Willi

h3. Control messages

132

1

Martin Willi

133

1

Martin Willi

In addition to the synchronization messages, the HA plugin uses control messages to notify about segment changes and optionally messages for simple monitoring functions:

134

1

Martin Willi

* SEGMENT_DROP: List of segments the sending node is dropping responsibility.

135

1

Martin Willi

* SEGMENT_TAKE: List of segments the sending node is taking responsibility.

136

1

Martin Willi

* STATUS: Heartbeat message containing a list of segments the sending node is responsible.

137

1

Martin Willi

* RESYNC: Request for resynchronization of a list of segments.

138

1

Martin Willi

139

1

Martin Willi

The take/drop messages are sent to notify other nodes about changes done by the daemon automatically or the administrator manually. The receiving node will automatically do the opposite action to handle all segment exactly once.

140

1

Martin Willi

141

1

Martin Willi

If heartbeat monitoring is enabled, the status message is periodically sent. This allows to detect the activity of the remote node and take over segments the remote node is not serving. It also implements node failure detection for simple errors.

142

1

Martin Willi

143

1

Martin Willi

If a replacement of a failing node is installed, reintegration of the node can be speed up by sending the resynchronization message. The active node

144

1

Martin Willi

will start resyncing all SAs, allowing the administrator to rebalance the load distribution in the cluster afterwards.

145

1

Martin Willi

146

1

Martin Willi

h3. Failover

147

1

Martin Willi

148

1

Martin Willi

In the failover case, responsibility for complete ClusterIP segments are moved from one node to another. Responsibility for a segment can be enabled or disabled on each node. For this purpose, the plugin uses the same hashing algorithm to calculate responsibility based on the source IP address.

149

1

Martin Willi

150

1

Martin Willi

If a segment is activated, the plugin searches for IKE_SAs in this segment and sets the state of all PASSIVE IKE_SAs to ESTABLISHED. No further action is required: The daemon handles the IKE_SA as every other one and sends out synchronization messages for state changes.

151

1

Martin Willi

152

1

Martin Willi

On segment deactivation, the plugin searches for IKE_SAs in the ESTABLISHED state in this segments and sets the state to PASSIVE.

153

1

Martin Willi

154

1

Martin Willi

CHILD_SAs are completely unaffected from activation and deactivation: They are always active and handle traffic assigned by ClusterIP.

155

1

Martin Willi

156

1

Martin Willi

h3. Reintegration

157

1

Martin Willi

158

1

Martin Willi

To reintegrate a failed node into a cluster, the node needs state information from scratch. If all the required state has been synced, the reintegrated node can be used as failover node again. Segments can be activated on the reintegrated node only after all required state has been exchanged.

159

1

Martin Willi

160

1

Martin Willi

SA state automatically gets synchronized during rekeying. Each rekeying procedure provides fresh keying material which can be used to build the

161

1

Martin Willi

mirrored IKE and CHILD_SA state. Rekeying is currently the only way to push the required state to a reintegrated node, as the key material is not stored directly on an active node.

162

1

Martin Willi

163

1

Martin Willi

To speed up the reintegration process, the plugin can trigger the rekeying of existing IKE- and CHILD_SAs in a segment, allowing a reintegration process to complete within seconds.

164

1

Martin Willi

165

1

Martin Willi

166

1

Martin Willi

h2. Configuration

167

1

Martin Willi

168

1

Martin Willi

Configuration is done in two places. The necessary virtual IPs and the ClusterIP rules are installed manually. This is explicitly not done by the daemon, as the rules must stay active after daemon shutdown or error conditions.

169

1

Martin Willi

170

1

Martin Willi

The HA plugin requires a configuration matching to the installed ClusterIP rules. All nodes in the cluster need an identical connection configuration and credentials; IP addresses assigned to clients using configuration payloads must be set carefully using a central or two distinct address pools.

171

1

Martin Willi

172

1

Martin Willi

173

1

Martin Willi

h3. ClusterIP

174

1

Martin Willi

175

1

Martin Willi

The configuration of the extended ClusterIP module is similar to a default ClusterIP setup. For a traffic forwarding IPsec gateway, a cluster usually

176

1

Martin Willi

needs an internal virtual IP/MAC and an external virtual IP/MAC on each node.

177

1

Martin Willi

178

1

Martin Willi

<pre>

179

1

Martin Willi

ip address add 192.168.0.200/24 dev eth0

180

1

Martin Willi

iptables -A INPUT -i eth0 -d 192.168.0.200 -j CLUSTERIP --new \

181

1

Martin Willi

   --hashmode sourceip --clustermac 01:00:5e:00:00:20 \

182

1

Martin Willi

   --total-nodes 2 --local-node 1

183

1

Martin Willi

</pre>

184

1

Martin Willi

185

1

Martin Willi

This example installs the virtual IP 192.168.0.200 on interface eth0 and adds a corresponding ClusterIP rule. ClusterIP rules are always added to the INPUT chain. To get the same result for segment responsibility calculation in the kernel and the HA plugin, the sourceip hashmode and a hash init value of 0 must be used (default).

186

1

Martin Willi

187

1

Martin Willi

The _total-nodes_ option must match the configuration of the HA plugin, and all nodes require the same virtual IP/MAC and ClusterIP configuration.

188

1

Martin Willi

189

1

Martin Willi

ClusterIP requires the _local-node_ option to be present. However, the HA plugin reassigns segment responsibility during daemon startup.

190

1

Martin Willi

191

1

Martin Willi

h3. HA plugin

192

1

Martin Willi

193

1

Martin Willi

The HA plugin configuration is handled in the _strongswan.conf_ file.

194

1

Martin Willi

195

1

Martin Willi

<pre>

196

1

Martin Willi

charon {

197

1

Martin Willi

    # ...

198

1

Martin Willi

    ha {

199

1

Martin Willi

        local = 10.0.0.2

200

1

Martin Willi

        remote = 10.0.0.1

201

1

Martin Willi

        segment_count = 2

202

1

Martin Willi

        # secret = s!ronG-P5K-s3cret

203

1

Martin Willi

        fifo_interface = yes

204

1

Martin Willi

        monitor = yes

205

1

Martin Willi

        resync = yes

206

1

Martin Willi

207

1

Martin Willi

208

1

Martin Willi

</pre>

209

1

Martin Willi

210

1

Martin Willi

The _local_ and _remote_ addresses are used to send and receive sync messages, _segment_count_ defines the number of segments to use.

211

1

Martin Willi

212

1

Martin Willi

If a _secret_ option is specified, the nodes automatically establish a pre-shared key authenticated IPsec tunnel for HA sync and control messages

213

1

Martin Willi

(experimental).

214

1

Martin Willi

215

1

Martin Willi

The segment responsibility administration interface is enabled with the _fifo_interface_ option. The _monitor_ parameter enables the heartbeat based remote node monitoring, the _resync_ option enables automatic state resynchronization if a node joins the cluster.

216

1

Martin Willi

217

1

Martin Willi

h3. Administrating segment responsibility

218

1

Martin Willi

219

1

Martin Willi

Changing the segment responsibility is done for the daemon, where it will propagate the changes in segment responsibility to the kernel.

220

1

Martin Willi

221

1

Martin Willi

The HA plugin uses a very similar interface for segment control as ClusterIP. Instead of a proc entry, it uses a FIFO located at _/var/run/charon.ha_ .

222

1

Martin Willi

Echoing +1/-1 will activate/deactivate responsibility for segment 1, while an additional command *3 will enforce a resynchronization by triggering a rekey of all SAs in segment 3.

Project

General

Profile

strongSwan

High Availability » History » Version 1

-Martin Willi
+{{>toc}}
 Martin Willi
-Martin Willi
+h1. High Availability
 Martin Willi
-Martin Willi
+Starting with the upcoming 4.4.0 release, the IKEv2 daemon will experimentally support pseudo active/active High Availability and Load Sharing capabilities using a cluster of (initially) two nodes.
 Martin Willi
-Martin Willi
+The source code is currently maintained in a separate, not yet public branch. The HA plugin and the required kernel patches will be released with 4.4.0.
 Martin Willi
-Martin Willi
+h2. Problem statement
 Martin Willi
-Martin Willi
+The IKEv2/IPsec protocol is not well suited for operation in an active/active cluster. While it is possible to share the state of IKE_SAs over high-speed links in a cluster, sharing the kernel maintained IPsec ESP SA is very difficult. Due to the strict sequence numbering of ESP packets, the overhead for synchronizing ESP sequence numbers would be very high.
 Martin Willi
-Martin Willi
+The IETF ipsecme working group "currently discusses":http://tools.ietf.org/html/draft-nir-ipsecme-ipsecha-00 the problems in more details and will probably standardize a solution which involves the client to work around the problems. This will, however, require an extension to the IKEv2 protocol. Clients connecting to a highly available cluster will benefit from this features only if they support this extension, existing clients (such as the one shipped with Windows 7) will not take advantage of this efforts.
 Martin Willi
-Martin Willi
+h2. Possible approaches
 Martin Willi
-Martin Willi
+h3. Node to node synchronization
 Martin Willi
-Martin Willi
+While the synchronization of IKE state including sequence numbers is realistic between two nodes, exchanging state information of ESP security association is difficult.
 Martin Willi
-Martin Willi
+Synchronizing the state for each processed IPsec packet will put high load on the nodes. Synchronizing after a certain amount of packets and/or after a certain timeout will reduce the load, but will make fail-over handling much more difficult, as we have to guess on the taking over node how many packets the failing node has processed, but could not synchronize before it failed.
 Martin Willi
-Martin Willi
+Another problem with such an approach is that there is no way of doing load sharing between nodes. An SA is strictly bound to a single node until the event of a failure.
 Martin Willi
-Martin Willi
+h3. Client to cluster synchronization
 Martin Willi
-Martin Willi
+Another approach to consider is requesting state information from the client. In the event of a failure, the taking over node can request sequence numbers from the client. But this approach has the same deficiencies as previously discussed. Further, it requires to extend the IKE protocol between client and gateway, making existing implementations incompatible to this approach.
 Martin Willi
-Martin Willi
+h2. Functional specification
 Martin Willi
-Martin Willi
+strongSwan uses a slightly different approach. Our solution should provide:
-Martin Willi
+* Failure detection: If a node fails due to power loss, hardware failures, kernel oops or daemon crashes, the node will be removed from the cluster.
-Martin Willi
+* State synchronization: If a node is removed due to failure or administrative purposes, the cluster should already have an up-to-date copy of the nodes state to take over.
-Martin Willi
+* Take over: Node failure detection and state take over should happen within 1-3 seconds.
-Martin Willi
+* Transparent migration: TCP or application sessions should not be interrupted during take over.
-Martin Willi
+* Load sharing: Load should be shared across all actives nodes in a cluster.
-Martin Willi
+* Reintgration: A repaired node can be (re-)added to an existing cluster, taking over a part of the load .
-Martin Willi
+* Legacy clients: No protocol extension, any IKEv2 client should benefit of High Availability if connected to a cluster.
 Martin Willi
-Martin Willi
+Migration of clients to another node does not affect the connection, a client does usually not detect a takeover. This allows a gateway administrator to e.g. remove a node from the load sharing cluster, apply security updates, reboot and reintegrate the node.
 Martin Willi
-Martin Willi
+h2. Selected solution
 Martin Willi
-Martin Willi
+The selected solution is based on the idea of "ClusterIP":http://lwn.net/Articles/108078, a Linux kernel module allowing a set of nodes to provide a service under a single virtual IP.
 Martin Willi
-Martin Willi
+h3. How ClusterIP works
 Martin Willi
-Martin Willi
+All nodes in a ClusterIP based setup act under a single virtual IP address. The nodes spoof ARP requests with a multicast MAC address. This will make the switch forwarding the packet to each node in the cluster.
 Martin Willi
-Martin Willi
+The received packet is associated to a segment by calculating a hash value of it. In the simplest setup, the source address is hashed and the hash value modulo the number of segments results in the responsible segment number. Each segment is handled by exactly one node in the cluster.
 Martin Willi
-Martin Willi
+The node responsible for the packet will pass it to upper layers, where all others just drop the packet in the netfilter code. Depending on the hash value, e.g. TCP connections are kept on the same node. If a node fails, a remaining node will take over the segment and process packets for it.
 Martin Willi
-Martin Willi
+h3. IPsec with ClusterIP
 Martin Willi
-Martin Willi
+While the ClusterIP module itself is not designed to handle IPsec traffic or even act as a forwarding router, the principle of ClusterIP is. If the IKE daemons in the cluster can synchronize the IKE state and the basic IPsec SA state without sequence numbers, a modified ClusterIP module can do the rest:
 Martin Willi
-Martin Willi
+For traffic to decrypt, the SPI value of the ESP packet can be included in the hash calculation, resulting in a spread of the packet flow across all nodes. As each node processes a packet of an SA from time to time, sequence numbers are automatically incremented if a packet is processed. Sequence numbers are not mangled if a packet is not verified using the IPsec authentication algorithm, as an attacker would be able to manipulate the SA state otherwise. Paket flows are partitioned in a dozen of packets to avoid too much packet reordering.
 Martin Willi
-Martin Willi
+For traffic to encrypt on the cluster, the SA is looked up and the hash value is feed with the SPI of the found SA. If the segment matches, the packet is further processed. If not, only the sequence number is incremented. To avoid assigning the same sequence number to different packets on multiple nodes, additional logic is required.
 Martin Willi
-Martin Willi
+h2. Kernel Implementation
 Martin Willi
-Martin Willi
+The ClusterIP Netfilter module uses an additional PREROUTING hook to mark received packets for forwarding. Two new Netfilter hooks are included in the IPsec processing, exactly before the decryption/encryption process (XFRM_IN/XFRM_OUT).
 Martin Willi
-Martin Willi
+<pre>
-Martin Willi
+                 v        PLAIN        ^
-Martin Willi
+    +------------------------------------------------+
-Martin Willi
+    |            |                     |             |
-Martin Willi
+    |     +--------------+      +--------------+     |
-Martin Willi
+    |  4. |  PREROUTING  |      |   DECRYPT    |  3. |
-Martin Willi
+    |     +--------------+      +--------------+     |
-Martin Willi
+    |            |                     ^             |
-Martin Willi
+    |            v                     |             |
-Martin Willi
+    |     +--------------+      +--------------+     |
-Martin Willi
+    |  5. |   XFRM_OUT   |      |   XFRM_IN    |  2. |
-Martin Willi
+    |     +--------------+      +--------------+     |
-Martin Willi
+    |            |                     ^             |
-Martin Willi
+    |            v           ^         | ESP/AH      |
-Martin Willi
+    |     +--------------+   |  +--------------+     |
-Martin Willi
+    |  6. |   ENCRYPT    |   +--|    INPUT     |  1. |
-Martin Willi
+    |     +--------------+      +--------------+     |
-Martin Willi
+    |            |                     |             |
-Martin Willi
+    +------------------------------------------------+
-Martin Willi
+                 v       CRYPTED       ^
-Martin Willi
+</pre>
 Martin Willi
-Martin Willi
+#  AH, ESP and UDP-Encapsulated ESP packets are all accepted. Other traffic is subject to the ClusterIP selection algorithm based on the source IP address (e.g. IKE traffic).
-Martin Willi
+# Undecrypted IPsec traffic gets dropped using a ClusterIP algorithm, based on the IPsec SA.
-Martin Willi
+# Decryption process is done on the responsible node only.
-Martin Willi
+# Traffic is received on ClusterIP multicast MAC and must be tagged as unicast traffic to advance through IP forwarding.
-Martin Willi
+# After IPsec policy lookup, unencrypted traffic gets dropped using a ClusterIP algorithm, based on IPsec SA. Outgoing sequence numbers are assigned before the packed drop, this will keep outgoing sequence numbers in sync on all nodes.
-Martin Willi
+# Encryption process is done on the responsible node only.
 Martin Willi
-Martin Willi
+h2. IKE daemon implementation
 Martin Willi
-Martin Willi
+A separate high availability plugin implemented for the IKEv2 daemon charon is responsible for state synchronization between the nodes in a cluster and simple monitoring functionality. It is currently designed for two nodes, but will be extended to synchronize larger clusters in the future.
 Martin Willi
-Martin Willi
+h3. Daemon hooks
 Martin Willi
-Martin Willi
+The plugin registers itself at several hooks in the daemon. These hooks are used for notifications about SA state changes and push information to the plugin. The following hooks are used:
-Martin Willi
+* ike_keys(): receives IKE key material (DH, nonces, proposals)
-Martin Willi
+* ike_updown()/ike_rekey(): monitor state changes of IKE SAs
-Martin Willi
+* message(): used to update IKE message IDs
-Martin Willi
+* child_keys(): receives CHILD key material
-Martin Willi
+* child_state_change(): monitor state changes of CHILD SAs
 Martin Willi
-Martin Willi
+The plugin registers its hook functions at the daemon bus. These hooks are sufficient to synchronize all IKE- and CHILD SAs with all the state required to do a fail-over of IKE and ESP SAs.
 Martin Willi
-Martin Willi
+h3. Synchronization messages
 Martin Willi
-Martin Willi
+The hook functions collect the required synchronization data and prepare messages to be sent to other nodes in the cluster. Messages are sent in unencrypted UDP datagrams, sent and received on port 4510. As these messages contain sensitive key material, securing the messages by IPsec is recommended.
 Martin Willi
-Martin Willi
+No packet acknowledge/retransmit scheme is currently implemented, the cluster needs a reliable network with very few packet losses. It might be necessary to use a more reliable transport protocol in the future, especially if nodes start to drop packets due to an overloaded CPU.
 Martin Willi
-Martin Willi
+Messages contain a protocol version, a message type and different attributes. The following synchronization message types are currently defined:
-Martin Willi
+* IKE_ADD: A new IKE_SA has been established. This message contains all information to derive key material. If the message contains a REKEY attribute, the IKE_SA inherits all required parameters from the old SA.
-Martin Willi
+* IKE_UPDATE: Update IKE_SA with newer information (e.g. Identities when authentication is complete).
-Martin Willi
+* IKE_DELETE: Delete an established IKE_SA.
-Martin Willi
+* CHILD_ADD: CHILD_SA has been established, contains keying material.
-Martin Willi
+* CHILD_DELETE: CHILD_SA has been deleted.
 Martin Willi
-Martin Willi
+h3. State synchronization
 Martin Willi
-Martin Willi
+Received synchronization messages are parsed, mirrored IKE and CHILD_SAs are created from this information. Mirrored CHILD_SAs do not differ from normally exchanged ones; they are installed in the kernel and handle packets if ClusterIP feels responsible for it.
 Martin Willi
-Martin Willi
+IKE_SAs are installed in a special PASSIVE state. They do not handle traffic, but accept state changes from sync messages only. PASSIVE IKE_SAs are managed in the IKE_SA manager as any other SA and are accessible through e.g. "ipsec statusall".
 Martin Willi
-Martin Willi
+Key derivation is repeated on mirrored SAs the same way as it is done on the real SAs. This allows the reuse of existing installation routines and the HA plugin to be very unobtrusive. A node responsible for an IKE_SA does not keep the keying material in memory, it just pushes the exchanged secret to other nodes and forgets the secrets afterwards.
 Martin Willi
-Martin Willi
+h3. Control messages
 Martin Willi
-Martin Willi
+In addition to the synchronization messages, the HA plugin uses control messages to notify about segment changes and optionally messages for simple monitoring functions:
-Martin Willi
+* SEGMENT_DROP: List of segments the sending node is dropping responsibility.
-Martin Willi
+* SEGMENT_TAKE: List of segments the sending node is taking responsibility.
-Martin Willi
+* STATUS: Heartbeat message containing a list of segments the sending node is responsible.
-Martin Willi
+* RESYNC: Request for resynchronization of a list of segments.
 Martin Willi
-Martin Willi
+The take/drop messages are sent to notify other nodes about changes done by the daemon automatically or the administrator manually. The receiving node will automatically do the opposite action to handle all segment exactly once.
 Martin Willi
-Martin Willi
+If heartbeat monitoring is enabled, the status message is periodically sent. This allows to detect the activity of the remote node and take over segments the remote node is not serving. It also implements node failure detection for simple errors.
 Martin Willi
-Martin Willi
+If a replacement of a failing node is installed, reintegration of the node can be speed up by sending the resynchronization message. The active node
-Martin Willi
+will start resyncing all SAs, allowing the administrator to rebalance the load distribution in the cluster afterwards.
 Martin Willi
-Martin Willi
+h3. Failover
 Martin Willi
-Martin Willi
+In the failover case, responsibility for complete ClusterIP segments are moved from one node to another. Responsibility for a segment can be enabled or disabled on each node. For this purpose, the plugin uses the same hashing algorithm to calculate responsibility based on the source IP address.
 Martin Willi
-Martin Willi
+If a segment is activated, the plugin searches for IKE_SAs in this segment and sets the state of all PASSIVE IKE_SAs to ESTABLISHED. No further action is required: The daemon handles the IKE_SA as every other one and sends out synchronization messages for state changes.
 Martin Willi
-Martin Willi
+On segment deactivation, the plugin searches for IKE_SAs in the ESTABLISHED state in this segments and sets the state to PASSIVE.
 Martin Willi
-Martin Willi
+CHILD_SAs are completely unaffected from activation and deactivation: They are always active and handle traffic assigned by ClusterIP.
 Martin Willi
-Martin Willi
+h3. Reintegration
 Martin Willi
-Martin Willi
+To reintegrate a failed node into a cluster, the node needs state information from scratch. If all the required state has been synced, the reintegrated node can be used as failover node again. Segments can be activated on the reintegrated node only after all required state has been exchanged.
 Martin Willi
-Martin Willi
+SA state automatically gets synchronized during rekeying. Each rekeying procedure provides fresh keying material which can be used to build the
-Martin Willi
+mirrored IKE and CHILD_SA state. Rekeying is currently the only way to push the required state to a reintegrated node, as the key material is not stored directly on an active node.
 Martin Willi
-Martin Willi
+To speed up the reintegration process, the plugin can trigger the rekeying of existing IKE- and CHILD_SAs in a segment, allowing a reintegration process to complete within seconds.
 Martin Willi
 Martin Willi
-Martin Willi
+h2. Configuration
 Martin Willi
-Martin Willi
+Configuration is done in two places. The necessary virtual IPs and the ClusterIP rules are installed manually. This is explicitly not done by the daemon, as the rules must stay active after daemon shutdown or error conditions.
 Martin Willi
-Martin Willi
+The HA plugin requires a configuration matching to the installed ClusterIP rules. All nodes in the cluster need an identical connection configuration and credentials; IP addresses assigned to clients using configuration payloads must be set carefully using a central or two distinct address pools.
 Martin Willi
 Martin Willi
-Martin Willi
+h3. ClusterIP
 Martin Willi
-Martin Willi
+The configuration of the extended ClusterIP module is similar to a default ClusterIP setup. For a traffic forwarding IPsec gateway, a cluster usually
-Martin Willi
+needs an internal virtual IP/MAC and an external virtual IP/MAC on each node.
 Martin Willi
-Martin Willi
+<pre>
-Martin Willi
+ip address add 192.168.0.200/24 dev eth0
-Martin Willi
+iptables -A INPUT -i eth0 -d 192.168.0.200 -j CLUSTERIP --new \
-Martin Willi
+   --hashmode sourceip --clustermac 01:00:5e:00:00:20 \
-Martin Willi
+   --total-nodes 2 --local-node 1
-Martin Willi
+</pre>
 Martin Willi
-Martin Willi
+This example installs the virtual IP 192.168.0.200 on interface eth0 and adds a corresponding ClusterIP rule. ClusterIP rules are always added to the INPUT chain. To get the same result for segment responsibility calculation in the kernel and the HA plugin, the sourceip hashmode and a hash init value of 0 must be used (default).
 Martin Willi
-Martin Willi
+The _total-nodes_ option must match the configuration of the HA plugin, and all nodes require the same virtual IP/MAC and ClusterIP configuration.
 Martin Willi
-Martin Willi
+ClusterIP requires the _local-node_ option to be present. However, the HA plugin reassigns segment responsibility during daemon startup.
 Martin Willi
-Martin Willi
+h3. HA plugin
 Martin Willi
-Martin Willi
+The HA plugin configuration is handled in the _strongswan.conf_ file.
 Martin Willi
-Martin Willi
+<pre>
-Martin Willi
+charon {
-Martin Willi
+    # ...
-Martin Willi
+    ha {
-Martin Willi
+        local = 10.0.0.2
-Martin Willi
+        remote = 10.0.0.1
-Martin Willi
+        segment_count = 2
-Martin Willi
+        # secret = s!ronG-P5K-s3cret
-Martin Willi
+        fifo_interface = yes
-Martin Willi
+        monitor = yes
-Martin Willi
+        resync = yes
 Martin Willi
 Martin Willi
-Martin Willi
+</pre>
 Martin Willi
-Martin Willi
+The _local_ and _remote_ addresses are used to send and receive sync messages, _segment_count_ defines the number of segments to use.
 Martin Willi
-Martin Willi
+If a _secret_ option is specified, the nodes automatically establish a pre-shared key authenticated IPsec tunnel for HA sync and control messages
-Martin Willi
+(experimental).
 Martin Willi
-Martin Willi
+The segment responsibility administration interface is enabled with the _fifo_interface_ option. The _monitor_ parameter enables the heartbeat based remote node monitoring, the _resync_ option enables automatic state resynchronization if a node joins the cluster.
 Martin Willi
-Martin Willi
+h3. Administrating segment responsibility
 Martin Willi
-Martin Willi
+Changing the segment responsibility is done for the daemon, where it will propagate the changes in segment responsibility to the kernel.
 Martin Willi
-Martin Willi
+The HA plugin uses a very similar interface for segment control as ClusterIP. Instead of a proc entry, it uses a FIFO located at _/var/run/charon.ha_ .
-Martin Willi
+Echoing +1/-1 will activate/deactivate responsibility for segment 1, while an additional command *3 will enforce a resynchronization by triggering a rekey of all SAs in segment 3.