Tags

, , , , , , , , ,

In an SRX chassis cluster setup, in addition to interface monitoring you can also use
IP monitoring to monitor the health of your upstream path.

srx-chassis-ip-monitoring

Above is a simple topology to explain how ip monitoring works. In this setup node0 and node1 are part of an srx chassis cluster. reth0.0 interface is part of the redundancy group 1 (RG1)

Currently node0 is the primary for RG1 as you can see from the output below;

{primary:node0}
root@node0> show chassis cluster status
Cluster ID: 1
Node Priority Status Preempt Manual failover

Redundancy group: 0 , Failover count: 1
 node0 100 primary no no
 node1 99 secondary no no

Redundancy group: 1 , Failover count: 1
 node0 100 primary no no
 node1 99 secondary no no

Now lets configure IP monitoring to detect any failure in network layer.

root@node0# show chassis cluster redundancy-group 1 ip-monitoring
global-weight 100;
global-threshold 200;
retry-interval 3;
retry-count 5;
family {
 inet {
 172.17.11.1 {
 weight 200;
 interface reth0.0 secondary-ip-address 172.17.11.99;
 }
 }
}

The config above instructs SRX:

  • Monitor IP address 172.17.11.1 by sending ICMP packets at 3 seconds (retry-interval) interval.
  • If 5 consecutive attempts (retry-count) fail, mark the IP address 172.17.11.1 unreachable.
  • then deduct the weight 200 from the global-threshold value (i.e 200)
  • and if the result of this deduction is 0, then deduct global-weight 100 from the RG1 threshold (255)

Configured secondary-ip-address shouldn’t be the primary IP address on reth0.0 as far as the documentation is concerned but it must be on the same subnet. In a nutshell, primary node is using reth0.0 interface address (172.17.11.100) as the source and secondary node is using the IP 172.17.11.99 (with MAC address of the secondary node child interface)

After configuring this ip-monitoring, here is the status. IP is reachable.

root@srx0> show chassis cluster ip-monitoring status
node0:
--------------------------------------------------------------------------

Redundancy group: 1

IP address Status Failure count Reason
172.17.11.1 reachable 0 n/a

node1:
--------------------------------------------------------------------------

Redundancy group: 1

IP address Status Failure count Reason
172.17.11.1 reachable 0 n/a

Now I disable ICMP responses on the gateway, to simulate a failure and
ip address is marked as unreachable.

root@srx0> show chassis cluster ip-monitoring status
node0:
--------------------------------------------------------------------------
Redundancy group: 1
IP address Status Failure count Reason
172.17.11.1 unreachable 1 unknown
node1:
--------------------------------------------------------------------------
Redundancy group: 1
IP address Status Failure count Reason
172.17.11.1 reachable 0 n/a

When we check the cluster information, we can see that threshold is 155 now.
Because our global weight is 100 for this monitored IP address (255-100=155)

{primary:node0}
root@srx0> show chassis cluster information
node0:
--------------------------------------------------------------------------
Redundancy mode:
 Configured mode: active-active
 Operational mode: active-active

Redundancy group: 0, Threshold: 255, Monitoring failures: none
 Events:
 Sep 19 14:34:43.366 : hold->secondary, reason: Hold timer expired
 Sep 19 14:34:59.817 : secondary->primary, reason: Better priority (100/99)

Redundancy group: 1, Threshold: 155, Monitoring failures: ip-monitoring
 Events:
 Sep 19 14:34:43.379 : hold->secondary, reason: Hold timer expired
 Sep 19 14:34:59.852 : secondary->primary, reason: Remote yeild (0/0)

You might have noticed that there is no failover yet, since for failover to happen, RG1 threshold must reach 0 which isn’t the case on this simulation.

Now, I change the config and set the IP weight to 200 and global to 255 to simulate a failover;

global-weight 255;
global-threshold 200;
retry-interval 3;
retry-count 5;
family {
 inet {
 172.17.11.1 {
 weight 200;
 interface reth0.0 secondary-ip-address 172.17.11.99;
 }
 }
}

After a network failure, node1 becomes primary this time.

{primary:node0}
root@srx0> show chassis cluster status
Cluster ID: 1
Node Priority Status Preempt Manual failover

Redundancy group: 0 , Failover count: 1
 node0 100 primary no no
 node1 99 secondary no no

Redundancy group: 1 , Failover count: 46
 node0 100 secondary no no
 node1 99 primary no no

You might see the high failure count 46 in the output. This happened after I made simulation mistake 🙂
I thought that I can just block ICMP from 172.17.11.100 on the Linux GW device as such;

iptables -A INPUT -p icmp -s 172.17.11.100 -j DROP
and it should trigger a failover. Yes, assumption was correct, it should trigger a failover as node1 is able to ping from 11.99 but once the failover occurs, node1 becomes primary and now it starts pinging from 11.100 which fails again and causes another failover. This caused a sort of dead lock in the cluster but,

I have learned something in there 🙂

—————————————————————————————————————————-

If you feel this article helped you to get some learning, please support by clicking below.

paypal-button

Advertisements