Traffic stays on lower LB interface
SonicAdmin80
Cybersecurity Overlord ✭✭✭
I'm managing a HA pair with two WAN connections. For some reason today secondary firewall became active because "higher link status", perhaps some temporary disruption in the network.
I now noticed that all WAN traffic is using the lower priority LB interface, even though the X1 connection is also up and both monitoring targets are alive. It's setup as basic failover and failback is checked.
What might cause the firewall to keep WAN traffic on the lower priority interface even if the primary is up? I don't have any special routing rules for WAN interfaces, just the defaults.
Category: Virtual Firewall
0
Answers
@SonicAdmin80 I never was in that situation, Failover in HA was working fine so far. Gen6 talking here, no experience with Gen7 at the moment in HA. Both WAN Interfaces are static or some kind of dial-up?
Did you checked the latest Firmware Release available for that unit which might address this issue?
--Michael@BWC
@BWC Yep, both WAN interfaces are static. Main is fiber and secondary 5G. Gen 6 NSv hasn't received much updates anymore, should be the latest (6.5.4.4-44v). I'll try to failback to the primary HA unit later to see if it starts using the main WAN after that.
@SonicAdmin80 yeah, a forced fallback to the primary was I meant to suggest too, but forget to mention, duh.
If everything is back to normal then I guess it's Ticket time with the Support.
It seems Gen6 NSv is dead, evidently.
--Michael@BWC
@SonicAdmin80
Please check sync status on HA page. There are sync problem on HA pairs.
and check Secondary Device's WAN1 gateway device.
HA status looked good and both WAN connections were online, so I'm not sure why it switched to the secondary in the first place. HA events in the log don't give much detail why these things happen. "Higher link status" doesn't tell which interface it was that caused the failover when there are multiple interfaces with monitoring IPs. None of the interfaces seemed to actually go down.
Doing a forced failover back to the primary worked without issue and the traffic also switched to use the primary WAN. I'm not sure if there's a bug in this scenario but I'm not very hopeful that support could come up with anything useful.
Actually in the NSv console I see that there are "logical monitoring on interface X1 fails" messages on both units. So there seems to have been a short problem with the primary WAN which caused the failover. Nothing in the logs about LB probes failing though, so those probes seemed to have succeeded between the set interval.
This also doesn't explain why it kept the traffic on WAN2 when WAN1 came back online.
It failed over again. Looks like it coincides this in the logs:
The cache is full; 375512 cacheCurrentInUse, 0 freed from pendingFreeList (Total 0) open connections; some will be dropped
I haven't been able to track down what uses those connections, as after the failover the connections go down to a normal level. Could also be a bug and connections aren't released normally.
1) Check connection peak status on system page.
2) Can you assign connection limit on your all access rules.
3) change probe destination on LB groups.
4) Change HA logic probe on HA menu.
I suppose, some user or server are attacking to somewhere.
Connection peak has been at the maximum so I did some limiting in the access rules. I also changed probe destinations and HA monitoring destinations. During the night it failed over multiple times due to "higher link status" on different interfaces on different times.
I'm starting to think something might be wrong on the ESXi hosts or the switch stack since it shouldn't be possible that those many monitoring targets fail.
There was one host that seemed to spam the network but it was shut down yesterday.
Looks like it was a flood on one web server. Strange as it isn't shown in AppFlow logs, perhaps the firewall was so taxed it couldn't log it. The problems stopped right after blocking the traffic coming in. Hasn't failed over again either. I didn't get any flood alerts, but connection limit alerts were coming in sometimes, not always.
The source address is known to us so at this point we suspect a malfunctioning device more than anything malicious. So seems like flooding can cause HA monitoring failures, full connection cache and then these can cause flapping for the HA. A bit tricky to troubleshoot as it also looked like a it could've been an internal issue.
If you dont select "Report DROPPED Connection" on app flow, you cannot see.
Please could you share "Appflow/ Flow Reportings/Settings" screen shoot?
In fact, sometimes a misconfigured sql query or similar nginx-based client side web queries can cause this. I suggest you do a detailed inspection on your web server.
This is how it looks like, so dropped connections isn't active. But since the connections weren't dropped until I disabled the access rule, shouldn't they show up in the AppFlow logs? The web server (NextCloud) was only used for one thing which isn't needed anymore, so I can just keep it offline for now.