NSA 4600 hangs when multiple users login
Hi, my company has about 550 users who need to login through Sonicwall NSA4600 to get access to internet. The Sonicwall syncs with our local AD.
But when more than lets say 20-30 users login at the same time the sonicwall hangs. Login-page is not responding and as an admin we cant login either. If the users login 10 by 10 or less it goes good but more than 20 it stalls and hangs completely when more than 30 tries at the same time. We have tried updated it and got some support earlier but when the man from support tried to restart it in the middle of working hours and users tried to log back on it stopped so we needed to stop all users for the rest of the day to get the sonicwall back up again.
And when i say hangs, its only the login page. Those who already got login has internet no problem. And we can login by SSH to the sonicwall.
Isnt the NSA4600 suppose to handle this many users? Is there anything we can do? We arent high-end IT-technician but we know our way around a network so we got help to set it up the first time.
Answers
Hello @Espen,
Could you please let us know the firmware version of the firewall?
Please make sure that you are on the latest version 6.5.4.7-83n.
While you have access through SSH, kindly run the commands
diag show cpu
diag show process <name_of_process>
With diag show cpu, it will show you the CPU utilization with the highly utilized process on the top. Use that process name in the second command.
If the CPU is going high, this will tell what exactly is causing that. The support should be able to dig in further and let you know the complete RCA.
Thanks!
Shipra Sahu
Technical Support Advisor, Premier Services
Thansk for responding.
Yes, its running the latest version, isnt that long ago it was updated. However this problem has been with us for the last 1,5 year but we've managed to ride though it somehow anyway.
Below is the commands you requested me running. I dont know what this tells but hopefully you can read something into it...
admin@C0EAE4F69E> diag show cpu
CPU Monitor:
Current 1s CPU Utilization: 100.00%
Current 10s CPU Utilization: 100.00%
Total Average CPU Utilization: 6.41%
Current MultiCore Utilization (%)
Core 0: 100
Core 1: 8
Core 2: 4
Core 3: 21
Core 4: 16
admin@C0EAE4F69E> diag show process pass_to_stack
Process pass_to_stack (0x8fe061f0):
pass_to_st> 80142a10 8fe061f0 50 PEND 818c8ec8 8fe060f0 18 0
$0 = 0 t0 = 8c00 s0 = 50008ca1
at = fffffffffffffffe t1 = ffffffffffff00ff s1 = ffffffffffffffff
v0 = 0 t2 = 2000000 s2 = ffffffff85bbd0c8
v1 = 318d t3 = ffffffff8fe06230 s3 = 1
a0 = 50008ca1 t4 = 190 s4 = ffffffff839a0000
a1 = ffffffff87b090e0 t5 = 80 s5 = 2f836cb0
a2 = 0 t6 = d740 s6 = ffffffff82312978
a3 = 400 t7 = ffffffff9e385834 s7 = a
s8 = 1 k0 = 0 intctrl = 0
tlbhi = 20000000 gp = ffffffff840d25c0
k1 = 0 t8 = 1 ra = ffffffff817ba564
sp = ffffffff8fe060f0 t9 = ffffffff818bae50 divlo = cac083126e978d52
divhi = 1 sr = 50008ca1 pc = 818c8ec8
Stack trace of pass_to_stack:
0x8182075c -> ($13)
0x80142a54 -> 0x818ca548
0x818c8ec8Core 5: 11
Core 6: 9
Core 7: 8
CPU Utilization Per Process:
# Name PC PRI Total% (secs) Curr% (secs)
--- ----------------- ---------- --- ------------- -------------
1. pass_to_stack 0x818c8ec8 50 0.48 (6173.15) 16.13 (0.17)
2. tWebRdrct03 0x8168e290 52 0.35 (4507.67) 14.52 (0.15)
3. tWebRdrct02 0x8168fcc8 52 0.35 (4505.63) 14.52 (0.15)
4. tWebRdrct01 0x8168fe6c 52 0.35 (4499.38) 11.29 (0.12)
5. tWebRdrct05 0x818980f0 52 0.35 (4502.93) 9.68 (0.10)
6. tWebListen 0x818c8ec8 50 0.16 (2030.15) 9.68 (0.10)
7. tWebRdrct04 0x80f4d44c 52 0.35 (4504.43) 6.45 (0.07)
--snip--
Task Total 4.84 (62866.50) 100.00 (1.03)
Idle 93.59 (1214736.25) 0.00 (0.00)
System 1.57 (20313.93) 0.00 (0.00)
CPU Utilization History for Last Minute (60 seconds ago --> now):
100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,97,100,100,100,100,100,100,100,10
CPU Utilization History for Last Hour (60 minutes ago --> now):
27,2,7,20,23,37,23,35,28,23,100,100,100,100,100,60,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100
CPU Utilization History for Last Day (24 hours ago --> now):
100,100,100,100,100,100,100,100,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,100
CPU Utilization History for Last Month (30 days ago --> now):
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,100,100,100,100,0,0,100
admin@C0EAE4F69E> diag show process pass_to_stack
Process pass_to_stack (0x8fe061f0):
pass_to_st> 80142a10 8fe061f0 50 PEND 818c8ec8 8fe060f0 18 0
$0 = 0 t0 = 8c00 s0 = 50008ca1
at = fffffffffffffffe t1 = ffffffffffff00ff s1 = ffffffffffffffff
v0 = 0 t2 = 2000000 s2 = ffffffff85bbd0c8
v1 = 318d t3 = ffffffff8fe06230 s3 = 1
a0 = 50008ca1 t4 = 190 s4 = ffffffff839a0000
a1 = ffffffff87b090e0 t5 = 80 s5 = 2f836cb0
a2 = 0 t6 = d740 s6 = ffffffff82312978
a3 = 400 t7 = ffffffff9e385834 s7 = a
s8 = 1 k0 = 0 intctrl = 0
tlbhi = 20000000 gp = ffffffff840d25c0
k1 = 0 t8 = 1 ra = ffffffff817ba564
sp = ffffffff8fe060f0 t9 = ffffffff818bae50 divlo = cac083126e978d52
divhi = 1 sr = 50008ca1 pc = 818c8ec8
Stack trace of pass_to_stack:
0x8182075c -> ($13)
0x80142a54 -> 0x818ca548
0x818c8ec8
@Espen,
The 6.5.4.7-83n was released just a few days ago, so please make sure you are on that version. I see something similar reported on 6.5.4.4 version. Also, I think the problem could be with tWebRdrctxx tasks.
We would need the TSR, tracelogs taken during the hang to further analyze it. It would be best to have a support case created for thorough analysis.
Thanks!
Shipra Sahu
Technical Support Advisor, Premier Services
Im sorry, we arent on the latest version. I was sure we was but we are on 6.5.4.6-79n, and i see now that 6.5.4.7-83n is the latest. I will upgrade as soon as i can.
Im googling for answers so i got this page: https://www.sonicwall.com/support/knowledge-base/troubleshooting-firewall-reboots-due-to-twebmain-process-in-gen-6-devices/170504822896540/
I might try the tips in that as well as upgrading. Any last pointers? And thanks for your help this far
@Espen,
No problem. In your case it doesn't look like the tWebMain process.
I hope those commands were run during the time of the issue. Anyway, it is best to be on 6.5.4.7 version. Please take all necessary backups before the firmware upgrade.
I hope this fixes your problem.
Thanks!
Shipra Sahu
Technical Support Advisor, Premier Services
Hi @Espen
I had face the same issue on 4600 and i did below steps to resolve the CPU spike & It was related to the Firewall logs.
1) Disabled the Logging for App control Globally & enable specific categories wise if really need to monitor.
2) Changed the Logging level to "Notice"
After done the above change please wait some time to see the change.
Step -1
Step 2
I did try this and i it didnt do much. We arent using app control anyway. The traffic did go down a bit later in the week so we lived through it. Now is the same again. The top-cpu-processes are: CPU Utilization Per Process:
# Name PC PRI Total% (secs) Curr% (secs)
--- ----------------- ---------- --- ------------- -------------
1. tWebRdrct04 0x80f4d44c 52 0.52 (9539.60) 17.19 (0.18)
2. tWebRdrct01 0x8168fe7c 52 0.52 (9515.73) 17.19 (0.18)
3. tWebRdrct02 0x8168fe1c 52 0.52 (9535.87) 12.50 (0.13)
4. pass_to_stack 0x818c8ec8 50 0.56 (10192.43) 10.94 (0.12)
5. tWebRdrct05 0x8168fcf8 52 0.52 (9540.80) 10.94 (0.12)
6. tWebRdrct03 0x8168dfa8 52 0.52 (9538.80) 9.38 (0.10)
7. REAL_tDataPlane 0x818c8ec8 50 0.12 (2211.42) 9.38 (0.10)
This time i couldnt connect to ssh at first either, it timed out.
Any help?
It also seems to lose contact with our local DNS-server when this happens. I try to ping it through ssh when im logged into Sonicwall. (Ive edited out the real ip)
admin@C0EAE4F69E> ping 10.xx.0.xx
Unable to resolve 10.xx.0.xx
admin@C0EAE4F69E> ping 10.xx.0.xx
Unable to resolve 10.xx.0.xx
admin@C0EAE4F69E> ping 10.xx.0.xx
10.xx.0.xx [10.xx.0.xx] : is alive
Ping time : 0 ms
admin@C0EAE4F69E> ping 10.xx.0.xx
Unable to resolve 10.xx.0.xx
admin@C0EAE4F69E> ping 10.xx.0.xx
Unable to resolve 10.xx.0.xx
I get connection like 1/8 times, how can i check if this is a part of the problem?
Hi @Espen
In this case please do the firmware update and let us know as per @shiprasahu93
The problem is the same after firmware-update.
Hi @Espen
Please follow the below KB & try it.
Make sure your network dont have any network switch loop.
I'm running into the same problem on the same device. Were you able to find a resolution?
Hi @KyleL ,
Please make sure your LDAP Referrals configured properly & while 4600 get slow while user login time, please do the LDAP test connection on the same time.
NB: If you don't have multiple subdomains or multiple LDAP server other than the primary, Please disable the highlighted enabled field and try.
Then check the LDAP connectivity & user authentication test.
Make sure your LDAP server support LDAP version 3. LDAP version 2 having issues.
No solution yet, sonicwall support has been working on the case for 2 weeks now.
Hello @Espen and @KyleL. I'm sorry to hear about this inconvenience. Please, if you PM your case number to me I can work with our internal teams. Let me know.
Kind Regards,
@micah - SonicWall's Self-Service Sr. Manager