Hello,
I'm having some big throughput issues. Recently we added a 5th controller to our team with 100 extra APs, having a total of ~760 APs and peaks of 4000 concurrent users. On the weekends it seems to be fine, we have a openwrt client making measures all the time to see how the throughput/latency and so on is working. The problem comes as soon as the number of users arise.
When there are about 1.5-2K simultaneous users connected, the manager controller CPU stays continuosly at 90-100% usage until the number of users drop, then the CPU lowers (it still has some spikes though),
From what I gathered the culprit seems to be a process called rrdsampler, it hogs the CPU and it is affecting the service. It is affecting the authentication process as well, I noticed that we have a ton more 802.1x timeouts than before, the throughput drops drastically and ping loss and latency increases. That happens on an AP without many users and the total throughput of the AP on the ethernet port is very low.
There are no big interferences detected, I went there with a spectrum analyzer to check if it could be an RF issue but I didn't find any problems, just a nearby AP that was on a different channel so no channel overlap there (5 channels of difference between them).
I know that RRDtool is used for graphing and storing statistics, maybe the issue here is trying to get too many statistics from each and everyone of the users. When there are few users it's ok, but when that value spikes it's just not working.
We are running 6.6.2.0, we have many 3 VSCs, 2 of them are tunneled through the controllers but the third one is not tunneled (sends the traffic directly from the AP to a VLAN tagged directly onto it). We are not using the team for control access, just for authentication through an external RADIUS server.
Our configuration is like this:
- We have the lower allowed speed rates disabled (11Mbps or higher are only allowed) to assure a good connection for each user.
- RRM enabled with auto-channel, auto-power and AP load balancing.
- Tx protection -> RTS/CTS with 1024 RTS Threshold to mitigate the hidden node problem (we took measures to see if this affected the overall throughput and it didn't seem affect that much).
I already opened a case with support but I would like to know if someone is experiencing the same issues I'm having. Mostly the rrdsampler process issue, if you want to check whether the process is hogging the CPU SSH the controller/AP and type top.
Aarón
Thanks!
Aarón