Break the huge lock region of htt tx_lock in tx completion

- The motivation is to reduce the contention time of acquire tx_lock for downlink.
- Original tx completion will hold the tx_lock and process MSDUs one by one. This will block htt_tx to fill htt->pending_tx if free space is available.
- This CL breaks the lock region of tx_lock in tx completion so that only MSDU id related operations are in the tx_lock

Performance comparisons:
- Setup: Desktop -> 1G Ethernet -> AP -> 802.11ac -> Mac Book
- 802.11ac channel 52 with DFS, TCP traffic

- Performance reported by iperf -c ... -i 1 -P 3 -t 10 (pick top one in 3 samples)
, ath10k (unchanged), ath10k (with this CL), LSDK
Downlink 620 Mbps, 719 Mbps, 884 Mbps
Uplink 380 Mbps, 400 Mbps, 750 Mbps

- Performance reported using www.speedtest.net at Mac Book (pick top one in 3 samples)
, ath10k (unchanged), ath10k (with this CL), LSDK
Downlink 570 Mbps, 708 Mbps, 300 Mbps
Uplink 325 Mbps, 322 Mbps, 428 Mbps
(not sure why LSDK has such poor performance in my setup)

Change-Id: I1ca723f77594b8e71729c604d7a20f84aa6fbb7e
4 files changed