I use perf to monitor suricata, and found that in my case (the 10G scenario) the function FlowGetFlowFromHash in flow-hash.c consumes most CPU times due to list traversing and pthread_lock_lock/unlock. I wonder to know why don't we use thread local storage to get rid of this to reduce contention? Seems that if we use the worker runmode, and use multiple threads to process captured packets, these threads share the same flow management list. My proposal is that if we bind threads to separate cores, and bind different NIC IRQs to these cores correspondingly, these threads should use thread local storage linked list or per-thread linked list to manage their own flow data, and let the NIC to issue that packets belonging to the same flow are sent to corresponding threads, what do you think? I think hardware flow function is good than the software-implemented one, am I right? Thanks.

