As I have mentioned in my prior posting, we are performing load testing at work
specifically on our authentication system. While looking at the graphs we
noticed that a huge amount of time was being spent performing bcrypt password
hashing operations. (TODO: I will have to do a posting on how we implemented telemetry
in our applications)
It almost seemed from the data like bcrypt was causing the login portion of the
authentication service to start eating itself under load, to the point where the
number of requests and spinning up of goroutines had blown right through the
point of diminishing returns. The entire application started slowing down to
the point where responses were taking several seconds to complete! Yikes.
It would seem that the golang concurrency was too much for itself!
To be fair, it is more likely that our naive hash checking could have been more
optimized, as it turns out, authentication systems do a lot of password hash
checking.
To mimic this scenario I created this gist to mimic what we were
seeing and start to get numbers so we can turn the knobs.
I started with a very simple benchmark tests to get a feel for what the
performance was and could be generally:
As seen above I performing benchmarks on a serial implementation base case, and
then benchmarking using as many goroutines as I can. Below are the results of
those two runs:
So far so good, my machine is a Core Two Duo ( don’t make fun ). We got what we
expected, double the performance of doing it in a serial fashion.
I then started making a sample application as, possibly the runs in the golang
benchmarking were too short and not making the test realistic. Plus there were
not many knobs to turn.
Below is the sample application to test:
Now this gives is more options; I can set the concurrency level, I can set
if I want pure max_out goroutines or a buffered channel that goes to a fan-in
worker pool, and I can set the size of the worker pool as well as the size
of the fan-in buffer, all in 71 LOC.
Here we go, lets do some test runs:
It is fairly obvious with one worker we get less work done than goroutines that
could go on either of my 2 cpus. Lets up the workers next.
Above you can see the fan in with a number of workers in the pool, and not too
surprising we start loosing performance as we go above the number of cpus
available. Okay, this is great, lets see what happens when we crank up
concurrency on these two mechanisms (max_out goroutines versuses fan in worker
pool):
Wow. With just running bcrypt in 90 concurrent goroutines we quickly swamp
our machine and half the number of results we were getting with a concurrency
of 30.
In the mean time, our fan in buffered channel approach is able to sustain it’s near
max of 69 bcrypts in 5 seconds with the exact same concurrency, and not slow
down it’s processing.
I guess this behavior makes sense, as if you keep the cpu’s worrying about one
bcrypt operation at a time, and not causing them to flip their stacks all the
time you can get better results.