Welcome to this talk: Moving Millions of Dollars Daily with Ruby While Still Able to Sleep at Night.
Before we start, I have a secret to share.
In my programming career,
I have not only written code
but also bugs
lots of bugs.
more than three bugs
more than six bugs
probably as many bugs as this slides can fit
I tried very hard to write bug-free code.
But sometimes there were still bugs.
When my bug was found
I would be like "Ouch! that's not fun."
Bugs might or might not have customer impacts. Most of the time, I would be able to fix them quickly and life would be fine again.
But then something changed ...
I joined the payments team,
the team that moves money.
Lots of money.
"How much money?" you might ask.
Well, at Gusto, we have over 60,000 customers.
And we move tens of billions of dollars annually.
On average, that's hundreds of millions of dollars each day.
Including me, we have four payments engineers,
which means on average each engineer moves tens of millions of dollars daily.
That's a lot of money. Deploying code now literally means moving money.
Then I remembered I had written lots of bugs in the past.
Ouch! Now the idea of potentially introducing bugs into the system becomes way scarier.
"What should I do?" I asked my manager.
"Should I quit? I don't want to cost the business millions of dollars."
"Don't quit just yet." my manager said.
"Read this book first."
So I opened the book.
And it said: "知己知彼，百战不殆。"from《孙子兵法》,
which in English means: "Know yourself and know your enemy, and you will never be defeated." from The Art of War, an ancient Chinese military treatise.
"It's not about writing bug-free code." my manager explained,
"It's about knowing that there will be bugs."
"Wow..." that was my moment of enlightenment.
Before we continue, let me tell you a bit about myself.
My name is Sihui.
I work at Gusto, a startup in San Francisco. We are building an all-in-one platform for payroll, HR, and benefits.
I first started as a full-stack engineer on the HR team building customer-facing features.
I now work in the payments engineering team building systems that move money.
I also blog about Ruby, System Design, and things I learn from work.
This talk is about how to build a mission-critical system
… that’s trust-worthy
.. knowing that bugs are inevitable
What I’m about to share can fall into three categories: general engineering best practices, using chatbots for monitoring and alerting, and best practices around Sidekiq, a Ruby job system.
These are the lessons we learned as a team over years. And we documented them in an internal wikis. Credits go to all engineers at Gusto, and mistakes are mine.
In case you wonder, we are hiring. (see: gusto.com/careers)
I figure the best way to convey most of the lessons is by going through a worst-case scenario.
Remember our premise is that there will be bugs.
So we should expect bugs.
And be prepared for them.
When it comes to handling money,
what’s the worst that can happen?
Is it losing money?
It’s losing money,
without knowing it.
Imagine one day your company receives a call from the bank telling you there’s no more money left in the company’s bank account.
That might be game over for the company. Losing money without knowing it will also cost customer trust. Without customer trust, there can be hardly any business.
How can we prevent this from happening?
That’s why we need to use chatbots to monitor the system and alert us about what’s going on.
When a maintenance runs, a chatbot will alert a slack channel about details of money being moved.
We also have alerts on the company level. If a company we serve have outstanding balance, a chatbot will alert the operations channel so the operations team will take a look and make sure things are under controlled.
Now with these alerts in place, what happens when an alarm goes off and tells you there’s a fire going on?
For example, if the payroll maintenance fails, the engineer that’s on-call will be notified directly.
What’s the worst that can happen in this case?
Well, let’s ask ourselves a question. Is it better to have a small fire like this?
Or a big fire that burns the whole house down like this?
On the left side is a big job that moves millions of dollars all at once. On the right side is a bunch of small jobs each moves thousands of dollars. Failing the big job on the left is way scarier than failing couple small jobs on the right.
That’s what I mean by a big job. In this Sidekiq job, we first get all companies in the database and load all of them into memory. Then we iterate through them and move money for each of them. If the job fails in the middle, it’s very hard to tell which company’s money have been moved which haven’t. Debugging what went wrong is also challenging. And creating all these company objects and bringing them into memory can also be an expensive operation.
In this improved code, we first get all the company ids. Then we use Sidekiq’s push_bulk to spin out an independent job for each company id. Instead of having a huge job that moves money for all companies, we now have many small jobs each moves money for a single company. In this case, if for some reason the server goes down, some jobs might have succeeded and some might have failed. It would be clear which company’s money has been moved and which hasn’t. Recovering those failed jobs is also easier.
So here comes our first Sidekiq best practice: never iterate and keep each job small. Always try to break big jobs down into small ones.
Now instead of having a big job failing, we have a few small jobs that failed. It’s time to debug and figure out what went wrong.
We need to submit transaction files to banks before 7 pm on each business day, which means we don’t have the whole day to debug. We need to do it fast.
So how can we make debugging easy and fast?
Here comes our first general best practice: be defensive and raise errors as soon as something unexpected happens.
In the payments team, we use the Contracts gem to specify the types for a method’s arguments and return results. If the type of an passed-in argument or the method’s return type violates the contract, the gem will throw an error immediately. For things that cannot be caught by the gem, we check and raise errors explicitly. It’s better to raise errors as soon as something unexpected happens, than letting a error slip by and then causing an error on a method that’s several level away from the root cause.
The world outside of your classes or methods is like the wild west. It’s dangerous out there and there’s no guarantee that others will use your classes and methods properly.
But we need to treat the inside of our classes or methods as an exclusive club. We need to prevent any unexpected data from passing through. Protecting data integrity is critical because correcting polluted data tends to be very difficult and expensive.
We also try to keep our code clean and embrace the single responsibility principle. A class should have one job and one job only. Simple code is easier to read and debug.
Thanks to these two practices, we are able to locate the bug quickly. It turns out it was because some companies’ bank information was incorrect. So we correct them.
And now it’s time to retry the failed jobs.
Here’s where another Sidekiq best practice comes in: never save state to Sidekiq.
If we pass the company object into a Sidekiq job, the object will be serialized and stored in Redis. In this case, even if we now have the correct data in the database, rerunning the job will still use the cached outdated data. There are two other downsides of saving states to Sidekiq. First, Redis is not a reliable data store. It might go down at any point in time and all the data stored there will be lost. Second, the process of serializing objects might be a performance drain.
So instead of passing the object into the job, we pass in an identifier for a company.
We also make sure our jobs are idempotent and transactional. Idempotent means running a job once and running it a thousand time should have the same result. Transactional means a job either succeed or fail and there are no in-between states. This gives us the confidence to rerun a job as many times as we want.
After correcting the data and rerunning the failed jobs, we received an alert telling us the payroll maintenance was succeeded. You might have noticed not all alerts are the same. In fact, depending on different levels of urgency, alerts might go to different channels and behave differently.
For example, if it’s a critical error, the engineer on call will be notified directly so timely actions can be taken.
If the issue is not urgent but still worth attention, the system will create an investigation ticket and alert the operations channel so the operations team can take a look.
It’s important to have different levels of alerts based on levels of urgency, so issues that require immediate attention won’t get buried in noises.
These are all the practices we went through together. There are three more I want to share.
The first one is the one query per job rule. Most Sidekiq jobs should have at most a read query and a write query. This is along the same line as keeping each job small. Because the database is very good at parallelizing works, we should take advantage of that by splitting separate database operations into separate jobs.
Another one is don’t hammer the database. If there’s a batch of long-running jobs, instead of starting all of them at the same time and try to finish them as soon as possible, we will spread them out over a period of time. For example, instead of trying to finish 10,000 heavy jobs all at once, we might spread them out over an hour. This is to avoid long-running jobs exhausting resources such as overloading and bringing down the database.
A payment system is essentially an accounting system. It needs to be able to tell us balances on the books at any given point in time. As a result, we embrace immutability and try not to update or delete any records. If a record is incorrect, instead of updating the row in the database directly, we might create a new record with the correct data and a newer effective date. If a record shouldn’t be there at all, instead of deleting it from the database directly, we might create a reversed version of that record to cancel its effects out. This way, the fact that inaccurate data was in the system at a point of time is captured.
These are all the practices we went through.
Some of them might not be applicable to the system you are currently building. But a key point I’m trying to make is to expect the unexpected. We are all fallible humans. No matter how hard we try, every now and then, we will make mistakes. Instead of betting on our code being perfect 100% of the time, we should expect the unexpected and plan around that.
A reliable system is not a system without bugs.
A reliable system is a system able to get its job done in spite of bugs.
Slides and notes can be found in sihui.io/millions-ruby. You can reach me via email or twitter.
If you enjoyed this talk, you might also like another talk I gave recently on code readability: Ouch! That code hurts my brain.