Solving Uncertainty

engineering-software

A story about constraint programming in production

I work at an EdTech company serving medical students. A few years ago, we set out to solve a problem: Students in their dedicated period (4-8 weeks of full-time study for a very important exam) had trouble planning their study. They needed to schedule hours of material each day while accounting for other schoolwork and life.

So could we do better than the existing options of buying a plan or making a big excel sheet and modifying it evertime life got in the way? Maybe something customizable?

Product envisioned a tool where students could:

Choose which resources they wanted to study (textbook chapters, video lectures, flashcard decks, sets of questions)
Choose how many hours they wanted to study each day of the week (6 hours on Mondays, 8 on Tuesdays)
Generate new plans as they changed resources, days-off, and study-time per day

To solve this, our content experts would:

Determine how long each task should take (90 mins to read this chapter, 30 mins to watch this lecture, etc.)
Label each resource with metadata
Order the metadata, so that resources with metadata A should be studied before metadata B

So our engineering department could build a tool that:

Scheduled hundreds of tasks across weeks
- In the order specified by the resource metadata
- Each day’s work not exceeding a time limit
Split and distributed larger assignments into smaller ones
Optimized for diversity of task types per day (a day of reading, watching, and flashcards is better than a day of just reading)

So engineering leadership said “Let’s use a solver! We can write less code and easily handle future requirements, nobody else in our industry has this. Go make this work with Google’s OR-Tools”

And then they gave it to junior dev me :0

Learning about solvers

So I needed to learn about solvers.

Solvers are a completely different way to tackle problems.

In normal (imperative) programming we write a process to manipulate data to result in an answer.

With solvers we

define variables and the range of values those variables can hold
write constraints relating the variables (a must be less than b)
run a “solver” that determines if it’s possible to meet the constraints

A techinal aside:
This is actually called constraint programming.

It’s an approach related to combinatorial optimization (finding an optimal solution from a intractibly large set of possible solutions).

But constraint programming is usually about determining if any feasible solution exists rather than finding the optimal one.

And the algos that do the solving are pretty cool.

There are other solvers and approaches, but this post is about contraint solvers.

Disclaimer (I’m not trained in this, if I get something wrong, pls lmk at twitter)

Example: Modeling an easier version of the problem

I worked to understand any example problems I could find online, then set out to break the big problem into smaller ones.

So let’s try an easier version of the problem: place assignments on days respecting the time limits of the days.

The challenge here is adequately modeling the problem domain and constraints.

Solution:
day 0 has assignments: [{'id': 4}]
day 1 has assignments: []
day 2 has assignments: [{'id': 0}, {'id': 1}]
day 3 has assignments: [{'id': 2}]
# Try it yourself tweaking the number of assignments/days or seconds!

Back to the big problem —after experimenting with the library, reading everything I could find on similar problems, and many attempts to break the problem down into smaller ones and model those, I found a solution.

So I got to deploy our first python microservice. To use it formulate a scheduling problem, POST it to this service, and recieve a response with the solved studyplan.

But this revealed a new problem —slow solving times.

For each additional day or assignment complexity grew more than linearly. The solving service would be too slow for users asking for long schedules. 80th percentile response time was horrible.

I attempted to optimize the model, and made some progress by splitting the problem into two steps. Instead of scheduling across days 1->2->3->4->5->6->7->8->9, what if first the solver scheduled first across chunks of days 123->456->789 and then got higher resolution 1->2->3 + 4->5->6 + 7->8->9?

This showed promising speedups, not enough for production, but enough to try a different approach.

Adventures in giving AWS money to go faster…

Our usual deployment target was AWS Lambda. A lambda with maxed out memory was not enough.
Next was AWS Fargate (read: managed AWS EC2)
Next was EC2 —I tried an m?.metal instance for fun (hundreds of vCPUs and GBs of memory :)
But still met unacceptable 95th percentile response times, so we went back to the drawing board.

Adventures in, erm, switching to topological sorting

My Engineering Manager and I each prototyped a topological ordering based solution. I wrote a production version which passed the unit tests from our solver implementation (nice to not throw those away) and quickly turned into what we use in production, on a pretty low memory AWS Lambda. :)

This project taught me the value of derisking early. Red flags should go up when your next task looks like an NP-hard lovechild of a bin packer and traveling salesman. I was so obsessed asking “is this possible?” that I neglected “is this possible in a way that’s good for users?” and “is there a simpler way?” Prototype models should have been benchmarked on the upper end of expected problem complexity, my EM needed to know how hard it would to train new hires in OR-Tools, and we should have started without being so set on a specific technology… lessons learned the hard way, but an education in solvers and identifying and addressing uncertainty I won’t forget.

But it was also neat. It was fun and challenging to wrap my brain around this type of programming. Solvers are pretty cool.