Going from nothing to a first model

I recently joined a new project. It’s an interesting problem that has a lot of upsides if we solve it successfully. However, as with all interesting problems, it comes with a lot of ambiguity.

Upon joining the project, I was told that our goal is to train a first model in a month. When I was told this, we had no data, no metrics, no objective function, and not a single line of code. The first month would be about figuring out the problem space while simultaneously laying down our project’s technical foundations. What an exciting opportunity!

I’m writing this after training our first set of models (🎉).

I’ll talk about some software design things, and some research tools that helped me move fast to go from nothing to a first model.

Software design

These things helped me with the codebase side of things.

1. Create skeletons first and then implement. The first pull request I worked on was to lay out the structure of our codebase. There was no functionality added at all at this stage. Just empty modules with __init__.py files containing docstrings about what the module is meant to contain.

This was a really important part of the process and helps the team to agree on a common structure before adding any new functionality. This isn’t just something that I did for the overall code structure. I also did this for different modules to ensure that I understood on a high level how different modules worked independently, as well as with other modules.

Once you have structure, you can then focus on the details.

2. Don’t sweat the small stuff. This is probably controversial but it’s an approach that I have really enjoyed. Our project lead is somebody who is deeply technical but he understands that not every change is worth reviewing. Small changes and optional features aren’t that important as long as the core modules function properly.

With that being said, how do we define “small” and how do we define “optional”?

Small changes are things like updating outdated docstrings, fixing typos, or small fixes to broken code. Major changes would be things that change design (for example, the interface of a module), or introduce new modules that we know we will use repeatedly.

I define optional modules as those modules that somebody can use to make their lives easier, but the core code won’t break if we delete these modules. I think it’s really important to have this freedom for optional modules so we can figure out if these modules are actually useful or not before having a huge debate about design decisions. Worst case scenario: they’re not useful. Then we can just delete them as long as they have been designed in an appropriately decoupled manner.

3. That thing you’re about to create - someone else probably created it and it’s better. This is a huge cliche but it needs to be emphasized: don’t reinvent the wheel. For the vast majority of machine learning applications, a combination of Hydra and PyTorch Lightning is more than enough. They’re open source, have wide community support, and probably already have the functionality that you’re looking for.

So many times while building up our codebase I realized that in previous projects, I had reinvented features that already existed in these two libraries. So before you run off and create your own custom implementations of things, look in the documentation first!

Research

These things helped me iterate faster on the research side of things.

1. I used ibis for data preparation and analysis. I didn’t have a hand in pulling raw data as well as computing some of the more complicated features that we want to use. Once the raw data was ready, I needed it to get it into a format suitable for my machine learning problem where it’s easy to retrieve a given sample along with its labels.

When trying to massage the data into a suitable format for machine learning, there is simply no way you can do it with pandas. Instead, I used ibis which is a python library that is an API on top of 20+ powerful backends (such as polars and duckdb). It supports larger then memory queries, and gives you a pythonic way to create complicated SQL queries.

Its interactive mode is extremely powerful as you can very easily see a sample of your query’s outputs in a jupyter notebook without outputting things to disc.

The interactive mode was also quite helpful when doing any kind of data analysis.

Check out my other blog post for a flavor of ibis. I might write another one soon with all the features that I found useful, but this tutorial was also quite good.

2. I used hydra + submitit for sweeps. I’ve seen researchers do this way too often where they create custom scripts that launch other scripts just to do a hyperparameter sweep. Or they’ll manually create a bash file where each line invokes a python train.py call with different arguments for each of the parameter values they want to try out. I know because I used to be one of them.

To my past self and many current researchers I want to say: stop it!

Hydra supports submitit. Along with Hydra’s multirun and extended sweep syntax, you can launch your training with different arguments from the usual training entrypoint! The syntax looks like the following

python train.py --multirun hydra/launcher=submitit_slurm learning_rate=0.01,0.001 hidden_size=50,100,200

This launches six different jobs with all different combos of the parameters you specified.

One note is that the slurm logs get stored in a .submitit folder inside of whatever multirun folder hydra puts your stuff in.

Things I want to do better / things that annoy me

1. I keep fooling myself with jupyter notebooks. Notebooks are too tempting because it’s so interactive and you get to see everything. I was doing a pretty good job of keeping visualizations separate from implementation - I would experiment on a small scale on jupyter notebooks, then move code to codebase, and only import from the codebase for real functionality. But during time crunches it all goes out the window and I’m calling some random function that I defined in cell 13143 from cell 3. Make it make sense!

So many dumb bugs that come out of this and it’s my mistake.

If anyone has ideas on what should and shouldn’t be allowed in notebooks, please let me know!

2. I can’t figure out if I’m stupid or poetry is stupid. Possibly the former, but why is poetry so complicated sometimes? Sometimes I’ll add a new dependency through poetry add and that that ends up removing a necessary dependency from poetry.lock? I’ll have to sit down and figure out what I’m doing wrong.

Conclusion

Moving fast is one thing, and doing things the right way is another. I think I struck a good balance between the two by following some of the principles and using some of the tools I laid out. It’s always a process so I’ll keep adding/removing from this blog if I find something is useful or not as useful as I thought it was.

For now, our team has our first model. And now we iterate!




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Shifting-left in ML: catching configuration errors
  • Presenting in industry as an academic
  • Training foundation models up to 10x more efficiently with Memory-Mapped Datasets
  • Cutting our CI test load by 70% using pants
  • No more SQL: using ibis as a machine learning researcher