No more SQL: using ibis as a machine learning researcher
TL;DR:
- I am afraid of SQL
-
pandas
was great until it wasn’t- it doesn’t do so well with large scale data
- duckdb gives you a powerful backend for dealing with larger scale data, but you need to know SQL
-
ibis
lets you use pythonic, pandas-like syntax while letting you use duckdb as a backend- in total they have 20+ backends (also including polars and other SQL backends)
- using
ibis
lets you write and execute complicated SQL queries in a pythonic way
ML researchers have low self-esteem because they can’t write SQL
The machine learning researcher pipeline goes: you get really good grades in university, go to grad school, work on common benchmark datasets with a huge focus improving the state-of-the-art by developing new algorithms.
If you’re really bold, you collect and annotate your own dataset.
Then you go to a company and you feel bad.
“What? I have to access this table where? How do I query it? How do I transform it to get it into the format I want? How can I explore this? It doesn’t fit into memory as a lovely DataFrame
that I can manipulate how I want to.”
You’re stuck. You used to be a somebody.
“If only… if only someone smarter than me wrote a tool that has a simple pythonic, pandas-like API that lets me do all the cool stuff the data engineers do,” you think to yourself.
You look up. The data engineers are doing kickflips outside.
Why isn’t pandas
sufficient?
pandas
is the de-facto library that researchers have come to love and use. With its trove of useful functions and simple python syntax, it’s great when you’re not dealing with large scale data.
As Wes McKinney, the creator of pandas says in “10 Things I Hate About Pandas” (link), his rule of thumb for pandas is that you should have 5 to 10 times as much RAM as the size of your dataset. When you’re dealing with terabytes of data, this is impossible to achieve.
In summary:
- ease of use ✅
- processing large scale data ❌
I heard a quack: enter duckdb
duckdb on the other hand is an open source database system. You can query and transform your data anywhere, and as the website says, you can process larger-than-memory workloads! It’s a much better alternative to pandas in terms of dealing with large scale data, but you need to know SQL. For example, to get the average bill length and maximum flipper length for each group of birds (by species and island), the SQL query looks like the following:
# get average bill length and maximum flipper length
import duckdb
duckdb.sql("""
SELECT
species,
island,
AVG(bill_length_mm) AS "Mean(bill_length_mm)",
MAX(flipper_length_mm) AS "Max(flipper_length_mm)"
FROM penguins
GROUP BY species, island
""")
Writing SQL inside of a python string is ok, but maybe not the prettiest.
In summary:
- ease of use ❌ (if you don’t know SQL)
- processing large scale data ✅
The ibis
has landed
Thankfully, there’s a third alternative that combines ease of use with the ability to process large scale data.
ibis is a python package that acts as an interface between you and the query engine. ibis supports over 20 query engines including duckdb. On top of that they support polars, BiqQuery, ClickHouse, MySQL, PostgreSQL, and a variety of others!
The whole philosophy behind ibis is to create a common user interface for all the different query engines. You can change what query engine to use without changing any of your ibis code.
With ibis
, we can rewrite the example above in a much more pythonic way:
import ibis
(
penguins
.group_by([penguins.species, penguins.island])
.aggregate(
[
penguins.bill_length_mm.mean(),
penguins.flipper_length_mm.max()
]
)
)
I found the syntax much more friendly compared to duckdb. Since the backend uses duckdb, we also don’t lose out on any performance benefits by using ibis.
ibis thus meets the two things we were looking for:
- ease of use ✅
- processing large scale data ✅
Another benefit of ibis is that since it is query engine-agnostic, if one of the query engines develops a new feature that is way faster than the others, you don’t have to rewrite your ibis code! You can just change the backend.
You can pretend to be a cool data engineer now
Now that you know about ibis
you can do some of the things that the cool skateboarding data engineers can do. You won’t be able to do all the cool things they can do, but hey at least you won’t embarrass yourself any further.
Give ibis a try! I found the ibis
tutorials here very useful.
Enjoy Reading This Article?
Here are some more articles you might like to read next: