--

You mentioned it briefly but this is like an apples and pears comparison. You're trying to take a distributed computational framework and use it on a job that can run on a single node anyway.

In that regard, why use DBT and Duckdb at all. Use any of the single node optimised tools from Pandas, Polars or Dask.

The real conclusion appears to be use the right tool for the job or don't use Spark where it's not needed.

--

--

Ashraf Miah

Data Scientist and Chartered Aeronautical Engineer (MEng CEng EUR ING MRAeS) with over 15 years experience in the Aerospace, Defence and Rail Industry.