How big is the dataset you were looking at? The advantage of pandas (where it can load the data in memory) is that bits still quicker for exploration as you don't have the distribution overhead.

The difficulty I've had with the new pandas API for spark is that getting the data in one place to generate the plot can take a long time almost irrespective of the cluster size.

Data Scientist and Chartered Aeronautical Engineer (MEng CEng EUR ING MRAeS) with over 15 years experience in the Aerospace, Defence and Rail Industry.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ashraf Miah

Data Scientist and Chartered Aeronautical Engineer (MEng CEng EUR ING MRAeS) with over 15 years experience in the Aerospace, Defence and Rail Industry.