It's worth remembering the `pandas` API is available now on PySpark:
https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html
Or for older versions via the `koalas` library:
https://koalas.readthedocs.io/en/latest/
Data Scientist and Chartered Aeronautical Engineer (MEng CEng EUR ING MRAeS) with over 15 years experience in the Aerospace, Defence and Rail Industry.