.. _spark: Spark ===== Spark is probably the most popular tool in the Hadoop ecosystem nowdays. Apart from the performance improvements it offers over standard MapReduce jobs, it simplifies considerably the process of developing an application because it offers high level programming APIs like the DataFrames API. The default version included in the platform is Spark 2.4.0, and you can easily start an interactive shell:: spark-shell Similarly to start an interactive session using Python:: pyspark You can also use newer versions of Spark provided through :ref:`modules`, for example:: module load spark/3.4.2 .. note:: If using Python we recommend that you use an Anaconda version provided through :ref:`modules`. To use it with Python you can load the desired version of the Anaconda module, for example:: module load anaconda3/2020.02 If using Anaconda, then you can also use ipython for the interactive pyspark session so you get a nicer CLI:: PYSPARK_DRIVER_PYTHON=ipython pyspark To submit a job:: spark-submit --name testWC test.py input output The jobs will be submitted to YARN and queued for execution. Depending on the load of the platform the execution will take more or less time. .. note:: You can acess the Spark UI through the :ref:`webui` and :ref:`HUE`. .. figure:: _static/screenshots/hue-jobs-to-sparkui.png :align: center The Spark UI showing details of a given job. For further information on how to use Spark you can check the `Spark Tutorial`_ that we have prepared to get you started. For more information you can check the `PySpark Course Material`_ and the `Sparklyr Course Material`_, these are courses that you can also attend to learn more. Finally, you can also find useful the `Spark Guide`_ in the CDH documentation, and of course, the great documentation provided by the `Spark project`_. .. _Spark Tutorial: https://bigdata.cesga.es/tutorials/spark.html .. _PySpark Course Material: https://github.com/javicacheiro/pyspark_course .. _Sparklyr Course Material: https://github.com/aurora-mareviv/sparklyr_test .. _Spark Guide: https://www.cloudera.com/documentation/enterprise/6/6.1/topics/spark.html .. _Spark project: https://spark.apache.org/docs/2.4.0/