In this Post we will learn how to setup learning environment for pyspark in windows.
To learning spark with python, we will install pyspark in windows and we will use jupyter notebook and spider IDE to test and run pyspark code.
Java should be installed. If java is not installed please install java then proceed to setup pyspark.
Please follow below step to setup pyspak.
- Download and install Anaconda.
Click on below link and download Anaconda.
Once downloaded , follow onscreen step to install the Anaconda.
Remember Destination Folder , Anancoda will be installed here. We will use this path to setup path environment variable to use python from cmd.
Click Install to complete the installation.
Search and open Anaconda Navigator.
After clicking on Anaconda Navigator please wait for 1-2 minutes . Anaconda Navigator will open as below.
We can open Anaconda Prompt and run python as below.
But it wont work from command prompt(cmd) as path has not set.
This will work after setting the Anaconda python path in environment variable.
Search for edit the system variable and set below path in path variable.
After setting up the python path variable , we will able to run python from command prompt(cmd).
Now we will install pyspark using pip.
pip install pyspark
Now we can run pyspark from command prompt.
But we are getting winutil exception , since pyspark comes with default hadoop and hadoop hdfs is not compatible with windows NTFS.
To handle this exception we need to download winutils and setup the HADOOD_HOME variable
You can copy downloaded winutils in any folder and setup HADOOP_HOME.
I have copied into C:\hadoop\bin Folder and setup HADOOP_HOME as below.
Now close and open new command prompt to check pyspark.
We can open Jupyter Notebook and run spark .
Click on New ->Python 3
Open Python Notebook and copy and paste below code in Notebook Shell.
Note:-By default Tab autocomplete of code will not work in jupyete notebook.Please install ” pyreadline” for autocomplete.
pip install pyreadline
We can run same code in Spider IDE and run the same code.
Launch Spider from Anaconda Navigator.