The following readme file, assume that before running the Spark analytic job, you have already installed the correct versions of Java, Hadoop, Spark and that you are inside Ubuntu. Required Software Versions:
- Ubuntu 24.04 (Recommended inside VirtualBox)
- Java: Java-8-openjdk-amd64
- Hadoop: hadoop-3.2.3
- Apache Spark: spark-3.5.0-bin-hadoop3
To verify installations:
java -version
hadoop version
spark-submit --version
To install the correct version of Java, run:
sudo apt install openjdk-8-jdk
To switch between Java versions, use:
sudo update-alternatives --config java # select the desired version and press enter
Verify the installation:
java -version
✅ Expected Output:
openjdk version "1.8.0_442"
Follow the following links for installation:
- https://trentu.blackboard.com/ultra/courses/_65448_1/cl/outline
- https://spark.apache.org/docs/latest/api/python/getting_started/install.html
- https://www.youtube.com/watch?v=ei_d4v9c2iA
- https://www.youtube.com/watch?v=lvEN48bWO0o
Go to the website for downloading Apache Spark.
-
Or issue the following command in your terminal inside Ubuntu:
- 2.1. Navigate to the /opt directory and download Spark:
cd /opt # first command
sudo wget https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz # second command
-
- 2.2. Extract the downloaded archive and set up the directory:
# one at a time
sudo tar -xvzf spark-3.5.0-bin-hadoop3.tgz # first command
sudo mv spark-3.5.0-bin-hadoop3 spark # second command
-
After downloading and extracting Spark, update your environment variables.
- 2.3.Modify the .bashrc file:
sudo nano ~/.bashrc
-
- 2.4.Add the following lines at the bottom:
export SPARK_HOME=/opt/spark/spark-3.5.0-bin-hadoop3
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_PYTHON=python3
-
- 2.5.Save and exit (CTRL + X, then Y, then Enter).
-
- 2.6.Apply the changes:
source ~/.bashrc
At this point if you have been watching the following video:
You probably might have been prompted to issue the following command:
sudo -i
Which makes a switch from a normal user mode to a root user mode that looks like the following:
root@ubuntu
And you probably have downloaded everything in the root user. This will cause some confusion as you continue, because you will notice that you cannot use hadoop in the root user. The best thing to do would be to move your spark installation from root user to normal user.
Unless, you download and configure hadoop to be in the root user as well, otherwise the following could be useful:
-
- Assuming you have already finished with the steps above, and you just started your virtual machine and logged into Ubuntu. Go to the terminal and issue the command:
sudo -i
-
- Move Spark to the Normal User’s Home Directory:
mv /opt/spark /home/ubuntu25/
-
- Change Ownership to the Normal User
chown -R ubuntu25:ubuntu25 /home/ubuntu25/spark
-
-
Update Environment Variables for the Normal User
- 4.1. Switch to the normal user
-
su - ubuntu25
# or
Ctrl + d
-
- 4.2. Open
.bashrc
and edit
- 4.2. Open
nano ~/.bashrc
-
- 4.3. Add the following spark variables at the bottom
# Spark Variables
# the only change is the `/home/ubuntu25/`
export SPARK_HOME=/home/ubuntu25/spark
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=python3
-
- 4.4. Save & reload the environment
source ~/.bashrc
-
- 4.5. Verify Spark installation
spark-submit --version
✅ Expected Output:
SPARK version 3.5.0
-
- 4.6. Test if spark works.
pyspark --master yarn
✅ Expected Output:
>>>
Or you can run the following:
```bash
spark-shell
```
✅ Expected Output:
```bash
scala>>>
```
To exit, type:
```bash
:quit
# or
Ctrl + d # wichever works
```
Either works just fine.
Now you should be back to normal user mode. And hopefully everything should be working correctly.
The following assume that you are working in normal user mode as opposed to root user. However, as already stated, if all your installation and configuration have been done in root user. You should be fine.
Run the following command in order:
start-dfs.sh
jps
✅ Expected output:
NameNode
DataNode
SecondaryNameNode
- If
NameNode
not showing continue with yarn.
Run the following command to start YARN:
start-yarn.sh
jps
✅ Expected output:
ResourceManager
NodeManager
- Now run the following for namenode:
hdfs namenode -format
hdfs --daemon start namenode
jps
✅ Expected output:
- ResourceManager
- NodeManager
- Jps
- NameNode
- DataNode
- SecondaryNameNode
Extra:
hdfs dfsadmin -report
✅ Expected output:
No errors and showing available storage.
The following is something that I found while digging for spark. You can manually start the spark history server just in case it is not on. This is so that you can see your past spark jobs:
# one of the following command will work. If not try them both
start-history-server.sh
$SPARK_HOME/sbin/start-history-server.sh
The following information were taken from the following sources:
- https://spark.apache.org/docs/latest/submitting-applications.html
- https://spark.apache.org/docs/latest/running-on-yarn.html
- https://spark.apache.org/docs/latest/spark-standalone.html
- Prepare the script(before hand) & dataset: Place
local_code_test_A2-pyspark-sql
andpokemon.csv
in the same directory(local
; in my case).
- Run the script locally:
spark-submit --master local local_code_test_A2-pyspark-sql
- Output is stored locally in my case, in the following:
output/local_feistiest_pokemon.csv
.
-
Issue encountered: Spark sometimes writes multiple part files, so rename the desired one manually:
mv output/part-00000-*.csv output/local_feistiest_pokemon.csv
This will usually result in the desired csv file. Also, you might need to modify the script to make sure it works for you.
- Few changes: Move files to HDFS. For the yarn job, in my case, I created a different folder(a second one:
yarn
), so that I can avoid any mistakes:
hdfs dfs -mkdir -p /user/ubuntu25/yarn
hdfs dfs -put pokemon.csv /user/ubuntu25/yarn/
- Modify script to read from HDFS (
yarn_code_test_A2-pyspark-sql.py
):pokemon_df = spark.read.csv("hdfs:///user/ubuntu25/yarn/pokemon.csv", header=True, inferSchema=True) # this is a crucial modification. #For the local job, instead of `hdfs` you would have `file`.
- Run the job on YARN:
spark-submit --master yarn yarn_code_test_A2-pyspark-sql.py # another difference. for yarn you use `--master yarn` instead of `--master local`
- Lastly, retrieve output from HDFS:
hdfs dfs -get /user/ubuntu25/yarn/output output/ mv output/part-00000-*.csv output/feistiest_pokemon_yarn.csv
Feature | Local Job | YARN Job |
---|---|---|
Execution Mode | --master local |
--master yarn |
Data Location | Reads/writes local files file:// |
Reads/writes from HDFS hdfs:// |
Use Case | Good for testing | Good for large datasets |
- Install Java 8, Hadoop 3.2.3, and Apache Spark 3.5.0, for better compatibility.
- Although, a different Hadoop version and Apache Spark versions have been used, as far as the code goes, so long as Hadoop, Apache Spark and Java are configured properly and are compatible, the code should work.
- Configure environment variables for Spark and Hadoop.
- Ensure Spark runs under a normal user (not root).
- Execute the PySpark job with:
spark-submit --master local local_code_test_A2-pyspark-sql.py
- Output is stored locally in
output/local_feistiest_pokemon.csv
.
-
Move dataset to HDFS:
hdfs dfs -mkdir -p /user/ubuntu25/yarn hdfs dfs -put pokemon.csv /user/ubuntu25/yarn/
-
Modify the script to read from HDFS:
pokemon_df = spark.read.csv("hdfs:///user/ubuntu25/yarn/pokemon.csv", header=True, inferSchema=True)
-
Submit job to YARN:
spark-submit --master yarn yarn_code_test_A2-pyspark-sql.py
-
Retrieve output from HDFS:
hdfs dfs -get /user/ubuntu25/yarn/output output/ mv output/part-00000-*.csv output/feistiest_pokemon_yarn.csv
- To get file from my virtual machine to my local machine, I uploaded them to google drive, and downloaded them.
🚀 Additional Debugging Steps
1️⃣ If at any point you encounter errors or bugs, google and stack overflow are your best friends.
📜 All resources used have been referenced, and the links are in this readme file and others are in a different pdf document.
📂 Dataset Reference
This project uses the Pokémon dataset available on Kaggle:
🔗 https://www.kaggle.com/datasets/rounakbanik/pokemon