Pokémon Feistiness Apache Spark Job

🏆 Pokémon Feistiness Apache Spark Job

1. Prerequisites

The following readme file, assume that before running the Spark analytic job, you have already installed the correct versions of Java, Hadoop, Spark and that you are inside Ubuntu. Required Software Versions:

Ubuntu 24.04 (Recommended inside VirtualBox)
Java: Java-8-openjdk-amd64
Hadoop: hadoop-3.2.3
Apache Spark: spark-3.5.0-bin-hadoop3

To verify installations:

java -version
hadoop version
spark-submit --version

2. Installation Steps

Step 1: Install Java

To install the correct version of Java, run:

sudo apt install openjdk-8-jdk

To switch between Java versions, use:

sudo update-alternatives --config java # select the desired version and press enter

Verify the installation:

java -version

✅ Expected Output:

openjdk version "1.8.0_442"

If hadoop is not already installed the following link should be helpful:

https://arjunkrish.medium.com/step-by-step-guide-to-setting-up-hadoop-on-ubuntu-installation-and-configuration-walkthrough-60e493e9370d

Step 2: Install Apache Spark

Follow the following links for installation:

Go to the website for downloading Apache Spark.

Or issue the following command in your terminal inside Ubuntu:
- 2.1. Navigate to the /opt directory and download Spark:

cd /opt # first command

sudo wget https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz # second  command

- 2.2. Extract the downloaded archive and set up the directory:

# one at a time

sudo tar -xvzf spark-3.5.0-bin-hadoop3.tgz # first command
sudo mv spark-3.5.0-bin-hadoop3 spark # second  command

After downloading and extracting Spark, update your environment variables.
- 2.3.Modify the .bashrc file:

sudo nano ~/.bashrc

- 2.4.Add the following lines at the bottom:

export SPARK_HOME=/opt/spark/spark-3.5.0-bin-hadoop3
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_PYTHON=python3

- 2.5.Save and exit (CTRL + X, then Y, then Enter).
- 2.6.Apply the changes:

source ~/.bashrc

At this point if you have been watching the following video:

https://trentu.blackboard.com/ultra/courses/_65448_1/cl/outline

You probably might have been prompted to issue the following command:

sudo -i

Which makes a switch from a normal user mode to a root user mode that looks like the following:

root@ubuntu

And you probably have downloaded everything in the root user. This will cause some confusion as you continue, because you will notice that you cannot use hadoop in the root user. The best thing to do would be to move your spark installation from root user to normal user.

Unless, you download and configure hadoop to be in the root user as well, otherwise the following could be useful:

1. Assuming you have already finished with the steps above, and you just started your virtual machine and logged into Ubuntu. Go to the terminal and issue the command:

sudo -i

1. Move Spark to the Normal User’s Home Directory:

mv /opt/spark /home/ubuntu25/

1. Change Ownership to the Normal User

chown -R ubuntu25:ubuntu25 /home/ubuntu25/spark

1. Update Environment Variables for the Normal User
  - 4.1. Switch to the normal user

su - ubuntu25
# or
Ctrl + d

- 4.2. Open .bashrc and edit

nano ~/.bashrc

- 4.3. Add the following spark variables at the bottom

# Spark Variables
# the only change is the `/home/ubuntu25/`
export SPARK_HOME=/home/ubuntu25/spark
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=python3

- 4.4. Save & reload the environment

source ~/.bashrc

- 4.5. Verify Spark installation

spark-submit --version

✅ Expected Output:

SPARK version 3.5.0

- 4.6. Test if spark works.

pyspark --master yarn

✅ Expected Output:

>>>

Or you can run the following:

```bash
spark-shell
```

✅ Expected Output:
```bash
scala>>>
```

To exit, type:

```bash
:quit

# or

Ctrl + d # wichever works
```

Either works just fine.

Now you should be back to normal user mode. And hopefully everything should be working correctly.

3. Running Spark Job

The following assume that you are working in normal user mode as opposed to root user. However, as already stated, if all your installation and configuration have been done in root user. You should be fine.

0. Start Hadoop first

Run the following command in order:

start-dfs.sh
jps

✅ Expected output:

NameNode
DataNode
SecondaryNameNode

If NameNode not showing continue with yarn.

Run the following command to start YARN:

start-yarn.sh
jps

✅ Expected output:

ResourceManager
NodeManager

Now run the following for namenode:

hdfs namenode -format 
hdfs --daemon start namenode 
jps

✅ Expected output:

- ResourceManager
- NodeManager
- Jps
- NameNode
- DataNode
- SecondaryNameNode

Extra:

hdfs dfsadmin -report

✅ Expected output:

No errors and showing available storage.

The following is something that I found while digging for spark. You can manually start the spark history server just in case it is not on. This is so that you can see your past spark jobs:

# one of the following command will work. If not try them both
start-history-server.sh 
$SPARK_HOME/sbin/start-history-server.sh

Let's run the spark job

The following information were taken from the following sources:

A. Running a Local Spark Job

Prepare the script(before hand) & dataset: Place local_code_test_A2-pyspark-sql and pokemon.csv in the same directory(local; in my case).

Run the script locally:

spark-submit --master local local_code_test_A2-pyspark-sql

Output is stored locally in my case, in the following: output/local_feistiest_pokemon.csv.

Issue encountered: Spark sometimes writes multiple part files, so rename the desired one manually:
```
mv output/part-00000-*.csv output/local_feistiest_pokemon.csv
```

This will usually result in the desired csv file. Also, you might need to modify the script to make sure it works for you.

B. Running a PySpark Job on YARN

Few changes: Move files to HDFS. For the yarn job, in my case, I created a different folder(a second one:yarn), so that I can avoid any mistakes:

   hdfs dfs -mkdir -p /user/ubuntu25/yarn
   hdfs dfs -put pokemon.csv /user/ubuntu25/yarn/

Modify script to read from HDFS (yarn_code_test_A2-pyspark-sql.py):

pokemon_df = spark.read.csv("hdfs:///user/ubuntu25/yarn/pokemon.csv", header=True, inferSchema=True) # this is a crucial modification. 
#For the local job, instead of `hdfs` you would have `file`.

Run the job on YARN:

spark-submit --master yarn yarn_code_test_A2-pyspark-sql.py # another difference. for yarn you use `--master yarn` instead of `--master local`

Lastly, retrieve output from HDFS:

hdfs dfs -get /user/ubuntu25/yarn/output output/
mv output/part-00000-*.csv output/feistiest_pokemon_yarn.csv

Key Differences Between Local and YARN Jobs

Feature	Local Job	YARN Job
Execution Mode	`--master local`	`--master yarn`
Data Location	Reads/writes local files `file://`	Reads/writes from HDFS `hdfs://`
Use Case	Good for testing	Good for large datasets

5. Summary

1️⃣ Setup & Installation

Install Java 8, Hadoop 3.2.3, and Apache Spark 3.5.0, for better compatibility.
- Although, a different Hadoop version and Apache Spark versions have been used, as far as the code goes, so long as Hadoop, Apache Spark and Java are configured properly and are compatible, the code should work.
Configure environment variables for Spark and Hadoop.
Ensure Spark runs under a normal user (not root).

2️⃣ Running a Spark Job Locally

Execute the PySpark job with:

spark-submit --master local local_code_test_A2-pyspark-sql.py

Output is stored locally in output/local_feistiest_pokemon.csv.

3️⃣ Running a Spark Job on YARN

Move dataset to HDFS:

hdfs dfs -mkdir -p /user/ubuntu25/yarn
hdfs dfs -put pokemon.csv /user/ubuntu25/yarn/

Modify the script to read from HDFS:

pokemon_df = spark.read.csv("hdfs:///user/ubuntu25/yarn/pokemon.csv", header=True, inferSchema=True)

Submit job to YARN:

spark-submit --master yarn yarn_code_test_A2-pyspark-sql.py

Retrieve output from HDFS:

hdfs dfs -get /user/ubuntu25/yarn/output output/
mv output/part-00000-*.csv output/feistiest_pokemon_yarn.csv

4️⃣ File retrieval

To get file from my virtual machine to my local machine, I uploaded them to google drive, and downloaded them.

🚀 Additional Debugging Steps

1️⃣ If at any point you encounter errors or bugs, google and stack overflow are your best friends.

📜 All resources used have been referenced, and the links are in this readme file and others are in a different pdf document.

📂 Dataset Reference
This project uses the Pokémon dataset available on Kaggle:
🔗 https://www.kaggle.com/datasets/rounakbanik/pokemon

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Pokémon_Feistiness_Apache_Spark_Job		Pokémon_Feistiness_Apache_Spark_Job
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pokémon Feistiness Apache Spark Job

🏆 Pokémon Feistiness Apache Spark Job

1. Prerequisites

2. Installation Steps

Step 1: Install Java

If hadoop is not already installed the following link should be helpful:

Step 2: Install Apache Spark

3. Running Spark Job

0. Start Hadoop first

Let's run the spark job

A. Running a Local Spark Job

B. Running a PySpark Job on YARN

Key Differences Between Local and YARN Jobs

5. Summary

1️⃣ Setup & Installation

2️⃣ Running a Spark Job Locally

3️⃣ Running a Spark Job on YARN

4️⃣ File retrieval

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

JKA098/Pokemon-Feistiness-Apache-Spark-Job

Folders and files

Latest commit

History

Repository files navigation

Pokémon Feistiness Apache Spark Job

🏆 Pokémon Feistiness Apache Spark Job

1. Prerequisites

2. Installation Steps

Step 1: Install Java

If hadoop is not already installed the following link should be helpful:

Step 2: Install Apache Spark

3. Running Spark Job

0. Start Hadoop first

Let's run the spark job

A. Running a Local Spark Job

B. Running a PySpark Job on YARN

Key Differences Between Local and YARN Jobs

5. Summary

1️⃣ Setup & Installation

2️⃣ Running a Spark Job Locally

3️⃣ Running a Spark Job on YARN

4️⃣ File retrieval

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages