Skip to content

The following readme file, assume that before running the Spark analytic job, you have already installed the correct versions of **Java**, **Hadoop**, **Spark** and that you are inside **Ubuntu**.

License

Notifications You must be signed in to change notification settings

JKA098/Pokemon-Feistiness-Apache-Spark-Job

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Pokémon Feistiness Apache Spark Job

🏆 Pokémon Feistiness Apache Spark Job

Big Data Analytics Big Data Stack Environment Isolation Python License Focus Theme Data Statistics ML Framework Notebook Editor

1. Prerequisites

The following readme file, assume that before running the Spark analytic job, you have already installed the correct versions of Java, Hadoop, Spark and that you are inside Ubuntu. Required Software Versions:

  • Ubuntu 24.04 (Recommended inside VirtualBox)
  • Java: Java-8-openjdk-amd64
  • Hadoop: hadoop-3.2.3
  • Apache Spark: spark-3.5.0-bin-hadoop3

To verify installations:

java -version
hadoop version
spark-submit --version 

2. Installation Steps

Step 1: Install Java

To install the correct version of Java, run:

sudo apt install openjdk-8-jdk

To switch between Java versions, use:

sudo update-alternatives --config java # select the desired version and press enter

Verify the installation:

java -version

✅ Expected Output:

openjdk version "1.8.0_442"

If hadoop is not already installed the following link should be helpful:


Step 2: Install Apache Spark

Follow the following links for installation:

Go to the website for downloading Apache Spark.

  • Or issue the following command in your terminal inside Ubuntu:

    • 2.1. Navigate to the /opt directory and download Spark:
cd /opt # first command

sudo wget https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz # second  command

    • 2.2. Extract the downloaded archive and set up the directory:
# one at a time

sudo tar -xvzf spark-3.5.0-bin-hadoop3.tgz # first command
sudo mv spark-3.5.0-bin-hadoop3 spark # second  command
  • After downloading and extracting Spark, update your environment variables.

    • 2.3.Modify the .bashrc file:
sudo nano ~/.bashrc
    • 2.4.Add the following lines at the bottom:
export SPARK_HOME=/opt/spark/spark-3.5.0-bin-hadoop3
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_PYTHON=python3
    • 2.5.Save and exit (CTRL + X, then Y, then Enter).
    • 2.6.Apply the changes:
source ~/.bashrc

At this point if you have been watching the following video:

You probably might have been prompted to issue the following command:

sudo -i

Which makes a switch from a normal user mode to a root user mode that looks like the following:

root@ubuntu 

And you probably have downloaded everything in the root user. This will cause some confusion as you continue, because you will notice that you cannot use hadoop in the root user. The best thing to do would be to move your spark installation from root user to normal user.

Unless, you download and configure hadoop to be in the root user as well, otherwise the following could be useful:

    1. Assuming you have already finished with the steps above, and you just started your virtual machine and logged into Ubuntu. Go to the terminal and issue the command:
sudo -i
    1. Move Spark to the Normal User’s Home Directory:
mv /opt/spark /home/ubuntu25/
    1. Change Ownership to the Normal User
chown -R ubuntu25:ubuntu25 /home/ubuntu25/spark
    1. Update Environment Variables for the Normal User

      • 4.1. Switch to the normal user
su - ubuntu25
# or
Ctrl + d
    • 4.2. Open .bashrc and edit
nano ~/.bashrc
    • 4.3. Add the following spark variables at the bottom
# Spark Variables
# the only change is the `/home/ubuntu25/`
export SPARK_HOME=/home/ubuntu25/spark
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=python3
    • 4.4. Save & reload the environment
source ~/.bashrc
    • 4.5. Verify Spark installation
spark-submit --version

✅ Expected Output:

SPARK version 3.5.0
    • 4.6. Test if spark works.
pyspark --master yarn

✅ Expected Output:

>>>

Or you can run the following:

```bash
spark-shell
```

✅ Expected Output:
```bash
scala>>>
```

To exit, type:

```bash
:quit

# or

Ctrl + d # wichever works
```

Either works just fine.

Now you should be back to normal user mode. And hopefully everything should be working correctly.

3. Running Spark Job

The following assume that you are working in normal user mode as opposed to root user. However, as already stated, if all your installation and configuration have been done in root user. You should be fine.

0. Start Hadoop first

Run the following command in order:

start-dfs.sh
jps

✅ Expected output:

NameNode
DataNode
SecondaryNameNode
  • If NameNode not showing continue with yarn.

Run the following command to start YARN:

start-yarn.sh
jps

✅ Expected output:

ResourceManager
NodeManager
  • Now run the following for namenode:
hdfs namenode -format 
hdfs --daemon start namenode 
jps

✅ Expected output:

- ResourceManager
- NodeManager
- Jps
- NameNode
- DataNode
- SecondaryNameNode

Extra:

hdfs dfsadmin -report

✅ Expected output:

No errors and showing available storage.

The following is something that I found while digging for spark. You can manually start the spark history server just in case it is not on. This is so that you can see your past spark jobs:

# one of the following command will work. If not try them both
start-history-server.sh 
$SPARK_HOME/sbin/start-history-server.sh 

Let's run the spark job

The following information were taken from the following sources:

A. Running a Local Spark Job

  1. Prepare the script(before hand) & dataset: Place local_code_test_A2-pyspark-sql and pokemon.csv in the same directory(local; in my case).

  1. Run the script locally:
    spark-submit --master local local_code_test_A2-pyspark-sql

  1. Output is stored locally in my case, in the following: output/local_feistiest_pokemon.csv.

  1. Issue encountered: Spark sometimes writes multiple part files, so rename the desired one manually:

    mv output/part-00000-*.csv output/local_feistiest_pokemon.csv

This will usually result in the desired csv file. Also, you might need to modify the script to make sure it works for you.


B. Running a PySpark Job on YARN

  1. Few changes: Move files to HDFS. For the yarn job, in my case, I created a different folder(a second one:yarn), so that I can avoid any mistakes:
   hdfs dfs -mkdir -p /user/ubuntu25/yarn
   hdfs dfs -put pokemon.csv /user/ubuntu25/yarn/

  1. Modify script to read from HDFS (yarn_code_test_A2-pyspark-sql.py):
    pokemon_df = spark.read.csv("hdfs:///user/ubuntu25/yarn/pokemon.csv", header=True, inferSchema=True) # this is a crucial modification. 
    #For the local job, instead of `hdfs` you would have `file`.

  1. Run the job on YARN:
    spark-submit --master yarn yarn_code_test_A2-pyspark-sql.py # another difference. for yarn you use `--master yarn` instead of `--master local`

  1. Lastly, retrieve output from HDFS:
    hdfs dfs -get /user/ubuntu25/yarn/output output/
    mv output/part-00000-*.csv output/feistiest_pokemon_yarn.csv

Key Differences Between Local and YARN Jobs

Feature Local Job YARN Job
Execution Mode --master local --master yarn
Data Location Reads/writes local files file:// Reads/writes from HDFS hdfs://
Use Case Good for testing Good for large datasets

5. Summary

1️⃣ Setup & Installation

  • Install Java 8, Hadoop 3.2.3, and Apache Spark 3.5.0, for better compatibility.
    • Although, a different Hadoop version and Apache Spark versions have been used, as far as the code goes, so long as Hadoop, Apache Spark and Java are configured properly and are compatible, the code should work.
  • Configure environment variables for Spark and Hadoop.
  • Ensure Spark runs under a normal user (not root).

2️⃣ Running a Spark Job Locally

  • Execute the PySpark job with:
    spark-submit --master local local_code_test_A2-pyspark-sql.py
  • Output is stored locally in output/local_feistiest_pokemon.csv.

3️⃣ Running a Spark Job on YARN

  • Move dataset to HDFS:

    hdfs dfs -mkdir -p /user/ubuntu25/yarn
    hdfs dfs -put pokemon.csv /user/ubuntu25/yarn/
  • Modify the script to read from HDFS:

    pokemon_df = spark.read.csv("hdfs:///user/ubuntu25/yarn/pokemon.csv", header=True, inferSchema=True)
  • Submit job to YARN:

    spark-submit --master yarn yarn_code_test_A2-pyspark-sql.py
  • Retrieve output from HDFS:

    hdfs dfs -get /user/ubuntu25/yarn/output output/
    mv output/part-00000-*.csv output/feistiest_pokemon_yarn.csv

4️⃣ File retrieval

  • To get file from my virtual machine to my local machine, I uploaded them to google drive, and downloaded them.

🚀 Additional Debugging Steps

1️⃣ If at any point you encounter errors or bugs, google and stack overflow are your best friends.


📜 All resources used have been referenced, and the links are in this readme file and others are in a different pdf document.


📂 Dataset Reference
This project uses the Pokémon dataset available on Kaggle:
🔗 https://www.kaggle.com/datasets/rounakbanik/pokemon

About

The following readme file, assume that before running the Spark analytic job, you have already installed the correct versions of **Java**, **Hadoop**, **Spark** and that you are inside **Ubuntu**.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages