Skip to content

Commit a615b97

Browse files
author
Rahul Gupta
committed
Initial commit
1 parent 04e463a commit a615b97

File tree

8 files changed

+6532
-1
lines changed

8 files changed

+6532
-1
lines changed

README.md

+40-1
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,41 @@
11
# scrape-employees-data-from-greythr
2-
This repository serves as an illustration of how to retrieve and scrape data from a website using Selenium.
2+
This repository serves as an illustration of how to retrieve and scrape data from Greythr using Selenium.
3+
4+
## Features
5+
- Automate login on Greythr using employee username and password.
6+
- Scrape all employees data and store it in xsls file.
7+
- Construct a bar graph that illustrates the employee count categorized by their respective designations.
8+
![Employees count](images/image1.png)
9+
- Generate a Pie chart representing the distribution of employees based on their birth months.
10+
![Employees birthday](images/image2.png)
11+
12+
## Dependencies
13+
1. **Python**: Make sure Python is installed on your system. You can download the latest version of Python from the official Python website (https://www.python.org) and follow the installation instructions for your operating system.
14+
15+
2. **pip**: Check if pip is installed by running the following command in your command-line interface or terminal:
16+
```sh
17+
pip --version
18+
```
19+
20+
If pip is not installed, you can install it by following the instructions provided on the official Python website.
21+
22+
3. **Chrome Browser**: Ensure that you have a Chrome web browser installed on your system. [The provided code has been tested on Chrome version 114.0.5735.198 (Official Build) (64-bit)]
23+
24+
4. **ChromeDriver**: Make sure the ChromeDriver version matches the Chrome browser version installed on your system. You can download ChromeDriver from the official ChromeDriver website (https://sites.google.com/a/chromium.org/chromedriver/downloads) and follow the installation instructions. [The driver version has also been included in the repository I provided]
25+
26+
5. **Used libraries**: Once you have fulfilled the above prerequisites, you can **Install the necessary libraries** using pip. Here is the command to install the required libraries:
27+
```sh
28+
pip install -r requirements.txt
29+
```
30+
31+
## Usage
32+
1. Modify the configuration file (config.ini) by replacing **<YOUR_COMPANY_NAME>** with your company name, **<YOUR_USERNAME>** with your username, and **<YOUR_PASSWORD>** with your Greythr password.
33+
2. To run the script, use the following command:
34+
```sh
35+
python3 scrap.py
36+
```
37+
## License
38+
**Free Software, Hell Yeah!**
39+
40+
## Authors
41+
- [Rahul Gupta](https://github.com/rahulelex)

chromedriver_linux64/LICENSE.chromedriver

+6,287
Large diffs are not rendered by default.

chromedriver_linux64/chromedriver

13.2 MB
Binary file not shown.

config.ini

+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
[DEFAULT]
2+
xlsx_file_name= employee_data.xlsx
3+
login_url= https://<YOUR_COMPANY_NAME>.greythr.com/
4+
employee_url= https://<YOUR_COMPANY_NAME>.greythr.com/v3/portal/ess/people/directory
5+
data_url= https://<YOUR_COMPANY_NAME>.greythr.com/v3/api/employee/list?page=
6+
page_size= 250
7+
8+
[database]
9+
username= <YOUR_USERNAME>
10+
password= <YOUR_PASSWORD>

images/image1.png

44.1 KB
Loading

images/image2.png

147 KB
Loading

requirements.txt

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
requests
2+
pandas
3+
selenium
4+
matplotlib
5+
configparser

scrap.py

+190
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
import requests
2+
import json
3+
import time
4+
from selenium import webdriver
5+
from selenium.webdriver.chrome.options import Options
6+
from selenium.webdriver.chrome.service import Service
7+
from selenium.webdriver.common.by import By
8+
import configparser
9+
import os
10+
import calendar
11+
import pandas as pd
12+
import matplotlib.pyplot as plt
13+
import configparser
14+
15+
16+
class DataExporter:
17+
def __init__(self):
18+
# Create a ConfigParser object
19+
self.config = configparser.ConfigParser()
20+
# Read the configuration file
21+
self.config.read('config.ini')
22+
23+
# Access the configuration values
24+
self.xlsx_file_path = self.config.get('DEFAULT', 'xlsx_file_name')
25+
self.username = self.config.get('database', 'username')
26+
self.password = self.config.get('database', 'password')
27+
self.driver = None
28+
self.access_token = None
29+
self._ga = None
30+
self._gid = None
31+
self.max_number_of_pages= 10
32+
self.login_url= self.config.get('DEFAULT', 'login_url')
33+
self.employee_url= self.config.get('DEFAULT', 'employee_url')
34+
self.data_url= self.config.get('DEFAULT', 'data_url')
35+
self.page_size= self.config.get('DEFAULT', 'page_size')
36+
37+
38+
def login(self):
39+
# Configure Chrome options for headless browsing
40+
chrome_options = Options()
41+
chrome_options.add_argument('--headless') # Run Chrome in headless mode
42+
chrome_options.add_argument('--no-sandbox')
43+
chrome_options.add_argument('--disable-dev-shm-usage')
44+
chromedriver_path = 'chromedriver_linux64/chromedriver'
45+
service = Service(chromedriver_path)
46+
driver = webdriver.Chrome(service=service, options= chrome_options)
47+
print('Login process initiated')
48+
driver.get(self.login_url)
49+
time.sleep(5)
50+
51+
username_input = driver.find_element(By.ID, "username")
52+
password_input = driver.find_element(By.ID,'password')
53+
username_input.send_keys(self.username)
54+
password_input.send_keys(self.password)
55+
56+
button = driver.find_element("css selector", 'button[type="submit"].bg-primary')
57+
button.click()
58+
59+
time.sleep(2)
60+
61+
# Navigate to the URL
62+
driver.get(self.employee_url)
63+
64+
cookies = driver.get_cookies()
65+
66+
for x in cookies:
67+
name= x.get('name')
68+
if name == 'access_token':
69+
self.access_token= x.get('value')
70+
elif name == '_ga':
71+
self._ga= x.get('value')
72+
elif name == '_gid':
73+
self._gid= x.get('value')
74+
75+
time.sleep(2)
76+
77+
if self._ga and self._gid and self.access_token is not None:
78+
print("LOGIN SUCCESS")
79+
else:
80+
print("LOGIN FAILED")
81+
exit
82+
83+
def fetch_data(self):
84+
payload = json.dumps({
85+
"cat::search": {}
86+
})
87+
88+
headers = {
89+
'content-type': 'application/json',
90+
'cookie': 'access_token='+self.access_token+'; _ga='+self._ga+'; _gid='+self._gid+'; _dc_gtm_UA-642192-18=1; _hjIncludedInSessionSample=1',
91+
}
92+
93+
session= requests.session()
94+
95+
if os.path.exists(self.xlsx_file_path):
96+
print('File already exists, removing it.')
97+
os.remove(self.xlsx_file_path)
98+
99+
# Create an empty DataFrame
100+
df = pd.DataFrame()
101+
dfs= []
102+
103+
for index in range(self.max_number_of_pages):
104+
response = session.request("POST", self.data_url+str(index)+'&pageSize='+str(self.page_size), headers=headers, data=payload)
105+
y = json.loads(response.text)
106+
107+
if y.get('results'):
108+
for z in y.get('results'):
109+
e_data = {
110+
'Name': z.get('name'),
111+
'DOB': z.get('dob'),
112+
'Designation': z.get('c_designation'),
113+
'Employee No': z.get('employeeno'),
114+
'Employee ID': z.get('employeeid'),
115+
'Email': z.get('email')
116+
}
117+
118+
# Create a DataFrame from the current iteration data
119+
df = pd.DataFrame([e_data])
120+
121+
# Append the DataFrame to the list
122+
dfs.append(df)
123+
124+
session.close()
125+
126+
# Concatenate all DataFrames in the list
127+
df_merged = pd.concat(dfs, ignore_index=True)
128+
129+
# Write the merged DataFrame to an Excel file
130+
df_merged.to_excel(self.xlsx_file_path, index=False)
131+
132+
print("Data appended to the Excel file successfully.")
133+
134+
# Read the XLSX file to remove duplicate data
135+
df = pd.read_excel(self.xlsx_file_path, engine='openpyxl')
136+
137+
# Remove duplicate rows based on all columns
138+
df.drop_duplicates(inplace=True)
139+
140+
# Write the updated DataFrame back to the Excel file
141+
df.to_excel(self.xlsx_file_path, index=False)
142+
143+
print('Data cleaned - removed duplicate data')
144+
145+
146+
def show_charts(self):
147+
df = pd.read_excel(self.xlsx_file_path, engine='openpyxl')
148+
149+
# Fetch the data from the specified column
150+
column_data = df['Designation']
151+
152+
# Count the occurrences of each value
153+
value_counts = column_data.value_counts()
154+
155+
# Display bar graph for column values
156+
value_counts.plot(kind='bar')
157+
fig1 = plt.figure(1)
158+
plt.title('Designation chart')
159+
plt.xlabel('Values')
160+
plt.ylabel('Count')
161+
162+
# Pie chart for month of employees born in
163+
# Update column with month
164+
df['Month'] = pd.to_datetime(df['DOB'], format='%d %b').dt.month
165+
166+
df['Month'] = df['Month'].replace('', float('nan')).astype(float).astype('Int64')
167+
168+
# Replace numeric month values with month names
169+
df['Month'] = df['Month'].apply(lambda x: calendar.month_name[x] if pd.notnull(x) else '')
170+
171+
# Count the occurrences of each month
172+
month_counts = df['Month'].value_counts()
173+
174+
# Plot the pie chart
175+
fig2 = plt.figure(2)
176+
plt.pie(month_counts, labels=month_counts.index, autopct='%1.1f%%')
177+
plt.title('Month-wise Birthday discribution')
178+
179+
# Show both charts
180+
plt.show()
181+
182+
183+
def main():
184+
exporter = DataExporter()
185+
exporter.login()
186+
exporter.fetch_data()
187+
exporter.show_charts()
188+
189+
if __name__ == '__main__':
190+
main()

0 commit comments

Comments
 (0)