Skip to content

Commit fd92464

Browse files
authored
Add files via upload
1 parent 89b4f7c commit fd92464

File tree

3 files changed

+155
-2
lines changed

3 files changed

+155
-2
lines changed

README.md

+102-2
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,102 @@
1-
# web-scraping-selenium-python
2-
Web Scraping with Python Selenium: Tutorial for Beginners
1+
# Web Scraping with Python Selenium
2+
3+
[<img src="https://img.shields.io/static/v1?label=&message=python&color=brightgreen" />](https://github.com/topics/python) [<img src="https://img.shields.io/static/v1?label=&message=selenium&color=blue" />](https://github.com/topics/selenium) [<img src="https://img.shields.io/static/v1?label=&message=Web%20Scraping&color=important" />](https://github.com/topics/web-scraping)
4+
- [Installing Selenium](#installing-selenium)
5+
- [Testing](#testing)
6+
- [Scraping with Selenium](#scraping-with-selenium)
7+
8+
In this article, we’ll cover an overview of web scraping with Selenium using a real-life example.
9+
10+
For a detailed tutorial on Selenium, see [our blog](https://oxylabs.io/blog/selenium-web-scraping).
11+
12+
## Installing Selenium
13+
14+
1. Create a virtual environment:
15+
16+
```sh
17+
python3 -m venv .env
18+
```
19+
20+
2. Install Selenium using pip:
21+
22+
```sh
23+
pip install selenium
24+
```
25+
26+
3. Install Selenium Web Driver. See [this page](https://www.selenium.dev/documentation/webdriver/getting_started/install_drivers/) for details.
27+
28+
## Testing
29+
30+
With virtual environment activated, enter IDLE by typing in `python3`. Enter the following command on IDLE:
31+
32+
```python
33+
>>> from selenium.webdriver import Chrome
34+
35+
```
36+
37+
If there are no errors, move on to the next step. If there is an error, ensure that `chromedriver` is added to the PATH.
38+
39+
## Scraping with Selenium
40+
41+
Import required modules as follows:
42+
43+
```python
44+
from selenium.webdriver import Chrome, ChromeOptions
45+
from selenium.webdriver.common.by import By
46+
```
47+
48+
Add the skeleton of the script as follows:
49+
50+
```python
51+
def get_data(url) -> list:
52+
...
53+
54+
55+
def main():
56+
...
57+
58+
if __name__ == '__main__':
59+
main()
60+
```
61+
62+
Create ChromeOptions object and set `headless` to `True`. Use this to create an instance of `Chrome`.
63+
64+
```python
65+
browser_options = ChromeOptions()
66+
browser_options.headless = True
67+
68+
driver = Chrome(options=browser_options)
69+
```
70+
71+
Call the `driver.get` method to load a URL. After that, locate the link for the Humor section by link text and click it:
72+
73+
```python
74+
driver.get(url)
75+
76+
element = driver.find_element(By.LINK_TEXT, "Humor")
77+
element.click()
78+
```
79+
80+
Create a CSS selector to find all books from this page. After that run a loop on the books and find the bookt title, price, stock availability. Use a dictionary to store one book information and add all these dictionaries to a list. See the code below:
81+
82+
```python
83+
books = driver.find_elements(By.CSS_SELECTOR, ".product_pod")
84+
data = []
85+
for book in books:
86+
title = book.find_element(By.CSS_SELECTOR, "h3 > a")
87+
price = book.find_element(By.CSS_SELECTOR, ".price_color")
88+
stock = book.find_element(By.CSS_SELECTOR, ".instock.availability")
89+
book_item = {
90+
'title': title.get_attribute("title"),
91+
'price': price.text,
92+
'stock': stock. text
93+
}
94+
data.append(book_item)
95+
96+
```
97+
98+
Lastly, return the `data` dictionary from this function.
99+
100+
For the complete code, see [main.py](src/main.py).
101+
102+
For a detailed tutorial on Selenium, see [our blog](https://oxylabs.io/blog/selenium-web-scraping).

src/main.py

+38
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
from selenium.webdriver import Chrome, ChromeOptions
2+
from selenium.webdriver.common.by import By
3+
4+
5+
def get_data(url) -> list:
6+
browser_options = ChromeOptions()
7+
browser_options.headless = True
8+
9+
driver = Chrome(options=browser_options)
10+
driver.get(url)
11+
12+
element = driver.find_element(By.LINK_TEXT, "Humor")
13+
element.click()
14+
15+
books = driver.find_elements(By.CSS_SELECTOR, ".product_pod")
16+
data = []
17+
for book in books:
18+
title = book.find_element(By.CSS_SELECTOR, "h3 > a")
19+
price = book.find_element(By.CSS_SELECTOR, ".price_color")
20+
stock = book.find_element(By.CSS_SELECTOR, ".instock.availability")
21+
book_item = {
22+
'title': title.get_attribute("title"),
23+
'price': price.text,
24+
'stock': stock. text
25+
}
26+
data.append(book_item)
27+
28+
driver.quit()
29+
return data
30+
31+
32+
def main():
33+
data = get_data("https://books.toscrape.com/")
34+
print(data)
35+
36+
37+
if __name__ == '__main__':
38+
main()

src/requirements.txt

+15
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
async-generator==1.10
2+
attrs==22.1.0
3+
certifi==2022.9.24
4+
exceptiongroup==1.0.0
5+
h11==0.14.0
6+
idna==3.4
7+
outcome==1.2.0
8+
PySocks==1.7.1
9+
selenium==4.5.0
10+
sniffio==1.3.0
11+
sortedcontainers==2.4.0
12+
trio==0.22.0
13+
trio-websocket==0.9.2
14+
urllib3==1.26.12
15+
wsproto==1.2.0

0 commit comments

Comments
 (0)