Skip to content

Commit 8098e23

Browse files
committed
initial set of files for pycon 2014 :)
0 parents  commit 8098e23

16 files changed

+1142
-0
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
*~
2+
*.pyc

README.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
PyCon Introduction to Web and Data Scraping Tutorial
2+
===========================================
3+
4+
A tutorial-based introduction to web scraping with Python.
5+
6+
Virtual Env
7+
------------
8+
9+
If you'd like to use virtual environments, please follow the following instructions. It is not required for the tutorial but may be helpful.
10+
11+
For more details on [virtual environments](http://www.doughellmann.com/projects/virtualenvwrapper/)
12+
13+
If you don't have virtual env wrapper and/or pip:
14+
15+
$ easy_install pip
16+
$ pip install virtualenvwrapper
17+
18+
and read the additional instructions [here](http://virtualenvwrapper.readthedocs.org/en/latest/install.html)
19+
20+
21+
$ mkvirtualenv scraper_tutorial
22+
$ pip install -r requirements.txt
23+
24+
25+
LXML and Selenium
26+
-------------------------
27+
You will need both [LXML](http://lxml.de/) and [Selenium](http://selenium-python.readthedocs.org/en/latest/index.html) to follow this tutorial in it's entirety.
28+
29+
If you are using a Mac, I would highly recommend using [Homebrew](http://brew.sh/). It will help make pip install *very easy* for you to use.
30+
* [More help on Installing LXML on Mac](http://lxml.de/installation.html#installation)
31+
* [And additional suggestions for LXML on Mac](http://stackoverflow.com/questions/1277124/how-do-you-install-lxml-on-os-x-leopard-without-using-macports-or-fink)
32+
33+
If you are using Windows, it might be worth it to run this within a Linux Virtual Machine. If you are a Windows + Python guru, please follow these installation instructions. I can help as needed but I have not programmed on Windows in more than 5 years.
34+
* [Installing Selenium on Windows](http://selenium-python.readthedocs.org/en/latest/installation.html#detailed-instructions-for-windows-users)
35+
* [Installing LXML on Windows](http://lxml.de/installation.html#ms-windows)
36+
37+
Please reach out to me if you have any questions on getting the initial requirements set up. Thanks!
38+
39+
40+
Questions?
41+
----------
42+
/msg kjam on freenode or @kjam on twitter
43+

bs_scraper.py

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
import urllib2
2+
from email.MIMEText import MIMEText
3+
import smtplib
4+
from bs4 import BeautifulSoup
5+
6+
GMAIL_LOGIN = 'pyladiestest@gmail.com'
7+
GMAIL_PASSWORD = 'YOU NO CAN HAZ'
8+
9+
10+
def send_email(subject, message, from_addr=GMAIL_LOGIN, to_addr=GMAIL_LOGIN):
11+
msg = MIMEText(message)
12+
msg['Subject'] = subject
13+
msg['From'] = from_addr
14+
msg['To'] = to_addr
15+
msg['Reply-To'] = 'happyhours@noreply.com'
16+
17+
server = smtplib.SMTP('smtp.gmail.com', 587) # port 465 or 587
18+
server.ehlo()
19+
server.starttls()
20+
server.ehlo()
21+
server.login(GMAIL_LOGIN, GMAIL_PASSWORD)
22+
server.sendmail(from_addr, to_addr, msg.as_string())
23+
server.close()
24+
25+
26+
def get_site_html(url):
27+
source = urllib2.urlopen(url).read()
28+
return source
29+
30+
31+
def get_tree(url):
32+
source = get_site_html(url)
33+
tree = BeautifulSoup(source)
34+
return tree
35+
36+
37+
if __name__ == '__main__':
38+
39+
stuff_i_like = ['burger', 'wine', 'sushi', 'sweet potato fries', 'BBQ']
40+
found_happy_hours = []
41+
my_happy_hours = []
42+
43+
# First, I'm going to identify the areas of the page I want to look at
44+
tables = get_tree(
45+
'http://www.downtownla.com/3_10_happyHours.asp?action=ALL')
46+
47+
# Then, I'm going to sort out the *exact* parts of the page
48+
# that match what I'm looking for...
49+
for t in tables.findAll('p', {'class': 'calendar_EventTitle'}):
50+
text = t.text
51+
for s in t.findNextSiblings():
52+
text += '\n' + s.text
53+
found_happy_hours.append(text)
54+
55+
print "The scraper found %d happy hours!" % len(found_happy_hours)
56+
57+
# Now I'm going to loop through the food I like
58+
# and see if any of the happy hour descriptions match
59+
for food in stuff_i_like:
60+
for hh in found_happy_hours:
61+
# checking for text AND making sure I don't have duplicates
62+
if food in hh and hh not in my_happy_hours:
63+
print "YAY! I found some %s!" % food
64+
my_happy_hours.append(hh)
65+
66+
print "I think you might like %d of them, yipeeeee!" % len(my_happy_hours)
67+
68+
# Now, let's make a mail message we can read:
69+
message = 'Hey Katharine,\n\n\n'
70+
message += 'OMG, I found some stuff for you in Downtown, take a look.\n\n'
71+
message += '==============================\n'.join(my_happy_hours)
72+
message = message.encode('utf-8')
73+
# To read more about encoding:
74+
# http://diveintopython.org/xml_processing/unicode.html
75+
message = message.replace('\t', '').replace('\r', '')
76+
message += '\n\nXOXO,\n Your Py Script'
77+
78+
# And email it to ourselves!
79+
email = 'katharine@pyladies.com'
80+
send_email('Happy Hour Update', message, from_addr=GMAIL_LOGIN,
81+
to_addr=email)

csv_scraper.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
from csv import DictReader
2+
from datetime import datetime
3+
4+
with open('/home/katharine/Downloads/schedule.csv') as document:
5+
reader = DictReader(document)
6+
for row in reader:
7+
day = datetime.strptime(row.get('START_DATE'), '%m/%d/%y')
8+
if 'PNC' in row.get('LOCATION') and day.weekday() > 4:
9+
print 'HOME WEEKEND GAME!! %s on %s' % (
10+
row.get('SUBJECT'), row.get('START_DATE'))

data/.~lock.crunchbase.xlsx#

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
katharine ,katharine,kjamistan,03.04.2014 11:59,file:///home/katharine/.config/libreoffice/4;

data/crunchbase.xlsx

30.4 MB
Binary file not shown.

0 commit comments

Comments
 (0)