MSc Methods&Statistics
Working at Jibes Data Analytics
Open source projects:
20 lines mime vs ..>
rather than single page consider, consider whole domain
don't abuse
gather from reddit, github, pypi, BDFL, twitter, stackoverflow
latest greatest xtoy
yagmail send emails in 2 lines (html/attach) 246 sky next-gen intelligent web scraping 57 gittyleaks find users/keys/pass in git repos 18 pytrending discover trending python 10 xtoy automatic prep/model/predict 2Interesting for python because:
"big data" cloud scraping sending email{"domain": "http://www.gtbit.org",
"url": "http://www.gtbit.org/news/viewitem.php?id=40",
"injectable": true,
"on line": true,
"error": false,
"at line": false,
"time": "Wed Oct 28 00:59:39 2015",
"warning": true,
"failed_request": false,
"emails": ["gtbit@rediffmail.com", "inderjeet@gmail.com"],
"sql": true}
control micro fleet
script being run on aws
gather from S3 using GreenPool, do the computations
part = r'[^?@ ><\'":\\\/]+'
email_re = re.compile(part + '@' + part + r'\.' + part)
for wet_path in wetpaths:
swp = slugger(wet_path)
if swp in dones:
continue
t1 = time.time()
results = []
# Start a connection to one of the WARC files
k = Key(pds, wet_path)
f = warc.WARCFile(fileobj=GzipStreamFile(k))
for i, record in enumerate(f):
if record.url is not None and 'php?id=' in record.url:
results.append(record.url)
print(time.time() - t1)
save_file_s3('\n'.join(results), swp)