Skip to content

Commit 9394ef5

Browse files
author
oxyjonas
committed
fix code style
1 parent 1cec922 commit 9394ef5

File tree

3 files changed

+139
-82
lines changed

3 files changed

+139
-82
lines changed

README.md

Lines changed: 95 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -1,80 +1,105 @@
11
# Integrating Oxylabs' Residential Proxies with AIOHTTP
2-
[<img src="https://img.shields.io/static/v1?label=&message=Python&color=brightgreen" />](https://github.com/topics/python) [<img src="https://img.shields.io/static/v1?label=&message=Web%20Scraping&color=important" />](https://github.com/topics/web-scraping) [<img src="https://img.shields.io/static/v1?label=&message=Residential%20Proxy&color=blueviolet" />](https://github.com/topics/residential-proxy) [<img src="https://img.shields.io/static/v1?label=&message=Aiohttp&color=blue" />](https://github.com/topics/aiohttp) [<img src="https://img.shields.io/static/v1?label=&message=Asyncio&color=yellow" />](https://github.com/topics/asyncio)
2+
[<img src="https://img.shields.io/static/v1?label=&message=Python&color=brightgreen" />](https://github.com/topics/python)
3+
[<img src="https://img.shields.io/static/v1?label=&message=Web%20Scraping&color=important" />](https://github.com/topics/web-scraping)
4+
[<img src="https://img.shields.io/static/v1?label=&message=Residential%20Proxy&color=blueviolet" />](https://github.com/topics/residential-proxy)
5+
[<img src="https://img.shields.io/static/v1?label=&message=Aiohttp&color=blue" />](https://github.com/topics/aiohttp)
6+
[<img src="https://img.shields.io/static/v1?label=&message=Asyncio&color=yellow" />](https://github.com/topics/asyncio)
37

48
## Requirements for the Integration
5-
For the integration to work you'll need to install `aiohttp` library, use `Python 3.6` version or higher and Residential Proxies. <br> If you don't have `aiohttp` library, you can install it by using `pip` command:
9+
10+
For the integration to work you'll need to install `aiohttp` library, use `Python 3.6`
11+
version or higher and Residential Proxies. <br> If you don't have `aiohttp` library,
12+
you can install it by using `pip` command:
13+
614
```bash
715
pip install aiohttp
816
```
17+
918
You can get Residential Proxies here: https://oxylabs.io/products/residential-proxy-pool
1019

1120
## Proxy Authentication
21+
1222
There are 2 ways to authenticate proxies with `aiohttp`.<br>
13-
The first way is to authorize and pass credentials along with the proxy URL using `aiohttp.BasicAuth`:
23+
The first way is to authorize and pass credentials along with the proxy URL
24+
using `aiohttp.BasicAuth`:
25+
1426
```python
27+
import aiohttp
28+
1529
USER = "user"
1630
PASSWORD = "pass"
1731
END_POINT = "pr.oxylabs.io:7777"
1832

1933
async def fetch():
2034
async with aiohttp.ClientSession() as session:
21-
proxy_auth = aiohttp.BasicAuth(USER, PASS)
22-
async with session.get("http://ip.oxylabs.io",
23-
proxy="http://pr.oxylabs.io:7777",
24-
proxy_auth=proxy_auth
35+
proxy_auth = aiohttp.BasicAuth(USER, PASSWORD)
36+
async with session.get(
37+
"http://ip.oxylabs.io",
38+
proxy="http://pr.oxylabs.io:7777",
39+
proxy_auth=proxy_auth ,
2540
) as resp:
2641
print(await resp.text())
2742
```
43+
2844
The second one is by passing authentication credentials in proxy URL:
45+
2946
```python
47+
import aiohttp
48+
3049
USER = "user"
3150
PASSWORD = "pass"
3251
END_POINT = "pr.oxylabs.io:7777"
3352

3453
async def fetch():
3554
async with aiohttp.ClientSession() as session:
36-
async with session.get("http://ip.oxylabs.io",
37-
proxy=f"http://{USER}:{PASSWORD}@{END_POINT}"
55+
async with session.get(
56+
"http://ip.oxylabs.io",
57+
proxy=f"http://{USER}:{PASSWORD}@{END_POINT}",
3858
) as resp:
3959
print(await resp.text())
4060
```
41-
In order to use your own proxies, adjust `user` and `pass` fields with your Oxylabs account credentials.
61+
62+
In order to use your own proxies, adjust `user` and `pass` fields with your
63+
Oxylabs account credentials.
4264

4365
## Testing Proxies
66+
4467
To see if the proxy is working, try visiting https://ip.oxylabs.io.
45-
If everything is working correctly, it will return an IP address of a proxy that you're currently using.
68+
If everything is working correctly, it will return an IP address of a proxy
69+
that you're currently using.
4670

4771
## Sample Project: Extracting Data From Multiple Pages
48-
To better understand how residential proxies can be utilized for asynchronous data extracting operations, we wrote a sample project to scrape product listing data and save the output to a `CSV` file. The proxy rotation allows us to send multiple requests at once risk-free – meaning that we don't need to worry about CAPTCHA or getting blocked. This makes the web scraping process extremely fast and efficient – now you can extract data from thousands of products in a matter of seconds!
72+
73+
To better understand how residential proxies can be utilized for asynchronous
74+
data extracting operations, we wrote a sample project to scrape product listing
75+
data and save the output to a `CSV` file. The proxy rotation allows us to send
76+
multiple requests at once risk-free – meaning that we don't need to worry about
77+
CAPTCHA or getting blocked. This makes the web scraping process extremely fast
78+
and efficient – now you can extract data from thousands of products in a matter
79+
of seconds!
80+
4981
```python
5082
import asyncio
5183
import time
5284
import sys
5385
import os
5486

55-
from bs4 import BeautifulSoup
56-
import pandas as pd
5787
import aiohttp
88+
import pandas as pd
89+
from bs4 import BeautifulSoup
5890

5991
USER = "user"
6092
PASSWORD = "pass"
6193
END_POINT = "pr.oxylabs.io:7777"
6294

63-
# Generate a list of URLs to scrape
95+
# Generate a list of URLs to scrape.
6496
url_list = [
65-
f"https://books.toscrape.com/catalogue/category/books_1/page-{page_num}.html"
66-
for page_num
67-
in range(1, 51)
97+
f"https://books.toscrape.com/catalogue/category/books_1/page-{page_num}.html"
98+
for page_num in range(1, 51)
6899
]
69100

70-
async def fetch(session, sem, url):
71-
async with sem:
72-
async with session.get(url,
73-
proxy=f"http://{USER}:{PASSWORD}@{END_POINT}"
74-
) as response:
75-
await parse_url(await response.text())
76101

77-
async def parse_url(text):
102+
async def parse_data(text, results_list):
78103
soup = BeautifulSoup(text, "lxml")
79104
for product_data in soup.select("ol.row > li > article.product_pod"):
80105
data = {
@@ -83,39 +108,62 @@ async def parse_url(text):
83108
"product_price": product_data.select_one("p.price_color").text,
84109
"stars": product_data.select_one("p")["class"][1],
85110
}
86-
final_list.append(data)
87-
print(f"Grabing book: {data['title']}")
88-
89-
async def create_jobs():
90-
final_res = []
111+
results_list.append(data) # Fill results_list by reference.
112+
print(f"Extracted data for a book: {data['title']}")
113+
114+
115+
async def fetch(session, sem, url, results_list):
116+
async with sem:
117+
async with session.get(
118+
url,
119+
proxy=f"http://{USER}:{PASSWORD}@{END_POINT}",
120+
) as response:
121+
await parse_data(await response.text(), results_list)
122+
123+
124+
async def create_jobs(results_list):
91125
sem = asyncio.Semaphore(4)
92126
async with aiohttp.ClientSession() as session:
93-
await asyncio.gather(*[fetch(session, sem, url)
94-
for url in url_list
95-
])
96-
127+
await asyncio.gather(
128+
*[fetch(session, sem, url, results_list) for url in url_list]
129+
)
130+
131+
97132
if __name__ == "__main__":
98-
final_list = []
133+
results = []
99134
start = time.perf_counter()
100-
# Different Event Loop Policy must be loaded if you're using Windows OS
101-
# This helps to avoid "Event Loop is closed" error
135+
136+
# Different EventLoopPolicy must be loaded if you're using Windows OS.
137+
# This helps to avoid "Event Loop is closed" error.
102138
if sys.platform.startswith("win") and sys.version_info.minor >= 8:
103139
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
140+
104141
try:
105-
asyncio.run(create_jobs())
106-
except Exception:
142+
asyncio.run(create_jobs(results))
143+
except Exception as e:
144+
print(e)
107145
print("We broke, but there might still be some results")
108-
109-
print(f"""\nTotal of {len(final_list)} products gathered in
110-
{time.perf_counter() - start:.2f} seconds")"""
111-
df = pd.DataFrame(final_list)
112-
df["url"] = df["url"].map(lambda x: ''.join(["https://books.toscrape.com/catalogue", x]))
146+
147+
print(
148+
f"\nTotal of {len(results)} products from {len(url_list)} pages "
149+
f"gathered in {time.perf_counter() - start:.2f} seconds.",
150+
)
151+
df = pd.DataFrame(results)
152+
df["url"] = df["url"].map(
153+
lambda x: "".join(["https://books.toscrape.com/catalogue", x])
154+
)
113155
filename = "scraped-books.csv"
114-
df.to_csv(filename, encoding='utf-8-sig', index=False)
156+
df.to_csv(filename, encoding="utf-8-sig", index=False)
115157
print(f"\nExtracted data can be found at {os.path.join(os.getcwd(), filename)}")
116158
```
117-
If you want to test the project's script by yourself, you'll need to install some additional packages. To do that, simply download `requirements.txt` file and use `pip` command:
159+
160+
If you want to test the project's script by yourself, you'll need to install
161+
some additional packages. To do that, simply download `requirements.txt` file
162+
and use `pip` command:
163+
118164
```bash
119165
pip install -r requirements.txt
120166
```
121-
If you're having any trouble integrating proxies with `aiohttp` and this guide didn't help you - feel free to contact Oxylabs customer support at support@oxylabs.io.
167+
168+
If you're having any trouble integrating proxies with `aiohttp` and this guide
169+
didn't help you - feel free to contact Oxylabs customer support at support@oxylabs.io.

requirements.txt

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
openpyxl
21
pandas
32
bs4
43
aiohttp

sample_project.py

Lines changed: 44 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -3,29 +3,22 @@
33
import sys
44
import os
55

6-
from bs4 import BeautifulSoup
7-
import pandas as pd
86
import aiohttp
7+
import pandas as pd
8+
from bs4 import BeautifulSoup
99

1010
USER = "user"
1111
PASSWORD = "pass"
1212
END_POINT = "pr.oxylabs.io:7777"
1313

14-
# Generate a list of URLs to scrape
14+
# Generate a list of URLs to scrape.
1515
url_list = [
16-
f"https://books.toscrape.com/catalogue/category/books_1/page-{page_num}.html"
17-
for page_num
18-
in range(1, 51)
16+
f"https://books.toscrape.com/catalogue/category/books_1/page-{page_num}.html"
17+
for page_num in range(1, 51)
1918
]
2019

21-
async def fetch(session, sem, url):
22-
async with sem:
23-
async with session.get(url,
24-
proxy=f"http://{USER}:{PASSWORD}@{END_POINT}"
25-
) as response:
26-
await parse_url(await response.text())
2720

28-
async def parse_url(text):
21+
async def parse_data(text, results_list):
2922
soup = BeautifulSoup(text, "lxml")
3023
for product_data in soup.select("ol.row > li > article.product_pod"):
3124
data = {
@@ -34,33 +27,50 @@ async def parse_url(text):
3427
"product_price": product_data.select_one("p.price_color").text,
3528
"stars": product_data.select_one("p")["class"][1],
3629
}
37-
final_list.append(data)
38-
print(f"Grabing book: {data['title']}")
39-
40-
async def create_jobs():
41-
final_res = []
30+
results_list.append(data) # Fill results_list by reference.
31+
print(f"Extracted data for a book: {data['title']}")
32+
33+
34+
async def fetch(session, sem, url, results_list):
35+
async with sem:
36+
async with session.get(
37+
url,
38+
proxy=f"http://{USER}:{PASSWORD}@{END_POINT}",
39+
) as response:
40+
await parse_data(await response.text(), results_list)
41+
42+
43+
async def create_jobs(results_list):
4244
sem = asyncio.Semaphore(4)
4345
async with aiohttp.ClientSession() as session:
44-
await asyncio.gather(*[fetch(session, sem, url)
45-
for url in url_list
46-
])
47-
46+
await asyncio.gather(
47+
*[fetch(session, sem, url, results_list) for url in url_list]
48+
)
49+
50+
4851
if __name__ == "__main__":
49-
final_list = []
52+
results = []
5053
start = time.perf_counter()
51-
# Different Event Loop Policy must be loaded if you're using Windows OS
52-
# This helps to avoid "Event Loop is closed" error
54+
55+
# Different EventLoopPolicy must be loaded if you're using Windows OS.
56+
# This helps to avoid "Event Loop is closed" error.
5357
if sys.platform.startswith("win") and sys.version_info.minor >= 8:
5458
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
59+
5560
try:
56-
asyncio.run(create_jobs())
57-
except Exception:
61+
asyncio.run(create_jobs(results))
62+
except Exception as e:
63+
print(e)
5864
print("We broke, but there might still be some results")
59-
60-
print(f"""\nTotal of {len(final_list)} products gathered in
61-
{time.perf_counter() - start:.2f} seconds")"""
62-
df = pd.DataFrame(final_list)
63-
df["url"] = df["url"].map(lambda x: ''.join(["https://books.toscrape.com/catalogue", x]))
65+
66+
print(
67+
f"\nTotal of {len(results)} products from {len(url_list)} pages "
68+
f"gathered in {time.perf_counter() - start:.2f} seconds.",
69+
)
70+
df = pd.DataFrame(results)
71+
df["url"] = df["url"].map(
72+
lambda x: "".join(["https://books.toscrape.com/catalogue", x])
73+
)
6474
filename = "scraped-books.csv"
65-
df.to_csv(filename, encoding='utf-8-sig', index=False)
66-
print(f"\nExtracted data can be accessed at {os.path.join(os.getcwd(), filename)}")
75+
df.to_csv(filename, encoding="utf-8-sig", index=False)
76+
print(f"\nExtracted data can be found at {os.path.join(os.getcwd(), filename)}")

0 commit comments

Comments
 (0)