对于一些网站,一开始能请求,但是时间久了,网站有可能会封ip,Requests库对此的解决办法:
import requests proxies = { 'http': 'http://10.10.1.10:3128', 'https': 'http://10.10.1.10:1080', } requests.get('https://www.taobao.com', proxies=proxies) 1234567
如果代理需要HTTP Basic Auth:
import requests proxies = { 'https': 'http://user:password@10.10.1.10:3128/', } requests.get('https://www.taobao.com', proxies=proxies) 12345
!!!!这里如果我们想用各种不同的ip来访问网站呢???
因为对单一IP,很多网站会设置访问间隔。
解决办法,先去免费IP网站爬取所有的IP地址,然后使用这些IP爬取目标网站:
import requests from bs4 import BeautifulSoup headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0' } r = requests.get("https://www.xicidaili.com/nt/", headers=headers) soup = BeautifulSoup(r.text, 'lxml') ips = soup.findAll('tr') proxy_list = [] for x in range(1, len(ips)): ip = ips[x] tds = ip.findAll("td") ip_temp = 'http://'+tds[1].contents[0]+":"+tds[2].contents[0] proxy_list.append(ip_temp) # 上面已经获取了IP,下面是爬取目标网站 run_times = 100000 for i in range(run_times): for item in proxy_list: proxies = { 'http': item, 'https': item, } print(proxies) try: requests.get('目标网站', proxies=proxies, timeout=1) print('ok') except: continue
123456789101112131415161718192021222324252627282930