When writing crawler crawling data, we often encounter the following prompt
HTTP Error 403: Forbidden
I have written a note using multiple headers
before, but this kind of note only has one IP, which is just disguised as a different browser. Therefore, in order to further prevent being blocked, I still need to change different IP in time. Let’s record the process of Python
using proxy crawling. PS: try not to say it too often
Go straight to the code:
proxy_list=[#This is the proxy IP I used at the time, please update the IP that can be used
'202.106.169.142:80',
'220.181.35.109:8080',
'124.65.163.10:8080',
'117.79.131.109:8080',
'58.30.233.200:8080',
'115.182.92.87:8080',
'210.75.240.62:3128',
'211.71.20.246:3128',
'115.182.83.38:8080',
'121.69.8.234:8080',
]
#Next, in the code you use to urllib2, bind a certain IP, as follows.
proxy = random.choice(proxy_list)
urlhandle = urllib2.ProxyHandler({'http':proxy})
opener = urllib2.build_opener(urlhandle)
urllib2.install_opener(opener)
#Normal use of urllib
req = urllib2.Request(listurl,headers=headers)
content = urllib2.urlopen(req).read()
According to the specific use experience of crawling time.com and Douban movies: explain
– the free agent is not very stable. If you crawl a lot of time for a long time, you’d better spend a little money, it’s very cheap
– find the free proxy IP and use the high hidden proxy IP. Recommend this site
Similar Posts:
- Differences of urllib, urllib2, httplib and httplib2 libraries in Python
- How to Solve Python Error: “HTTP Error 403: Forbidden”
- Sublime text install Emmet (Zen coding) plug in
- [Solved] HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
- Interface automation (8): an error is reported during interface testing sslerror: Certificate verify failed
- Python crawling picture prompt urllib.error.httperror: http error 403: forbidden solution
- No module named ‘urllib.request’; ‘urllib’ is not a package
- You-get Warning urllib.error.URLError:
- After installing BS4 in Python, pychar still reports module not found error: no module named ‘BS4’
- Python3 Use urlliburlopen error EOF occurred in violation of protocol (_ssl.c:841)