When writing crawler crawling data, we often encounter the following prompt
HTTP Error 403: Forbidden
I have written a note using multiple headers
before, but this kind of note only has one IP, which is just disguised as a different browser. Therefore, in order to further prevent being blocked, I still need to change different IP in time. Let’s record the process of Python
using proxy crawling. PS: try not to say it too often
Go straight to the code:
proxy_list=[#This is the proxy IP I used at the time, please update the IP that can be used
'202.106.169.142:80',
'220.181.35.109:8080',
'124.65.163.10:8080',
'117.79.131.109:8080',
'58.30.233.200:8080',
'115.182.92.87:8080',
'210.75.240.62:3128',
'211.71.20.246:3128',
'115.182.83.38:8080',
'121.69.8.234:8080',
]
#Next, in the code you use to urllib2, bind a certain IP, as follows.
proxy = random.choice(proxy_list)
urlhandle = urllib2.ProxyHandler({'http':proxy})
opener = urllib2.build_opener(urlhandle)
urllib2.install_opener(opener)
#Normal use of urllib
req = urllib2.Request(listurl,headers=headers)
content = urllib2.urlopen(req).read()
According to the specific use experience of crawling time.com and Douban movies: explain
– the free agent is not very stable. If you crawl a lot of time for a long time, you’d better spend a little money, it’s very cheap
– find the free proxy IP and use the high hidden proxy IP. Recommend this site