Tag Archives: Python:

How to Solve Python Error: crawler uses proxy anti blocking IP: http error 403: forbidden

When writing crawler crawling data, we often encounter the following prompt

HTTP Error 403: Forbidden

I have written a note using multiple headers before, but this kind of note only has one IP, which is just disguised as a different browser. Therefore, in order to further prevent being blocked, I still need to change different IP in time. Let’s record the process of Python using proxy crawling. PS: try not to say it too often

Go straight to the code:

proxy_list=[#This is the proxy IP I used at the time, please update the IP that can be used
    '202.106.169.142:80',   
    '220.181.35.109:8080',  
    '124.65.163.10:8080',
    '117.79.131.109:8080',
    '58.30.233.200:8080',
    '115.182.92.87:8080',
    '210.75.240.62:3128',
    '211.71.20.246:3128',
    '115.182.83.38:8080',
    '121.69.8.234:8080',
        ]

#Next, in the code you use to urllib2, bind a certain IP, as follows.
proxy       = random.choice(proxy_list)
urlhandle   = urllib2.ProxyHandler({'http':proxy})
opener      = urllib2.build_opener(urlhandle)        
urllib2.install_opener(opener) 

#Normal use of urllib
req         = urllib2.Request(listurl,headers=headers)
content     = urllib2.urlopen(req).read()

According to the specific use experience of crawling time.com and Douban movies: explain
– the free agent is not very stable. If you crawl a lot of time for a long time, you’d better spend a little money, it’s very cheap
– find the free proxy IP and use the high hidden proxy IP. Recommend this site