Solve Python running error:
UnicodeDecodeError: 'gbk' codec can't decode byte 0xb0 in position 166: illegal multibyte sequence
UnicodeDecodeError: 'gbk' codec can't decode byte 0xb0 in position 166: illegal multibyte sequence
encoding='utf-8'
example:response = open(path, 'r', encoding='utf-8')
Core code:
def ipPools(numPage):
headers = randomHeads()
url = 'http://www.xicidaili.com/nn/'
saveFsvFile = open('ips.csv', 'wb')
writer = csv.writer(saveFsvFile)
for num in range(1, numPage + 1):
full_url = url + str(num)
re = requests.get(full_url, headers=headers)
soup = BeautifulSoup(re.text, 'lxml')
res = soup.find(id="ip_list").find_all('tr')
for item in res:
try:
temp = []
tds = item.find_all('td')
proxyIp = tds[1].text.encode("utf-8")
proxyPort = tds[2].text.encode("utf-8")
temp.append(proxyIp)
temp.append(proxyPort)
writer.writerow(temp)
print('保存为excel成功!')
except IndexError:
pass
Points to note.
Be sure to convert str to bytes :
str.encode("utf-8")
python36 file method to open
open('ips.csv', 'wb') change wb to w I got an error right here. If there is the same error can, as a reference it!
The main reason is ^ M
This is caused by different system coding formats: the. Sh. Py file edited in Windows system may have invisible characters, so the above abnormal information will be reported when executing in Linux system. It is usually caused by the different identification of the end of windows line and Linux line
Solution:
1) Conversion in Windows:
Use some editors, such as UltraEdit or EDITPLUS, to encode and convert scripts first, and then put them into Linux for execution. The conversion method is as follows (UltraEdit): File — > Conversions–> DOS-> UNIX is fine
2) Direct replacement under Linux
Sed – I’s/^ m// g ‘file name
3) It can also be converted in Linux
First, make sure that the file has executable permissions
#sh> chmod a+x filename
Then change the file format
#sh> vi filename
Use the following command to view the file format
: set FF or: set fileformat
You can see the following information
Fileformat = DOS or fileformat = UNIX
Use the following command to modify the file format
: set FF = UNIX or: set fileformat = UNIX
: WQ (save and exit)
Finally, execute the file
#sh>./ filename
When the following statement is executed
1 def set_IPlsit():
2 url = 'https://www.whatismyip.com/'
3 response = urllib.request.urlopen(url)
4 html = response.read().decode('utf-8')
The following exception occurred:
C:\Users\54353\AppData\Local\Programs\Python\Python36\python.exe "C:/Users/54353/PycharmProjects/untitled/爬虫/图片 - 某网站.py"
Traceback (most recent call last):
File "C:/Users/54353/PycharmProjects/untitled/crawler/pic.py", line 100, in <module>
ip = set_IPlsit2()
File "C:/Users/54353/PycharmProjects/untitled/crawler/pic.py", line 95, in set_IPlsit2
response = ure.urlopen(url)
File "C:\Users\54353\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\54353\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 532, in open
response = meth(req, response)
File "C:\Users\54353\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Users\54353\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 570, in error
return self._call_chain(*args)
File "C:\Users\54353\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 504, in _call_chain
result = func(*args)
File "C:\Users\54353\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
Process finished with exit code 1
The reason for the above exception is that if you open a URL in urllib.request.urlopen mode, the server will only receive a simple request for accessing the page, but the server does not know the browser, operating system, hardware platform and other information used to send the request, and the request without such information is often abnormal access, such as crawler
In order to prevent this kind of abnormal access, some websites will verify the user agent in the request information. If the user agent is abnormal or does not exist, the request will be rejected
Add the user agent to the request, and the code is as follows
1 headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
2 req = urllib.request.Request(url=chaper_url, headers=headers)
3 urllib.request.urlopen(req).read()
from sklearn.cross_validation import train_test_split
ERROR:
ImportError: No module named sklearn.cross_validation
Solution:
it must relate to therenaming and depreaction ofcross_validation
submodule tomodel_selection
. Try substitutingcross_validation
->model_selection
train_test_split is now in model_selection. Just type:
from sklearn.model_selection import train_test_split
Reference:
https://stackoverflow.com/questions/30667525/importerror-no-module-named-sklearn-cross-validation
When writing crawler crawling data, we often encounter the following prompt
HTTP Error 403: Forbidden
I have written a note using multiple headers
before, but this kind of note only has one IP, which is just disguised as a different browser. Therefore, in order to further prevent being blocked, I still need to change different IP in time. Let’s record the process of Python
using proxy crawling. PS: try not to say it too often
Go straight to the code:
proxy_list=[#This is the proxy IP I used at the time, please update the IP that can be used
'202.106.169.142:80',
'220.181.35.109:8080',
'124.65.163.10:8080',
'117.79.131.109:8080',
'58.30.233.200:8080',
'115.182.92.87:8080',
'210.75.240.62:3128',
'211.71.20.246:3128',
'115.182.83.38:8080',
'121.69.8.234:8080',
]
#Next, in the code you use to urllib2, bind a certain IP, as follows.
proxy = random.choice(proxy_list)
urlhandle = urllib2.ProxyHandler({'http':proxy})
opener = urllib2.build_opener(urlhandle)
urllib2.install_opener(opener)
#Normal use of urllib
req = urllib2.Request(listurl,headers=headers)
content = urllib2.urlopen(req).read()
According to the specific use experience of crawling time.com and Douban movies: explain
– the free agent is not very stable. If you crawl a lot of time for a long time, you’d better spend a little money, it’s very cheap
– find the free proxy IP and use the high hidden proxy IP. Recommend this site
__ file__ It is a variable generated when Python module is imported__ file__ Can’t be used, but what should I do to get the path of the current file
import inspect, os.path
filename = inspect.getframeinfo(inspect.currentframe()).filename
path = os.path.dirname(os.path.abspath(filename))
import inspect
import os
os.path.abspath(inspect.getsourcefile(lambda:0))
My code:
The content is treated as a string
content[len(content)/2:len(content)/2+5]
Error:
TypeError: slice indices must be integers or None or have an __ index__ method
Looking through a lot of data, we find that Python may be converted to floating-point number when dividing. You need to change the “/” in it to “/ /” to run it
Error:
Today, when I write a simple Python class definition code, I encountered the problem of reporting an error: typeerror: drive() takes 2 positional arguments but 3 were given
The code is as follows
class Car:
speed = 0
def drive(self,distance):
time = distance/self.speed
print(time)
bike = Car()
bike.speed=60
bike.drive(60,80)
After investigation, it was found that it was the self parameter in the def drive (self, distance) method in the class definition
Now let’s take a brief look at the basic information of self in Python
self
, which means that the created class instance itself and the method itself can bind various attributes to self, because self points to the created instance itself. When creating an instance, you can’t pass in empty parameters. You must pass in parameters that match the method, but you don’t need to pass in self. The Python interpreter will pass in instance variables by itself
so there are two solutions
method 1: transfer only one parameter. If you want to transfer two parameters, look at method 2
class Car:
speed = 0
def drive(self,distance):
time = distance/self.speed
print(time)
bike = Car()
bike.speed=60
bike.drive(80)
Method 2:
class Car:
speed = 0
def drive(self,distance,speed):
time = distance/speed
print(time)
bike = Car()
bike.drive(80,50)
error source code:
#Receive request data
def search(request):
request.encoding = 'utf-8'
if 'q' in request.GET:
message = 'You searched for: ' +request.GET['q'].encode('utf-8')
else:
message = 'You submitted an empty form'
return HttpResponse(message)
code marked red position, we can see that encode function is used to transcode, because encode transcode returns data of type Bytes, can not be directly added with data of type STR.
Since the request request has been transcoded in the first sentence of the function, we remove the following encode function here, and the error can be solved.
The updated code is:
#Receive request data
def search(request):
request.encoding = 'utf-8'
if 'q' in request.GET:
message = 'You searched for: ' +request.GET['q']
else:
message = 'You submitted an empty form'
return HttpResponse(message)