문제 설명
UnicodeError: URL에 ASCII가 아닌 문자가 포함되어 있습니다(Python 2.7). (UnicodeError: URL contains non‑ASCII characters (Python 2.7))
그래서 크롤러를 만들 수 있었고 모든 링크를 검색하고 제품 링크에 도착하면 일부를 찾고 모든 제품 정보를 가져오지만 특정 페이지에 도착하면 유니코드 오류:/
import urllib
import urlparse
from itertools import ifilterfalse
from urllib2 import URLError, HTTPError
from bs4 import BeautifulSoup
urls = ["http://www.kiabi.es/"]
visited = []
def get_html_text(url):
try:
return urllib.urlopen(current_url).read()
except (URLError, HTTPError, urllib.ContentTooShortError):
print "Error getting " + current_url
def find_internal_links_in_html_text(html_text, base_url):
soup = BeautifulSoup(html_text, "html.parser")
links = []
for tag in soup.findAll('a', href=True):
url = urlparse.urljoin(base_url, tag['href'])
domain = urlparse.urlparse(base_url).hostname
if domain in url:
links.append(url)
return links
def is_url_already_visited(url):
return url in visited
while urls:
current_url = urls.pop()
word = '#C'
if word in current_url:
[do sth]
#print "Parsing", current_url
html_text = get_html_text(current_url)
visited.append(current_url)
found_urls = find_internal_links_in_html_text(html_text, current_url)
new_urls = ifilterfalse(is_url_already_visited, found_urls)
urls.extend(new_urls)
오류:
Traceback (most recent call last):
File "<ipython‑input‑1‑67c2b4cf7175>", line 1, in <module>
runfile('S:/Consultas_python/Kiabi.py', wdir='S:/Consultas_python')
File "C:\Anaconda2\lib\site‑packages\spyderlib\widgets\externalshell\sitecustomize.py", line 685, in runfile
execfile(filename, namespace)
File "C:\Anaconda2\lib\site‑packages\spyderlib\widgets\externalshell\sitecustomize.py", line 71, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "S:/Consultas_python/Kiabi.py", line 91, in <module>
html_text = get_html_text(current_url)
File "S:/Consultas_python/Kiabi.py", line 30, in get_html_text
return urllib.urlopen(current_url).read()
File "C:\Anaconda2\lib\urllib.py", line 87, in urlopen
return opener.open(url)
File "C:\Anaconda2\lib\urllib.py", line 185, in open
fullurl = unwrap(toBytes(fullurl))
File "C:\Anaconda2\lib\urllib.py", line 1070, in toBytes
" contains non‑ASCII characters")
UnicodeError: URL u'http://www.kiabi.es/Barbapap\xe1_s1' contains non‑ASCII characters
또는
UnicodeError: URL u'http://www.kiabi.es/Petit‑B\xe9guin_s2' contains non‑ASCII characters
고칠 수 있나요?
참조 솔루션
방법 1:
You need to percent encode the utf8 representation of your unicode string.
As explained here:
All non‑ASCII code points in the IRI should next be encoded as UTF‑8, and the resulting bytes percent‑encoded, to produce a valid URI.
In python code, that means:
import urllib
url = urllib.quote(url.encode('utf8'), ':/')
The second argument to quote
, ':/'
, is to prevent the colon in the protocol part http:
, or path separator /
from being encoded.
(In Python 3, the quote
function has been moved to the urllib.parse module).
방법 2:
You can try to encode the urls. Your code may look like:
def get_html_text(url):
try:
return urllib.urlopen(current_url.encode('ascii','ignore')).read()
except (URLError, HTTPError, urllib.ContentTooShortError):
print "Error getting " + current_url
(by Joel Sánchez López、memoselyk、Rahul)