UnicodeError: URL에 ASCII가 아닌 문자가 포함되어 있습니다(Python 2.7). (UnicodeError: URL contains non-ASCII characters (Python 2.7))


문제 설명

UnicodeError: URL에 ASCII가 아닌 문자가 포함되어 있습니다(Python 2.7). (UnicodeError: URL contains non‑ASCII characters (Python 2.7))

그래서 크롤러를 만들 수 있었고 모든 링크를 검색하고 제품 링크에 도착하면 일부를 찾고 모든 제품 정보를 가져오지만 특정 페이지에 도착하면 유니코드 오류:/

import urllib
import urlparse
from itertools import ifilterfalse
from urllib2 import URLError, HTTPError

from bs4 import BeautifulSoup

urls = ["http://www.kiabi.es/"]
visited = []


def get_html_text(url):
    try:
        return urllib.urlopen(current_url).read()
    except (URLError, HTTPError, urllib.ContentTooShortError):
        print "Error getting " + current_url


def find_internal_links_in_html_text(html_text, base_url):
    soup = BeautifulSoup(html_text, "html.parser")
    links = []
    for tag in soup.findAll('a', href=True):
        url = urlparse.urljoin(base_url, tag['href'])
        domain = urlparse.urlparse(base_url).hostname
        if domain in url:
            links.append(url)
    return links


def is_url_already_visited(url):
    return url in visited


while urls:
    current_url = urls.pop()
    word = '#C'
    if word in current_url:
        [do sth]
    #print "Parsing", current_url
    html_text = get_html_text(current_url)
    visited.append(current_url)
    found_urls = find_internal_links_in_html_text(html_text, current_url)
    new_urls = ifilterfalse(is_url_already_visited, found_urls)
    urls.extend(new_urls)

오류:

Traceback (most recent call last):

File "<ipython‑input‑1‑67c2b4cf7175>", line 1, in <module>
runfile('S:/Consultas_python/Kiabi.py', wdir='S:/Consultas_python')

File "C:\Anaconda2\lib\site‑packages\spyderlib\widgets\externalshell\sitecustomize.py", line 685, in runfile
execfile(filename, namespace)

File "C:\Anaconda2\lib\site‑packages\spyderlib\widgets\externalshell\sitecustomize.py", line 71, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)

File "S:/Consultas_python/Kiabi.py", line 91, in <module>
html_text = get_html_text(current_url)

File "S:/Consultas_python/Kiabi.py", line 30, in get_html_text
return urllib.urlopen(current_url).read()

File "C:\Anaconda2\lib\urllib.py", line 87, in urlopen
return opener.open(url)

File "C:\Anaconda2\lib\urllib.py", line 185, in open
fullurl = unwrap(toBytes(fullurl))

File "C:\Anaconda2\lib\urllib.py", line 1070, in toBytes
" contains non‑ASCII characters")

UnicodeError: URL u'http://www.kiabi.es/Barbapap\xe1_s1' contains non‑ASCII characters

또는

UnicodeError: URL u'http://www.kiabi.es/Petit‑B\xe9guin_s2' contains non‑ASCII characters

고칠 수 있나요?


참조 솔루션

방법 1:

You need to percent encode the utf8 representation of your unicode string.

As explained here:

All non‑ASCII code points in the IRI should next be encoded as UTF‑8, and the resulting bytes percent‑encoded, to produce a valid URI.

In python code, that means:

import urllib
url = urllib.quote(url.encode('utf8'), ':/')

The second argument to quote, ':/', is to prevent the colon in the protocol part http:, or path separator / from being encoded.

(In Python 3, the quote function has been moved to the urllib.parse module).

방법 2:

You can try to encode the urls. Your code may look like:

def get_html_text(url):
    try:
        return urllib.urlopen(current_url.encode('ascii','ignore')).read()
    except (URLError, HTTPError, urllib.ContentTooShortError):
        print "Error getting " + current_url

(by Joel Sánchez LópezmemoselykRahul)

참조 문서

  1. UnicodeError: URL contains non‑ASCII characters (Python 2.7) (CC BY‑SA 2.5/3.0/4.0)

#web-crawler #Python #non-ascii-characters #beautifulsoup #unicode






관련 질문

UnicodeError: URL에 ASCII가 아닌 문자가 포함되어 있습니다(Python 2.7). (UnicodeError: URL contains non-ASCII characters (Python 2.7))

크롤링 출력 - 두 변수 연결 (Crawling output - connecting two variables)

Python2.7에서 효과적인 크롤러를 만드는 방법 (How to make an effective crawler in Python2.7)

이 텍스트가 다른 기사의 일부임을 Google에 알리는 방법 (How to tell google this text is part of another article)

크롤링하는 HTML 페이지에서 JavaScript 개체를 구문 분석하는 방법은 무엇입니까? (How to parse a JavaScript object from a HTML page I crawl?)

데이터 크롤링 또는 API 사용 (Crawling data or using API)

파이썬을 사용하여 웹사이트에서 내부 링크만 크롤링 (Crawl only internal links from a website using python)

받은 응답에서 HTML 코드를 긁는 방법은 무엇입니까? (How to scrape the html code from the response received?)

PHP를 사용하여 웹 사이트에서 클래스 이름 스크래핑 (Scraping class name on a website using php)

Scrapy Spider를 사용하는 Craigslist Scraper가 기능을 수행하지 않음 (Craigslist Scraper using Scrapy Spider not performing functions)

BeautifulSoup: 이 링크에서 모든 기사 링크를 가져오는 방법은 무엇입니까? (BeautifulSoup: how to get all article links from this link?)

나는 클라이언트입니다. 선택적으로 http 응답에서 헤더를 제거할 수 있습니까? (I'm client. Can I remove header from http response optionally?)







코멘트