데이터 크롤링 또는 API 사용 (Crawling data or using API)


문제 설명

데이터 크롤링 또는 API 사용 (Crawling data or using API)

이 사이트가 모든 데이터를 수집하는 방법 ‑ questionhub, bigresource, devsea, developerbay?

bigresource처럼 프레임에 데이터를 표시하는 것이 합법적입니까?


참조 솔루션

방법 1:

<p>@amazed</p>

EDITED : fixed some spelling issues 20110310

How these sites gather all data‑ questionhub, bigresource ...

Here's a very general sketch of what is probably happening in the background at website like questionhub.com

  1. Spider program (google "spider program" to learn more)

    a. configured to start reading web pages at stackoverflow.com (for example)

    b. run program so it goes to home page of stackoverflow.com and starts visiting all links that it finds on those pages.

    c. Returns HTML data from all of those pages

  2. Search Index Program

    Reads HTML data returned by spider and creates search index Storing the words that it found AND what URL those words where found at

  3. User Interface web‑page

    Provides feature rich user‑interface so you can search the sites that have been spidered.

Is this legal to show data in frame as bigresource do?

To be technical, "it all depends" ;‑)

Normally, websites want to be visible in google, so why not other search engines too.

Just as google displays part of the text that was found when a site was spidered, questionhub.com (or others) has chosen to show more of the text found on the original page, possibly keeping the formatting that was in the orginal HTML OR changing the formatting to fit their standard visual styling.

A remote site can 'request' that spyders do NOT go thru some/all of their web pages by adding a rule in a well‑known file called robots.txt. Spiders do not have to honor the robots.txt, but a vigilant website will track the IP addresses of spyders that do not honor their robots.txt file and then block that IP address from looking at anything on their website. You can find plenty of information about robots.txt here on stackoverflow OR by running a query on google.

There is a several industries (besides google) built about what you are asking. There are tags in stack‑overflow for search‑engine, search; read some of those question/answers. Lucene/Solr are open source search engine components. There is a companion open‑source spider, but the name eludes me right now. Good luck.

I hope this helps.

P.S. as you appear to be a new user, if you get an answer that helps you please remember to mark it as accepted, or give it a + (or ‑) as a useful answer. This goes for your other posts here too ;‑)

(by amazedshellter)

참조 문서

  1. Crawling data or using API (CC BY‑SA 3.0/4.0)

#web-crawler #html #RegEx






관련 질문

UnicodeError: URL에 ASCII가 아닌 문자가 포함되어 있습니다(Python 2.7). (UnicodeError: URL contains non-ASCII characters (Python 2.7))

크롤링 출력 - 두 변수 연결 (Crawling output - connecting two variables)

Python2.7에서 효과적인 크롤러를 만드는 방법 (How to make an effective crawler in Python2.7)

이 텍스트가 다른 기사의 일부임을 Google에 알리는 방법 (How to tell google this text is part of another article)

크롤링하는 HTML 페이지에서 JavaScript 개체를 구문 분석하는 방법은 무엇입니까? (How to parse a JavaScript object from a HTML page I crawl?)

데이터 크롤링 또는 API 사용 (Crawling data or using API)

파이썬을 사용하여 웹사이트에서 내부 링크만 크롤링 (Crawl only internal links from a website using python)

받은 응답에서 HTML 코드를 긁는 방법은 무엇입니까? (How to scrape the html code from the response received?)

PHP를 사용하여 웹 사이트에서 클래스 이름 스크래핑 (Scraping class name on a website using php)

Scrapy Spider를 사용하는 Craigslist Scraper가 기능을 수행하지 않음 (Craigslist Scraper using Scrapy Spider not performing functions)

BeautifulSoup: 이 링크에서 모든 기사 링크를 가져오는 방법은 무엇입니까? (BeautifulSoup: how to get all article links from this link?)

나는 클라이언트입니다. 선택적으로 http 응답에서 헤더를 제거할 수 있습니까? (I'm client. Can I remove header from http response optionally?)







코멘트