문제 설명
데이터 크롤링 또는 API 사용 (Crawling data or using API)
이 사이트가 모든 데이터를 수집하는 방법 ‑ questionhub, bigresource, devsea, developerbay?
bigresource처럼 프레임에 데이터를 표시하는 것이 합법적입니까?
참조 솔루션
방법 1:
<p>@amazed</p>
EDITED : fixed some spelling issues 20110310
How these sites gather all data‑ questionhub, bigresource ...
Here's a very general sketch of what is probably happening in the background at website like questionhub.com
Spider program (google "spider program" to learn more)
a. configured to start reading web pages at stackoverflow.com (for example)
b. run program so it goes to home page of stackoverflow.com and starts visiting all links that it finds on those pages.
c. Returns HTML data from all of those pages
Search Index Program
Reads HTML data returned by spider and creates search index Storing the words that it found AND what URL those words where found at
User Interface web‑page
Provides feature rich user‑interface so you can search the sites that have been spidered.
Is this legal to show data in frame as bigresource do?
To be technical, "it all depends" ;‑)
Normally, websites want to be visible in google, so why not other search engines too.
Just as google displays part of the text that was found when a site was spidered, questionhub.com (or others) has chosen to show more of the text found on the original page, possibly keeping the formatting that was in the orginal HTML OR changing the formatting to fit their standard visual styling.
A remote site can 'request' that spyders do NOT go thru some/all of their web pages by adding a rule in a well‑known file called robots.txt. Spiders do not have to honor the robots.txt, but a vigilant website will track the IP addresses of spyders that do not honor their robots.txt file and then block that IP address from looking at anything on their website. You can find plenty of information about robots.txt here on stackoverflow OR by running a query on google.
There is a several industries (besides google) built about what you are asking. There are tags in stack‑overflow for search‑engine, search; read some of those question/answers. Lucene/Solr are open source search engine components. There is a companion open‑source spider, but the name eludes me right now. Good luck.
I hope this helps.
P.S. as you appear to be a new user, if you get an answer that helps you please remember to mark it as accepted, or give it a + (or ‑) as a useful answer. This goes for your other posts here too ;‑)