문제 설명
BeautifulSoup: 이 링크에서 모든 기사 링크를 가져오는 방법은 무엇입니까? (BeautifulSoup: how to get all article links from this link?)
"https://www.cnnindonesia.com/search?query=covid"에서 모든 기사 링크를 가져오고 싶습니다. 다음은 내 코드입니다.
links = []
base_url = requests.get(f"https://www.cnnindonesia.com/search?query=covid")
soup = bs(base_url.text, 'html.parser')
cont = soup.findall('div', class='container')
for l in cont:
l_cont = l.findall('div', class='l_content')
for bf in l_cont:
bf_cont = bf.findall('div', class='box feed')
for lm in bf_cont:
lmcont = lm.find('div', class='list media_rows middle')
for article in lm_cont.find_all('article'):
a_cont = article.find('a', href=True)
if url:
link = a['href']
links.append(link)
</code></pre>
결과는 다음과 같습니다.
links
[]
참조 솔루션
방법 1:
Each article has this structure:
<article class="col_4">
<a href="https://www.cnnindonesia.com/...">
<span>...</span>
<h2 class="title">...</h2>
</a>
</article>
Simpler to iterate over the article elements then look for a elements.
Try:
from bs4 import BeautifulSoup
import requests
links = []
response = requests.get(f"https://www.cnnindonesia.com/search?query=covid")
soup = BeautifulSoup(response.text, 'html.parser')
for article in soup.find_all('article'):
url = article.find('a', href=True)
if url:
link = url['href']
print(link)
links.append(link)
print(links)
Output:
https://www.cnnindonesia.com/nasional/...pola‑sawah‑di‑laut‑natuna‑utara
...
['https://www.cnnindonesia.com/nasional/...pola‑sawah‑di‑laut‑natuna‑utara', ...
'https://www.cnnindonesia.com/gaya‑hidup/...ikut‑penerbangan‑gravitasi‑nol']
Update:
If want to extract the URLs that are dynamically added by JavaScript inside the <div class="list media_rows middle">
element then you must use something like Selenium that can extract the content after the full page is rendered in the web browser.
from selenium import webdriver
from selenium.webdriver.common.by import By
url = 'https://www.cnnindonesia.com/search?query=covid'
links = []
options = webdriver.ChromeOptions()
pathToChromeDriver = "chromedriver.exe"
browser = webdriver.Chrome(executable_path=pathToChromeDriver,
options=options)
try:
browser.get(url)
browser.implicitly_wait(10)
html = browser.page_source
content = browser.find_element(By.CLASS_NAME, 'media_rows')
for elt in content.find_elements(By.TAG_NAME, 'article'):
link = elt.find_element(By.TAG_NAME, 'a')
href = link.get_attribute('href')
if href:
print(href)
links.append(href)
finally:
browser.quit()
(by MrX、CodeMonkey)
참조 문서