
掌握ChatGPT插件與自定義GPT
├── pyproject.toml
├── README.md
├── yahoo_finance_scraper/
│ └── __init__.py
└── tests/
└── __init__.py
Markdown復(fù)制
導(dǎo)航到項(xiàng)目目錄并安裝 Playwright:
cd yahoo-finance-scraper
poetry add playwright
poetry run playwright install
殼復(fù)制
Yahoo Finance 使用 JavaScript 動(dòng)態(tài)加載內(nèi)容。Playwright 可以呈現(xiàn) JavaScript,因此適合從 Yahoo Finance 抓取動(dòng)態(tài)內(nèi)容。
打開該pyproject.toml
文件來檢查你的項(xiàng)目的依賴項(xiàng),其中應(yīng)該包括:
[tool.poetry.dependencies]
python = "^3.12"
playwright = "^1.46.0"
托木斯克復(fù)制
最后,在文件夾中創(chuàng)建一個(gè)main.py
文件yahoo_finance_scraper
來編寫您的抓取邏輯。
更新后的項(xiàng)目結(jié)構(gòu)應(yīng)如下所示:
yahoo-finance-scraper/
├── pyproject.toml
├── README.md
├── yahoo_finance_scraper/
│ ├── __init__.py
│ └── main.py
└── tests/
└── __init__.py
Markdown復(fù)制
您的環(huán)境現(xiàn)已設(shè)置好,您可以開始編寫 Python Playwright 代碼來抓取雅虎財(cái)經(jīng)了。
注意:如果您不想在本地機(jī)器上設(shè)置所有這些,您可以直接在
Apify上部署代碼。在本教程的后面,我將向您展示如何在 Apify 上部署和運(yùn)行您的抓取工具。
首先,讓我們使用 Playwright 啟動(dòng) Chromium 瀏覽器實(shí)例。雖然 Playwright 支持各種瀏覽器引擎,但在本教程中我們將使用 Chromium:
from playwright.async_api import async_playwright, Playwright
async def main():
async with async_playwright() as playwright:
browser = await playwright.chromium.launch(headless=False) # Launch a Chromium browser
context = await browser.new_context()
page = await context.new_page()
if __name__ == "__main__":
asyncio.run(main())
Python復(fù)制
要運(yùn)行此腳本,您需要main()
在腳本末尾使用事件循環(huán)執(zhí)行該函數(shù)。
接下來,導(dǎo)航到要抓取的股票的 Yahoo Finance 頁面。Yahoo Finance 股票頁面的 URL 如下所示:
https://finance.yahoo.com/quote/{ticker_symbol}
殼復(fù)制
股票代碼是用于識(shí)別證券交易所上市公司的唯一代碼,例如AAPL
Apple Inc. 或TSLA
Tesla, Inc.。股票代碼發(fā)生變化時(shí),URL 也會(huì)發(fā)生變化。因此,您應(yīng)將其替換{ticker_symbol}
為要抓取的特定股票代碼。
import asyncio
from playwright.async_api import async_playwright, Playwright
async def main():
async with async_playwright() as playwright:
# ...
ticker_symbol = "AAPL" # Replace with the desired ticker symbol
yahoo_finance_url = f"https://finance.yahoo.com/quote/{ticker_symbol}"
await page.goto(yahoo_finance_url) # Navigate to the Yahoo Finance page
if __name__ == "__main__":
asyncio.run(main())
Python復(fù)制
以下是迄今為止的完整腳本:
import asyncio
from playwright.async_api import async_playwright, Playwright
async def main():
async with async_playwright() as playwright:
# Launch a Chromium browser
browser = await playwright.chromium.launch(headless=False)
context = await browser.new_context()
page = await context.new_page()
ticker_symbol = "AAPL" # Replace with the desired ticker symbol
yahoo_finance_url = f"https://finance.yahoo.com/quote/{ticker_symbol}"
await page.goto(yahoo_finance_url) # Navigate to the Yahoo Finance page
# Wait for a few seconds
await asyncio.sleep(3)
# Close the browser
await browser.close()
if __name__ == "__main__":
asyncio.run(main())
Python復(fù)制
當(dāng)您運(yùn)行此腳本時(shí),它將打開 Yahoo Finance 頁面幾秒鐘后才終止。
太棒了!現(xiàn)在,您只需更改股票代碼即可抓取您選擇的任何股票的數(shù)據(jù)。
注意:使用 UI ( headless=False
) 啟動(dòng)瀏覽器非常適合測(cè)試和調(diào)試。如果您想節(jié)省資源并在后臺(tái)運(yùn)行瀏覽器,請(qǐng)切換到無頭模式:
browser = await playwright.chromium.launch(headless=True)
Python復(fù)制
當(dāng)從歐洲 IP 地址訪問雅虎財(cái)經(jīng)時(shí),您可能會(huì)遇到一個(gè)需要先解決的 cookie 同意模式,然后才能繼續(xù)抓取數(shù)據(jù)。
要繼續(xù)訪問所需頁面,您需要通過點(diǎn)擊“全部接受”或“全部拒絕”與模式進(jìn)行交互。為此,請(qǐng)右鍵單擊“全部接受”按鈕并選擇“檢查”以打開瀏覽器的 DevTools:
在 DevTools 中,您可以看到可以使用以下 CSS 選擇器選擇該按鈕:
button.accept-all
CSS復(fù)制
要在 Playwright 中自動(dòng)單擊此按鈕,您可以使用以下腳本:
import asyncio
from playwright.async_api import async_playwright, Playwright
async def main():
async with async_playwright() as playwright:
browser = await playwright.chromium.launch(headless=False)
context = await browser.new_context()
page = await context.new_page()
ticker_symbol = "AAPL"
url = f"https://finance.yahoo.com/quote/{ticker_symbol}"
await page.goto(url)
try:
# Click the "Accept All" button to bypass the modal
await page.locator("button.accept-all").click()
except:
pass
await browser.close()
# Run the main function
if __name__ == "__main__":
asyncio.run(main())
Python復(fù)制
如果出現(xiàn) Cookie 同意模式,此腳本將嘗試單擊“全部接受”按鈕。這樣您就可以繼續(xù)抓取而不會(huì)中斷。
要有效地抓取數(shù)據(jù),首先需要了解網(wǎng)頁的 DOM 結(jié)構(gòu)。假設(shè)您要提取常規(guī)市場(chǎng)價(jià)格 (224.72)、變化 (+3.00) 和變化百分比 (+1.35%)。這些值都包含在一個(gè)div
元素中。在這個(gè)元素中div
,您會(huì)發(fā)現(xiàn)三個(gè)fin-streamer
元素,每個(gè)元素分別代表市場(chǎng)價(jià)格、變化和百分比。
為了精確定位這些元素,您可以使用以下 CSS 選擇器:
[data-testid="qsp-price"]
[data-testid="qsp-price-change"]
[data-testid="qsp-price-change-percent"]
純文本復(fù)制
太棒了!接下來我們來看看如何提取收盤時(shí)間,頁面上顯示為“4 PM EDT”。
要選擇收盤時(shí)間,請(qǐng)使用以下 CSS 選擇器:
div[slot="marketTimeNotice"] > span
純文本復(fù)制
現(xiàn)在,讓我們繼續(xù)從表中提取關(guān)鍵的公司數(shù)據(jù),如市值、前收盤價(jià)和交易量:
如您所見,數(shù)據(jù)結(jié)構(gòu)為一個(gè)表格,其中有多個(gè)li
標(biāo)簽代表每個(gè)字段,從“上次收盤價(jià)”開始到“1y Target Est”結(jié)束。
要提取特定字段(如“上次收盤價(jià)”和“開盤價(jià)”),可以使用data-field
唯一標(biāo)識(shí)每個(gè)元素的屬性:
[data-field="regularMarketPreviousClose"]
[data-field="regularMarketOpen"]
純文本復(fù)制
屬性data-field
提供了一種選擇元素的簡(jiǎn)單方法。但是,在某些情況下,可能不存在這樣的屬性。例如,提取“Bid”值,該值缺少data-field
屬性或任何唯一標(biāo)識(shí)符。在這種情況下,我們將首先使用其文本內(nèi)容找到“Bid”標(biāo)簽,然后移至下一個(gè)同級(jí)元素以提取相應(yīng)的值。
以下是您可以使用的組合選擇器:
span:has-text('Bid') + span.value
純文本復(fù)制
現(xiàn)在您已經(jīng)確定了需要抓取的元素,接下來可以編寫Playwright腳本來從Yahoo Finance提取數(shù)據(jù)了。
讓我們定義一個(gè)名為 的新函數(shù)scrape_data
來處理抓取過程。此函數(shù)接受股票代碼,導(dǎo)航到 Yahoo Finance 頁面,并返回包含提取的財(cái)務(wù)數(shù)據(jù)的字典。
工作原理如下:
from playwright.async_api import async_playwright, Playwright
async def scrape_data(playwright: Playwright, ticker: str) -> dict:
try:
# Launch the browser in headless mode
browser = await playwright.chromium.launch(headless=True)
context = await browser.new_context()
page = await context.new_page()
url = f"https://finance.yahoo.com/quote/{ticker}"
await page.goto(url, wait_until="domcontentloaded")
try:
# Click the "Accept All" button if present
await page.locator("button.accept-all").click()
except:
pass # If the button is not found, continue without any action
data = {"Ticker": ticker}
# Extract regular market values
data["Regular Market Price"] = await page.locator(
'[data-testid="qsp-price"]'
).text_content()
data["Regular Market Price Change"] = await page.locator(
'[data-testid="qsp-price-change"]'
).text_content()
data["Regular Market Price Change Percent"] = await page.locator(
'[data-testid="qsp-price-change-percent"]'
).text_content()
# Extract market close time
market_close_time = await page.locator(
'div[slot="marketTimeNotice"] > span'
).first.text_content()
data["Market Close Time"] = market_close_time.replace("At close: ", "")
# Extract other financial metrics
data["Previous Close"] = await page.locator(
'[data-field="regularMarketPreviousClose"]'
).text_content()
data["Open Price"] = await page.locator(
'[data-field="regularMarketOpen"]'
).text_content()
data["Bid"] = await page.locator(
"span:has-text('Bid') + span.value"
).text_content()
data["Ask"] = await page.locator(
"span:has-text('Ask') + span.value"
).text_content()
data["Day's Range"] = await page.locator(
'[data-field="regularMarketDayRange"]'
).text_content()
data["52 Week Range"] = await page.locator(
'[data-field="fiftyTwoWeekRange"]'
).text_content()
data["Volume"] = await page.locator(
'[data-field="regularMarketVolume"]'
).text_content()
data["Avg. Volume"] = await page.locator(
'[data-field="averageVolume"]'
).text_content()
data["Market Cap"] = await page.locator(
'[data-field="marketCap"]'
).text_content()
data["Beta"] = await page.locator(
"span:has-text('Beta (5Y Monthly)') + span.value"
).text_content()
data["PE Ratio"] = await page.locator(
"span:has-text('PE Ratio (TTM)') + span.value"
).text_content()
data["EPS"] = await page.locator(
"span:has-text('EPS (TTM)') + span.value"
).text_content()
data["Earnings Date"] = await page.locator(
"span:has-text('Earnings Date') + span.value"
).text_content()
data["Dividend & Yield"] = await page.locator(
"span:has-text('Forward Dividend & Yield') + span.value"
).text_content()
data["Ex-Dividend Date"] = await page.locator(
"span:has-text('Ex-Dividend Date') + span.value"
).text_content()
data["1y Target Est"] = await page.locator(
'[data-field="targetMeanPrice"]'
).text_content()
return data
except Exception as e:
print(f"An error occurred while processing {ticker}: {e}")
return {"Ticker": ticker, "Error": str(e)}
finally:
await context.close()
await browser.close()
Python復(fù)制
代碼通過已識(shí)別的CSS選擇器來提取數(shù)據(jù),使用locator方法定位每個(gè)元素,并應(yīng)用text_content()方法從這些元素中抓取文本。抓取到的指標(biāo)會(huì)存儲(chǔ)在一個(gè)字典里,字典的每個(gè)鍵對(duì)應(yīng)一個(gè)財(cái)務(wù)指標(biāo),而相應(yīng)的值就是抓取到的文本內(nèi)容。
最后,定義一個(gè)main
函數(shù),通過迭代每個(gè)代碼并收集數(shù)據(jù)來協(xié)調(diào)整個(gè)過程
async def main():
# Define the ticker symbol
ticker = "AAPL"
async with async_playwright() as playwright:
# Collect data for the ticker
data = await scrape_data(playwright, ticker)
print(data)
# Run the main function
if __name__ == "__main__":
asyncio.run(main())
Python復(fù)制
在抓取過程結(jié)束時(shí),控制臺(tái)中會(huì)打印以下數(shù)據(jù):
在獲取了實(shí)時(shí)數(shù)據(jù)之后,我們?cè)賮砜纯囱呕⒇?cái)經(jīng)提供的歷史股票信息。這些數(shù)據(jù)反映了股票過往的表現(xiàn),對(duì)做出投資決策很有幫助。您可以查詢不同時(shí)間段的數(shù)據(jù),包括日、周、月度數(shù)據(jù),比如上個(gè)月、去年,甚至是股票的完整歷史記錄。
要訪問 Yahoo Finance 上的歷史股票數(shù)據(jù),您需要通過修改特定參數(shù)來自定義 URL:
frequency
:指定數(shù)據(jù)間隔,例如每日(1d
)、每周(1wk
)或每月(1mo
)。period1
和period2
:這些參數(shù)以 Unix 時(shí)間戳格式設(shè)置數(shù)據(jù)的開始和結(jié)束日期。比如,下面這個(gè)網(wǎng)址可以查詢亞馬遜(AMZN)從2023年8月16日到2024年8月16日的每周歷史數(shù)據(jù):
https://finance.yahoo.com/quote/AMZN/history/?frequency=1wk&period1=1692172771&period2=1723766400
純文本復(fù)制
導(dǎo)航到此 URL 后,您將看到一個(gè)包含歷史數(shù)據(jù)的表格。在我們的例子中,顯示的數(shù)據(jù)是過去一年的,間隔為一周。
要提取這些數(shù)據(jù),您可以使用query_selector_all
Playwright 和 CSS 選擇器中的方法.table tbody tr
:
rows = await page.query_selector_all(".table tbody tr")
Python復(fù)制
每行包含多個(gè)單元格(標(biāo)簽)來保存數(shù)據(jù)。以下是從每個(gè)單元格中提取文本內(nèi)容的方法:
for row in rows:
cells = await row.query_selector_all("td")
date = await cells[0].text_content()
open_price = await cells[1].text_content()
high_price = await cells[2].text_content()
low_price = await cells[3].text_content()
close_price = await cells[4].text_content()
adj_close = await cells[5].text_content()
volume = await cells[6].text_content()
Python復(fù)制
接下來,創(chuàng)建一個(gè)函數(shù)來生成 Unix 時(shí)間戳,我們將使用它來定義數(shù)據(jù)的開始( period1
)和結(jié)束( )日期:period2
def get_unix_timestamp(
years_back: int = 0,
months_back: int = 0,
days_back: int = 0
) -> int:
"""Get a Unix timestamp for a specified number of years, months, or days back from today."""
current_time = time.time()
seconds_in_day = 86400
return int(
current_time
- (years_back * 365 + months_back * 30 + days_back) * seconds_in_day
)
Python復(fù)制
現(xiàn)在,讓我們編寫一個(gè)函數(shù)來抓取歷史數(shù)據(jù):
from playwright.async_api import async_playwright, Playwright
async def scrape_historical_data(
playwright: Playwright,
ticker: str,
frequency: str,
period1: int,
period2: int
):
url = f"https://finance.yahoo.com/quote/{ticker}/history?frequency={frequency}&period1={period1}&period2={period2}"
browser = await playwright.chromium.launch(headless=True)
context = await browser.new_context()
page = await context.new_page()
await page.goto(url, wait_until="domcontentloaded")
try:
await page.locator("button.accept-all").click()
except:
pass
# Wait for the table to load
await page.wait_for_selector(".table-container")
# Extract table rows
rows = await page.query_selector_all(".table tbody tr")
# Prepare data storage
data = []
for row in rows:
cells = await row.query_selector_all("td")
date = await cells[0].text_content()
open_price = await cells[1].text_content()
high_price = await cells[2].text_content()
low_price = await cells[3].text_content()
close_price = await cells[4].text_content()
adj_close = await cells[5].text_content()
volume = await cells[6].text_content()
# Add row data to list
data.append(
[date, open_price, high_price, low_price, close_price, adj_close, volume]
)
print(data)
await context.close()
await browser.close()
return data
Python復(fù)制
該scrape_historical_data
函數(shù)使用給定的參數(shù)構(gòu)造 Yahoo Finance URL,在管理任何 cookie 提示的同時(shí)導(dǎo)航到該頁面,等待歷史數(shù)據(jù)表完全加載,然后提取并將相關(guān)數(shù)據(jù)打印到控制臺(tái)。
最后,我們來看看如何用不同的設(shè)置來運(yùn)行這個(gè)腳本:
async def main():
async with async_playwright() as playwright:
ticker = "TSLA"
# Weekly data for last year
period1 = get_unix_timestamp(years_back=1)
period2 = get_unix_timestamp()
weekly_data = await scrape_historical_data(
playwright, ticker, "1wk", period1, period2
)
# Run the main function
if __name__ == "__main__":
asyncio.run(main())
Python復(fù)制
通過調(diào)整參數(shù)來定制數(shù)據(jù)周期和頻率:
# Daily data for the last month
period1 = get_unix_timestamp(months_back=1)
period2 = get_unix_timestamp()
await scrape_historical_data(playwright, ticker, "1d", period1, period2)
# Monthly data for the stock's lifetime
period1 = 1
period2 = 999999999999
await scrape_historical_data(playwright, ticker, "1mo", period1, period2)
Python復(fù)制
以下是我們到目前為止編寫的,用于從雅虎財(cái)經(jīng)(Yahoo Finance)抓取歷史數(shù)據(jù)的完整腳本:
from playwright.async_api import async_playwright, Playwright
import asyncio
import time
def get_unix_timestamp(
years_back: int = 0, months_back: int = 0, days_back: int = 0
) -> int:
"""Get a Unix timestamp for a specified number of years, months, or days back from today."""
current_time = time.time()
seconds_in_day = 86400
return int(
current_time
- (years_back * 365 + months_back * 30 + days_back) * seconds_in_day
)
async def scrape_historical_data(
playwright: Playwright, ticker: str, frequency: str, period1: int, period2: int
):
url = f"https://finance.yahoo.com/quote/{ticker}/history?frequency={frequency}&period1={period1}&period2={period2}"
browser = await playwright.chromium.launch(headless=True)
context = await browser.new_context()
page = await context.new_page()
await page.goto(url, wait_until="domcontentloaded")
try:
await page.locator("button.accept-all").click()
except:
pass
# Wait for the table to load
await page.wait_for_selector(".table-container")
# Extract table rows
rows = await page.query_selector_all(".table tbody tr")
# Prepare data storage
data = []
for row in rows:
cells = await row.query_selector_all("td")
date = await cells[0].text_content()
open_price = await cells[1].text_content()
high_price = await cells[2].text_content()
low_price = await cells[3].text_content()
close_price = await cells[4].text_content()
adj_close = await cells[5].text_content()
volume = await cells[6].text_content()
# Add row data to list
data.append(
[date, open_price, high_price, low_price, close_price, adj_close, volume]
)
print(data)
await context.close()
await browser.close()
return data
async def main() -> None:
async with async_playwright() as playwright:
ticker = "TSLA"
# Weekly data for the last year
period1 = get_unix_timestamp(years_back=1)
period2 = get_unix_timestamp()
weekly_data = await scrape_historical_data(
playwright, ticker, "1wk", period1, period2
)
if __name__ == "__main__":
asyncio.run(main())
Python復(fù)制
運(yùn)行此腳本根據(jù)您指定的參數(shù)將所有歷史股票數(shù)據(jù)打印到控制臺(tái)。
到目前為止,我們已經(jīng)抓取了一只股票的數(shù)據(jù)。為了同時(shí)收集多只股票的數(shù)據(jù),我們可以修改腳本以接受股票代碼作為命令行參數(shù)并處理每只股票。
async def main() -> None:
if len(sys.argv) < 2:
print("Please provide at least one ticker symbol as a command-line argument.")
return
tickers = sys.argv[1:]
async with async_playwright() as playwright:
# Collect data for all tickers
all_data = []
for ticker in tickers:
data = await scrape_data(playwright, ticker)
all_data.append(data)
print(all_data)
# Run the main function
if __name__ == "__main__":
asyncio.run(main())
Python復(fù)制
要運(yùn)行腳本,請(qǐng)將股票代碼作為參數(shù)傳遞:
python yahoo_finance_scraper/main.py AAPL MSFT TSLA
殼復(fù)制
這將抓取并顯示蘋果公司 (AAPL)、微軟公司 (MSFT) 和特斯拉公司 (TSLA) 的數(shù)據(jù)。
網(wǎng)站通常會(huì)發(fā)現(xiàn)并阻止自動(dòng)抓取。它們使用速率限制、IP 阻止和檢查瀏覽模式。以下是一些在網(wǎng)頁抓取時(shí)不被發(fā)現(xiàn)的有效方法:
1. 請(qǐng)求之間的隨機(jī)間隔
在請(qǐng)求之間添加隨機(jī)延遲是一種避免檢測(cè)的簡(jiǎn)單方法。這種基本方法可以使您的抓取行為對(duì)網(wǎng)站來說不那么明顯。
以下是如何在 Playwright 腳本中添加隨機(jī)延遲的方法:
import asyncio
import random
from playwright.async_api import Playwright, async_playwright
async def scrape_data(playwright: Playwright, ticker: str):
browser = await playwright.chromium.launch()
context = await browser.new_context()
page = await context.new_page()
url = f"https://example.com/{ticker}" # Example URL
await page.goto(url)
# Random delay to mimic human-like behavior
await asyncio.sleep(random.uniform(2, 5))
# Your scraping logic here...
await context.close()
await browser.close()
async def main():
async with async_playwright() as playwright:
await scrape_data(playwright, "AAPL") # Example ticker
if __name__ == "__main__":
asyncio.run(main())
Python復(fù)制
該腳本在請(qǐng)求之間引入了 2 到 5 秒的隨機(jī)延遲,使得操作變得不那么可預(yù)測(cè),并降低了被標(biāo)記為機(jī)器人的可能性。
2. 設(shè)置和切換 User-Agent
網(wǎng)站通常會(huì)通過User-Agent字符串來識(shí)別發(fā)出請(qǐng)求的瀏覽器和設(shè)備。通過更換User-Agent字符串,可以讓你的爬蟲請(qǐng)求看起來像是來自不同的瀏覽器和設(shè)備,這有助于避免被網(wǎng)站檢測(cè)到。
以下是在 Playwright 中實(shí)現(xiàn) User-Agent 輪換的方法:
import asyncio
import random
from playwright.async_api import Playwright, async_playwright
async def scrape_data(playwright: Playwright, ticker: str) -> None:
browser = await playwright.chromium.launch(headless=True)
context = await browser.new_context()
# List of user-agents
user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0",
]
# Select a random user-agent from the list to rotate between requests
user_agent = random.choice(user_agents)
# Set the chosen user-agent for the current browser context
context.set_user_agent(user_agent)
page = await context.new_page()
url = f"https://example.com/{ticker}" # Example URL with ticker
await page.goto(url)
# Your scraping logic goes here...
await context.close()
await browser.close()
async def main():
async with async_playwright() as playwright:
await scrape_data(playwright, "AAPL") # Example ticker
if __name__ == "__main__":
asyncio.run(main())
Python復(fù)制
此方法使用 User-Agent 字符串列表,并為每個(gè)請(qǐng)求隨機(jī)選擇一個(gè)。此技術(shù)有助于掩蓋您的抓取工具的身份并降低被阻止的可能性。
注意:您可以參考
useragentstring.com等網(wǎng)站來獲取完整的 User-Agent 字符串列表。
3. 使用 Playwright-Stealth
為了降低被檢測(cè)的風(fēng)險(xiǎn)并提升您的爬蟲效果,您可以使用playwright-stealth庫(kù)。這個(gè)庫(kù)運(yùn)用多種技術(shù)手段,讓您的爬取行為更接近真實(shí)用戶的瀏覽活動(dòng)。
首先,安裝playwright-stealth
:
poetry add playwright-stealth
然后,修改腳本:
import asyncio
from playwright.async_api import Playwright, async_playwright
from playwright_stealth import stealth_async
async def scrape_data(playwright: Playwright, ticker: str) -> None:
browser = await playwright.chromium.launch(headless=True)
context = await browser.new_context()
# Apply stealth techniques to avoid detection
await stealth_async(context)
page = await context.new_page()
url = f"https://finance.yahoo.com/quote/{ticker}"
await page.goto(url)
# Your scraping logic here...
await context.close()
await browser.close()
async def main():
async with async_playwright() as playwright:
await scrape_data(playwright, "AAPL") # Example ticker
if __name__ == "__main__":
asyncio.run(main())
Python復(fù)制
這些技術(shù)可以幫助避免被阻止,但您可能仍會(huì)遇到問題。如果是這樣,請(qǐng)嘗試更高級(jí)的方法,例如使用代理、輪換 IP 地址或?qū)嵤?CAPTCHA 求解器。您可以查看詳細(xì)指南21 條提示,讓您在不被阻止的情況下抓取網(wǎng)站。這是您明智地選擇代理、對(duì)抗 Cloudflare、解決 CAPTCHA、避免誘捕等的必備指南。
通過智能輪換數(shù)據(jù)中心和住宅 IP 地址來提高抓取工具的性能。
抓取到所需的股票數(shù)據(jù)后,下一步就是將其導(dǎo)出為 CSV 文件,以便于分析、與他人共享或?qū)氲狡渌麛?shù)據(jù)處理工具中。
將提取的數(shù)據(jù)保存到 CSV 文件的方法如下:
# ...
import csv
async def main() -> None:
# ...
async with async_playwright() as playwright:
# Collect data for all tickers
all_data = []
for ticker in tickers:
data = await scrape_data(playwright, ticker)
all_data.append(data)
# Define the CSV file name
csv_file = "stock_data.csv"
# Write the data to a CSV file
with open(csv_file, mode="w", newline="", encoding="utf-8") as file:
writer = csv.DictWriter(file, fieldnames=all_data[0].keys())
writer.writeheader()
writer.writerows(all_data)
if __name__ == "__main__":
asyncio.run(main())
Python復(fù)制
代碼首先收集每個(gè)股票代碼的數(shù)據(jù)。之后,它會(huì)創(chuàng)建一個(gè)名為的 CSV 文件stock_data.csv
。然后,它使用 Python 的csv.DictWriter
方法寫入數(shù)據(jù),首先使用方法寫入列標(biāo)題writeheader()
,然后使用方法添加每行數(shù)據(jù)writerows()
。
讓我們將所有內(nèi)容整合到一個(gè)腳本中。這個(gè)最終代碼片段包括從 Yahoo Finance 抓取數(shù)據(jù)到將其導(dǎo)出到 CSV 文件的所有步驟。
import asyncio
from playwright.async_api import async_playwright, Playwright
import sys
import csv
async def scrape_data(playwright: Playwright, ticker: str) -> dict:
"""
Extracts financial data from Yahoo Finance for a given stock ticker.
Args:
playwright (Playwright): The Playwright instance used to control the browser.
ticker (str): The stock ticker symbol to retrieve data for.
Returns:
dict: A dictionary containing the extracted financial data for the given ticker.
"""
try:
# Launch a headless browser
browser = await playwright.chromium.launch(headless=True)
context = await browser.new_context()
page = await context.new_page()
# Form the URL using the ticker symbol
url = f"https://finance.yahoo.com/quote/{ticker}"
# Navigate to the page and wait for the DOM content to load
await page.goto(url, wait_until="domcontentloaded")
# Try to click the "Accept All" button for cookies, if it exists
try:
await page.locator("button.accept-all").click()
except:
pass # If the button is not found, continue without any action
# Dictionary to store the extracted data
data = {"Ticker": ticker}
# Extract regular market values
data["Regular Market Price"] = await page.locator(
'[data-testid="qsp-price"]'
).text_content()
data["Regular Market Price Change"] = await page.locator(
'[data-testid="qsp-price-change"]'
).text_content()
data["Regular Market Price Change Percent"] = await page.locator(
'[data-testid="qsp-price-change-percent"]'
).text_content()
# Extract market close time
market_close_time = await page.locator(
'div[slot="marketTimeNotice"] > span'
).first.text_content()
data["Market Close Time"] = market_close_time.replace("At close: ", "")
# Extract other financial metrics
data["Previous Close"] = await page.locator(
'[data-field="regularMarketPreviousClose"]'
).text_content()
data["Open Price"] = await page.locator(
'[data-field="regularMarketOpen"]'
).text_content()
data["Bid"] = await page.locator(
"span:has-text('Bid') + span.value"
).text_content()
data["Ask"] = await page.locator(
"span:has-text('Ask') + span.value"
).text_content()
data["Day's Range"] = await page.locator(
'[data-field="regularMarketDayRange"]'
).text_content()
data["52 Week Range"] = await page.locator(
'[data-field="fiftyTwoWeekRange"]'
).text_content()
data["Volume"] = await page.locator(
'[data-field="regularMarketVolume"]'
).text_content()
data["Avg. Volume"] = await page.locator(
'[data-field="averageVolume"]'
).text_content()
data["Market Cap"] = await page.locator(
'[data-field="marketCap"]'
).text_content()
data["Beta"] = await page.locator(
"span:has-text('Beta (5Y Monthly)') + span.value"
).text_content()
data["PE Ratio"] = await page.locator(
"span:has-text('PE Ratio (TTM)') + span.value"
).text_content()
data["EPS"] = await page.locator(
"span:has-text('EPS (TTM)') + span.value"
).text_content()
data["Earnings Date"] = await page.locator(
"span:has-text('Earnings Date') + span.value"
).text_content()
data["Dividend & Yield"] = await page.locator(
"span:has-text('Forward Dividend & Yield') + span.value"
).text_content()
data["Ex-Dividend Date"] = await page.locator(
"span:has-text('Ex-Dividend Date') + span.value"
).text_content()
data["1y Target Est"] = await page.locator(
'[data-field="targetMeanPrice"]'
).text_content()
return data
except Exception as e:
# Handle any exceptions and return an error message
print(f"An error occurred while processing {ticker}: {e}")
return {"Ticker": ticker, "Error": str(e)}
finally:
# Ensure the browser is closed even if an error occurs
await context.close()
await browser.close()
async def main() -> None:
"""
Main function to run the Yahoo Finance data extraction for multiple tickers.
Reads ticker symbols from command-line arguments, extracts data for each,
and saves the results to a CSV file.
"""
if len(sys.argv) < 2:
print("Please provide at least one ticker symbol as a command-line argument.")
return
tickers = sys.argv[1:]
# Use async_playwright context to handle browser automation
async with async_playwright() as playwright:
# List to store data for all tickers
all_data = []
for ticker in tickers:
# Extract data for each ticker and add it to the list
data = await scrape_data(playwright, ticker)
all_data.append(data)
# Define the CSV file name
csv_file = "stock_data.csv"
# Write the extracted data to a CSV file
with open(csv_file, mode="w", newline="", encoding="utf-8") as file:
writer = csv.DictWriter(file, fieldnames=all_data[0].keys())
writer.writeheader()
writer.writerows(all_data)
print(f"Data for tickers {', '.join(tickers)
} has been saved to {csv_file}")
# Run the main function using asyncio
if __name__ == "__main__":
asyncio.run(main())
Python復(fù)制
您可以通過提供一個(gè)或多個(gè)股票代碼作為命令行參數(shù)從終端運(yùn)行腳本。
python yahoo_finance_scraper/main.py AAPL GOOG TSLA AMZN META
殼復(fù)制
運(yùn)行腳本后,stock_data.csv
將在同一目錄中創(chuàng)建名為的 CSV 文件。此文件將以有組織的方式包含所有數(shù)據(jù)。CSV 文件將如下所示:
準(zhǔn)備好抓取工具后,就可以使用Apify將其部署到云端。這樣您就可以按計(jì)劃運(yùn)行抓取工具并利用 Apify 的強(qiáng)大功能。對(duì)于此任務(wù),我們將使用Python Playwright 模板進(jìn)行快速設(shè)置。在 Apify 上,抓取工具稱為Actors。
首先從 Apify Python 模板庫(kù)克隆Playwright + Chrome模板。
首先,您需要安裝 Apify CLI,它將幫助您管理 Actor。在 macOS 或 Linux 上,您可以使用 Homebrew 執(zhí)行此操作:
brew install apify/tap/apify-cli
或者通過 NPM:
npm -g install apify-cli
安裝 CLI 后,使用 Python Playwright + Chrome模板創(chuàng)建一個(gè)新的 Actor:
apify create yf-scraper -t python-playwright
此命令將在您的目錄中設(shè)置一個(gè)項(xiàng)目yf-scraper
。它會(huì)安裝所有必要的依賴項(xiàng)并提供一些樣板代碼來幫助您入門。
導(dǎo)航到新項(xiàng)目文件夾并使用您喜歡的代碼編輯器將其打開。在此示例中,我使用的是 VS Code:
cd yf-scraper
code .
該模板附帶功能齊全的抓取工具。您可以通過運(yùn)行命令來測(cè)試它,apify run
以查看其運(yùn)行情況。結(jié)果將保存在 中storage/datasets
。
接下來,修改代碼src/main.py
以使其適合抓取雅虎財(cái)經(jīng)。
修改后的代碼如下:
from playwright.async_api import async_playwright
from apify import Actor
async def extract_stock_data(page, ticker):
data = {"Ticker": ticker}
data["Regular Market Price"] = await page.locator(
'[data-testid="qsp-price"]'
).text_content()
data["Regular Market Price Change"] = await page.locator(
'[data-testid="qsp-price-change"]'
).text_content()
data["Regular Market Price Change Percent"] = await page.locator(
'[data-testid="qsp-price-change-percent"]'
).text_content()
data["Previous Close"] = await page.locator(
'[data-field="regularMarketPreviousClose"]'
).text_content()
data["Open Price"] = await page.locator(
'[data-field="regularMarketOpen"]'
).text_content()
data["Bid"] = await page.locator("span:has-text('Bid') + span.value").text_content()
data["Ask"] = await page.locator("span:has-text('Ask') + span.value").text_content()
data["Day's Range"] = await page.locator(
'[data-field="regularMarketDayRange"]'
).text_content()
data["52 Week Range"] = await page.locator(
'[data-field="fiftyTwoWeekRange"]'
).text_content()
data["Volume"] = await page.locator(
'[data-field="regularMarketVolume"]'
).text_content()
data["Avg. Volume"] = await page.locator(
'[data-field="averageVolume"]'
).text_content()
data["Market Cap"] = await page.locator('[data-field="marketCap"]').text_content()
data["Beta"] = await page.locator(
"span:has-text('Beta (5Y Monthly)') + span.value"
).text_content()
data["PE Ratio"] = await page.locator(
"span:has-text('PE Ratio (TTM)') + span.value"
).text_content()
data["EPS"] = await page.locator(
"span:has-text('EPS (TTM)') + span.value"
).text_content()
data["Earnings Date"] = await page.locator(
"span:has-text('Earnings Date') + span.value"
).text_content()
data["Dividend & Yield"] = await page.locator(
"span:has-text('Forward Dividend & Yield') + span.value"
).text_content()
data["Ex-Dividend Date"] = await page.locator(
"span:has-text('Ex-Dividend Date') + span.value"
).text_content()
data["1y Target Est"] = await page.locator(
'[data-field="targetMeanPrice"]'
).text_content()
return data
async def main() -> None:
"""
Main function to run the Apify Actor and extract stock data using Playwright.
Reads input configuration from the Actor, enqueues URLs for scraping,
launches Playwright to process requests, and extracts stock data.
"""
async with Actor:
# Retrieve input parameters
actor_input = await Actor.get_input() or {}
start_urls = actor_input.get("start_urls", [])
tickers = actor_input.get("tickers", [])
if not start_urls:
Actor.log.info(
"No start URLs specified in actor input. Exiting...")
await Actor.exit()
base_url = start_urls[0].get("url", "")
# Enqueue requests for each ticker
default_queue = await Actor.open_request_queue()
for ticker in tickers:
url = f"{base_url}{ticker}"
await default_queue.add_request({"url": url, "userData": {"depth": 0}})
# Launch Playwright and open a new browser context
Actor.log.info("Launching Playwright...")
async with async_playwright() as playwright:
browser = await playwright.chromium.launch(headless=Actor.config.headless)
context = await browser.new_context()
# Process requests from the queue
while request := await default_queue.fetch_next_request():
url = request["url"]
Actor.log.info(f"Scraping {url} ...")
try:
# Open the URL in a new Playwright page
page = await context.new_page()
await page.goto(url, wait_until="domcontentloaded")
# Extract the ticker symbol from the URL
ticker = url.rsplit("/", 1)[-1]
data = await extract_stock_data(page, ticker)
# Push the extracted data to Apify
await Actor.push_data(data)
except Exception as e:
Actor.log.exception(
f"Error extracting data from {url}: {e}")
finally:
# Ensure the page is closed and the request is marked as handled
await page.close()
await default_queue.mark_request_as_handled(request)
在運(yùn)行代碼之前,更新目錄input_schema.json
中的文件.actor/
以包含 Yahoo Finance 報(bào)價(jià)頁面 URL 并添加一個(gè)tickers
字段。
這是更新后的input_schema.json
文件:
{
"title": "Python Playwright Scraper",
"type": "object",
"schemaVersion": 1,
"properties": {
"start_urls": {
"title": "Start URLs",
"type": "array",
"description": "URLs to start with",
"prefill": [
{
"url": "https://finance.yahoo.com/quote/"
}
],
"editor": "requestListSources"
},
"tickers": {
"title": "Tickers",
"type": "array",
"description": "List of stock ticker symbols to scrape data for",
"items": {
"type": "string"
},
"prefill": [
"AAPL",
"GOOGL",
"AMZN"
],
"editor": "stringList"
},
"max_depth": {
"title": "Maximum depth",
"type": "integer",
"description": "Depth to which to scrape to",
"default": 1
}
},
"required": [
"start_urls",
"tickers"
]
}
JSON復(fù)制
此外,input.json
通過將 URL 更改為 Yahoo Finance 頁面來更新文件,以防止執(zhí)行期間發(fā)生沖突,或者您可以直接刪除此文件。
要運(yùn)行你的 Actor,請(qǐng)?jiān)诮K端中運(yùn)行以下命令:
apify run
抓取的結(jié)果將保存在 中storage/datasets
,其中每個(gè)股票代碼都有自己的 JSON 文件,如下所示:
要部署您的 Actor,請(qǐng)先創(chuàng)建一個(gè) Apify 帳戶(如果您還沒有)。然后,從 Apify 控制臺(tái)的“設(shè)置 → 集成”下獲取您的 API 令牌,最后使用以下命令使用您的令牌登錄:
apify login -t YOUR_APIFY_TOKEN
最后,將您的 Actor 推送到 Apify:
apify push
片刻之后,你的 Actor 應(yīng)該會(huì)出現(xiàn)在 Apify 控制臺(tái)的Actors → My actors下。
您的抓取工具現(xiàn)已準(zhǔn)備好在 Apify 平臺(tái)上運(yùn)行。點(diǎn)擊“開始”按鈕即可開始。運(yùn)行完成后,您可以從“存儲(chǔ)”選項(xiàng)卡預(yù)覽和下載各種格式的數(shù)據(jù)。
額外好處:在 Apify 上運(yùn)行抓取工具的一個(gè)主要優(yōu)勢(shì)是可以為同一個(gè) Actor 保存不同的配置并設(shè)置自動(dòng)調(diào)度。讓我們?yōu)槲覀兊?Playwright Actor 設(shè)置這個(gè)。
在Actor頁面上,點(diǎn)擊 創(chuàng)建空任務(wù)。
接下來,單擊 “操作” ,然后 單擊“計(jì)劃”。
最后,選擇你希望 Actor 運(yùn)行的頻率并點(diǎn)擊 “創(chuàng)建”。
要開始在 Apify 平臺(tái)上使用 Python 進(jìn)行抓取,您可以使用Python 代碼模板。這些模板適用于流行的庫(kù),例如 Requests、Beautiful Soup、Scrapy、Playwright 和 Selenium。使用這些模板,您可以快速構(gòu)建用于各種 Web 抓取任務(wù)的抓取工具。
使用代碼模板快速構(gòu)建抓取工具
Yahoo Finance 提供免費(fèi) API,讓用戶可以訪問大量財(cái)務(wù)信息。其中包括實(shí)時(shí)股票報(bào)價(jià)、歷史市場(chǎng)數(shù)據(jù)和最新財(cái)經(jīng)新聞。該 API 提供各種端點(diǎn),允許您以 JSON、CSV 和 XML 等不同格式檢索信息。您可以輕松地將數(shù)據(jù)集成到您的項(xiàng)目中,以最適合您需求的方式使用它。
您已經(jīng)構(gòu)建了一個(gè)實(shí)用的系統(tǒng),使用 Playwright 從 Yahoo Finance 中提取財(cái)務(wù)數(shù)據(jù)。此代碼處理多個(gè)股票代碼并將結(jié)果保存到 CSV 文件中。您已經(jīng)學(xué)會(huì)了如何繞過攔截機(jī)制,讓您的抓取工具保持正常運(yùn)行。
原文鏈接:https://blog.apify.com/scrape-yahoo-finance-python/
對(duì)比大模型API的內(nèi)容創(chuàng)意新穎性、情感共鳴力、商業(yè)轉(zhuǎn)化潛力
一鍵對(duì)比試用API 限時(shí)免費(fèi)