├── pyproject.toml
├── README.md
├── google_finance_scraper/
│ └── __init__.py
└── tests/
└── __init__.py

導(dǎo)航到項(xiàng)目目錄并安裝 Playwright:

cd google-finance-scraper
poetry add playwright
poetry run playwright install

Google Finance 使用 JavaScript 動(dòng)態(tài)加載內(nèi)容。Playwright 可以呈現(xiàn) JavaScript,因此適合從 Google Finance 抓取動(dòng)態(tài)內(nèi)容。

打開(kāi)該 pyproject.toml 文件來(lái)檢查你的項(xiàng)目的依賴(lài)項(xiàng),其中應(yīng)該包括:

[tool.poetry.dependencies]
python = "^3.12"
playwright = "^1.46.0"

??*注意:* 撰寫(xiě)本文時(shí),的版本為 playwright1.46.0但可能會(huì)更改。請(qǐng)檢查最新版本,并 pyproject.toml 在必要時(shí)更新您的。

最后,在文件夾中創(chuàng)建一個(gè) main.py 文件 google_finance_scraper 來(lái)編寫(xiě)您的抓取邏輯。

更新后的項(xiàng)目結(jié)構(gòu)應(yīng)如下所示:

google-finance-scraper/
├── pyproject.toml
├── README.md
├── google_finance_scraper/
│ ├── __init__.py
│ └── main.py
└── tests/
└── __init__.py

您的環(huán)境現(xiàn)已設(shè)置好,您可以開(kāi)始編寫(xiě) Python Playwright 代碼來(lái)抓取 Google Finance 了。

2. 連接到目標(biāo) Google 財(cái)經(jīng)頁(yè)面

首先,讓我們使用 Playwright 啟動(dòng) Chromium 瀏覽器實(shí)例。雖然 Playwright 支持各種瀏覽器引擎,但在本教程中我們將使用 Chromium:

from playwright.async_api import async_playwright

async def main():
async with async_playwright() as playwright:
browser = await playwright.chromium.launch(headless=False) # Launch a Chromium browser
context = await browser.new_context()
page = await context.new_page()

if __name__ == "__main__":
asyncio.run(main())

要運(yùn)行此腳本,您需要 main() 在腳本末尾使用事件循環(huán)執(zhí)行該函數(shù)。

接下來(lái),導(dǎo)航到您要抓取的股票的 Google 財(cái)經(jīng)頁(yè)面。Google 財(cái)經(jīng)股票頁(yè)面的 URL 格式如下:

https://www.google.com/finance/quote/{ticker_symbol}

股票 代碼 是用于識(shí)別證券交易所上市公司的唯一代碼,例如 AAPL Apple Inc. 或 TSLA Tesla, Inc.。股票代碼發(fā)生變化時(shí),URL 也會(huì)發(fā)生變化。因此,您應(yīng)將其替換 {ticker_symbol} 為要抓取的特定股票代碼。

import asyncio
from playwright.async_api import async_playwright

async def main():
async with async_playwright() as playwright:
# ...

ticker_symbol = "AAPL:NASDAQ" # Replace with the desired ticker symbol
google_finance_url = f"https://www.google.com/finance/quote/{ticker_symbol}"

await page.goto(google_finance_url) # Navigate to the Google Finance page

if __name__ == "__main__":
asyncio.run(main())

以下是迄今為止的完整腳本:

import asyncio
from playwright.async_api import async_playwright

async def main():
async with async_playwright() as playwright:
# Launch a Chromium browser
browser = await playwright.chromium.launch(headless=False)
context = await browser.new_context()
page = await context.new_page()

ticker_symbol = "AAPL:NASDAQ" # Replace with the desired ticker symbol
google_finance_url = f"https://www.google.com/finance/quote/{ticker_symbol}"

# Navigate to the Google Finance page
await page.goto(google_finance_url)

# Wait for a few seconds
await asyncio.sleep(3)

# Close the browser
await browser.close()

if __name__ == "__main__":
asyncio.run(main())

當(dāng)您運(yùn)行此腳本時(shí),它將打開(kāi) Google Finance 頁(yè)面幾秒鐘后才終止。

太棒了!現(xiàn)在,您只需更改股票代碼即可抓取您選擇的任何股票的數(shù)據(jù)。

請(qǐng)注意,使用 UI ( headless=False) 啟動(dòng)瀏覽器非常適合測(cè)試和調(diào)試。如果您想節(jié)省資源并在后臺(tái)運(yùn)行瀏覽器,請(qǐng)切換到無(wú)頭模式:

browser = await playwright.chromium.launch(headless=True)

3. 檢查頁(yè)面以選擇要抓取的元素

要有效地抓取數(shù)據(jù),首先需要了解網(wǎng)頁(yè)的 DOM 結(jié)構(gòu)。假設(shè)您要提取常規(guī)市場(chǎng)價(jià)格(229.79 美元)、變化(+1.46)和變化百分比(+3.30%)。這些值都包含在一個(gè) div 元素中。

您可以使用選擇器 從 Google Finance 中div.YMlKec.fxKbKc 提取價(jià)格、 div.enJeMd div.JwB6zf 百分比變化和 span.P2Luy.ZYVHBb 價(jià)值變化。

div.YMlKec.fxKbKc
div.enJeMd div.JwB6zf
span.P2Luy.ZYVHBb

太棒了!接下來(lái)我們看看如何提取收盤(pán)時(shí)間,頁(yè)面上顯示為“06:02:19 UTC-4”。

要選擇收盤(pán)時(shí)間,請(qǐng)使用以下 CSS 選擇器:

//div[contains(text(), "Closed:")]

現(xiàn)在,讓我們繼續(xù)從表中提取關(guān)鍵的公司數(shù)據(jù),如市值、前收盤(pán)價(jià)和交易量:

如您所見(jiàn),數(shù)據(jù)在表格中是有結(jié)構(gòu)的,多個(gè) div 標(biāo)簽代表每個(gè)字段,從“前一收盤(pán)價(jià)”開(kāi)始到“主要交易所”結(jié)束。

您可以使用選擇器 從 Google Finance 表中.mfs7Fc 提取標(biāo)簽和 .P6K39c 相應(yīng)的值。這些選擇器根據(jù)元素的類(lèi)名來(lái)定位元素,讓您可以成對(duì)地檢索和處理表的數(shù)據(jù)。

.mfs7Fc
.P6K39c

4. 抓取股票數(shù)據(jù)

現(xiàn)在您已經(jīng)確定了所需的元素,是時(shí)候編寫(xiě) Playwright 腳本來(lái)從 Google Finance 中提取數(shù)據(jù)了。

讓我們定義一個(gè)名為 的新函數(shù) scrape_data 來(lái)處理抓取過(guò)程。此函數(shù)接受股票代碼,導(dǎo)航到 Google 財(cái)經(jīng)頁(yè)面,并返回包含提取的財(cái)務(wù)數(shù)據(jù)的字典。

工作原理如下:

import asyncio
from playwright.async_api import async_playwright, Playwright

async def scrape_data(playwright: Playwright, ticker: str) -> dict:
financial_data = {
"ticker": ticker.split(":")[0],
"price": None,
"price_change_value": None,
"price_change_percentage": None,
"close_time": None,
"previous_close": None,
"day_range": None,
"year_range": None,
"market_cap": None,
"avg_volume": None,
"p/e_ratio": None,
"dividend_yield": None,
"primary_exchange": None,
}

try:
browser = await playwright.chromium.launch(headless=True)
context = await browser.new_context()
page = await context.new_page()
await page.goto(f"https://www.google.com/finance/quote/{ticker}")

price_element = await page.query_selector("div.YMlKec.fxKbKc")
if price_element:
price_text = await price_element.inner_text()
financial_data["price"] = price_text.replace(",", "")

percentage_element = await page.query_selector("div.enJeMd div.JwB6zf")
if percentage_element:
percentage_text = await percentage_element.inner_text()
financial_data["price_change_percentage"] = percentage_text.strip()

value_element = await page.query_selector("span.P2Luy.ZYVHBb")
if value_element:
value_text = await value_element.inner_text()
value_parts = value_text.split()
if value_parts:
financial_data["price_change_value"] = value_parts[0].replace(
"$", "")

close_time_element = await page.query_selector('//div[contains(text(), "Closed:")]')
if close_time_element:
close_time_text = await close_time_element.inner_text()
close_time = close_time_text.split(
"·")[0].replace("Closed:", "").strip()
clean_close_time = close_time.replace("\\u202f", " ")
financial_data["close_time"] = clean_close_time

label_elements = await page.query_selector_all(".mfs7Fc")
value_elements = await page.query_selector_all(".P6K39c")

for label_element, value_element in zip(label_elements, value_elements):
label = await label_element.inner_text()
value = await value_element.inner_text()
label = label.strip().lower().replace(" ", "_")
if label in financial_data:
financial_data[label] = value.strip()

except Exception as e:
print(f"An error occurred for {ticker}: {str(e)}")
finally:
await context.close()
await browser.close()

return financial_data

代碼首先導(dǎo)航到股票頁(yè)面并使用和提取價(jià)格和市值等各種指標(biāo) query_selectorquery_selector_all這是 Playwright 常用的方法,用于根據(jù) CSS 選擇器和 XPath 查詢(xún)從元素中選擇和獲取數(shù)據(jù)。

之后,使用字典從元素中提取文本 inner_text() 并將其存儲(chǔ)在字典中,其中每個(gè)鍵代表一個(gè)財(cái)務(wù)指標(biāo)(例如價(jià)格、市值),每個(gè)值是相應(yīng)的提取文本。最后,關(guān)閉瀏覽器會(huì)話(huà)以釋放資源。

現(xiàn)在, main 通過(guò)迭代每個(gè)股票行情機(jī)并收集數(shù)據(jù)來(lái)定義協(xié)調(diào)整個(gè)過(guò)程的函數(shù)。

async def main():
# Define the ticker symbol
ticker = "AAPL"

# Append ":NASDAQ" to the ticker for the Google Finance URL
ticker = f"{ticker}:NASDAQ"

async with async_playwright() as playwright:
# Collect data for the ticker
data = await scrape_data(playwright, ticker)
print(data)

# Run the main function
if __name__ == "__main__":
asyncio.run(main())

在抓取過(guò)程結(jié)束時(shí),控制臺(tái)中會(huì)打印以下數(shù)據(jù):

5. 抓取多只股票

到目前為止,我們已經(jīng)抓取了一只股票的數(shù)據(jù)。要從 Google Finance 一次收集多只股票的數(shù)據(jù),我們可以修改腳本以接受股票代碼作為命令行參數(shù)并處理每只股票。確保導(dǎo)入模塊 sys

import sys

async def main():
# Get ticker symbols from command line arguments
if len(sys.argv) < 2:
print("Please provide at least one ticker symbol as a command-line argument.")
sys.exit(1)

tickers = sys.argv[1:]
async with async_playwright() as playwright:
results = []
for ticker in tickers:
data = await scrape_data(playwright, f"{ticker}:NASDAQ")
results.append(data)
print(results)

# Run the main function
if __name__ == "__main__":
asyncio.run(main())

要運(yùn)行腳本,請(qǐng)將股票代碼作為參數(shù)傳遞:

python google_finance_scraper/main.py aapl meta amzn

這將抓取并顯示 Apple、Meta 和 Amazon 的數(shù)據(jù)。

6. 避免被阻止

網(wǎng)站通常會(huì)使用速率限制、IP 阻止和分析瀏覽模式等技術(shù)來(lái)檢測(cè)和阻止自動(dòng)抓取。從網(wǎng)站抓取數(shù)據(jù)時(shí),采用避免檢測(cè)的策略至關(guān)重要。以下是一些有效的避免被發(fā)現(xiàn)的方法:

1. 請(qǐng)求之間的隨機(jī)間隔

降低檢測(cè)風(fēng)險(xiǎn)的一個(gè)簡(jiǎn)單方法是在請(qǐng)求之間引入隨機(jī)延遲。這種簡(jiǎn)單的技術(shù)可以大大降低被識(shí)別為自動(dòng)爬蟲(chóng)的可能性。

以下是如何在 Playwright 腳本中添加隨機(jī)延遲的方法:

import asyncio
import random
from playwright.async_api import Playwright, async_playwright

async def scrape_data(playwright: Playwright, ticker: str):
browser = await playwright.chromium.launch()
context = await browser.new_context()
page = await context.new_page()

url = f"https://www.google.com/finance/quote/{ticker}"
await page.goto(url)

# Random delay to mimic human-like behavior
await asyncio.sleep(random.uniform(2, 5))

# Your scraping logic here...

await context.close()
await browser.close()

async def main():
async with async_playwright() as playwright:
await scrape_data(playwright, "AAPL:NASDAQ")

if __name__ == "__main__":
asyncio.run(main())

該腳本在請(qǐng)求之間引入了 2 到 5 秒的隨機(jī)延遲,使得操作變得不那么可預(yù)測(cè),并降低了被標(biāo)記為機(jī)器人的可能性。

2. 設(shè)置和切換用戶(hù)代理

網(wǎng)站通常使用 User-Agent 字符串來(lái)識(shí)別每個(gè)請(qǐng)求背后的瀏覽器和設(shè)備。通過(guò)輪換 User-Agent 字符串,您可以讓您的抓取請(qǐng)求看起來(lái)來(lái)自不同的瀏覽器和設(shè)備,從而幫助您避免被檢測(cè)到。

以下是在 Playwright 中實(shí)現(xiàn) User-Agent 輪換的方法:

import asyncio
import random
from playwright.async_api import Playwright, async_playwright

user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0",
]

async def scrape_data(playwright: Playwright, ticker: str) -> None:
browser = await playwright.chromium.launch(headless=True)

context = await browser.new_context(user_agent=random.choice(user_agents))

page = await context.new_page()

url = f"https://www.google.com/finance/quote/{ticker}"
await page.goto(url)

# Your scraping logic goes here...

await context.close()
await browser.close()

async def main():
async with async_playwright() as playwright:
await scrape_data(playwright, "AAPL:NASDAQ")

if __name__ == "__main__":
asyncio.run(main())

此方法使用 User-Agent 字符串列表,并為每個(gè)請(qǐng)求隨機(jī)選擇一個(gè)。此技術(shù)有助于掩蓋您的抓取工具的身份并降低被阻止的可能性。

??*注意*:您可以參考 useragentstring.com 等網(wǎng)站來(lái)獲取完整的 User-Agent 字符串列表。

3. 使用playwright-stealth

為了進(jìn)一步最大限度地減少檢測(cè)并增強(qiáng)您的抓取工作,您可以使用 playwright-stealth 庫(kù),它應(yīng)用各種技術(shù)使您的抓取活動(dòng)看起來(lái)像真實(shí)用戶(hù)的活動(dòng)。

首先,安裝 playwright-stealth

poetry add playwright-stealth

如果遇到 ,很可能是因?yàn)?未安裝該軟件包。要解決此問(wèn)題,還需要 ModuleNotFoundError 安裝 :pkg_resources`setuptools`setuptools

poetry add setuptools

然后,修改腳本:

import asyncio
from playwright.async_api import Playwright, async_playwright
from playwright_stealth import stealth_async

async def scrape_data(playwright: Playwright, ticker: str) -> None:
browser = await playwright.chromium.launch(headless=True)
context = await browser.new_context()

# Apply stealth techniques to avoid detection
await stealth_async(context)

page = await context.new_page()

url = f"https://www.google.com/finance/quote/{ticker}"
await page.goto(url)

# Your scraping logic here...

await context.close()
await browser.close()

async def main():
async with async_playwright() as playwright:
await scrape_data(playwright, "AAPL:NASDAQ")

if __name__ == "__main__":
asyncio.run(main())

這些技術(shù)可以幫助避免被阻止,但您可能仍會(huì)遇到問(wèn)題。如果是這樣,請(qǐng)嘗試更高級(jí)的方法,例如使用代理、輪換 IP 地址或?qū)嵤?CAPTCHA 求解器。

7. 將抓取的股票數(shù)據(jù)導(dǎo)出為 CSV

抓取到所需的股票數(shù)據(jù)后,下一步就是將其導(dǎo)出為 CSV 文件,以便于分析、與他人共享或?qū)氲狡渌麛?shù)據(jù)處理工具中。

將提取的數(shù)據(jù)保存到 CSV 文件的方法如下:

# ...

import csv

async def main() -> None:
# ...

async with async_playwright() as playwright:

# Collect data for all tickers
results = []
for ticker in tickers:
data = await scrape_data(playwright, ticker)
results.append(data)

# Define the CSV file name
csv_file = "financial_data.csv"

# Write data to CSV
with open(csv_file, mode="w", newline="") as file:
writer = csv.DictWriter(file, fieldnames=results[0].keys())
writer.writeheader()
writer.writerows(results)

if __name__ == "__main__":
asyncio.run(main())

代碼首先收集每個(gè)股票代碼的數(shù)據(jù)。之后,它會(huì)創(chuàng)建一個(gè)名為的 CSV 文件 financial_data.csv。然后,它使用 Python 的 csv.DictWriter 方法寫(xiě)入數(shù)據(jù),首先使用方法寫(xiě)入列標(biāo)題 writeheader() ,然后使用方法添加每行數(shù)據(jù) writerows()

8. 整合所有內(nèi)容

讓我們將所有內(nèi)容整合到一個(gè)腳本中。這個(gè)最終代碼片段包括從 Google Finance 抓取數(shù)據(jù)到將其導(dǎo)出到 CSV 文件的所有步驟。

import asyncio
import sys
import csv
from playwright.async_api import async_playwright, Playwright

async def scrape_data(playwright: Playwright, ticker: str) -> dict:
"""
Scrape financial data for a given stock ticker from Google Finance.

Args:
playwright (Playwright): The Playwright instance.
ticker (str): The stock ticker symbol.

Returns:
dict: A dictionary containing the scraped financial data.
"""
financial_data = {
"ticker": ticker.split(":")[0],
"price": None,
"price_change_value": None,
"price_change_percentage": None,
"close_time": None,
"previous_close": None,
"day_range": None,
"year_range": None,
"market_cap": None,
"avg_volume": None,
"p/e_ratio": None,
"dividend_yield": None,
"primary_exchange": None,
}

try:
# Launch the browser and navigate to the Google Finance page for the ticker
browser = await playwright.chromium.launch(headless=True)
context = await browser.new_context()
page = await context.new_page()
await page.goto(f"https://www.google.com/finance/quote/{ticker}")

# Scrape current price
price_element = await page.query_selector("div.YMlKec.fxKbKc")
if price_element:
price_text = await price_element.inner_text()
financial_data["price"] = price_text.replace(",", "")

# Scrape price change percentage
percentage_element = await page.query_selector("div.enJeMd div.JwB6zf")
if percentage_element:
percentage_text = await percentage_element.inner_text()
financial_data["price_change_percentage"] = percentage_text.strip()

# Scrape price change value
value_element = await page.query_selector("span.P2Luy.ZYVHBb")
if value_element:
value_text = await value_element.inner_text()
value_parts = value_text.split()
if value_parts:
financial_data["price_change_value"] = value_parts[0].replace(
"$", "")

# Scrape close time
close_time_element = await page.query_selector('//div[contains(text(), "Closed:")]')
if close_time_element:
close_time_text = await close_time_element.inner_text()
close_time = close_time_text.split(
"·")[0].replace("Closed:", "").strip()
clean_close_time = close_time.replace("\\u202f", " ")
financial_data["close_time"] = clean_close_time

# Scrape additional financial data
label_elements = await page.query_selector_all(".mfs7Fc")
value_elements = await page.query_selector_all(".P6K39c")

for label_element, value_element in zip(label_elements, value_elements):
label = await label_element.inner_text()
value = await value_element.inner_text()
label = label.strip().lower().replace(" ", "_")
if label in financial_data:
financial_data[label] = value.strip()

except Exception as e:
print(f"An error occurred for {ticker}: {str(e)}")
finally:
# Ensure browser is closed even if an exception occurs
await context.close()
await browser.close()

return financial_data

async def main():
"""
Main function to scrape financial data for multiple stock tickers and save to CSV.
"""
# Get ticker symbols from command line arguments
if len(sys.argv) < 2:
print("Please provide at least one ticker symbol as a command-line argument.")
sys.exit(1)

tickers = sys.argv[1:]
async with async_playwright() as playwright:
results = []
for ticker in tickers:
data = await scrape_data(playwright, f"{ticker}:NASDAQ")
results.append(data)

# Define CSV file name
csv_file = "financial_data.csv"

# Write data to CSV
with open(csv_file, mode="w", newline="") as file:
writer = csv.DictWriter(file, fieldnames=results[0].keys())
writer.writeheader()
writer.writerows(results)

print(f"Data exported to {csv_file}")

# Run the main function
if __name__ == "__main__":
asyncio.run(main())

您可以通過(guò)提供一個(gè)或多個(gè)股票代碼作為命令行參數(shù)從終端運(yùn)行腳本。

python google_finance_scraper/main.py AAPL META AMZN TSLA

運(yùn)行腳本后, financial_data.csv 將在同一目錄中創(chuàng)建名為的 CSV 文件。此文件將以有組織的方式包含所有數(shù)據(jù)。CSV 文件將如下所示:

9.將代碼部署到 Apify

準(zhǔn)備好抓取工具后,就可以使用Apify將其部署到云端 。這樣您就可以按計(jì)劃運(yùn)行抓取工具并利用 Apify 的強(qiáng)大功能。對(duì)于此任務(wù),我們將使用 Python Playwright 模板 進(jìn)行快速設(shè)置。在 Apify 上,抓取工具稱(chēng)為 Actors。

首先從 Apify Python 模板庫(kù)克隆 Playwright + Chrome模板。

首先,您需要 安裝 Apify CLI,它將幫助您管理 Actor。在 macOS 或 Linux 上,您可以使用 Homebrew 執(zhí)行此操作:

brew install apify-cli

或者通過(guò) NPM:

npm -g install apify-cli

安裝 CLI 后,使用 Python Playwright + Chrome 模板創(chuàng)建一個(gè)新的 Actor:

apify create gf-scraper -t python-playwright

此命令將在您的目錄中設(shè)置一個(gè)項(xiàng)目 gf-scraper 。它會(huì)安裝所有必要的依賴(lài)項(xiàng)并提供一些樣板代碼來(lái)幫助您入門(mén)。

導(dǎo)航到新項(xiàng)目文件夾并使用您喜歡的代碼編輯器將其打開(kāi)。在此示例中,我使用的是 VS Code:

cd gf-scraper
code .

該模板附帶功能齊全的抓取工具。您可以通過(guò)運(yùn)行命令來(lái)測(cè)試它, apify run 以查看其運(yùn)行情況。結(jié)果將保存在 中 storage/datasets

接下來(lái),修改代碼 src/main.py 以使其適合抓取 Google Finance。

修改后的代碼如下:

from playwright.async_api import async_playwright
from apify import Actor

async def extract_stock_data(page, ticker):
financial_data = {
"ticker": ticker.split(":")[0],
"price": None,
"price_change_value": None,
"price_change_percentage": None,
"close_time": None,
"previous_close": None,
"day_range": None,
"year_range": None,
"market_cap": None,
"avg_volume": None,
"p/e_ratio": None,
"dividend_yield": None,
"primary_exchange": None,
}

# Scrape current price
price_element = await page.query_selector("div.YMlKec.fxKbKc")
if price_element:
price_text = await price_element.inner_text()
financial_data["price"] = price_text.replace(",", "")

# Scrape price change percentage
percentage_element = await page.query_selector("div.enJeMd div.JwB6zf")
if percentage_element:
percentage_text = await percentage_element.inner_text()
financial_data["price_change_percentage"] = percentage_text.strip()

# Scrape price change value
value_element = await page.query_selector("span.P2Luy.ZYVHBb")
if value_element:
value_text = await value_element.inner_text()
value_parts = value_text.split()
if value_parts:
financial_data["price_change_value"] = value_parts[0].replace(
"$", "")

# Scrape close time
close_time_element = await page.query_selector('//div[contains(text(), "Closed:")]')
if close_time_element:
close_time_text = await close_time_element.inner_text()
close_time = close_time_text.split(
"·")[0].replace("Closed:", "").strip()
clean_close_time = close_time.replace("\\u202f", " ")
financial_data["close_time"] = clean_close_time

# Scrape additional financial data
label_elements = await page.query_selector_all(".mfs7Fc")
value_elements = await page.query_selector_all(".P6K39c")

for label_element, value_element in zip(label_elements, value_elements):
label = await label_element.inner_text()
value = await value_element.inner_text()
label = label.strip().lower().replace(" ", "_")
if label in financial_data:
financial_data[label] = value.strip()
return financial_data

async def main() -> None:
"""
Main function to run the Apify Actor and extract stock data using Playwright.

Reads input configuration from the Actor, enqueues URLs for scraping,
launches Playwright to process requests, and extracts stock data.
"""
async with Actor:

# Retrieve input parameters
actor_input = await Actor.get_input() or {}
start_urls = actor_input.get("start_urls", [])
tickers = actor_input.get("tickers", [])

if not start_urls:
Actor.log.info(
"No start URLs specified in actor input. Exiting...")
await Actor.exit()
base_url = start_urls[0].get("url", "")

# Enqueue requests for each ticker
default_queue = await Actor.open_request_queue()
for ticker in tickers:
url = f"{base_url}{ticker}:NASDAQ"
await default_queue.add_request(url)

# Launch Playwright and open a new browser context
Actor.log.info("Launching Playwright...")
async with async_playwright() as playwright:
browser = await playwright.chromium.launch(headless=Actor.config.headless)
context = await browser.new_context()

# Process requests from the queue
while request := await default_queue.fetch_next_request():
url = (
request.url
) # Use attribute access instead of dictionary-style access
Actor.log.info(f"Scraping {url} ...")

try:
# Open the URL in a new Playwright page
page = await context.new_page()
await page.goto(url, wait_until="domcontentloaded")

# Extract the ticker symbol from the URL
ticker = url.rsplit("/", 1)[-1]
data = await extract_stock_data(page, ticker)

# Push the extracted data to Apify
await Actor.push_data(data)
except Exception as e:
Actor.log.exception(
f"Error extracting data from {url}: {e}")
finally:
# Ensure the page is closed and the request is marked as handled
await page.close()
await default_queue.mark_request_as_handled(request)

在運(yùn)行代碼之前,更新 目錄input_schema.json 中的文件 .actor/ 以包含 Google Finance 報(bào)價(jià)頁(yè)面 URL 并添加一個(gè) tickers 字段。

這是更新后的 input_schema.json 文件:

{
"title": "Python Playwright Scraper",
"type": "object",
"schemaVersion": 1,
"properties": {
"start_urls": {
"title": "Start URLs",
"type": "array",
"description": "URLs to start with",
"prefill": [
{
"url": "https://www.google.com/finance/quote/"
}
],
"editor": "requestListSources"
},
"tickers": {
"title": "Tickers",
"type": "array",
"description": "List of stock ticker symbols to scrape data for",
"items": {
"type": "string"
},
"prefill": [
"AAPL",
"GOOGL",
"AMZN"
],
"editor": "stringList"
},
"max_depth": {
"title": "Maximum depth",
"type": "integer",
"description": "Depth to which to scrape to",
"default": 1
}
},
"required": [
"start_urls",
"tickers"
]
}

此外, input.json 通過(guò)將 URL 更改為 Google Finance 頁(yè)面來(lái)更新文件,以防止執(zhí)行期間發(fā)生沖突,或者您可以直接刪除此文件。

要運(yùn)行你的 Actor,請(qǐng)?jiān)诮K端中運(yùn)行以下命令:

apify run

抓取的結(jié)果將保存在 中 storage/datasets,其中每個(gè)股票代碼都有自己的 JSON 文件,如下所示:

要部署您的 Actor,請(qǐng)先 創(chuàng)建一個(gè) Apify 帳戶(hù) (如果您還沒(méi)有)。然后,從 Apify 控制臺(tái)的 “設(shè)置 → 集成”下獲取您的 API 令牌,最后使用以下命令使用您的令牌登錄:

apify login -t YOUR_APIFY_TOKEN

最后,將您的 Actor 推送到 Apify:

apify push

片刻之后,你的 Actor 應(yīng)該會(huì)出現(xiàn)在 Apify 控制臺(tái)的 Actors → My actors下。

您的抓取工具現(xiàn)已準(zhǔn)備好在 Apify 平臺(tái)上運(yùn)行。點(diǎn)擊“開(kāi)始”按鈕即可開(kāi)始。運(yùn)行完成后,您可以從“存儲(chǔ)”選項(xiàng)卡預(yù)覽和下載各種格式的數(shù)據(jù)。

額外好處: 在 Apify 上運(yùn)行抓取工具的一個(gè)主要優(yōu)勢(shì)是可以為同一個(gè) Actor 保存不同的配置并設(shè)置自動(dòng)調(diào)度。讓我們?yōu)槲覀兊?Playwright Actor 設(shè)置這個(gè)。

在Actor頁(yè)面上,點(diǎn)擊 創(chuàng)建空任務(wù)。

接下來(lái),單擊 “操作” ,然后 單擊“計(jì)劃”

最后,選擇你希望 Actor 運(yùn)行的頻率并點(diǎn)擊 “創(chuàng)建”

完美!您的 Actor 現(xiàn)已設(shè)置為在您指定的時(shí)間自動(dòng)運(yùn)行。您可以在Apify 平臺(tái)的“計(jì)劃”選項(xiàng)卡中查看和管理所有計(jì)劃的運(yùn)行。


要開(kāi)始在 Apify 平臺(tái)上使用 Python 進(jìn)行抓取,您可以使用 Python 代碼模板。這些模板適用于流行的庫(kù),例如 Requests、Beautiful Soup、Scrapy、Playwright 和 Selenium。使用這些模板,您可以快速構(gòu)建用于各種 Web 抓取任務(wù)的抓取工具。

使用代碼模板快速構(gòu)建抓取工具


Google Finance 有 API 嗎?

不, Google Finance 沒(méi)有公開(kāi)的 API。雖然它曾經(jīng)有一個(gè),但它在 2012 年被棄用了。從那時(shí)起,Google 就沒(méi)有發(fā)布新的公開(kāi) API 來(lái)通過(guò) Google Finance 訪問(wèn)財(cái)務(wù)數(shù)據(jù)。

結(jié)論

您已經(jīng)學(xué)會(huì)了如何使用 Playwright 與 Google Finance 進(jìn)行交互并提取有價(jià)值的財(cái)務(wù)數(shù)據(jù)。您還探索了避免被阻止的方法,并構(gòu)建了一個(gè)代碼解決方案,您只需傳遞股票代碼(一個(gè)或多個(gè)),所有所需數(shù)據(jù)都會(huì)存儲(chǔ)在一個(gè) CSV 文件中。此外,您現(xiàn)在對(duì)如何使用 Apify 平臺(tái) 及其 Actor 框架 來(lái)構(gòu)建可擴(kuò)展的 Web 抓取工具以及安排抓取工具在最方便的時(shí)間運(yùn)行有了深入的了解。

文章來(lái)源:https://blog.apify.com/scrape-google-finance-python/#does-google-finance-allow-scraping

上一篇:

GraphQL API手冊(cè):如何構(gòu)建、測(cè)試、使用和記錄

下一篇:

與Azure構(gòu)建安全的(CI/CD)集成
#你可能也喜歡這些API文章!

我們有何不同?

API服務(wù)商零注冊(cè)

多API并行試用

數(shù)據(jù)驅(qū)動(dòng)選型,提升決策效率

查看全部API→
??

熱門(mén)場(chǎng)景實(shí)測(cè),選對(duì)API

#AI文本生成大模型API

對(duì)比大模型API的內(nèi)容創(chuàng)意新穎性、情感共鳴力、商業(yè)轉(zhuǎn)化潛力

25個(gè)渠道
一鍵對(duì)比試用API 限時(shí)免費(fèi)

#AI深度推理大模型API

對(duì)比大模型API的邏輯推理準(zhǔn)確性、分析深度、可視化建議合理性

10個(gè)渠道
一鍵對(duì)比試用API 限時(shí)免費(fèi)