亚洲国产精品91,久久中精品中文,国产精品欧美一区二区三区不卡

├── pyproject.toml
├── README.md
├── yahoo_finance_scraper/
│ └── __init__.py
└── tests/
└── __init__.py

Markdown復制

導航到項目目錄并安裝 Playwright：

cd yahoo-finance-scraper

poetry add playwright

poetry run playwright install

殼復制

Yahoo Finance 使用 JavaScript 動態加載內容。Playwright 可以呈現 JavaScript，因此適合從 Yahoo Finance 抓取動態內容。

打開該pyproject.toml文件來檢查你的項目的依賴項，其中應該包括：

[tool.poetry.dependencies]

python = "^3.12"

playwright = "^1.46.0"

托木斯克復制

最后，在文件夾中創建一個main.py文件yahoo_finance_scraper來編寫您的抓取邏輯。

更新后的項目結構應如下所示：

yahoo-finance-scraper/

├── pyproject.toml

├── README.md

├── yahoo_finance_scraper/

│   ├── __init__.py

│   └── main.py

└── tests/

    └── __init__.py

Markdown復制

您的環境現已設置好，您可以開始編寫 Python Playwright 代碼來抓取雅虎財經了。

注意：如果您不想在本地機器上設置所有這些，您可以直接在

Apify上部署代碼。在本教程的后面，我將向您展示如何在 Apify 上部署和運行您的抓取工具。

2. 連接到目標 Yahoo 財經頁面

首先，讓我們使用 Playwright 啟動 Chromium 瀏覽器實例。雖然 Playwright 支持各種瀏覽器引擎，但在本教程中我們將使用 Chromium：

from playwright.async_api import async_playwright, Playwright



async def main():

    async with async_playwright() as playwright:

        browser = await playwright.chromium.launch(headless=False)  # Launch a Chromium browser

        context = await browser.new_context()

        page = await context.new_page()



if __name__ == "__main__":

    asyncio.run(main())

Python復制

要運行此腳本，您需要main()在腳本末尾使用事件循環執行該函數。

接下來，導航到要抓取的股票的 Yahoo Finance 頁面。Yahoo Finance 股票頁面的 URL 如下所示：

https://finance.yahoo.com/quote/{ticker_symbol}

殼復制

股票代碼是用于識別證券交易所上市公司的唯一代碼，例如AAPLApple Inc. 或TSLATesla, Inc.。股票代碼發生變化時，URL 也會發生變化。因此，您應將其替換{ticker_symbol}為要抓取的特定股票代碼。

import asyncio

from playwright.async_api import async_playwright, Playwright



async def main():

    async with async_playwright() as playwright:

        # ...



        ticker_symbol = "AAPL"  # Replace with the desired ticker symbol

        yahoo_finance_url = f"https://finance.yahoo.com/quote/{ticker_symbol}"



        await page.goto(yahoo_finance_url)  # Navigate to the Yahoo Finance page



if __name__ == "__main__":

    asyncio.run(main())

Python復制

以下是迄今為止的完整腳本：

import asyncio

from playwright.async_api import async_playwright, Playwright



async def main():

    async with async_playwright() as playwright:

        # Launch a Chromium browser

        browser = await playwright.chromium.launch(headless=False)

        context = await browser.new_context()

        page = await context.new_page()



        ticker_symbol = "AAPL"  # Replace with the desired ticker symbol

        yahoo_finance_url = f"https://finance.yahoo.com/quote/{ticker_symbol}"



        await page.goto(yahoo_finance_url)  # Navigate to the Yahoo Finance page



        # Wait for a few seconds

        await asyncio.sleep(3)



        # Close the browser

        await browser.close()



if __name__ == "__main__":

    asyncio.run(main())

Python復制

當您運行此腳本時，它將打開 Yahoo Finance 頁面幾秒鐘后才終止。

太棒了！現在，您只需更改股票代碼即可抓取您選擇的任何股票的數據。

注意：使用 UI ( headless=False) 啟動瀏覽器非常適合測試和調試。如果您想節省資源并在后臺運行瀏覽器，請切換到無頭模式：

browser = await playwright.chromium.launch(headless=True)

Python復制

當從歐洲 IP 地址訪問雅虎財經時，您可能會遇到一個需要先解決的 cookie 同意模式，然后才能繼續抓取數據。

要繼續訪問所需頁面，您需要通過點擊“全部接受”或“全部拒絕”與模式進行交互。為此，請右鍵單擊“全部接受”按鈕并選擇“檢查”以打開瀏覽器的 DevTools：

在 DevTools 中，您可以看到可以使用以下 CSS 選擇器選擇該按鈕：

button.accept-all

CSS復制

要在 Playwright 中自動單擊此按鈕，您可以使用以下腳本：

import asyncio

from playwright.async_api import async_playwright, Playwright



async def main():



    async with async_playwright() as playwright:

        browser = await playwright.chromium.launch(headless=False)

        context = await browser.new_context()

        page = await context.new_page()



        ticker_symbol = "AAPL"

        url = f"https://finance.yahoo.com/quote/{ticker_symbol}"

        await page.goto(url)



        try:

            # Click the "Accept All" button to bypass the modal

            await page.locator("button.accept-all").click()

        except:

            pass

        await browser.close()



# Run the main function

if __name__ == "__main__":

    asyncio.run(main())

Python復制

如果出現 Cookie 同意模式，此腳本將嘗試單擊“全部接受”按鈕。這樣您就可以繼續抓取而不會中斷。

4. 檢查頁面以選擇要抓取的元素

要有效地抓取數據，首先需要了解網頁的 DOM 結構。假設您要提取常規市場價格 (224.72)、變化 (+3.00) 和變化百分比 (+1.35%)。這些值都包含在一個div元素中。在這個元素中div，您會發現三個fin-streamer元素，每個元素分別代表市場價格、變化和百分比。

為了精確定位這些元素，您可以使用以下 CSS 選擇器：

[data-testid="qsp-price"]

[data-testid="qsp-price-change"]

[data-testid="qsp-price-change-percent"]

純文本復制

太棒了！接下來我們來看看如何提取收盤時間，頁面上顯示為“4 PM EDT”。

要選擇收盤時間，請使用以下 CSS 選擇器：

div[slot="marketTimeNotice"] > span

純文本復制

現在，讓我們繼續從表中提取關鍵的公司數據，如市值、前收盤價和交易量：

如您所見，數據結構為一個表格，其中有多個li標簽代表每個字段，從“上次收盤價”開始到“1y Target Est”結束。

要提取特定字段（如“上次收盤價”和“開盤價”），可以使用data-field唯一標識每個元素的屬性：

[data-field="regularMarketPreviousClose"]

[data-field="regularMarketOpen"]

純文本復制

屬性data-field提供了一種選擇元素的簡單方法。但是，在某些情況下，可能不存在這樣的屬性。例如，提取“Bid”值，該值缺少data-field屬性或任何唯一標識符。在這種情況下，我們將首先使用其文本內容找到“Bid”標簽，然后移至下一個同級元素以提取相應的值。

以下是您可以使用的組合選擇器：

span:has-text('Bid') + span.value

純文本復制

5. 抓取股票數據

現在您已經確定了需要抓取的元素，接下來可以編寫Playwright腳本來從Yahoo Finance提取數據了。

讓我們定義一個名為的新函數scrape_data來處理抓取過程。此函數接受股票代碼，導航到 Yahoo Finance 頁面，并返回包含提取的財務數據的字典。

工作原理如下：

from playwright.async_api import async_playwright, Playwright



async def scrape_data(playwright: Playwright, ticker: str) -> dict:

    try:

        # Launch the browser in headless mode

        browser = await playwright.chromium.launch(headless=True)

        context = await browser.new_context()

        page = await context.new_page()



        url = f"https://finance.yahoo.com/quote/{ticker}"

        await page.goto(url, wait_until="domcontentloaded")



        try:

            # Click the "Accept All" button if present

            await page.locator("button.accept-all").click()

        except:

            pass  # If the button is not found, continue without any action

        data = {"Ticker": ticker}



        # Extract regular market values

        data["Regular Market Price"] = await page.locator(

            '[data-testid="qsp-price"]'

        ).text_content()

        data["Regular Market Price Change"] = await page.locator(

            '[data-testid="qsp-price-change"]'

        ).text_content()

        data["Regular Market Price Change Percent"] = await page.locator(

            '[data-testid="qsp-price-change-percent"]'

        ).text_content()



        # Extract market close time

        market_close_time = await page.locator(

            'div[slot="marketTimeNotice"] > span'

        ).first.text_content()

        data["Market Close Time"] = market_close_time.replace("At close: ", "")



        # Extract other financial metrics

        data["Previous Close"] = await page.locator(

            '[data-field="regularMarketPreviousClose"]'

        ).text_content()

        data["Open Price"] = await page.locator(

            '[data-field="regularMarketOpen"]'

        ).text_content()

        data["Bid"] = await page.locator(

            "span:has-text('Bid') + span.value"

        ).text_content()

        data["Ask"] = await page.locator(

            "span:has-text('Ask') + span.value"

        ).text_content()

        data["Day's Range"] = await page.locator(

            '[data-field="regularMarketDayRange"]'

        ).text_content()

        data["52 Week Range"] = await page.locator(

            '[data-field="fiftyTwoWeekRange"]'

        ).text_content()

        data["Volume"] = await page.locator(

            '[data-field="regularMarketVolume"]'

        ).text_content()

        data["Avg. Volume"] = await page.locator(

            '[data-field="averageVolume"]'

        ).text_content()

        data["Market Cap"] = await page.locator(

            '[data-field="marketCap"]'

        ).text_content()

        data["Beta"] = await page.locator(

            "span:has-text('Beta (5Y Monthly)') + span.value"

        ).text_content()

        data["PE Ratio"] = await page.locator(

            "span:has-text('PE Ratio (TTM)') + span.value"

        ).text_content()

        data["EPS"] = await page.locator(

            "span:has-text('EPS (TTM)') + span.value"

        ).text_content()

        data["Earnings Date"] = await page.locator(

            "span:has-text('Earnings Date') + span.value"

        ).text_content()

        data["Dividend & Yield"] = await page.locator(

            "span:has-text('Forward Dividend & Yield') + span.value"

        ).text_content()

        data["Ex-Dividend Date"] = await page.locator(

            "span:has-text('Ex-Dividend Date') + span.value"

        ).text_content()

        data["1y Target Est"] = await page.locator(

            '[data-field="targetMeanPrice"]'

        ).text_content()



        return data

    except Exception as e:

        print(f"An error occurred while processing {ticker}: {e}")

        return {"Ticker": ticker, "Error": str(e)}

    finally:

        await context.close()

        await browser.close()

Python復制

代碼通過已識別的CSS選擇器來提取數據，使用locator方法定位每個元素，并應用text_content()方法從這些元素中抓取文本。抓取到的指標會存儲在一個字典里，字典的每個鍵對應一個財務指標，而相應的值就是抓取到的文本內容。

最后，定義一個main函數，通過迭代每個代碼并收集數據來協調整個過程

async def main():

    # Define the ticker symbol

    ticker = "AAPL"



    async with async_playwright() as playwright:

        # Collect data for the ticker

        data = await scrape_data(playwright, ticker)



        print(data)



# Run the main function

if __name__ == "__main__":

    asyncio.run(main())

Python復制

在抓取過程結束時，控制臺中會打印以下數據：

6. 抓取歷史股票數據

在獲取了實時數據之后，我們再來看看雅虎財經提供的歷史股票信息。這些數據反映了股票過往的表現，對做出投資決策很有幫助。您可以查詢不同時間段的數據，包括日、周、月度數據，比如上個月、去年，甚至是股票的完整歷史記錄。

要訪問 Yahoo Finance 上的歷史股票數據，您需要通過修改特定參數來自定義 URL：

frequency：指定數據間隔，例如每日（1d）、每周（1wk）或每月（1mo）。
period1和period2：這些參數以 Unix 時間戳格式設置數據的開始和結束日期。

比如，下面這個網址可以查詢亞馬遜（AMZN）從2023年8月16日到2024年8月16日的每周歷史數據：

https://finance.yahoo.com/quote/AMZN/history/?frequency=1wk&period1=1692172771&period2=1723766400

純文本復制

導航到此 URL 后，您將看到一個包含歷史數據的表格。在我們的例子中，顯示的數據是過去一年的，間隔為一周。

要提取這些數據，您可以使用query_selector_allPlaywright 和 CSS 選擇器中的方法.table tbody tr：

rows = await page.query_selector_all(".table tbody tr")

Python復制

每行包含多個單元格（標簽）來保存數據。以下是從每個單元格中提取文本內容的方法：

for row in rows:

    cells = await row.query_selector_all("td")

    date = await cells[0].text_content()

    open_price = await cells[1].text_content()

    high_price = await cells[2].text_content()

    low_price = await cells[3].text_content()

    close_price = await cells[4].text_content()

    adj_close = await cells[5].text_content()

    volume = await cells[6].text_content()

Python復制

接下來，創建一個函數來生成 Unix 時間戳，我們將使用它來定義數據的開始（ period1）和結束（）日期：period2

def get_unix_timestamp(

    years_back: int = 0, 

    months_back: int = 0, 

    days_back: int = 0

) -> int:

    """Get a Unix timestamp for a specified number of years, months, or days back from today."""

    current_time = time.time()

    seconds_in_day = 86400

    return int(

        current_time 

        - (years_back * 365 + months_back * 30 + days_back) * seconds_in_day

    )

Python復制

現在，讓我們編寫一個函數來抓取歷史數據：

from playwright.async_api import async_playwright, Playwright



async def scrape_historical_data(

    playwright: Playwright, 

    ticker: str, 

    frequency: str, 

    period1: int, 

    period2: int

):

    url = f"https://finance.yahoo.com/quote/{ticker}/history?frequency={frequency}&period1={period1}&period2={period2}"



    browser = await playwright.chromium.launch(headless=True)

    context = await browser.new_context()

    page = await context.new_page()



    await page.goto(url, wait_until="domcontentloaded")



    try:

        await page.locator("button.accept-all").click()

    except:

        pass



    # Wait for the table to load

    await page.wait_for_selector(".table-container")



    # Extract table rows

    rows = await page.query_selector_all(".table tbody tr")



    # Prepare data storage

    data = []



    for row in rows:

        cells = await row.query_selector_all("td")

        date = await cells[0].text_content()

        open_price = await cells[1].text_content()

        high_price = await cells[2].text_content()

        low_price = await cells[3].text_content()

        close_price = await cells[4].text_content()

        adj_close = await cells[5].text_content()

        volume = await cells[6].text_content()



        # Add row data to list

        data.append(

            [date, open_price, high_price, low_price, close_price, adj_close, volume]

        )



    print(data)



    await context.close()

    await browser.close()



    return data

Python復制

該scrape_historical_data函數使用給定的參數構造 Yahoo Finance URL，在管理任何 cookie 提示的同時導航到該頁面，等待歷史數據表完全加載，然后提取并將相關數據打印到控制臺。

最后，我們來看看如何用不同的設置來運行這個腳本：

async def main():

    async with async_playwright() as playwright:

        ticker = "TSLA"



        # Weekly data for last year

        period1 = get_unix_timestamp(years_back=1)

        period2 = get_unix_timestamp()

        weekly_data = await scrape_historical_data(

            playwright, ticker, "1wk", period1, period2

        )



# Run the main function

if __name__ == "__main__":

    asyncio.run(main())

Python復制

通過調整參數來定制數據周期和頻率：

# Daily data for the last month

period1 = get_unix_timestamp(months_back=1)

period2 = get_unix_timestamp()

await scrape_historical_data(playwright, ticker, "1d", period1, period2)



# Monthly data for the stock's lifetime

period1 = 1

period2 = 999999999999

await scrape_historical_data(playwright, ticker, "1mo", period1, period2)

Python復制

以下是我們到目前為止編寫的，用于從雅虎財經（Yahoo Finance）抓取歷史數據的完整腳本：

from playwright.async_api import async_playwright, Playwright

import asyncio

import time



def get_unix_timestamp(

    years_back: int = 0, months_back: int = 0, days_back: int = 0

) -> int:

    """Get a Unix timestamp for a specified number of years, months, or days back from today."""

    current_time = time.time()

    seconds_in_day = 86400

    return int(

        current_time

        - (years_back * 365 + months_back * 30 + days_back) * seconds_in_day

    )



async def scrape_historical_data(

    playwright: Playwright, ticker: str, frequency: str, period1: int, period2: int

):

    url = f"https://finance.yahoo.com/quote/{ticker}/history?frequency={frequency}&period1={period1}&period2={period2}"



    browser = await playwright.chromium.launch(headless=True)

    context = await browser.new_context()

    page = await context.new_page()



    await page.goto(url, wait_until="domcontentloaded")



    try:

        await page.locator("button.accept-all").click()

    except:

        pass



    # Wait for the table to load

    await page.wait_for_selector(".table-container")



    # Extract table rows

    rows = await page.query_selector_all(".table tbody tr")



    # Prepare data storage

    data = []



    for row in rows:

        cells = await row.query_selector_all("td")

        date = await cells[0].text_content()

        open_price = await cells[1].text_content()

        high_price = await cells[2].text_content()

        low_price = await cells[3].text_content()

        close_price = await cells[4].text_content()

        adj_close = await cells[5].text_content()

        volume = await cells[6].text_content()



        # Add row data to list

        data.append(

            [date, open_price, high_price, low_price, close_price, adj_close, volume]

        )

    print(data)



    await context.close()

    await browser.close()



    return data



async def main() -> None:

    async with async_playwright() as playwright:

        ticker = "TSLA"



        # Weekly data for the last year

        period1 = get_unix_timestamp(years_back=1)

        period2 = get_unix_timestamp()

        weekly_data = await scrape_historical_data(

            playwright, ticker, "1wk", period1, period2

        )



if __name__ == "__main__":

    asyncio.run(main())

Python復制

運行此腳本根據您指定的參數將所有歷史股票數據打印到控制臺。

7. 抓取多只股票

到目前為止，我們已經抓取了一只股票的數據。為了同時收集多只股票的數據，我們可以修改腳本以接受股票代碼作為命令行參數并處理每只股票。

async def main() -> None:

    if len(sys.argv) < 2:

        print("Please provide at least one ticker symbol as a command-line argument.")

        return



    tickers = sys.argv[1:]



    async with async_playwright() as playwright:

        # Collect data for all tickers

        all_data = []

        for ticker in tickers:

            data = await scrape_data(playwright, ticker)

            all_data.append(data)

        print(all_data)



# Run the main function

if __name__ == "__main__":

    asyncio.run(main())

Python復制

要運行腳本，請將股票代碼作為參數傳遞：

python yahoo_finance_scraper/main.py AAPL MSFT TSLA

殼復制

這將抓取并顯示蘋果公司 (AAPL)、微軟公司 (MSFT) 和特斯拉公司 (TSLA) 的數據。

8. 避免被阻止

網站通常會發現并阻止自動抓取。它們使用速率限制、IP 阻止和檢查瀏覽模式。以下是一些在網頁抓取時不被發現的有效方法：

1. 請求之間的隨機間隔

在請求之間添加隨機延遲是一種避免檢測的簡單方法。這種基本方法可以使您的抓取行為對網站來說不那么明顯。

以下是如何在 Playwright 腳本中添加隨機延遲的方法：

import asyncio

import random

from playwright.async_api import Playwright, async_playwright



async def scrape_data(playwright: Playwright, ticker: str):

    browser = await playwright.chromium.launch()

    context = await browser.new_context()

    page = await context.new_page()



    url = f"https://example.com/{ticker}"  # Example URL

    await page.goto(url)



    # Random delay to mimic human-like behavior

    await asyncio.sleep(random.uniform(2, 5))



    # Your scraping logic here...



    await context.close()

    await browser.close()



async def main():

    async with async_playwright() as playwright:

        await scrape_data(playwright, "AAPL")  # Example ticker



if __name__ == "__main__":

    asyncio.run(main())

Python復制

該腳本在請求之間引入了 2 到 5 秒的隨機延遲，使得操作變得不那么可預測，并降低了被標記為機器人的可能性。

2. 設置和切換 User-Agent

網站通常會通過User-Agent字符串來識別發出請求的瀏覽器和設備。通過更換User-Agent字符串，可以讓你的爬蟲請求看起來像是來自不同的瀏覽器和設備，這有助于避免被網站檢測到。

以下是在 Playwright 中實現 User-Agent 輪換的方法：

import asyncio

import random

from playwright.async_api import Playwright, async_playwright



async def scrape_data(playwright: Playwright, ticker: str) -> None:

    browser = await playwright.chromium.launch(headless=True)

    context = await browser.new_context()



    # List of user-agents

    user_agents = [

        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",

        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",

        "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0",

    ]



    # Select a random user-agent from the list to rotate between requests

    user_agent = random.choice(user_agents)



    # Set the chosen user-agent for the current browser context

    context.set_user_agent(user_agent)



    page = await context.new_page()



    url = f"https://example.com/{ticker}"  # Example URL with ticker

    await page.goto(url)



    # Your scraping logic goes here...



    await context.close()

    await browser.close()



async def main():

    async with async_playwright() as playwright:

        await scrape_data(playwright, "AAPL")  # Example ticker



if __name__ == "__main__":

    asyncio.run(main())

Python復制

此方法使用 User-Agent 字符串列表，并為每個請求隨機選擇一個。此技術有助于掩蓋您的抓取工具的身份并降低被阻止的可能性。

注意：您可以參考

useragentstring.com等網站來獲取完整的 User-Agent 字符串列表。

3. 使用 Playwright-Stealth

為了降低被檢測的風險并提升您的爬蟲效果，您可以使用playwright-stealth庫。這個庫運用多種技術手段，讓您的爬取行為更接近真實用戶的瀏覽活動。

首先，安裝playwright-stealth：

poetry add playwright-stealth

然后，修改腳本：

import asyncio

from playwright.async_api import Playwright, async_playwright

from playwright_stealth import stealth_async



async def scrape_data(playwright: Playwright, ticker: str) -> None:

    browser = await playwright.chromium.launch(headless=True)

    context = await browser.new_context()



    # Apply stealth techniques to avoid detection

    await stealth_async(context)



    page = await context.new_page()



    url = f"https://finance.yahoo.com/quote/{ticker}"

    await page.goto(url)



    # Your scraping logic here...



    await context.close()

    await browser.close()



async def main():

    async with async_playwright() as playwright:

        await scrape_data(playwright, "AAPL")  # Example ticker



if __name__ == "__main__":

    asyncio.run(main())

Python復制

這些技術可以幫助避免被阻止，但您可能仍會遇到問題。如果是這樣，請嘗試更高級的方法，例如使用代理、輪換 IP 地址或實施 CAPTCHA 求解器。您可以查看詳細指南21 條提示，讓您在不被阻止的情況下抓取網站。這是您明智地選擇代理、對抗 Cloudflare、解決 CAPTCHA、避免誘捕等的必備指南。

又被屏蔽了？Apify Proxy 能幫你解決

通過智能輪換數據中心和住宅 IP 地址來提高抓取工具的性能。

9. 將抓取的股票數據導出為 CSV

抓取到所需的股票數據后，下一步就是將其導出為 CSV 文件，以便于分析、與他人共享或導入到其他數據處理工具中。

將提取的數據保存到 CSV 文件的方法如下：

# ...



import csv



async def main() -> None:

    # ...



    async with async_playwright() as playwright:



        # Collect data for all tickers

        all_data = []

        for ticker in tickers:

            data = await scrape_data(playwright, ticker)

            all_data.append(data)



        # Define the CSV file name

        csv_file = "stock_data.csv"



        # Write the data to a CSV file

        with open(csv_file, mode="w", newline="", encoding="utf-8") as file:

            writer = csv.DictWriter(file, fieldnames=all_data[0].keys())

            writer.writeheader()

            writer.writerows(all_data)



if __name__ == "__main__":

    asyncio.run(main())

Python復制

代碼首先收集每個股票代碼的數據。之后，它會創建一個名為的 CSV 文件stock_data.csv。然后，它使用 Python 的csv.DictWriter方法寫入數據，首先使用方法寫入列標題writeheader()，然后使用方法添加每行數據writerows()。

10. 整合所有內容

讓我們將所有內容整合到一個腳本中。這個最終代碼片段包括從 Yahoo Finance 抓取數據到將其導出到 CSV 文件的所有步驟。

import asyncio

from playwright.async_api import async_playwright, Playwright

import sys

import csv



async def scrape_data(playwright: Playwright, ticker: str) -> dict:

    """

    Extracts financial data from Yahoo Finance for a given stock ticker.



    Args:

        playwright (Playwright): The Playwright instance used to control the browser.

        ticker (str): The stock ticker symbol to retrieve data for.



    Returns:

        dict: A dictionary containing the extracted financial data for the given ticker.

    """

    try:

        # Launch a headless browser

        browser = await playwright.chromium.launch(headless=True)

        context = await browser.new_context()

        page = await context.new_page()



        # Form the URL using the ticker symbol

        url = f"https://finance.yahoo.com/quote/{ticker}"



        # Navigate to the page and wait for the DOM content to load

        await page.goto(url, wait_until="domcontentloaded")



        # Try to click the "Accept All" button for cookies, if it exists

        try:

            await page.locator("button.accept-all").click()

        except:

            pass  # If the button is not found, continue without any action



        # Dictionary to store the extracted data

        data = {"Ticker": ticker}



        # Extract regular market values

        data["Regular Market Price"] = await page.locator(

            '[data-testid="qsp-price"]'

        ).text_content()

        data["Regular Market Price Change"] = await page.locator(

            '[data-testid="qsp-price-change"]'

        ).text_content()

        data["Regular Market Price Change Percent"] = await page.locator(

            '[data-testid="qsp-price-change-percent"]'

        ).text_content()



        # Extract market close time

        market_close_time = await page.locator(

            'div[slot="marketTimeNotice"] > span'

        ).first.text_content()

        data["Market Close Time"] = market_close_time.replace("At close: ", "")



        # Extract other financial metrics

        data["Previous Close"] = await page.locator(

            '[data-field="regularMarketPreviousClose"]'

        ).text_content()

        data["Open Price"] = await page.locator(

            '[data-field="regularMarketOpen"]'

        ).text_content()

        data["Bid"] = await page.locator(

            "span:has-text('Bid') + span.value"

        ).text_content()

        data["Ask"] = await page.locator(

            "span:has-text('Ask') + span.value"

        ).text_content()

        data["Day's Range"] = await page.locator(

            '[data-field="regularMarketDayRange"]'

        ).text_content()

        data["52 Week Range"] = await page.locator(

            '[data-field="fiftyTwoWeekRange"]'

        ).text_content()

        data["Volume"] = await page.locator(

            '[data-field="regularMarketVolume"]'

        ).text_content()

        data["Avg. Volume"] = await page.locator(

            '[data-field="averageVolume"]'

        ).text_content()

        data["Market Cap"] = await page.locator(

            '[data-field="marketCap"]'

        ).text_content()

        data["Beta"] = await page.locator(

            "span:has-text('Beta (5Y Monthly)') + span.value"

        ).text_content()

        data["PE Ratio"] = await page.locator(

            "span:has-text('PE Ratio (TTM)') + span.value"

        ).text_content()

        data["EPS"] = await page.locator(

            "span:has-text('EPS (TTM)') + span.value"

        ).text_content()

        data["Earnings Date"] = await page.locator(

            "span:has-text('Earnings Date') + span.value"

        ).text_content()

        data["Dividend & Yield"] = await page.locator(

            "span:has-text('Forward Dividend & Yield') + span.value"

        ).text_content()

        data["Ex-Dividend Date"] = await page.locator(

            "span:has-text('Ex-Dividend Date') + span.value"

        ).text_content()

        data["1y Target Est"] = await page.locator(

            '[data-field="targetMeanPrice"]'

        ).text_content()



        return data

    except Exception as e:

        # Handle any exceptions and return an error message

        print(f"An error occurred while processing {ticker}: {e}")

        return {"Ticker": ticker, "Error": str(e)}

    finally:

        # Ensure the browser is closed even if an error occurs

          await context.close()

        await browser.close()



async def main() -> None:

    """

    Main function to run the Yahoo Finance data extraction for multiple tickers.



    Reads ticker symbols from command-line arguments, extracts data for each,

    and saves the results to a CSV file.

    """

    if len(sys.argv) < 2:

        print("Please provide at least one ticker symbol as a command-line argument.")

        return

    tickers = sys.argv[1:]



    # Use async_playwright context to handle browser automation

    async with async_playwright() as playwright:

        # List to store data for all tickers

        all_data = []

        for ticker in tickers:

            # Extract data for each ticker and add it to the list

            data = await scrape_data(playwright, ticker)

            all_data.append(data)



        # Define the CSV file name

        csv_file = "stock_data.csv"



        # Write the extracted data to a CSV file

        with open(csv_file, mode="w", newline="", encoding="utf-8") as file:

            writer = csv.DictWriter(file, fieldnames=all_data[0].keys())

            writer.writeheader()

            writer.writerows(all_data)

        print(f"Data for tickers {', '.join(tickers)

                                  } has been saved to {csv_file}")



# Run the main function using asyncio

if __name__ == "__main__":

    asyncio.run(main())

Python復制

您可以通過提供一個或多個股票代碼作為命令行參數從終端運行腳本。

python yahoo_finance_scraper/main.py AAPL GOOG TSLA AMZN META

殼復制

運行腳本后，stock_data.csv將在同一目錄中創建名為的 CSV 文件。此文件將以有組織的方式包含所有數據。CSV 文件將如下所示：

11.將代碼部署到 Apify

準備好抓取工具后，就可以使用Apify將其部署到云端。這樣您就可以按計劃運行抓取工具并利用 Apify 的強大功能。對于此任務，我們將使用Python Playwright 模板進行快速設置。在 Apify 上，抓取工具稱為Actors。

首先從 Apify Python 模板庫克隆Playwright + Chrome模板。

首先，您需要安裝 Apify CLI，它將幫助您管理 Actor。在 macOS 或 Linux 上，您可以使用 Homebrew 執行此操作：

brew install apify/tap/apify-cli

或者通過 NPM：

npm -g install apify-cli

安裝 CLI 后，使用 Python Playwright + Chrome模板創建一個新的 Actor：

apify create yf-scraper -t python-playwright

此命令將在您的目錄中設置一個項目yf-scraper。它會安裝所有必要的依賴項并提供一些樣板代碼來幫助您入門。

導航到新項目文件夾并使用您喜歡的代碼編輯器將其打開。在此示例中，我使用的是 VS Code：

cd yf-scraper

code .

該模板附帶功能齊全的抓取工具。您可以通過運行命令來測試它，apify run以查看其運行情況。結果將保存在中storage/datasets。

接下來，修改代碼src/main.py以使其適合抓取雅虎財經。

修改后的代碼如下：

from playwright.async_api import async_playwright

from apify import Actor



async def extract_stock_data(page, ticker):

    data = {"Ticker": ticker}



    data["Regular Market Price"] = await page.locator(

        '[data-testid="qsp-price"]'

    ).text_content()

    data["Regular Market Price Change"] = await page.locator(

        '[data-testid="qsp-price-change"]'

    ).text_content()

    data["Regular Market Price Change Percent"] = await page.locator(

        '[data-testid="qsp-price-change-percent"]'

    ).text_content()

    data["Previous Close"] = await page.locator(

        '[data-field="regularMarketPreviousClose"]'

    ).text_content()

    data["Open Price"] = await page.locator(

        '[data-field="regularMarketOpen"]'

    ).text_content()

    data["Bid"] = await page.locator("span:has-text('Bid') + span.value").text_content()

    data["Ask"] = await page.locator("span:has-text('Ask') + span.value").text_content()

    data["Day's Range"] = await page.locator(

        '[data-field="regularMarketDayRange"]'

    ).text_content()

    data["52 Week Range"] = await page.locator(

        '[data-field="fiftyTwoWeekRange"]'

    ).text_content()

    data["Volume"] = await page.locator(

        '[data-field="regularMarketVolume"]'

    ).text_content()

    data["Avg. Volume"] = await page.locator(

        '[data-field="averageVolume"]'

    ).text_content()

    data["Market Cap"] = await page.locator('[data-field="marketCap"]').text_content()

    data["Beta"] = await page.locator(

        "span:has-text('Beta (5Y Monthly)') + span.value"

    ).text_content()

    data["PE Ratio"] = await page.locator(

        "span:has-text('PE Ratio (TTM)') + span.value"

    ).text_content()

    data["EPS"] = await page.locator(

        "span:has-text('EPS (TTM)') + span.value"

    ).text_content()

    data["Earnings Date"] = await page.locator(

        "span:has-text('Earnings Date') + span.value"

    ).text_content()

    data["Dividend & Yield"] = await page.locator(

        "span:has-text('Forward Dividend & Yield') + span.value"

    ).text_content()

    data["Ex-Dividend Date"] = await page.locator(

        "span:has-text('Ex-Dividend Date') + span.value"

    ).text_content()

    data["1y Target Est"] = await page.locator(

        '[data-field="targetMeanPrice"]'

    ).text_content()



    return data



async def main() -> None:

    """

    Main function to run the Apify Actor and extract stock data using Playwright.



    Reads input configuration from the Actor, enqueues URLs for scraping,

    launches Playwright to process requests, and extracts stock data.

    """

    async with Actor:



        # Retrieve input parameters

        actor_input = await Actor.get_input() or {}

        start_urls = actor_input.get("start_urls", [])

        tickers = actor_input.get("tickers", [])



        if not start_urls:

            Actor.log.info(

                "No start URLs specified in actor input. Exiting...")

            await Actor.exit()

        base_url = start_urls[0].get("url", "")



        # Enqueue requests for each ticker

        default_queue = await Actor.open_request_queue()

        for ticker in tickers:

            url = f"{base_url}{ticker}"

            await default_queue.add_request({"url": url, "userData": {"depth": 0}})



        # Launch Playwright and open a new browser context

        Actor.log.info("Launching Playwright...")

        async with async_playwright() as playwright:

            browser = await playwright.chromium.launch(headless=Actor.config.headless)

            context = await browser.new_context()



            # Process requests from the queue

            while request := await default_queue.fetch_next_request():

                url = request["url"]

                Actor.log.info(f"Scraping {url} ...")



                try:

                    # Open the URL in a new Playwright page

                    page = await context.new_page()

                    await page.goto(url, wait_until="domcontentloaded")



                    # Extract the ticker symbol from the URL

                    ticker = url.rsplit("/", 1)[-1]

                    data = await extract_stock_data(page, ticker)



                    # Push the extracted data to Apify

                    await Actor.push_data(data)

                except Exception as e:

                    Actor.log.exception(

                        f"Error extracting data from {url}: {e}")

                finally:

                    # Ensure the page is closed and the request is marked as handled

                    await page.close()

                    await default_queue.mark_request_as_handled(request)

在運行代碼之前，更新目錄input_schema.json中的文件.actor/以包含 Yahoo Finance 報價頁面 URL 并添加一個tickers字段。

這是更新后的input_schema.json文件：

{

    "title": "Python Playwright Scraper",

    "type": "object",

    "schemaVersion": 1,

    "properties": {

        "start_urls": {

            "title": "Start URLs",

            "type": "array",

            "description": "URLs to start with",

            "prefill": [

                {

                    "url": "https://finance.yahoo.com/quote/"

                }

            ],

            "editor": "requestListSources"

        },

        "tickers": {

            "title": "Tickers",

            "type": "array",

            "description": "List of stock ticker symbols to scrape data for",

            "items": {

                "type": "string"

            },

            "prefill": [

                "AAPL",

                "GOOGL",

                "AMZN"

            ],

            "editor": "stringList"

        },

        "max_depth": {

            "title": "Maximum depth",

            "type": "integer",

            "description": "Depth to which to scrape to",

            "default": 1

        }

    },

    "required": [

        "start_urls",

        "tickers"

    ]

}