四虎最新永久免费视频,成人福利视频在线观看网址,99久久婷婷国产综合精品hsex

FBRef 主頁

繼續(xù)，我們的目標是獲得一個時間序列數(shù)據(jù)集，其中包含此頁面、幾個日期中可用的信息，以及比賽報告鏈接中包含的信息，其中包含有關比賽的更多具體統(tǒng)計信息。下圖中還有一個匹配報告示例。

Fbref 匹配報告

然后，通過查看網(wǎng)站及其結構，很明顯我們不需要處理 JavaScript 代碼，這會使我們的抓取任務稍微復雜一些，所以我們從現(xiàn)在開始使用BeautifulSoup 。我們現(xiàn)在應該根據(jù)我們需要的信息來規(guī)劃我們的抓取結構，因為抓取器線性工作以捕獲我們想要的信息。該代碼嵌入在類“scrapper”中，并且在其中實現(xiàn)了它的功能。

class scrapper:



    """

    Class used to scrap football data

    :param path:            The chrome driver path in your computer. Only used to get today matches information.

    :def getMatches():      Gets past matches information from the leagues chosen in a certain period.

                            Uses beautifulSoup framework

    :def getMatchesToday(): Gets predicted lineups and odds about matches to be played today.

                            Uses selenium framework

    """



    def __init__(self, path='D:/chromedriver_win32/chromedriver.exe'):



        self.originLink = 'https://fbref.com'

        self.path=path



        self.baseFolder = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))

        self.dataFolder = os.path.join(self.baseFolder, 'data')



        self.scoresHome = []

        self.scoresAway = []

        self.homeTeams = []

        self.awayTeams = []

        self.scoresHome = []

        self.scoresAway = []

        self.dates = []

        self.homeXG = []

        self.awayXG = []

那么，讓我們按照我所遵循的步驟進行：

Old Matches Scraper

在比賽頁面中，到達指定日期

yearNow, monthNow, dayNow = self._getDate(day)

urlDay = self.originLink + "/en/matches/{year}-{month}-{day}".format(year=yearNow, month=monthNow, day=dayNow)

print(urlDay)

html = urlopen(urlDay)

bs = BeautifulSoup(html.read(), 'html.parser')



def _getDate(self, date):

  """

    Helper function used to format url in the desired date in getMatches()

    :param date: datetime.date object

    :return: The formatted year, month and day of the date object

    """

    year = str(date.year)

    month = str(date.month) if date.month >= 10 else '0' + str(date.month)

    day = str(date.day) if date.day >= 10 else '0' + str(date.day)

    return year, month, day

這個過程和下面的所有過程都是在用戶定義的迭代宇宙中每天進行的。函數(shù) getMatches() 有一個開始日期和一個結束日期，它設置了抓取器將執(zhí)行的邊界。

2.獲取每張冠軍表

championshipTables = bs.find_all('div', {'class':'table_wrapper'})

errorList = []

for i in range(len(championshipTables)):

  try:

    championshipTables[i].find('a', {'href':re.compile('^/en/comps/')}).get_text()

  except AttributeError:

    errorList.append(i)

  for error in errorList:

    del championshipTables[error]

  desiredTables = [ch for ch in championshipTables if ch.find('a', {'href':re.compile('^/en/comps/')}).get_text() in leagues]

按照第一步的例子，聯(lián)賽變量可以由用戶輸入，所以他選擇他想要報廢的聯(lián)賽。我們還可以在代碼中看到一個 try-except 子句，它處理結構錯誤，例如網(wǎng)站中可能出現(xiàn)的假表。

3.從每個冠軍表中，從比賽行中獲取信息

for table in desiredTables:

  time.sleep(4)

  matchesLinks = []

  homeTeams = table.find_all('td', {'data-stat':'home_team'})

  for team in homeTeams:

    self.homeTeams.append(team.get_text())

    self.dates.append(day)

    awayTeams = table.find_all('td', {'data-stat':'away_team'})

  for team in awayTeams:

    self.awayTeams.append(team.get_text())

    scores = table.find_all('td', {'data-stat':'score'})

  for score in scores:

    scoreHome, scoreAway = self._getScore(score.get_text())

    self.scoresHome.append(scoreHome)

    self.scoresAway.append(scoreAway)

    matchesLinks.append(score.find('a', {'href':re.compile('^/')})['href'])



  if table.find_all('td', {'data-stat':'home_xg'}):

    homeXG = table.find_all('td', {'data-stat':'home_xg'})

    awayXG = table.find_all('td', {'data-stat':'away_xg'})

    for xg in homeXG:

      self.homeXG.append(xg.get_text())

    for xg in awayXG:

      self.awayXG.append(xg.get_text())

  else:

    for team in homeTeams:

      self.homeXG.append(np.nan)

      self.awayXG.append(np.nan)

在這里，除了在我們的列表中添加我們最開始想要的信息外，我突出顯示了睡眠時間，用于控制我們在一定時間內(nèi)發(fā)出的請求數(shù)量，避免我們的IP被禁止。另外值得注意的是每個比賽報告鏈接的存儲，它包含在分數(shù)變量中。通過從分數(shù)變量而不是“匹配報告”中捕獲鏈接，我們可以避免在內(nèi)存中存儲延遲或取消的匹配鏈接。這引導我們進入下一步：

4.獲取每場比賽報告并檢索信息

for link in matchesLinks:

  dfMatchStats.loc[len(dfMatchStats)] = self._getMatchStats(link)



def _getMatchStats(self, url):

  """

    Helper function to extract the match stats for each match in getMatches()

    :param url: The match report url - is extracted in getMatches()

    :return: List with match stats

  """



  stats={"Fouls":[np.nan, np.nan], "Corners":[np.nan, np.nan], "Crosses":[np.nan, np.nan], "Touches":[np.nan, np.nan],

        "Tackles":[np.nan, np.nan], "Interceptions":[np.nan, np.nan],"Aerials Won":[np.nan, np.nan],

        "Clearances":[np.nan, np.nan], "Offsides":[np.nan, np.nan], "Goal Kicks":[np.nan, np.nan], "Throw Ins":[np.nan, np.nan],

        "Long Balls":[np.nan, np.nan]}



  matchStatsList = []

  htmlMatch = urlopen(self.originLink + url)

  bsMatch = BeautifulSoup(htmlMatch.read(), 'html.parser')

  homeLineup = bsMatch.find('div', {'class':'lineup', 'id':'a'})

  if not homeLineup:

    homePlayers = []

    awayPlayers = []

    for i in range(0,11):

      homePlayers.append(np.nan)

      awayPlayers.append(np.nan)

    yellowCardsHome = np.nan

    redCardsHome = np.nan

    yellowCardsAway = np.nan

    redCardsAway = np.nan

    matchStatsList.extend([yellowCardsHome, redCardsHome, yellowCardsAway, redCardsAway])

    for key, value in stats.items():

      matchStatsList.extend(value)

    return homePlayers + awayPlayers + matchStatsList

  homePlayers = homeLineup.find_all('a', {'href':re.compile('^/en/players')})[0:11]

  homePlayers = [player.get_text() for player in homePlayers]

  awayLineup = bsMatch.find('div', {'class':'lineup', 'id':'b'})

  awayPlayers = awayLineup.find_all('a', {'href':re.compile('^/en/players')})[0:11]

  awayPlayers = [player.get_text() for player in awayPlayers]

  matchCards = bsMatch.find_all('div', {'class':'cards'})

  yellowCardsHome = len(matchCards[0].find_all('span', {'class':'yellow_card'})) + len(matchCards[0].find_all('span', {'class':'yellow_red_card'}))

  redCardsHome = len(matchCards[0].find_all('span', {'class':'red_card'})) + len(matchCards[0].find_all('span', {'class':'yellow_red_card'}))

  yellowCardsAway = len(matchCards[1].find_all('span', {'class':'yellow_card'})) + len(matchCards[1].find_all('span', {'class':'yellow_red_card'}))

  redCardsAway = len(matchCards[1].find_all('span', {'class':'red_card'})) + len(matchCards[1].find_all('span', {'class':'yellow_red_card'}))

  matchStatsList.extend([yellowCardsHome, redCardsHome, yellowCardsAway, redCardsAway])



  extraStatsPanel = bsMatch.find("div", {"id":"team_stats_extra"})

  for statColumn in extraStatsPanel.find_all("div", recursive=False):

    column = statColumn.find_all("div")

    columnValues = [value.get_text() for value in column]

    for index, value in enumerate(columnValues):

      if not value.isdigit() and value in stats:

        stats[value] = [int(columnValues[index-1]), int(columnValues[index+1])]

  for key, value in stats.items():

    matchStatsList.extend(value)



  return homePlayers + awayPlayers + matchStatsList

正如您所看到的，這個過程有點棘手，所以讓我們做一個簡單的解釋。黃色和紅色卡片是通過將黃色或紅色類別的卡片對象的數(shù)量相加而得出的。其他統(tǒng)計數(shù)據(jù)來自：

檢查預期統(tǒng)計數(shù)據(jù)字典中的統(tǒng)計數(shù)據(jù)
如果為真，則使用鏈接到該統(tǒng)計的值更新字典，這些值是與統(tǒng)計名稱相關的上一個和下一個值熱切的讀者可能已經(jīng)意識到第 2 步——獲取每個冠軍表——不是強制性的，但它使我們能夠靈活地只過濾我們想要的聯(lián)賽的比賽，這就是我采用的方法。

作為一個額外的步驟，我意識到需要創(chuàng)建一個檢查點觸發(fā)器，因為爬蟲可能會面臨無法預料的錯誤，或者 fbref 可能會不允許您的 IP 發(fā)出新請求，而這種情況將意味著大量時間的浪費。然后，每個月的每個第一天，我們都會保存到目前為止的爬蟲工作，以防萬一發(fā)生意外錯誤，我們有一個安全檢查點可以檢索。

僅此而已。在下面代碼的底部，您可以看到日期更新 iteraroe 和格式化最終數(shù)據(jù)框所需的操作。

if day.day == 1:

  # if the process crashes, we have a checkpoint every month starter

  dfCheckpoint = dfMatchStats.copy()

  dfCheckpoint["homeTeam"] = self.homeTeams

  dfCheckpoint["awayTeam"] = self.awayTeams

  dfCheckpoint["scoreHome"] = self.scoresHome

  dfCheckpoint["scoreAway"] = self.scoresAway]

  dfCheckpoint["homeXG"] = self.homeXG

  dfCheckpoint["awayXG"] = self.awayXG

  dfCheckpoint["date"] = self.dates

  dfCheckpoint.to_pickle(os.path.join(self.dataFolder, 'checkPoint.pkl'))



day = day + timedelta(days=1)

dfMatchStats["homeTeam"] = self.homeTeams

dfMatchStats["awayTeam"] = self.awayTeams

dfMatchStats["scoreHome"] = self.scoresHome

dfMatchStats["scoreAway"] = self.scoresAway

dfMatchStats["homeXG"] = self.homeXG

dfMatchStats["awayXG"] = self.awayXG

dfMatchStats["date"] = self.dates



return dfMatchStats

數(shù)據(jù)框預覽

整個過程允許我們抓取一些數(shù)據(jù)來建立模型來預測足球比賽，但我們?nèi)匀恍枰ト∮嘘P即將舉行的比賽的數(shù)據(jù)，以便我們可以對已經(jīng)收集的數(shù)據(jù)做一些有用的事情。我為此找到的最佳來源是SofaScore，該應用程序還收集和存儲有關比賽和球員的信息，但不僅如此，它們還在Bet365中提供每場比賽的實際賠率。

SofaScore 特別處理 JavaScript 代碼，這意味著 html 腳本并不完全可供我們與 BeautifulSoup 一起使用。這意味著我們需要使用另一個框架來抓取他們的信息。我選擇了廣泛使用的Selenium包，它使我們能夠像人類用戶一樣通過 Python 代碼上網(wǎng)沖浪。您實際上可以看到網(wǎng)絡驅動程序在您選擇的瀏覽器中點擊和導航——我選擇了 Chrome。

在下圖中，您可以看到 SofaScore 主頁以及正在進行或即將進行的比賽，在右側，您可以看到當您點擊特定比賽然后點擊“LINEUPS”時會發(fā)生什么。

SofaScore 界面

使用 Selenium，正如我所解釋的，它的工作方式就像人類用戶在網(wǎng)上沖浪一樣，您可能會認為這個過程會慢一點，這是事實。因此，我們必須在每個步驟中更加小心，這樣我們就不會點擊不存在的按鈕，一旦 JavaScript 代碼僅在用戶執(zhí)行某些操作后呈現(xiàn)，例如當我們點擊特定匹配項時，服務器會采取需要一些時間來渲染我們在第二張圖片中看到的側邊菜單，如果代碼在此期間嘗試單擊陣容按鈕，則會返回錯誤。現(xiàn)在，讓我們來看看代碼。

即將到來的Matches Scraper

打開主頁并激活“顯示賠率”按鈕

def _getDriver(self, path='D:/chromedriver_win32/chromedriver.exe'):

  chrome_options = Options()

  return webdriver.Chrome(executable_path=path, options=chrome_options)  



def getMatchesToday(self):

  self.driver = self._getDriver(path=self.path)

  self.driver.get("https://www.sofascore.com/")



  WebDriverWait(self.driver, 20).until(EC.element_to_be_clickable((By.CLASS_NAME, "slider")))

  oddsButton = self.driver.find_element(By.CLASS_NAME, "slider")

  oddsButton.click()



  homeTeam=[]

  awayTeam=[]

  odds=[]

  homeOdds = []

  drawOdds = []

  awayOdds = []

正如我提到的，在啟動驅動程序并到達 SofaScore 的 URL 后，我們需要等到賠率按鈕呈現(xiàn)后才能單擊它。我們還為我們創(chuàng)建了列表來存儲抓取的信息。

2.店鋪匹配主要信息

WebDriverWait(self.driver, 5).until(EC.visibility_of_element_located((By.CLASS_NAME, 'fvgWCd')))

matches = self.driver.find_elements(By.CLASS_NAME, 'js-list-cell-target')

for match in matches:

  if self._checkExistsByClass('blXay'):

    homeTeam.append(match.find_element(By.CLASS_NAME, 'blXay').text)

    awayTeam.append(match.find_element(By.CLASS_NAME, 'crsngN').text)



    if match.find_element(By.CLASS_NAME, 'haEAMa').text == '-':

      oddsObject = match.find_elements(By.CLASS_NAME, 'fvgWCd')

      for odd in oddsObject:

        odds.append(odd.text)



while(len(odds) > 0):

  homeOdds.append(odds.pop(0))

  drawOdds.append(odds.pop(0))

  awayOdds.append(odds.pop(0))

這里沒有什么特別的，但是考慮到在第 8 行我們只過濾還沒有開始的匹配是很好的。我這樣做是因為處理正在進行的比賽會使賠率變得更加棘手，而且目前還不清楚未來的投注模擬器將如何工作，而且它可能無法在實時結果中正常工作。

3.獲得陣容

df = pd.DataFrame({"homeTeam":homeTeam, "awayTeam":awayTeam, "homeOdds":homeOdds, "drawOdds":drawOdds, "awayOdds":awayOdds})

lineups = self._getLineups()



df = pd.concat([df, lineups], axis=1).iloc[:,:-1]



return df



def _getLineups(self):



  matches = self.driver.find_elements(By.CLASS_NAME, "kusmLq")



  nameInPanel = ""



  df = pd.DataFrame(columns=["{team}Player{i}".format(team="home" if i <=10 else "away", i=i+1 if i <=10 else i-10) for i in range(0,22)])

  df["homeTeam"] = []



  for match in matches:



    self.driver.execute_script("arguments[0].click()", match)



    #wait until panel is refreshed



    waiter = WebDriverWait(driver=self.driver, timeout=10, poll_frequency=1)

    waiter.until(lambda drv: drv.find_element(By.CLASS_NAME, "dsMMht").text != nameInPanel)

    nameInPanel = self.driver.find_element(By.CLASS_NAME, "dsMMht").text



    if self._checkExistsByClass("jwanNG") and self.driver.find_element(By.CLASS_NAME, "jwanNG").text == "LINEUPS":



      lineupButton = self.driver.find_element(By.CLASS_NAME, "jwanNG")

      lineupButton.click()

      # wait until players are avilable

      WebDriverWait(self.driver, 20).until(EC.visibility_of_element_located((By.CLASS_NAME, "kDQXnl")))

      players = self.driver.find_elements(By.CLASS_NAME, "kDQXnl")

      playerNames=[]

      for player in players:

        playerNames.append(player.find_elements(By.CLASS_NAME, "sc-eDWCr")[2].accessible_name)

      playerNames = [self._isCaptain(playerName) for playerName in playerNames]

      playerNames.append(nameInPanel)



      df.loc[len(df)] = playerNames

    else:

      df.loc[len(df), "homeTeam"] = nameInPanel



  return df



 def _isCaptain(self, name):

  if name.startswith("(c) "):

  name = name[4:]

  return name

數(shù)據(jù)框預覽

總結上面的代碼塊，我們等到比賽的側邊菜單加載完畢，單擊陣容按鈕并獲取球員姓名。我們需要注意一下，因為每個團隊的隊長的名字在網(wǎng)站上都是格式化的，所以我們創(chuàng)建了一個輔助函數(shù)來處理它。然后，我們將每場比賽的球員姓名存儲在數(shù)據(jù)框中，最后在整個過程之后，我們將比賽信息與預測陣容連接起來。