
如何快速實現(xiàn)REST API集成以優(yōu)化業(yè)務(wù)流程
可以看到左側(cè)第二個菜單
就是Web Scraper API
,點擊之后就可以看到 Web Scraper API 的詳細信息。
其中我們可以看到一些比較受歡迎的API,比如:Facebook、Instagram、TikTok、Twitter等相關(guān)信息的APi,另外也可以看到我們在運行的?API logs
我們可以在列表中有上百種API,包括了市場數(shù)據(jù)、B2B數(shù)據(jù)、電子商務(wù)數(shù)據(jù)、財務(wù)數(shù)據(jù)、新聞資料、房地產(chǎn)數(shù)據(jù)、社交媒體數(shù)據(jù)、旅行數(shù)據(jù)等。這里我選擇社交媒體數(shù)據(jù)
中比較受歡迎的Facebook - Comments - Collect by URL
?API 。
這里需要填寫收集數(shù)據(jù)的網(wǎng)址、帖子數(shù)量(num_of_posts)、不包括的帖子(posts_to_not_include)、開始日期(start_date)、結(jié)束日期(end_date),這里我們爬取三位Facebook用戶的10條帖子信息。
接著需要獲取 API Token,點擊Get API token
即可生成你的Token,注意保存在本地,在下面請求API時需要用到。
然后執(zhí)行請求命令
這里我選擇是Linux Bash
版本的命令,注意一定要把命令中的API_TOKEN換成上面你生成的TOKEN
。隨后執(zhí)行生成快照id
。
生成的快照id結(jié)果為如下:
{"snapshot_id":"s_m342n89p1h56iw97em"}%
提供了兩種,一種是通過下載快照,另一種是發(fā)送至倉庫,這里我選擇下載快照的方式,并且選擇文件的格式為 JSON 、Compress files (.gz),然后執(zhí)行右側(cè)的代碼命令,需要注意先填寫快照id,然后用生成的TOKEN替換掉 API_TOKEN
隨后執(zhí)行下載結(jié)果命令,則會出現(xiàn)下面的提示,狀態(tài)處于running
運行中,稍等片刻(注意如果前面選擇的日期或者數(shù)據(jù)量比較大的話,等待時間會略長)。
隨后我們繼續(xù)執(zhí)行上面的代碼,會看到快照正在構(gòu)建中
{"status":"building","message":"Snapshot is building, try again in 10s"}%
等待一段時間,繼續(xù)執(zhí)行上面的代碼,就可以看到我們最終爬取的數(shù)據(jù)啦!
直接在終端查看數(shù)據(jù)不是很方便,可以生成json文件便于查看,只需要在剛剛的命令后面加上--output data.json
,就可以在當(dāng)前目錄下生成?data.json
文件
生成的json數(shù)據(jù)中,其中爬取地址為https://www.facebook.com/gagadaily/
和https://www.facebook.com/apple/
在各自設(shè)置下的start_date
和end_date
時間范圍爬取失敗,
"post_id": null,
"page_name": null,
"post_external_image": null,
"post_type": null,
"following": null,
"link_description_text": null,
"timestamp": "2024-11-05T06:31:43.199Z",
"input": {
"url": "https://www.facebook.com/gagadaily/",
"num_of_posts": 10,
"start_date": "10-20-2024",
"end_date": "10-20-2024"
},
"warning": "posts for the specified period were not found",
"warning_code": "dead_page"
},
{
"timestamp": "2024-11-05T06:36:23.938Z",
"input": {
"url": "https://www.facebook.com/apple/",
"num_of_posts": 10,
"start_date": "10-20-2024",
"end_date": "11-01-2024"
},
"error": "Crawler error: Timed out waiting for graphql response",
"error_code": "timeout"
},
因為數(shù)據(jù)太多,這里截取其中一條數(shù)據(jù)
{
"url": "https://www.facebook.com/LeBron/videos/7922013201234317/",
"post_id": "1112318133592414",
"user_url": "https://www.facebook.com/LeBron",
"user_username_raw": "LeBron James",
"content": "What are we even talking about here?? When I think about my kids and my family and how they will grow up, the choice is clear to me. VOTE KAMALA HARRIS!!!",
"date_posted": "2024-10-31T21:28:41.000Z",
"num_comments": 1983,
"num_shares": 4085,
"num_likes_type": {
"type": "Like",
"num": 2556
},
"page_name": "LeBron James",
"profile_id": "100044427126625",
"page_intro": "The Official LeBron James Facebook page.\n\nwww.lebronjames.com",
"page_category": "Athlete",
"page_logo": "https://scontent.fmnl17-3.fna.fbcdn.net/v/t39.30808-1/461936413_1091563265667901_6592324197866706840_n.jpg?stp=dst-jpg_s200x200&_nc_cat=1&ccb=1-7&_nc_sid=f4b9fd&_nc_ohc=qTe8zYXlYsQQ7kNvgHBfFD2&_nc_zt=24&_nc_ht=scontent.fmnl17-3.fna&_nc_gid=AYpf7yucZIySMKrlXBSh-pJ&oh=00_AYAZuaCma8ReH0PhBPf2K46WnXGbnxsc6N4OEP1crs2mkA&oe=672F87E7",
"page_followers": 27000000,
"page_is_verified": true,
"attachments": [
{
"id": "7922013201234317",
"type": "Video",
"url": "https://scontent.fmnl17-6.fna.fbcdn.net/v/t15.5256-10/465066739_890906873146323_7371909864090599845_n.jpg?stp=dst-jpg_p296x100&_nc_cat=109&ccb=1-7&_nc_sid=7965db&_nc_ohc=GAT9utKXJdoQ7kNvgEDaPy4&_nc_zt=23&_nc_ht=scontent.fmnl17-6.fna&_nc_gid=Ab94zEj6O3ME80PjpwtPl_C&oh=00_AYCYKzhNEZ6FLxoQoEKI1uQgrhK58t6sh4iGrC5mOq_skA&oe=672F7951",
"video_length": "75108",
"attachment_url": "https://www.facebook.com/LeBron/videos/7922013201234317/",
"video_url": "https://video.fmnl17-3.fna.fbcdn.net/o1/v/t2/f2/m69/AQM4uas0Hm2iFEVJe8Z0ww2is_mZJJlW2zUYYO3FOi_88_3uUPuhZuDPQvFUcK4xVKwBhM-vKp2fFCDt7l-s78hX.mp4?efg=eyJ4cHZfYXNzZXRfaWQiOjEyNzAzNTIyNDM5OTUwMTcsInZlbmNvZGVfdGFnIjoieHB2X3Byb2dyZXNzaXZlLkZBQ0VCT09LLi5DM2UuNzIwLmRhc2hfaDI2NC1iYXNpYy1nZW4yXzcyMHAifQ&_nc_ht=video.fmnl17-3.fna.fbcdn.net&_nc_cat=104&strext=1&vs=45419d027a7075ba&_nc_vs=HBksFQIYOnBhc3N0aHJvdWdoX2V2ZXJzdG9yZS9HTHB0dHh1QU9UUkZYbnNFQVBZOXdWVEtVQlZUYm1kakFBQUYVAALIAQAVAhg6cGFzc3Rocm91Z2hfZXZlcnN0b3JlL0dFaFp1UnNHUkJid01zWU5BQmRpRDZZdjhHby1ickZxQUFBRhUCAsgBACgAGAAbAogHdXNlX29pbAExEnByb2dyZXNzaXZlX3JlY2lwZQExFQAAJpLG8eOd2MEEFQIoA0MzZSwXQFLG6XjU_fQYGWRhc2hfaDI2NC1iYXNpYy1nZW4yXzcyMHARAHUCAA&ccb=9-4&oh=00_AYBtuf70c0Pv2GUxzxMa5xQg403E4P1OzWYe-T_iE758ZA&oe=672BAE2B&_nc_sid=1d576d"
}
],
"post_external_image": null,
"page_url": "https://www.facebook.com/LeBron",
"header_image": "https://scontent.fmnl17-1.fna.fbcdn.net/v/t1.6435-9/139267227_247937373363832_6589163605052708194_n.jpg?stp=dst-jpg_s960x960&_nc_cat=100&ccb=1-7&_nc_sid=cc71e4&_nc_ohc=jxGtOqQH7PQQ7kNvgElz9kR&_nc_zt=23&_nc_ht=scontent.fmnl17-1.fna&_nc_gid=AYpf7yucZIySMKrlXBSh-pJ&oh=00_AYBH8GeOiJeU3E69PAzYJEIL2b5YCczNFLKfNzBdzuH2aA&oe=6751412E",
"avatar_image_url": "https://scontent.fmnl17-3.fna.fbcdn.net/v/t39.30808-1/461936413_1091563265667901_6592324197866706840_n.jpg?stp=dst-jpg_s200x200&_nc_cat=1&ccb=1-7&_nc_sid=f4b9fd&_nc_ohc=qTe8zYXlYsQQ7kNvgHBfFD2&_nc_zt=24&_nc_ht=scontent.fmnl17-3.fna&_nc_gid=AYpf7yucZIySMKrlXBSh-pJ&oh=00_AYAZuaCma8ReH0PhBPf2K46WnXGbnxsc6N4OEP1crs2mkA&oe=672F87E7",
"profile_handle": "LeBron",
"is_sponsored": false,
"shortcode": "1112318133592414",
"video_view_count": 55668,
"likes": 2556,
"post_type": "Post",
"following": 114,
"link_description_text": null,
"timestamp": "2024-11-05T06:31:43.816Z",
"input": {
"url": "https://www.facebook.com/LeBron/",
"num_of_posts": 10,
"posts_to_not_include": "",
"start_date": "10-20-2024",
"end_date": "11-01-2024"
}
},
我們在他的Facebook賬號首頁可以看到爬蟲的這條數(shù)據(jù)信息
在Management APIs
菜單下,可以看到Get snapshots list
,需要設(shè)置Dataset ID
以及Status (ready, running, failed)
。使用此 API 檢索數(shù)據(jù)快照列表,顯示收集的數(shù)據(jù)的已保存版本,其狀態(tài)為“就緒”、“正在運行”或“失敗”以指示處理階段。
復(fù)制右側(cè)代碼,在終端執(zhí)行命令,注意替換TOKEN
得到如下數(shù)據(jù),是我的快照列表
[
{
"id": "s_m33ruu64vapj5x5e",
"dataset_id": "gd_lkaxegm826bjpoo9m5",
"status": "ready",
"dataset_size": 1110,
"created": "2024-11-05T01:29:04.060Z"
},
{
"id": "s_m33rva5t1901k40t9f",
"dataset_id": "gd_lkaxegm826bjpoo9m5",
"status": "ready",
"dataset_size": 1358,
"created": "2024-11-05T01:29:24.785Z"
},
{
"id": "s_m33vhh4y1sqjtfgmws",
"dataset_id": "gd_lkaxegm826bjpoo9m5",
"status": "ready",
"dataset_size": 1683,
"created": "2024-11-05T03:10:39.106Z"
},
{
"id": "s_m341tbg4lwht5mr2e",
"dataset_id": "gd_lkaxegm826bjpoo9m5",
"status": "ready",
"dataset_size": 11,
"created": "2024-11-05T06:07:49.300Z"
},
{
"id": "s_m342n89p1h56iw97em",
"dataset_id": "gd_lkaxegm826bjpoo9m5",
"status": "ready",
"dataset_size": 9,
"created": "2024-11-05T06:31:04.861Z"
}
]
使用此 API 檢查您的數(shù)據(jù)收集狀態(tài)。輸入“觸發(fā)數(shù)據(jù)收集 API”響應(yīng)提供的快照 ID。它將在數(shù)據(jù)收集期間返回“正在運行”,并在數(shù)據(jù)可用時返回“就緒”。
執(zhí)行右側(cè)的命令,注意需要替換TOKEN
可以看到輸出結(jié)果,該快照已經(jīng)處于ready的狀態(tài)。
{"status":"ready","snapshot_id":"s_m33rva5t1901k40t9f","dataset_id":"gd_lkaxegm826bjpoo9m5","error_codes":{"timeout":1},"records":1358,"errors":1,"collection_duration":2170955}
在API logs 菜單中看到當(dāng)前快照id對應(yīng)的數(shù)據(jù)收集狀態(tài),如下:可以看到目前正在爬取數(shù)據(jù)中。
bright data 的 Web Scraper API 適用于各種使用場景的抓取,無需開發(fā)和維護網(wǎng)頁抓取工具。使用網(wǎng)頁一次API調(diào)用,輕松提取大量網(wǎng)頁數(shù)據(jù),并且在爬取數(shù)據(jù)過程中具有以下特點。
在本文案例中,使用?bright data?的?Web Scraper API?真的是非常的高效!它支持自定義配置,可以根據(jù)需求選擇抓取的頁面、數(shù)據(jù)量、日期,非常適合應(yīng)對不同的數(shù)據(jù)需求。還能夠輕松集成到現(xiàn)有的數(shù)據(jù)處理或分析管道中。尤其對開發(fā)人員來說,通常只需少量代碼即可調(diào)用和處理數(shù)據(jù)。在抓取數(shù)據(jù)過程中,效率真的很高,么有出現(xiàn)任何個人信息泄露的狀況,并且成本也不是很高,完全可以hold住。
總的來說,Web Scraper API 帶來了強大的數(shù)據(jù)獲取能力和靈活性,強烈推薦使用!
本文章轉(zhuǎn)載微信公眾號@前端Coldplay