
GraphQL API滲透測試指南
Journey of Rural Studies 期刊全部文獻翻譯庫
(共3007篇)
后期排版一下,便于閱讀
①預覽題目、關鍵字等 → ②找到需要文獻 → ③點擊鏈接 → ④文章細讀
快速選擇文獻——中文 VS 英文
要想找到一篇與自己研究內容相關的文獻,應該先從廣度上進行篩選,找到合適自己研究的文章,接著細讀,研究其觀點、研究方法等。相對于閱讀英文,我們對中文的閱讀速度更快、能夠更有效的通過關鍵信息篩選出自己需要的文章。避免文章看了好久終于看完了發現與自己的研究相關性不大。
于是想法誕生,如果能批量的將英文文獻自動翻譯形成數據庫,不僅能夠方便閱讀,而且能在相同時間內閱讀更多的信息,延展文獻篩選的廣度,并且沒有網絡延遲,提高了思維的流暢性。寒假在家順便實驗了一下,感覺效果不錯,實現方式見下文。
路徑 & 框架
? ? 要想實現這個想法,總體操作流程應該有兩大塊。一是利用 Python 爬取相關的數據,二是調用 百度翻譯API 接口進行自動翻譯。詳細流程整理如下圖:
源碼 & 實現
本次使用 Journal of Rural Studies期刊作為測試,具體的網址如下,任務就是爬取該期刊從創刊以來到現在所有的文章信息。
https://www.journals.elsevier.com/journal-of-rural-studies/
# 導入庫
import requests as re
from lxml import etree
import pandas as pd
import time
# 構造請求頭
headers = {'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}
先拿一個網頁做一個測試,看看X path解析結果
url = 'https://www.sciencedirect.com/journal/journal-of-rural-studies/issues'
res = re.get(url,headers = headers).text
res = etree.HTML(res)
testdata = res.xpath("http://a[@class='anchor js-issue-item-link text-m']/@href")
testdata
? ?結果發現一級網頁解析結果只有一個次級鏈接,按照道理來說應該有一級網頁的全部鏈接,通過多次嘗試發現,網頁設計過程中第一個次級鏈接為get請求,而其余的次級鏈接都是POST請求,該網頁一共有page2,為了方便,將所有鏈接都點開之后將網頁保存為HTML文件之后再導入較為方便
html1 = etree.parse('G:\\Pythontest\\practice\\test1.html1',etree.HTMLParser())
html2 = etree.parse('G:\\Pythontest\\practice\\test2.html1',etree.HTMLParser())
data1 = html1.xpath("http://a[@class='anchor js-issue-item-link text-m']/@href")
data2 = html2.xpath("http://a[@class='anchor js-issue-item-link text-m']/@href")
LINKS = []
LINKS.append(data1)
LINKS.append(data2)
TLINKS = []
for i in LINKS:
link = 'https://www.sciencedirect.com' + i
TLINKS = append(link)
得到 TLINKS 是所有一級網頁的鏈接,觀察長度共有158條數據,數據獲取正確。接下來獲取所有的二級網絡鏈接,這個時候就看看直播之類的吧,訪問國外網站有點慢。完成之后共得到3007個次級鏈接(即3007篇文章)
SUBLINKS = []
for link in TLINKS:
subres = re.get(link,headers = headers).text
subres = etree.HTML(subres)
subllinks = subres.xpath("http://a[@class = 'anchor article-content-title u-margin-xs-top u-margin-s-bottom']/@href")
SUBLINKS.append(sublinks)
print("第",TLINKS.index(link),"條數據OK")
time.sleep(0.2)
print('ALL IS OK')
LINKS = []
for i in SUBLINKS:
link = 'https://www.sciencedirect.com' + i
LINKS.append(link)
得到二級網頁網頁鏈接之后,需要分析三級網頁的網頁結構,并將需要的信息進行篩選,構造字典比存儲。
allinfo = []
for LINK in LINKS:
info = {}
res = re.get(LINK,headers=headers).text
res = etree.HTML(res)
vol = res.xpath("http://a[@title = 'Go to table of contents for this volume/issue']/text()")
datainfo = res.xpath("http://div[@class = 'text-xs']/text()")
timu = res.xpath("http://span[@class = 'title-text']/text()")
givenname = res.xpath("http://span[@class='text given-name']/text()")
surname = res.xpath("http://span[@class='text surname']/text()")
web = res.xpath("http://a[@class='doi']/@href")
abstract = res.xpath("http://p[@id='abspara0010']/text()")
keywords = res.xpath("http://div[@class='keyword']/span/text()")
highlights = res.xpath("http://dd[@class='list-description']/p/text()")
info['vol'] = vol
info['datainfo'] = datainfo
info['timu'] = timu
info['givenname'] = givenname
info['surname'] = surname
info['web'] = web
info['abstract'] = abstract
info['keywords'] = keywords
info['highlights'] = highlights
allinfo.append(info)
print("第",LINKS.index(LINK),"條數據 IS FINISHED,總進度是",(LINKS.index(LINK)+1)/len(LINKS))
df = pd.DataFrame(allinfo)
df
df.to_excel(r'G:\PythonStudy\practice1\test.xls',sheet_name='sheet1')
?由此數據的爬取工作完成,得到了擁有所有文章信息的DataFrame
? ? 去除掉數據中多余的字符、將一些爬取時合并的信息進行拆分,形成面向翻譯的Data Frame
# 刪除多余的字符
data = df.copy()
data['abstract'] = data['abstract'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['datainfo'] = data['datainfo'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['givenname'] = data['givenname'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['highlights'] = data['highlights'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['keywords'] = data['keywords'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['surname'] = data['surname'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['timu'] = data['timu'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['vol'] = data['vol'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['web'] = data['web'].str.replace('[','').str.replace(']','').str.replace('\'','')
# 分割合并的信息
data['date'] = data['datainfo'].str.split(',').str.get(1)
data['page'] = data['datainfo'].str.split(',').str.get(2)
? ? 得到具有全部文獻信息的Data Frame之后,需要調用 百度翻譯 API 進行批量翻譯。需要具體看一下官方的技術文檔,所需要的請求參數在文檔中有詳細的說明。
[https://api.fanyi.baidu.com/doc/21],
字段名 | 類型 | 必填參數 | 描述 | 備注 |
q | TEXT | Y | 請求翻譯query | UTF-8編碼 |
from | TEXT | Y | 請求翻譯的源語言 | zh中文、en英語 |
to | TEXT | Y | 譯文語言 | zh中文、en英語 |
salt | TEXT | Y | 隨機數 | |
appid | TEXT | Y | APP ID | 自己申請 |
sign | TEXT | Y | 簽名 | appid+q+salt+密鑰的MD5值 |
# 導入相應的庫
import http.client
import hashlib
import urllib
import random
import json
import requests as re
# 構造自動翻譯函數 translateBaidu
def translateBaidu(content):
appid = '20200119000376***'
secretKet = 'd7SAX0xhIHEEYQ7qp***'
url = 'http://api.fanyi.baidu.com/api/trans/vip/translate'
fromLang = 'en'
toLang = 'zh'
salt = str(random.randint(32555,65333))
sign = appid + content + salt + secretKet
sign = hashlib.md5(sign.encode('utf-8')).hexdigest()
try:
params = {
'appid' : appid,
'q' : content
'from' : fromLang,
'to' : toLang,
'salt' : salt,
'sign' : sign
}
res = re.get(url,params)
jres = res.json()
# 轉換為json格式之后需要分析json的格式,并取出相應的返回翻譯結果
dst = str(jres['trans_result'][0]['dst'])
return dst
except Exception as e:
print(e)
?構造完成后測試一下,結果返回正確,當輸入參數為空時,返回‘trans_result’
萬事具備,現在只需要將 爬取到的文獻的數據利用translateBaidu進行翻譯并構造新的 DateFrame即可。
# 在DataFrame中構建相應的新列
data['trans-timy'] = 'NULL'
data['trans-keywords'] = 'NULL'
data['trans-abstract'] = 'NULL'
data['trans-hightlights'] = 'NULL'
# 開始翻譯并賦值
for i in range(len(data)):
data['trans-timu'][i] = translateBaidu(data['timu'][i])
data['trans-keywords'][i] = translateBaidu(data['keywords'][i])
data['trans-abstract'][i] = translateBaidu(data['abstract'][i])
data['trans-hightlights'][i] = translateBaidu(data['hightlights'][i])
#按照文檔要求,每秒的請求不能超過10條
time.sleep(0.5)
print('ALL FINISHED')
?看一下翻譯的效果
最后調用 ODBC 接口把數據存入數據庫中,保存OK,以后過一段時間睡覺之前跑一下程序就能不斷更新文獻庫了。可以把經常看的期刊依葫蘆畫瓢都編寫一下,以后就可以很輕松的監察文獻動態了……
機翻 vs 人翻
? ? ?在翻譯完成之后,還是有點擔心百度機翻的質量(谷歌接口有點難搞),所以隨機抽樣幾條數據來檢查一下翻譯的質量。emmmm,大概看了一下,感覺比我翻的好(手動滑稽)…….
[關鍵詞翻譯的準確度 > 題目翻譯的準確度 > 摘要 > highlights ]
? ? ?但是粗粗的看一下還是沒有問題的,能夠理解大概的意思,不影響理解。
# 相應庫的導入
import requests as re
from lxml import etree
import pandas as pd
import time
import http.client
import hashlib
import urllib
import random
import json
import requests as re
# 請求頭的構造
headers = {'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}
# 獲取第一層網頁鏈接
html1 = etree.parse('G:\\Pythontest\\practice\\test1.html1',etree.HTMLParser())
html2 = etree.parse('G:\\Pythontest\\practice\\test2.html1',etree.HTMLParser())
data1 = html1.xpath("http://a[@class='anchor js-issue-item-link text-m']/@href")
data2 = html2.xpath("http://a[@class='anchor js-issue-item-link text-m']/@href")
LINKS = []
LINKS.append(data1)
LINKS.append(data2)
TLINKS = []
for i in LINKS:
link = 'https://www.sciencedirect.com' + i
TLINKS = append(link)
# 獲取第二層網頁鏈接
SUBLINKS = []
for link in TLINKS:
subres = re.get(link,headers = headers).text
subres = etree.HTML(subres)
subllinks = subres.xpath("http://a[@class = 'anchor article-content-title u-margin-xs-top u-margin-s-bottom']/@href")
SUBLINKS.append(sublinks)
print("第",TLINKS.index(link),"條數據OK")
time.sleep(0.2)
print('ALL IS OK')
LINKS = []
for i in SUBLINKS:
link = 'https://www.sciencedirect.com' + i
LINKS.append(link)
# 獲取第三層網頁的數據
allinfo = []
for LINK in LINKS:
info = {}
res = re.get(LINK,headers=headers).text
res = etree.HTML(res)
vol = res.xpath("http://a[@title = 'Go to table of contents for this volume/issue']/text()")
datainfo = res.xpath("http://div[@class = 'text-xs']/text()")
timu = res.xpath("http://span[@class = 'title-text']/text()")
givenname = res.xpath("http://span[@class='text given-name']/text()")
surname = res.xpath("http://span[@class='text surname']/text()")
web = res.xpath("http://a[@class='doi']/@href")
abstract = res.xpath("http://p[@id='abspara0010']/text()")
keywords = res.xpath("http://div[@class='keyword']/span/text()")
highlights = res.xpath("http://dd[@class='list-description']/p/text()")
# 字典內部數據結構的組織
info['vol'] = vol
info['datainfo'] = datainfo
info['timu'] = timu
info['givenname'] = givenname
info['surname'] = surname
info['web'] = web
info['abstract'] = abstract
info['keywords'] = keywords
info['highlights'] = highlights
allinfo.append(info)
print("第",LINKS.index(LINK),"條數據 IS FINISHED,總進度是",(LINKS.index(LINK)+1)/len(LINKS))
# 保存數據到excel文件
df = pd.DataFrame(allinfo)
df
df.to_excel(r'G:\PythonStudy\practice1\test.xls',sheet_name='sheet1')
# 數據的初步清洗
data = df.copy()
data['abstract'] = data['abstract'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['datainfo'] = data['datainfo'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['givenname'] = data['givenname'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['highlights'] = data['highlights'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['keywords'] = data['keywords'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['surname'] = data['surname'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['timu'] = data['timu'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['vol'] = data['vol'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['web'] = data['web'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['date'] = data['datainfo'].str.split(',').str.get(1)
data['page'] = data['datainfo'].str.split(',').str.get(2)
# 構造自動翻譯函數 translateBaidu
def translateBaidu(content):
appid = '20200119000376***'
secretKet = 'd7SAX0xhIHEEYQ7qp***'
url = 'http://api.fanyi.baidu.com/api/trans/vip/translate'
fromLang = 'en'
toLang = 'zh'
salt = str(random.randint(32555,65333))
sign = appid + content + salt + secretKet
sign = hashlib.md5(sign.encode('utf-8')).hexdigest()
try:
params = {
'appid' : appid,
'q' : content
'from' : fromLang,
'to' : toLang,
'salt' : salt,
'sign' : sign
}
res = re.get(url,params)
jres = res.json()
# 轉換為json格式之后需要分析json的格式,并取出相應的返回翻譯結果
dst = str(jres['trans_result'][0]['dst'])
return dst
except Exception as e:
print(e)
# 在DataFrame中構建相應的新列
data['trans-timy'] = 'NULL'
data['trans-keywords'] = 'NULL'
data['trans-abstract'] = 'NULL'
data['trans-hightlights'] = 'NULL'
# 開始翻譯并賦值
for i in range(len(data)):
data['trans-timu'][i] = translateBaidu(data['timu'][i])
data['trans-keywords'][i] = translateBaidu(data['keywords'][i])
data['trans-abstract'][i] = translateBaidu(data['abstract'][i])
data['trans-hightlights'][i] = translateBaidu(data['hightlights'][i])
#按照文檔要求,每秒的請求不能超過10條
time.sleep(0.5)
print('ALL FINISHED')
# 保存文件
data.to_excel(r'G:\PythonStudy\practice1\test.xls',sheet_name='sheet1')
本文章轉載微信公眾號@OCD Planners
GraphQL API滲透測試指南
掌握ChatGPT API集成的方便指南
node.js + express + docker + mysql + jwt 實現用戶管理restful api
nodejs + mongodb 編寫 restful 風格博客 api
表格插件wpDataTables-將 WordPress 表與 Google Sheets API 連接
手把手教你用Python和Flask創建REST API
使用 Django 和 Django REST 框架構建 RESTful API:實現 CRUD 操作
ASP.NET Web API快速入門介紹
2024年在線市場平臺的11大最佳支付解決方案