
Google語音識別技術詳解與實踐應用
相當于靜態特征更耗時,它要真正去執行代碼。通常包括:
–?API調用關系:比較明顯的特征,調用了哪些API,表述對應的功能
–?控制流圖:軟件工程中比較常用,機器學習將其表示成向量,從而進行分類
–?數據流圖:軟件工程中比較常用,機器學習將其表示成向量,從而進行分類
動態特征提取方式:
前面的系列文章詳細介紹如何提取惡意軟件的靜態和動態特征,包括API序列。接下來將構建深度學習模型學習API序列實現分類。基本流程如下:
整個數據集包括5類惡意家族的樣本,每個樣本經過先前的CAPE工具成功提取的動態API序列。數據集分布情況如下所示:(建議讀者提取自己數據集的樣本,包括BIG2015、BODMAS等)
惡意家族 | 類別 | 數量 | 訓練集 | 測試集 |
---|---|---|---|---|
AAAA | class1 | 352 | 242 | 110 |
BBBB | class2 | 335 | 235 | 100 |
CCCC | class3 | 363 | 243 | 120 |
DDDD | class4 | 293 | 163 | 130 |
EEEE | class5 | 548 | 358 | 190 |
數據集分為訓練集和測試集,如下圖所示:
數據集中主要包括四個字段,即序號、惡意家族類別、Md5值、API序列或特征。
需要注意,在特征提取過程中涉及大量數據預處理和清洗的工作,讀者需要結合實際需求完成。比如提取特征為空值的過濾代碼。
#coding:utf-8
#By:Eastmount CSDN 2023-05-31
import csv
import re
import os
csv.field_size_limit(500 * 1024 * 1024)
filename = "AAAA_result.csv"
writename = "AAAA_result_final.csv"
fw = open(writename, mode="w", newline="")
writer = csv.writer(fw)
writer.writerow(['no', 'type', 'md5', 'api'])
with open(filename,encoding='utf-8') as fr:
reader = csv.reader(fr)
no = 1
for row in reader: #['no','type','md5','api']
tt = row[1]
md5 = row[2]
api = row[3]
#print(no,tt,md5,api)
#api空值的過濾
if api=="" or api=="api":
continue
else:
writer.writerow([str(no),tt,md5,api])
no += 1
fr.close()
該模型的基本步驟如下:
第一步 數據讀取
第二步 OneHotEncoder()編碼
第三步 使用Tokenizer對詞組進行編碼
第四步 建立CNN模型并訓練
第五步 預測及評估
第六步 驗證算法
構建模型如下圖所示:
完整代碼如下所示:
# -*- coding: utf-8 -*-
# By:Eastmount CSDN 2023-06-27
import pickle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from keras.models import Model
from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding
from keras.layers import Convolution1D, MaxPool1D, Flatten
from keras.optimizers import RMSprop
from keras.layers import Bidirectional
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.callbacks import EarlyStopping
from keras.models import load_model
from keras.models import Sequential
from keras.layers.merge import concatenate
import time
"""
import os
os.environ["CUDA_DEVICES_ORDER"] = "PCI_BUS_IS"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.8)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
"""
start = time.clock()
#---------------------------------------第一步 數據讀取------------------------------------
# 讀取測數據集
train_df = pd.read_csv("..\\train_dataset.csv")
val_df = pd.read_csv("..\\val_dataset.csv")
test_df = pd.read_csv("..\\test_dataset.csv")
# 指定數據類型 否則AttributeError: 'float' object has no attribute 'lower' 存在文本為空的現象
# train_df.SentimentText = train_df.SentimentText.astype(str)
print(train_df.head())
# 解決中文顯示問題
plt.rcParams['font.sans-serif'] = ['KaiTi'] #指定默認字體 SimHei黑體
plt.rcParams['axes.unicode_minus'] = False #解決保存圖像是負號'
#---------------------------------第二步 OneHotEncoder()編碼---------------------------------
# 對數據集的標簽數據進行編碼 (no apt md5 api)
train_y = train_df.apt
print("Label:")
print(train_y[:10])
val_y = val_df.apt
test_y = test_df.apt
le = LabelEncoder()
train_y = le.fit_transform(train_y).reshape(-1,1)
print("LabelEncoder")
print(train_y[:10])
print(len(train_y))
val_y = le.transform(val_y).reshape(-1,1)
test_y = le.transform(test_y).reshape(-1,1)
Labname = le.classes_
print(Labname)
# 對數據集的標簽數據進行one-hot編碼
ohe = OneHotEncoder()
train_y = ohe.fit_transform(train_y).toarray()
val_y = ohe.transform(val_y).toarray()
test_y = ohe.transform(test_y).toarray()
print("OneHotEncoder:")
print(train_y[:10])
#-------------------------------第三步 使用Tokenizer對詞組進行編碼-------------------------------
# 使用Tokenizer對詞組進行編碼
# 當我們創建了一個Tokenizer對象后,使用該對象的fit_on_texts()函數,以空格去識別每個詞
# 可以將輸入的文本中的每個詞編號,編號是根據詞頻的,詞頻越大,編號越小
max_words = 1000
max_len = 200
tok = Tokenizer(num_words=max_words) #使用的最大詞語數為1000
print(train_df.api[:5])
print(type(train_df.api))
# 提取token:api
train_value = train_df.api
train_content = [str(a) for a in train_value.tolist()]
val_value = val_df.api
val_content = [str(a) for a in val_value.tolist()]
test_value = test_df.api
test_content = [str(a) for a in test_value.tolist()]
tok.fit_on_texts(train_content)
print(tok)
# 保存訓練好的Tokenizer和導入
# saving
with open('tok.pickle', 'wb') as handle:
pickle.dump(tok, handle, protocol=pickle.HIGHEST_PROTOCOL)
# loading
with open('tok.pickle', 'rb') as handle:
tok = pickle.load(handle)
# 使用word_index屬性可以看到每次詞對應的編碼
# 使用word_counts屬性可以看到每個詞對應的頻數
for ii,iterm in enumerate(tok.word_index.items()):
if ii < 10:
print(iterm)
else:
break
print("===================")
for ii,iterm in enumerate(tok.word_counts.items()):
if ii < 10:
print(iterm)
else:
break
# 使用tok.texts_to_sequences()將數據轉化為序列
# 使用sequence.pad_sequences()將每個序列調整為相同的長度
# 對每個詞編碼之后,每句新聞中的每個詞就可以用對應的編碼表示,即每條新聞可以轉變成一個向量了
train_seq = tok.texts_to_sequences(train_content)
val_seq = tok.texts_to_sequences(val_content)
test_seq = tok.texts_to_sequences(test_content)
# 將每個序列調整為相同的長度
train_seq_mat = sequence.pad_sequences(train_seq,maxlen=max_len)
val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)
test_seq_mat = sequence.pad_sequences(test_seq,maxlen=max_len)
print(train_seq_mat.shape) #(1241, 200)
print(val_seq_mat.shape) #(459, 200)
print(test_seq_mat.shape) #(650, 200)
print(train_seq_mat[:2])
#-------------------------------第四步 建立CNN模型并訓練-------------------------------
num_labels = 5
inputs = Input(name='inputs',shape=[max_len], dtype='float64')
# 詞嵌入(使用預訓練的詞向量)
layer = Embedding(max_words+1, 256, input_length=max_len, trainable=False)(inputs)
# 詞窗大小分別為3,4,5
cnn = Convolution1D(256, 3, padding='same', strides = 1, activation='relu')(layer)
cnn = MaxPool1D(pool_size=3)(cnn)
# 合并三個模型的輸出向量
flat = Flatten()(cnn)
drop = Dropout(0.4)(flat)
main_output = Dense(num_labels, activation='softmax')(drop)
model = Model(inputs=inputs, outputs=main_output)
model.summary()
model.compile(loss="categorical_crossentropy",
optimizer='adam', #RMSprop()
metrics=["accuracy"])
# 增加判斷 防止再次訓練
flag = "train"
if flag == "train":
print("模型訓練")
# 模型訓練
model_fit = model.fit(train_seq_mat, train_y, batch_size=64, epochs=15,
validation_data=(val_seq_mat,val_y),
callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.001)] #當val-loss不再提升時停止訓練 0.0001
)
# 保存模型
model.save('cnn_model.h5')
del model # deletes the existing model
# 計算時間
elapsed = (time.clock() - start)
print("Time used:", elapsed)
print(model_fit.history)
else:
print("模型預測")
# 導入已經訓練好的模型
model = load_model('cnn_model.h5')
#--------------------------------------第五步 預測及評估--------------------------------
# 對測試集進行預測
test_pre = model.predict(test_seq_mat)
# 評價預測效果,計算混淆矩陣
confm = metrics.confusion_matrix(np.argmax(test_y,axis=1),
np.argmax(test_pre,axis=1))
print(confm)
print(metrics.classification_report(np.argmax(test_y,axis=1),
np.argmax(test_pre,axis=1),
digits=4))
print("accuracy", metrics.accuracy_score(np.argmax(test_y, axis=1),
np.argmax(test_pre, axis=1)))
# 結果存儲
f1 = open("cnn_test_pre.txt", "w")
for n in np.argmax(test_pre, axis=1):
f1.write(str(n) + "\n")
f1.close()
f2 = open("cnn_test_y.txt", "w")
for n in np.argmax(test_y, axis=1):
f2.write(str(n) + "\n")
f2.close()
plt.figure(figsize=(8,8))
sns.heatmap(confm.T, square=True, annot=True,
fmt='d', cbar=False, linewidths=.6,
cmap="YlGnBu")
plt.xlabel('True label',size = 14)
plt.ylabel('Predicted label', size = 14)
plt.xticks(np.arange(5)+0.5, Labname, size = 12)
plt.yticks(np.arange(5)+0.5, Labname, size = 12)
plt.savefig('cnn_result.png')
plt.show()
#--------------------------------------第六步 驗證算法--------------------------------
# 使用tok對驗證數據集重新預處理
val_seq = tok.texts_to_sequences(val_content)
# 將每個序列調整為相同的長度
val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)
# 對驗證集進行預測
val_pre = model.predict(val_seq_mat)
print(metrics.classification_report(np.argmax(val_y,axis=1),
np.argmax(val_pre,axis=1),
digits=4))
print("accuracy", metrics.accuracy_score(np.argmax(val_y, axis=1),
np.argmax(val_pre, axis=1)))
# 計算時間
elapsed = (time.clock() - start)
print("Time used:", elapsed)
最終運行結果及其生成文件如下圖所示:
輸出中間過程結果如下所示:
no ... api
0 1 ... GetSystemInfo;HeapCreate;NtAllocateVirtualMemo...
1 2 ... GetSystemInfo;HeapCreate;NtAllocateVirtualMemo...
2 3 ... NtQueryValueKey;GetSystemTimeAsFileTime;HeapCr...
3 4 ... NtQueryValueKey;NtClose;NtAllocateVirtualMemor...
4 5 ... NtOpenFile;NtCreateSection;NtMapViewOfSection;...
[5 rows x 4 columns]
Label:
0 class1
1 class1
2 class1
3 class1
4 class1
5 class1
6 class1
7 class1
8 class1
9 class1
Name: apt, dtype: object
LabelEncoder
[[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]]
1241
['class1' 'class2' 'class3' 'class4' 'class5']
OneHotEncoder:
[[1. 0. 0. 0. 0.]
[1. 0. 0. 0. 0.]
[1. 0. 0. 0. 0.]
[1. 0. 0. 0. 0.]
[1. 0. 0. 0. 0.]
[1. 0. 0. 0. 0.]
[1. 0. 0. 0. 0.]
[1. 0. 0. 0. 0.]
[1. 0. 0. 0. 0.]
[1. 0. 0. 0. 0.]]
0 GetSystemInfo;HeapCreate;NtAllocateVirtualMemo...
1 GetSystemInfo;HeapCreate;NtAllocateVirtualMemo...
2 NtQueryValueKey;GetSystemTimeAsFileTime;HeapCr...
3 NtQueryValueKey;NtClose;NtAllocateVirtualMemor...
4 NtOpenFile;NtCreateSection;NtMapViewOfSection;...
Name: api, dtype: object
<class 'pandas.core.series.Series'>
<keras_preprocessing.text.Tokenizer object at 0x0000028E55D36B08>
('regqueryvalueexw', 1)
('ntclose', 2)
('ldrgetprocedureaddress', 3)
('regopenkeyexw', 4)
('regclosekey', 5)
('ntallocatevirtualmemory', 6)
('sendmessagew', 7)
('ntwritefile', 8)
('process32nextw', 9)
('ntdeviceiocontrolfile', 10)
===================
('getsysteminfo', 2651)
('heapcreate', 2996)
('ntallocatevirtualmemory', 115547)
('ntqueryvaluekey', 24120)
('getsystemtimeasfiletime', 52727)
('ldrgetdllhandle', 25135)
('ldrgetprocedureaddress', 199952)
('memcpy', 9008)
('setunhandledexceptionfilter', 1504)
('ntcreatefile', 43260)
(1241, 200)
(459, 200)
(650, 200)
[[ 3 135 3 3 2 21 3 3 4 3 96 3 3 4 96 4 96 20
22 20 3 6 6 23 128 129 3 103 23 56 2 103 23 20 3 23
3 3 3 3 4 1 5 23 12 131 12 20 3 10 2 10 2 20
3 4 5 27 3 10 2 6 10 2 3 10 2 10 2 3 10 2
10 2 10 2 10 2 10 2 3 10 2 10 2 10 2 10 2 3
3 3 36 4 3 23 20 3 5 207 34 6 6 6 11 11 6 11
6 6 6 6 6 6 6 6 6 11 6 6 11 6 11 6 11 6
6 11 6 34 3 141 3 140 3 3 141 34 6 2 21 4 96 4
96 4 96 23 3 3 12 131 12 10 2 10 2 4 5 27 10 2
6 10 2 10 2 10 2 10 2 10 2 10 2 10 2 10 2 10
2 10 2 10 2 10 2 36 4 23 5 207 6 3 3 12 131 12
132 3]
[ 27 4 27 4 27 4 27 4 27 27 5 27 4 27 4 27 27 27
27 27 27 27 5 27 4 27 4 27 4 27 4 27 4 27 4 27
4 27 4 27 4 27 5 52 2 21 4 5 1 1 1 5 21 25
2 52 12 33 51 28 34 30 2 52 2 21 4 5 27 5 52 6
6 52 4 1 5 4 52 54 7 7 20 52 7 52 7 7 6 4
4 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 5
5 3 7 50 50 50 95 50 50 50 50 50 4 1 5 4 3 3
3 3 3 7 7 7 3 7 3 7 3 60 3 3 7 7 7 7
60 3 7 7 7 7 7 7 7 7 52 20 3 3 3 14 14 60
18 19 18 19 2 21 4 5 18 19 18 19 18 19 18 19 7 7
7 7 7 7 7 7 7 7 7 52 7 7 7 7 7 60 7 7
7 7]]
模型訓練過程如下:
模型訓練
Epoch 1/15
1/20 [>.............................] - ETA: 5s - loss: 1.5986 - accuracy: 0.2656
2/20 [==>...........................] - ETA: 1s - loss: 1.6050 - accuracy: 0.2266
3/20 [===>..........................] - ETA: 1s - loss: 1.5777 - accuracy: 0.2292
4/20 [=====>........................] - ETA: 2s - loss: 1.5701 - accuracy: 0.2500
5/20 [======>.......................] - ETA: 2s - loss: 1.5628 - accuracy: 0.2719
6/20 [========>.....................] - ETA: 3s - loss: 1.5439 - accuracy: 0.3125
7/20 [=========>....................] - ETA: 3s - loss: 1.5306 - accuracy: 0.3348
8/20 [===========>..................] - ETA: 3s - loss: 1.5162 - accuracy: 0.3535
9/20 [============>.................] - ETA: 3s - loss: 1.5020 - accuracy: 0.3698
10/20 [==============>...............] - ETA: 3s - loss: 1.4827 - accuracy: 0.3969
11/20 [===============>..............] - ETA: 3s - loss: 1.4759 - accuracy: 0.4020
12/20 [=================>............] - ETA: 3s - loss: 1.4734 - accuracy: 0.4036
13/20 [==================>...........] - ETA: 3s - loss: 1.4456 - accuracy: 0.4255
14/20 [====================>.........] - ETA: 3s - loss: 1.4322 - accuracy: 0.4353
15/20 [=====================>........] - ETA: 2s - loss: 1.4157 - accuracy: 0.4469
16/20 [=======================>......] - ETA: 2s - loss: 1.4093 - accuracy: 0.4482
17/20 [========================>.....] - ETA: 2s - loss: 1.4010 - accuracy: 0.4531
18/20 [==========================>...] - ETA: 1s - loss: 1.3920 - accuracy: 0.4601
19/20 [===========================>..] - ETA: 0s - loss: 1.3841 - accuracy: 0.4638
20/20 [==============================] - ETA: 0s - loss: 1.3763 - accuracy: 0.4674
20/20 [==============================] - 20s 1s/step - loss: 1.3763 - accuracy: 0.4674 - val_loss: 1.3056 - val_accuracy: 0.4837
Time used: 26.1328806
{'loss': [1.3762551546096802], 'accuracy': [0.467365026473999],
'val_loss': [1.305567979812622], 'val_accuracy': [0.48366013169288635]}
最終預測結果如下所示:
模型預測
[[ 40 14 11 1 44]
[ 16 57 10 0 17]
[ 6 30 61 0 23]
[ 12 20 15 47 36]
[ 11 14 19 0 146]]
precision recall f1-score support
0 0.4706 0.3636 0.4103 110
1 0.4222 0.5700 0.4851 100
2 0.5259 0.5083 0.5169 120
3 0.9792 0.3615 0.5281 130
4 0.5489 0.7684 0.6404 190
accuracy 0.5400 650
macro avg 0.5893 0.5144 0.5162 650
weighted avg 0.5980 0.5400 0.5323 650
accuracy 0.54
precision recall f1-score support
0 0.9086 0.4517 0.6034 352
1 0.5943 0.5888 0.5915 107
2 0.0000 0.0000 0.0000 0
3 0.0000 0.0000 0.0000 0
4 0.0000 0.0000 0.0000 0
accuracy 0.4837 459
macro avg 0.3006 0.2081 0.2390 459
weighted avg 0.8353 0.4837 0.6006 459
accuracy 0.48366013071895425
Time used: 14.170902800000002
思考:
然而,整個預測結果效果較差,請讀者思考,這是為什么呢?我們能不能通過調參進行優化,又如何改進算法呢?本文僅提供基本思路和代碼,更多優化及完善需要讀者學會獨立解決,加油喔!
該模型的基本步驟如下:
第一步 數據讀取
第二步 OneHotEncoder()編碼
第三步 使用Tokenizer對詞組進行編碼
第四步 建立BiLSTM模型并訓練
第五步 預測及評估
第六步 驗證算法
構建模型如下圖所示:
完整代碼如下所示:
# -*- coding: utf-8 -*-
# By:Eastmount CSDN 2023-06-27
import pickle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from keras.models import Model
from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding
from keras.layers import Convolution1D, MaxPool1D, Flatten
from keras.optimizers import RMSprop
from keras.layers import Bidirectional
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.callbacks import EarlyStopping
from keras.models import load_model
from keras.models import Sequential
from keras.layers.merge import concatenate
import time
start = time.clock()
#---------------------------------------第一步 數據讀取------------------------------------
# 讀取測數據集
train_df = pd.read_csv("..\\train_dataset.csv")
val_df = pd.read_csv("..\\val_dataset.csv")
test_df = pd.read_csv("..\\test_dataset.csv")
print(train_df.head())
# 解決中文顯示問題
plt.rcParams['font.sans-serif'] = ['KaiTi']
plt.rcParams['axes.unicode_minus'] = False
#---------------------------------第二步 OneHotEncoder()編碼---------------------------------
# 對數據集的標簽數據進行編碼 (no apt md5 api)
train_y = train_df.apt
val_y = val_df.apt
test_y = test_df.apt
le = LabelEncoder()
train_y = le.fit_transform(train_y).reshape(-1,1)
val_y = le.transform(val_y).reshape(-1,1)
test_y = le.transform(test_y).reshape(-1,1)
Labname = le.classes_
# 對數據集的標簽數據進行one-hot編碼
ohe = OneHotEncoder()
train_y = ohe.fit_transform(train_y).toarray()
val_y = ohe.transform(val_y).toarray()
test_y = ohe.transform(test_y).toarray()
#-------------------------------第三步 使用Tokenizer對詞組進行編碼-------------------------------
# 使用Tokenizer對詞組進行編碼
max_words = 2000
max_len = 300
tok = Tokenizer(num_words=max_words)
# 提取token:api
train_value = train_df.api
train_content = [str(a) for a in train_value.tolist()]
val_value = val_df.api
val_content = [str(a) for a in val_value.tolist()]
test_value = test_df.api
test_content = [str(a) for a in test_value.tolist()]
tok.fit_on_texts(train_content)
print(tok)
# 保存訓練好的Tokenizer和導入
with open('tok.pickle', 'wb') as handle:
pickle.dump(tok, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('tok.pickle', 'rb') as handle:
tok = pickle.load(handle)
# 使用tok.texts_to_sequences()將數據轉化為序列
train_seq = tok.texts_to_sequences(train_content)
val_seq = tok.texts_to_sequences(val_content)
test_seq = tok.texts_to_sequences(test_content)
# 將每個序列調整為相同的長度
train_seq_mat = sequence.pad_sequences(train_seq,maxlen=max_len)
val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)
test_seq_mat = sequence.pad_sequences(test_seq,maxlen=max_len)
#-------------------------------第四步 建立LSTM模型并訓練-------------------------------
num_labels = 5
model = Sequential()
model.add(Embedding(max_words+1, 128, input_length=max_len))
#model.add(Bidirectional(LSTM(128, dropout=0.3, recurrent_dropout=0.1)))
model.add(Bidirectional(LSTM(128)))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(num_labels, activation='softmax'))
model.summary()
model.compile(loss="categorical_crossentropy",
optimizer='adam',
metrics=["accuracy"])
flag = "train"
if flag == "train":
print("模型訓練")
# 模型訓練
model_fit = model.fit(train_seq_mat, train_y, batch_size=64, epochs=15,
validation_data=(val_seq_mat,val_y),
callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.0001)]
)
# 保存模型
model.save('bilstm_model.h5')
del model # deletes the existing model
# 計算時間
elapsed = (time.clock() - start)
print("Time used:", elapsed)
print(model_fit.history)
else:
print("模型預測")
model = load_model('bilstm_model.h5')
#--------------------------------------第五步 預測及評估--------------------------------
# 對測試集進行預測
test_pre = model.predict(test_seq_mat)
confm = metrics.confusion_matrix(np.argmax(test_y,axis=1),
np.argmax(test_pre,axis=1))
print(confm)
print(metrics.classification_report(np.argmax(test_y,axis=1),
np.argmax(test_pre,axis=1),
digits=4))
print("accuracy", metrics.accuracy_score(np.argmax(test_y, axis=1),
np.argmax(test_pre, axis=1)))
# 結果存儲
f1 = open("bilstm_test_pre.txt", "w")
for n in np.argmax(test_pre, axis=1):
f1.write(str(n) + "\n")
f1.close()
f2 = open("bilstm_test_y.txt", "w")
for n in np.argmax(test_y, axis=1):
f2.write(str(n) + "\n")
f2.close()
plt.figure(figsize=(8,8))
sns.heatmap(confm.T, square=True, annot=True,
fmt='d', cbar=False, linewidths=.6,
cmap="YlGnBu")
plt.xlabel('True label',size = 14)
plt.ylabel('Predicted label', size = 14)
plt.xticks(np.arange(5)+0.5, Labname, size = 12)
plt.yticks(np.arange(5)+0.5, Labname, size = 12)
plt.savefig('bilstm_result.png')
plt.show()
#--------------------------------------第六步 驗證算法--------------------------------
# 使用tok對驗證數據集重新預處理
val_seq = tok.texts_to_sequences(val_content)
val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)
# 對驗證集進行預測
val_pre = model.predict(val_seq_mat)
print(metrics.classification_report(np.argmax(val_y,axis=1),
np.argmax(val_pre,axis=1),
digits=4))
print("accuracy", metrics.accuracy_score(np.argmax(val_y, axis=1),
np.argmax(val_pre, axis=1)))
# 計算時間
elapsed = (time.clock() - start)
print("Time used:", elapsed)
訓練輸出結果如下圖所示:
模型訓練
Epoch 1/15
1/20 [>.............................] - ETA: 40s - loss: 1.6114 - accuracy: 0.2031
2/20 [==>...........................] - ETA: 10s - loss: 1.6055 - accuracy: 0.2969
3/20 [===>..........................] - ETA: 10s - loss: 1.6015 - accuracy: 0.3281
4/20 [=====>........................] - ETA: 10s - loss: 1.5931 - accuracy: 0.3477
5/20 [======>.......................] - ETA: 10s - loss: 1.5914 - accuracy: 0.3469
6/20 [========>.....................] - ETA: 10s - loss: 1.5827 - accuracy: 0.3698
7/20 [=========>....................] - ETA: 10s - loss: 1.5785 - accuracy: 0.3884
8/20 [===========>..................] - ETA: 10s - loss: 1.5673 - accuracy: 0.4121
9/20 [============>.................] - ETA: 9s - loss: 1.5610 - accuracy: 0.4149
10/20 [==============>...............] - ETA: 9s - loss: 1.5457 - accuracy: 0.4187
11/20 [===============>..............] - ETA: 8s - loss: 1.5297 - accuracy: 0.4148
12/20 [=================>............] - ETA: 8s - loss: 1.5338 - accuracy: 0.4128
13/20 [==================>...........] - ETA: 7s - loss: 1.5214 - accuracy: 0.4279
14/20 [====================>.........] - ETA: 6s - loss: 1.5176 - accuracy: 0.4286
15/20 [=====================>........] - ETA: 5s - loss: 1.5100 - accuracy: 0.4271
16/20 [=======================>......] - ETA: 4s - loss: 1.5065 - accuracy: 0.4258
17/20 [========================>.....] - ETA: 3s - loss: 1.5021 - accuracy: 0.4237
18/20 [==========================>...] - ETA: 2s - loss: 1.4921 - accuracy: 0.4288
19/20 [===========================>..] - ETA: 1s - loss: 1.4822 - accuracy: 0.4334
20/20 [==============================] - ETA: 0s - loss: 1.4825 - accuracy: 0.4327
20/20 [==============================] - 33s 2s/step - loss: 1.4825 - accuracy: 0.4327 - val_loss: 1.4187 - val_accuracy: 0.4074
Time used: 38.565846900000004
{'loss': [1.4825222492218018], 'accuracy': [0.4327155649662018],
'val_loss': [1.4187402725219727], 'val_accuracy': [0.40740740299224854]}
>>>
最終預測結果如下所示:
模型預測
[[36 18 37 1 18]
[14 46 34 0 6]
[ 8 29 73 0 10]
[16 29 14 45 26]
[47 15 33 0 95]]
precision recall f1-score support
0 0.2975 0.3273 0.3117 110
1 0.3358 0.4600 0.3882 100
2 0.3822 0.6083 0.4695 120
3 0.9783 0.3462 0.5114 130
4 0.6129 0.5000 0.5507 190
accuracy 0.4538 650
macro avg 0.5213 0.4484 0.4463 650
weighted avg 0.5474 0.4538 0.4624 650
accuracy 0.45384615384615384
precision recall f1-score support
0 0.9189 0.3864 0.5440 352
1 0.4766 0.4766 0.4766 107
2 0.0000 0.0000 0.0000 0
3 0.0000 0.0000 0.0000 0
4 0.0000 0.0000 0.0000 0
accuracy 0.4074 459
macro avg 0.2791 0.1726 0.2041 459
weighted avg 0.8158 0.4074 0.5283 459
accuracy 0.4074074074074074
Time used: 32.2772881
該模型的基本步驟如下:
第一步 數據讀取
第二步 OneHotEncoder()編碼
第三步 使用Tokenizer對詞組進行編碼
第四步 建立BiGRU模型并訓練
第五步 預測及評估
第六步 驗證算法
構建模型如下圖所示:
完整代碼如下所示:
# -*- coding: utf-8 -*-
# By:Eastmount CSDN 2023-06-27
import pickle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from keras.models import Model
from keras.layers import GRU, LSTM, Activation, Dense, Dropout, Input, Embedding
from keras.layers import Convolution1D, MaxPool1D, Flatten
from keras.optimizers import RMSprop
from keras.layers import Bidirectional
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.callbacks import EarlyStopping
from keras.models import load_model
from keras.models import Sequential
from keras.layers.merge import concatenate
import time
start = time.clock()
#---------------------------------------第一步 數據讀取------------------------------------
# 讀取測數據集
train_df = pd.read_csv("..\\train_dataset.csv")
val_df = pd.read_csv("..\\val_dataset.csv")
test_df = pd.read_csv("..\\test_dataset.csv")
print(train_df.head())
# 解決中文顯示問題
plt.rcParams['font.sans-serif'] = ['KaiTi']
plt.rcParams['axes.unicode_minus'] = False
#---------------------------------第二步 OneHotEncoder()編碼---------------------------------
# 對數據集的標簽數據進行編碼 (no apt md5 api)
train_y = train_df.apt
val_y = val_df.apt
test_y = test_df.apt
le = LabelEncoder()
train_y = le.fit_transform(train_y).reshape(-1,1)
val_y = le.transform(val_y).reshape(-1,1)
test_y = le.transform(test_y).reshape(-1,1)
Labname = le.classes_
# 對數據集的標簽數據進行one-hot編碼
ohe = OneHotEncoder()
train_y = ohe.fit_transform(train_y).toarray()
val_y = ohe.transform(val_y).toarray()
test_y = ohe.transform(test_y).toarray()
#-------------------------------第三步 使用Tokenizer對詞組進行編碼-------------------------------
# 使用Tokenizer對詞組進行編碼
max_words = 2000
max_len = 300
tok = Tokenizer(num_words=max_words)
# 提取token:api
train_value = train_df.api
train_content = [str(a) for a in train_value.tolist()]
val_value = val_df.api
val_content = [str(a) for a in val_value.tolist()]
test_value = test_df.api
test_content = [str(a) for a in test_value.tolist()]
tok.fit_on_texts(train_content)
print(tok)
# 保存訓練好的Tokenizer和導入
with open('tok.pickle', 'wb') as handle:
pickle.dump(tok, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('tok.pickle', 'rb') as handle:
tok = pickle.load(handle)
# 使用tok.texts_to_sequences()將數據轉化為序列
train_seq = tok.texts_to_sequences(train_content)
val_seq = tok.texts_to_sequences(val_content)
test_seq = tok.texts_to_sequences(test_content)
# 將每個序列調整為相同的長度
train_seq_mat = sequence.pad_sequences(train_seq,maxlen=max_len)
val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)
test_seq_mat = sequence.pad_sequences(test_seq,maxlen=max_len)
#-------------------------------第四步 建立GRU模型并訓練-------------------------------
num_labels = 5
model = Sequential()
model.add(Embedding(max_words+1, 256, input_length=max_len))
#model.add(Bidirectional(GRU(128, dropout=0.2, recurrent_dropout=0.1)))
model.add(Bidirectional(GRU(256)))
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.4))
model.add(Dense(num_labels, activation='softmax'))
model.summary()
model.compile(loss="categorical_crossentropy",
optimizer='adam',
metrics=["accuracy"])
flag = "train"
if flag == "train":
print("模型訓練")
# 模型訓練
model_fit = model.fit(train_seq_mat, train_y, batch_size=64, epochs=15,
validation_data=(val_seq_mat,val_y),
callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.005)]
)
# 保存模型
model.save('gru_model.h5')
del model # deletes the existing model
# 計算時間
elapsed = (time.clock() - start)
print("Time used:", elapsed)
print(model_fit.history)
else:
print("模型預測")
model = load_model('gru_model.h5')
#--------------------------------------第五步 預測及評估--------------------------------
# 對測試集進行預測
test_pre = model.predict(test_seq_mat)
confm = metrics.confusion_matrix(np.argmax(test_y,axis=1),
np.argmax(test_pre,axis=1))
print(confm)
print(metrics.classification_report(np.argmax(test_y,axis=1),
np.argmax(test_pre,axis=1),
digits=4))
print("accuracy", metrics.accuracy_score(np.argmax(test_y, axis=1),
np.argmax(test_pre, axis=1)))
# 結果存儲
f1 = open("gru_test_pre.txt", "w")
for n in np.argmax(test_pre, axis=1):
f1.write(str(n) + "\n")
f1.close()
f2 = open("gru_test_y.txt", "w")
for n in np.argmax(test_y, axis=1):
f2.write(str(n) + "\n")
f2.close()
plt.figure(figsize=(8,8))
sns.heatmap(confm.T, square=True, annot=True,
fmt='d', cbar=False, linewidths=.6,
cmap="YlGnBu")
plt.xlabel('True label',size = 14)
plt.ylabel('Predicted label', size = 14)
plt.xticks(np.arange(5)+0.5, Labname, size = 12)
plt.yticks(np.arange(5)+0.5, Labname, size = 12)
plt.savefig('gru_result.png')
plt.show()
#--------------------------------------第六步 驗證算法--------------------------------
# 使用tok對驗證數據集重新預處理
val_seq = tok.texts_to_sequences(val_content)
val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)
# 對驗證集進行預測
val_pre = model.predict(val_seq_mat)
print(metrics.classification_report(np.argmax(val_y,axis=1),
np.argmax(val_pre,axis=1),
digits=4))
print("accuracy", metrics.accuracy_score(np.argmax(val_y, axis=1),
np.argmax(val_pre, axis=1)))
# 計算時間
elapsed = (time.clock() - start)
print("Time used:", elapsed)
訓練輸出結果如下圖所示:
模型訓練
Epoch 1/15
1/20 [>.............................] - ETA: 47s - loss: 1.6123 - accuracy: 0.1875
2/20 [==>...........................] - ETA: 18s - loss: 1.6025 - accuracy: 0.2656
3/20 [===>..........................] - ETA: 18s - loss: 1.5904 - accuracy: 0.3333
4/20 [=====>........................] - ETA: 18s - loss: 1.5728 - accuracy: 0.3867
5/20 [======>.......................] - ETA: 17s - loss: 1.5639 - accuracy: 0.4094
6/20 [========>.....................] - ETA: 17s - loss: 1.5488 - accuracy: 0.4375
7/20 [=========>....................] - ETA: 16s - loss: 1.5375 - accuracy: 0.4397
8/20 [===========>..................] - ETA: 16s - loss: 1.5232 - accuracy: 0.4434
9/20 [============>.................] - ETA: 15s - loss: 1.5102 - accuracy: 0.4358
10/20 [==============>...............] - ETA: 14s - loss: 1.5014 - accuracy: 0.4250
11/20 [===============>..............] - ETA: 13s - loss: 1.5053 - accuracy: 0.4233
12/20 [=================>............] - ETA: 12s - loss: 1.5022 - accuracy: 0.4232
13/20 [==================>...........] - ETA: 11s - loss: 1.4913 - accuracy: 0.4279
14/20 [====================>.........] - ETA: 9s - loss: 1.4912 - accuracy: 0.4286
15/20 [=====================>........] - ETA: 8s - loss: 1.4841 - accuracy: 0.4365
16/20 [=======================>......] - ETA: 7s - loss: 1.4720 - accuracy: 0.4404
17/20 [========================>.....] - ETA: 5s - loss: 1.4669 - accuracy: 0.4375
18/20 [==========================>...] - ETA: 3s - loss: 1.4636 - accuracy: 0.4349
19/20 [===========================>..] - ETA: 1s - loss: 1.4544 - accuracy: 0.4383
20/20 [==============================] - ETA: 0s - loss: 1.4509 - accuracy: 0.4400
20/20 [==============================] - 44s 2s/step - loss: 1.4509 - accuracy: 0.4400 - val_loss: 1.3812 - val_accuracy: 0.3660
Time used: 49.7057119
{'loss': [1.4508591890335083], 'accuracy': [0.4399677813053131],
'val_loss': [1.381193995475769], 'val_accuracy': [0.3660130798816681]}
最終預測結果如下所示:
模型預測
[[ 30 8 9 17 46]
[ 13 50 9 13 15]
[ 10 4 58 29 19]
[ 11 8 8 73 30]
[ 25 3 23 14 125]]
precision recall f1-score support
0 0.3371 0.2727 0.3015 110
1 0.6849 0.5000 0.5780 100
2 0.5421 0.4833 0.5110 120
3 0.5000 0.5615 0.5290 130
4 0.5319 0.6579 0.5882 190
accuracy 0.5169 650
macro avg 0.5192 0.4951 0.5016 650
weighted avg 0.5180 0.5169 0.5120 650
accuracy 0.5169230769230769
precision recall f1-score support
0 0.8960 0.3182 0.4696 352
1 0.7273 0.5234 0.6087 107
2 0.0000 0.0000 0.0000 0
3 0.0000 0.0000 0.0000 0
4 0.0000 0.0000 0.0000 0
accuracy 0.3660 459
macro avg 0.3247 0.1683 0.2157 459
weighted avg 0.8567 0.3660 0.5020 459
accuracy 0.3660130718954248
Time used: 60.106339399999996
該模型的基本步驟如下:
第一步 數據讀取
第二步 OneHotEncoder()編碼
第三步 使用Tokenizer對詞組進行編碼
第四步 建立Attention機制
第五步 建立Attention+CNN+BiLSTM模型并訓練
第六步 預測及評估
第七步 驗證算法
構建模型如下圖所示:
Model: "model"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
inputs (InputLayer) [(None, 100)] 0
__________________________________________________________________________________________________
embedding (Embedding) (None, 100, 256) 256256 inputs[0][0]
__________________________________________________________________________________________________
conv1d (Conv1D) (None, 100, 256) 196864 embedding[0][0]
__________________________________________________________________________________________________
conv1d_1 (Conv1D) (None, 100, 256) 262400 embedding[0][0]
__________________________________________________________________________________________________
conv1d_2 (Conv1D) (None, 100, 256) 327936 embedding[0][0]
__________________________________________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 25, 256) 0 conv1d[0][0]
__________________________________________________________________________________________________
max_pooling1d_1 (MaxPooling1D) (None, 25, 256) 0 conv1d_1[0][0]
__________________________________________________________________________________________________
max_pooling1d_2 (MaxPooling1D) (None, 25, 256) 0 conv1d_2[0][0]
__________________________________________________________________________________________________
concatenate (Concatenate) (None, 25, 768) 0 max_pooling1d[0][0]
max_pooling1d_1[0][0]
max_pooling1d_2[0][0]
__________________________________________________________________________________________________
bidirectional (Bidirectional) (None, 25, 256) 918528 concatenate[0][0]
__________________________________________________________________________________________________
dense (Dense) (None, 25, 128) 32896 bidirectional[0][0]
__________________________________________________________________________________________________
dropout (Dropout) (None, 25, 128) 0 dense[0][0]
__________________________________________________________________________________________________
attention_layer (AttentionLayer (None, 128) 6500 dropout[0][0]
__________________________________________________________________________________________________
dense_1 (Dense) (None, 5) 645 attention_layer[0][0]
==================================================================================================
Total params: 2,002,025
Trainable params: 1,745,769
Non-trainable params: 256,256
完整代碼如下所示:
# -*- coding: utf-8 -*-
# By:Eastmount CSDN 2023-06-27
import pickle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from keras.models import Model
from keras.layers import LSTM, GRU, Activation, Dense, Dropout, Input, Embedding
from keras.layers import Convolution1D, MaxPool1D, Flatten
from keras.optimizers import RMSprop
from keras.layers import Bidirectional
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.callbacks import EarlyStopping
from keras.models import load_model
from keras.models import Sequential
from keras.layers.merge import concatenate
import time
start = time.clock()
#---------------------------------------第一步 數據讀取------------------------------------
# 讀取測數據集
train_df = pd.read_csv("..\\train_dataset.csv")
val_df = pd.read_csv("..\\val_dataset.csv")
test_df = pd.read_csv("..\\test_dataset.csv")
print(train_df.head())
# 解決中文顯示問題
plt.rcParams['font.sans-serif'] = ['KaiTi']
plt.rcParams['axes.unicode_minus'] = False
#---------------------------------第二步 OneHotEncoder()編碼---------------------------------
# 對數據集的標簽數據進行編碼 (no apt md5 api)
train_y = train_df.apt
val_y = val_df.apt
test_y = test_df.apt
le = LabelEncoder()
train_y = le.fit_transform(train_y).reshape(-1,1)
val_y = le.transform(val_y).reshape(-1,1)
test_y = le.transform(test_y).reshape(-1,1)
Labname = le.classes_
# 對數據集的標簽數據進行one-hot編碼
ohe = OneHotEncoder()
train_y = ohe.fit_transform(train_y).toarray()
val_y = ohe.transform(val_y).toarray()
test_y = ohe.transform(test_y).toarray()
#-------------------------------第三步 使用Tokenizer對詞組進行編碼-------------------------------
# 使用Tokenizer對詞組進行編碼
max_words = 1000
max_len = 100
tok = Tokenizer(num_words=max_words)
# 提取token:api
train_value = train_df.api
train_content = [str(a) for a in train_value.tolist()]
val_value = val_df.api
val_content = [str(a) for a in val_value.tolist()]
test_value = test_df.api
test_content = [str(a) for a in test_value.tolist()]
tok.fit_on_texts(train_content)
print(tok)
# 保存訓練好的Tokenizer和導入
with open('tok.pickle', 'wb') as handle:
pickle.dump(tok, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('tok.pickle', 'rb') as handle:
tok = pickle.load(handle)
# 使用tok.texts_to_sequences()將數據轉化為序列
train_seq = tok.texts_to_sequences(train_content)
val_seq = tok.texts_to_sequences(val_content)
test_seq = tok.texts_to_sequences(test_content)
# 將每個序列調整為相同的長度
train_seq_mat = sequence.pad_sequences(train_seq,maxlen=max_len)
val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)
test_seq_mat = sequence.pad_sequences(test_seq,maxlen=max_len)
#-------------------------------第四步 建立Attention機制-------------------------------
"""
由于Keras目前還沒有現成的Attention層可以直接使用,我們需要自己來構建一個新的層函數。
Keras自定義的函數主要分為四個部分,分別是:
init:初始化一些需要的參數
bulid:具體來定義權重是怎么樣的
call:核心部分,定義向量是如何進行運算的
compute_output_shape:定義該層輸出的大小
推薦文章 https://blog.csdn.net/huanghaocs/article/details/95752379
推薦文章 https://zhuanlan.zhihu.com/p/29201491
"""
# Hierarchical Model with Attention
from keras import initializers
from keras import constraints
from keras import activations
from keras import regularizers
from keras import backend as K
from keras.engine.topology import Layer
K.clear_session()
class AttentionLayer(Layer):
def __init__(self, attention_size=None, **kwargs):
self.attention_size = attention_size
super(AttentionLayer, self).__init__(**kwargs)
def get_config(self):
config = super().get_config()
config['attention_size'] = self.attention_size
return config
def build(self, input_shape):
assert len(input_shape) == 3
self.time_steps = input_shape[1]
hidden_size = input_shape[2]
if self.attention_size is None:
self.attention_size = hidden_size
self.W = self.add_weight(name='att_weight', shape=(hidden_size, self.attention_size),
initializer='uniform', trainable=True)
self.b = self.add_weight(name='att_bias', shape=(self.attention_size,),
initializer='uniform', trainable=True)
self.V = self.add_weight(name='att_var', shape=(self.attention_size,),
initializer='uniform', trainable=True)
super(AttentionLayer, self).build(input_shape)
#解決方法: Attention The graph tensor has name: model/attention_layer/Reshape:0
#https://blog.csdn.net/weixin_54227557/article/details/129898614
def call(self, inputs):
#self.V = K.reshape(self.V, (-1, 1))
V = K.reshape(self.V, (-1, 1))
H = K.tanh(K.dot(inputs, self.W) + self.b)
#score = K.softmax(K.dot(H, self.V), axis=1)
score = K.softmax(K.dot(H, V), axis=1)
outputs = K.sum(score * inputs, axis=1)
return outputs
def compute_output_shape(self, input_shape):
return input_shape[0], input_shape[2]
#-------------------------------第五步 建立Attention+CNN模型并訓練-------------------------------
# 構建TextCNN模型
num_labels = 5
inputs = Input(name='inputs',shape=[max_len], dtype='float64')
layer = Embedding(max_words+1, 256, input_length=max_len, trainable=False)(inputs)
cnn1 = Convolution1D(256, 3, padding='same', strides = 1, activation='relu')(layer)
cnn1 = MaxPool1D(pool_size=4)(cnn1)
cnn2 = Convolution1D(256, 4, padding='same', strides = 1, activation='relu')(layer)
cnn2 = MaxPool1D(pool_size=4)(cnn2)
cnn3 = Convolution1D(256, 5, padding='same', strides = 1, activation='relu')(layer)
cnn3 = MaxPool1D(pool_size=4)(cnn3)
# 合并三個模型的輸出向量
cnn = concatenate([cnn1,cnn2,cnn3], axis=-1)
# BiLSTM+Attention
#bilstm = Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.1, return_sequences=True))(cnn)
bilstm = Bidirectional(LSTM(128, return_sequences=True))(cnn) #參數保持維度3
layer = Dense(128, activation='relu')(bilstm)
layer = Dropout(0.3)(layer)
attention = AttentionLayer(attention_size=50)(layer)
output = Dense(num_labels, activation='softmax')(attention)
model = Model(inputs=inputs, outputs=output)
model.summary()
model.compile(loss="categorical_crossentropy",
optimizer='adam',
metrics=["accuracy"])
flag = "test"
if flag == "train":
print("模型訓練")
# 模型訓練
model_fit = model.fit(train_seq_mat, train_y, batch_size=128, epochs=15,
validation_data=(val_seq_mat,val_y),
callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.0005)]
)
# 保存模型
model.save('cnn_bilstm_model.h5')
del model # deletes the existing model
#計算時間
elapsed = (time.clock() - start)
print("Time used:", elapsed)
print(model_fit.history)
else:
print("模型預測")
model = load_model('cnn_bilstm_model.h5', custom_objects={'AttentionLayer': AttentionLayer(50)}, compile=False)
#--------------------------------------第六步 預測及評估--------------------------------
# 對測試集進行預測
test_pre = model.predict(test_seq_mat)
confm = metrics.confusion_matrix(np.argmax(test_y,axis=1),np.argmax(test_pre,axis=1))
print(confm)
print(metrics.classification_report(np.argmax(test_y,axis=1),
np.argmax(test_pre,axis=1),
digits=4))
print("accuracy",metrics.accuracy_score(np.argmax(test_y,axis=1),
np.argmax(test_pre,axis=1)))
# 結果存儲
f1 = open("cnn_bilstm_test_pre.txt", "w")
for n in np.argmax(test_pre, axis=1):
f1.write(str(n) + "\n")
f1.close()
f2 = open("cnn_bilstm_test_y.txt", "w")
for n in np.argmax(test_y, axis=1):
f2.write(str(n) + "\n")
f2.close()
plt.figure(figsize=(8,8))
sns.heatmap(confm.T, square=True, annot=True,
fmt='d', cbar=False, linewidths=.6,
cmap="YlGnBu")
plt.xlabel('True label',size = 14)
plt.ylabel('Predicted label', size = 14)
plt.xticks(np.arange(5)+0.5, Labname, size = 12)
plt.yticks(np.arange(5)+0.5, Labname, size = 12)
plt.savefig('cnn_bilstm_result.png')
plt.show()
#--------------------------------------第七步 驗證算法--------------------------------
# 使用tok對驗證數據集重新預處理,并使用訓練好的模型進行預測
val_seq = tok.texts_to_sequences(val_content)
val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)
# 對驗證集進行預測
val_pre = model.predict(val_seq_mat)
print(metrics.classification_report(np.argmax(val_y, axis=1),
np.argmax(val_pre, axis=1),
digits=4))
print("accuracy", metrics.accuracy_score(np.argmax(val_y, axis=1),
np.argmax(val_pre, axis=1)))
# 計算時間
elapsed = (time.clock() - start)
print("Time used:", elapsed)
訓練輸出結果如下圖所示:
模型訓練
Epoch 1/15
1/10 [==>...........................] - ETA: 18s - loss: 1.6074 - accuracy: 0.2188
2/10 [=====>........................] - ETA: 2s - loss: 1.5996 - accuracy: 0.2383
3/10 [========>.....................] - ETA: 2s - loss: 1.5903 - accuracy: 0.2500
4/10 [===========>..................] - ETA: 2s - loss: 1.5665 - accuracy: 0.2793
5/10 [==============>...............] - ETA: 2s - loss: 1.5552 - accuracy: 0.2750
6/10 [=================>............] - ETA: 1s - loss: 1.5346 - accuracy: 0.2930
7/10 [====================>.........] - ETA: 1s - loss: 1.5229 - accuracy: 0.3103
8/10 [=======================>......] - ETA: 1s - loss: 1.5208 - accuracy: 0.3135
9/10 [==========================>...] - ETA: 0s - loss: 1.5132 - accuracy: 0.3281
10/10 [==============================] - ETA: 0s - loss: 1.5046 - accuracy: 0.3400
10/10 [==============================] - 9s 728ms/step - loss: 1.5046 - accuracy: 0.3400 - val_loss: 1.4659 - val_accuracy: 0.5599
Time used: 13.8141568
{'loss': [1.5045626163482666], 'accuracy': [0.34004834294319153],
'val_loss': [1.4658586978912354], 'val_accuracy': [0.5599128603935242]}
最終預測結果如下所示:
模型預測
[[ 56 13 1 0 40]
[ 31 53 0 0 16]
[ 54 47 3 1 15]
[ 27 14 1 51 37]
[ 39 16 8 2 125]]
precision recall f1-score support
0 0.2705 0.5091 0.3533 110
1 0.3706 0.5300 0.4362 100
2 0.2308 0.0250 0.0451 120
3 0.9444 0.3923 0.5543 130
4 0.5365 0.6579 0.5910 190
accuracy 0.4431 650
macro avg 0.4706 0.4229 0.3960 650
weighted avg 0.4911 0.4431 0.4189 650
accuracy 0.4430769230769231
havior.
precision recall f1-score support
0 0.8571 0.5625 0.6792 352
1 0.6344 0.5514 0.5900 107
2 0.0000 0.0000 0.0000 0
4 0.0000 0.0000 0.0000 0
accuracy 0.5599 459
macro avg 0.3729 0.2785 0.3173 459
weighted avg 0.8052 0.5599 0.6584 459
accuracy 0.5599128540305011
Time used: 23.0178675
本文章轉載微信公眾號@娜璋AI安全之家