2.動(dòng)態(tài)特征

相當(dāng)于靜態(tài)特征更耗時(shí),它要真正去執(zhí)行代碼。通常包括:
–?API調(diào)用關(guān)系:比較明顯的特征,調(diào)用了哪些API,表述對(duì)應(yīng)的功能
–?控制流圖:軟件工程中比較常用,機(jī)器學(xué)習(xí)將其表示成向量,從而進(jìn)行分類
–?數(shù)據(jù)流圖:軟件工程中比較常用,機(jī)器學(xué)習(xí)將其表示成向量,從而進(jìn)行分類

動(dòng)態(tài)特征提取方式:


二.基于CNN的惡意家族檢測(cè)

前面的系列文章詳細(xì)介紹如何提取惡意軟件的靜態(tài)和動(dòng)態(tài)特征,包括API序列。接下來將構(gòu)建深度學(xué)習(xí)模型學(xué)習(xí)API序列實(shí)現(xiàn)分類。基本流程如下:

1.數(shù)據(jù)集

整個(gè)數(shù)據(jù)集包括5類惡意家族的樣本,每個(gè)樣本經(jīng)過先前的CAPE工具成功提取的動(dòng)態(tài)API序列。數(shù)據(jù)集分布情況如下所示:(建議讀者提取自己數(shù)據(jù)集的樣本,包括BIG2015、BODMAS等)

惡意家族類別數(shù)量訓(xùn)練集測(cè)試集
AAAAclass1352242110
BBBBclass2335235100
CCCCclass3363243120
DDDDclass4293163130
EEEEclass5548358190

數(shù)據(jù)集分為訓(xùn)練集和測(cè)試集,如下圖所示:

數(shù)據(jù)集中主要包括四個(gè)字段,即序號(hào)、惡意家族類別、Md5值、API序列或特征。

需要注意,在特征提取過程中涉及大量數(shù)據(jù)預(yù)處理和清洗的工作,讀者需要結(jié)合實(shí)際需求完成。比如提取特征為空值的過濾代碼。

#coding:utf-8
#By:Eastmount CSDN 2023-05-31
import csv
import re
import os

csv.field_size_limit(500 * 1024 * 1024)
filename = "AAAA_result.csv"
writename = "AAAA_result_final.csv"
fw = open(writename, mode="w", newline="")
writer = csv.writer(fw)
writer.writerow(['no', 'type', 'md5', 'api'])
with open(filename,encoding='utf-8') as fr:
reader = csv.reader(fr)
no = 1
for row in reader: #['no','type','md5','api']
tt = row[1]
md5 = row[2]
api = row[3]
#print(no,tt,md5,api)
#api空值的過濾
if api=="" or api=="api":
continue
else:
writer.writerow([str(no),tt,md5,api])
no += 1
fr.close()

2.模型構(gòu)建

該模型的基本步驟如下:

構(gòu)建模型如下圖所示:

完整代碼如下所示:

# -*- coding: utf-8 -*-
# By:Eastmount CSDN 2023-06-27
import pickle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from keras.models import Model
from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding
from keras.layers import Convolution1D, MaxPool1D, Flatten
from keras.optimizers import RMSprop
from keras.layers import Bidirectional
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.callbacks import EarlyStopping
from keras.models import load_model
from keras.models import Sequential
from keras.layers.merge import concatenate
import time

"""
import os
os.environ["CUDA_DEVICES_ORDER"] = "PCI_BUS_IS"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.8)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
"""

start = time.clock()

#---------------------------------------第一步 數(shù)據(jù)讀取------------------------------------
# 讀取測(cè)數(shù)據(jù)集
train_df = pd.read_csv("..\\train_dataset.csv")
val_df = pd.read_csv("..\\val_dataset.csv")
test_df = pd.read_csv("..\\test_dataset.csv")

# 指定數(shù)據(jù)類型 否則AttributeError: 'float' object has no attribute 'lower' 存在文本為空的現(xiàn)象
# train_df.SentimentText = train_df.SentimentText.astype(str)
print(train_df.head())

# 解決中文顯示問題
plt.rcParams['font.sans-serif'] = ['KaiTi'] #指定默認(rèn)字體 SimHei黑體
plt.rcParams['axes.unicode_minus'] = False #解決保存圖像是負(fù)號(hào)'

#---------------------------------第二步 OneHotEncoder()編碼---------------------------------
# 對(duì)數(shù)據(jù)集的標(biāo)簽數(shù)據(jù)進(jìn)行編碼 (no apt md5 api)
train_y = train_df.apt
print("Label:")
print(train_y[:10])
val_y = val_df.apt
test_y = test_df.apt
le = LabelEncoder()
train_y = le.fit_transform(train_y).reshape(-1,1)
print("LabelEncoder")
print(train_y[:10])
print(len(train_y))
val_y = le.transform(val_y).reshape(-1,1)
test_y = le.transform(test_y).reshape(-1,1)
Labname = le.classes_
print(Labname)

# 對(duì)數(shù)據(jù)集的標(biāo)簽數(shù)據(jù)進(jìn)行one-hot編碼
ohe = OneHotEncoder()
train_y = ohe.fit_transform(train_y).toarray()
val_y = ohe.transform(val_y).toarray()
test_y = ohe.transform(test_y).toarray()
print("OneHotEncoder:")
print(train_y[:10])

#-------------------------------第三步 使用Tokenizer對(duì)詞組進(jìn)行編碼-------------------------------
# 使用Tokenizer對(duì)詞組進(jìn)行編碼
# 當(dāng)我們創(chuàng)建了一個(gè)Tokenizer對(duì)象后,使用該對(duì)象的fit_on_texts()函數(shù),以空格去識(shí)別每個(gè)詞
# 可以將輸入的文本中的每個(gè)詞編號(hào),編號(hào)是根據(jù)詞頻的,詞頻越大,編號(hào)越小
max_words = 1000
max_len = 200
tok = Tokenizer(num_words=max_words) #使用的最大詞語數(shù)為1000
print(train_df.api[:5])
print(type(train_df.api))

# 提取token:api
train_value = train_df.api
train_content = [str(a) for a in train_value.tolist()]
val_value = val_df.api
val_content = [str(a) for a in val_value.tolist()]
test_value = test_df.api
test_content = [str(a) for a in test_value.tolist()]
tok.fit_on_texts(train_content)
print(tok)

# 保存訓(xùn)練好的Tokenizer和導(dǎo)入
# saving
with open('tok.pickle', 'wb') as handle:
pickle.dump(tok, handle, protocol=pickle.HIGHEST_PROTOCOL)
# loading
with open('tok.pickle', 'rb') as handle:
tok = pickle.load(handle)

# 使用word_index屬性可以看到每次詞對(duì)應(yīng)的編碼
# 使用word_counts屬性可以看到每個(gè)詞對(duì)應(yīng)的頻數(shù)
for ii,iterm in enumerate(tok.word_index.items()):
if ii < 10:
print(iterm)
else:
break
print("===================")
for ii,iterm in enumerate(tok.word_counts.items()):
if ii < 10:
print(iterm)
else:
break

# 使用tok.texts_to_sequences()將數(shù)據(jù)轉(zhuǎn)化為序列
# 使用sequence.pad_sequences()將每個(gè)序列調(diào)整為相同的長度
# 對(duì)每個(gè)詞編碼之后,每句新聞中的每個(gè)詞就可以用對(duì)應(yīng)的編碼表示,即每條新聞可以轉(zhuǎn)變成一個(gè)向量了
train_seq = tok.texts_to_sequences(train_content)
val_seq = tok.texts_to_sequences(val_content)
test_seq = tok.texts_to_sequences(test_content)

# 將每個(gè)序列調(diào)整為相同的長度
train_seq_mat = sequence.pad_sequences(train_seq,maxlen=max_len)
val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)
test_seq_mat = sequence.pad_sequences(test_seq,maxlen=max_len)
print(train_seq_mat.shape) #(1241, 200)
print(val_seq_mat.shape) #(459, 200)
print(test_seq_mat.shape) #(650, 200)
print(train_seq_mat[:2])

#-------------------------------第四步 建立CNN模型并訓(xùn)練-------------------------------
num_labels = 5
inputs = Input(name='inputs',shape=[max_len], dtype='float64')

# 詞嵌入(使用預(yù)訓(xùn)練的詞向量)
layer = Embedding(max_words+1, 256, input_length=max_len, trainable=False)(inputs)

# 詞窗大小分別為3,4,5
cnn = Convolution1D(256, 3, padding='same', strides = 1, activation='relu')(layer)
cnn = MaxPool1D(pool_size=3)(cnn)

# 合并三個(gè)模型的輸出向量
flat = Flatten()(cnn)
drop = Dropout(0.4)(flat)
main_output = Dense(num_labels, activation='softmax')(drop)
model = Model(inputs=inputs, outputs=main_output)
model.summary()
model.compile(loss="categorical_crossentropy",
optimizer='adam', #RMSprop()
metrics=["accuracy"])

# 增加判斷 防止再次訓(xùn)練
flag = "train"
if flag == "train":
print("模型訓(xùn)練")
# 模型訓(xùn)練
model_fit = model.fit(train_seq_mat, train_y, batch_size=64, epochs=15,
validation_data=(val_seq_mat,val_y),
callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.001)] #當(dāng)val-loss不再提升時(shí)停止訓(xùn)練 0.0001
)

# 保存模型
model.save('cnn_model.h5')
del model # deletes the existing model

# 計(jì)算時(shí)間
elapsed = (time.clock() - start)
print("Time used:", elapsed)
print(model_fit.history)

else:
print("模型預(yù)測(cè)")
# 導(dǎo)入已經(jīng)訓(xùn)練好的模型
model = load_model('cnn_model.h5')

#--------------------------------------第五步 預(yù)測(cè)及評(píng)估--------------------------------
# 對(duì)測(cè)試集進(jìn)行預(yù)測(cè)
test_pre = model.predict(test_seq_mat)

# 評(píng)價(jià)預(yù)測(cè)效果,計(jì)算混淆矩陣
confm = metrics.confusion_matrix(np.argmax(test_y,axis=1),
np.argmax(test_pre,axis=1))
print(confm)
print(metrics.classification_report(np.argmax(test_y,axis=1),
np.argmax(test_pre,axis=1),
digits=4))
print("accuracy", metrics.accuracy_score(np.argmax(test_y, axis=1),
np.argmax(test_pre, axis=1)))
# 結(jié)果存儲(chǔ)
f1 = open("cnn_test_pre.txt", "w")
for n in np.argmax(test_pre, axis=1):
f1.write(str(n) + "\n")
f1.close()

f2 = open("cnn_test_y.txt", "w")
for n in np.argmax(test_y, axis=1):
f2.write(str(n) + "\n")
f2.close()

plt.figure(figsize=(8,8))
sns.heatmap(confm.T, square=True, annot=True,
fmt='d', cbar=False, linewidths=.6,
cmap="YlGnBu")
plt.xlabel('True label',size = 14)
plt.ylabel('Predicted label', size = 14)
plt.xticks(np.arange(5)+0.5, Labname, size = 12)
plt.yticks(np.arange(5)+0.5, Labname, size = 12)
plt.savefig('cnn_result.png')
plt.show()

#--------------------------------------第六步 驗(yàn)證算法--------------------------------
# 使用tok對(duì)驗(yàn)證數(shù)據(jù)集重新預(yù)處理
val_seq = tok.texts_to_sequences(val_content)
# 將每個(gè)序列調(diào)整為相同的長度
val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)
# 對(duì)驗(yàn)證集進(jìn)行預(yù)測(cè)
val_pre = model.predict(val_seq_mat)
print(metrics.classification_report(np.argmax(val_y,axis=1),
np.argmax(val_pre,axis=1),
digits=4))
print("accuracy", metrics.accuracy_score(np.argmax(val_y, axis=1),
np.argmax(val_pre, axis=1)))
# 計(jì)算時(shí)間
elapsed = (time.clock() - start)
print("Time used:", elapsed)

3.實(shí)驗(yàn)結(jié)果

最終運(yùn)行結(jié)果及其生成文件如下圖所示:

輸出中間過程結(jié)果如下所示:

no  ...                                                api
0 1 ... GetSystemInfo;HeapCreate;NtAllocateVirtualMemo...
1 2 ... GetSystemInfo;HeapCreate;NtAllocateVirtualMemo...
2 3 ... NtQueryValueKey;GetSystemTimeAsFileTime;HeapCr...
3 4 ... NtQueryValueKey;NtClose;NtAllocateVirtualMemor...
4 5 ... NtOpenFile;NtCreateSection;NtMapViewOfSection;...

[5 rows x 4 columns]
Label:
0 class1
1 class1
2 class1
3 class1
4 class1
5 class1
6 class1
7 class1
8 class1
9 class1
Name: apt, dtype: object
LabelEncoder
[[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]]
1241
['class1' 'class2' 'class3' 'class4' 'class5']
OneHotEncoder:
[[1. 0. 0. 0. 0.]
[1. 0. 0. 0. 0.]
[1. 0. 0. 0. 0.]
[1. 0. 0. 0. 0.]
[1. 0. 0. 0. 0.]
[1. 0. 0. 0. 0.]
[1. 0. 0. 0. 0.]
[1. 0. 0. 0. 0.]
[1. 0. 0. 0. 0.]
[1. 0. 0. 0. 0.]]
0 GetSystemInfo;HeapCreate;NtAllocateVirtualMemo...
1 GetSystemInfo;HeapCreate;NtAllocateVirtualMemo...
2 NtQueryValueKey;GetSystemTimeAsFileTime;HeapCr...
3 NtQueryValueKey;NtClose;NtAllocateVirtualMemor...
4 NtOpenFile;NtCreateSection;NtMapViewOfSection;...
Name: api, dtype: object
<class 'pandas.core.series.Series'>
<keras_preprocessing.text.Tokenizer object at 0x0000028E55D36B08>

('regqueryvalueexw', 1)
('ntclose', 2)
('ldrgetprocedureaddress', 3)
('regopenkeyexw', 4)
('regclosekey', 5)
('ntallocatevirtualmemory', 6)
('sendmessagew', 7)
('ntwritefile', 8)
('process32nextw', 9)
('ntdeviceiocontrolfile', 10)
===================
('getsysteminfo', 2651)
('heapcreate', 2996)
('ntallocatevirtualmemory', 115547)
('ntqueryvaluekey', 24120)
('getsystemtimeasfiletime', 52727)
('ldrgetdllhandle', 25135)
('ldrgetprocedureaddress', 199952)
('memcpy', 9008)
('setunhandledexceptionfilter', 1504)
('ntcreatefile', 43260)

(1241, 200)
(459, 200)
(650, 200)
[[ 3 135 3 3 2 21 3 3 4 3 96 3 3 4 96 4 96 20
22 20 3 6 6 23 128 129 3 103 23 56 2 103 23 20 3 23
3 3 3 3 4 1 5 23 12 131 12 20 3 10 2 10 2 20
3 4 5 27 3 10 2 6 10 2 3 10 2 10 2 3 10 2
10 2 10 2 10 2 10 2 3 10 2 10 2 10 2 10 2 3
3 3 36 4 3 23 20 3 5 207 34 6 6 6 11 11 6 11
6 6 6 6 6 6 6 6 6 11 6 6 11 6 11 6 11 6
6 11 6 34 3 141 3 140 3 3 141 34 6 2 21 4 96 4
96 4 96 23 3 3 12 131 12 10 2 10 2 4 5 27 10 2
6 10 2 10 2 10 2 10 2 10 2 10 2 10 2 10 2 10
2 10 2 10 2 10 2 36 4 23 5 207 6 3 3 12 131 12
132 3]
[ 27 4 27 4 27 4 27 4 27 27 5 27 4 27 4 27 27 27
27 27 27 27 5 27 4 27 4 27 4 27 4 27 4 27 4 27
4 27 4 27 4 27 5 52 2 21 4 5 1 1 1 5 21 25
2 52 12 33 51 28 34 30 2 52 2 21 4 5 27 5 52 6
6 52 4 1 5 4 52 54 7 7 20 52 7 52 7 7 6 4
4 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 5
5 3 7 50 50 50 95 50 50 50 50 50 4 1 5 4 3 3
3 3 3 7 7 7 3 7 3 7 3 60 3 3 7 7 7 7
60 3 7 7 7 7 7 7 7 7 52 20 3 3 3 14 14 60
18 19 18 19 2 21 4 5 18 19 18 19 18 19 18 19 7 7
7 7 7 7 7 7 7 7 7 52 7 7 7 7 7 60 7 7
7 7]]

模型訓(xùn)練過程如下:

模型訓(xùn)練
Epoch 1/15

1/20 [>.............................] - ETA: 5s - loss: 1.5986 - accuracy: 0.2656
2/20 [==>...........................] - ETA: 1s - loss: 1.6050 - accuracy: 0.2266
3/20 [===>..........................] - ETA: 1s - loss: 1.5777 - accuracy: 0.2292
4/20 [=====>........................] - ETA: 2s - loss: 1.5701 - accuracy: 0.2500
5/20 [======>.......................] - ETA: 2s - loss: 1.5628 - accuracy: 0.2719
6/20 [========>.....................] - ETA: 3s - loss: 1.5439 - accuracy: 0.3125
7/20 [=========>....................] - ETA: 3s - loss: 1.5306 - accuracy: 0.3348
8/20 [===========>..................] - ETA: 3s - loss: 1.5162 - accuracy: 0.3535
9/20 [============>.................] - ETA: 3s - loss: 1.5020 - accuracy: 0.3698
10/20 [==============>...............] - ETA: 3s - loss: 1.4827 - accuracy: 0.3969
11/20 [===============>..............] - ETA: 3s - loss: 1.4759 - accuracy: 0.4020
12/20 [=================>............] - ETA: 3s - loss: 1.4734 - accuracy: 0.4036
13/20 [==================>...........] - ETA: 3s - loss: 1.4456 - accuracy: 0.4255
14/20 [====================>.........] - ETA: 3s - loss: 1.4322 - accuracy: 0.4353
15/20 [=====================>........] - ETA: 2s - loss: 1.4157 - accuracy: 0.4469
16/20 [=======================>......] - ETA: 2s - loss: 1.4093 - accuracy: 0.4482
17/20 [========================>.....] - ETA: 2s - loss: 1.4010 - accuracy: 0.4531
18/20 [==========================>...] - ETA: 1s - loss: 1.3920 - accuracy: 0.4601
19/20 [===========================>..] - ETA: 0s - loss: 1.3841 - accuracy: 0.4638
20/20 [==============================] - ETA: 0s - loss: 1.3763 - accuracy: 0.4674
20/20 [==============================] - 20s 1s/step - loss: 1.3763 - accuracy: 0.4674 - val_loss: 1.3056 - val_accuracy: 0.4837

Time used: 26.1328806
{'loss': [1.3762551546096802], 'accuracy': [0.467365026473999],
'val_loss': [1.305567979812622], 'val_accuracy': [0.48366013169288635]}

最終預(yù)測(cè)結(jié)果如下所示:

模型預(yù)測(cè)
[[ 40 14 11 1 44]
[ 16 57 10 0 17]
[ 6 30 61 0 23]
[ 12 20 15 47 36]
[ 11 14 19 0 146]]
precision recall f1-score support

0 0.4706 0.3636 0.4103 110
1 0.4222 0.5700 0.4851 100
2 0.5259 0.5083 0.5169 120
3 0.9792 0.3615 0.5281 130
4 0.5489 0.7684 0.6404 190

accuracy 0.5400 650
macro avg 0.5893 0.5144 0.5162 650
weighted avg 0.5980 0.5400 0.5323 650

accuracy 0.54

precision recall f1-score support

0 0.9086 0.4517 0.6034 352
1 0.5943 0.5888 0.5915 107
2 0.0000 0.0000 0.0000 0
3 0.0000 0.0000 0.0000 0
4 0.0000 0.0000 0.0000 0

accuracy 0.4837 459
macro avg 0.3006 0.2081 0.2390 459
weighted avg 0.8353 0.4837 0.6006 459

accuracy 0.48366013071895425
Time used: 14.170902800000002

思考:
然而,整個(gè)預(yù)測(cè)結(jié)果效果較差,請(qǐng)讀者思考,這是為什么呢?我們能不能通過調(diào)參進(jìn)行優(yōu)化,又如何改進(jìn)算法呢?本文僅提供基本思路和代碼,更多優(yōu)化及完善需要讀者學(xué)會(huì)獨(dú)立解決,加油喔!


三.基于BiLSTM的惡意家族檢測(cè)

1.模型構(gòu)建

該模型的基本步驟如下:

構(gòu)建模型如下圖所示:

完整代碼如下所示:

# -*- coding: utf-8 -*-
# By:Eastmount CSDN 2023-06-27
import pickle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from keras.models import Model
from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding
from keras.layers import Convolution1D, MaxPool1D, Flatten
from keras.optimizers import RMSprop
from keras.layers import Bidirectional
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.callbacks import EarlyStopping
from keras.models import load_model
from keras.models import Sequential
from keras.layers.merge import concatenate
import time

start = time.clock()

#---------------------------------------第一步 數(shù)據(jù)讀取------------------------------------
# 讀取測(cè)數(shù)據(jù)集
train_df = pd.read_csv("..\\train_dataset.csv")
val_df = pd.read_csv("..\\val_dataset.csv")
test_df = pd.read_csv("..\\test_dataset.csv")
print(train_df.head())

# 解決中文顯示問題
plt.rcParams['font.sans-serif'] = ['KaiTi']
plt.rcParams['axes.unicode_minus'] = False

#---------------------------------第二步 OneHotEncoder()編碼---------------------------------
# 對(duì)數(shù)據(jù)集的標(biāo)簽數(shù)據(jù)進(jìn)行編碼 (no apt md5 api)
train_y = train_df.apt
val_y = val_df.apt
test_y = test_df.apt
le = LabelEncoder()
train_y = le.fit_transform(train_y).reshape(-1,1)
val_y = le.transform(val_y).reshape(-1,1)
test_y = le.transform(test_y).reshape(-1,1)
Labname = le.classes_

# 對(duì)數(shù)據(jù)集的標(biāo)簽數(shù)據(jù)進(jìn)行one-hot編碼
ohe = OneHotEncoder()
train_y = ohe.fit_transform(train_y).toarray()
val_y = ohe.transform(val_y).toarray()
test_y = ohe.transform(test_y).toarray()

#-------------------------------第三步 使用Tokenizer對(duì)詞組進(jìn)行編碼-------------------------------
# 使用Tokenizer對(duì)詞組進(jìn)行編碼
max_words = 2000
max_len = 300
tok = Tokenizer(num_words=max_words)

# 提取token:api
train_value = train_df.api
train_content = [str(a) for a in train_value.tolist()]
val_value = val_df.api
val_content = [str(a) for a in val_value.tolist()]
test_value = test_df.api
test_content = [str(a) for a in test_value.tolist()]
tok.fit_on_texts(train_content)
print(tok)

# 保存訓(xùn)練好的Tokenizer和導(dǎo)入
with open('tok.pickle', 'wb') as handle:
pickle.dump(tok, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('tok.pickle', 'rb') as handle:
tok = pickle.load(handle)

# 使用tok.texts_to_sequences()將數(shù)據(jù)轉(zhuǎn)化為序列
train_seq = tok.texts_to_sequences(train_content)
val_seq = tok.texts_to_sequences(val_content)
test_seq = tok.texts_to_sequences(test_content)

# 將每個(gè)序列調(diào)整為相同的長度
train_seq_mat = sequence.pad_sequences(train_seq,maxlen=max_len)
val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)
test_seq_mat = sequence.pad_sequences(test_seq,maxlen=max_len)

#-------------------------------第四步 建立LSTM模型并訓(xùn)練-------------------------------
num_labels = 5
model = Sequential()
model.add(Embedding(max_words+1, 128, input_length=max_len))
#model.add(Bidirectional(LSTM(128, dropout=0.3, recurrent_dropout=0.1)))
model.add(Bidirectional(LSTM(128)))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(num_labels, activation='softmax'))
model.summary()
model.compile(loss="categorical_crossentropy",
optimizer='adam',
metrics=["accuracy"])

flag = "train"
if flag == "train":
print("模型訓(xùn)練")
# 模型訓(xùn)練
model_fit = model.fit(train_seq_mat, train_y, batch_size=64, epochs=15,
validation_data=(val_seq_mat,val_y),
callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.0001)]
)

# 保存模型
model.save('bilstm_model.h5')
del model # deletes the existing model

# 計(jì)算時(shí)間
elapsed = (time.clock() - start)
print("Time used:", elapsed)
print(model_fit.history)

else:
print("模型預(yù)測(cè)")
model = load_model('bilstm_model.h5')

#--------------------------------------第五步 預(yù)測(cè)及評(píng)估--------------------------------
# 對(duì)測(cè)試集進(jìn)行預(yù)測(cè)
test_pre = model.predict(test_seq_mat)
confm = metrics.confusion_matrix(np.argmax(test_y,axis=1),
np.argmax(test_pre,axis=1))
print(confm)
print(metrics.classification_report(np.argmax(test_y,axis=1),
np.argmax(test_pre,axis=1),
digits=4))
print("accuracy", metrics.accuracy_score(np.argmax(test_y, axis=1),
np.argmax(test_pre, axis=1)))
# 結(jié)果存儲(chǔ)
f1 = open("bilstm_test_pre.txt", "w")
for n in np.argmax(test_pre, axis=1):
f1.write(str(n) + "\n")
f1.close()

f2 = open("bilstm_test_y.txt", "w")
for n in np.argmax(test_y, axis=1):
f2.write(str(n) + "\n")
f2.close()

plt.figure(figsize=(8,8))
sns.heatmap(confm.T, square=True, annot=True,
fmt='d', cbar=False, linewidths=.6,
cmap="YlGnBu")
plt.xlabel('True label',size = 14)
plt.ylabel('Predicted label', size = 14)
plt.xticks(np.arange(5)+0.5, Labname, size = 12)
plt.yticks(np.arange(5)+0.5, Labname, size = 12)
plt.savefig('bilstm_result.png')
plt.show()

#--------------------------------------第六步 驗(yàn)證算法--------------------------------
# 使用tok對(duì)驗(yàn)證數(shù)據(jù)集重新預(yù)處理
val_seq = tok.texts_to_sequences(val_content)
val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)

# 對(duì)驗(yàn)證集進(jìn)行預(yù)測(cè)
val_pre = model.predict(val_seq_mat)
print(metrics.classification_report(np.argmax(val_y,axis=1),
np.argmax(val_pre,axis=1),
digits=4))
print("accuracy", metrics.accuracy_score(np.argmax(val_y, axis=1),
np.argmax(val_pre, axis=1)))
# 計(jì)算時(shí)間
elapsed = (time.clock() - start)
print("Time used:", elapsed)

2.實(shí)驗(yàn)結(jié)果

訓(xùn)練輸出結(jié)果如下圖所示:

模型訓(xùn)練
Epoch 1/15
1/20 [>.............................] - ETA: 40s - loss: 1.6114 - accuracy: 0.2031
2/20 [==>...........................] - ETA: 10s - loss: 1.6055 - accuracy: 0.2969
3/20 [===>..........................] - ETA: 10s - loss: 1.6015 - accuracy: 0.3281
4/20 [=====>........................] - ETA: 10s - loss: 1.5931 - accuracy: 0.3477
5/20 [======>.......................] - ETA: 10s - loss: 1.5914 - accuracy: 0.3469
6/20 [========>.....................] - ETA: 10s - loss: 1.5827 - accuracy: 0.3698
7/20 [=========>....................] - ETA: 10s - loss: 1.5785 - accuracy: 0.3884
8/20 [===========>..................] - ETA: 10s - loss: 1.5673 - accuracy: 0.4121
9/20 [============>.................] - ETA: 9s - loss: 1.5610 - accuracy: 0.4149
10/20 [==============>...............] - ETA: 9s - loss: 1.5457 - accuracy: 0.4187
11/20 [===============>..............] - ETA: 8s - loss: 1.5297 - accuracy: 0.4148
12/20 [=================>............] - ETA: 8s - loss: 1.5338 - accuracy: 0.4128
13/20 [==================>...........] - ETA: 7s - loss: 1.5214 - accuracy: 0.4279
14/20 [====================>.........] - ETA: 6s - loss: 1.5176 - accuracy: 0.4286
15/20 [=====================>........] - ETA: 5s - loss: 1.5100 - accuracy: 0.4271
16/20 [=======================>......] - ETA: 4s - loss: 1.5065 - accuracy: 0.4258
17/20 [========================>.....] - ETA: 3s - loss: 1.5021 - accuracy: 0.4237
18/20 [==========================>...] - ETA: 2s - loss: 1.4921 - accuracy: 0.4288
19/20 [===========================>..] - ETA: 1s - loss: 1.4822 - accuracy: 0.4334
20/20 [==============================] - ETA: 0s - loss: 1.4825 - accuracy: 0.4327
20/20 [==============================] - 33s 2s/step - loss: 1.4825 - accuracy: 0.4327 - val_loss: 1.4187 - val_accuracy: 0.4074

Time used: 38.565846900000004
{'loss': [1.4825222492218018], 'accuracy': [0.4327155649662018],
'val_loss': [1.4187402725219727], 'val_accuracy': [0.40740740299224854]}
>>>

最終預(yù)測(cè)結(jié)果如下所示:

模型預(yù)測(cè)
[[36 18 37 1 18]
[14 46 34 0 6]
[ 8 29 73 0 10]
[16 29 14 45 26]
[47 15 33 0 95]]
precision recall f1-score support

0 0.2975 0.3273 0.3117 110
1 0.3358 0.4600 0.3882 100
2 0.3822 0.6083 0.4695 120
3 0.9783 0.3462 0.5114 130
4 0.6129 0.5000 0.5507 190

accuracy 0.4538 650
macro avg 0.5213 0.4484 0.4463 650
weighted avg 0.5474 0.4538 0.4624 650

accuracy 0.45384615384615384

precision recall f1-score support

0 0.9189 0.3864 0.5440 352
1 0.4766 0.4766 0.4766 107
2 0.0000 0.0000 0.0000 0
3 0.0000 0.0000 0.0000 0
4 0.0000 0.0000 0.0000 0

accuracy 0.4074 459
macro avg 0.2791 0.1726 0.2041 459
weighted avg 0.8158 0.4074 0.5283 459

accuracy 0.4074074074074074
Time used: 32.2772881

四.基于BiGRU的惡意家族檢測(cè)

1.模型構(gòu)建

該模型的基本步驟如下:

構(gòu)建模型如下圖所示:

完整代碼如下所示:

# -*- coding: utf-8 -*-
# By:Eastmount CSDN 2023-06-27
import pickle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from keras.models import Model
from keras.layers import GRU, LSTM, Activation, Dense, Dropout, Input, Embedding
from keras.layers import Convolution1D, MaxPool1D, Flatten
from keras.optimizers import RMSprop
from keras.layers import Bidirectional
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.callbacks import EarlyStopping
from keras.models import load_model
from keras.models import Sequential
from keras.layers.merge import concatenate
import time

start = time.clock()

#---------------------------------------第一步 數(shù)據(jù)讀取------------------------------------
# 讀取測(cè)數(shù)據(jù)集
train_df = pd.read_csv("..\\train_dataset.csv")
val_df = pd.read_csv("..\\val_dataset.csv")
test_df = pd.read_csv("..\\test_dataset.csv")
print(train_df.head())

# 解決中文顯示問題
plt.rcParams['font.sans-serif'] = ['KaiTi']
plt.rcParams['axes.unicode_minus'] = False

#---------------------------------第二步 OneHotEncoder()編碼---------------------------------
# 對(duì)數(shù)據(jù)集的標(biāo)簽數(shù)據(jù)進(jìn)行編碼 (no apt md5 api)
train_y = train_df.apt
val_y = val_df.apt
test_y = test_df.apt
le = LabelEncoder()
train_y = le.fit_transform(train_y).reshape(-1,1)
val_y = le.transform(val_y).reshape(-1,1)
test_y = le.transform(test_y).reshape(-1,1)
Labname = le.classes_

# 對(duì)數(shù)據(jù)集的標(biāo)簽數(shù)據(jù)進(jìn)行one-hot編碼
ohe = OneHotEncoder()
train_y = ohe.fit_transform(train_y).toarray()
val_y = ohe.transform(val_y).toarray()
test_y = ohe.transform(test_y).toarray()

#-------------------------------第三步 使用Tokenizer對(duì)詞組進(jìn)行編碼-------------------------------
# 使用Tokenizer對(duì)詞組進(jìn)行編碼
max_words = 2000
max_len = 300
tok = Tokenizer(num_words=max_words)

# 提取token:api
train_value = train_df.api
train_content = [str(a) for a in train_value.tolist()]
val_value = val_df.api
val_content = [str(a) for a in val_value.tolist()]
test_value = test_df.api
test_content = [str(a) for a in test_value.tolist()]
tok.fit_on_texts(train_content)
print(tok)

# 保存訓(xùn)練好的Tokenizer和導(dǎo)入
with open('tok.pickle', 'wb') as handle:
pickle.dump(tok, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('tok.pickle', 'rb') as handle:
tok = pickle.load(handle)

# 使用tok.texts_to_sequences()將數(shù)據(jù)轉(zhuǎn)化為序列
train_seq = tok.texts_to_sequences(train_content)
val_seq = tok.texts_to_sequences(val_content)
test_seq = tok.texts_to_sequences(test_content)

# 將每個(gè)序列調(diào)整為相同的長度
train_seq_mat = sequence.pad_sequences(train_seq,maxlen=max_len)
val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)
test_seq_mat = sequence.pad_sequences(test_seq,maxlen=max_len)

#-------------------------------第四步 建立GRU模型并訓(xùn)練-------------------------------
num_labels = 5
model = Sequential()
model.add(Embedding(max_words+1, 256, input_length=max_len))
#model.add(Bidirectional(GRU(128, dropout=0.2, recurrent_dropout=0.1)))
model.add(Bidirectional(GRU(256)))
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.4))
model.add(Dense(num_labels, activation='softmax'))
model.summary()
model.compile(loss="categorical_crossentropy",
optimizer='adam',
metrics=["accuracy"])

flag = "train"
if flag == "train":
print("模型訓(xùn)練")
# 模型訓(xùn)練
model_fit = model.fit(train_seq_mat, train_y, batch_size=64, epochs=15,
validation_data=(val_seq_mat,val_y),
callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.005)]
)

# 保存模型
model.save('gru_model.h5')
del model # deletes the existing model

# 計(jì)算時(shí)間
elapsed = (time.clock() - start)
print("Time used:", elapsed)
print(model_fit.history)

else:
print("模型預(yù)測(cè)")
model = load_model('gru_model.h5')

#--------------------------------------第五步 預(yù)測(cè)及評(píng)估--------------------------------
# 對(duì)測(cè)試集進(jìn)行預(yù)測(cè)
test_pre = model.predict(test_seq_mat)
confm = metrics.confusion_matrix(np.argmax(test_y,axis=1),
np.argmax(test_pre,axis=1))
print(confm)
print(metrics.classification_report(np.argmax(test_y,axis=1),
np.argmax(test_pre,axis=1),
digits=4))
print("accuracy", metrics.accuracy_score(np.argmax(test_y, axis=1),
np.argmax(test_pre, axis=1)))
# 結(jié)果存儲(chǔ)
f1 = open("gru_test_pre.txt", "w")
for n in np.argmax(test_pre, axis=1):
f1.write(str(n) + "\n")
f1.close()

f2 = open("gru_test_y.txt", "w")
for n in np.argmax(test_y, axis=1):
f2.write(str(n) + "\n")
f2.close()

plt.figure(figsize=(8,8))
sns.heatmap(confm.T, square=True, annot=True,
fmt='d', cbar=False, linewidths=.6,
cmap="YlGnBu")
plt.xlabel('True label',size = 14)
plt.ylabel('Predicted label', size = 14)
plt.xticks(np.arange(5)+0.5, Labname, size = 12)
plt.yticks(np.arange(5)+0.5, Labname, size = 12)
plt.savefig('gru_result.png')
plt.show()

#--------------------------------------第六步 驗(yàn)證算法--------------------------------
# 使用tok對(duì)驗(yàn)證數(shù)據(jù)集重新預(yù)處理
val_seq = tok.texts_to_sequences(val_content)
val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)

# 對(duì)驗(yàn)證集進(jìn)行預(yù)測(cè)
val_pre = model.predict(val_seq_mat)
print(metrics.classification_report(np.argmax(val_y,axis=1),
np.argmax(val_pre,axis=1),
digits=4))
print("accuracy", metrics.accuracy_score(np.argmax(val_y, axis=1),
np.argmax(val_pre, axis=1)))
# 計(jì)算時(shí)間
elapsed = (time.clock() - start)
print("Time used:", elapsed)

2.實(shí)驗(yàn)結(jié)果

訓(xùn)練輸出結(jié)果如下圖所示:

模型訓(xùn)練
Epoch 1/15

1/20 [>.............................] - ETA: 47s - loss: 1.6123 - accuracy: 0.1875
2/20 [==>...........................] - ETA: 18s - loss: 1.6025 - accuracy: 0.2656
3/20 [===>..........................] - ETA: 18s - loss: 1.5904 - accuracy: 0.3333
4/20 [=====>........................] - ETA: 18s - loss: 1.5728 - accuracy: 0.3867
5/20 [======>.......................] - ETA: 17s - loss: 1.5639 - accuracy: 0.4094
6/20 [========>.....................] - ETA: 17s - loss: 1.5488 - accuracy: 0.4375
7/20 [=========>....................] - ETA: 16s - loss: 1.5375 - accuracy: 0.4397
8/20 [===========>..................] - ETA: 16s - loss: 1.5232 - accuracy: 0.4434
9/20 [============>.................] - ETA: 15s - loss: 1.5102 - accuracy: 0.4358
10/20 [==============>...............] - ETA: 14s - loss: 1.5014 - accuracy: 0.4250
11/20 [===============>..............] - ETA: 13s - loss: 1.5053 - accuracy: 0.4233
12/20 [=================>............] - ETA: 12s - loss: 1.5022 - accuracy: 0.4232
13/20 [==================>...........] - ETA: 11s - loss: 1.4913 - accuracy: 0.4279
14/20 [====================>.........] - ETA: 9s - loss: 1.4912 - accuracy: 0.4286
15/20 [=====================>........] - ETA: 8s - loss: 1.4841 - accuracy: 0.4365
16/20 [=======================>......] - ETA: 7s - loss: 1.4720 - accuracy: 0.4404
17/20 [========================>.....] - ETA: 5s - loss: 1.4669 - accuracy: 0.4375
18/20 [==========================>...] - ETA: 3s - loss: 1.4636 - accuracy: 0.4349
19/20 [===========================>..] - ETA: 1s - loss: 1.4544 - accuracy: 0.4383
20/20 [==============================] - ETA: 0s - loss: 1.4509 - accuracy: 0.4400
20/20 [==============================] - 44s 2s/step - loss: 1.4509 - accuracy: 0.4400 - val_loss: 1.3812 - val_accuracy: 0.3660

Time used: 49.7057119
{'loss': [1.4508591890335083], 'accuracy': [0.4399677813053131],
'val_loss': [1.381193995475769], 'val_accuracy': [0.3660130798816681]}

最終預(yù)測(cè)結(jié)果如下所示:

模型預(yù)測(cè)
[[ 30 8 9 17 46]
[ 13 50 9 13 15]
[ 10 4 58 29 19]
[ 11 8 8 73 30]
[ 25 3 23 14 125]]
precision recall f1-score support

0 0.3371 0.2727 0.3015 110
1 0.6849 0.5000 0.5780 100
2 0.5421 0.4833 0.5110 120
3 0.5000 0.5615 0.5290 130
4 0.5319 0.6579 0.5882 190

accuracy 0.5169 650
macro avg 0.5192 0.4951 0.5016 650
weighted avg 0.5180 0.5169 0.5120 650

accuracy 0.5169230769230769

precision recall f1-score support

0 0.8960 0.3182 0.4696 352
1 0.7273 0.5234 0.6087 107
2 0.0000 0.0000 0.0000 0
3 0.0000 0.0000 0.0000 0
4 0.0000 0.0000 0.0000 0

accuracy 0.3660 459
macro avg 0.3247 0.1683 0.2157 459
weighted avg 0.8567 0.3660 0.5020 459

accuracy 0.3660130718954248
Time used: 60.106339399999996

五.基于CNN+BiLSTM和注意力的惡意家族檢測(cè)

1.模型構(gòu)建

該模型的基本步驟如下:

構(gòu)建模型如下圖所示:

Model: "model"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
inputs (InputLayer) [(None, 100)] 0
__________________________________________________________________________________________________
embedding (Embedding) (None, 100, 256) 256256 inputs[0][0]
__________________________________________________________________________________________________
conv1d (Conv1D) (None, 100, 256) 196864 embedding[0][0]
__________________________________________________________________________________________________
conv1d_1 (Conv1D) (None, 100, 256) 262400 embedding[0][0]
__________________________________________________________________________________________________
conv1d_2 (Conv1D) (None, 100, 256) 327936 embedding[0][0]
__________________________________________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 25, 256) 0 conv1d[0][0]
__________________________________________________________________________________________________
max_pooling1d_1 (MaxPooling1D) (None, 25, 256) 0 conv1d_1[0][0]
__________________________________________________________________________________________________
max_pooling1d_2 (MaxPooling1D) (None, 25, 256) 0 conv1d_2[0][0]
__________________________________________________________________________________________________
concatenate (Concatenate) (None, 25, 768) 0 max_pooling1d[0][0]
max_pooling1d_1[0][0]
max_pooling1d_2[0][0]
__________________________________________________________________________________________________
bidirectional (Bidirectional) (None, 25, 256) 918528 concatenate[0][0]
__________________________________________________________________________________________________
dense (Dense) (None, 25, 128) 32896 bidirectional[0][0]
__________________________________________________________________________________________________
dropout (Dropout) (None, 25, 128) 0 dense[0][0]
__________________________________________________________________________________________________
attention_layer (AttentionLayer (None, 128) 6500 dropout[0][0]
__________________________________________________________________________________________________
dense_1 (Dense) (None, 5) 645 attention_layer[0][0]
==================================================================================================
Total params: 2,002,025
Trainable params: 1,745,769
Non-trainable params: 256,256

完整代碼如下所示:

# -*- coding: utf-8 -*-
# By:Eastmount CSDN 2023-06-27
import pickle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from keras.models import Model
from keras.layers import LSTM, GRU, Activation, Dense, Dropout, Input, Embedding
from keras.layers import Convolution1D, MaxPool1D, Flatten
from keras.optimizers import RMSprop
from keras.layers import Bidirectional
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.callbacks import EarlyStopping
from keras.models import load_model
from keras.models import Sequential
from keras.layers.merge import concatenate
import time

start = time.clock()

#---------------------------------------第一步 數(shù)據(jù)讀取------------------------------------
# 讀取測(cè)數(shù)據(jù)集
train_df = pd.read_csv("..\\train_dataset.csv")
val_df = pd.read_csv("..\\val_dataset.csv")
test_df = pd.read_csv("..\\test_dataset.csv")
print(train_df.head())

# 解決中文顯示問題
plt.rcParams['font.sans-serif'] = ['KaiTi']
plt.rcParams['axes.unicode_minus'] = False

#---------------------------------第二步 OneHotEncoder()編碼---------------------------------
# 對(duì)數(shù)據(jù)集的標(biāo)簽數(shù)據(jù)進(jìn)行編碼 (no apt md5 api)
train_y = train_df.apt
val_y = val_df.apt
test_y = test_df.apt
le = LabelEncoder()
train_y = le.fit_transform(train_y).reshape(-1,1)
val_y = le.transform(val_y).reshape(-1,1)
test_y = le.transform(test_y).reshape(-1,1)
Labname = le.classes_

# 對(duì)數(shù)據(jù)集的標(biāo)簽數(shù)據(jù)進(jìn)行one-hot編碼
ohe = OneHotEncoder()
train_y = ohe.fit_transform(train_y).toarray()
val_y = ohe.transform(val_y).toarray()
test_y = ohe.transform(test_y).toarray()

#-------------------------------第三步 使用Tokenizer對(duì)詞組進(jìn)行編碼-------------------------------
# 使用Tokenizer對(duì)詞組進(jìn)行編碼
max_words = 1000
max_len = 100
tok = Tokenizer(num_words=max_words)

# 提取token:api
train_value = train_df.api
train_content = [str(a) for a in train_value.tolist()]
val_value = val_df.api
val_content = [str(a) for a in val_value.tolist()]
test_value = test_df.api
test_content = [str(a) for a in test_value.tolist()]
tok.fit_on_texts(train_content)
print(tok)

# 保存訓(xùn)練好的Tokenizer和導(dǎo)入
with open('tok.pickle', 'wb') as handle:
pickle.dump(tok, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('tok.pickle', 'rb') as handle:
tok = pickle.load(handle)

# 使用tok.texts_to_sequences()將數(shù)據(jù)轉(zhuǎn)化為序列
train_seq = tok.texts_to_sequences(train_content)
val_seq = tok.texts_to_sequences(val_content)
test_seq = tok.texts_to_sequences(test_content)

# 將每個(gè)序列調(diào)整為相同的長度
train_seq_mat = sequence.pad_sequences(train_seq,maxlen=max_len)
val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)
test_seq_mat = sequence.pad_sequences(test_seq,maxlen=max_len)

#-------------------------------第四步 建立Attention機(jī)制-------------------------------
"""
由于Keras目前還沒有現(xiàn)成的Attention層可以直接使用,我們需要自己來構(gòu)建一個(gè)新的層函數(shù)。
Keras自定義的函數(shù)主要分為四個(gè)部分,分別是:
init:初始化一些需要的參數(shù)
bulid:具體來定義權(quán)重是怎么樣的
call:核心部分,定義向量是如何進(jìn)行運(yùn)算的
compute_output_shape:定義該層輸出的大小

推薦文章 https://blog.csdn.net/huanghaocs/article/details/95752379
推薦文章 https://zhuanlan.zhihu.com/p/29201491
"""
# Hierarchical Model with Attention
from keras import initializers
from keras import constraints
from keras import activations
from keras import regularizers
from keras import backend as K
from keras.engine.topology import Layer

K.clear_session()

class AttentionLayer(Layer):
def __init__(self, attention_size=None, **kwargs):
self.attention_size = attention_size
super(AttentionLayer, self).__init__(**kwargs)

def get_config(self):
config = super().get_config()
config['attention_size'] = self.attention_size
return config

def build(self, input_shape):
assert len(input_shape) == 3

self.time_steps = input_shape[1]
hidden_size = input_shape[2]
if self.attention_size is None:
self.attention_size = hidden_size

self.W = self.add_weight(name='att_weight', shape=(hidden_size, self.attention_size),
initializer='uniform', trainable=True)
self.b = self.add_weight(name='att_bias', shape=(self.attention_size,),
initializer='uniform', trainable=True)
self.V = self.add_weight(name='att_var', shape=(self.attention_size,),
initializer='uniform', trainable=True)
super(AttentionLayer, self).build(input_shape)

#解決方法: Attention The graph tensor has name: model/attention_layer/Reshape:0
#https://blog.csdn.net/weixin_54227557/article/details/129898614
def call(self, inputs):
#self.V = K.reshape(self.V, (-1, 1))
V = K.reshape(self.V, (-1, 1))
H = K.tanh(K.dot(inputs, self.W) + self.b)
#score = K.softmax(K.dot(H, self.V), axis=1)
score = K.softmax(K.dot(H, V), axis=1)
outputs = K.sum(score * inputs, axis=1)
return outputs

def compute_output_shape(self, input_shape):
return input_shape[0], input_shape[2]

#-------------------------------第五步 建立Attention+CNN模型并訓(xùn)練-------------------------------
# 構(gòu)建TextCNN模型
num_labels = 5
inputs = Input(name='inputs',shape=[max_len], dtype='float64')
layer = Embedding(max_words+1, 256, input_length=max_len, trainable=False)(inputs)
cnn1 = Convolution1D(256, 3, padding='same', strides = 1, activation='relu')(layer)
cnn1 = MaxPool1D(pool_size=4)(cnn1)
cnn2 = Convolution1D(256, 4, padding='same', strides = 1, activation='relu')(layer)
cnn2 = MaxPool1D(pool_size=4)(cnn2)
cnn3 = Convolution1D(256, 5, padding='same', strides = 1, activation='relu')(layer)
cnn3 = MaxPool1D(pool_size=4)(cnn3)

# 合并三個(gè)模型的輸出向量
cnn = concatenate([cnn1,cnn2,cnn3], axis=-1)

# BiLSTM+Attention
#bilstm = Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.1, return_sequences=True))(cnn)
bilstm = Bidirectional(LSTM(128, return_sequences=True))(cnn) #參數(shù)保持維度3
layer = Dense(128, activation='relu')(bilstm)
layer = Dropout(0.3)(layer)
attention = AttentionLayer(attention_size=50)(layer)

output = Dense(num_labels, activation='softmax')(attention)
model = Model(inputs=inputs, outputs=output)
model.summary()
model.compile(loss="categorical_crossentropy",
optimizer='adam',
metrics=["accuracy"])

flag = "test"
if flag == "train":
print("模型訓(xùn)練")
# 模型訓(xùn)練
model_fit = model.fit(train_seq_mat, train_y, batch_size=128, epochs=15,
validation_data=(val_seq_mat,val_y),
callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.0005)]
)

# 保存模型
model.save('cnn_bilstm_model.h5')
del model # deletes the existing model

#計(jì)算時(shí)間
elapsed = (time.clock() - start)
print("Time used:", elapsed)
print(model_fit.history)

else:
print("模型預(yù)測(cè)")
model = load_model('cnn_bilstm_model.h5', custom_objects={'AttentionLayer': AttentionLayer(50)}, compile=False)

#--------------------------------------第六步 預(yù)測(cè)及評(píng)估--------------------------------
# 對(duì)測(cè)試集進(jìn)行預(yù)測(cè)
test_pre = model.predict(test_seq_mat)
confm = metrics.confusion_matrix(np.argmax(test_y,axis=1),np.argmax(test_pre,axis=1))
print(confm)
print(metrics.classification_report(np.argmax(test_y,axis=1),
np.argmax(test_pre,axis=1),
digits=4))
print("accuracy",metrics.accuracy_score(np.argmax(test_y,axis=1),
np.argmax(test_pre,axis=1)))
# 結(jié)果存儲(chǔ)
f1 = open("cnn_bilstm_test_pre.txt", "w")
for n in np.argmax(test_pre, axis=1):
f1.write(str(n) + "\n")
f1.close()

f2 = open("cnn_bilstm_test_y.txt", "w")
for n in np.argmax(test_y, axis=1):
f2.write(str(n) + "\n")
f2.close()

plt.figure(figsize=(8,8))
sns.heatmap(confm.T, square=True, annot=True,
fmt='d', cbar=False, linewidths=.6,
cmap="YlGnBu")
plt.xlabel('True label',size = 14)
plt.ylabel('Predicted label', size = 14)
plt.xticks(np.arange(5)+0.5, Labname, size = 12)
plt.yticks(np.arange(5)+0.5, Labname, size = 12)
plt.savefig('cnn_bilstm_result.png')
plt.show()

#--------------------------------------第七步 驗(yàn)證算法--------------------------------
# 使用tok對(duì)驗(yàn)證數(shù)據(jù)集重新預(yù)處理,并使用訓(xùn)練好的模型進(jìn)行預(yù)測(cè)
val_seq = tok.texts_to_sequences(val_content)
val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)

# 對(duì)驗(yàn)證集進(jìn)行預(yù)測(cè)
val_pre = model.predict(val_seq_mat)
print(metrics.classification_report(np.argmax(val_y, axis=1),
np.argmax(val_pre, axis=1),
digits=4))
print("accuracy", metrics.accuracy_score(np.argmax(val_y, axis=1),
np.argmax(val_pre, axis=1)))
# 計(jì)算時(shí)間
elapsed = (time.clock() - start)
print("Time used:", elapsed)

2.實(shí)驗(yàn)結(jié)果

訓(xùn)練輸出結(jié)果如下圖所示:

模型訓(xùn)練
Epoch 1/15

1/10 [==>...........................] - ETA: 18s - loss: 1.6074 - accuracy: 0.2188
2/10 [=====>........................] - ETA: 2s - loss: 1.5996 - accuracy: 0.2383
3/10 [========>.....................] - ETA: 2s - loss: 1.5903 - accuracy: 0.2500
4/10 [===========>..................] - ETA: 2s - loss: 1.5665 - accuracy: 0.2793
5/10 [==============>...............] - ETA: 2s - loss: 1.5552 - accuracy: 0.2750
6/10 [=================>............] - ETA: 1s - loss: 1.5346 - accuracy: 0.2930
7/10 [====================>.........] - ETA: 1s - loss: 1.5229 - accuracy: 0.3103
8/10 [=======================>......] - ETA: 1s - loss: 1.5208 - accuracy: 0.3135
9/10 [==========================>...] - ETA: 0s - loss: 1.5132 - accuracy: 0.3281
10/10 [==============================] - ETA: 0s - loss: 1.5046 - accuracy: 0.3400
10/10 [==============================] - 9s 728ms/step - loss: 1.5046 - accuracy: 0.3400 - val_loss: 1.4659 - val_accuracy: 0.5599

Time used: 13.8141568
{'loss': [1.5045626163482666], 'accuracy': [0.34004834294319153],
'val_loss': [1.4658586978912354], 'val_accuracy': [0.5599128603935242]}

最終預(yù)測(cè)結(jié)果如下所示:

模型預(yù)測(cè)
[[ 56 13 1 0 40]
[ 31 53 0 0 16]
[ 54 47 3 1 15]
[ 27 14 1 51 37]
[ 39 16 8 2 125]]
precision recall f1-score support

0 0.2705 0.5091 0.3533 110
1 0.3706 0.5300 0.4362 100
2 0.2308 0.0250 0.0451 120
3 0.9444 0.3923 0.5543 130
4 0.5365 0.6579 0.5910 190

accuracy 0.4431 650
macro avg 0.4706 0.4229 0.3960 650
weighted avg 0.4911 0.4431 0.4189 650

accuracy 0.4430769230769231

havior.
precision recall f1-score support

0 0.8571 0.5625 0.6792 352
1 0.6344 0.5514 0.5900 107
2 0.0000 0.0000 0.0000 0
4 0.0000 0.0000 0.0000 0

accuracy 0.5599 459
macro avg 0.3729 0.2785 0.3173 459
weighted avg 0.8052 0.5599 0.6584 459

accuracy 0.5599128540305011
Time used: 23.0178675

本文章轉(zhuǎn)載微信公眾號(hào)@娜璋AI安全之家

上一篇:

網(wǎng)站分類API:快速指南

下一篇:

Ubuntu 18.04 e站域名解析失敗和鎖文件錯(cuò)誤處理指南
#你可能也喜歡這些API文章!

我們有何不同?

API服務(wù)商零注冊(cè)

多API并行試用

數(shù)據(jù)驅(qū)動(dòng)選型,提升決策效率

查看全部API→
??

熱門場(chǎng)景實(shí)測(cè),選對(duì)API

#AI文本生成大模型API

對(duì)比大模型API的內(nèi)容創(chuàng)意新穎性、情感共鳴力、商業(yè)轉(zhuǎn)化潛力

25個(gè)渠道
一鍵對(duì)比試用API 限時(shí)免費(fèi)

#AI深度推理大模型API

對(duì)比大模型API的邏輯推理準(zhǔn)確性、分析深度、可視化建議合理性

10個(gè)渠道
一鍵對(duì)比試用API 限時(shí)免費(fèi)