cnn训练恶评区分模型

nlp

Publish Date: 2019-11-15

训练集

数据集包含大量来自wikipedia的恶评，并且被分类为以下6个标签：

toxic
severe_toxic
obscene
threat
insult
identity_hate
创建一个模型来预测每个恶评在每种标签的可能性。

基本参数

MAX_SEQUENCE_LENGTH = 100
MAX_VOCAB_SIZE = 20000
EMBEDDING_DIM = 100 #embedding dimenstion
VALIDATION_SPLIT = 0.2
BATCH_SIZE = 128
EPOCHS = 10

读取golve做的词向量预训练的结果.word2vec 是通过glove预训练词向量构造的一个字典，每个单词都有一个对应的100维度的词向量,词向量来源于glove的预训练。

word2vec = {}

with open('D://software engineering/nlp_translate/glove.6B.100d.txt','r',encoding='utf-8') as f:
    for line in f:
        values = line.split()
    word = values[0]
    vec = np.asarray(values[1:], dtype='float32')
    word2vec[word] = vec
print('Found %s word vectors.' % len(word2vec))

使用keras库里的文本和序列预处理的库,Tokenizer是一个用于向量化文本，或将文本转换为序列的类。

train = pd.read_csv("D:/software engineering/nlp_translate/train.csv")
sentences  =train["comment_text"].fillna("DUMMY_VALUE").values
possible_lables = ["toxic","severe_toxic","obscene","threat","insult","identity_hate"]
targets = train[possible_lables].values

print("max sequence length:",max(len(s) for s in sentences))
print("min sequence length:", min(len(s)for s in sentences))
s=sorted(len(s) for s in sentences)
print("median sequence length:", s[len(s)//2])

#convert the sentences into integers
tokenizer = Tokenizer(num_words = MAX_VOCAB_SIZE)
tokenizer.fit_on_texts(sentences)
sequences = tokenizer.texts_to_sequences(sentences)

word2idx = tokenizer.word_index
print('Found %s unique tokens.' %len(word2idx))

data = pad_sequences(sequences,maxlen = MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', data.shape)

利用train中的comment_text数据，先造了一个字典word2idx，每个单词都有一个对应的下标序号，texts_to_sequences()的作用则是构造了一个list,list的数字来源于word_index这个字典。

如：单词nonsense，word2idx.get("nonsense")就是等于845。

pad_seq则是在sequences的基础上进行了填充，填充至100,这样每个sequences的维度就一样了。

构造一个embedding_matrix，只取了排名靠前的2W单词，并且把词向量填充进embedding_matrix。

num_words = min(MAX_VOCAB_SIZE, len(word2idx) + 1)
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word2idx.items():
  if i < MAX_VOCAB_SIZE:
    embedding_vector = word2vec.get(word)
    if embedding_vector is not None:
      # words not found in embedding index will be all zeros.
      embedding_matrix[i] = embedding_vector

将这个词向量矩阵加载到Embedding层中，设置trainable=False使得这个编码层不可再训练。如果输入数据不需要词的语义特征，简单使用Embedding层就可以得到一个对应的词向量矩阵，但如果需要语义特征，就需要把glove预训练好的词向量扔到Embedding层中。

embedding_layer = Embedding(
  num_words,
  EMBEDDING_DIM,
  weights=[embedding_matrix],
  input_length=MAX_SEQUENCE_LENGTH,
  trainable=False
)

构造训练模型：

input_ = Input(shape=(MAX_SEQUENCE_LENGTH,))
x = embedding_layer(input_)
x = Conv1D(128, 3, activation='relu')(x)
x = MaxPooling1D(3)(x)
x = Conv1D(128, 3, activation='relu')(x)
x = MaxPooling1D(3)(x)
x = Conv1D(128, 3, activation='relu')(x)
x = GlobalMaxPooling1D()(x)
x = Dense(128, activation='relu')(x)
output = Dense(len(possible_lables), activation='sigmoid')(x)

model = Model(input_, output)
model.compile(loss='binary_crossentropy',optimizer='rmsprop',metrics=['accuracy'])

print('Training model...')
r = model.fit(data,targets,batch_size=BATCH_SIZE,epochs=EPOCHS,validation_split=VALIDATION_SPLIT)

结果

代码地址：https://github.com/Dennis174698/toxic

nlp

Leetcode-回文数(easy)

Problem descriptionDetermine whether an integer is a palindrome. An integer is a palindrome when it reads the same backw

2020-04-16 leetcode

leetcode easy

Word2Vec

Word2Vec是使用浅层神经网络学习词嵌入的最流行技术之一。它是由Tomas Mikolov于2013年在Google上开发的。当使用one-hot 编码并将其可视化时，将其想象为一个多维空间，其中每个单词占据一个维，而与其余单

2019-11-03 nlp

nlp