Skip to content

Commit 9e4644c

Browse files
committed
initial commit
0 parents  commit 9e4644c

18 files changed

+1950
-0
lines changed

README.md

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
## Description
2+
This project is an implementation of Roth and Lapatas "Neural Semantic Role Labeling with Dependency Path Embeddings"[1], a Semantic Role Labeling Model with a LSTM at the core.
3+
4+
## Requirements
5+
#### Software
6+
Python >= 3.4.3 until this (https://github.com/tensorflow/tensorflow/issues/4588) will be in tensorflow, then 3.X should be fine. <br />
7+
tensorflow >= 0.10.0 <br />
8+
Perl >= 5.8.1 for evaluation with the CoNNL 2009 Scorer<br />
9+
10+
#### Data
11+
Download the CoNLL 2008 [2] or 2009 [3] Shared Task data:<br />
12+
2008: https://catalog.ldc.upenn.edu/LDC2009T12<br />
13+
2009: https://catalog.ldc.upenn.edu/LDC2012T04<br />
14+
15+
#### Scorer
16+
If you want to evaluate a dataset with the official CoNLL Scorer you have to download it here:<br />
17+
2009: https://ufal.mff.cuni.cz/conll2009-st/scorer.html<br />
18+
Put the scorer named 'eval09.pl' into the 'score_scripts' folder. The python script will use it internally and can even evaluate 2008 format input files. The scorer is executed by a 'perl' command, so make sure to have it installed. You can switch of warnings in the perl script for better readability.
19+
20+
#### Model
21+
I trained a model on the CoNLL2008 'train.closed' file. The parameters where set like in the paper, except there where only 2 iterations for each argument identification model and 5 iterations for argument classification. This model scores 73.89% Labeled F1 score on the in-domain test file and 67.61% on the out-of-domain test file.<br />
22+
Download model: https://www.dropbox.com/sh/au325tntluwe8us/AAC8tUeO5SH6txKUT5DC2Mu6a?dl=0 <br />
23+
After the download place all *.model and *.meta files into the 'model' folder. Only the main *.model file (the one without LSTM in it) needs to be passed to the run script, the others will be identified by their names.
24+
25+
26+
## Run
27+
#### Training
28+
python3 run.py train ./data/train/train.closed conll2008 ./output/
29+
Trains a model with the 'train.closed' file from the CoNLL-2008 Shared Task. Writes the model files into 'output'. Be sure to set your desired parameters in 'config.cfg'. <br />
30+
The provided model took around 21 hours to train on a p2.xlarge EC2 instance with one NVIDIA K80 GPU.
31+
32+
#### Testing
33+
python3 run.py test ./data/test.wsj/test.wsj.closed.GOLD conll2008 ./output/ ./model/train.closed_pi_ai_ac.model
34+
The input files needs to have gold labels to be evaluated on. Writes the predicted labels to 'output/[input_file].PRED' and the gold labels to 'output/[input_file].GOLD'. The full evaluation is written to 'output/[input_file].RESULTS'.
35+
36+
#### Prediction
37+
python3 run.py predict ./data/test.wsj/my_own_sentences.conll conll2009 ./output/ ./model/train.closed_pi_ai_ac.model
38+
The input file format has to be the CoNLL 2008 or 2009 format (without argument labels, but with sense disambiguated predicates!). Predicts the argument labels and writes them to 'output/[input_file].PRED'.
39+
40+
##ToDos
41+
1. Implement Predicate Prediction and Disambiguation
42+
2. Implement Reranker
43+
44+
<br />
45+
<br />
46+
47+
[1] Roth and Lapata, 2016, https://arxiv.org/abs/1605.07515 <br />
48+
[2] Surdeanu et al., 2008, http://dl.acm.org/citation.cfm?id=1596411 <br />
49+
[3] Hajic et al., 2009, http://dl.acm.org/citation.cfm?id=1596324.1596352
50+
51+

config.cfg

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
>default
2+
lstm_size = 28
3+
hidden_layer = 128
4+
learning_rate = 0.001
5+
batch_size = 1000
6+
drop_out_rate = 0.50
7+
class_weights = True
8+
iteration_factor = 5
9+
features = ArgForm ArgPOS ArgDeprel PredForm PredLemma PredSense PredPOS
10+
11+
>ai_V
12+
lstm_size = 25
13+
hidden_layer = 90
14+
learning_rate = 0.0006
15+
drop_out_rate = 0.42
16+
features = ArgForm ArgPOS ArgDeprel PredDeprel PredSense ArgLeftPOS ArgRightPOS POSPath POSDepPath DeprelPath Position PredChildDepSet PredParentForm PredParentPOS ArgRightSiblingForm ArgRightSiblingPOS
17+
18+
>ai_N
19+
lstm_size = 16
20+
hidden_layer = 125
21+
learning_rate = 0.0009
22+
drop_out_rate = 0.25
23+
features = ArgForm ArgPOS ArgDeprel PredForm PredLemma PredSense PredPOS ArgLeftPOS ArgRightForm ArgRightPOS POSPath POSDepPath DeprelPath Position PredParentForm PredChildFormSet
24+
25+
>ac_V
26+
lstm_size = 5
27+
hidden_layer = 300
28+
learning_rate = 0.0155
29+
drop_out_rate = 0.50
30+
iteration_factor = 50
31+
features = PredSense PredLemma PredPOS ArgForm ArgPOS ArgDeprel ArgRightForm ArgRightPOS ArgLeftPOS ArgLeftSiblingPOS POSPath POSDepPath DeprelPath Position PredChildDepSet PredParentForm PredParentPOS
32+
33+
>ac_N
34+
lstm_size = 88
35+
hidden_layer = 500
36+
learning_rate = 0.0055
37+
drop_out_rate = 0.46
38+
iteration_factor = 50
39+
features = PredForm PredSense PredLemma ArgForm ArgPOS ArgRightForm ArgRightPOS ArgLeftForm ArgLeftPOS ArgLeftSiblingForm ArgLeftSiblingPOS ArgRightSiblingPOS POSPath POSDepPath Position PredChildPOSSet

data/.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# Ignore everything in this directory
2+
*
3+
# Except this file
4+
!.gitignore

feature_set.py

Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
from features import available_features, LSTMFeature
2+
from scipy.sparse import csr_matrix
3+
import numpy as np
4+
from random import randint
5+
6+
7+
class FeatureSet(object):
8+
9+
def __init__(self, iterator, step_name, pos_type, config, freezed, label_func=None, vocabs=None, class_names=None):
10+
self.config = config
11+
self.name = step_name + pos_type
12+
self.iterator = iterator
13+
self.step_name = step_name
14+
self.pos_type = pos_type
15+
self.freezed = freezed
16+
self.vocabs = vocabs
17+
self.label_func = label_func
18+
self.class_names = class_names
19+
feature_names = self.config.get_value('features', lambda s: s.split(' '))
20+
self.binary_features = [available_features[f_name]() for f_name in feature_names]
21+
self.lstm_feature = LSTMFeature()
22+
if self.vocabs:
23+
self._set_vocabs()
24+
25+
self.binary_feature_matrix = None
26+
self.binary_feature_width = None
27+
self.lstm_feature_vectors = None
28+
self.lstm_feature_row_width = None
29+
self.label_array = None
30+
self.number_of_instances = None
31+
self.num_classes = None if self.class_names is None else len(self.class_names)
32+
self.class_indices = None
33+
34+
self.print('get binary feature matrix')
35+
binary_features_vectors = list()
36+
for feature in self.binary_features:
37+
feature_vectors = feature.get_vector_batch(self.iterator, self.freezed)
38+
binary_features_vectors.append(feature_vectors)
39+
features_length = [len(f) for f in self.binary_features]
40+
self.print('binary features length ' + str(features_length))
41+
self.binary_feature_matrix = dicts_to_sparse_matrix(binary_features_vectors, features_length, add_bias=True)
42+
self.binary_feature_width = self.binary_feature_matrix[0].shape[1]
43+
self.print('finished binary feature matrix')
44+
self.print('get lstm feature matrix')
45+
self.lstm_feature_vectors = self.lstm_feature.get_vector_batch(self.iterator, self.freezed)
46+
self.lstm_feature_row_width = len(self.lstm_feature)
47+
self.print('finished lstm feature matrix')
48+
49+
# get the labels
50+
if self.label_func is not None:
51+
label_array_raw = self.label_func(self.iterator)
52+
self.class_names = list(np.unique(label_array_raw).tolist())
53+
self.num_classes = len(self.class_names)
54+
label_array = list()
55+
# calculate the class weights by frequency
56+
class_weights = [1 - (list(label_array_raw).count(c) / float(len(list(label_array_raw))))
57+
for c in self.class_names]
58+
self.class_weights = [w / min(class_weights) for w in class_weights]
59+
print('classes: ', self.class_names)
60+
61+
for i, label_raw in enumerate(label_array_raw):
62+
label = [0 for _ in range(self.num_classes)]
63+
label[self.class_names.index(label_raw)] = 1
64+
label_array.append(label)
65+
66+
self.label_array = np.asarray(label_array)
67+
68+
# check if number of lstm feature instances, binary feature instance and labels are identical
69+
if (self.binary_feature_matrix.shape[0] != len(self.lstm_feature_vectors)) or \
70+
(self.label_func and (self.binary_feature_matrix.shape[0] != len(self.label_array))):
71+
raise ValueError('No equal number of instances')
72+
self.number_of_instances = self.binary_feature_matrix.shape[0]
73+
74+
def get_binary_feature_matrix(self):
75+
return self.binary_feature_matrix
76+
77+
def get_lstm_features(self):
78+
return self.lstm_feature, self.lstm_feature_row_width
79+
80+
def get_training_batch(self, batch_size, epoch):
81+
random_int = randint(0, self.number_of_instances)
82+
indices = [(i + random_int) % self.number_of_instances for i in range(batch_size)]
83+
batch_lstm_instances = list()
84+
batch_binary_instances = self.binary_feature_matrix[indices].toarray()
85+
labels = list()
86+
for i in indices:
87+
batch_lstm_instances.append(self.lstm_feature_vectors[i])
88+
labels.append(self.label_array[i])
89+
lstm_instance_time_major, sequence_lengths = self._lstm_time_major(batch_lstm_instances)
90+
labels = np.asarray(labels)
91+
batch_binary_instances = np.asarray(batch_binary_instances)
92+
return batch_binary_instances, lstm_instance_time_major, sequence_lengths, labels
93+
94+
def get_prediction_instances(self, start, stop):
95+
lstm_instance_time_major, sequence_lengths = self._lstm_time_major(self.lstm_feature_vectors[start:stop])
96+
binary_instances = self.binary_feature_matrix[start:stop].toarray()
97+
return binary_instances, lstm_instance_time_major, sequence_lengths
98+
99+
def get_prediction_instance(self, i):
100+
feature_vector = self.lstm_feature_vectors[i]
101+
sequence_lengths = [len(feature_vector)]
102+
lstm_instance_time_major = list()
103+
for row_index in range(len(feature_vector)):
104+
row = [1 if r in feature_vector[row_index] else 0 for r in range(self.lstm_feature_row_width)]
105+
lstm_instance_time_major.append([row])
106+
107+
lstm_instance_time_major = np.asarray(lstm_instance_time_major, dtype=np.float32)
108+
sequence_lengths = np.asarray(sequence_lengths, dtype=np.int32)
109+
return lstm_instance_time_major, sequence_lengths
110+
111+
def _lstm_time_major(self, lstm_feature_instances):
112+
# Tensorflow needs this format for sequences of different length
113+
sequence_lengths = [len(sequence) for sequence in lstm_feature_instances]
114+
max_sequence_length = max(sequence_lengths)
115+
116+
instance_time_major = np.zeros(shape=(max_sequence_length, len(lstm_feature_instances),
117+
self.lstm_feature_row_width), dtype=np.float32)
118+
for s_id, sequence in enumerate(lstm_feature_instances):
119+
for r_id , row in enumerate(sequence):
120+
for key in row.keys():
121+
instance_time_major[r_id][s_id][key] = 1
122+
123+
sequence_lengths = np.asarray(sequence_lengths, dtype=np.int32)
124+
return instance_time_major, sequence_lengths
125+
126+
def get_vocabs(self):
127+
# get the vocabularies from the features of this step to be saved in the model object
128+
vocabs = {}
129+
for feature in self.binary_features:
130+
if hasattr(feature, 'get_vocab'):
131+
vocabs.update(feature.get_vocab())
132+
return vocabs
133+
134+
def _set_vocabs(self):
135+
# load the vocabularies into the features to be able to reproduce one-hot-vectors of the trained model
136+
for feature in self.binary_features:
137+
if hasattr(feature, 'set_vocab'):
138+
feature.set_vocab(self.vocabs)
139+
140+
def get_label_array(self):
141+
if self.label_func is None:
142+
raise ValueError('Need label function to generate labels')
143+
return self.label_array
144+
145+
def print(self, s):
146+
print(self.step_name + ' - ' + self.pos_type + ': ' + s)
147+
148+
149+
def dicts_to_sparse_matrix(features_vectors, features_length, add_bias=False):
150+
if any([len(features_vectors[0]) != len(f) for f in features_vectors]):
151+
raise ValueError('Every Feature needs values for every instance')
152+
153+
shape = (len(features_vectors[0]), sum(features_length))
154+
row = list()
155+
col = list()
156+
data = list()
157+
158+
for feature_idx, feature_vectors in enumerate(features_vectors):
159+
col_offset = sum(features_length[0:feature_idx])
160+
for row_idx, feature_vector in enumerate(feature_vectors):
161+
for col_idx, v in feature_vector.items():
162+
col_idx += col_offset
163+
row.append(row_idx)
164+
col.append(col_idx)
165+
data.append(v)
166+
if add_bias:
167+
for i in range(shape[0]):
168+
row.append(i)
169+
col.append(shape[1])
170+
data.append(1)
171+
shape = (shape[0], shape[1]+1)
172+
return csr_matrix((data, (row, col)), shape=shape)

0 commit comments

Comments
 (0)