DongjunLee
diff --git a/‎README.md
Lines changed: 39 additions & 12 deletions b/‎README.md
Lines changed: 39 additions & 12 deletions
diff --git a/‎config/kaggle_movie_review.yml
Lines changed: 9 additions & 3 deletions b/‎config/kaggle_movie_review.yml
Lines changed: 9 additions & 3 deletions
diff --git a/‎config/rt-polarity.yml
Lines changed: 12 additions & 5 deletions b/‎config/rt-polarity.yml
Lines changed: 12 additions & 5 deletions
diff --git a/‎data_loader.py
Lines changed: 78 additions & 22 deletions b/‎data_loader.py
Lines changed: 78 additions & 22 deletions
diff --git a/‎dataset.py
Lines changed: 0 additions & 104 deletions b/‎dataset.py
Lines changed: 0 additions & 104 deletions
@@ -11,15 +11,26 @@ This code implements [Convolutional Neural Networks for Sentence Classification]
 
 - Python 3.6
 - TensorFlow 1.4
-- hb-config
+- [hb-config](https://github.com/hb-research/hb-config) (Singleton Config)
 - tqdm
 
-## Features
+## Project Structure
+
+init Project by [hb-base](https://github.com/hb-research/hb-base)
+
+    .
+    ├── config                  # Config files (.yml, .json) using with hb-config
+    ├── data                    # dataset path
+    ├── notebooks               # Prototyping with numpy or tf.interactivesession
+    ├── text-cnn                # text-cnn architecture graphs (from input to logits)
+        ├── __init__.py             # Graph logic
+    ├── data_loader.py          # raw_date -> precossed_data -> generate_batch (using Dataset)
+    ├── hook.py                 # training or test hook feature (eg. print_variables)
+    ├── main.py                 # define experiment_fn
+    └── model.py                # define EstimatorSpec      
+
+Reference : [hb-config](https://github.com/hb-research/hb-config), [Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_generator), [experiments_fn](https://www.tensorflow.org/api_docs/python/tf/contrib/learn/Experiment), [EstimatorSpec](https://www.tensorflow.org/api_docs/python/tf/estimator/EstimatorSpec)
 
-- Using Higher-APIs in TensorFlow
-	- [Estimator](https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator)
-	- [Experiment](https://www.tensorflow.org/api_docs/python/tf/contrib/learn/Experiment)
-	- [Dataset](https://www.tensorflow.org/api_docs/python/tf/contrib/data/Dataset)
 - Dataset : [rt-polarity](https://github.com/yoonkim/CNN_sentence), [Sentiment Analysis on Movie Reviews](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data)
 
 ## Todo
@@ -30,7 +41,6 @@ This code implements [Convolutional Neural Networks for Sentence Classification]
 	- CNN-nonstatic
 	- CNN-multichannel
 
-
 ## Config
 
 example: kaggle\_movie\_review.yml
@@ -46,7 +56,8 @@ data:
   PAD_ID: 0
 
 model:
-  embed_type: 'rand'  (rand, static, non-static, multichannel)
+  batch_size: 64
+  embed_type: 'rand'     #(rand, static, non-static, multichannel)
   pretrained_embed: "" 
   embed_dim: 300
   num_filters: 256
@@ -58,11 +69,13 @@ model:
   dropout: 0.5
 
 train:
-  batch_size: 64
   learning_rate: 0.00005
+  
   train_steps: 100000
   model_dir: 'logs/kaggle_movie_review'
-  save_checkpoints_steps: 2000
+  
+  save_checkpoints_steps: 1000
+  loss_hook_n_iter: 1000
   check_hook_n_iter: 1000
   min_eval_frequency: 1000
 ```
@@ -81,6 +94,20 @@ sh prepare_kaggle_movie_reviews.sh
 python main.py --config kaggle_movie_review --mode train_and_evaluate
 ```
 
+### Experiments modes
+
+:white_check_mark: : Working  
+:white_medium_small_square: : Not tested yet.
+
+- :white_check_mark: `evaluate` : Evaluate on the evaluation data.
+- :white_medium_small_square: `extend_train_hooks` : Extends the hooks for training.
+- :white_medium_small_square: `reset_export_strategies` : Resets the export strategies with the new_export_strategies.
+- :white_medium_small_square: `run_std_server` : Starts a TensorFlow server and joins the serving thread.
+- :white_medium_small_square: `test` : Tests training, evaluating and exporting the estimator for a single step.
+- :white_check_mark: `train` : Fit the estimator using the training data.
+- :white_check_mark: `train_and_evaluate` : Interleaves training and evaluation.
+
+
 ### Tensorboard
 
 ```tensorboard --logdir logs```
@@ -101,5 +128,5 @@ python main.py --config kaggle_movie_review --mode train_and_evaluate
 ## Reference
 
 - [Implementing a CNN for Text Classification in TensorFlow](http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/) by Denny Britz
-- [Convolutional Neural Networks for Sentence Classification](http://arxiv.org/abs/1408.5882) (2014) by Y Kim
-- [A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification](https://arxiv.org/pdf/1510.03820.pdf) (2015) Y Zhang
+- [Paper - Convolutional Neural Networks for Sentence Classification](http://arxiv.org/abs/1408.5882) (2014) by Y Kim
+- [Paper - A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification](https://arxiv.org/pdf/1510.03820.pdf) (2015) Y Zhang
@@ -8,7 +8,8 @@ data:
   PAD_ID: 0
 
 model:
-  embed_type: 'rand'
+  batch_size: 64
+  embed_type: 'rand'     #(rand, static, non-static, multichannel)
   pretrained_embed: ""
   embed_dim: 300
   num_filters: 256
@@ -20,10 +21,15 @@ model:
   dropout: 0.5
 
 train:
-  batch_size: 64
   learning_rate: 0.00005
+
   train_steps: 100000
   model_dir: 'logs/kaggle_movie_review'
-  save_checkpoints_steps: 2000
+
+  save_checkpoints_steps: 1000
+  loss_hook_n_iter: 1000
   check_hook_n_iter: 1000
   min_eval_frequency: 1000
+
+  print_verbose: True
+  debug: False
@@ -8,21 +8,28 @@ data:
   PAD_ID: 0
 
 model:
-  embed_type: 'rand'
+  batch_size: 64
+  embed_type: 'rand'     #(rand, static, non-static, multichannel)
   pretrained_embed: ""
-  embed_dim: 128
-  num_filters: 128
+  embed_dim: 300
+  num_filters: 256
   filter_sizes:
+    - 2
     - 3
     - 4
     - 5
   dropout: 0.5
 
 train:
-  batch_size: 64
   learning_rate: 0.00001
+
   train_steps: 20000
   model_dir: 'logs/rt-polarity'
-  save_checkpoints_steps: 1000
+
+  save_checkpoints_steps: 100
+  loss_hook_n_iter: 100
   check_hook_n_iter: 100
   min_eval_frequency: 100
+
+  print_verbose: True
+  debug: False
@@ -8,9 +8,9 @@
 import re
 
 import numpy as np
-from tqdm import tqdm
 from hbconfig import Config
-
+import tensorflow as tf
+from tqdm import tqdm
 
 
 def clean_str(string):
@@ -214,26 +214,31 @@ def process_data():
     token2id('test_X')
 
 
-def make_train_and_test_set():
+def make_train_and_test_set(shuffle=True):
     print("make Training data and Test data Start....")
 
-    set_max_seq_length(['train_X_ids', 'test_X_ids'])
+    if Config.data.get('max_seq_length', None) is None:
+        set_max_seq_length(['train_X_ids', 'test_X_ids'])
 
     train_X, train_y = load_data('train_X_ids', 'train_y')
     test_X, test_y = load_data('test_X_ids', 'test_y')
 
-    if len(train_X) == len(train_y) and len(test_X) == len(test_y):
-        print(f"train data count : {len(train_X)}")
-        print(f"test data count : {len(test_X)}")
-        return train_X, test_X, train_y, test_y
-    else:
-        train_count = min(len(train_X), len(train_y))
-        test_count = min(len(test_X), len(test_y))
+    assert len(train_X) == len(train_y)
+    assert len(test_X) == len(test_y)
+
+    print(f"train data count : {len(train_y)}")
+    print(f"test data count : {len(test_y)}")
 
-        print(f"train data count : {train_count}")
-        print(f"test data count : {test_count}")
+    if shuffle:
+        print("shuffle dataset ...")
+        train_p = np.random.permutation(len(train_y))
+        test_p = np.random.permutation(len(test_y))
 
-        return train_X[:train_count], test_X[:test_count], train_y[:train_count], test_y[:test_count]
+        return ((train_X[train_p], train_y[train_p]),
+                (test_X[test_p], test_y[test_p]))
+    else:
+        return ((train_X, train_y),
+                (test_X, test_y))
 
 
 def load_data(X_fname, y_fname):
@@ -281,15 +286,66 @@ def set_max_seq_length(dataset_fnames):
     print(f"Setting max_seq_length to Config : {max_seq_length}")
 
 
-def _reshape_batch(inputs, size, batch_size):
-    """ Create batch-major inputs. Batch inputs are just re-indexed inputs
-    """
-    batch_inputs = []
-    for length_id in range(size):
-        batch_inputs.append(np.array([inputs[batch_id][length_id]
-                                      for batch_id in range(batch_size)], dtype=np.int32))
-    return batch_inputs
+def make_batch(data, buffer_size=10000, batch_size=64, scope="train"):
+
+    class IteratorInitializerHook(tf.train.SessionRunHook):
+        """Hook to initialise data iterator after Session is created."""
+
+        def __init__(self):
+            super(IteratorInitializerHook, self).__init__()
+            self.iterator_initializer_func = None
+
+        def after_create_session(self, session, coord):
+            """Initialise the iterator after the session has been created."""
+            self.iterator_initializer_func(session)
+
+
+    def get_inputs():
+
+        iterator_initializer_hook = IteratorInitializerHook()
+
+        def train_inputs():
+            with tf.name_scope(scope):
+
+                X, y = data
+
+                # Define placeholders
+                input_placeholder = tf.placeholder(
+                    tf.int32, [None, Config.data.max_seq_length])
+                output_placeholder = tf.placeholder(
+                    tf.int32, [None, Config.data.num_classes])
+
+                # Build dataset iterator
+                dataset = tf.data.Dataset.from_tensor_slices(
+                    (input_placeholder, output_placeholder))
+
+                if scope == "train":
+                    dataset = dataset.repeat(None)  # Infinite iterations
+                else:
+                    dataset = dataset.repeat(1)  # 1 Epoch
+                # dataset = dataset.shuffle(buffer_size=buffer_size)
+                dataset = dataset.batch(batch_size)
+
+                iterator = dataset.make_initializable_iterator()
+                next_X, next_y = iterator.get_next()
+
+                tf.identity(next_X[0], 'input_0')
+                tf.identity(next_y[0], 'target_0')
+
+                # Set runhook to initialize iterator
+                iterator_initializer_hook.iterator_initializer_func = \
+                    lambda sess: sess.run(
+                        iterator.initializer,
+                        feed_dict={input_placeholder: X,
+                                   output_placeholder: y})
+
+                # Return batched (features, labels)
+                return next_X, next_y
+
+        # Return function and hook
+        return train_inputs, iterator_initializer_hook
 
+    return get_inputs()
 
 if __name__ == '__main__':