Skip to content
This repository has been archived by the owner on May 24, 2018. It is now read-only.

Data Input

Tianqi Chen edited this page Apr 9, 2014 · 18 revisions

Introduction

This page will introduce data input method in cxxnet. cxxnet use data iterator to provide data to the neural network. Iterators do some preprocessing and generate batch for the neural network.

  • We provide basic iterators for MNIST, CIFAR-10, Image, Binary Image.
  • To boost performance, we provide thread buffer for loading.
    • Putting threadbuffer iterator after input iterator will open an independent thread to fetch from the input, this allows parallel of learning process and data fetching.
    • We recommend you use thread buffer in all cases to avoid IO bottle neck.

Declarer the iterator in the form

iter = iterator_type
options 1 = 
options 2 = 
...
iter = end
  • The basic iterator type is mnist , cifar , image , imgbin
  • To use thread buffer, declare in this form
iter = iterator_type
options 1 = 
options 2 = 
...
iter = threadbuffer
iter = end

= Iterators

=

MNIST/CIFAR Preprocessing Options
shuffle = 1
  • shuffle set 1 to shuffle the training data.

=

MNIST Iterator
  • Required fields
path_img = path to gz file of image
path_label = path to gz file of label
input_flat = 1
  • input_flat means loading the data in shape 1,1,784 or 1,28,28

=

CIFAR Iterator
  • Required fields
path = path to CIFAR file folder
input_flat = 0
test = 1
batch1 = 1
batch2 = 0
batch3 = 1
batch4 = 0
batch5 = 0
  • test , batch1 , batch2 , batch3 , batch4 , batch5 , is binary variable to choose which batch file to be used.

=

Image and Image Binary Iterator

There are two ways to load images, image iterator that takes list of images in the disk, and image binary iterator that reads images from a packed binary file. Usually, I/O is a bottle neck, and image binary iterator makes training faster. However, we also provide image iterator for convenience

  • Preprocessing Option for Image/Image Binary
rand_crop = 1
rand_mirror = 1
divideby = 256
image_mean = "img_mean.bin"
  • rand_crop set 1 for cropping image to a larger space
  • rand_mirror set 1 for random mirroring the training data
  • divideby normalize the data by dividing a value
  • image_mean normalize the data by minus the mean of all image. The value is the path of the mean image file. If the file doesn't exist, cxxnet will generate one.
Image Iterator
  • Required fields
image_list = path to the image list file
image_root = path to the image folder
Image list file

The image_list is a formatted file. The format is

image_index \t label \t file_name

A valid image list file is like the following (NO header):

1       0       cat.5396.jpg
2       0       cat.11780.jpg
3       1       dog.11254.jpg
4       0       cat.6791.jpg
5       0       cat.7937.jpg
6       1       dog.9329.jpg
  • image_root is the path to the folder contains files in the image list file.
Image binary iterator

Image binary iterator aims to reduce to IO cost in random seek. It is especially useful when deal with large amount for data like in ImageNet.

  • Required field
image_list = path to the image list file
image_bin = path to the image binary file
  • The image_list file is described above
  • To generate image_bin file, you need to use the tool im2bin in the tools folder.