Part 2: Preparing Data
Now that you’re all set up, it’s time to get your project off the ground. The first step is getting data ready to be processed by the model you will make. Preparing your data and defining your model go hand-in-hand so you may have to go back-and-forth between Parts 2, 3, and 4 before you see how it all fits together.
The preferred, yet least documented, format for storing data is in TFRecord files. TFRecord files (*.tfrecord) store data as records in a binary format.
Let’s see how to take data and write it as TFRecord files.
Writing Data to TFRecords
To convert your data to TFRecords, you will follow these steps:
- Gather all the data (e.g. a list of images and corresponding labels)
- Create a TFRecordWriter
- Create an Example out of Features for each datapoint
- Serialize the Examples
- Write the Examples to your TFRecord
Running Example: to keep up with the example download this data and copy/unzip it in your data folder.
Imports you will need for this example
###########################################
### Inside data/create_tfrecords.py ###
###########################################
import tensorflow as tf
import glob, imageio, shutil, os
Gathering Data
This is the most variable part of the data conversion pipeline. You may have a small batch of image (png/jpeg) files you can load into memory at once, a large dataset you need to load in piece-by-piece, or you may be generating data (ex: interacting with a simulator) and writing to TFRecords dynamically.
Let’s handle the simple case of loading in a single set of jpeg images. The data provided is a subset of the Caltech 101 dataset. This data is very small and you should not expect it to be the kind of data needed to train a good deep network. We are using it to keep the example simple. Download the data, move it to your data
directory and unzip it.
###########################################
### Inside data/create_tfrecords.py ###
###########################################
# Gather file paths to all iamges
data_dir = 'Caltech50'
object_dirs = glob.glob(data_dir + '/*')
objects = {}
for d in object_dirs:
objects[d.split('/')[1]] = glob.glob(d + '/*.jpg')
# Create an integer label for each object category
categories = list(objects.keys())
category_labels = {}
for i in range(len(categories)):
category_labels[categories[i]] = i
That was easy enough. Now we have the file paths for all the images we want to store in our TFRecords.
Creating TFRecord Writer
The TFRecordWriter is what we will use to write each Example once it has been constructed.
# Say we want one of the files to be named 'gerenuk.tfrecord' for one of the categories
tfrecord_filename = 'gerenuk.tfrecord'
writer = tf.python_io.TFRecordWriter(tfrecord_filename)
We will be creating one tfrecord file for every category below.
Creating an Example
An Example represents one datapoint in our dataset – e.g. an image/label pair for image classification or state/value for reinforcement learning.
Each example is constructed from Features. One Feature is created for each subpiece of the datapoint - i.e. the image is one Feature and the label is another Feature.
The following helper functions create Features for different data types.
##########################################
### Inside data/create_tfrecord.py ###
##########################################
def int64_feature(value):
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
def bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def float_feature(value):
return tf.train.Feature(bytes_list=tf.train.FloatList(value=[value]))
To write a single image to our TFRecord file we would do the following:
# Load the image into memory and grab its label
category = list(objects.keys())[0] # choose the first category -- 'gerenuk'
image = imageio.imread(objects[category][0]) # load the first gerenuk image as a numpy array
shape = image.shape # grab the shape so we can add it as meta-data
label = category_labels[category]
# Create a dictionary of this example's features
features = {
'height' : int64_feature(shape[0]),
'width' : int64_feature(shape[1]),
'depth' : int64_feature(shape[2]),
'image' : bytes_feature(image.tostring()),
'label' : int64_feature(int(label))
}
# Now we both construct tf.train.Features from our feature dictionary and
# construct a tf.train.Example from those features.
example = tf.train.Example(features=tf.train.Features(feature=features))
# To write it to our TFRecord file, we serialize the example and call write
# from our TFRecordWriter handle.
writer.write(example.SerializeToString())
Warning: be mindful of the data type of your image before converting it to a feature. I sometimes find my numpy data getting converted to float64 when we would much rather store it as uint8! Loading a png/jpeg with imageio.imread will return data as uint8.
Creating a Dataset
The specifics of what you need for training/validation/test will determine exactly how you throw this into a loop and continue calling writer.write(...)
on all examples. See the TFRecord template for how I reuse some boilerplate code for this process.
Running Example: to complete the code (in addition to all code blocks above that say “Inside data/create_tfrecords.py”) for our example, add the following. This code creates a separate TFRecord file for each object class with a 80/20 split between training and validation data.
##########################################
### Inside data/create_tfrecord.py ###
##########################################
# Create train/valid directories to store our TFRecords
if not os.path.exists('tfrecords') and not os.path.isdir('tfrecords'):
os.mkdir('tfrecords')
if not os.path.exists('tfrecords/train') and not os.path.isdir('tfrecords/train'):
os.mkdir('tfrecords/train')
if not os.path.exists('tfrecords/valid') and not os.path.isdir('tfrecords/valid'):
os.mkdir('tfrecords/valid')
object_names = list(objects.keys())
# Create a separate TFRecord file for each object category
for o in object_names:
print(o)
# Create this object's TFRecord file
train_writer = tf.python_io.TFRecordWriter('tfrecords/train/' + o + '.tfrecord')
valid_writer = tf.python_io.TFRecordWriter('tfrecords/valid/' + o + '.tfrecord')
# Write each image of the object into that file
num_images = len(objects[o])
for index in range(num_images):
i = objects[o][index]
# Let's make 80% train and leave 20% for validation
if index < num_images * 0.8:
writer = train_writer
else:
writer = valid_writer
image = imageio.imread(i)
shape = image.shape
label = category_labels[o]
# Create features dict for this image
features = {
'height' : int64_feature(shape[0]),
'width' : int64_feature(shape[1]),
'depth' : int64_feature(shape[2]),
'image' : bytes_feature(image.tostring()),
'label' : int64_feature(int(label))
}
# Create Example out of this image and write it to the TFRecord
example = tf.train.Example(features=tf.train.Features(feature=features))
writer.write(example.SerializeToString())
train_writer.close()
valid_writer.close()
Running Example: the complete create_tfrecords.py file can be found here.
Advanced Formatting
There are several extensions of this basic process you may want to use when creating your records.
- Splitting into train/validation/test records
- Splitting data into multiple TFRecords to avoid giant files.
- Loading small chunks of your data at a time
- Interleaving generating data (ex: interacting with the OpenAI Gym simulator) and then writing the data to a TFRecord
Continue Reading
In Part 3 we will see how to load these TFRecord files and prepare them to be used in a model we will build in Part 4.