class Dataset final
Matcha comes together with a seamlessly integrated and modular dataset pipeline system. The pipelines enable you to load (or generate) data and manipulate it just-in-time or otherwise tune it to your needs, in a way that is both memory and time efficient.
?> NOTE: Use datasets for large amounts of data. Datasets load from the disk only the part that is needed at the moment. This prevents wasting memory.
class Instance final
A single piece of data is called an Instance
. Instances behave as a dictionary of tensors with string keys. Why? It's quite common to
store more than one information per data entry: for example, we might have labeled images (*image of a cat* - "cat", and so on).
For labeled data, it's a convention to have the keys "x"
for data, "y"
for label.
The list of all keys can be obtained by calling the keys()
method:
// we will show this later...
Dataset labeled_images;
// get an instance from the dataset
Instance i = labeled_images.get();
for (auto& key: i.keys()) {
std::cout << key << ": " << i[key].frame() << std::endl;;
}
// x: Float[224, 224]
// y: Int[]
In general, there are two types of pipeline components:
Let's look at sources first.
They can load various files from the disk, while using the minimum amount of memory possible.
The following snippet loads the hand-written digits dataset MNIST.
The training set has in total 60'000 images, 28x28 pixels each.
However, since every image is represented as a single CSV file row, they are flattened, i.e. they have shape 28 * 28 = 784
.
Dataset mnist = load("mnist_train.csv");
std::cout << mnist.size() << std::endl; // 60000
Instance i = mnist.get();
tensor digit = i["x"];
std::cout << digit.dtype() << std::endl; // Float
std::cout << digit.shape() << std::endl; // [784]
image(digit, "digit_original.png");
digit = digit.reshape(28, 28);
std::cout << digit.shape() << std::endl; // [28, 28]
image(digit, "digit_reshaped.png");
Contents of digit_original.png
and digit_reshaped.png
:
This is kind of unfortunate. We would like a dataset that provides 28x28 images right away. We can easily do this by mapping the original dataset. Map is a relay example. It modifies instances of the underlying dataset in a way we want. Let's define a function that will do this:
Instance toSquare(Instance& i) {
i["x"] = i["x"].reshape(28, 28);
}
Now we can use it to map the original dataset:
mnist = mnist.map(toSquare);
This is the magic of pipelines. The new dataset behaves completely the same as the original dataset, performing all the necessary operations just-in-time for our need. Let's modify it further. The original dataset pixels have values 0-255. We would like to have them normalized to 0-1. We can simply take the dataset and map it once more. Using a lambda function:
mnist = mnist.map([](Instance& i) {
i["x"] /= 255;
});
There are too many source and relay datasets to cover all of them here. For more, you can read sources and relays.
At last, we would like to get instances from the dataset. We have already seen this way of doing it:
Instance i = mnist.get();
The method Dataset::get()
retrieves the next Instance
from the dataset. We can do that repeatedly to iterate through the entire dataset:
while (Instance i = mnist.get()) {
std::cout << i << std::endl;
}
Alternatively, we can use a range-based for loop:
for (Instance i: mnist) {
std::cout << i << std::endl;
}
Datasets behave as linear streams. As such, they provide tell
method returning the current position,
and seek
method for changing the current position manually. Note that since the get
method reads the next Instance,
it increases the position by one. The reset
method is a shorthand for seek(0)
.
If the current position is greater or equal to the dataset size, eof
returns true
, else it returns false
.
std::cout << mnist.size() << std::endl; // 60000
mnist.seek(5);
std::cout << mnist.tell() << std::endl; // 5
mnist.seek(3);
std::cout << mnist.tell() << std::endl; // 3
mnist.get();
std::cout << mnist.tell() << std::endl; // 4
std::cout << mnist.eof() << std::endl; // 0
mnist.seek(60000);
std::cout << mnist.tell() << std::endl; // 60000
std::cout << mnist.eof() << std::endl; // 1
mnist.reset();
std::cout << mnist.tell() << std::endl; // 0
std::cout << mnist.eof() << std::endl; // 0
mnist.get();
std::cout << mnist.tell() << std::endl; // 1
std::cout << mnist.eof() << std::endl; // 0
!> Depending on the concrete pipeline components used, it may or may not be inefficient to jump frequently through the dataset.
It's recommended to avoid unnecessary jumping and to proceed linearly instead.