Data augmentation

3.3.4. Data augmentation#

Các bộ dữ liệu deep learning thường có kích thước rất lớn. Trong quá trình huấn luyện các model deep learning chúng ta không thể truyền toàn bộ dữ liệu vào mô hình cùng một lúc bởi dữ liệu thường có kích thước lớn hơn RAM máy tính. Xuất phát từ lý do này, các framework deep learning đều hỗ trợ các hàm huấn luyện mô hình theo generator. Dữ liệu sẽ không được khởi tạo ngay toàn bộ từ đầu mà sẽ huấn luyện đến đâu sẽ được khởi tạo đến đó theo từng phần nhỏ gọi là batch.

Tùy theo định dạng dữ liệu là text, image, data frame, numpy array,… mà chúng ta sẽ sử dụng những module tạo dữ liệu huấn luyện khác nhau.

Generator

Generator sẽ không trả về kết quả ngay mà chỉ tạo sẵn các ô nhớ lưu hàm generator mô tả cách tính. Do đó chúng ta sẽ không tốn chi phí thời gian để thực hiện các phép tính. Thực tế là chúng ta đang nợ máy tính kết quả trả về. Chỉ khi nào được gọi tên bằng cách kích hoạt trong hàm next() thì mới tính kết quả.

Chúng ta có thể thấy generator có lợi thế là:

Không sinh toàn bộ dữ liệu cùng một lúc, do đó sẽ nâng cao hiệu suất vì sử dụng ít bộ nhớ hơn.
Không phải chờ toàn bộ các vòng lặp được xử lý xong thì mới xử lý tiếp nên tiết kiệm thời gian tính toán.

Đó chính là lý do generator chính là giải pháp được lựa chọn cho huấn luyện mô hình deep learning với dữ liệu lớn.

def _gen_interest_rate(month):
    yield (1+0.01)**month - 1


periods = [1, 3, 6, 9, 12]
scales = [_gen_interest_rate(month) for month in periods]
print('scales of origin balance: ', scales)

[next(_gen_interest_rate(n)) for n in periods]

scales of origin balance:  [<generator object _gen_interest_rate at 0x106acd5b0>, <generator object _gen_interest_rate at 0x106acd1c0>, <generator object _gen_interest_rate at 0x106acd3f0>, <generator object _gen_interest_rate at 0x106acd7e0>, <generator object _gen_interest_rate at 0x106acd850>]

[0.010000000000000009,
030301000000000133,
061520150601000134,
09368527268436089,
12682503013196977]

3.3.4.1. In memory Dataset#

Khởi tạo các dataset ngay từ đầu và dữ liệu được lưu trữ trên memory. Phương pháp In memory Dataset sẽ phù hợp với các bộ dữ liệu kích thước nhỏ mà RAM có thể load được. Quá trình huấn luyện theo cách này thì nhanh hơn so với phương pháp Generator Dataset vì dữ liệu đã được chuẩn bị sẵn mà không tốn thời gian chờ khởi tạo batch. Tuy nhiên dễ xảy ra out of memory trong quá trình huấn luyện.

from tensorflow.keras.datasets import mnist

(X_train, y_train), (X_test, y_test) = mnist.load_data()
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11490434/11490434 [==============================] - 1s 0us/step
(60000, 28, 28)
(10000, 28, 28)
(60000,)
(10000,)

Như vậy các dữ liệu train và test của bộ dữ liệu mnist đã được load vào bộ nhớ. Tiếp theo chúng ta sẽ khởi tạo Dataset cho những dữ liệu in memory này bằng hàm tf.data.Dataset.from_tensor_slices(). Hàm này sẽ khai báo dữ liệu đầu vào cho mô hình huấn luyện.

import tensorflow as tf
train_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))
valid_dataset = tf.data.Dataset.from_tensor_slices((X_test, y_test))

Metal device set to: Apple M1 Pro

Khi đó chúng ta đã có thể fit vào mô hình huấn luyện các dữ liệu được truyền vào tf.Dataset là (X_train, y_train).

Chúng ta cũng có thể áp dụng các phép biến đổi bằng các hàm như Dataset.map() hoặc Dataset.batch() để biến đổi dữ liệu trước khi fit vào model. Các bạn xem thêm tại tf.Dataset. Chẳng hạn trước khi truyền batch vào huấn luyện tôi sẽ thực hiện chuẩn hóa batch theo phân phối chuẩn.

import numpy as np
from tensorflow.keras.backend import std, mean
from tensorflow.math import reduce_std, reduce_mean

def _normalize(X_batch, y_batch):
    '''
    X_batch: matrix digit images, shape batch_size x 28 x 28
    y_batch: labels of digit.
    '''
    X_batch = tf.cast(X_batch, dtype = tf.float32)
    # Padding về 2 chiều các giá trị 0 để được shape là 32 x 32
    pad = tf.constant([[0, 0], [2, 2], [2, 2]])
    X_batch = tf.pad(X_batch, paddings=pad, mode='CONSTANT', constant_values=0)
    X_batch = tf.expand_dims(X_batch, axis=-1)
    mean = reduce_mean(X_batch)
    std = reduce_std(X_batch)
    X_norm = (X_batch-mean)/std
    return X_norm, y_batch

# batch(32): Trích xuất ra từ list (X_train, y_train) các batch_size có kích thước là 32.
# map(_normalize): Mapping đầu vào là các batch (X_batch, y_batch) kích thước 32 vào hàm số _normalize()
# Kết quả trả về là giá trị đã chuẩn hóa theo batch của X_batch và y_batch
train_dataset = train_dataset.batch(32).map(_normalize)
valid_dataset = valid_dataset.batch(32).map(_normalize)

train model

from tensorflow.keras.applications import MobileNet
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.optimizers import Adam

base_extractor = MobileNet(input_shape = (32, 32, 1), include_top = False, weights = None)
flat = Flatten()
den = Dense(10, activation='softmax')
model = Sequential([base_extractor, 
                   flat,
                   den])
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 mobilenet_1.00_32 (Function  (None, 1, 1, 1024)       3228288   
 al)                                                             
                                                                 
 flatten (Flatten)           (None, 1024)              0         
                                                                 
 dense (Dense)               (None, 10)                10250     
                                                                 
=================================================================
Total params: 3,238,538
Trainable params: 3,216,650
Non-trainable params: 21,888
_________________________________________________________________

fit data batch

model.compile(Adam(), loss='sparse_categorical_crossentropy', metrics = ['accuracy'])
model.fit(train_dataset, validation_data = valid_dataset, epochs = 5)

Epoch 1/5

1875/1875 [==============================] - 68s 33ms/step - loss: 0.5088 - accuracy: 0.8390 - val_loss: 0.2128 - val_accuracy: 0.9398
Epoch 2/5
1875/1875 [==============================] - 63s 33ms/step - loss: 0.1546 - accuracy: 0.9567 - val_loss: 0.1376 - val_accuracy: 0.9616
Epoch 3/5
1875/1875 [==============================] - 62s 33ms/step - loss: 0.1276 - accuracy: 0.9672 - val_loss: 0.1276 - val_accuracy: 0.9689
Epoch 4/5
1875/1875 [==============================] - 63s 33ms/step - loss: 0.1001 - accuracy: 0.9742 - val_loss: 0.1002 - val_accuracy: 0.9732
Epoch 5/5
1875/1875 [==============================] - 63s 33ms/step - loss: 0.0882 - accuracy: 0.9783 - val_loss: 0.0625 - val_accuracy: 0.9851

<keras.callbacks.History at 0x177635e70>

3.3.4.2. Generator Dataset#

Theo cách Generator Dataset chúng ta sẽ qui định cách mà dữ liệu được tạo ra như thế nào thông qua một hàm generator. Quá trình huấn luyện đến đâu sẽ tạo batch đến đó. Do đó các bộ dữ liệu big data có thể được load theo từng batch sao cho kích thước vừa được dung lượng RAM. Theo cách huấn luyện này chúng ta có thể huấn luyện được các bộ dữ liệu có kích thước lớn hơn nhiều so với RAM bằng cách chia nhỏ chúng theo batch. Đồng thời có thể áp dụng thêm các step preprocessing data trước khi dữ liệu được đưa vào huấn luyện. Do đó đây thường là phương pháp được ưa chuộng khi huấn luyện các model deep learning.

3.3.4.2.1. Ví dụ#

import pandas as pd

hanoi = ['bún chả hà nội', 'chả cá lã vọng hà nội', 'cháo lòng hà nội', 'ô mai sấu hà nội', 'ô mai', 'chả cá', 'cháo lòng']
hochiminh = ['bánh canh sài gòn', 'hủ tiếu nam vang sài gòn', 'hủ tiếu bò sài gòn', 'banh phở sài gòn', 'bánh phở', 'hủ tiếu']
city = ['hanoi'] * len(hanoi) + ['hochiminh'] * len(hochiminh)
corpus = hanoi+hochiminh

data = pd.DataFrame({'city': city, 'food': corpus})
data.sample(5)

	city	food
12	hochiminh	hủ tiếu
10	hochiminh	banh phở sài gòn
1	hanoi	chả cá lã vọng hà nội
8	hochiminh	hủ tiếu nam vang sài gòn
5	hanoi	chả cá

class Voc(object):
    """Class Voc có tác dụng khởi tạo index từ điển cho toàn bộ corpus (bộ văn bản)"""
    def __init__(self, corpus):
        self.corpus = corpus                     # list toàn bộ tên các món ăn
        self.dictionary = {'unk': 0}
        self._initialize_dict(corpus)

    def _add_dict_sentence(self, sentence):
        words = sentence.split(' ')
        for word in words:
            if word not in self.dictionary.keys():
                max_indice = max(self.dictionary.values())
                self.dictionary[word] = (max_indice + 1)

    def _initialize_dict(self, sentences):
        for sentence in sentences:
            self._add_dict_sentence(sentence)

    def _tokenize(self, sentence):
        words = sentence.split(' ')
        token_seq = [self.dictionary[word] for word in words]
        return np.array(token_seq)

voc = Voc(corpus = corpus)
voc.dictionary

{'unk': 0,
 'bún': 1,
 'chả': 2,
 'hà': 3,
 'nội': 4,
 'cá': 5,
 'lã': 6,
 'vọng': 7,
 'cháo': 8,
 'lòng': 9,
 'ô': 10,
 'mai': 11,
 'sấu': 12,
 'bánh': 13,
 'canh': 14,
 'sài': 15,
 'gòn': 16,
 'hủ': 17,
 'tiếu': 18,
 'nam': 19,
 'vang': 20,
 'bò': 21,
 'banh': 22,
 'phở': 23}

Tiếp theo chúng ta sẽ khởi tạo một random_generator có tác dụng lựa chọn ngẫu nhiên một tên món ăn trong corpus và tokenize chúng.

import tensorflow as tf

cat_indices = {
    'hanoi': 0,
    'hochiminh': 1
}

def generators():
    i = 0
    while True:
        i = np.random.choice(data.shape[0])
        sentence = data.iloc[i, 1]
        x_indice = voc._tokenize(sentence)
        label = data.iloc[i, 0]
        y_indice = cat_indices[label]
        yield x_indice, y_indice
        i += 1

random_generator = tf.data.Dataset.from_generator(
    generators,
    output_types = (tf.float16, tf.float16),
    output_shapes = ((None,), ())
)

random_generator

<_FlatMapDataset element_spec=(TensorSpec(shape=(None,), dtype=tf.float16, name=None), TensorSpec(shape=(), dtype=tf.float16, name=None))>

import numpy as np

# hàm shuffle(20) có tác dụng trộn lẫn ngẫu nhiên dữ liệu
# Sau đó dữ liệu được chia thành những batch có kích thước là 20 
# padding giá trị 0 sao cho bằng với độ dài của câu dài nhất bằng hàm padded_batch()
random_generator_batch = random_generator.shuffle(20).padded_batch(20, padded_shapes=([None], []))
sequence_batch, label = next(iter(random_generator_batch))

print(sequence_batch)
print(label)

tf.Tensor(
[[ 1.  2.  3.  4.  0.  0.]
 [ 8.  9.  3.  4.  0.  0.]
 [10. 11.  0.  0.  0.  0.]
 [13. 23.  0.  0.  0.  0.]
 [ 1.  2.  3.  4.  0.  0.]
 [10. 11.  0.  0.  0.  0.]
 [ 8.  9.  3.  4.  0.  0.]
 [10. 11.  0.  0.  0.  0.]
 [13. 23.  0.  0.  0.  0.]
 [17. 18.  0.  0.  0.  0.]
 [ 1.  2.  3.  4.  0.  0.]
 [10. 11. 12.  3.  4.  0.]
 [17. 18.  0.  0.  0.  0.]
 [ 2.  5.  0.  0.  0.  0.]
 [ 2.  5.  0.  0.  0.  0.]
 [10. 11.  0.  0.  0.  0.]
 [ 1.  2.  3.  4.  0.  0.]
 [17. 18. 21. 15. 16.  0.]
 [10. 11.  0.  0.  0.  0.]
 [ 2.  5.  6.  7.  3.  4.]], shape=(20, 6), dtype=float16)
tf.Tensor([0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0.], shape=(20,), dtype=float16)

3.3.4.2.2. ImageGenerator#

ImageGenerator cũng là một dạng data generator được xây dựng trên framework keras và dành riêng cho dữ liệu ảnh.

Đây là một high level function nên cú pháp đơn giản, rất dễ sử dụng nhưng khả năng tùy biến và can thiệp sâu vào dữ liệu kém.

Khi khởi tạo ImageGenerator chúng ta sẽ khai báo các thủ tục preprocessing image trước khi đưa vào huấn luyện. Mình sẽ không quá đi sâu vào các kĩ thuật preprocessing data này. Bạn đọc quan tâm có thể xem thêm tại ImageDataGenerator.

import glob2
root_folder = 'Datasets/Dog-Cat-Classifier/Data/Train_Data/'

image_gen = tf.keras.preprocessing.image.ImageDataGenerator(
    rescale = 1./255, 
    rotation_range = 20,
    horizontal_flip = True
)


images, labels = next(image_gen.flow_from_directory(root_folder))

Found 1399 images belonging to 2 classes.

images.shape

(32, 256, 256, 3)

Hàm flow_from_directory() sẽ có tác dụng đọc các ảnh từ root_folder và lấy ra những thông tin bao gồm ma trận ảnh sau biến đổi và nhãn tương ứng. Cấu trúc cây thư mục của root_folder có dạng như sau:

| root-folder
----| sub-folder-class-1
----| sub-folder-class-2
----| ...
----| sub-folder-class-C

Trong đó bên trong các sub-folder-class-i là list toàn bộ các ảnh thuộc về một class. Hàm flow_from_directory() sẽ tự động xác định các file dữ liệu nào là ảnh để load vào quá trình huấn luyện mô hình. Ở đây trong root_folder chúng ta có 2 sub-folders tương ứng với 2 classes là dog, cat.

# Tiếp theo ta sẽ khởi tạo một tf.Dataset từ generator thông qua hàm from_generator().
# Khai báo bắt buộc định dạng dữ liệu input và output thông qua tham số output_types và output shape thông qua tham số output_shapes
# Như vậy kết quả trả ra sẽ là những batch có kích thước 32 và ảnh có kích thước 256 x 256 và nhãn tương ứng của ảnh.
image_gen_dataset = tf.data.Dataset.from_generator(
    image_gen.flow_from_directory, 
    args = ([root_folder]),
    output_types=(tf.float32, tf.float32), 
    output_shapes=([32,256,256,3], [32, 1])
)

3.3.4.2.3. Customize ImageGenerator#

Giả sử bạn có một bộ dữ liệu ảnh mà kích thước các ảnh là khác biệt nhau. Đồng thời bạn cũng muốn can thiệp sâu hơn vào bức ảnh trước khi đưa vào huấn luyện như giảm nhiễu bằng bộ lọc Gausianblur, rotate ảnh, crop, zoom ảnh, …. Nếu sử dụng các hàm mặc định của image preprocessing trong ImageGenerator thì sẽ gặp hạn chế đó là bị giới hạn bởi một số phép biến đổi mà hàm này hỗ trợ. Sử dụng high level framework tiện thì rất tiện nhưng khi muốn can thiệp sâu thì rất khó. Muốn can thiệp được sâu vào bên trong các biến đổi chúng ta phải customize lại một chút ImageGenerator.

!pip install opencv-python

Collecting opencv-python
  Downloading opencv_python-4.7.0.72-cp37-abi3-macosx_11_0_arm64.whl (32.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 32.6/32.6 MB 9.6 MB/s eta 0:00:0000:01m00:01
?25hRequirement already satisfied: numpy>=1.21.2 in /Users/datkhong/miniconda3/lib/python3.10/site-packages (from opencv-python) (1.24.2)
Installing collected packages: opencv-python
Successfully installed opencv-python-4.7.0.72

import numpy as np
from tensorflow.keras.utils import Sequence, to_categorical
import cv2
import glob2

class DataGenerator(Sequence):
    'Generates data for Keras'
    def __init__(self,
                 all_filenames, 
                 labels, 
                 batch_size, 
                 index2class,
                 input_dim,
                 n_channels,
                 n_classes=2, 
                 shuffle=True):
        '''
        all_filenames: list toàn bộ các filename
        labels: nhãn của toàn bộ các file
        batch_size: kích thước của 1 batch
        index2class: index của các class
        input_dim: (width, height) đầu vào của ảnh
        n_channels: số lượng channels của ảnh
        n_classes: số lượng các class 
        shuffle: có shuffle dữ liệu sau mỗi epoch hay không?
        '''
        self.all_filenames = all_filenames
        self.labels = labels
        self.batch_size = batch_size
        self.index2class = index2class
        self.input_dim = input_dim
        self.n_channels = n_channels
        self.n_classes = n_classes
        self.shuffle = shuffle
        self.on_epoch_end()

    def __len__(self):
        '''
        Số lượng step trong một epoch.
        
        return:
          Trả về số lượng batch/1 epoch
        '''
        return int(np.floor(len(self.all_filenames) / self.batch_size))

    def __getitem__(self, index):
        '''
        Trong quá trình huấn luyện chúng ta cần phải access vào từng batch trong bộ dữ liệu. 
        Hàm __getitem__() sẽ khởi tạo batch theo thứ tự của batch được truyền vào hàm.
        
        params:
          index: index của batch
        return:
          X, y cho batch thứ index
        '''
        # Lấy ra indexes của batch thứ index
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]

        # List all_filenames trong một batch
        all_filenames_temp = [self.all_filenames[k] for k in indexes]

        # Khởi tạo data
        X, y = self.__data_generation(all_filenames_temp)

        return X, y

    def on_epoch_end(self):
        '''
        Đây là hàm được tự động run mỗi khi một epoch huấn luyện bắt đầu và kết thúc. 
        Tại hàm này chúng ta sẽ xác định các hành động khi bắt đầu hoặc kết thúc một epoch như: 
        - Có shuffle dữ liệu hay không?
        - Điều chỉnh lại tỷ lệ các class tước khi fit vào model,….
        
        Shuffle dữ liệu khi epochs end hoặc start.
        '''
        self.indexes = np.arange(len(self.all_filenames))
        if self.shuffle == True:
            np.random.shuffle(self.indexes)

    def __data_generation(self, all_filenames_temp):
        '''
        Hàm này sẽ được gọi trong __getitem__(). __data_generation() sẽ trực tiếp 
        biến đổi dữ liệu và quyết định các kết quả dữ liệu trả về cho người dùng. 
        Tại hàm này ta có thể thực hiện các phép preprocessing image.
        
        params:
          all_filenames_temp: list các filenames trong 1 batch
        return:
          Trả về giá trị cho một batch.
        '''
        X = np.empty((self.batch_size, *self.input_dim, self.n_channels))
        y = np.empty((self.batch_size), dtype=int)

        # Khởi tạo dữ liệu
        for i, fn in enumerate(all_filenames_temp):
            # Đọc file từ folder name
            img = cv2.imread(fn)
            img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
            img = cv2.resize(img, self.input_dim)
            label = os.path.basename(fn)
            label = self.index2class[label[:3]]
    
            X[i,] = img

            # Lưu class
            y[i] = label
        return X, y

import os
dict_labels = {
    'dog': 0,
    'cat': 1
}
root = r"/Users/datkhong/Library/CloudStorage/GoogleDrive-k55.1613310017@ftu.edu.vn/My Drive/GitCode/My_learning/1. DA - DS/3. Learning/2_Notebooks/5_Deep_learning/"
root_folder = root + r'Datasets/Dog-Cat-Classifier/Data/Train_Data/*/*'
fns = glob2.glob(root_folder)
print(len(fns))

image_generator = DataGenerator(
    all_filenames = fns,
    labels = None,
    batch_size = 32,
    index2class = dict_labels,
    input_dim = (224, 224),
    n_channels = 3,
    n_classes = 2,
    shuffle = True
)

X, y = image_generator.__getitem__(1)

print(X.shape)
print(y.shape)

1399
(32, 224, 224, 3)
(32,)

Như vậy ta có thể thấy, tại mỗi lượt huấn luyện model lấy ra một batch có kích thước là 32. Mặc dù ảnh của chúng ta có kích thước khác nhau nhưng đã được resize về chung một kích thước là width x height = 224 x 224.

from tensorflow.keras.applications import MobileNet
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.optimizers import Adam

base_extractor = MobileNet(input_shape = (224, 224, 3), include_top = False, weights = 'imagenet')
flat = Flatten()
den = Dense(1, activation='sigmoid')
model = Sequential([base_extractor, flat, den])
model.summary()

# chúng ta chỉ cần thay generator vào vị trí của train data trong hàm fit()
model.compile(optimizer='adam', loss='binary_crossentropy', metrics = ['accuracy'])
model.fit(image_generator, epochs = 5)

Model: "sequential_12"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 mobilenet_1.00_224 (Functio  (None, 7, 7, 1024)       3228864   
 nal)                                                            
                                                                 
 flatten_12 (Flatten)        (None, 50176)             0         
                                                                 
 dense_12 (Dense)            (None, 1)                 50177     
                                                                 
=================================================================
Total params: 3,279,041
Trainable params: 3,257,153
Non-trainable params: 21,888
_________________________________________________________________
Epoch 1/5
43/43 [==============================] - 11s 167ms/step - loss: 0.8531 - accuracy: 0.9041
Epoch 2/5
43/43 [==============================] - 7s 164ms/step - loss: 0.3894 - accuracy: 0.9448
Epoch 3/5
43/43 [==============================] - 7s 165ms/step - loss: 0.1809 - accuracy: 0.9688
Epoch 4/5
43/43 [==============================] - 7s 165ms/step - loss: 0.1497 - accuracy: 0.9709
Epoch 5/5
43/43 [==============================] - 7s 163ms/step - loss: 0.1265 - accuracy: 0.9724

<keras.callbacks.History at 0x306e78f10>