Image Classification¶
In this tutorial, we are going to build a neural network that can classify handwritten digits (0-9)! Does that sound exciting? Probabbly not in this problem domain. However the principles we are going to apply in handwritten digit recognition are equally valid to other visual recognition challanges. Would you like to build a 1000 class image classifier? Are you developing an automated vehicle? Do you want to replicate the human visual system? Read on.
Disclaimer: the latter two objectives are far more complicated than this tutorial makes vision seem to be. It is almost abusive to consider human vision as simple as convolution.
Notice and Copyright¶
This tutorial was written to complete the Quiz 14 requirement of Data Mining:
Complete the MNIST Classifier shown in class and submit the code+output screenshot.
Change the network to contain 4 convolution layers with 6, 32, 64, 16 layers, and 3 fully connected layers with 256, 64, 10 nodes in each layer respectively.
Use sigmoid activation in all layers except the output layer.
And later extended for the Assignment 1 requirement of the same class:
Your goal is building CIFAR-10 image classifier.
All comments and code were written from memory. No papers, books, Google, stack overflow, or Internet unless noted.
Copyright © Jacob Valdez 2021. Released under MIT License.
Getting Started¶
As you start to explore github, you'll observe a few common nicknames that we give our packages. I'm just going to import my default go-to's for now:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import IPython.display as display
import tensorflow as tf
import tensorflow.keras as keras
import tensorflow.keras.layers as tfkl
import tensorflow.keras.datasets as datasets
The Data¶
Let's load the mnist
dataset and observe a few elements.
(X_train, Y_train), (X_test, Y_test) = datasets.mnist.load_data()
for i in range(4):
plt.imshow(X_train[i])
display.display(Y_train[i])
plt.show()
print(X_train.shape, Y_train.shape, X_test.shape, Y_test.shape)
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz 11493376/11490434 [==============================] - 0s 0us/step 11501568/11490434 [==============================] - 0s 0us/step
5
0
4
1
(60000, 28, 28) (60000,) (10000, 28, 28) (10000,)
Notice that each number has an image (stored in X_train/test
) and a label (stored in Y_train/test
) Each image is 28 by 28 pixels and there are 60000 training examples and 10000 test examples. Note that this dataset is supplied in integers so I'm going to convert it to floating point representation for our neural network:
print('before', X_train.dtype)
X_train, X_test = X_train/255., X_test/255.
print('after', X_train.dtype)
before uint8 after float64
The Classifier¶
We're just going to build a plain-old Convolutional Neural Network. The idea of performing convolutions is that not every part of an images has information pertaining to every other part. As we analyze a scene, we can often decompose the visual information relationships into a spatially segmented hierarchy. Convolutional neural networks carry this inductive bias by performing a miniature perceptron operation at every receptive field location in an image. Unless commented below, we'll use keras's default implementations to achieve this:
model = keras.Sequential([
tfkl.Input(shape=(28, 28)),
tfkl.Reshape(target_shape=(28, 28, 1)), # give each pixel a 1 dimensional channel
tfkl.Conv2D(filters=6, kernel_size=(3,3), activation='sigmoid'),
tfkl.Conv2D(filters=32, kernel_size=(3,3), activation='sigmoid'),
tfkl.Conv2D(filters=64, kernel_size=(3,3), activation='sigmoid'),
tfkl.Conv2D(filters=16, kernel_size=(3,3), activation='sigmoid'),
tfkl.GlobalMaxPooling2D(), # this layer will take the highest value features over all pixels for each of the 16 filters
tfkl.Dense(256, activation='sigmoid'),
tfkl.Dense(64, activation='sigmoid'),
tfkl.Dense(10),
])
Training¶
Next, we're goign to train our classifier. Since the data is supplied with integer labels but our model outputs probabilities over 10 classes, we cannot directly differentiate between the two without either
- converting
y_train
andy_test
integer labels into one-hot encodings or - using a sparse categorical loss function.
I select the latter option for computational and information theoretic reasons. Cross entropy $H(p,q)$ represents the expected amount of extra information needed to encode some code under an existing distribution. Formally, $$H(p,q)=E_{x \sim p(x)}[-\log{q(x)}]$$ This is ideal when our model serves as the posterier $q(x,y)$ and the dataset as the prior $p(x,y)$. Our loss function will then be the sparse categorical cross entropy between our model's estimates and the dataset labels. Keras provides a high level interface to implement this in the model.compile
function:
model.compile(loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True))
model.summary()
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= reshape (Reshape) (None, 28, 28, 1) 0 _________________________________________________________________ conv2d (Conv2D) (None, 26, 26, 6) 60 _________________________________________________________________ conv2d_1 (Conv2D) (None, 24, 24, 32) 1760 _________________________________________________________________ conv2d_2 (Conv2D) (None, 22, 22, 64) 18496 _________________________________________________________________ conv2d_3 (Conv2D) (None, 20, 20, 16) 9232 _________________________________________________________________ global_max_pooling2d (Global (None, 16) 0 _________________________________________________________________ dense (Dense) (None, 256) 4352 _________________________________________________________________ dense_1 (Dense) (None, 64) 16448 _________________________________________________________________ dense_2 (Dense) (None, 10) 650 ================================================================= Total params: 50,998 Trainable params: 50,998 Non-trainable params: 0 _________________________________________________________________
Now it's time to actually train the model. Let's supply our training a testing data and see how training progresses:
history = model.fit(
x=X_train,
y=Y_train,
batch_size=64,
epochs=10,
verbose=2,
validation_data=(X_test, Y_test),
validation_batch_size=64,
)
Epoch 1/10 938/938 - 22s - loss: 2.3095 - val_loss: 2.3083 Epoch 2/10 938/938 - 5s - loss: 2.0046 - val_loss: 1.8036 Epoch 3/10 938/938 - 5s - loss: 1.5575 - val_loss: 1.3240 Epoch 4/10 938/938 - 5s - loss: 1.0506 - val_loss: 0.8737 Epoch 5/10 938/938 - 5s - loss: 0.6946 - val_loss: 0.5424 Epoch 6/10 938/938 - 5s - loss: 0.5413 - val_loss: 0.5187 Epoch 7/10 938/938 - 5s - loss: 0.4601 - val_loss: 0.4580 Epoch 8/10 938/938 - 5s - loss: 0.3984 - val_loss: 0.3420 Epoch 9/10 938/938 - 5s - loss: 0.3344 - val_loss: 0.2832 Epoch 10/10 938/938 - 5s - loss: 0.2850 - val_loss: 0.2613
What's happening? The loss isn't improving.
Why can't we just plug and chug whatever data we want into our model? Consider two reasons:
- There is no globally optimal universal approximator, and specialized models such as this CNN may not have sufficient inductive priors to estimate their data generating distribution
- Sigmoid-type activation functions saturate the gradients relatively easily. This means that when the input is large in the positive or negative extrema, gradients are effectively zero. During backpropagation, the gradients hardly penetrate the top layer and only slowly penetrate lower and lower into the model. (See the paper that introduced batch norm and The Principles of Deep Learning Theory for a longer discussion of these points.)
We can solve this problem by changing our activation function to something that is still nonlinear but allows gradients to flow faster over the epochs. My go-to activation function is the rectified linear unit relu
:
model = keras.Sequential([
tfkl.Input(shape=(28, 28)),
tfkl.Reshape(target_shape=(28, 28, 1)), # give each pixel a 1 dimensional channel
tfkl.Conv2D(filters=6, kernel_size=(3,3), activation='relu'),
tfkl.Conv2D(filters=32, kernel_size=(3,3), activation='relu'),
tfkl.Conv2D(filters=64, kernel_size=(3,3), activation='relu'),
tfkl.Conv2D(filters=16, kernel_size=(3,3), activation='relu'),
tfkl.GlobalMaxPooling2D(), # this layer will take the highest value features over all pixels for each of the 16 filters
tfkl.Dense(256, activation='relu'),
tfkl.Dense(64, activation='relu'),
tfkl.Dense(10),
])
model.compile(loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True))
history = model.fit(
x=X_train,
y=Y_train,
batch_size=64,
epochs=10,
verbose=2,
validation_data=(X_test, Y_test),
validation_batch_size=64,
)
Epoch 1/10 938/938 - 6s - loss: 0.6999 - val_loss: 0.3878 Epoch 2/10 938/938 - 5s - loss: 0.2282 - val_loss: 0.1778 Epoch 3/10 938/938 - 5s - loss: 0.1500 - val_loss: 0.1543 Epoch 4/10 938/938 - 5s - loss: 0.1216 - val_loss: 0.1268 Epoch 5/10 938/938 - 5s - loss: 0.1027 - val_loss: 0.0911 Epoch 6/10 938/938 - 5s - loss: 0.0929 - val_loss: 0.0902 Epoch 7/10 938/938 - 5s - loss: 0.0808 - val_loss: 0.0770 Epoch 8/10 938/938 - 5s - loss: 0.0733 - val_loss: 0.0844 Epoch 9/10 938/938 - 5s - loss: 0.0680 - val_loss: 0.0790 Epoch 10/10 938/938 - 5s - loss: 0.0632 - val_loss: 0.0964
What a significant change. relu
definitely performed better in the first 10 epochs than sigmoid
. Feel free to experiment yourself with this model.
# your code here (download this notebook at https://raw.githubusercontent.com/JacobFV/jacobfv.github.io/source/notebooks/MNIST_Classifier.ipynb)
How to Overfit Your Dev Set¶
I hope you've enjoyed learning about machine learning by tweaking the hyperparameters of your model. Likely you realize at this point that we could tweek hyperparameters forever. Why not let machine learn machine learning instead? ray-tune
is a powerful tool we can use to find the optimal hyperparameters for a model. Per its official docs, ray.tune
frames its optimization problem into a run -- report metric -- optimize iteration loop. To give you the idea, here's their quick start code:
from ray import tune
def objective(step, alpha, beta):
return (0.1 + alpha * step / 100)**(-1) + beta * 0.1
def training_function(config):
# Hyperparameters
alpha, beta = config["alpha"], config["beta"]
for step in range(10):
# Iterative training function - can be any arbitrary training procedure.
intermediate_score = objective(step, alpha, beta)
# Feed the score back back to Tune.
tune.report(mean_loss=intermediate_score)
analysis = tune.run(
training_function,
config={
"alpha": tune.grid_search([0.001, 0.01, 0.1]),
"beta": tune.choice([1, 2, 3])
})
print("Best config: ", analysis.get_best_config(
metric="mean_loss", mode="min"))
# Get a dataframe for analyzing trial results.
df = analysis.results_df
Let's make an isomorphic case with our MNIST classifier: We'll have a triple optimization loop. On the inside, SGD
, Adam
, RMSProp
, or another first order optimizer will backpropagate gradients into the trainable parameters. After 10 epochs, a hyperparameter optimizer will tune our choice of activation function, hidden convolution and dense layers, hidden depth, loss function, and inner optimizer. Finally, we'll be the slow optimizer and make changes to the primary and secondary optimization loops when needed. Let's start by defining our meta-objective:
# on a separate console run:
# pip install -q ray[tune]
# ray start --head --num-cpus 2 --num-gpus 1
import ray
import ray.tune as tune
ray.init(address='auto', _redis_password='5241590000000000')
2021-11-01 13:39:23,542 INFO worker.py:827 -- Connecting to existing Ray cluster at address: 10.138.0.10:6379
{'node_ip_address': '10.138.0.10', 'raylet_ip_address': '10.138.0.10', 'redis_address': '10.138.0.10:6379', 'object_store_address': '/tmp/ray/session_2021-11-01_13-39-17_184816_6838/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2021-11-01_13-39-17_184816_6838/sockets/raylet', 'webui_url': None, 'session_dir': '/tmp/ray/session_2021-11-01_13-39-17_184816_6838', 'metrics_export_port': 49266, 'node_id': '06fd01adf83b504b561996944ea682fc4135d39c2ebd6199b713c524'}
Iteration 1¶
def meta_loss(config):
# load dataset
(X_train, Y_train), (X_test, Y_test) = datasets.mnist.load_data()
X_train, X_test = X_train / 255., X_test / 255.
# number of units in each layer (if applicable)
N1 = config['N1']
N2 = config['N2']
N3 = config['N3']
N4 = config['N4']
N5 = config['N5']
N6 = config['N6']
N7 = config['N7']
# layer type: conv2d, dense, maxpooling2d, flatten, dropout, none
# exactly one flatten layer is allowed and conv2d must be placed before flatten
# errors are indicated by massive negative losses
L1 = config['L1'].lower()
L2 = config['L2'].lower()
L3 = config['L3'].lower()
L4 = config['L4'].lower()
L5 = config['L5'].lower()
L6 = config['L6'].lower()
L7 = config['L7'].lower()
conv_activation = config['conv_activation'].lower() # 'sigmoid', 'relu', 'tanh', 'elu', 'selu', 'softplus', 'softsign'
dense_activation = config['dense_activation'].lower() # 'sigmoid', 'relu', 'tanh', 'elu', 'selu', 'softplus', 'softsign'
initial_learning_rate = 10 ** config['initial_learning_rate_exp'] # -4.0 to -1.0
learning_rate_rate = 10 ** config['learning_rate_rate'] # 0.0 to 1.0
optimizer_name = config['optimizer_name'].lower() # 'adam', 'sgd', 'rmsprop', 'adagrad', 'adadelta', 'adamax', or 'nadam'
batch_size = config['batch_size'] # 4 to 1024, integers only
loss_name = config['loss_name'].lower() # 'mse', 'mae', 'mape', 'categorical_crossentropy', or 'sparse_categorical_crossentropy'
# activation functions
activations = {
'sigmoid': tf.nn.sigmoid,
'relu': tf.nn.relu,
'tanh': tf.nn.tanh,
'elu': tf.nn.elu,
'selu': tf.nn.selu,
'softplus': tf.nn.softplus,
'softsign': tf.nn.softsign,
}
conv_activation = activations[conv_activation]
dense_activation = activations[dense_activation]
# make the loss function
losses = {
'mse': keras.losses.MeanSquaredError(),
'mae': keras.losses.MeanAbsoluteError(),
'mape': keras.losses.MeanAbsolutePercentageError(),
'categorical_crossentropy': keras.losses.CategoricalCrossentropy(),
'sparse_categorical_crossentropy': keras.losses.SparseCategoricalCrossentropy(from_logits=False),
}
loss = losses[loss_name]
# convert labels to one-hot vectors for dense loss penalties
if loss_name != 'spare_categorical_crossentropy':
Y_train = tf.one_hot(Y_train, depth=10)
Y_test = tf.one_hot(Y_test, depth=10)
# optimizer and learning rate
optimziers = {
'SGD': keras.optimizers.SGD,
'RMSprop': keras.optimizers.RMSprop,
'Adagrad': keras.optimizers.Adagrad,
'Adadelta': keras.optimizers.Adadelta,
'Adam': keras.optimizers.Adam,
'Adamax': keras.optimizers.Adamax,
'Nadam': keras.optimizers.Nadam,
}
optimizer = optimziers[optimizer_name](initial_learning_rate)
learning_rate_scheduler = keras.callbacks.LearningRateScheduler(
lambda epoch, _: initial_learning_rate * (learning_rate_rate ** epoch))
# build model
model = keras.Sequential([
tfkl.Input(shape=(28, 28)),
tfkl.Reshape(target_shape=(28, 28, 1)) # give each pixel a 1 dimensional channel
])
flattened = False
for L, N in zip([L1, L2, L3, L4, L5, L6, L7], [N1, N2, N3, N4, N5, N6, N7]):
if L == 'conv2d':
if not flattened:
model.add(tfkl.Conv2D(filters=N, kernel_size=(3,3),
activation=conv_activation))
elif L == 'maxpooling2d':
if not flattened:
model.add(tfkl.MaxPooling2D(pool_size=(2,2)))
elif L == 'flatten':
if not flattened:
model.add(tfkl.Flatten())
flattened = True
elif L == 'dropout':
model.add(tfkl.Dropout(rate=0.1))
elif L == 'dense':
if flattened:
model.add(tfkl.Dense(N, activation=dense_activation))
elif L == 'none': # no more hidden layers
break
else:
raise ValueError(f'unknown layer type {L}')
model.add(tfkl.Dense(10, activation='softmax')) # softmax activation is used for classification
model.compile(loss=loss, optimizer=optimizer)
# train model
history = model.fit(
x=X_train,
y=Y_train,
batch_size=batch_size,
epochs=10,
verbose=2,
callbacks=[learning_rate_scheduler],
validation_data=(X_test, Y_test),
validation_batch_size=64,
)
# report validation loss
final_val_loss = history.history['val_loss'][-1]
tune.report(validation_loss=final_val_loss)
Now just looking at the hyperparameter space we've defined, you can see why this is overkill for MNIST. Each run of meta_loss
runs a full 10 iterations on the optimization loop beneath it. To meet these computation demands, I'm running this notebook on a deep learning optimized Google Cloud VM (n1-highmem-2 with an nvidia-tesla-k80). Learn how you can do this on your own for AWS or GCP from my previous notebook.
Without further hesitation (the assignment due date is approaching), let's start tuning!
analysis = tune.run(
meta_loss,
resources_per_trial={'gpu': 1},
config={
'N1': tune.grid_search([8, 10, 12, 16, 20, 32, 64, 96, 128, 192, 256]),
'N2': tune.grid_search([8, 10, 12, 16, 20, 32, 64, 96, 128, 192, 256]),
'N3': tune.grid_search([8, 10, 12, 16, 20, 32, 64, 96, 128, 192, 256]),
'N4': tune.grid_search([8, 10, 12, 16, 20, 32, 64, 96, 128, 192, 256]),
'N5': tune.grid_search([8, 10, 12, 16, 20, 32, 64, 96, 128, 192, 256]),
'N6': tune.grid_search([8, 10, 12, 16, 20, 32, 64, 96, 128, 192, 256]),
'N7': tune.grid_search([8, 10, 12, 16, 20, 32, 64, 96, 128, 192, 256]),
'L1': tune.grid_search(['conv2d', 'dense', 'maxpooling2d', 'flatten', 'dropout', 'none']),
'L2': tune.grid_search(['conv2d', 'dense', 'maxpooling2d', 'flatten', 'dropout', 'none']),
'L3': tune.grid_search(['conv2d', 'dense', 'maxpooling2d', 'flatten', 'dropout', 'none']),
'L4': tune.grid_search(['conv2d', 'dense', 'maxpooling2d', 'flatten', 'dropout', 'none']),
'L5': tune.grid_search(['conv2d', 'dense', 'maxpooling2d', 'flatten', 'dropout', 'none']),
'L6': tune.grid_search(['conv2d', 'dense', 'maxpooling2d', 'flatten', 'dropout', 'none']),
'L7': tune.grid_search(['conv2d', 'dense', 'maxpooling2d', 'flatten', 'dropout', 'none']),
'dense_activation': tune.grid_search(['sigmoid', 'relu', 'tanh', 'elu', 'selu', 'softplus', 'softsign']),
'conv_activation': tune.grid_search(['sigmoid', 'relu', 'tanh', 'elu', 'selu', 'softplus', 'softsign']),
'initial_learning_rate_exp': tune.uniform(-4.0, -1.0),
'learning_rate_rate': tune.uniform(0.0, 1.0),
'optimizer_name': tune.choice(['adam', 'sgd', 'rmsprop', 'adagrad', 'adadelta', 'adamax', 'nadam']),
'batch_size': tune.choice([4, 8, 16, 24, 32, 48, 64, 128]),
'loss_name': tune.choice(['mse', 'mae', 'mape', 'categorical_crossentropy', 'sparse_categorical_crossentropy']),
})
print("Best config: ", analysis.get_best_config(
metric="val_loss", mode="min"))
# Get a dataframe for analyzing trial results.
df = analysis.results_df
2021-11-01 12:24:01,660 WARNING function_runner.py:559 -- Function checkpointing is disabled. This may result in unexpected behavior when using checkpointing features or certain schedulers. To enable, set the train function arguments to be `func(config, checkpoint_dir=None)`. /opt/conda/lib/python3.7/site-packages/ray/tune/suggest/basic_variant.py:289: UserWarning: The number of pre-generated samples (267302874351744) exceeds the serialization threshold (1000000). Resume ability is disabled. To fix this, reduce the number of dimensions/size of the provided grid search. f"The number of pre-generated samples ({grid_vals}) "
Memory usage on this node: 1.4/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.21 GiB heap, 0.0/3.6 GiB objects (0.0/1.0 accelerator_type:K80)
Result logdir: /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-24-01
Number of trials: 16/267302874351744 (16 PENDING)
Trial name | status | loc | L1 | L2 | L3 | L4 | L5 | L6 | L7 | N1 | N2 | N3 | N4 | N5 | N6 | N7 | batch_size | conv_activation | dense_activation | initial_learning_rate_exp | learning_rate_rate | loss_name | optimizer_name |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
meta_loss_98963_00000 | PENDING | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 24 | sigmoid | sigmoid | -2.51525 | 0.553209 | mape | adagrad | |
meta_loss_98963_00001 | PENDING | dense | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 16 | sigmoid | sigmoid | -2.37358 | 0.314978 | sparse_categorical_crossentropy | adagrad | |
meta_loss_98963_00002 | PENDING | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 128 | sigmoid | sigmoid | -2.8581 | 0.512218 | sparse_categorical_crossentropy | adamax | |
meta_loss_98963_00003 | PENDING | flatten | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 4 | sigmoid | sigmoid | -1.27046 | 0.454703 | mape | sgd | |
meta_loss_98963_00004 | PENDING | dropout | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 16 | sigmoid | sigmoid | -3.53965 | 0.861406 | categorical_crossentropy | adadelta | |
meta_loss_98963_00005 | PENDING | none | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | sigmoid | sigmoid | -3.50222 | 0.955176 | mape | sgd | |
meta_loss_98963_00006 | PENDING | conv2d | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 128 | sigmoid | sigmoid | -2.40886 | 0.878881 | mae | adadelta | |
meta_loss_98963_00007 | PENDING | dense | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 24 | sigmoid | sigmoid | -1.66232 | 0.355708 | mae | adagrad | |
meta_loss_98963_00008 | PENDING | maxpooling2d | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 64 | sigmoid | sigmoid | -1.15489 | 0.736778 | mae | nadam | |
meta_loss_98963_00009 | PENDING | flatten | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -2.96978 | 0.814146 | categorical_crossentropy | adamax | |
meta_loss_98963_00010 | PENDING | dropout | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -1.84345 | 0.335166 | sparse_categorical_crossentropy | adadelta | |
meta_loss_98963_00011 | PENDING | none | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 24 | sigmoid | sigmoid | -3.12773 | 0.515024 | categorical_crossentropy | adamax | |
meta_loss_98963_00012 | PENDING | conv2d | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | sigmoid | sigmoid | -2.44683 | 0.949793 | mse | adamax | |
meta_loss_98963_00013 | PENDING | dense | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -2.7508 | 0.0587266 | categorical_crossentropy | nadam | |
meta_loss_98963_00014 | PENDING | maxpooling2d | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -3.15043 | 0.594349 | mse | adagrad | |
meta_loss_98963_00015 | PENDING | flatten | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 4 | sigmoid | sigmoid | -1.01626 | 0.938932 | mse | adagrad |
(ImplicitFunc pid=1145) Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
16384/11490434 [..............................] - ETA: 0s
11493376/11490434 [==============================] - 0s 0us/step
11501568/11490434 [==============================] - 0s 0us/step
Memory usage on this node: 2.0/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/2 CPUs, 1.0/1 GPUs, 0.0/7.21 GiB heap, 0.0/3.6 GiB objects (0.0/1.0 accelerator_type:K80)
Result logdir: /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-24-01
Number of trials: 17/267302874351744 (16 PENDING, 1 RUNNING)
Trial name | status | loc | L1 | L2 | L3 | L4 | L5 | L6 | L7 | N1 | N2 | N3 | N4 | N5 | N6 | N7 | batch_size | conv_activation | dense_activation | initial_learning_rate_exp | learning_rate_rate | loss_name | optimizer_name |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
meta_loss_98963_00000 | RUNNING | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 24 | sigmoid | sigmoid | -2.51525 | 0.553209 | mape | adagrad | |
meta_loss_98963_00001 | PENDING | dense | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 16 | sigmoid | sigmoid | -2.37358 | 0.314978 | sparse_categorical_crossentropy | adagrad | |
meta_loss_98963_00002 | PENDING | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 128 | sigmoid | sigmoid | -2.8581 | 0.512218 | sparse_categorical_crossentropy | adamax | |
meta_loss_98963_00003 | PENDING | flatten | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 4 | sigmoid | sigmoid | -1.27046 | 0.454703 | mape | sgd | |
meta_loss_98963_00004 | PENDING | dropout | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 16 | sigmoid | sigmoid | -3.53965 | 0.861406 | categorical_crossentropy | adadelta | |
meta_loss_98963_00005 | PENDING | none | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | sigmoid | sigmoid | -3.50222 | 0.955176 | mape | sgd | |
meta_loss_98963_00006 | PENDING | conv2d | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 128 | sigmoid | sigmoid | -2.40886 | 0.878881 | mae | adadelta | |
meta_loss_98963_00007 | PENDING | dense | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 24 | sigmoid | sigmoid | -1.66232 | 0.355708 | mae | adagrad | |
meta_loss_98963_00008 | PENDING | maxpooling2d | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 64 | sigmoid | sigmoid | -1.15489 | 0.736778 | mae | nadam | |
meta_loss_98963_00009 | PENDING | flatten | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -2.96978 | 0.814146 | categorical_crossentropy | adamax | |
meta_loss_98963_00010 | PENDING | dropout | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -1.84345 | 0.335166 | sparse_categorical_crossentropy | adadelta | |
meta_loss_98963_00011 | PENDING | none | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 24 | sigmoid | sigmoid | -3.12773 | 0.515024 | categorical_crossentropy | adamax | |
meta_loss_98963_00012 | PENDING | conv2d | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | sigmoid | sigmoid | -2.44683 | 0.949793 | mse | adamax | |
meta_loss_98963_00013 | PENDING | dense | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -2.7508 | 0.0587266 | categorical_crossentropy | nadam | |
meta_loss_98963_00014 | PENDING | maxpooling2d | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -3.15043 | 0.594349 | mse | adagrad | |
meta_loss_98963_00015 | PENDING | flatten | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 4 | sigmoid | sigmoid | -1.01626 | 0.938932 | mse | adagrad | |
meta_loss_98963_00016 | PENDING | dropout | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | sigmoid | sigmoid | -3.70958 | 0.0631872 | mape | sgd |
(pid=1145) 2021-11-01 12:24:08.783944: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1145) 2021-11-01 12:24:08.941310: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1145) 2021-11-01 12:24:08.942175: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1145) 2021-11-01 12:24:08.944842: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA (pid=1145) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. (pid=1145) 2021-11-01 12:24:08.945198: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1145) 2021-11-01 12:24:08.946028: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1145) 2021-11-01 12:24:08.946751: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1145) 2021-11-01 12:24:11.377990: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1145) 2021-11-01 12:24:11.378975: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1145) 2021-11-01 12:24:11.379808: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1145) 2021-11-01 12:24:11,436 ERROR function_runner.py:266 -- Runner Thread raised error. (pid=1145) Traceback (most recent call last): (pid=1145) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 260, in run (pid=1145) self._entrypoint() (pid=1145) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 329, in entrypoint (pid=1145) self._status_reporter.get_checkpoint()) (pid=1145) File "/opt/conda/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 449, in _resume_span (pid=1145) return method(self, *_args, **_kwargs) (pid=1145) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 594, in _trainable_func (pid=1145) output = fn() (pid=1145) File "/tmp/ipykernel_886/1119233880.py", line 61, in meta_loss (pid=1145) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper (pid=1145) return target(*args, **kwargs) (pid=1145) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 4338, in one_hot (pid=1145) on_value = ops.convert_to_tensor(1, dtype, name="on_value") (pid=1145) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped (pid=1145) return func(*args, **kwargs) (pid=1145) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1566, in convert_to_tensor (pid=1145) ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) (pid=1145) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function (pid=1145) return constant_op.constant(value, dtype, name=name) (pid=1145) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 272, in constant (pid=1145) allow_broadcast=True) (pid=1145) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 283, in _constant_impl (pid=1145) return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) (pid=1145) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 308, in _constant_eager_impl (pid=1145) t = convert_to_eager_tensor(value, ctx, dtype) (pid=1145) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 105, in convert_to_eager_tensor (pid=1145) ctx.ensure_initialized() (pid=1145) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 555, in ensure_initialized (pid=1145) context_handle = pywrap_tfe.TFE_NewContext(opts) (pid=1145) MemoryError: std::bad_alloc (pid=1145) Exception in thread Thread-2: (pid=1145) Traceback (most recent call last): (pid=1145) File "/opt/conda/lib/python3.7/threading.py", line 926, in _bootstrap_inner (pid=1145) self.run() (pid=1145) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 279, in run (pid=1145) raise e (pid=1145) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 260, in run (pid=1145) self._entrypoint() (pid=1145) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 329, in entrypoint (pid=1145) self._status_reporter.get_checkpoint()) (pid=1145) File "/opt/conda/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 449, in _resume_span (pid=1145) return method(self, *_args, **_kwargs) (pid=1145) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 594, in _trainable_func (pid=1145) output = fn() (pid=1145) File "/tmp/ipykernel_886/1119233880.py", line 61, in meta_loss (pid=1145) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper (pid=1145) return target(*args, **kwargs) (pid=1145) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 4338, in one_hot (pid=1145) on_value = ops.convert_to_tensor(1, dtype, name="on_value") (pid=1145) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped (pid=1145) return func(*args, **kwargs) (pid=1145) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1566, in convert_to_tensor (pid=1145) ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) (pid=1145) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function (pid=1145) return constant_op.constant(value, dtype, name=name) (pid=1145) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 272, in constant (pid=1145) allow_broadcast=True) (pid=1145) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 283, in _constant_impl (pid=1145) return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) (pid=1145) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 308, in _constant_eager_impl (pid=1145) t = convert_to_eager_tensor(value, ctx, dtype) (pid=1145) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 105, in convert_to_eager_tensor (pid=1145) ctx.ensure_initialized() (pid=1145) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 555, in ensure_initialized (pid=1145) context_handle = pywrap_tfe.TFE_NewContext(opts) (pid=1145) MemoryError: std::bad_alloc (pid=1145) 2021-11-01 12:24:11,639 ERROR trial_runner.py:846 -- Trial meta_loss_98963_00000: Error processing event. Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 812, in _process_trial results = self.trial_executor.fetch_result(trial) File "/opt/conda/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 767, in fetch_result result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT) File "/opt/conda/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 89, in wrapper return func(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 1621, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train_buffered() (pid=1145, ip=10.138.0.10, repr=<types.ImplicitFunc object at 0x7fda288a30d0>) File "/opt/conda/lib/python3.7/site-packages/ray/tune/trainable.py", line 189, in train_buffered result = self.train() File "/opt/conda/lib/python3.7/site-packages/ray/tune/trainable.py", line 248, in train result = self.step() File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 379, in step self._report_thread_runner_error(block=True) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 527, in _report_thread_runner_error ("Trial raised an exception. Traceback:\n{}".format(err_tb_str) ray.tune.error.TuneError: Trial raised an exception. Traceback: ray::ImplicitFunc.train_buffered() (pid=1145, ip=10.138.0.10, repr=<types.ImplicitFunc object at 0x7fda288a30d0>) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 260, in run self._entrypoint() File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 329, in entrypoint self._status_reporter.get_checkpoint()) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 594, in _trainable_func output = fn() File "/tmp/ipykernel_886/1119233880.py", line 61, in meta_loss File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper return target(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 4338, in one_hot on_value = ops.convert_to_tensor(1, dtype, name="on_value") File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped return func(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1566, in convert_to_tensor ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function return constant_op.constant(value, dtype, name=name) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 272, in constant allow_broadcast=True) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 283, in _constant_impl return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 308, in _constant_eager_impl t = convert_to_eager_tensor(value, ctx, dtype) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 105, in convert_to_eager_tensor ctx.ensure_initialized() File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 555, in ensure_initialized context_handle = pywrap_tfe.TFE_NewContext(opts) MemoryError: std::bad_alloc
Result for meta_loss_98963_00000: {}
Memory usage on this node: 1.4/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/2 CPUs, 1.0/1 GPUs, 0.0/7.21 GiB heap, 0.0/3.6 GiB objects (0.0/1.0 accelerator_type:K80)
Result logdir: /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-24-01
Number of trials: 18/267302874351744 (1 ERROR, 16 PENDING, 1 RUNNING)
Trial name | status | loc | L1 | L2 | L3 | L4 | L5 | L6 | L7 | N1 | N2 | N3 | N4 | N5 | N6 | N7 | batch_size | conv_activation | dense_activation | initial_learning_rate_exp | learning_rate_rate | loss_name | optimizer_name |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
meta_loss_98963_00001 | RUNNING | dense | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 16 | sigmoid | sigmoid | -2.37358 | 0.314978 | sparse_categorical_crossentropy | adagrad | |
meta_loss_98963_00002 | PENDING | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 128 | sigmoid | sigmoid | -2.8581 | 0.512218 | sparse_categorical_crossentropy | adamax | |
meta_loss_98963_00003 | PENDING | flatten | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 4 | sigmoid | sigmoid | -1.27046 | 0.454703 | mape | sgd | |
meta_loss_98963_00004 | PENDING | dropout | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 16 | sigmoid | sigmoid | -3.53965 | 0.861406 | categorical_crossentropy | adadelta | |
meta_loss_98963_00005 | PENDING | none | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | sigmoid | sigmoid | -3.50222 | 0.955176 | mape | sgd | |
meta_loss_98963_00006 | PENDING | conv2d | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 128 | sigmoid | sigmoid | -2.40886 | 0.878881 | mae | adadelta | |
meta_loss_98963_00007 | PENDING | dense | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 24 | sigmoid | sigmoid | -1.66232 | 0.355708 | mae | adagrad | |
meta_loss_98963_00008 | PENDING | maxpooling2d | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 64 | sigmoid | sigmoid | -1.15489 | 0.736778 | mae | nadam | |
meta_loss_98963_00009 | PENDING | flatten | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -2.96978 | 0.814146 | categorical_crossentropy | adamax | |
meta_loss_98963_00010 | PENDING | dropout | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -1.84345 | 0.335166 | sparse_categorical_crossentropy | adadelta | |
meta_loss_98963_00011 | PENDING | none | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 24 | sigmoid | sigmoid | -3.12773 | 0.515024 | categorical_crossentropy | adamax | |
meta_loss_98963_00012 | PENDING | conv2d | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | sigmoid | sigmoid | -2.44683 | 0.949793 | mse | adamax | |
meta_loss_98963_00013 | PENDING | dense | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -2.7508 | 0.0587266 | categorical_crossentropy | nadam | |
meta_loss_98963_00014 | PENDING | maxpooling2d | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -3.15043 | 0.594349 | mse | adagrad | |
meta_loss_98963_00015 | PENDING | flatten | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 4 | sigmoid | sigmoid | -1.01626 | 0.938932 | mse | adagrad | |
meta_loss_98963_00016 | PENDING | dropout | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | sigmoid | sigmoid | -3.70958 | 0.0631872 | mape | sgd | |
meta_loss_98963_00017 | PENDING | none | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 16 | sigmoid | sigmoid | -3.41504 | 0.482399 | mae | nadam | |
meta_loss_98963_00000 | ERROR | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 24 | sigmoid | sigmoid | -2.51525 | 0.553209 | mape | adagrad |
Number of errored trials: 1
Trial name | # failures | error file |
---|---|---|
meta_loss_98963_00000 | 1 | /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-24-01/meta_loss_98963_00000_0_L1=conv2d,L2=conv2d,L3=conv2d,L4=conv2d,L5=conv2d,L6=conv2d,L7=conv2d,N1=8,N2=8,N3=8,N4=8,N5=8,N6=8,N7=8,b_2021-11-01_12-24-02/error.txt |
Memory usage on this node: 2.0/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/2 CPUs, 1.0/1 GPUs, 0.0/7.21 GiB heap, 0.0/3.6 GiB objects (0.0/1.0 accelerator_type:K80)
Result logdir: /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-24-01
Number of trials: 18/267302874351744 (1 ERROR, 16 PENDING, 1 RUNNING)
Trial name | status | loc | L1 | L2 | L3 | L4 | L5 | L6 | L7 | N1 | N2 | N3 | N4 | N5 | N6 | N7 | batch_size | conv_activation | dense_activation | initial_learning_rate_exp | learning_rate_rate | loss_name | optimizer_name |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
meta_loss_98963_00001 | RUNNING | dense | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 16 | sigmoid | sigmoid | -2.37358 | 0.314978 | sparse_categorical_crossentropy | adagrad | |
meta_loss_98963_00002 | PENDING | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 128 | sigmoid | sigmoid | -2.8581 | 0.512218 | sparse_categorical_crossentropy | adamax | |
meta_loss_98963_00003 | PENDING | flatten | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 4 | sigmoid | sigmoid | -1.27046 | 0.454703 | mape | sgd | |
meta_loss_98963_00004 | PENDING | dropout | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 16 | sigmoid | sigmoid | -3.53965 | 0.861406 | categorical_crossentropy | adadelta | |
meta_loss_98963_00005 | PENDING | none | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | sigmoid | sigmoid | -3.50222 | 0.955176 | mape | sgd | |
meta_loss_98963_00006 | PENDING | conv2d | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 128 | sigmoid | sigmoid | -2.40886 | 0.878881 | mae | adadelta | |
meta_loss_98963_00007 | PENDING | dense | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 24 | sigmoid | sigmoid | -1.66232 | 0.355708 | mae | adagrad | |
meta_loss_98963_00008 | PENDING | maxpooling2d | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 64 | sigmoid | sigmoid | -1.15489 | 0.736778 | mae | nadam | |
meta_loss_98963_00009 | PENDING | flatten | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -2.96978 | 0.814146 | categorical_crossentropy | adamax | |
meta_loss_98963_00010 | PENDING | dropout | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -1.84345 | 0.335166 | sparse_categorical_crossentropy | adadelta | |
meta_loss_98963_00011 | PENDING | none | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 24 | sigmoid | sigmoid | -3.12773 | 0.515024 | categorical_crossentropy | adamax | |
meta_loss_98963_00012 | PENDING | conv2d | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | sigmoid | sigmoid | -2.44683 | 0.949793 | mse | adamax | |
meta_loss_98963_00013 | PENDING | dense | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -2.7508 | 0.0587266 | categorical_crossentropy | nadam | |
meta_loss_98963_00014 | PENDING | maxpooling2d | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -3.15043 | 0.594349 | mse | adagrad | |
meta_loss_98963_00015 | PENDING | flatten | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 4 | sigmoid | sigmoid | -1.01626 | 0.938932 | mse | adagrad | |
meta_loss_98963_00016 | PENDING | dropout | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | sigmoid | sigmoid | -3.70958 | 0.0631872 | mape | sgd | |
meta_loss_98963_00017 | PENDING | none | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 16 | sigmoid | sigmoid | -3.41504 | 0.482399 | mae | nadam | |
meta_loss_98963_00000 | ERROR | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 24 | sigmoid | sigmoid | -2.51525 | 0.553209 | mape | adagrad |
Number of errored trials: 1
Trial name | # failures | error file |
---|---|---|
meta_loss_98963_00000 | 1 | /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-24-01/meta_loss_98963_00000_0_L1=conv2d,L2=conv2d,L3=conv2d,L4=conv2d,L5=conv2d,L6=conv2d,L7=conv2d,N1=8,N2=8,N3=8,N4=8,N5=8,N6=8,N7=8,b_2021-11-01_12-24-02/error.txt |
(pid=1185) 2021-11-01 12:24:17.581494: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1185) 2021-11-01 12:24:17.592053: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1185) 2021-11-01 12:24:17.592853: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1185) 2021-11-01 12:24:17.594109: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA (pid=1185) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. (pid=1185) 2021-11-01 12:24:17.594570: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1185) 2021-11-01 12:24:17.595438: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1185) 2021-11-01 12:24:17.596336: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1185) 2021-11-01 12:24:18.164144: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1185) 2021-11-01 12:24:18.165161: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1185) 2021-11-01 12:24:18.166119: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1185) 2021-11-01 12:24:18,167 ERROR function_runner.py:266 -- Runner Thread raised error. (pid=1185) Traceback (most recent call last): (pid=1185) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 260, in run (pid=1185) self._entrypoint() (pid=1185) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 329, in entrypoint (pid=1185) self._status_reporter.get_checkpoint()) (pid=1185) File "/opt/conda/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 449, in _resume_span (pid=1185) return method(self, *_args, **_kwargs) (pid=1185) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 594, in _trainable_func (pid=1185) output = fn() (pid=1185) File "/tmp/ipykernel_886/1119233880.py", line 61, in meta_loss (pid=1185) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper (pid=1185) return target(*args, **kwargs) (pid=1185) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 4338, in one_hot (pid=1185) on_value = ops.convert_to_tensor(1, dtype, name="on_value") (pid=1185) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped (pid=1185) return func(*args, **kwargs) (pid=1185) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1566, in convert_to_tensor (pid=1185) ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) (pid=1185) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function (pid=1185) return constant_op.constant(value, dtype, name=name) (pid=1185) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 272, in constant (pid=1185) allow_broadcast=True) (pid=1185) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 283, in _constant_impl (pid=1185) return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) (pid=1185) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 308, in _constant_eager_impl (pid=1185) t = convert_to_eager_tensor(value, ctx, dtype) (pid=1185) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 105, in convert_to_eager_tensor (pid=1185) ctx.ensure_initialized() (pid=1185) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 555, in ensure_initialized (pid=1185) context_handle = pywrap_tfe.TFE_NewContext(opts) (pid=1185) MemoryError: std::bad_alloc (pid=1185) Exception in thread Thread-2: (pid=1185) Traceback (most recent call last): (pid=1185) File "/opt/conda/lib/python3.7/threading.py", line 926, in _bootstrap_inner (pid=1185) self.run() (pid=1185) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 279, in run (pid=1185) raise e (pid=1185) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 260, in run (pid=1185) self._entrypoint() (pid=1185) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 329, in entrypoint (pid=1185) self._status_reporter.get_checkpoint()) (pid=1185) File "/opt/conda/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 449, in _resume_span (pid=1185) return method(self, *_args, **_kwargs) (pid=1185) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 594, in _trainable_func (pid=1185) output = fn() (pid=1185) File "/tmp/ipykernel_886/1119233880.py", line 61, in meta_loss (pid=1185) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper (pid=1185) return target(*args, **kwargs) (pid=1185) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 4338, in one_hot (pid=1185) on_value = ops.convert_to_tensor(1, dtype, name="on_value") (pid=1185) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped (pid=1185) return func(*args, **kwargs) (pid=1185) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1566, in convert_to_tensor (pid=1185) ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) (pid=1185) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function (pid=1185) return constant_op.constant(value, dtype, name=name) (pid=1185) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 272, in constant (pid=1185) allow_broadcast=True) (pid=1185) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 283, in _constant_impl (pid=1185) return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) (pid=1185) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 308, in _constant_eager_impl (pid=1185) t = convert_to_eager_tensor(value, ctx, dtype) (pid=1185) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 105, in convert_to_eager_tensor (pid=1185) ctx.ensure_initialized() (pid=1185) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 555, in ensure_initialized (pid=1185) context_handle = pywrap_tfe.TFE_NewContext(opts) (pid=1185) MemoryError: std::bad_alloc (pid=1185) 2021-11-01 12:24:18,370 ERROR trial_runner.py:846 -- Trial meta_loss_98963_00001: Error processing event. Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 812, in _process_trial results = self.trial_executor.fetch_result(trial) File "/opt/conda/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 767, in fetch_result result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT) File "/opt/conda/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 89, in wrapper return func(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 1621, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train_buffered() (pid=1185, ip=10.138.0.10, repr=<types.ImplicitFunc object at 0x7fd6ffc6cad0>) File "/opt/conda/lib/python3.7/site-packages/ray/tune/trainable.py", line 189, in train_buffered result = self.train() File "/opt/conda/lib/python3.7/site-packages/ray/tune/trainable.py", line 248, in train result = self.step() File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 379, in step self._report_thread_runner_error(block=True) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 527, in _report_thread_runner_error ("Trial raised an exception. Traceback:\n{}".format(err_tb_str) ray.tune.error.TuneError: Trial raised an exception. Traceback: ray::ImplicitFunc.train_buffered() (pid=1185, ip=10.138.0.10, repr=<types.ImplicitFunc object at 0x7fd6ffc6cad0>) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 260, in run self._entrypoint() File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 329, in entrypoint self._status_reporter.get_checkpoint()) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 594, in _trainable_func output = fn() File "/tmp/ipykernel_886/1119233880.py", line 61, in meta_loss File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper return target(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 4338, in one_hot on_value = ops.convert_to_tensor(1, dtype, name="on_value") File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped return func(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1566, in convert_to_tensor ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function return constant_op.constant(value, dtype, name=name) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 272, in constant allow_broadcast=True) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 283, in _constant_impl return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 308, in _constant_eager_impl t = convert_to_eager_tensor(value, ctx, dtype) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 105, in convert_to_eager_tensor ctx.ensure_initialized() File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 555, in ensure_initialized context_handle = pywrap_tfe.TFE_NewContext(opts) MemoryError: std::bad_alloc
Result for meta_loss_98963_00001: {}
Memory usage on this node: 2.0/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/2 CPUs, 1.0/1 GPUs, 0.0/7.21 GiB heap, 0.0/3.6 GiB objects (0.0/1.0 accelerator_type:K80)
Result logdir: /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-24-01
Number of trials: 19/267302874351744 (2 ERROR, 16 PENDING, 1 RUNNING)
Trial name | status | loc | L1 | L2 | L3 | L4 | L5 | L6 | L7 | N1 | N2 | N3 | N4 | N5 | N6 | N7 | batch_size | conv_activation | dense_activation | initial_learning_rate_exp | learning_rate_rate | loss_name | optimizer_name |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
meta_loss_98963_00002 | RUNNING | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 128 | sigmoid | sigmoid | -2.8581 | 0.512218 | sparse_categorical_crossentropy | adamax | |
meta_loss_98963_00003 | PENDING | flatten | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 4 | sigmoid | sigmoid | -1.27046 | 0.454703 | mape | sgd | |
meta_loss_98963_00004 | PENDING | dropout | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 16 | sigmoid | sigmoid | -3.53965 | 0.861406 | categorical_crossentropy | adadelta | |
meta_loss_98963_00005 | PENDING | none | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | sigmoid | sigmoid | -3.50222 | 0.955176 | mape | sgd | |
meta_loss_98963_00006 | PENDING | conv2d | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 128 | sigmoid | sigmoid | -2.40886 | 0.878881 | mae | adadelta | |
meta_loss_98963_00007 | PENDING | dense | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 24 | sigmoid | sigmoid | -1.66232 | 0.355708 | mae | adagrad | |
meta_loss_98963_00008 | PENDING | maxpooling2d | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 64 | sigmoid | sigmoid | -1.15489 | 0.736778 | mae | nadam | |
meta_loss_98963_00009 | PENDING | flatten | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -2.96978 | 0.814146 | categorical_crossentropy | adamax | |
meta_loss_98963_00010 | PENDING | dropout | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -1.84345 | 0.335166 | sparse_categorical_crossentropy | adadelta | |
meta_loss_98963_00011 | PENDING | none | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 24 | sigmoid | sigmoid | -3.12773 | 0.515024 | categorical_crossentropy | adamax | |
meta_loss_98963_00012 | PENDING | conv2d | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | sigmoid | sigmoid | -2.44683 | 0.949793 | mse | adamax | |
meta_loss_98963_00013 | PENDING | dense | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -2.7508 | 0.0587266 | categorical_crossentropy | nadam | |
meta_loss_98963_00014 | PENDING | maxpooling2d | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -3.15043 | 0.594349 | mse | adagrad | |
meta_loss_98963_00015 | PENDING | flatten | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 4 | sigmoid | sigmoid | -1.01626 | 0.938932 | mse | adagrad | |
meta_loss_98963_00016 | PENDING | dropout | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | sigmoid | sigmoid | -3.70958 | 0.0631872 | mape | sgd | |
meta_loss_98963_00017 | PENDING | none | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 16 | sigmoid | sigmoid | -3.41504 | 0.482399 | mae | nadam | |
meta_loss_98963_00018 | PENDING | conv2d | flatten | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -2.26361 | 0.83558 | mse | adagrad | |
meta_loss_98963_00000 | ERROR | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 24 | sigmoid | sigmoid | -2.51525 | 0.553209 | mape | adagrad | |
meta_loss_98963_00001 | ERROR | dense | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 16 | sigmoid | sigmoid | -2.37358 | 0.314978 | sparse_categorical_crossentropy | adagrad |
Number of errored trials: 2
Trial name | # failures | error file |
---|---|---|
meta_loss_98963_00000 | 1 | /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-24-01/meta_loss_98963_00000_0_L1=conv2d,L2=conv2d,L3=conv2d,L4=conv2d,L5=conv2d,L6=conv2d,L7=conv2d,N1=8,N2=8,N3=8,N4=8,N5=8,N6=8,N7=8,b_2021-11-01_12-24-02/error.txt |
meta_loss_98963_00001 | 1 | /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-24-01/meta_loss_98963_00001_1_L1=dense,L2=conv2d,L3=conv2d,L4=conv2d,L5=conv2d,L6=conv2d,L7=conv2d,N1=8,N2=8,N3=8,N4=8,N5=8,N6=8,N7=8,ba_2021-11-01_12-24-03/error.txt |
(pid=1231) 2021-11-01 12:24:24.314833: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1231) 2021-11-01 12:24:24.325610: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1231) 2021-11-01 12:24:24.326377: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1231) 2021-11-01 12:24:24.328110: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA (pid=1231) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. (pid=1231) 2021-11-01 12:24:24.329209: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1231) 2021-11-01 12:24:24.330326: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1231) 2021-11-01 12:24:24.331181: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1231) 2021-11-01 12:24:24.774456: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1231) 2021-11-01 12:24:24.775204: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1231) 2021-11-01 12:24:24.776030: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1231) 2021-11-01 12:24:24,777 ERROR function_runner.py:266 -- Runner Thread raised error. (pid=1231) Traceback (most recent call last): (pid=1231) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 260, in run (pid=1231) self._entrypoint() (pid=1231) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 329, in entrypoint (pid=1231) self._status_reporter.get_checkpoint()) (pid=1231) File "/opt/conda/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 449, in _resume_span (pid=1231) return method(self, *_args, **_kwargs) (pid=1231) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 594, in _trainable_func (pid=1231) output = fn() (pid=1231) File "/tmp/ipykernel_886/1119233880.py", line 61, in meta_loss (pid=1231) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper (pid=1231) return target(*args, **kwargs) (pid=1231) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 4338, in one_hot (pid=1231) on_value = ops.convert_to_tensor(1, dtype, name="on_value") (pid=1231) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped (pid=1231) return func(*args, **kwargs) (pid=1231) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1566, in convert_to_tensor (pid=1231) ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) (pid=1231) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function (pid=1231) return constant_op.constant(value, dtype, name=name) (pid=1231) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 272, in constant (pid=1231) allow_broadcast=True) (pid=1231) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 283, in _constant_impl (pid=1231) return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) (pid=1231) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 308, in _constant_eager_impl (pid=1231) t = convert_to_eager_tensor(value, ctx, dtype) (pid=1231) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 105, in convert_to_eager_tensor (pid=1231) ctx.ensure_initialized() (pid=1231) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 555, in ensure_initialized (pid=1231) context_handle = pywrap_tfe.TFE_NewContext(opts) (pid=1231) MemoryError: std::bad_alloc (pid=1231) Exception in thread Thread-2: (pid=1231) Traceback (most recent call last): (pid=1231) File "/opt/conda/lib/python3.7/threading.py", line 926, in _bootstrap_inner (pid=1231) self.run() (pid=1231) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 279, in run (pid=1231) raise e (pid=1231) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 260, in run (pid=1231) self._entrypoint() (pid=1231) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 329, in entrypoint (pid=1231) self._status_reporter.get_checkpoint()) (pid=1231) File "/opt/conda/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 449, in _resume_span (pid=1231) return method(self, *_args, **_kwargs) (pid=1231) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 594, in _trainable_func (pid=1231) output = fn() (pid=1231) File "/tmp/ipykernel_886/1119233880.py", line 61, in meta_loss (pid=1231) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper (pid=1231) return target(*args, **kwargs) (pid=1231) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 4338, in one_hot (pid=1231) on_value = ops.convert_to_tensor(1, dtype, name="on_value") (pid=1231) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped (pid=1231) return func(*args, **kwargs) (pid=1231) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1566, in convert_to_tensor (pid=1231) ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) (pid=1231) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function (pid=1231) return constant_op.constant(value, dtype, name=name) (pid=1231) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 272, in constant (pid=1231) allow_broadcast=True) (pid=1231) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 283, in _constant_impl (pid=1231) return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) (pid=1231) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 308, in _constant_eager_impl (pid=1231) t = convert_to_eager_tensor(value, ctx, dtype) (pid=1231) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 105, in convert_to_eager_tensor (pid=1231) ctx.ensure_initialized() (pid=1231) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 555, in ensure_initialized (pid=1231) context_handle = pywrap_tfe.TFE_NewContext(opts) (pid=1231) MemoryError: std::bad_alloc (pid=1231) 2021-11-01 12:24:24,979 ERROR trial_runner.py:846 -- Trial meta_loss_98963_00002: Error processing event. Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 812, in _process_trial results = self.trial_executor.fetch_result(trial) File "/opt/conda/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 767, in fetch_result result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT) File "/opt/conda/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 89, in wrapper return func(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 1621, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train_buffered() (pid=1231, ip=10.138.0.10, repr=<types.ImplicitFunc object at 0x7fe9427c1d50>) File "/opt/conda/lib/python3.7/site-packages/ray/tune/trainable.py", line 189, in train_buffered result = self.train() File "/opt/conda/lib/python3.7/site-packages/ray/tune/trainable.py", line 248, in train result = self.step() File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 379, in step self._report_thread_runner_error(block=True) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 527, in _report_thread_runner_error ("Trial raised an exception. Traceback:\n{}".format(err_tb_str) ray.tune.error.TuneError: Trial raised an exception. Traceback: ray::ImplicitFunc.train_buffered() (pid=1231, ip=10.138.0.10, repr=<types.ImplicitFunc object at 0x7fe9427c1d50>) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 260, in run self._entrypoint() File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 329, in entrypoint self._status_reporter.get_checkpoint()) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 594, in _trainable_func output = fn() File "/tmp/ipykernel_886/1119233880.py", line 61, in meta_loss File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper return target(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 4338, in one_hot on_value = ops.convert_to_tensor(1, dtype, name="on_value") File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped return func(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1566, in convert_to_tensor ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function return constant_op.constant(value, dtype, name=name) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 272, in constant allow_broadcast=True) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 283, in _constant_impl return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 308, in _constant_eager_impl t = convert_to_eager_tensor(value, ctx, dtype) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 105, in convert_to_eager_tensor ctx.ensure_initialized() File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 555, in ensure_initialized context_handle = pywrap_tfe.TFE_NewContext(opts) MemoryError: std::bad_alloc
Result for meta_loss_98963_00002: {}
Memory usage on this node: 1.6/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/2 CPUs, 1.0/1 GPUs, 0.0/7.21 GiB heap, 0.0/3.6 GiB objects (0.0/1.0 accelerator_type:K80)
Result logdir: /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-24-01
Number of trials: 20/267302874351744 (3 ERROR, 16 PENDING, 1 RUNNING)
Trial name | status | loc | L1 | L2 | L3 | L4 | L5 | L6 | L7 | N1 | N2 | N3 | N4 | N5 | N6 | N7 | batch_size | conv_activation | dense_activation | initial_learning_rate_exp | learning_rate_rate | loss_name | optimizer_name |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
meta_loss_98963_00003 | RUNNING | flatten | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 4 | sigmoid | sigmoid | -1.27046 | 0.454703 | mape | sgd | |
meta_loss_98963_00004 | PENDING | dropout | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 16 | sigmoid | sigmoid | -3.53965 | 0.861406 | categorical_crossentropy | adadelta | |
meta_loss_98963_00005 | PENDING | none | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | sigmoid | sigmoid | -3.50222 | 0.955176 | mape | sgd | |
meta_loss_98963_00006 | PENDING | conv2d | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 128 | sigmoid | sigmoid | -2.40886 | 0.878881 | mae | adadelta | |
meta_loss_98963_00007 | PENDING | dense | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 24 | sigmoid | sigmoid | -1.66232 | 0.355708 | mae | adagrad | |
meta_loss_98963_00008 | PENDING | maxpooling2d | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 64 | sigmoid | sigmoid | -1.15489 | 0.736778 | mae | nadam | |
meta_loss_98963_00009 | PENDING | flatten | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -2.96978 | 0.814146 | categorical_crossentropy | adamax | |
meta_loss_98963_00010 | PENDING | dropout | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -1.84345 | 0.335166 | sparse_categorical_crossentropy | adadelta | |
meta_loss_98963_00011 | PENDING | none | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 24 | sigmoid | sigmoid | -3.12773 | 0.515024 | categorical_crossentropy | adamax | |
meta_loss_98963_00012 | PENDING | conv2d | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | sigmoid | sigmoid | -2.44683 | 0.949793 | mse | adamax | |
meta_loss_98963_00013 | PENDING | dense | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -2.7508 | 0.0587266 | categorical_crossentropy | nadam | |
meta_loss_98963_00014 | PENDING | maxpooling2d | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -3.15043 | 0.594349 | mse | adagrad | |
meta_loss_98963_00015 | PENDING | flatten | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 4 | sigmoid | sigmoid | -1.01626 | 0.938932 | mse | adagrad | |
meta_loss_98963_00016 | PENDING | dropout | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | sigmoid | sigmoid | -3.70958 | 0.0631872 | mape | sgd | |
meta_loss_98963_00017 | PENDING | none | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 16 | sigmoid | sigmoid | -3.41504 | 0.482399 | mae | nadam | |
meta_loss_98963_00018 | PENDING | conv2d | flatten | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -2.26361 | 0.83558 | mse | adagrad | |
meta_loss_98963_00019 | PENDING | dense | flatten | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 4 | sigmoid | sigmoid | -3.09267 | 0.103705 | sparse_categorical_crossentropy | adadelta | |
meta_loss_98963_00000 | ERROR | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 24 | sigmoid | sigmoid | -2.51525 | 0.553209 | mape | adagrad | |
meta_loss_98963_00001 | ERROR | dense | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 16 | sigmoid | sigmoid | -2.37358 | 0.314978 | sparse_categorical_crossentropy | adagrad | |
meta_loss_98963_00002 | ERROR | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 128 | sigmoid | sigmoid | -2.8581 | 0.512218 | sparse_categorical_crossentropy | adamax |
Number of errored trials: 3
Trial name | # failures | error file |
---|---|---|
meta_loss_98963_00000 | 1 | /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-24-01/meta_loss_98963_00000_0_L1=conv2d,L2=conv2d,L3=conv2d,L4=conv2d,L5=conv2d,L6=conv2d,L7=conv2d,N1=8,N2=8,N3=8,N4=8,N5=8,N6=8,N7=8,b_2021-11-01_12-24-02/error.txt |
meta_loss_98963_00001 | 1 | /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-24-01/meta_loss_98963_00001_1_L1=dense,L2=conv2d,L3=conv2d,L4=conv2d,L5=conv2d,L6=conv2d,L7=conv2d,N1=8,N2=8,N3=8,N4=8,N5=8,N6=8,N7=8,ba_2021-11-01_12-24-03/error.txt |
meta_loss_98963_00002 | 1 | /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-24-01/meta_loss_98963_00002_2_L1=maxpooling2d,L2=conv2d,L3=conv2d,L4=conv2d,L5=conv2d,L6=conv2d,L7=conv2d,N1=8,N2=8,N3=8,N4=8,N5=8,N6=8,_2021-11-01_12-24-11/error.txt |
(pid=1270) 2021-11-01 12:24:30.963152: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1270) 2021-11-01 12:24:30.973430: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1270) 2021-11-01 12:24:30.974169: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1270) 2021-11-01 12:24:30.975364: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA (pid=1270) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. (pid=1270) 2021-11-01 12:24:30.975658: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1270) 2021-11-01 12:24:30.976454: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1270) 2021-11-01 12:24:30.977195: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1270) 2021-11-01 12:24:31.412220: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1270) 2021-11-01 12:24:31.413057: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1270) 2021-11-01 12:24:31.413788: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1270) 2021-11-01 12:24:31,415 ERROR function_runner.py:266 -- Runner Thread raised error. (pid=1270) Traceback (most recent call last): (pid=1270) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 260, in run (pid=1270) self._entrypoint() (pid=1270) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 329, in entrypoint (pid=1270) self._status_reporter.get_checkpoint()) (pid=1270) File "/opt/conda/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 449, in _resume_span (pid=1270) return method(self, *_args, **_kwargs) (pid=1270) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 594, in _trainable_func (pid=1270) output = fn() (pid=1270) File "/tmp/ipykernel_886/1119233880.py", line 61, in meta_loss (pid=1270) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper (pid=1270) return target(*args, **kwargs) (pid=1270) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 4338, in one_hot (pid=1270) on_value = ops.convert_to_tensor(1, dtype, name="on_value") (pid=1270) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped (pid=1270) return func(*args, **kwargs) (pid=1270) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1566, in convert_to_tensor (pid=1270) ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) (pid=1270) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function (pid=1270) return constant_op.constant(value, dtype, name=name) (pid=1270) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 272, in constant (pid=1270) allow_broadcast=True) (pid=1270) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 283, in _constant_impl (pid=1270) return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) (pid=1270) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 308, in _constant_eager_impl (pid=1270) t = convert_to_eager_tensor(value, ctx, dtype) (pid=1270) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 105, in convert_to_eager_tensor (pid=1270) ctx.ensure_initialized() (pid=1270) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 555, in ensure_initialized (pid=1270) context_handle = pywrap_tfe.TFE_NewContext(opts) (pid=1270) MemoryError: std::bad_alloc (pid=1270) Exception in thread Thread-2: (pid=1270) Traceback (most recent call last): (pid=1270) File "/opt/conda/lib/python3.7/threading.py", line 926, in _bootstrap_inner (pid=1270) self.run() (pid=1270) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 279, in run (pid=1270) raise e (pid=1270) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 260, in run (pid=1270) self._entrypoint() (pid=1270) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 329, in entrypoint (pid=1270) self._status_reporter.get_checkpoint()) (pid=1270) File "/opt/conda/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 449, in _resume_span (pid=1270) return method(self, *_args, **_kwargs) (pid=1270) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 594, in _trainable_func (pid=1270) output = fn() (pid=1270) File "/tmp/ipykernel_886/1119233880.py", line 61, in meta_loss (pid=1270) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper (pid=1270) return target(*args, **kwargs) (pid=1270) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 4338, in one_hot (pid=1270) on_value = ops.convert_to_tensor(1, dtype, name="on_value") (pid=1270) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped (pid=1270) return func(*args, **kwargs) (pid=1270) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1566, in convert_to_tensor (pid=1270) ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) (pid=1270) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function (pid=1270) return constant_op.constant(value, dtype, name=name) (pid=1270) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 272, in constant (pid=1270) allow_broadcast=True) (pid=1270) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 283, in _constant_impl (pid=1270) return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) (pid=1270) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 308, in _constant_eager_impl (pid=1270) t = convert_to_eager_tensor(value, ctx, dtype) (pid=1270) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 105, in convert_to_eager_tensor (pid=1270) ctx.ensure_initialized() (pid=1270) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 555, in ensure_initialized (pid=1270) context_handle = pywrap_tfe.TFE_NewContext(opts) (pid=1270) MemoryError: std::bad_alloc (pid=1270) 2021-11-01 12:24:31,618 ERROR trial_runner.py:846 -- Trial meta_loss_98963_00003: Error processing event. Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 812, in _process_trial results = self.trial_executor.fetch_result(trial) File "/opt/conda/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 767, in fetch_result result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT) File "/opt/conda/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 89, in wrapper return func(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 1621, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train_buffered() (pid=1270, ip=10.138.0.10, repr=<types.ImplicitFunc object at 0x7fc247370150>) File "/opt/conda/lib/python3.7/site-packages/ray/tune/trainable.py", line 189, in train_buffered result = self.train() File "/opt/conda/lib/python3.7/site-packages/ray/tune/trainable.py", line 248, in train result = self.step() File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 379, in step self._report_thread_runner_error(block=True) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 527, in _report_thread_runner_error ("Trial raised an exception. Traceback:\n{}".format(err_tb_str) ray.tune.error.TuneError: Trial raised an exception. Traceback: ray::ImplicitFunc.train_buffered() (pid=1270, ip=10.138.0.10, repr=<types.ImplicitFunc object at 0x7fc247370150>) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 260, in run self._entrypoint() File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 329, in entrypoint self._status_reporter.get_checkpoint()) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 594, in _trainable_func output = fn() File "/tmp/ipykernel_886/1119233880.py", line 61, in meta_loss File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper return target(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 4338, in one_hot on_value = ops.convert_to_tensor(1, dtype, name="on_value") File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped return func(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1566, in convert_to_tensor ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function return constant_op.constant(value, dtype, name=name) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 272, in constant allow_broadcast=True) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 283, in _constant_impl return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 308, in _constant_eager_impl t = convert_to_eager_tensor(value, ctx, dtype) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 105, in convert_to_eager_tensor ctx.ensure_initialized() File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 555, in ensure_initialized context_handle = pywrap_tfe.TFE_NewContext(opts) MemoryError: std::bad_alloc
Result for meta_loss_98963_00003: {}
Memory usage on this node: 1.5/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/2 CPUs, 1.0/1 GPUs, 0.0/7.21 GiB heap, 0.0/3.6 GiB objects (0.0/1.0 accelerator_type:K80)
Result logdir: /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-24-01
Number of trials: 21/267302874351744 (4 ERROR, 16 PENDING, 1 RUNNING)
Trial name | status | loc | L1 | L2 | L3 | L4 | L5 | L6 | L7 | N1 | N2 | N3 | N4 | N5 | N6 | N7 | batch_size | conv_activation | dense_activation | initial_learning_rate_exp | learning_rate_rate | loss_name | optimizer_name |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
meta_loss_98963_00004 | RUNNING | dropout | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 16 | sigmoid | sigmoid | -3.53965 | 0.861406 | categorical_crossentropy | adadelta | |
meta_loss_98963_00005 | PENDING | none | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | sigmoid | sigmoid | -3.50222 | 0.955176 | mape | sgd | |
meta_loss_98963_00006 | PENDING | conv2d | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 128 | sigmoid | sigmoid | -2.40886 | 0.878881 | mae | adadelta | |
meta_loss_98963_00007 | PENDING | dense | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 24 | sigmoid | sigmoid | -1.66232 | 0.355708 | mae | adagrad | |
meta_loss_98963_00008 | PENDING | maxpooling2d | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 64 | sigmoid | sigmoid | -1.15489 | 0.736778 | mae | nadam | |
meta_loss_98963_00009 | PENDING | flatten | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -2.96978 | 0.814146 | categorical_crossentropy | adamax | |
meta_loss_98963_00010 | PENDING | dropout | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -1.84345 | 0.335166 | sparse_categorical_crossentropy | adadelta | |
meta_loss_98963_00011 | PENDING | none | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 24 | sigmoid | sigmoid | -3.12773 | 0.515024 | categorical_crossentropy | adamax | |
meta_loss_98963_00012 | PENDING | conv2d | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | sigmoid | sigmoid | -2.44683 | 0.949793 | mse | adamax | |
meta_loss_98963_00013 | PENDING | dense | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -2.7508 | 0.0587266 | categorical_crossentropy | nadam | |
meta_loss_98963_00014 | PENDING | maxpooling2d | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -3.15043 | 0.594349 | mse | adagrad | |
meta_loss_98963_00015 | PENDING | flatten | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 4 | sigmoid | sigmoid | -1.01626 | 0.938932 | mse | adagrad | |
meta_loss_98963_00016 | PENDING | dropout | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | sigmoid | sigmoid | -3.70958 | 0.0631872 | mape | sgd | |
meta_loss_98963_00017 | PENDING | none | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 16 | sigmoid | sigmoid | -3.41504 | 0.482399 | mae | nadam | |
meta_loss_98963_00018 | PENDING | conv2d | flatten | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -2.26361 | 0.83558 | mse | adagrad | |
meta_loss_98963_00019 | PENDING | dense | flatten | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 4 | sigmoid | sigmoid | -3.09267 | 0.103705 | sparse_categorical_crossentropy | adadelta | |
meta_loss_98963_00000 | ERROR | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 24 | sigmoid | sigmoid | -2.51525 | 0.553209 | mape | adagrad | |
meta_loss_98963_00001 | ERROR | dense | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 16 | sigmoid | sigmoid | -2.37358 | 0.314978 | sparse_categorical_crossentropy | adagrad | |
meta_loss_98963_00002 | ERROR | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 128 | sigmoid | sigmoid | -2.8581 | 0.512218 | sparse_categorical_crossentropy | adamax | |
meta_loss_98963_00003 | ERROR | flatten | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 4 | sigmoid | sigmoid | -1.27046 | 0.454703 | mape | sgd |
... 1 more trials not shown (1 PENDING)
Number of errored trials: 4
Trial name | # failures | error file |
---|---|---|
meta_loss_98963_00000 | 1 | /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-24-01/meta_loss_98963_00000_0_L1=conv2d,L2=conv2d,L3=conv2d,L4=conv2d,L5=conv2d,L6=conv2d,L7=conv2d,N1=8,N2=8,N3=8,N4=8,N5=8,N6=8,N7=8,b_2021-11-01_12-24-02/error.txt |
meta_loss_98963_00001 | 1 | /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-24-01/meta_loss_98963_00001_1_L1=dense,L2=conv2d,L3=conv2d,L4=conv2d,L5=conv2d,L6=conv2d,L7=conv2d,N1=8,N2=8,N3=8,N4=8,N5=8,N6=8,N7=8,ba_2021-11-01_12-24-03/error.txt |
meta_loss_98963_00002 | 1 | /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-24-01/meta_loss_98963_00002_2_L1=maxpooling2d,L2=conv2d,L3=conv2d,L4=conv2d,L5=conv2d,L6=conv2d,L7=conv2d,N1=8,N2=8,N3=8,N4=8,N5=8,N6=8,_2021-11-01_12-24-11/error.txt |
meta_loss_98963_00003 | 1 | /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-24-01/meta_loss_98963_00003_3_L1=flatten,L2=conv2d,L3=conv2d,L4=conv2d,L5=conv2d,L6=conv2d,L7=conv2d,N1=8,N2=8,N3=8,N4=8,N5=8,N6=8,N7=8,_2021-11-01_12-24-18/error.txt |
Memory usage on this node: 2.0/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/2 CPUs, 1.0/1 GPUs, 0.0/7.21 GiB heap, 0.0/3.6 GiB objects (0.0/1.0 accelerator_type:K80)
Result logdir: /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-24-01
Number of trials: 21/267302874351744 (4 ERROR, 16 PENDING, 1 RUNNING)
Trial name | status | loc | L1 | L2 | L3 | L4 | L5 | L6 | L7 | N1 | N2 | N3 | N4 | N5 | N6 | N7 | batch_size | conv_activation | dense_activation | initial_learning_rate_exp | learning_rate_rate | loss_name | optimizer_name |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
meta_loss_98963_00004 | RUNNING | dropout | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 16 | sigmoid | sigmoid | -3.53965 | 0.861406 | categorical_crossentropy | adadelta | |
meta_loss_98963_00005 | PENDING | none | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | sigmoid | sigmoid | -3.50222 | 0.955176 | mape | sgd | |
meta_loss_98963_00006 | PENDING | conv2d | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 128 | sigmoid | sigmoid | -2.40886 | 0.878881 | mae | adadelta | |
meta_loss_98963_00007 | PENDING | dense | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 24 | sigmoid | sigmoid | -1.66232 | 0.355708 | mae | adagrad | |
meta_loss_98963_00008 | PENDING | maxpooling2d | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 64 | sigmoid | sigmoid | -1.15489 | 0.736778 | mae | nadam | |
meta_loss_98963_00009 | PENDING | flatten | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -2.96978 | 0.814146 | categorical_crossentropy | adamax | |
meta_loss_98963_00010 | PENDING | dropout | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -1.84345 | 0.335166 | sparse_categorical_crossentropy | adadelta | |
meta_loss_98963_00011 | PENDING | none | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 24 | sigmoid | sigmoid | -3.12773 | 0.515024 | categorical_crossentropy | adamax | |
meta_loss_98963_00012 | PENDING | conv2d | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | sigmoid | sigmoid | -2.44683 | 0.949793 | mse | adamax | |
meta_loss_98963_00013 | PENDING | dense | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -2.7508 | 0.0587266 | categorical_crossentropy | nadam | |
meta_loss_98963_00014 | PENDING | maxpooling2d | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -3.15043 | 0.594349 | mse | adagrad | |
meta_loss_98963_00015 | PENDING | flatten | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 4 | sigmoid | sigmoid | -1.01626 | 0.938932 | mse | adagrad | |
meta_loss_98963_00016 | PENDING | dropout | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | sigmoid | sigmoid | -3.70958 | 0.0631872 | mape | sgd | |
meta_loss_98963_00017 | PENDING | none | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 16 | sigmoid | sigmoid | -3.41504 | 0.482399 | mae | nadam | |
meta_loss_98963_00018 | PENDING | conv2d | flatten | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -2.26361 | 0.83558 | mse | adagrad | |
meta_loss_98963_00019 | PENDING | dense | flatten | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 4 | sigmoid | sigmoid | -3.09267 | 0.103705 | sparse_categorical_crossentropy | adadelta | |
meta_loss_98963_00000 | ERROR | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 24 | sigmoid | sigmoid | -2.51525 | 0.553209 | mape | adagrad | |
meta_loss_98963_00001 | ERROR | dense | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 16 | sigmoid | sigmoid | -2.37358 | 0.314978 | sparse_categorical_crossentropy | adagrad | |
meta_loss_98963_00002 | ERROR | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 128 | sigmoid | sigmoid | -2.8581 | 0.512218 | sparse_categorical_crossentropy | adamax | |
meta_loss_98963_00003 | ERROR | flatten | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 4 | sigmoid | sigmoid | -1.27046 | 0.454703 | mape | sgd |
... 1 more trials not shown (1 PENDING)
Number of errored trials: 4
Trial name | # failures | error file |
---|---|---|
meta_loss_98963_00000 | 1 | /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-24-01/meta_loss_98963_00000_0_L1=conv2d,L2=conv2d,L3=conv2d,L4=conv2d,L5=conv2d,L6=conv2d,L7=conv2d,N1=8,N2=8,N3=8,N4=8,N5=8,N6=8,N7=8,b_2021-11-01_12-24-02/error.txt |
meta_loss_98963_00001 | 1 | /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-24-01/meta_loss_98963_00001_1_L1=dense,L2=conv2d,L3=conv2d,L4=conv2d,L5=conv2d,L6=conv2d,L7=conv2d,N1=8,N2=8,N3=8,N4=8,N5=8,N6=8,N7=8,ba_2021-11-01_12-24-03/error.txt |
meta_loss_98963_00002 | 1 | /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-24-01/meta_loss_98963_00002_2_L1=maxpooling2d,L2=conv2d,L3=conv2d,L4=conv2d,L5=conv2d,L6=conv2d,L7=conv2d,N1=8,N2=8,N3=8,N4=8,N5=8,N6=8,_2021-11-01_12-24-11/error.txt |
meta_loss_98963_00003 | 1 | /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-24-01/meta_loss_98963_00003_3_L1=flatten,L2=conv2d,L3=conv2d,L4=conv2d,L5=conv2d,L6=conv2d,L7=conv2d,N1=8,N2=8,N3=8,N4=8,N5=8,N6=8,N7=8,_2021-11-01_12-24-18/error.txt |
(pid=1308) 2021-11-01 12:24:37.617787: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1308) 2021-11-01 12:24:37.629614: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1308) 2021-11-01 12:24:37.630415: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1308) 2021-11-01 12:24:37.631673: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA (pid=1308) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. (pid=1308) 2021-11-01 12:24:37.632031: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1308) 2021-11-01 12:24:37.632958: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1308) 2021-11-01 12:24:37.633716: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1308) 2021-11-01 12:24:38.062656: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1308) 2021-11-01 12:24:38.063476: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1308) 2021-11-01 12:24:38.064327: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=1308) 2021-11-01 12:24:38,065 ERROR function_runner.py:266 -- Runner Thread raised error. (pid=1308) Traceback (most recent call last): (pid=1308) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 260, in run (pid=1308) self._entrypoint() (pid=1308) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 329, in entrypoint (pid=1308) self._status_reporter.get_checkpoint()) (pid=1308) File "/opt/conda/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 449, in _resume_span (pid=1308) return method(self, *_args, **_kwargs) (pid=1308) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 594, in _trainable_func (pid=1308) output = fn() (pid=1308) File "/tmp/ipykernel_886/1119233880.py", line 61, in meta_loss (pid=1308) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper (pid=1308) return target(*args, **kwargs) (pid=1308) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 4338, in one_hot (pid=1308) on_value = ops.convert_to_tensor(1, dtype, name="on_value") (pid=1308) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped (pid=1308) return func(*args, **kwargs) (pid=1308) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1566, in convert_to_tensor (pid=1308) ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) (pid=1308) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function (pid=1308) return constant_op.constant(value, dtype, name=name) (pid=1308) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 272, in constant (pid=1308) allow_broadcast=True) (pid=1308) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 283, in _constant_impl (pid=1308) return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) (pid=1308) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 308, in _constant_eager_impl (pid=1308) t = convert_to_eager_tensor(value, ctx, dtype) (pid=1308) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 105, in convert_to_eager_tensor (pid=1308) ctx.ensure_initialized() (pid=1308) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 555, in ensure_initialized (pid=1308) context_handle = pywrap_tfe.TFE_NewContext(opts) (pid=1308) MemoryError: std::bad_alloc (pid=1308) Exception in thread Thread-2: (pid=1308) Traceback (most recent call last): (pid=1308) File "/opt/conda/lib/python3.7/threading.py", line 926, in _bootstrap_inner (pid=1308) self.run() (pid=1308) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 279, in run (pid=1308) raise e (pid=1308) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 260, in run (pid=1308) self._entrypoint() (pid=1308) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 329, in entrypoint (pid=1308) self._status_reporter.get_checkpoint()) (pid=1308) File "/opt/conda/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 449, in _resume_span (pid=1308) return method(self, *_args, **_kwargs) (pid=1308) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 594, in _trainable_func (pid=1308) output = fn() (pid=1308) File "/tmp/ipykernel_886/1119233880.py", line 61, in meta_loss (pid=1308) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper (pid=1308) return target(*args, **kwargs) (pid=1308) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 4338, in one_hot (pid=1308) on_value = ops.convert_to_tensor(1, dtype, name="on_value") (pid=1308) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped (pid=1308) return func(*args, **kwargs) (pid=1308) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1566, in convert_to_tensor (pid=1308) ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) (pid=1308) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function (pid=1308) return constant_op.constant(value, dtype, name=name) (pid=1308) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 272, in constant (pid=1308) allow_broadcast=True) (pid=1308) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 283, in _constant_impl (pid=1308) return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) (pid=1308) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 308, in _constant_eager_impl (pid=1308) t = convert_to_eager_tensor(value, ctx, dtype) (pid=1308) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 105, in convert_to_eager_tensor (pid=1308) ctx.ensure_initialized() (pid=1308) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 555, in ensure_initialized (pid=1308) context_handle = pywrap_tfe.TFE_NewContext(opts) (pid=1308) MemoryError: std::bad_alloc (pid=1308) 2021-11-01 12:24:38,268 ERROR trial_runner.py:846 -- Trial meta_loss_98963_00004: Error processing event. Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 812, in _process_trial results = self.trial_executor.fetch_result(trial) File "/opt/conda/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 767, in fetch_result result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT) File "/opt/conda/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 89, in wrapper return func(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 1621, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train_buffered() (pid=1308, ip=10.138.0.10, repr=<types.ImplicitFunc object at 0x7f9ad1b6e110>) File "/opt/conda/lib/python3.7/site-packages/ray/tune/trainable.py", line 189, in train_buffered result = self.train() File "/opt/conda/lib/python3.7/site-packages/ray/tune/trainable.py", line 248, in train result = self.step() File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 379, in step self._report_thread_runner_error(block=True) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 527, in _report_thread_runner_error ("Trial raised an exception. Traceback:\n{}".format(err_tb_str) ray.tune.error.TuneError: Trial raised an exception. Traceback: ray::ImplicitFunc.train_buffered() (pid=1308, ip=10.138.0.10, repr=<types.ImplicitFunc object at 0x7f9ad1b6e110>) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 260, in run self._entrypoint() File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 329, in entrypoint self._status_reporter.get_checkpoint()) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 594, in _trainable_func output = fn() File "/tmp/ipykernel_886/1119233880.py", line 61, in meta_loss File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper return target(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 4338, in one_hot on_value = ops.convert_to_tensor(1, dtype, name="on_value") File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped return func(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1566, in convert_to_tensor ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function return constant_op.constant(value, dtype, name=name) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 272, in constant allow_broadcast=True) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 283, in _constant_impl return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 308, in _constant_eager_impl t = convert_to_eager_tensor(value, ctx, dtype) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 105, in convert_to_eager_tensor ctx.ensure_initialized() File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 555, in ensure_initialized context_handle = pywrap_tfe.TFE_NewContext(opts) MemoryError: std::bad_alloc
Result for meta_loss_98963_00004: {}
2021-11-01 12:24:40,402 WARNING worker.py:1227 -- The autoscaler failed with the following error: Terminated with signal 15 File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 430, in <module> monitor.run() File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 331, in run self._run() File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 248, in _run time.sleep(AUTOSCALER_UPDATE_INTERVAL_S) 2021-11-01 12:24:40,508 ERROR worker.py:1229 -- listen_error_messages_raylet: Connection closed by server. 2021-11-01 12:24:40,511 ERROR import_thread.py:88 -- ImportThread: Connection closed by server. 2021-11-01 12:24:40,517 ERROR worker.py:475 -- print_logs: Connection closed by server. 2021-11-01 12:24:40,545 ERROR ray_trial_executor.py:600 -- Trial meta_loss_98963_00006: Unexpected error starting runner. Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/redis/connection.py", line 559, in connect sock = self._connect() File "/opt/conda/lib/python3.7/site-packages/redis/connection.py", line 615, in _connect raise err File "/opt/conda/lib/python3.7/site-packages/redis/connection.py", line 603, in _connect sock.connect(socket_address) ConnectionRefusedError: [Errno 111] Connection refused During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 590, in start_trial return self._start_trial(trial, checkpoint, train=train) File "/opt/conda/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 465, in _start_trial runner = self._setup_remote_runner(trial) File "/opt/conda/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 307, in _setup_remote_runner trainable_cls = trial.get_trainable_cls() File "/opt/conda/lib/python3.7/site-packages/ray/tune/trial.py", line 646, in get_trainable_cls return get_trainable_cls(self.trainable_name) File "/opt/conda/lib/python3.7/site-packages/ray/tune/registry.py", line 31, in get_trainable_cls validate_trainable(trainable_name) File "/opt/conda/lib/python3.7/site-packages/ray/tune/registry.py", line 36, in validate_trainable if not has_trainable(trainable_name): File "/opt/conda/lib/python3.7/site-packages/ray/tune/registry.py", line 27, in has_trainable return _global_registry.contains(TRAINABLE_CLASS, trainable_name) File "/opt/conda/lib/python3.7/site-packages/ray/tune/registry.py", line 151, in contains value = _internal_kv_get(_make_key(category, key)) File "/opt/conda/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 89, in wrapper return func(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/ray/experimental/internal_kv.py", line 57, in _internal_kv_get return ray.worker.global_worker.redis_client.hget(key, "value") File "/opt/conda/lib/python3.7/site-packages/redis/client.py", line 3010, in hget return self.execute_command('HGET', name, key) File "/opt/conda/lib/python3.7/site-packages/redis/client.py", line 898, in execute_command conn = self.connection or pool.get_connection(command_name, **options) File "/opt/conda/lib/python3.7/site-packages/redis/connection.py", line 1192, in get_connection connection.connect() File "/opt/conda/lib/python3.7/site-packages/redis/connection.py", line 563, in connect raise ConnectionError(self._error_message(e)) redis.exceptions.ConnectionError: Error 111 connecting to 10.138.0.10:6379. Connection refused. 2021-11-01 12:24:42,551 WARNING util.py:166 -- The `start_trial` operation took 2.006 s, which may be a performance bottleneck.
Memory usage on this node: 1.2/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/2 CPUs, 1.0/1 GPUs, 0.0/7.21 GiB heap, 0.0/3.6 GiB objects (0.0/1.0 accelerator_type:K80)
Result logdir: /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-24-01
Number of trials: 22/267302874351744 (6 ERROR, 15 PENDING, 1 RUNNING)
Trial name | status | loc | L1 | L2 | L3 | L4 | L5 | L6 | L7 | N1 | N2 | N3 | N4 | N5 | N6 | N7 | batch_size | conv_activation | dense_activation | initial_learning_rate_exp | learning_rate_rate | loss_name | optimizer_name |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
meta_loss_98963_00005 | RUNNING | none | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | sigmoid | sigmoid | -3.50222 | 0.955176 | mape | sgd | |
meta_loss_98963_00007 | PENDING | dense | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 24 | sigmoid | sigmoid | -1.66232 | 0.355708 | mae | adagrad | |
meta_loss_98963_00008 | PENDING | maxpooling2d | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 64 | sigmoid | sigmoid | -1.15489 | 0.736778 | mae | nadam | |
meta_loss_98963_00009 | PENDING | flatten | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -2.96978 | 0.814146 | categorical_crossentropy | adamax | |
meta_loss_98963_00010 | PENDING | dropout | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -1.84345 | 0.335166 | sparse_categorical_crossentropy | adadelta | |
meta_loss_98963_00011 | PENDING | none | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 24 | sigmoid | sigmoid | -3.12773 | 0.515024 | categorical_crossentropy | adamax | |
meta_loss_98963_00012 | PENDING | conv2d | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | sigmoid | sigmoid | -2.44683 | 0.949793 | mse | adamax | |
meta_loss_98963_00013 | PENDING | dense | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -2.7508 | 0.0587266 | categorical_crossentropy | nadam | |
meta_loss_98963_00014 | PENDING | maxpooling2d | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -3.15043 | 0.594349 | mse | adagrad | |
meta_loss_98963_00015 | PENDING | flatten | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 4 | sigmoid | sigmoid | -1.01626 | 0.938932 | mse | adagrad | |
meta_loss_98963_00016 | PENDING | dropout | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | sigmoid | sigmoid | -3.70958 | 0.0631872 | mape | sgd | |
meta_loss_98963_00017 | PENDING | none | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 16 | sigmoid | sigmoid | -3.41504 | 0.482399 | mae | nadam | |
meta_loss_98963_00018 | PENDING | conv2d | flatten | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 48 | sigmoid | sigmoid | -2.26361 | 0.83558 | mse | adagrad | |
meta_loss_98963_00019 | PENDING | dense | flatten | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 4 | sigmoid | sigmoid | -3.09267 | 0.103705 | sparse_categorical_crossentropy | adadelta | |
meta_loss_98963_00000 | ERROR | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 24 | sigmoid | sigmoid | -2.51525 | 0.553209 | mape | adagrad | |
meta_loss_98963_00001 | ERROR | dense | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 16 | sigmoid | sigmoid | -2.37358 | 0.314978 | sparse_categorical_crossentropy | adagrad | |
meta_loss_98963_00002 | ERROR | maxpooling2d | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 128 | sigmoid | sigmoid | -2.8581 | 0.512218 | sparse_categorical_crossentropy | adamax | |
meta_loss_98963_00003 | ERROR | flatten | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 4 | sigmoid | sigmoid | -1.27046 | 0.454703 | mape | sgd | |
meta_loss_98963_00004 | ERROR | dropout | conv2d | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 16 | sigmoid | sigmoid | -3.53965 | 0.861406 | categorical_crossentropy | adadelta | |
meta_loss_98963_00006 | ERROR | conv2d | dense | conv2d | conv2d | conv2d | conv2d | conv2d | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 128 | sigmoid | sigmoid | -2.40886 | 0.878881 | mae | adadelta |
... 2 more trials not shown (2 PENDING)
Number of errored trials: 6
Trial name | # failures | error file |
---|---|---|
meta_loss_98963_00000 | 1 | /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-24-01/meta_loss_98963_00000_0_L1=conv2d,L2=conv2d,L3=conv2d,L4=conv2d,L5=conv2d,L6=conv2d,L7=conv2d,N1=8,N2=8,N3=8,N4=8,N5=8,N6=8,N7=8,b_2021-11-01_12-24-02/error.txt |
meta_loss_98963_00001 | 1 | /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-24-01/meta_loss_98963_00001_1_L1=dense,L2=conv2d,L3=conv2d,L4=conv2d,L5=conv2d,L6=conv2d,L7=conv2d,N1=8,N2=8,N3=8,N4=8,N5=8,N6=8,N7=8,ba_2021-11-01_12-24-03/error.txt |
meta_loss_98963_00002 | 1 | /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-24-01/meta_loss_98963_00002_2_L1=maxpooling2d,L2=conv2d,L3=conv2d,L4=conv2d,L5=conv2d,L6=conv2d,L7=conv2d,N1=8,N2=8,N3=8,N4=8,N5=8,N6=8,_2021-11-01_12-24-11/error.txt |
meta_loss_98963_00003 | 1 | /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-24-01/meta_loss_98963_00003_3_L1=flatten,L2=conv2d,L3=conv2d,L4=conv2d,L5=conv2d,L6=conv2d,L7=conv2d,N1=8,N2=8,N3=8,N4=8,N5=8,N6=8,N7=8,_2021-11-01_12-24-18/error.txt |
meta_loss_98963_00004 | 1 | /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-24-01/meta_loss_98963_00004_4_L1=dropout,L2=conv2d,L3=conv2d,L4=conv2d,L5=conv2d,L6=conv2d,L7=conv2d,N1=8,N2=8,N3=8,N4=8,N5=8,N6=8,N7=8,_2021-11-01_12-24-25/error.txt |
meta_loss_98963_00006 | 1 | /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-24-01/meta_loss_98963_00006_6_L1=conv2d,L2=dense,L3=conv2d,L4=conv2d,L5=conv2d,L6=conv2d,L7=conv2d,N1=8,N2=8,N3=8,N4=8,N5=8,N6=8,N7=8,ba_2021-11-01_12-24-38/error.txt |
Obviously, I got carried away. There are way too many tunable parameters to expect convergence. Let's try a smaller search space with only units
changing
Iteration 2¶
def meta_loss(config):
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import IPython.display as display
import tensorflow as tf
import tensorflow.keras as keras
import tensorflow.keras.layers as tfkl
import tensorflow.keras.datasets as datasets
# load dataset
(X_train, Y_train), (X_test, Y_test) = datasets.mnist.load_data()
X_train, X_test = X_train / 255., X_test / 255.
units = config['units'] # number of units in the middle layer (if applicable)
conv_activation = config['conv_activation'] # 'relu', 'elu', 'softplus'
dense_activation = config['dense_activation'] # 'relu', 'elu', 'softplus'
initial_learning_rate = 10 ** config['initial_learning_rate_exp'] # -4.0 to -1.0
learning_rate_rate = 10 ** config['learning_rate_rate'] # 0.0 to 1.0
optimizer_name = config['optimizer_name'] # 'adam', 'sgd', 'rmsprop', 'adagrad', 'adadelta', 'adamax', or 'nadam'
batch_size = config['batch_size'] # 4 to 1024, integers only
# activation functions
activations = {
'relu': tf.nn.relu,
'elu': tf.nn.elu,
'softplus': tf.nn.softplus,
}
conv_activation = activations[conv_activation]
dense_activation = activations[dense_activation]
# optimizer and learning rate
optimziers = {
'SGD': keras.optimizers.SGD,
'RMSprop': keras.optimizers.RMSprop,
'Adagrad': keras.optimizers.Adagrad,
'Adadelta': keras.optimizers.Adadelta,
'Adam': keras.optimizers.Adam,
'Adamax': keras.optimizers.Adamax,
'Nadam': keras.optimizers.Nadam,
}
optimizer = optimziers[optimizer_name](initial_learning_rate)
learning_rate_scheduler = keras.callbacks.LearningRateScheduler(
lambda epoch, _: initial_learning_rate * (learning_rate_rate ** epoch))
# build model
model = keras.Sequential([
tfkl.Input(shape=(28, 28)),
tfkl.Reshape(target_shape=(28, 28, 1)), # give each pixel a 1 dimensional channel
tfkl.Conv2D(filters=8, kernel_size=(3,3), activation='relu'),
tfkl.Conv2D(filters=(units+8)//2, kernel_size=(3,3), activation='relu'),
tfkl.GlobalMaxPooling2D(), # this layer will take the highest value features over all pixels for each of the 16 filters
tfkl.Dense(units, activation=dense_activation),
tfkl.Dense((units+10)//2, activation=dense_activation),
tfkl.Dense(10),
])
model.compile(
loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer=optimizer)
# train model
history = model.fit(
x=X_train,
y=Y_train,
batch_size=batch_size,
epochs=10,
verbose=2,
callbacks=[learning_rate_scheduler],
validation_data=(X_test, Y_test),
validation_batch_size=64,
)
# report validation loss
final_val_loss = history.history['val_loss'][-1]
tune.report(validation_loss=final_val_loss)
large_config={
'units': tune.grid_search([8, 12, 16, 32, 64, 128, 256]),
'dense_activation': tune.grid_search([ 'relu', 'elu', 'softplus']),
'conv_activation': tune.grid_search([ 'relu', 'elu', 'softplus']),
'initial_learning_rate_exp': tune.uniform(-4.0, -1.0),
'learning_rate_rate': tune.uniform(0.0, 1.0),
'optimizer_name': tune.choice(['adam', 'sgd', 'rmsprop', 'adagrad', 'adadelta', 'adamax', 'nadam']),
'batch_size': tune.choice([4, 8, 16, 24, 32, 48, 64, 128]),
'loss_name': tune.choice(['mse', 'mae', 'mape', 'categorical_crossentropy', 'sparse_categorical_crossentropy']),
}
small_config={
'units': tune.grid_search([8, 16, 32]),
'dense_activation': tune.grid_search([ 'relu', ]),
'conv_activation': tune.grid_search([ 'relu', ]),
'initial_learning_rate_exp': -2.0,
'learning_rate_rate': 0.9,
'optimizer_name': tune.choice(['Adam', 'SGD']),
'batch_size': tune.choice([16, 32]),
}
meta_loss({
'units': 8,
'dense_activation': 'relu',
'conv_activation': 'relu',
'initial_learning_rate_exp': -2.0,
'learning_rate_rate': 0.9,
'optimizer_name': 'SGD',
'batch_size': 32
})
2021-11-01 12:29:39.483154: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-11-01 12:29:39.494402: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-11-01 12:29:39.495286: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-11-01 12:29:39.496558: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-11-01 12:29:39.496951: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-11-01 12:29:39.497758: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-11-01 12:29:39.498647: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-11-01 12:29:39.914087: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-11-01 12:29:39.914874: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-11-01 12:29:39.915628: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
--------------------------------------------------------------------------- MemoryError Traceback (most recent call last) /tmp/ipykernel_1785/850133443.py in <module> 6 'learning_rate_rate': 0.9, 7 'optimizer_name': 'SGD', ----> 8 'batch_size': 32 9 }) /tmp/ipykernel_1785/823405744.py in meta_loss(config) 56 tfkl.Dense(units, activation=dense_activation), 57 tfkl.Dense((units+10)//2, activation=dense_activation), ---> 58 tfkl.Dense(10), 59 ]) 60 model.compile( /opt/conda/lib/python3.7/site-packages/tensorflow/python/training/tracking/base.py in _method_wrapper(self, *args, **kwargs) 528 self._self_setattr_tracking = False # pylint: disable=protected-access 529 try: --> 530 result = method(self, *args, **kwargs) 531 finally: 532 self._self_setattr_tracking = previous_value # pylint: disable=protected-access /opt/conda/lib/python3.7/site-packages/keras/engine/sequential.py in __init__(self, layers, name) 106 # Skip the init in FunctionalModel since model doesn't have input/output yet 107 super(functional.Functional, self).__init__( # pylint: disable=bad-super-call --> 108 name=name, autocast=False) 109 base_layer.keras_api_gauge.get_cell('Sequential').set(True) 110 self.supports_masking = True /opt/conda/lib/python3.7/site-packages/tensorflow/python/training/tracking/base.py in _method_wrapper(self, *args, **kwargs) 528 self._self_setattr_tracking = False # pylint: disable=protected-access 529 try: --> 530 result = method(self, *args, **kwargs) 531 finally: 532 self._self_setattr_tracking = previous_value # pylint: disable=protected-access /opt/conda/lib/python3.7/site-packages/keras/engine/training.py in __init__(self, *args, **kwargs) 287 self._steps_per_execution = None 288 --> 289 self._init_batch_counters() 290 self._base_model_initialized = True 291 /opt/conda/lib/python3.7/site-packages/tensorflow/python/training/tracking/base.py in _method_wrapper(self, *args, **kwargs) 528 self._self_setattr_tracking = False # pylint: disable=protected-access 529 try: --> 530 result = method(self, *args, **kwargs) 531 finally: 532 self._self_setattr_tracking = previous_value # pylint: disable=protected-access /opt/conda/lib/python3.7/site-packages/keras/engine/training.py in _init_batch_counters(self) 295 # `evaluate`, and `predict`. 296 agg = tf.VariableAggregation.ONLY_FIRST_REPLICA --> 297 self._train_counter = tf.Variable(0, dtype='int64', aggregation=agg) 298 self._test_counter = tf.Variable(0, dtype='int64', aggregation=agg) 299 self._predict_counter = tf.Variable( /opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/variables.py in __call__(cls, *args, **kwargs) 266 return cls._variable_v1_call(*args, **kwargs) 267 elif cls is Variable: --> 268 return cls._variable_v2_call(*args, **kwargs) 269 else: 270 return super(VariableMetaclass, cls).__call__(*args, **kwargs) /opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/variables.py in _variable_v2_call(cls, initial_value, trainable, validate_shape, caching_device, name, variable_def, dtype, import_scope, constraint, synchronization, aggregation, shape) 260 synchronization=synchronization, 261 aggregation=aggregation, --> 262 shape=shape) 263 264 def __call__(cls, *args, **kwargs): /opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/variables.py in <lambda>(**kws) 241 shape=None): 242 """Call on Variable class. Useful to force the signature.""" --> 243 previous_getter = lambda **kws: default_variable_creator_v2(None, **kws) 244 for _, getter in ops.get_default_graph()._variable_creator_stack: # pylint: disable=protected-access 245 previous_getter = _make_getter(getter, previous_getter) /opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py in default_variable_creator_v2(next_creator, **kwargs) 2673 synchronization=synchronization, 2674 aggregation=aggregation, -> 2675 shape=shape) 2676 2677 /opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/variables.py in __call__(cls, *args, **kwargs) 268 return cls._variable_v2_call(*args, **kwargs) 269 else: --> 270 return super(VariableMetaclass, cls).__call__(*args, **kwargs) 271 272 /opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py in __init__(self, initial_value, trainable, collections, validate_shape, caching_device, name, dtype, variable_def, import_scope, constraint, distribute_strategy, synchronization, aggregation, shape) 1611 aggregation=aggregation, 1612 shape=shape, -> 1613 distribute_strategy=distribute_strategy) 1614 1615 def _init_from_args(self, /opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py in _init_from_args(self, initial_value, trainable, collections, caching_device, name, dtype, constraint, synchronization, aggregation, distribute_strategy, shape) 1745 initial_value = ops.convert_to_tensor(initial_value, 1746 name="initial_value", -> 1747 dtype=dtype) 1748 if shape is not None: 1749 if not initial_value.shape.is_compatible_with(shape): /opt/conda/lib/python3.7/site-packages/tensorflow/python/profiler/trace.py in wrapped(*args, **kwargs) 161 with Trace(trace_name, **trace_kwargs): 162 return func(*args, **kwargs) --> 163 return func(*args, **kwargs) 164 165 return wrapped /opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py in convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, dtype_hint, ctx, accepted_result_types) 1564 1565 if ret is None: -> 1566 ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) 1567 1568 if ret is NotImplemented: /opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/tensor_conversion_registry.py in _default_conversion_function(***failed resolving arguments***) 50 def _default_conversion_function(value, dtype, name, as_ref): 51 del as_ref # Unused. ---> 52 return constant_op.constant(value, dtype, name=name) 53 54 /opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py in constant(value, dtype, shape, name) 270 """ 271 return _constant_impl(value, dtype, shape, name, verify_shape=False, --> 272 allow_broadcast=True) 273 274 /opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py in _constant_impl(value, dtype, shape, name, verify_shape, allow_broadcast) 281 with trace.Trace("tf.constant"): 282 return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) --> 283 return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) 284 285 g = ops.get_default_graph() /opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py in _constant_eager_impl(ctx, value, dtype, shape, verify_shape) 306 def _constant_eager_impl(ctx, value, dtype, shape, verify_shape): 307 """Creates a constant on the current device.""" --> 308 t = convert_to_eager_tensor(value, ctx, dtype) 309 if shape is None: 310 return t /opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py in convert_to_eager_tensor(value, ctx, dtype) 103 except AttributeError: 104 dtype = dtypes.as_dtype(dtype).as_datatype_enum --> 105 ctx.ensure_initialized() 106 return ops.EagerTensor(value, ctx.device_name, dtype) 107 /opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py in ensure_initialized(self) 553 pywrap_tfe.TFE_ContextOptionsSetRunEagerOpAsFunction( 554 opts, self._run_eager_op_as_function) --> 555 context_handle = pywrap_tfe.TFE_NewContext(opts) 556 finally: 557 pywrap_tfe.TFE_DeleteContextOptions(opts) MemoryError: std::bad_alloc
analysis = tune.run(
meta_loss,
resources_per_trial={'gpu': 0, 'cpu': 2},
config=small_config)
print("Best config: ", analysis.get_best_config(
metric="val_loss", mode="min"))
# Get a dataframe for analyzing trial results.
df = analysis.results_df
2021-11-01 12:30:23,241 WARNING function_runner.py:559 -- Function checkpointing is disabled. This may result in unexpected behavior when using checkpointing features or certain schedulers. To enable, set the train function arguments to be `func(config, checkpoint_dir=None)`.
Memory usage on this node: 1.2/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.2 GiB heap, 0.0/3.6 GiB objects (0.0/1.0 accelerator_type:K80)
Result logdir: /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-30-23
Number of trials: 3/3 (3 PENDING)
Trial name | status | loc | batch_size | conv_activation | dense_activation | optimizer_name | units |
---|---|---|---|---|---|---|---|
meta_loss_7c005_00000 | PENDING | 32 | relu | relu | SGD | 8 | |
meta_loss_7c005_00001 | PENDING | 16 | relu | relu | Adam | 16 | |
meta_loss_7c005_00002 | PENDING | 32 | relu | relu | SGD | 32 |
Memory usage on this node: 1.9/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 2.0/2 CPUs, 0/1 GPUs, 0.0/7.2 GiB heap, 0.0/3.6 GiB objects (0.0/1.0 accelerator_type:K80)
Result logdir: /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-30-23
Number of trials: 3/3 (2 PENDING, 1 RUNNING)
Trial name | status | loc | batch_size | conv_activation | dense_activation | optimizer_name | units |
---|---|---|---|---|---|---|---|
meta_loss_7c005_00000 | RUNNING | 32 | relu | relu | SGD | 8 | |
meta_loss_7c005_00001 | PENDING | 16 | relu | relu | Adam | 16 | |
meta_loss_7c005_00002 | PENDING | 32 | relu | relu | SGD | 32 |
(pid=1939) 2021-11-01 12:30:29.807450: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected (pid=1939) 2021-11-01 12:30:29.807537: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: deeplearning-7-vm (pid=1939) 2021-11-01 12:30:29.807554: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: deeplearning-7-vm (pid=1939) 2021-11-01 12:30:29.807744: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 460.73.1 (pid=1939) 2021-11-01 12:30:29.807788: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 460.73.1 (pid=1939) 2021-11-01 12:30:29.807800: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 460.73.1 (pid=1939) 2021-11-01 12:30:29.808361: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA (pid=1939) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-11-01 12:30:30,147 ERROR trial_runner.py:846 -- Trial meta_loss_7c005_00000: Error processing event. Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 812, in _process_trial results = self.trial_executor.fetch_result(trial) File "/opt/conda/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 767, in fetch_result result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT) File "/opt/conda/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 89, in wrapper return func(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 1623, in get raise value ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. 2021-11-01 12:30:30,152 WARNING worker.py:1227 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff8bb23fe5668d6d46702920b803000000 Worker ID: ae85aa106aa3d6af84167e08626e369bf854be3ac9d9b62787c6181d Node ID: de2fbc5360fb324acfae8b140ae1c049d7b7834725943bf2cdb2e73e Worker IP address: 10.138.0.10 Worker port: 10005 Worker PID: 1939 (pid=1939) *** SIGSEGV received at time=1635769830 on cpu 0 *** (pid=1939) PC: @ 0x56017b7d2a8e (unknown) PyErr_SetObject (pid=1939) @ 0x7efca180d730 (unknown) (unknown) (pid=1939) @ 0x56017ba1f0e0 (unknown) (unknown) (pid=1939) [2021-11-01 12:30:30,119 E 1939 1964] logging.cc:315: *** SIGSEGV received at time=1635769830 on cpu 0 *** (pid=1939) [2021-11-01 12:30:30,120 E 1939 1964] logging.cc:315: PC: @ 0x56017b7d2a8e (unknown) PyErr_SetObject (pid=1939) [2021-11-01 12:30:30,122 E 1939 1964] logging.cc:315: @ 0x7efca180d730 (unknown) (unknown) (pid=1939) [2021-11-01 12:30:30,125 E 1939 1964] logging.cc:315: @ 0x56017ba1f0e0 (unknown) (unknown)
Result for meta_loss_7c005_00000: {}
Memory usage on this node: 1.5/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 2.0/2 CPUs, 0/1 GPUs, 0.0/7.2 GiB heap, 0.0/3.6 GiB objects (0.0/1.0 accelerator_type:K80)
Result logdir: /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-30-23
Number of trials: 3/3 (1 ERROR, 1 PENDING, 1 RUNNING)
Trial name | status | loc | batch_size | conv_activation | dense_activation | optimizer_name | units |
---|---|---|---|---|---|---|---|
meta_loss_7c005_00001 | RUNNING | 16 | relu | relu | Adam | 16 | |
meta_loss_7c005_00002 | PENDING | 32 | relu | relu | SGD | 32 | |
meta_loss_7c005_00000 | ERROR | 32 | relu | relu | SGD | 8 |
Number of errored trials: 1
Trial name | # failures | error file |
---|---|---|
meta_loss_7c005_00000 | 1 | /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-30-23/meta_loss_7c005_00000_0_batch_size=32,conv_activation=relu,dense_activation=relu,optimizer_name=SGD,units=8_2021-11-01_12-30-23/error.txt |
(pid=1977) 2021-11-01 12:30:36.149869: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected (pid=1977) 2021-11-01 12:30:36.149961: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: deeplearning-7-vm (pid=1977) 2021-11-01 12:30:36.149977: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: deeplearning-7-vm (pid=1977) 2021-11-01 12:30:36.150150: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 460.73.1 (pid=1977) 2021-11-01 12:30:36.150192: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 460.73.1 (pid=1977) 2021-11-01 12:30:36.150204: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 460.73.1 (pid=1977) 2021-11-01 12:30:36.150687: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA (pid=1977) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. (pid=1977) *** SIGSEGV received at time=1635769836 on cpu 1 *** (pid=1977) PC: @ 0x5573b8b2ba8e (unknown) PyErr_SetObject (pid=1977) @ 0x7f4d76ff4730 (unknown) (unknown) (pid=1977) @ 0x5573b8d780e0 (unknown) (unknown) (pid=1977) [2021-11-01 12:30:36,443 E 1977 2002] logging.cc:315: *** SIGSEGV received at time=1635769836 on cpu 1 *** (pid=1977) [2021-11-01 12:30:36,443 E 1977 2002] logging.cc:315: PC: @ 0x5573b8b2ba8e (unknown) PyErr_SetObject (pid=1977) [2021-11-01 12:30:36,446 E 1977 2002] logging.cc:315: @ 0x7f4d76ff4730 (unknown) (unknown) (pid=1977) [2021-11-01 12:30:36,450 E 1977 2002] logging.cc:315: @ 0x5573b8d780e0 (unknown) (unknown) 2021-11-01 12:30:36,476 ERROR trial_runner.py:846 -- Trial meta_loss_7c005_00001: Error processing event. Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 812, in _process_trial results = self.trial_executor.fetch_result(trial) File "/opt/conda/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 767, in fetch_result result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT) File "/opt/conda/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 89, in wrapper return func(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 1623, in get raise value ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. 2021-11-01 12:30:36,477 WARNING worker.py:1227 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffd685dca337370d9be30994e603000000 Worker ID: dc32c63cf446d421627c3e8460cfa6fb17cc6aac0f09d251fa6193a7 Node ID: de2fbc5360fb324acfae8b140ae1c049d7b7834725943bf2cdb2e73e Worker IP address: 10.138.0.10 Worker port: 10006 Worker PID: 1977
Result for meta_loss_7c005_00001: {}
Memory usage on this node: 1.5/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 2.0/2 CPUs, 0/1 GPUs, 0.0/7.2 GiB heap, 0.0/3.6 GiB objects (0.0/1.0 accelerator_type:K80)
Result logdir: /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-30-23
Number of trials: 3/3 (2 ERROR, 1 RUNNING)
Trial name | status | loc | batch_size | conv_activation | dense_activation | optimizer_name | units |
---|---|---|---|---|---|---|---|
meta_loss_7c005_00002 | RUNNING | 32 | relu | relu | SGD | 32 | |
meta_loss_7c005_00000 | ERROR | 32 | relu | relu | SGD | 8 | |
meta_loss_7c005_00001 | ERROR | 16 | relu | relu | Adam | 16 |
Number of errored trials: 2
Trial name | # failures | error file |
---|---|---|
meta_loss_7c005_00000 | 1 | /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-30-23/meta_loss_7c005_00000_0_batch_size=32,conv_activation=relu,dense_activation=relu,optimizer_name=SGD,units=8_2021-11-01_12-30-23/error.txt |
meta_loss_7c005_00001 | 1 | /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-30-23/meta_loss_7c005_00001_1_batch_size=16,conv_activation=relu,dense_activation=relu,optimizer_name=Adam,units=16_2021-11-01_12-30-24/error.txt |
(pid=2015) 2021-11-01 12:30:42.757520: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected (pid=2015) 2021-11-01 12:30:42.757606: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: deeplearning-7-vm (pid=2015) 2021-11-01 12:30:42.757630: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: deeplearning-7-vm (pid=2015) 2021-11-01 12:30:42.757796: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 460.73.1 (pid=2015) 2021-11-01 12:30:42.757837: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 460.73.1 (pid=2015) 2021-11-01 12:30:42.757848: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 460.73.1 (pid=2015) 2021-11-01 12:30:42.758400: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA (pid=2015) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-11-01 12:30:43,077 ERROR trial_runner.py:846 -- Trial meta_loss_7c005_00002: Error processing event. Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 812, in _process_trial results = self.trial_executor.fetch_result(trial) File "/opt/conda/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 767, in fetch_result result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT) File "/opt/conda/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 89, in wrapper return func(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 1623, in get raise value ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. 2021-11-01 12:30:43,085 WARNING worker.py:1227 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffe5361df5652c1dd93637d63803000000 Worker ID: e61b3c45187b164f348cd5f9432119b19d3c99d4f0b87f5075a51879 Node ID: de2fbc5360fb324acfae8b140ae1c049d7b7834725943bf2cdb2e73e Worker IP address: 10.138.0.10 Worker port: 10007 Worker PID: 2015
Result for meta_loss_7c005_00002: {}
Memory usage on this node: 1.2/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.2 GiB heap, 0.0/3.6 GiB objects (0.0/1.0 accelerator_type:K80)
Result logdir: /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-30-23
Number of trials: 3/3 (3 ERROR)
Trial name | status | loc | batch_size | conv_activation | dense_activation | optimizer_name | units |
---|---|---|---|---|---|---|---|
meta_loss_7c005_00000 | ERROR | 32 | relu | relu | SGD | 8 | |
meta_loss_7c005_00001 | ERROR | 16 | relu | relu | Adam | 16 | |
meta_loss_7c005_00002 | ERROR | 32 | relu | relu | SGD | 32 |
Number of errored trials: 3
Trial name | # failures | error file |
---|---|---|
meta_loss_7c005_00000 | 1 | /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-30-23/meta_loss_7c005_00000_0_batch_size=32,conv_activation=relu,dense_activation=relu,optimizer_name=SGD,units=8_2021-11-01_12-30-23/error.txt |
meta_loss_7c005_00001 | 1 | /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-30-23/meta_loss_7c005_00001_1_batch_size=16,conv_activation=relu,dense_activation=relu,optimizer_name=Adam,units=16_2021-11-01_12-30-24/error.txt |
meta_loss_7c005_00002 | 1 | /home/jacobfv123/ray_results/meta_loss_2021-11-01_12-30-23/meta_loss_7c005_00002_2_batch_size=32,conv_activation=relu,dense_activation=relu,optimizer_name=SGD,units=32_2021-11-01_12-30-30/error.txt |
(pid=2015) *** SIGSEGV received at time=1635769843 on cpu 0 *** (pid=2015) PC: @ 0x55d94c62aa8e (unknown) PyErr_SetObject (pid=2015) @ 0x7f72b39e2730 (unknown) (unknown) (pid=2015) @ 0x55d94c8770e0 (unknown) (unknown) (pid=2015) [2021-11-01 12:30:43,049 E 2015 2040] logging.cc:315: *** SIGSEGV received at time=1635769843 on cpu 0 *** (pid=2015) [2021-11-01 12:30:43,049 E 2015 2040] logging.cc:315: PC: @ 0x55d94c62aa8e (unknown) PyErr_SetObject (pid=2015) [2021-11-01 12:30:43,052 E 2015 2040] logging.cc:315: @ 0x7f72b39e2730 (unknown) (unknown) (pid=2015) [2021-11-01 12:30:43,055 E 2015 2040] logging.cc:315: @ 0x55d94c8770e0 (unknown) (unknown)
--------------------------------------------------------------------------- TuneError Traceback (most recent call last) /tmp/ipykernel_1868/3609324667.py in <module> 2 meta_loss, 3 resources_per_trial={'gpu': 0, 'cpu': 2}, ----> 4 config=small_config) 5 6 print("Best config: ", analysis.get_best_config( /opt/conda/lib/python3.7/site-packages/ray/tune/tune.py in run(run_or_experiment, name, metric, mode, stop, time_budget_s, config, resources_per_trial, num_samples, local_dir, search_alg, scheduler, keep_checkpoints_num, checkpoint_score_attr, checkpoint_freq, checkpoint_at_end, verbose, progress_reporter, log_to_file, trial_name_creator, trial_dirname_creator, sync_config, export_formats, max_failures, fail_fast, restore, server_port, resume, queue_trials, reuse_actors, trial_executor, raise_on_failed_trial, callbacks, max_concurrent_trials, loggers, ray_auto_init, run_errored_only, global_checkpoint_period, with_server, upload_dir, sync_to_cloud, sync_to_driver, sync_on_checkpoint, _remote) 609 if incomplete_trials: 610 if raise_on_failed_trial and not state[signal.SIGINT]: --> 611 raise TuneError("Trials did not complete", incomplete_trials) 612 else: 613 logger.error("Trials did not complete: %s", incomplete_trials) TuneError: ('Trials did not complete', [meta_loss_7c005_00000, meta_loss_7c005_00001, meta_loss_7c005_00002])
There were several outer loop iterations behind those code blocks above. It seems like the behavior of meta_loss
changes during a tuning session causing bugs to be raised. I found a keras native example while debugging this issue and read the ray tune documentation. The above keras example did not actually work, so I probed into the training pipeline and found that the model.fit
function raises tensorflow errors during ray worker execution time. To combat this, I wrote my own training loop using tf.GradientTape
and further scaled down the hyperparameter space:
Iteration 3¶
def trainable(config):
import tensorflow as tf
keras = tf.keras
tfkl = keras.layers
(X_train, Y_train), (X_test, Y_test) = keras.datasets.mnist.load_data()
model = keras.Sequential([
tfkl.Flatten(input_shape=(28, 28)),
tfkl.Dense(config['hidden_units'], activation="relu"),
tfkl.Dropout(0.2),
tfkl.Dense(10, activation="softmax")
])
epochs = 1
train_batch_size=config['train_batch_size']
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=False)
optimizer=tf.keras.optimizers.SGD(0.005)
# for some reason keras' model.fit functions raises have errors so I wrote my own training loop
for epoch in range(epochs):
minibatches = 2 # X_train.shape[0]//train_batch_size
for step in range(minibatches):
Xmb_train = X_train[step*train_batch_size:(step+1)*train_batch_size]
Ymb_train = Y_train[step*train_batch_size:(step+1)*train_batch_size]
with tf.GradientTape() as tape:
probits = model(Xmb_train, training=True)
loss_value = loss_fn(Ymb_train, probits)
grads = tape.gradient(loss_value, model.trainable_weights)
optimizer.apply_gradients(zip(grads, model.trainable_weights))
"""
# evalutate model
for step in range(X_train.shape[0]//batch_size):
Xmb_train = X_train[step*batch_size:(step+1)*batch_size]
Ymb_train = Y_train[step*batch_size:(step+1)*batch_size]
with tf.GradientTape() as tape:
probits = model(Xmb_test)
loss_value = loss_fn(Ymb_test, probits)"""
"""
history = model.fit(
X_train,
Y_train,
batch_size=config['train_batch_size'],
epochs=10,
verbose=0,
validation_data=(X_test, Y_test),
validation_batch_size=64,
)"""
# report validation loss
probits = model(X_test[:16], training=False)
final_val_loss = loss_fn(Y_test[:16], probits)
tune.report(val_loss=final_val_loss)
analysis = tune.run(
trainable,
config={
'train_batch_size': 16,
'hidden_units': tune.choice([16, 32, 64])
},
resources_per_trial={'gpu': 1}
)
print("best config: ", analysis.get_best_config(metric="val_loss", mode="max"))
Memory usage on this node: 1.2/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/1 CPUs, 0/1 GPUs, 0.0/7.2 GiB heap, 0.0/3.6 GiB objects (0.0/1.0 accelerator_type:K80)
Result logdir: /home/jacobfv123/ray_results/trainable_2021-11-01_13-40-26
Number of trials: 1/1 (1 PENDING)
Trial name | status | loc | hidden_units |
---|---|---|---|
trainable_45a03_00000 | PENDING | 16 |
Memory usage on this node: 1.5/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/1 CPUs, 1.0/1 GPUs, 0.0/7.2 GiB heap, 0.0/3.6 GiB objects (0.0/1.0 accelerator_type:K80)
Result logdir: /home/jacobfv123/ray_results/trainable_2021-11-01_13-40-26
Number of trials: 1/1 (1 RUNNING)
Trial name | status | loc | hidden_units |
---|---|---|---|
trainable_45a03_00000 | RUNNING | 16 |
(pid=7034) 2021-11-01 13:40:34.073976: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=7034) 2021-11-01 13:40:34.085830: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=7034) 2021-11-01 13:40:34.086687: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=7034) 2021-11-01 13:40:34.087968: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA (pid=7034) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. (pid=7034) 2021-11-01 13:40:34.088281: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=7034) 2021-11-01 13:40:34.089149: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=7034) 2021-11-01 13:40:34.089986: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=7034) 2021-11-01 13:40:34.541694: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=7034) 2021-11-01 13:40:34.542547: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=7034) 2021-11-01 13:40:34.543459: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero (pid=7034) 2021-11-01 13:40:34,544 ERROR function_runner.py:266 -- Runner Thread raised error. (pid=7034) Traceback (most recent call last): (pid=7034) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 260, in run (pid=7034) self._entrypoint() (pid=7034) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 329, in entrypoint (pid=7034) self._status_reporter.get_checkpoint()) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 449, in _resume_span (pid=7034) return method(self, *_args, **_kwargs) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 594, in _trainable_func (pid=7034) output = fn() (pid=7034) File "/tmp/ipykernel_6823/3481960557.py", line 13, in trainable (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/tracking/base.py", line 530, in _method_wrapper (pid=7034) result = method(self, *args, **kwargs) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/keras/engine/sequential.py", line 108, in __init__ (pid=7034) name=name, autocast=False) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/tracking/base.py", line 530, in _method_wrapper (pid=7034) result = method(self, *args, **kwargs) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/keras/engine/training.py", line 289, in __init__ (pid=7034) self._init_batch_counters() (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/tracking/base.py", line 530, in _method_wrapper (pid=7034) result = method(self, *args, **kwargs) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/keras/engine/training.py", line 297, in _init_batch_counters (pid=7034) self._train_counter = tf.Variable(0, dtype='int64', aggregation=agg) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 268, in __call__ (pid=7034) return cls._variable_v2_call(*args, **kwargs) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 262, in _variable_v2_call (pid=7034) shape=shape) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 243, in <lambda> (pid=7034) previous_getter = lambda **kws: default_variable_creator_v2(None, **kws) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 2675, in default_variable_creator_v2 (pid=7034) shape=shape) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 270, in __call__ (pid=7034) return super(VariableMetaclass, cls).__call__(*args, **kwargs) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 1613, in __init__ (pid=7034) distribute_strategy=distribute_strategy) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 1747, in _init_from_args (pid=7034) dtype=dtype) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped (pid=7034) return func(*args, **kwargs) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1566, in convert_to_tensor (pid=7034) ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function (pid=7034) return constant_op.constant(value, dtype, name=name) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 272, in constant (pid=7034) allow_broadcast=True) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 283, in _constant_impl (pid=7034) return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 308, in _constant_eager_impl (pid=7034) t = convert_to_eager_tensor(value, ctx, dtype) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 105, in convert_to_eager_tensor (pid=7034) ctx.ensure_initialized() (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 555, in ensure_initialized (pid=7034) context_handle = pywrap_tfe.TFE_NewContext(opts) (pid=7034) MemoryError: std::bad_alloc (pid=7034) Exception in thread Thread-2: (pid=7034) Traceback (most recent call last): (pid=7034) File "/opt/conda/lib/python3.7/threading.py", line 926, in _bootstrap_inner (pid=7034) self.run() (pid=7034) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 279, in run (pid=7034) raise e (pid=7034) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 260, in run (pid=7034) self._entrypoint() (pid=7034) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 329, in entrypoint (pid=7034) self._status_reporter.get_checkpoint()) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 449, in _resume_span (pid=7034) return method(self, *_args, **_kwargs) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 594, in _trainable_func (pid=7034) output = fn() (pid=7034) File "/tmp/ipykernel_6823/3481960557.py", line 13, in trainable (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/tracking/base.py", line 530, in _method_wrapper (pid=7034) result = method(self, *args, **kwargs) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/keras/engine/sequential.py", line 108, in __init__ (pid=7034) name=name, autocast=False) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/tracking/base.py", line 530, in _method_wrapper (pid=7034) result = method(self, *args, **kwargs) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/keras/engine/training.py", line 289, in __init__ (pid=7034) self._init_batch_counters() (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/tracking/base.py", line 530, in _method_wrapper (pid=7034) result = method(self, *args, **kwargs) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/keras/engine/training.py", line 297, in _init_batch_counters (pid=7034) self._train_counter = tf.Variable(0, dtype='int64', aggregation=agg) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 268, in __call__ (pid=7034) return cls._variable_v2_call(*args, **kwargs) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 262, in _variable_v2_call (pid=7034) shape=shape) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 243, in <lambda> (pid=7034) previous_getter = lambda **kws: default_variable_creator_v2(None, **kws) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 2675, in default_variable_creator_v2 (pid=7034) shape=shape) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 270, in __call__ (pid=7034) return super(VariableMetaclass, cls).__call__(*args, **kwargs) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 1613, in __init__ (pid=7034) distribute_strategy=distribute_strategy) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 1747, in _init_from_args (pid=7034) dtype=dtype) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped (pid=7034) return func(*args, **kwargs) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1566, in convert_to_tensor (pid=7034) ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function (pid=7034) return constant_op.constant(value, dtype, name=name) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 272, in constant (pid=7034) allow_broadcast=True) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 283, in _constant_impl (pid=7034) return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 308, in _constant_eager_impl (pid=7034) t = convert_to_eager_tensor(value, ctx, dtype) (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 105, in convert_to_eager_tensor (pid=7034) ctx.ensure_initialized() (pid=7034) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 555, in ensure_initialized (pid=7034) context_handle = pywrap_tfe.TFE_NewContext(opts) (pid=7034) MemoryError: std::bad_alloc (pid=7034) 2021-11-01 13:40:34,747 ERROR trial_runner.py:846 -- Trial trainable_45a03_00000: Error processing event. Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 812, in _process_trial results = self.trial_executor.fetch_result(trial) File "/opt/conda/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 767, in fetch_result result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT) File "/opt/conda/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 89, in wrapper return func(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 1621, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train_buffered() (pid=7034, ip=10.138.0.10, repr=<types.ImplicitFunc object at 0x7f9637a4b150>) File "/opt/conda/lib/python3.7/site-packages/ray/tune/trainable.py", line 189, in train_buffered result = self.train() File "/opt/conda/lib/python3.7/site-packages/ray/tune/trainable.py", line 248, in train result = self.step() File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 379, in step self._report_thread_runner_error(block=True) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 527, in _report_thread_runner_error ("Trial raised an exception. Traceback:\n{}".format(err_tb_str) ray.tune.error.TuneError: Trial raised an exception. Traceback: ray::ImplicitFunc.train_buffered() (pid=7034, ip=10.138.0.10, repr=<types.ImplicitFunc object at 0x7f9637a4b150>) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 260, in run self._entrypoint() File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 329, in entrypoint self._status_reporter.get_checkpoint()) File "/opt/conda/lib/python3.7/site-packages/ray/tune/function_runner.py", line 594, in _trainable_func output = fn() File "/tmp/ipykernel_6823/3481960557.py", line 13, in trainable File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/tracking/base.py", line 530, in _method_wrapper result = method(self, *args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/keras/engine/sequential.py", line 108, in __init__ name=name, autocast=False) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/tracking/base.py", line 530, in _method_wrapper result = method(self, *args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/keras/engine/training.py", line 289, in __init__ self._init_batch_counters() File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/tracking/base.py", line 530, in _method_wrapper result = method(self, *args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/keras/engine/training.py", line 297, in _init_batch_counters self._train_counter = tf.Variable(0, dtype='int64', aggregation=agg) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 268, in __call__ return cls._variable_v2_call(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 262, in _variable_v2_call shape=shape) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 243, in <lambda> previous_getter = lambda **kws: default_variable_creator_v2(None, **kws) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 2675, in default_variable_creator_v2 shape=shape) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 270, in __call__ return super(VariableMetaclass, cls).__call__(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 1613, in __init__ distribute_strategy=distribute_strategy) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 1747, in _init_from_args dtype=dtype) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped return func(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1566, in convert_to_tensor ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function return constant_op.constant(value, dtype, name=name) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 272, in constant allow_broadcast=True) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 283, in _constant_impl return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 308, in _constant_eager_impl t = convert_to_eager_tensor(value, ctx, dtype) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 105, in convert_to_eager_tensor ctx.ensure_initialized() File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 555, in ensure_initialized context_handle = pywrap_tfe.TFE_NewContext(opts) MemoryError: std::bad_alloc
Result for trainable_45a03_00000: {}
Memory usage on this node: 1.8/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/1 CPUs, 0/1 GPUs, 0.0/7.2 GiB heap, 0.0/3.6 GiB objects (0.0/1.0 accelerator_type:K80)
Result logdir: /home/jacobfv123/ray_results/trainable_2021-11-01_13-40-26
Number of trials: 1/1 (1 ERROR)
Trial name | status | loc | hidden_units |
---|---|---|---|
trainable_45a03_00000 | ERROR | 16 |
Number of errored trials: 1
Trial name | # failures | error file |
---|---|---|
trainable_45a03_00000 | 1 | /home/jacobfv123/ray_results/trainable_2021-11-01_13-40-26/trainable_45a03_00000_0_hidden_units=16_2021-11-01_13-40-27/error.txt |
--------------------------------------------------------------------------- TuneError Traceback (most recent call last) /tmp/ipykernel_6823/3481960557.py in <module> 65 'hidden_units': tune.choice([16, 32, 64]) 66 }, ---> 67 resources_per_trial={'gpu': 1} 68 ) 69 /opt/conda/lib/python3.7/site-packages/ray/tune/tune.py in run(run_or_experiment, name, metric, mode, stop, time_budget_s, config, resources_per_trial, num_samples, local_dir, search_alg, scheduler, keep_checkpoints_num, checkpoint_score_attr, checkpoint_freq, checkpoint_at_end, verbose, progress_reporter, log_to_file, trial_name_creator, trial_dirname_creator, sync_config, export_formats, max_failures, fail_fast, restore, server_port, resume, queue_trials, reuse_actors, trial_executor, raise_on_failed_trial, callbacks, max_concurrent_trials, loggers, ray_auto_init, run_errored_only, global_checkpoint_period, with_server, upload_dir, sync_to_cloud, sync_to_driver, sync_on_checkpoint, _remote) 609 if incomplete_trials: 610 if raise_on_failed_trial and not state[signal.SIGINT]: --> 611 raise TuneError("Trials did not complete", incomplete_trials) 612 else: 613 logger.error("Trials did not complete: %s", incomplete_trials) TuneError: ('Trials did not complete', [trainable_45a03_00000])
Iteration 4¶
At this point, I began to suspect serious issues with the way I am using ray.tune
. I hope you found a solution. Maybe I have by now as well, since ray tune is such as flexible and useful tool. Along my search, I encountered a more narrow library keras-tuner
that does what I want. Check out the guide for information on getting started. In my case, deploying keras-tuner
was as simple as:
!pip install -q -U keras-tuner
import keras_tuner as kt
import tensorflow as tf
keras = tf.keras
tfkl = keras.layers
(X_train, Y_train), (X_test, Y_test) = keras.datasets.mnist.load_data()
def model_builder(hp):
hp_units = hp.Int('units', min_value=32, max_value=512, step=32)
hp_units2 = hp.Int('units2', min_value=32, max_value=512, step=32)
hp_conv_activation = hp.Choice('conv_activation', values=['relu', 'elu', 'tanh'])
hp_dense_activation = hp.Choice('dense_activation', values=['relu', 'elu', 'tanh'])
hp_num_conv_layers = hp.Int('num_conv_layers', min_value=0, max_value=3, step=1)
hp_num_dense_layers = hp.Int('num_dense_layers', min_value=0, max_value=2, step=1)
hp_dropout_factor = hp.Choice('dropout_factor', values=[0.0, 0.05, 0.1, 0.15, 0.2])
hp_optimizer = hp.Choice('optimizer', values=['SGD', 'Adam', 'Adadelta'])
#hp_learning_rate = hp.choice('learning_rate', values=[0.0005, 0.001, 0.002, 0.005, 0.01])
model = keras.Sequential([
tfkl.Input(shape=(28, 28)),
tfkl.Reshape(target_shape=(28, 28, 1)), # give each pixel a 1 dimensional channel
] + [
tfkl.Conv2D(filters=(3*1+1*hp_units)//4, kernel_size=(3,3), activation=hp_conv_activation),
tfkl.Conv2D(filters=(2*1+2*hp_units)//4, kernel_size=(3,3), activation=hp_conv_activation),
tfkl.Conv2D(filters=(1*1+3*hp_units)//4, kernel_size=(3,3), activation=hp_conv_activation),
][:hp_num_conv_layers] + [
tfkl.Flatten(input_shape=(28, 28)),
] + [
tfkl.Dense(hp_units, activation=hp_dense_activation),
tfkl.Dense(hp_units2, activation=hp_dense_activation),
][:hp_num_dense_layers] + [
tfkl.Dropout(hp_dropout_factor),
tfkl.Dense(10, activation="softmax")
])
model.compile(loss='sparse_categorical_crossentropy',
optimizer=hp_optimizer, metrics=['accuracy'])
return model
The following code is directly copied from https://www.tensorflow.org/tutorials/keras/keras_tuner
tuner = kt.Hyperband(model_builder,
objective='val_accuracy',
max_epochs=10,
factor=3,
directory='my_dir',
project_name='intro_to_kt')
stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)
tuner.search(X_train, Y_train, epochs=50, validation_split=0.2, callbacks=[stop_early])
# Get the optimal hyperparameters
best_hps=tuner.get_best_hyperparameters(num_trials=1)[0]
print('Hyperparameter search is complete. Best hyperparameters:', best_hps.values)
Trial 30 Complete [00h 00m 59s] val_accuracy: 0.09950000047683716 Best val_accuracy So Far: 0.9877499938011169 Total elapsed time: 00h 21m 51s INFO:tensorflow:Oracle triggered exit Hyperparameter search is complete. Best hyperparameters: {'units': 480, 'units2': 160, 'conv_activation': 'tanh', 'dense_activation': 'relu', 'num_conv_layers': 1, 'num_dense_layers': 2, 'dropout_factor': 0.05, 'optimizer': 'SGD', 'tuner/epochs': 10, 'tuner/initial_epoch': 4, 'tuner/bracket': 2, 'tuner/round': 2, 'tuner/trial_id': '91393a31d51f30707bd93361fb509e4c'}
I was surprised that the tuner chose tanh
and the humble SGD
optimizer. That's convenient since SGD is the fastest of the tested optimizers. Let's now test these hyperparameters to see if we can reach the 0.9877
validation accuracy that keras-tuner boasts:
model = tuner.hypermodel.build(best_hps)
history = model.fit(
x=X_train,
y=Y_train,
epochs=10,
verbose=2,
validation_data=(X_test, Y_test),
validation_batch_size=64,
)
Epoch 1/10 1875/1875 - 8s - loss: 0.1581 - accuracy: 0.9518 - val_loss: 0.0658 - val_accuracy: 0.9798 Epoch 2/10 1875/1875 - 8s - loss: 0.0452 - accuracy: 0.9865 - val_loss: 0.0565 - val_accuracy: 0.9824 Epoch 3/10 1875/1875 - 8s - loss: 0.0214 - accuracy: 0.9938 - val_loss: 0.0425 - val_accuracy: 0.9863 Epoch 4/10 1875/1875 - 8s - loss: 0.0112 - accuracy: 0.9974 - val_loss: 0.0413 - val_accuracy: 0.9861 Epoch 5/10 1875/1875 - 8s - loss: 0.0048 - accuracy: 0.9994 - val_loss: 0.0438 - val_accuracy: 0.9871 Epoch 6/10 1875/1875 - 8s - loss: 0.0028 - accuracy: 0.9997 - val_loss: 0.0450 - val_accuracy: 0.9869 Epoch 7/10 1875/1875 - 8s - loss: 0.0017 - accuracy: 0.9999 - val_loss: 0.0400 - val_accuracy: 0.9879 Epoch 8/10 1875/1875 - 8s - loss: 0.0013 - accuracy: 0.9999 - val_loss: 0.0401 - val_accuracy: 0.9884 Epoch 9/10 1875/1875 - 8s - loss: 0.0011 - accuracy: 0.9999 - val_loss: 0.0406 - val_accuracy: 0.9876 Epoch 10/10 1875/1875 - 8s - loss: 0.0010 - accuracy: 0.9999 - val_loss: 0.0395 - val_accuracy: 0.9877
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.DataFrame(history.history)
sns.lineplot(data=data.filter(['loss', 'val_loss']))
plt.title("loss")
plt.show()
sns.lineplot(data=data.filter(['accuracy', 'val_accuracy']))
plt.title("accuracy")
plt.show()
Very nice performance! Almost 3 9's of validation accuracy. Let's play one more game with our model before concluding.
Fine-tuning and Transfer Learning¶
For transfer learning, the idea is: pretrain in domain X, then train and test in domain Y. Transfer learning attempts to catch students who are only studying for the exam but not actually learning the content. A 'good' image model shouldn't have trouble retraining for cifar10
. A narrow but useful application of transfer learning is fine-tuning a subset of the parameters to a new objective without major architectural changes. We're going to try fine-tuning our model
for cifar10
and see how representative its internal activations are of arbitrary images. This is an easy transfer to attempt since
- both are image classification problems and
- data is shaped the similarly in both datasets.
We'll do this by freezing the parameters of all the layers except the final one and then retraining on a new dataset. Along the way, we'll evaluate (but not run gradient descent) on the model's loss in its origonal mnist
classification task. Then we'll try to train a model that can classify images from either domain, and finally throw the inner optimizer into a hyperparameter tunning loop.
Let's start by making our own training loop
mnist_train, mnist_test = keras.datasets.mnist.load_data()
cifar10_train, cifar10_test = keras.datasets.cifar10.load_data()
# resize the cifar10 images to be (28,28)
cifar10_train = cifar10_train[0][:, 2:-2, 2:-2].mean(-1), cifar10_train[1]
cifar10_test = cifar10_test[0][:, 2:-2, 2:-2].mean(-1), cifar10_test[1]
class CustomEvaluation(keras.callbacks.Callback):
def __init__(self, data, ds_name, verbose=0):
self.X, self.Y = data
self.ds_name = ds_name
self.loss_key = f'{self.ds_name}_loss'
self.acc_key = f'{self.ds_name}_acc'
self.verbose = verbose
def on_epoch_end(self, epoch, logs=None):
if not logs:
logs = dict()
loss, acc = self.model.evaluate(self.X, self.Y, verbose = 0)
logs[self.loss_key] = loss
logs[self.acc_key] = acc
if self.verbose > 0:
print(f'%s: %0.5f - %s: %0.5f' %
(self.loss_key, loss, self.acc_key, acc))
mnist_eval_cb = CustomEvaluation(mnist_test, 'mnist', verbose=1)
cifar10_eval_cb = CustomEvaluation(cifar10_test, 'cifar10', verbose=1)
# only let the last layer learn
for layer in model.layers[:-1]:
layer.trainable = False
history = model.fit(
x=cifar10_train[0],
y=cifar10_train[1],
epochs=10,
verbose=2,
callbacks=[mnist_eval_cb, cifar10_eval_cb]
)
Epoch 1/10 1563/1563 - 7s - loss: 2.7380 - accuracy: 0.0981 mnist_loss: 2.35357 - mnist_acc: 0.15990 cifar10_loss: 2.30254 - cifar10_acc: 0.10020 Epoch 2/10 1563/1563 - 6s - loss: 2.3028 - accuracy: 0.0961 mnist_loss: 2.30807 - mnist_acc: 0.17150 cifar10_loss: 2.30258 - cifar10_acc: 0.10000 Epoch 3/10 1563/1563 - 6s - loss: 2.3028 - accuracy: 0.0984 mnist_loss: 2.31940 - mnist_acc: 0.15170 cifar10_loss: 2.30258 - cifar10_acc: 0.10010 Epoch 4/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0986 mnist_loss: 2.32376 - mnist_acc: 0.14780 cifar10_loss: 2.30258 - cifar10_acc: 0.10010 Epoch 5/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0970 mnist_loss: 2.32900 - mnist_acc: 0.14830 cifar10_loss: 2.30258 - cifar10_acc: 0.10010 Epoch 6/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0981 mnist_loss: 2.32387 - mnist_acc: 0.15160 cifar10_loss: 2.30258 - cifar10_acc: 0.10010 Epoch 7/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0966 mnist_loss: 2.32490 - mnist_acc: 0.15050 cifar10_loss: 2.30257 - cifar10_acc: 0.10010 Epoch 8/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0976 mnist_loss: 2.33507 - mnist_acc: 0.14700 cifar10_loss: 2.30259 - cifar10_acc: 0.10000 Epoch 9/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0984 mnist_loss: 2.33530 - mnist_acc: 0.14980 cifar10_loss: 2.30258 - cifar10_acc: 0.10000 Epoch 10/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0981 mnist_loss: 2.31990 - mnist_acc: 0.15250 cifar10_loss: 2.30257 - cifar10_acc: 0.10010
data = pd.DataFrame(history.history)
sns.lineplot(data=data.filter(['loss', 'cifar10_loss', 'mnist_acc']))
plt.title("loss")
plt.show()
sns.lineplot(data=data.filter(['accuracy', 'cifar10_acc', 'mnist_loss']))
plt.title("accuracy")
plt.show()
This is not good. Amazingly, by forcing only the last layer to adapt to a new task, the model performed worse than random (1/10 = 0.1 > 0.09...)! I'm going to help the model out by thawing out the rest of the layers. Let's see if that helps:
for layer in model.layers:
layer.trainable=True
history = model.fit(
x=cifar10_train[0],
y=cifar10_train[1],
epochs=10,
verbose=2,
callbacks=[mnist_eval_cb, cifar10_eval_cb]
)
data = pd.DataFrame(history.history)
sns.lineplot(data=data.filter(['loss', 'cifar10_loss', 'mnist_acc']))
plt.title("loss")
plt.show()
sns.lineplot(data=data.filter(['accuracy', 'cifar10_acc', 'mnist_loss']))
plt.title("accuracy")
plt.show()
Epoch 1/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0990 mnist_loss: 2.31927 - mnist_acc: 0.15290 cifar10_loss: 2.30257 - cifar10_acc: 0.10010 Epoch 2/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0978 mnist_loss: 2.31963 - mnist_acc: 0.15480 cifar10_loss: 2.30257 - cifar10_acc: 0.10000 Epoch 3/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0977 mnist_loss: 2.31968 - mnist_acc: 0.15230 cifar10_loss: 2.30256 - cifar10_acc: 0.09990 Epoch 4/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0981 mnist_loss: 2.32013 - mnist_acc: 0.15150 cifar10_loss: 2.30258 - cifar10_acc: 0.10010 Epoch 5/10 1563/1563 - 6s - loss: 2.3040 - accuracy: 0.0983 mnist_loss: 2.29998 - mnist_acc: 0.16470 cifar10_loss: 2.30258 - cifar10_acc: 0.10010 Epoch 6/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0964 mnist_loss: 2.30030 - mnist_acc: 0.16330 cifar10_loss: 2.30258 - cifar10_acc: 0.10010 Epoch 7/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0985 mnist_loss: 2.30038 - mnist_acc: 0.16020 cifar10_loss: 2.30257 - cifar10_acc: 0.09990 Epoch 8/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0992 mnist_loss: 2.29983 - mnist_acc: 0.16680 cifar10_loss: 2.30257 - cifar10_acc: 0.10000 Epoch 9/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0976 mnist_loss: 2.29999 - mnist_acc: 0.16140 cifar10_loss: 2.30257 - cifar10_acc: 0.10000 Epoch 10/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0999 mnist_loss: 2.29976 - mnist_acc: 0.16130 cifar10_loss: 2.30258 - cifar10_acc: 0.10010
Sadly it does not. I think the problem is that I removed the RGB channels when I averaged over the last axis. I will modify the model to accept 3 channels instead of 1 and then tile the mnist dataset on a new last axis.
mnist_train, mnist_test = keras.datasets.mnist.load_data()
cifar10_train, cifar10_test = keras.datasets.cifar10.load_data()
# resize the cifar10 images to be (28,28)
cifar10_train = cifar10_train[0][:, 2:-2, 2:-2], cifar10_train[1]
cifar10_test = cifar10_test[0][:, 2:-2, 2:-2], cifar10_test[1]
# insert and repeat 3 channels on greyscale mnist
mnist_train = tf.repeat(mnist_train[0][..., None], 3, axis=-1), mnist_train[1]
mnist_test = tf.repeat(mnist_test[0][..., None], 3, axis=-1), mnist_test[1]
mnist_train[0].shape, mnist_test[0].shape
# sanity check: make sure everything is shaped compatibly
for ds in [mnist_train, mnist_test, cifar10_train, cifar10_test]:
print(ds[0].shape, ds[1].shape)
(60000, 28, 28, 3) (60000,) (10000, 28, 28, 3) (10000,) (50000, 28, 28, 3) (50000, 1) (10000, 28, 28, 3) (10000, 1)
def model_builder(hp):
hp_units = hp.Int('units', min_value=32, max_value=512, step=32)
hp_units2 = hp.Int('units2', min_value=32, max_value=512, step=32)
hp_conv_activation = hp.Choice('conv_activation', values=['relu', 'elu', 'tanh'])
hp_dense_activation = hp.Choice('dense_activation', values=['relu', 'elu', 'tanh'])
hp_num_conv_layers = hp.Int('num_conv_layers', min_value=0, max_value=3, step=1)
hp_num_dense_layers = hp.Int('num_dense_layers', min_value=0, max_value=2, step=1)
hp_dropout_factor = hp.Choice('dropout_factor', values=[0.0, 0.05, 0.1, 0.15, 0.2])
hp_optimizer = hp.Choice('optimizer', values=['SGD', 'Adam', 'Adadelta'])
model = keras.Sequential([
tfkl.Input(shape=(28, 28, 3)),
] + [
tfkl.Conv2D(filters=(3*1+1*hp_units)//4, kernel_size=(3,3), activation=hp_conv_activation),
tfkl.Conv2D(filters=(2*1+2*hp_units)//4, kernel_size=(3,3), activation=hp_conv_activation),
tfkl.Conv2D(filters=(1*1+3*hp_units)//4, kernel_size=(3,3), activation=hp_conv_activation),
][:hp_num_conv_layers] + [
tfkl.Flatten(),
] + [
tfkl.Dense(hp_units, activation=hp_dense_activation),
tfkl.Dense(hp_units2, activation=hp_dense_activation),
][:hp_num_dense_layers] + [
tfkl.Dropout(hp_dropout_factor),
tfkl.Dense(10, activation="softmax")
])
model.compile(loss='sparse_categorical_crossentropy',
optimizer=hp_optimizer, metrics=['accuracy'])
return model
model = model_builder(best_hps)
history = model.fit(
x=mnist_train[0],
y=mnist_train[1],
epochs=10,
verbose=2,
validation_data=mnist_test,
validation_batch_size=64,
)
data = pd.DataFrame(history.history)
sns.lineplot(data=data.filter(['loss', 'val_loss']))
plt.title("loss")
plt.show()
sns.lineplot(data=data.filter(['accuracy', 'val_accuracy']))
plt.title("accuracy")
plt.show()
Epoch 1/10 1875/1875 - 8s - loss: 0.1521 - accuracy: 0.9533 - val_loss: 0.0689 - val_accuracy: 0.9784 Epoch 2/10 1875/1875 - 8s - loss: 0.0467 - accuracy: 0.9856 - val_loss: 0.0649 - val_accuracy: 0.9802 Epoch 3/10 1875/1875 - 8s - loss: 0.0236 - accuracy: 0.9930 - val_loss: 0.0500 - val_accuracy: 0.9837 Epoch 4/10 1875/1875 - 8s - loss: 0.0131 - accuracy: 0.9962 - val_loss: 0.0496 - val_accuracy: 0.9843 Epoch 5/10 1875/1875 - 8s - loss: 0.0067 - accuracy: 0.9985 - val_loss: 0.0420 - val_accuracy: 0.9880 Epoch 6/10 1875/1875 - 8s - loss: 0.0035 - accuracy: 0.9995 - val_loss: 0.0434 - val_accuracy: 0.9874 Epoch 7/10 1875/1875 - 8s - loss: 0.0019 - accuracy: 1.0000 - val_loss: 0.0425 - val_accuracy: 0.9880 Epoch 8/10 1875/1875 - 8s - loss: 0.0012 - accuracy: 1.0000 - val_loss: 0.0429 - val_accuracy: 0.9886 Epoch 9/10 1875/1875 - 8s - loss: 0.0012 - accuracy: 0.9999 - val_loss: 0.0447 - val_accuracy: 0.9880 Epoch 10/10 1875/1875 - 8s - loss: 8.9660e-04 - accuracy: 0.9999 - val_loss: 0.0444 - val_accuracy: 0.9882
Ok. The mnist
classifier is back on its feet but now it can see in color. The model summary looks the same except the number of parameters for layer 1 have increased:
> model.summary()
Model: "sequential_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_6 (Conv2D) (None, 26, 26, 120) 3360
_________________________________________________________________
flatten_2 (Flatten) (None, 81120) 0
_________________________________________________________________
dense_6 (Dense) (None, 480) 38938080
_________________________________________________________________
dense_7 (Dense) (None, 160) 76960
_________________________________________________________________
dropout_2 (Dropout) (None, 160) 0
_________________________________________________________________
dense_8 (Dense) (None, 10) 1610
=================================================================
Total params: 39,020,010
Trainable params: 39,020,010
Non-trainable params: 0
_________________________________________________________________
Let's try fine tuning again, and this time, we'll learn the first and last layer (but not any intermediate layers):
mnist_eval_cb = CustomEvaluation(mnist_test, 'mnist', verbose=1)
cifar10_eval_cb = CustomEvaluation(cifar10_test, 'cifar10', verbose=1)
# only let the first and last layer learn
for layer in model.layers:
layer.trainable = False
model.layers[0].trainable = True
model.layers[-1].trainable = True
history = model.fit(
x=cifar10_train[0],
y=cifar10_train[1],
epochs=10,
verbose=2,
callbacks=[mnist_eval_cb, cifar10_eval_cb]
)
data = pd.DataFrame(history.history)
sns.lineplot(data=data.filter(['loss', 'cifar10_loss', 'mnist_acc']))
plt.title("loss")
plt.show()
sns.lineplot(data=data.filter(['accuracy', 'cifar10_acc', 'mnist_loss']))
plt.title("accuracy")
plt.show()
Epoch 1/10 1563/1563 - 7s - loss: 2.7930 - accuracy: 0.0993 mnist_loss: 2.23506 - mnist_acc: 0.27090 cifar10_loss: 2.30257 - cifar10_acc: 0.10010 Epoch 2/10 1563/1563 - 6s - loss: 2.3028 - accuracy: 0.0971 mnist_loss: 2.23064 - mnist_acc: 0.25680 cifar10_loss: 2.30257 - cifar10_acc: 0.10010 Epoch 3/10 1563/1563 - 6s - loss: 2.3030 - accuracy: 0.0968 mnist_loss: 2.22431 - mnist_acc: 0.29550 cifar10_loss: 2.30257 - cifar10_acc: 0.09990 Epoch 4/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0972 mnist_loss: 2.25326 - mnist_acc: 0.24510 cifar10_loss: 2.30258 - cifar10_acc: 0.10000 Epoch 5/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0971 mnist_loss: 2.26340 - mnist_acc: 0.24070 cifar10_loss: 2.30257 - cifar10_acc: 0.09990 Epoch 6/10 1563/1563 - 6s - loss: 2.3031 - accuracy: 0.0971 mnist_loss: 2.23891 - mnist_acc: 0.24970 cifar10_loss: 2.30259 - cifar10_acc: 0.09990 Epoch 7/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0973 mnist_loss: 2.26161 - mnist_acc: 0.22930 cifar10_loss: 2.30258 - cifar10_acc: 0.10010 Epoch 8/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0987 mnist_loss: 2.25460 - mnist_acc: 0.22980 cifar10_loss: 2.30259 - cifar10_acc: 0.10010 Epoch 9/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0972 mnist_loss: 2.24977 - mnist_acc: 0.23900 cifar10_loss: 2.30258 - cifar10_acc: 0.10010 Epoch 10/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0981 mnist_loss: 2.24748 - mnist_acc: 0.24250 cifar10_loss: 2.30258 - cifar10_acc: 0.10010
It still looks like we'll have to train the whole model. You know the drill:
# unfreeze everything
for layer in model.layers:
layer.trainable = True
history = model.fit(
x=cifar10_train[0],
y=cifar10_train[1],
epochs=10,
verbose=2,
callbacks=[mnist_eval_cb, cifar10_eval_cb]
)
data = pd.DataFrame(history.history)
sns.lineplot(data=data.filter(['loss', 'cifar10_loss', 'mnist_acc']))
plt.title("loss")
plt.show()
sns.lineplot(data=data.filter(['accuracy', 'cifar10_acc', 'mnist_loss']))
plt.title("accuracy")
plt.show()
Epoch 1/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0981 mnist_loss: 2.25516 - mnist_acc: 0.23380 cifar10_loss: 2.30260 - cifar10_acc: 0.10020 Epoch 2/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0970 mnist_loss: 2.26369 - mnist_acc: 0.22390 cifar10_loss: 2.30261 - cifar10_acc: 0.10000 Epoch 3/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0944 mnist_loss: 2.24401 - mnist_acc: 0.24480 cifar10_loss: 2.30258 - cifar10_acc: 0.10010 Epoch 4/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0980 mnist_loss: 2.27201 - mnist_acc: 0.21970 cifar10_loss: 2.30258 - cifar10_acc: 0.09990 Epoch 5/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0979 mnist_loss: 2.27206 - mnist_acc: 0.22140 cifar10_loss: 2.30258 - cifar10_acc: 0.10010 Epoch 6/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0987 mnist_loss: 2.27167 - mnist_acc: 0.22160 cifar10_loss: 2.30259 - cifar10_acc: 0.10010 Epoch 7/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0978 mnist_loss: 2.27237 - mnist_acc: 0.22290 cifar10_loss: 2.30258 - cifar10_acc: 0.09990 Epoch 8/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0984 mnist_loss: 2.27085 - mnist_acc: 0.22390 cifar10_loss: 2.30259 - cifar10_acc: 0.10010 Epoch 9/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0977 mnist_loss: 2.27129 - mnist_acc: 0.22430 cifar10_loss: 2.30258 - cifar10_acc: 0.10010 Epoch 10/10 1563/1563 - 6s - loss: 2.3027 - accuracy: 0.0987 mnist_loss: 2.27180 - mnist_acc: 0.22000 cifar10_loss: 2.30259 - cifar10_acc: 0.10010
Still no improvement. Let's see if the hyperparameter tuner can help:
tuner = kt.Hyperband(model_builder,
objective='val_accuracy',
max_epochs=10,
factor=3,
directory='cifar10',
project_name='intro_to_kt')
stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)
tuner.search(cifar10_train[0], cifar10_train[1], epochs=50,
validation_split=0.2, callbacks=[stop_early])
# Get the optimal hyperparameters
best_hps=tuner.get_best_hyperparameters(num_trials=1)[0]
print('Hyperparameter search is complete. Best hyperparameters:', best_hps.values)
Trial 30 Complete [00h 01m 42s] val_accuracy: 0.49459999799728394 Best val_accuracy So Far: 0.5548999905586243 Total elapsed time: 00h 19m 53s INFO:tensorflow:Oracle triggered exit Hyperparameter search is complete. Best hyperparameters: {'units': 480, 'units2': 480, 'conv_activation': 'tanh', 'dense_activation': 'tanh', 'num_conv_layers': 3, 'num_dense_layers': 1, 'dropout_factor': 0.0, 'optimizer': 'Adadelta', 'tuner/epochs': 10, 'tuner/initial_epoch': 4, 'tuner/bracket': 2, 'tuner/round': 2, 'tuner/trial_id': '7c00f5b4ad4a41a7e4469bf866adfe6d'}
Still no improvement. Let's introduce some more translation-invariance priors. Math first, then code.
Translation invariance refers to the fact that object representations may ideally be factored into separate part-object and object-scene representations where the former do not change as the object moves around the scene. (Side note: neuroscience commonly refers to the dorsal and ventral visual pathways as 'what and where' streams) A robust classifier should classify objects regardless of how they are framed within an image. To reach that lofty goal, machine learning researchers and engineers instill various translation invariant priors into their models. 2D convolution is an excellent example of a translation invariant operation. Others include:
- 1x1 convolution (i.e.: a dense layer with weights shared applied on every pixel)
- MaxPooling
- MeanPooling
- GlobalMaxPooling
- Attention
- Normalization
Networks like VGG, ResNet, and ViT also make extensive use of skip connections, highways, or residual connections. Basically those terms mean we're going to break out of the 'sequential' mode of thinking and build acyclic connection toplogies instead. There's too much to explain here but I have included seveal reference in the bottom of this notebook about the aforementioned topics. For now, we'll give our hyperparameter tuner the chance to explore all those priors.
The Hyper Resisudal Attention Network¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import IPython.display as display
import tensorflow as tf
import tensorflow.keras as keras
import tensorflow.keras.layers as tfkl
import tensorflow.keras.datasets as datasets
!pip install -q einops
import einops
!pip install -q -U keras-tuner
import keras_tuner as kt
# load data
mnist_train, mnist_test = datasets.mnist.load_data()
cifar10_train, cifar10_test = datasets.cifar10.load_data()
mnist_train = mnist_train[0].astype('float32')/255., mnist_train[1]
mnist_test = mnist_test[0].astype('float32')/255., mnist_test[1]
cifar10_train = cifar10_train[0].astype('float32')/255., cifar10_train[1]
cifar10_test = cifar10_test[0].astype('float32')/255., cifar10_test[1]
# insert and repeat 3 channels on greyscale mnist
mnist_train = tf.repeat(mnist_train[0][..., None], 3, axis=-1), mnist_train[1]
mnist_test = tf.repeat(mnist_test[0][..., None], 3, axis=-1), mnist_test[1]
# remove empty channel axis on cifar10
cifar10_train = cifar10_train[0], cifar10_train[1][..., 0]
cifar10_test = cifar10_test[0], cifar10_test[1][..., 0]
# sanity check: make sure datasets are shaped similarly with different H's and W's
for ds in [mnist_train, mnist_test, cifar10_train, cifar10_test]:
print(ds[0].shape, ds[1].shape)
(60000, 28, 28, 3) (60000,) (10000, 28, 28, 3) (10000,) (50000, 32, 32, 3) (50000,) (10000, 32, 32, 3) (10000,)
def hyper_resisudal_attention_model_builder(hp):
### declare hyperparameters ###
# input conv layer
hp_augment = hp.Choice('augment', values=[True, False])
hp_conv_act_input = hp.Choice('conv_act_input', values=['relu', 'tanh', 'linear', 'sigmoid'])
# residual depth
hp_res_units = hp.Int('res_units', min_value=8, max_value=128, step=8)
hp_not_residual = hp.Choice('not_residual', values=[True, False])
# residual conv hparams
hp_conv_block_size = hp.Choice('conv_block_size', values=[0, 1, 2, 3, 4])
hp_num_conv_blocks = hp.Choice('num_conv_blocks', values=[0, 1, 2, 3])
# first layer
hp_conv_type0 = hp.Choice('conv_type0', values=['conv_1x1', 'conv_3x3', 'conv_5x5', 'conv_mixed',
'attention'])
hp_conv_units0 = hp.Int('conv_units0', min_value=8, max_value=32, step=16)
hp_conv_act0 = hp.Choice('conv_act0', values=['relu', 'tanh', 'linear', 'sigmoid'])
# second layer
hp_conv_type1 = hp.Choice('conv_type1', values=['conv_1x1', 'conv_3x3', 'conv_5x5', 'conv_mixed',
'max_pool', 'mean_pool'])
hp_conv_units1 = hp.Int('conv_units1', min_value=8, max_value=32, step=16)
hp_conv_act1 = hp.Choice('conv_act1', values=['linear'])
# third layer
hp_conv_type2 = hp.Choice('conv_type2', values=['conv_1x1', 'conv_3x3', 'conv_5x5', 'conv_mixed',
'max_pool', 'mean_pool'])
hp_conv_units2 = hp.Int('conv_units2', min_value=8, max_value=32, step=16)
hp_conv_act2 = hp.Choice('conv_act2', values=['relu', 'linear'])
# forth layer
hp_conv_type3 = hp.Choice('conv_type3', values=['max_pool', 'mean_pool'])
hp_conv_units3 = hp.Int('conv_units3', min_value=8, max_value=32, step=16)
hp_conv_act3 = hp.Choice('conv_act3', values=['linear'])
# final dense layer (1x1 conv)
hp_conv_act_final = hp.Choice('conv_act_final', values=['relu', 'linear'])
# convert image shaped tensor to vector representation
hp_collapse_type = hp.Choice('collapse_type', values=['global_max_pool', 'global_mean_pool', 'flatten'])
# residual dense hparams
hp_dense_block_size = hp.Choice('dense_block_size', values=[1, 2, 3])
hp_num_dense_blocks = hp.Choice('num_dense_blocks', values=[1, 2, 3])
hp_dense_units0 = hp.Int('dense_units0', min_value=8, max_value=512, step=16)
hp_dense_act0 = hp.Choice('dense_act0', values=['relu', 'tanh', 'linear', 'sigmoid'])
hp_dense_units1 = hp.Int('dense_units1', min_value=8, max_value=512, step=16)
hp_dense_act1 = hp.Choice('dense_act1', values=['relu', 'linear'])
hp_dense_units2 = hp.Int('dense_units2', min_value=8, max_value=512, step=16)
hp_dense_act2 = hp.Choice('dense_act2', values=['relu', 'linear'])
hp_dense_units3 = hp.Int('dense_units3', min_value=8, max_value=512, step=16)
hp_dense_act3 = hp.Choice('dense_act3', values=['relu', 'linear'])
hp_dense_act_final = hp.Choice('dense_act_final', values=['relu', 'linear'])
# optimizer hparams
hp_optimizer = hp.Choice('optimizer', values=['SGD', 'Adam'])
hp_learning_rate = hp.Choice('learning_rate', values=[0.0005, 0.001, 0.0025, 0.005, 0.01])
if hp_collapse_type == 'flatten':
hp_not_residual = True
### get helpers ready ###
# activation functions
activations = {
'sigmoid': tf.nn.sigmoid,
'relu': tf.nn.relu,
'tanh': tf.nn.tanh,
'linear': (lambda x: x),
}
# utility function for arbitrary image-tensor processing layers
def build_conv_layer(input_layer, layer_type, units, activation):
if layer_type == 'conv_1x1':
return tfkl.Conv2D(units, (1,1), (1,1), 'same',
activation=activations[activation],
kernel_initializer='he_uniform')(input_layer)
elif layer_type == 'conv_3x3':
return tfkl.Conv2D(units, (3,3), (1,1), 'same',
activation=activations[activation],
kernel_initializer='he_uniform')(input_layer)
elif layer_type == 'conv_5x5':
return tfkl.Conv2D(units, (5,5), (1,1), 'same',
activation=activations[activation],
kernel_initializer='he_uniform')(input_layer)
elif layer_type == 'conv_mixed':
# not a precise implementation of vgg
conv1 = tfkl.Conv2D(units, (1,1), (1,1), 'same',
activation=activations[activation],
kernel_initializer='he_uniform')(input_layer)
conv3 = tfkl.Conv2D(units, (3,3), (1,1), 'same',
activation=activations[activation],
kernel_initializer='he_uniform')(input_layer)
conv5 = tfkl.Conv2D(units, (5,5), (1,1), 'same',
activation=activations[activation],
kernel_initializer='he_uniform')(input_layer)
cat = tfkl.Maximum()([conv1, conv3, conv5])
return cat
elif layer_type == 'max_pool':
return tfkl.UpSampling2D()(tfkl.MaxPool2D()(input_layer))
elif layer_type == 'mean_pool':
return tfkl.UpSampling2D()(tfkl.AveragePooling2D()(input_layer))
elif layer_type == 'attention':
# class Attn2D(tfkl.Layer):
#
# def __init__(self, patch_size=4, *args, **kwargs):
# super(Attn2D, self).__init__(*args, **kwargs)
# self.patch_size = patch_size
#
# def build(self, input_shape):
# self.attn_layer = tfkl.MultiHeadAttention(num_heads=4, key_dim=units, attention_axes=(1,2))
# super(Attn2D, self).build(input_shape)
# print(input_shape)
#
# def call(self, inputs):
# x = inputs
# orig_shape = tf.shape(inputs)
# attn_shape = (orig_shape[0],
# orig_shape[1]//self.patch_size,
# orig_shape[2]//self.patch_size,
# 4*orig_shape[3])
# x = tf.reshape(inputs, attn_shape)
# x = self.attn_layer(x, x)
# x = tf.reshape(x, orig_shape)
# attended = Attn2D(patch_size=4)(input_layer)
# @tf.function
# def upscale(x):
# tf.print('before', x.shape)
# x = einops.repeat(attended, f'b h w c -> b (h Hrepeat) (w Wrepeat) c',
# Hrepeat=attn_block_len, Wrepeat=attn_block_len)
# tf.print('after', x.shape)
# return x
# upscaled = tfkl.Lambda(upscale)(attended)
# upscaled = einops.repeat(attended, f'b h w c -> b (h repeat) (w repeat) c', repeat=attn_block_len)
attn_block_len = 4
downscaled = tfkl.AveragePooling2D((attn_block_len, attn_block_len))(input_layer)
attended = tfkl.MultiHeadAttention(num_heads=4, key_dim=units, attention_axes=(1,2)) \
(downscaled, downscaled, return_attention_scores=False) # [B, H/s, W/s, N]
upscaled = tfkl.UpSampling2D((attn_block_len, attn_block_len))(attended)
return upscaled
else:
raise ValueError(f'Layer type {layer_type} not supported')
# layers which turn [B, H, W, hp_res_units] into [B, hp_res_units]
collapse_layers = {
'global_max_pool': tfkl.GlobalMaxPool2D(),
'global_mean_pool': tfkl.GlobalAveragePooling2D(),
'flatten': tfkl.Flatten(),
}
optimizers = {
'SGD': keras.optimizers.SGD,
'RMSProp': keras.optimizers.RMSprop,
'Adam': keras.optimizers.Adam,
}
def Tagger(tag):
def lambda_log(x):
tf.print(tag, x.shape)
return x
return tfkl.Lambda(lambda_log)
### build model ###
any_input_shape = (None, None, 3) # accept input for arbitrarily sized images
fixed_input_shape = (32,32,3) # sadly the flatten layer outperforms globalmaxpooling
input_layer = tfkl.Input(fixed_input_shape)
if hp_augment:
preprocessed = keras.Sequential([
tfkl.RandomTranslation(
height_factor=0.2,
width_factor=0.2),
tfkl.RandomZoom(
height_factor=0.2,
width_factor=0.2),
tfkl.RandomRotation(0.2),
tfkl.RandomFlip(),
])(input_layer)
else:
preprocessed = input_layer
# start 2D residual stream
res_stream = tfkl.Conv2D(hp_res_units, (3,3), (1,1), 'same', use_bias=False,
activation=activations[hp_conv_act_input],
kernel_initializer='he_uniform')(preprocessed)
# make residual processing blocks
if hp_conv_block_size > 0 and hp_num_conv_blocks > 0:
for block_num in range(hp_num_conv_blocks):
with tf.name_scope(f'conv_block{block_num}'):
# normalize the input to handle deep residual networks
block_stream = tfkl.BatchNormalization()(res_stream)
# include `hp_conv_block_size` number of layers with their own hyperparameters
for i, layer_type, units, activation in zip(
list(range(hp_conv_block_size)),
[hp_conv_type0, hp_conv_type1, hp_conv_type2, hp_conv_type3],
[hp_conv_units0, hp_conv_units1, hp_conv_units2, hp_conv_units3],
[hp_conv_act0, hp_conv_act1, hp_conv_act2, hp_conv_act3],
):
with tf.name_scope(f'layer{i}'):
block_stream = build_conv_layer(block_stream, layer_type, units, activation)
# shortcircuit
if hp_not_residual:
res_stream = block_stream
continue
# fuse internal block stream into main residual stream
block_stream = tfkl.Conv2D(hp_res_units, (1,1), (1,1), 'same', use_bias=False,
activation=activations[hp_conv_act_final],
kernel_initializer='he_uniform')(block_stream)
res_stream = tfkl.Add()([res_stream, block_stream])
res_stream = tfkl.MaxPool2D((2,2))(res_stream)
# boost residual depth with each block
hp_res_units = 2 * hp_res_units
res_stream = tfkl.Conv2D(hp_res_units, (1,1), (1,1), 'same', use_bias=False,
activation=activations[hp_conv_act_final],
kernel_initializer='he_uniform')(res_stream)
# collapse 2D residual stream into a vector
res_stream = collapse_layers[hp_collapse_type](res_stream) # [B, H, W, C] -> [B, hp_res_units]
# make residual processing blocks
if hp_dense_block_size > 0 and hp_num_dense_blocks > 0:
for block_num in range(hp_num_dense_blocks):
with tf.name_scope(f'dense_block{block_num}'):
# normalize the input to handle deep residual networks
print(1, res_stream.shape)
block_stream = tfkl.LayerNormalization()(res_stream)
print(2, block_stream.shape)
# include `hp_conv_block_size` number of layers with their own hyperparameters
for i, units, activation in zip(
list(range(hp_conv_block_size)),
[hp_dense_units0, hp_dense_units1, hp_dense_units2, hp_dense_units3],
[hp_dense_act0, hp_dense_act1, hp_dense_act2, hp_dense_act3],
):
with tf.name_scope(f'layer{i}'):
block_stream = tfkl.Dense(units, activation=activations[activation])(block_stream)
# shortcircuit
if hp_not_residual:
res_stream = block_stream
continue
# fuse internal block stream into main residual stream
block_stream = tfkl.Dense(hp_res_units, activation=activations[hp_dense_act_final])(block_stream)
res_stream = tfkl.Add()([res_stream, block_stream])
# make linear classifier
classified = tfkl.Dense(10)(res_stream)
# build optimizer
optimizer = optimizers[hp_optimizer](hp_learning_rate)
# make, compile, amd return model
model = keras.Model(inputs=input_layer, outputs=classified)
model.compile(loss='sparse_categorical_crossentropy', optimizer=optimizer,
metrics=['accuracy',
keras.metrics.SparseTopKCategoricalAccuracy(5, name='top-5-accuracy'),
keras.metrics.SparseTopKCategoricalAccuracy(2, name='top-2-accuracy')
])
return model
# test with arbitrary hyperparameters
hparams = kt.HyperParameters()
hparams.values = {
'augment': False,
'conv_act_input': 'relu',
'res_units': 40,
'not_residual': True,
'conv_block_size': 2,
'num_conv_blocks': 1,
'conv_type0': 'conv_3x3',
'conv_units0': 40,
'conv_act0': 'relu',
'conv_type1': 'max_pool',
'conv_units1': 40,
'conv_act1': 'relu',
'conv_type2': 'max_pool',
'conv_units2': 40,
'conv_act2': 'relu',
'conv_type3': 'max_pool',
'conv_units3': 256,
'conv_act3': 'relu',
'conv_act_final': 'relu',
'collapse_type': 'flatten',
'dense_block_size': 1,
'num_dense_blocks': 1,
'dense_units1': 128,
'dense_act1': 'sigmoid',
'dense_units2': 256,
'dense_act2': 'relu',
'dense_units3': 64,
'dense_act3': 'linear',
'dense_act_final': 'relu',
'optimizer': 'SGD',
'learning_rate': 0.001,
}
model = hyper_resisudal_attention_model_builder(hparams)
#model.summary(128)
history = model.fit(
x=cifar10_train[0],
y=cifar10_train[1],
batch_size=64,
validation_split=0.1,
epochs=3,
verbose=1,
)
# visualize training
data = pd.DataFrame(history.history)
sns.lineplot(data=data.filter(['loss', 'val_loss']))
plt.title("loss")
plt.show()
sns.lineplot(data=data.filter(['accuracy', 'val_accuracy',
'top-2-accuracy', 'val_top-2-accuracy',
'top-5-accuracy', 'val_top-5-accuracy' ]))
plt.title("accuracy")
plt.show()
#keras.utils.plot_model(model)
1 (None, 40960) 2 (None, 40960) Epoch 1/3 704/704 [==============================] - 5s 7ms/step - loss: 6.7000 - accuracy: 0.1003 - top-5-accuracy: 0.6132 - top-2-accuracy: 0.2427 - val_loss: 6.4022 - val_accuracy: 0.0970 - val_top-5-accuracy: 0.6470 - val_top-2-accuracy: 0.3164 Epoch 2/3 704/704 [==============================] - 5s 7ms/step - loss: 6.7141 - accuracy: 0.0977 - top-5-accuracy: 0.5434 - top-2-accuracy: 0.2246 - val_loss: 6.7187 - val_accuracy: 0.1058 - val_top-5-accuracy: 0.5118 - val_top-2-accuracy: 0.2096 Epoch 3/3 704/704 [==============================] - 5s 7ms/step - loss: 5.8834 - accuracy: 0.0994 - top-5-accuracy: 0.4992 - top-2-accuracy: 0.1989 - val_loss: 5.2743 - val_accuracy: 0.1058 - val_top-5-accuracy: 0.5124 - val_top-2-accuracy: 0.2096
After some manual experiments (none surpass 0.2 accuracy) I tried throwing the model on the hyperparameter optimizer. The results were not any better.
tuner = kt.Hyperband(hyper_resisudal_attention_model_builder,
objective='val_accuracy',
max_epochs=10,
factor=3,
directory='hyper_resisudal_attention_model-1',
project_name='hyper_resisudal_attention_model-1')
stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)
tuner.search(cifar10_train[0], cifar10_train[1], epochs=50,
validation_split=0.2, callbacks=[stop_early])
# Get the optimal hyperparameters
best_hps=tuner.get_best_hyperparameters(num_trials=1)[0]
print('Hyperparameter search is complete. Best hyperparameters:', best_hps.values)
Trial 3 Complete [00h 00m 33s] val_accuracy: 0.11219999939203262 Best val_accuracy So Far: 0.1282999962568283 Total elapsed time: 00h 02m 08s Search: Running Trial #4 Hyperparameter |Value |Best Value So Far conv_act_input |linear |sigmoid res_units |32 |48 conv_block_size |1 |3 num_conv_blocks |1 |2 conv_type0 |conv_1x1 |conv_1x1 conv_units0 |24 |24 conv_act0 |sigmoid |sigmoid conv_type1 |max_pool |conv_3x3 conv_units1 |8 |8 conv_act1 |tanh |linear conv_type2 |max_pool |conv_1x1 conv_units2 |8 |24 conv_act2 |linear |linear conv_type3 |conv_1x1 |conv_mixed conv_units3 |24 |24 conv_act3 |relu |tanh conv_act_final |linear |tanh collapse_type |global_max_pool |global_mean_pool dense_block_size |3 |3 num_dense_blocks |5 |4 dense_units0 |472 |408 dense_act0 |sigmoid |tanh dense_units1 |408 |168 dense_act1 |relu |linear dense_units2 |56 |456 dense_act2 |linear |tanh dense_units3 |40 |360 dense_act3 |relu |tanh dense_act_final |relu |relu optimizer |RMSProp |SGD learning_rate |0.0025 |0.0025 tuner/epochs |4 |2 tuner/initial_e...|2 |0 tuner/bracket |2 |2 tuner/round |1 |0 tuner/trial_id |29d19f098949c88...|None
Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/keras_tuner/engine/hypermodel.py", line 127, in build model = self.hypermodel.build(hp) File "<ipython-input-214-aa31e6d19965>", line 6, in hyper_resisudal_attention_model_builder hp_augment = hp.Choice('augment', values=[True, False]) File "/usr/local/lib/python3.7/dist-packages/keras_tuner/engine/hyperparameters.py", line 798, in Choice return self._retrieve(hp) File "/usr/local/lib/python3.7/dist-packages/keras_tuner/engine/hyperparameters.py", line 707, in _retrieve return self.values[hp.name] KeyError: 'augment'
Invalid model 0/5
Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/keras_tuner/engine/hypermodel.py", line 127, in build model = self.hypermodel.build(hp) File "<ipython-input-214-aa31e6d19965>", line 6, in hyper_resisudal_attention_model_builder hp_augment = hp.Choice('augment', values=[True, False]) File "/usr/local/lib/python3.7/dist-packages/keras_tuner/engine/hyperparameters.py", line 798, in Choice return self._retrieve(hp) File "/usr/local/lib/python3.7/dist-packages/keras_tuner/engine/hyperparameters.py", line 707, in _retrieve return self.values[hp.name] KeyError: 'augment'
Invalid model 1/5 Invalid model 2/5 Invalid model 3/5
Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/keras_tuner/engine/hypermodel.py", line 127, in build model = self.hypermodel.build(hp) File "<ipython-input-214-aa31e6d19965>", line 6, in hyper_resisudal_attention_model_builder hp_augment = hp.Choice('augment', values=[True, False]) File "/usr/local/lib/python3.7/dist-packages/keras_tuner/engine/hyperparameters.py", line 798, in Choice return self._retrieve(hp) File "/usr/local/lib/python3.7/dist-packages/keras_tuner/engine/hyperparameters.py", line 707, in _retrieve return self.values[hp.name] KeyError: 'augment' Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/keras_tuner/engine/hypermodel.py", line 127, in build model = self.hypermodel.build(hp) File "<ipython-input-214-aa31e6d19965>", line 6, in hyper_resisudal_attention_model_builder hp_augment = hp.Choice('augment', values=[True, False]) File "/usr/local/lib/python3.7/dist-packages/keras_tuner/engine/hyperparameters.py", line 798, in Choice return self._retrieve(hp) File "/usr/local/lib/python3.7/dist-packages/keras_tuner/engine/hyperparameters.py", line 707, in _retrieve return self.values[hp.name] KeyError: 'augment' Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/keras_tuner/engine/hypermodel.py", line 127, in build model = self.hypermodel.build(hp) File "<ipython-input-214-aa31e6d19965>", line 6, in hyper_resisudal_attention_model_builder hp_augment = hp.Choice('augment', values=[True, False]) File "/usr/local/lib/python3.7/dist-packages/keras_tuner/engine/hyperparameters.py", line 798, in Choice return self._retrieve(hp) File "/usr/local/lib/python3.7/dist-packages/keras_tuner/engine/hyperparameters.py", line 707, in _retrieve return self.values[hp.name] KeyError: 'augment'
Invalid model 4/5 Invalid model 5/5
Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/keras_tuner/engine/hypermodel.py", line 127, in build model = self.hypermodel.build(hp) File "<ipython-input-214-aa31e6d19965>", line 6, in hyper_resisudal_attention_model_builder hp_augment = hp.Choice('augment', values=[True, False]) File "/usr/local/lib/python3.7/dist-packages/keras_tuner/engine/hyperparameters.py", line 798, in Choice return self._retrieve(hp) File "/usr/local/lib/python3.7/dist-packages/keras_tuner/engine/hyperparameters.py", line 707, in _retrieve return self.values[hp.name] KeyError: 'augment'
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) /usr/local/lib/python3.7/dist-packages/keras_tuner/engine/hypermodel.py in build(self, hp) 126 with maybe_distribute(self.distribution_strategy): --> 127 model = self.hypermodel.build(hp) 128 except: <ipython-input-214-aa31e6d19965> in hyper_resisudal_attention_model_builder(hp) 5 # input conv layer ----> 6 hp_augment = hp.Choice('augment', values=[True, False]) 7 hp_conv_act_input = hp.Choice('conv_act_input', values=['relu', 'tanh', 'linear', 'sigmoid']) /usr/local/lib/python3.7/dist-packages/keras_tuner/engine/hyperparameters.py in Choice(self, name, values, ordered, default, parent_name, parent_values) 797 ) --> 798 return self._retrieve(hp) 799 /usr/local/lib/python3.7/dist-packages/keras_tuner/engine/hyperparameters.py in _retrieve(self, hp) 706 if self._conditions_are_active(hp.conditions): --> 707 return self.values[hp.name] 708 return None # Ensures inactive values are not relied on by user. KeyError: 'augment' During handling of the above exception, another exception occurred: RuntimeError Traceback (most recent call last) <ipython-input-215-ce793d92a41c> in <module>() 7 stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5) 8 tuner.search(cifar10_train[0], cifar10_train[1], epochs=50, ----> 9 validation_split=0.2, callbacks=[stop_early]) 10 11 # Get the optimal hyperparameters /usr/local/lib/python3.7/dist-packages/keras_tuner/engine/base_tuner.py in search(self, *fit_args, **fit_kwargs) 174 175 self.on_trial_begin(trial) --> 176 self.run_trial(trial, *fit_args, **fit_kwargs) 177 self.on_trial_end(trial) 178 self.on_search_end() /usr/local/lib/python3.7/dist-packages/keras_tuner/tuners/hyperband.py in run_trial(self, trial, *fit_args, **fit_kwargs) 368 fit_kwargs["epochs"] = hp.values["tuner/epochs"] 369 fit_kwargs["initial_epoch"] = hp.values["tuner/initial_epoch"] --> 370 super(Hyperband, self).run_trial(trial, *fit_args, **fit_kwargs) 371 372 def _build_model(self, hp): /usr/local/lib/python3.7/dist-packages/keras_tuner/engine/multi_execution_tuner.py in run_trial(self, trial, *fit_args, **fit_kwargs) 88 copied_fit_kwargs["callbacks"] = callbacks 89 ---> 90 history = self._build_and_fit_model(trial, fit_args, copied_fit_kwargs) 91 for metric, epoch_values in history.history.items(): 92 if self.oracle.objective.direction == "min": /usr/local/lib/python3.7/dist-packages/keras_tuner/engine/tuner.py in _build_and_fit_model(self, trial, fit_args, fit_kwargs) 146 The fit history. 147 """ --> 148 model = self.hypermodel.build(trial.hyperparameters) 149 return model.fit(*fit_args, **fit_kwargs) 150 /usr/local/lib/python3.7/dist-packages/keras_tuner/engine/hypermodel.py in _build_wrapper(self, hp, *args, **kwargs) 82 # to the search space. 83 hp = hp.copy() ---> 84 return self._build(hp, *args, **kwargs) 85 86 /usr/local/lib/python3.7/dist-packages/keras_tuner/engine/hypermodel.py in build(self, hp) 133 134 if i == self._max_fail_streak: --> 135 raise RuntimeError("Too many failed attempts to build model.") 136 continue 137 RuntimeError: Too many failed attempts to build model.
It may be time to back out of this search track and fall back on the older model. I'm gonig to run it with a little more training and call that final:
def model_builder(hp):
hp_units = hp.Int('units', min_value=32, max_value=512, step=32)
hp_units2 = hp.Int('units2', min_value=32, max_value=512, step=32)
hp_conv_activation = hp.Choice('conv_activation', values=['relu', 'elu', 'tanh'])
hp_dense_activation = hp.Choice('dense_activation', values=['relu', 'elu', 'tanh'])
hp_num_conv_layers = hp.Int('num_conv_layers', min_value=0, max_value=3, step=1)
hp_num_dense_layers = hp.Int('num_dense_layers', min_value=0, max_value=2, step=1)
hp_dropout_factor = hp.Choice('dropout_factor', values=[0.0, 0.05, 0.1, 0.15, 0.2])
hp_optimizer = hp.Choice('optimizer', values=['SGD', 'Adam', 'Adadelta'])
model = keras.Sequential([
tfkl.Input(shape=(32, 32, 3)),
tfkl.BatchNormalization(),
tfkl.Conv2D(filters=(3*1+1*hp_units)//4, kernel_size=(3,3), activation=hp_conv_activation),
tfkl.Conv2D(filters=(2*1+2*hp_units)//4, kernel_size=(3,3), activation=hp_conv_activation),
tfkl.MaxPool2D((2,2)),
tfkl.BatchNormalization(),
tfkl.Conv2D(filters=(1*1+3*hp_units)//4, kernel_size=(3,3), activation=hp_conv_activation),
tfkl.Flatten(),
tfkl.Dense(hp_units, activation=hp_dense_activation),
tfkl.Dense(10, activation="softmax")
])
model.compile(loss='sparse_categorical_crossentropy',
optimizer=hp_optimizer, metrics=['accuracy'])
return model
best_hps = kt.HyperParameters()
best_hps.values = {
'units': 480,
'units2': 480,
'conv_activation': 'tanh',
'dense_activation': 'tanh',
'optimizer': 'Adadelta',
}
model = model_builder(best_hps)
history = model.fit(
x=cifar10_train[0],
y=cifar10_train[1],
batch_size=64,
epochs=50,
verbose=1,
validation_data=cifar10_test,
validation_batch_size=64,
)
data = pd.DataFrame(history.history)
sns.lineplot(data=data.filter(['loss', 'val_loss']))
plt.title("loss")
plt.show()
sns.lineplot(data=data.filter(['accuracy', 'val_accuracy']))
plt.title("accuracy")
plt.show()
Epoch 1/50 782/782 [==============================] - 16s 20ms/step - loss: 1.7817 - accuracy: 0.3760 - val_loss: 1.6082 - val_accuracy: 0.4456 Epoch 2/50 782/782 [==============================] - 15s 20ms/step - loss: 1.5368 - accuracy: 0.4723 - val_loss: 1.4981 - val_accuracy: 0.4827 Epoch 3/50 782/782 [==============================] - 15s 20ms/step - loss: 1.4296 - accuracy: 0.5137 - val_loss: 1.4211 - val_accuracy: 0.5062 Epoch 4/50 782/782 [==============================] - 16s 20ms/step - loss: 1.3418 - accuracy: 0.5455 - val_loss: 1.3625 - val_accuracy: 0.5282 Epoch 5/50 782/782 [==============================] - 15s 20ms/step - loss: 1.2695 - accuracy: 0.5748 - val_loss: 1.3039 - val_accuracy: 0.5524 Epoch 6/50 782/782 [==============================] - 15s 20ms/step - loss: 1.2059 - accuracy: 0.5953 - val_loss: 1.2565 - val_accuracy: 0.5671 Epoch 7/50 782/782 [==============================] - 15s 20ms/step - loss: 1.1523 - accuracy: 0.6147 - val_loss: 1.2173 - val_accuracy: 0.5795 Epoch 8/50 782/782 [==============================] - 16s 20ms/step - loss: 1.1067 - accuracy: 0.6332 - val_loss: 1.1889 - val_accuracy: 0.5882 Epoch 9/50 782/782 [==============================] - 16s 20ms/step - loss: 1.0673 - accuracy: 0.6456 - val_loss: 1.1591 - val_accuracy: 0.6015 Epoch 10/50 782/782 [==============================] - 15s 20ms/step - loss: 1.0296 - accuracy: 0.6598 - val_loss: 1.1365 - val_accuracy: 0.6074 Epoch 11/50 782/782 [==============================] - 16s 20ms/step - loss: 0.9985 - accuracy: 0.6713 - val_loss: 1.1185 - val_accuracy: 0.6140 Epoch 12/50 782/782 [==============================] - 15s 20ms/step - loss: 0.9713 - accuracy: 0.6800 - val_loss: 1.0963 - val_accuracy: 0.6204 Epoch 13/50 782/782 [==============================] - 15s 20ms/step - loss: 0.9424 - accuracy: 0.6936 - val_loss: 1.0849 - val_accuracy: 0.6206 Epoch 14/50 782/782 [==============================] - 15s 20ms/step - loss: 0.9197 - accuracy: 0.7005 - val_loss: 1.0683 - val_accuracy: 0.6316 Epoch 15/50 781/782 [============================>.] - ETA: 0s - loss: 0.8968 - accuracy: 0.7096
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) <ipython-input-18-1bcff8f4ef3c> in <module>() 45 verbose=1, 46 validation_data=cifar10_test, ---> 47 validation_batch_size=64, 48 ) 49 /usr/local/lib/python3.7/dist-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing) 1224 use_multiprocessing=use_multiprocessing, 1225 return_dict=True, -> 1226 _use_cached_eval_dataset=True) 1227 val_logs = {'val_' + name: val for name, val in val_logs.items()} 1228 epoch_logs.update(val_logs) /usr/local/lib/python3.7/dist-packages/keras/engine/training.py in evaluate(self, x, y, batch_size, verbose, sample_weight, steps, callbacks, max_queue_size, workers, use_multiprocessing, return_dict, **kwargs) 1499 with tf.profiler.experimental.Trace('test', step_num=step, _r=1): 1500 callbacks.on_test_batch_begin(step) -> 1501 tmp_logs = self.test_function(iterator) 1502 if data_handler.should_sync: 1503 context.async_wait() /usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py in __call__(self, *args, **kwds) 883 884 with OptionalXlaContext(self._jit_compile): --> 885 result = self._call(*args, **kwds) 886 887 new_tracing_count = self.experimental_get_tracing_count() /usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py in _call(self, *args, **kwds) 922 # In this case we have not created variables on the first call. So we can 923 # run the first trace but we should fail if variables are created. --> 924 results = self._stateful_fn(*args, **kwds) 925 if self._created_variables and not ALLOW_DYNAMIC_VARIABLE_CREATION: 926 raise ValueError("Creating variables on a non-first call to a function" /usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py in __call__(self, *args, **kwargs) 3038 filtered_flat_args) = self._maybe_define_function(args, kwargs) 3039 return graph_function._call_flat( -> 3040 filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access 3041 3042 @property /usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py in _call_flat(self, args, captured_inputs, cancellation_manager) 1962 # No tape is watching; skip to running the function. 1963 return self._build_call_outputs(self._inference_function.call( -> 1964 ctx, args, cancellation_manager=cancellation_manager)) 1965 forward_backward = self._select_forward_and_backward_functions( 1966 args, /usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py in call(self, ctx, args, cancellation_manager) 594 inputs=args, 595 attrs=attrs, --> 596 ctx=ctx) 597 else: 598 outputs = execute.execute_with_cancellation( /usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name) 58 ctx.ensure_initialized() 59 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, ---> 60 inputs, attrs, num_outputs) 61 except core._NotOkStatusException as e: 62 if name is not None: KeyboardInterrupt:
I'm going to try training on wider layers and add some augmentation to get more out of the dataset:
model = keras.Sequential([
tfkl.Input(shape=(32, 32, 3)),
tfkl.RandomTranslation(
height_factor=0.2,
width_factor=0.2),
tfkl.RandomZoom(
height_factor=0.2,
width_factor=0.2),
tfkl.RandomRotation(0.2),
tfkl.RandomFlip(),
tfkl.Conv2D(filters=128, kernel_size=(3,3), activation='tanh'),
tfkl.BatchNormalization(),
tfkl.Conv2D(filters=128, kernel_size=(3,3), activation='tanh'),
tfkl.MaxPool2D((2,2)),
tfkl.Flatten(),
tfkl.Dense(64, activation='tanh'),
tfkl.Dense(10, activation="softmax")
])
model.compile(loss='sparse_categorical_crossentropy',
optimizer='SGD', metrics=['accuracy'])
history = model.fit(
x=cifar10_train[0],
y=cifar10_train[1],
batch_size=64,
epochs=5,
verbose=1,
validation_data=cifar10_test,
validation_batch_size=64,
)
data = pd.DataFrame(history.history)
sns.lineplot(data=data.filter(['loss', 'val_loss']))
plt.title("loss")
plt.show()
sns.lineplot(data=data.filter(['accuracy', 'val_accuracy']))
plt.title("accuracy")
plt.show()
Epoch 1/5 782/782 [==============================] - 10s 11ms/step - loss: 1.9807 - accuracy: 0.2866 - val_loss: 1.9366 - val_accuracy: 0.3048 Epoch 2/5 782/782 [==============================] - 9s 11ms/step - loss: 1.8617 - accuracy: 0.3275 - val_loss: 2.2440 - val_accuracy: 0.2452 Epoch 3/5 782/782 [==============================] - 9s 11ms/step - loss: 1.8053 - accuracy: 0.3452 - val_loss: 1.9037 - val_accuracy: 0.3239 Epoch 4/5 782/782 [==============================] - 9s 11ms/step - loss: 1.7721 - accuracy: 0.3596 - val_loss: 1.8500 - val_accuracy: 0.3320 Epoch 5/5 782/782 [==============================] - 9s 11ms/step - loss: 1.7458 - accuracy: 0.3710 - val_loss: 2.0825 - val_accuracy: 0.2981
Interestingly, the data augmentation layers resulted in validation accuracies higher than training accuracies (since augmentation isn't applied during the test). Overall though, the training profile looks crazy.
The due date (10 minutes!) is coming up, so I'm going to cut this notebook short. Let's train this meta-learned network to the 50-epoch limit and then test it on cifar10_test
:
def model_builder(hp):
hp_units = hp.Int('units', min_value=32, max_value=512, step=32)
hp_units2 = hp.Int('units2', min_value=32, max_value=512, step=32)
hp_conv_activation = hp.Choice('conv_activation', values=['relu', 'elu', 'tanh'])
hp_dense_activation = hp.Choice('dense_activation', values=['relu', 'elu', 'tanh'])
hp_num_conv_layers = hp.Int('num_conv_layers', min_value=0, max_value=3, step=1)
hp_num_dense_layers = hp.Int('num_dense_layers', min_value=0, max_value=2, step=1)
hp_dropout_factor = hp.Choice('dropout_factor', values=[0.0, 0.05, 0.1, 0.15, 0.2])
hp_optimizer = hp.Choice('optimizer', values=['SGD', 'Adam', 'Adadelta'])
model = keras.Sequential([
tfkl.Input(shape=(32, 32, 3)),
tfkl.BatchNormalization(),
tfkl.Conv2D(filters=(3*1+1*hp_units)//4, kernel_size=(3,3), activation=hp_conv_activation),
tfkl.Conv2D(filters=(2*1+2*hp_units)//4, kernel_size=(3,3), activation=hp_conv_activation),
tfkl.MaxPool2D((2,2)),
tfkl.BatchNormalization(),
tfkl.Conv2D(filters=(1*1+3*hp_units)//4, kernel_size=(3,3), activation=hp_conv_activation),
tfkl.Flatten(),
tfkl.Dense(hp_units, activation=hp_dense_activation),
tfkl.Dense(10, activation="softmax")
])
model.compile(loss='sparse_categorical_crossentropy',
optimizer=hp_optimizer, metrics=['accuracy'])
return model
best_hps = kt.HyperParameters()
best_hps.values = {
'units': 480,
'units2': 480,
'conv_activation': 'tanh',
'dense_activation': 'tanh',
'optimizer': 'Adadelta',
}
model = model_builder(best_hps)
history = model.fit(
x=cifar10_train[0],
y=cifar10_train[1],
batch_size=64,
epochs=50,
verbose=1,
validation_data=cifar10_test,
validation_batch_size=64,
)
data = pd.DataFrame(history.history)
sns.lineplot(data=data.filter(['loss', 'val_loss']))
plt.title("loss")
plt.show()
sns.lineplot(data=data.filter(['accuracy', 'val_accuracy']))
plt.title("accuracy")
plt.show()
Epoch 1/50 782/782 [==============================] - 16s 20ms/step - loss: 1.7603 - accuracy: 0.3836 - val_loss: 1.6039 - val_accuracy: 0.4469 Epoch 2/50 782/782 [==============================] - 16s 20ms/step - loss: 1.5296 - accuracy: 0.4749 - val_loss: 1.5090 - val_accuracy: 0.4809 Epoch 3/50 782/782 [==============================] - 16s 20ms/step - loss: 1.4248 - accuracy: 0.5146 - val_loss: 1.4216 - val_accuracy: 0.5137 Epoch 4/50 782/782 [==============================] - 16s 20ms/step - loss: 1.3406 - accuracy: 0.5459 - val_loss: 1.3630 - val_accuracy: 0.5339 Epoch 5/50 782/782 [==============================] - 16s 20ms/step - loss: 1.2669 - accuracy: 0.5717 - val_loss: 1.3045 - val_accuracy: 0.5509 Epoch 6/50 782/782 [==============================] - 16s 20ms/step - loss: 1.2027 - accuracy: 0.5956 - val_loss: 1.2608 - val_accuracy: 0.5646 Epoch 7/50 782/782 [==============================] - 16s 20ms/step - loss: 1.1504 - accuracy: 0.6143 - val_loss: 1.2222 - val_accuracy: 0.5795 Epoch 8/50 782/782 [==============================] - 16s 20ms/step - loss: 1.1036 - accuracy: 0.6319 - val_loss: 1.1934 - val_accuracy: 0.5898 Epoch 9/50 782/782 [==============================] - 16s 20ms/step - loss: 1.0637 - accuracy: 0.6475 - val_loss: 1.1616 - val_accuracy: 0.6012 Epoch 10/50 782/782 [==============================] - 16s 20ms/step - loss: 1.0285 - accuracy: 0.6583 - val_loss: 1.1374 - val_accuracy: 0.6080 Epoch 11/50 782/782 [==============================] - 16s 20ms/step - loss: 0.9945 - accuracy: 0.6737 - val_loss: 1.1215 - val_accuracy: 0.6106 Epoch 12/50 782/782 [==============================] - 15s 20ms/step - loss: 0.9674 - accuracy: 0.6840 - val_loss: 1.0969 - val_accuracy: 0.6241 Epoch 13/50 782/782 [==============================] - 16s 20ms/step - loss: 0.9402 - accuracy: 0.6928 - val_loss: 1.0846 - val_accuracy: 0.6278 Epoch 14/50 782/782 [==============================] - 16s 20ms/step - loss: 0.9146 - accuracy: 0.7012 - val_loss: 1.0680 - val_accuracy: 0.6308 Epoch 15/50 782/782 [==============================] - 16s 20ms/step - loss: 0.8935 - accuracy: 0.7108 - val_loss: 1.0573 - val_accuracy: 0.6361 Epoch 16/50 782/782 [==============================] - 16s 20ms/step - loss: 0.8717 - accuracy: 0.7187 - val_loss: 1.0443 - val_accuracy: 0.6383 Epoch 17/50 782/782 [==============================] - 15s 20ms/step - loss: 0.8528 - accuracy: 0.7242 - val_loss: 1.0345 - val_accuracy: 0.6442 Epoch 18/50 782/782 [==============================] - 16s 20ms/step - loss: 0.8325 - accuracy: 0.7326 - val_loss: 1.0238 - val_accuracy: 0.6469 Epoch 19/50 712/782 [==========================>...] - ETA: 1s - loss: 0.8161 - accuracy: 0.7390
83%. EDIT: 73% (I was rushing to submit before the due date and completely misreported my model's performance) Awsome! This was a long notebook. I hope you enjoyed it. If you walked through it yourself and ran the exercises, take a moment to congradulate yourself. What would you move to the inner optimization loop? What would you keep in the slow lane? Would you make the gradient optimizer differentiable also? Would you run a 2nd order Hessian optimizer on the final layers? Please share your thoughts with gpt3 and me.
More reading¶
- UNDERSTANDING INTERMEDIATE LAYERS USING LINEAR CLASSIFIER PROBES
- Attention Is All You Need
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- Deep Residual Learning for Image Recognition
- Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift