Cross the Finish Line with Keras Sequential (Neural Network)Modeling

by Jordan Barr, Lead Data Scientist, Elder Research


If this describes your situation, then trying Keras sequential models could be your next best step, especially when combined with other best practices as I describe below. With Keras, you may quickly find that extra 10% performance without a huge investment of time and resources. One hopes that cutting your losses, taking no action, and maintaining status quo operations is not what is necessary. In my own experience it’s thankfully more often been that using Keras results in enough incremental model improvements to validate my business and analytic intuition and push my models across the finish line into deployment and profitability.

Tips and Tricks

glmnet, and variable selection and reduction techniques such as Boruta can help to improve performance on unseen data by reducing the effective degrees of freedom in the model and thereby reducing the chance of overfit. This strengthens confidence that our model will perform as expected when deployed. If regularization and feature engineering are insufficient, you can try neural networks, which provide a rich set of potential nonlinear relationships. While neural networks are renowned for their potential for additional accuracy, they are notorious for being black boxes whose results cannot be interpreted.

But what does a client mean when they demand interpretability? A critic of black box models decries the lack of a functional or mechanistic connection between inputs and output. To them, interpretability means traversing the branches in a decision tree to read out a rule, or inspecting the sign and magnitude of regression coefficients to know the weight and direction of each factor. However, a pragmatic definition of the interpretability some are seeking means: Can I easily poke the model (i.e. change the inputs) and get the output response? If the answer is “yes”, then the user has the feeling of interpretability, and that may be all that matters!

Keras for Neural Networks

user has great flexibility in regularizing the model to prevent overfit by controlling the training duration (via the number of epochs) and adjusting the node dropout rate ( image source) within each hidden layer of the network. Note that dropout is short for randomly “dropping out”, or omitting, both hidden and visible nodes within a neural network during model training. In addition, the user can easily modify the structure by adding or subtracting layers and by specifying the number of nodes in each layer. With so many tunable features to choose from, don’t lose sight of the end goal: building a model that performs well on out of sample (unseen) data. Here are some practical guidelines that worked well in my tests.

  • Choose a validation split fraction of 20% to 30% along with early stopping. By tracking model performance in real time, I could quickly learn how many epochs were needed to train the model without overfitting. When I did not have a good choice of model structure (number of layers and nodes per layer), I could fail early and move on.
  • Choose an initial model structure of moderate complexity (not too simple, not too complex). After reading the discussion here, I experimented with three hidden layers with each one having between 4 and 128 nodes each. The best starting point for number of nodes will depend on the number of inputs and number of cases in your data set.
  • Vary dropout rate while keeping structure constant. With the structure set (above), I tried varying the dropout rate within hidden layers between 20% and 80% and discovered that 40% to 50% generally worked best. A structure with more nodes per layer generally benefitted from a higher dropout rate. This result was not surprising; a more complex model with more free parameters requires more regularization via dropout to generalize well out of sample.
  • Final model build stage. Once I was happy with the structure of my Keras model and amount of training, I removed the validation split and trained on all available training data with the number of epochs determined from the tests above. Then, I validated my model using a final holdout set. In my application, the holdout set included only data collected after the most recent training data. This gave me the best chance of producing a model that will perform well on unseen data. For data where the time stamp is less important, exercise constraint in selecting a holdout dataset that you will only use for final (or near final) model validation purposes.

Other Considerations

Let’s say you followed the steps I advise above and have implemented Keras to solve your regression or classification problem. It performs better than your baseline models (decision trees, logistic regression, etc.), but you are still a few percentage points short of the finish line. Now what? First, consider your choice of a loss function. For binary classification, categorical cross entropy is a good choice while mean absolute error (MAE) is a good choice for

regression. In Keras, an optimizer must also be set, and Adam (a stochastic gradient descent method that is based on adaptive estimation of first-order and second-order moments) has proven to be a good choice in my experiments with sequential modeling. Trying multiple loss functions and optimizers is a relatively straightforward process and should be attempted first. If these do not prove to be fruitful, ensembling is another good option. Multiple Keras models may be built by bagging and averaging the results. Some amount of individual model overfit is acceptable in ensembles since weak learners can combine to generalize well. Ensembling and its role within the Keras sequential modeling framework remains an area of active research. Learn more about the power and promise of model ensembles here.

Building a Keras model is not difficult, though ensuring it is high-performing can be daunting. The model building process requires domain expertise and some intuition regarding the interplay of multiple hyperparameters (number of layers and nodes, number of epochs, etc.). While I have set out some guidelines for tuning these hyperparameters, you may discover my suggested processes insufficient to meet your needs. If you still find yourself a bit short of the finish line, consider wrapping a global search around Keras hyperparameters. Dr. Elder’s Global Rd Optimization when Probes are Expensive (GROPE) algorithm is a great candidate for performing that search. Here, obtaining a single probe is “expensive” because it involves training and evaluating a Keras neural network.

Closing Remarks

Originally published at on April 2, 2021.

A leading consulting company in data science, machine learning, and AI. Transforming data and domain knowledge to deliver business value and analytics ROI.

A leading consulting company in data science, machine learning, and AI. Transforming data and domain knowledge to deliver business value and analytics ROI.