A *sequence-to-one* model is a model which takes a sequence of data as input and predicts a single value. A well-known application of such models is the classification of the sentiment of a sentence as positive or negative. In the following post, I will go over how to build a sequence-to-one model in Flux.

I will use a toy example somewhat similar to that of the seminal paper of Hochreiter and Schmidhuber (1997), which introduces Long Short-Term Memory (LSTM) networks.

## The toy example

Admit that we observe a succession of letters consisting of *A*, *B*, *X*, *Y*, and *Z*. There can be any number of *X*, *Y*, and *Z* in the sequence, but there are always exactly two of *A* and *B*: either *A* and *B*, two *A*’s or two *B*’s. We are interested in determining the order of *A* and *B* in the sequence, i.e., we want to find out whether the series contains *A → B*, *B → A*, *A → A*, or *B → B*. You can think of *X*, *Y*, and *Z* as noise.

A few observations of this example sequence might look like this:

Sequence | Label |
---|---|

XBXZAXXZ |
B → A |

XXAZYZBX |
A → B |

ZYZBXXXYZBZX |
B → B |

YYYZXAZXAZ |
A → A |

Thus, we have four different labels in total. While determining the correct label is a childishly easy task for a human, we will see that it is daunting for a feedforward neural network. Luckily, recurrent neural networks can help us tackle this problem.

## Encoding the sequence

To input the sequence into some neural network, we must first encode it, e.g., using one-hot encoding. One-hot encoding replaces each letter in the series with an \(n\)-dimensional vector, where \(n\) is the number of letters. In this example, we can imagine the vector \(\left[\text{A} \ \text{B} \ \text{X} \ \text{Y} \ \text{Z} \right]^\top\) with entries \(1\) for the matching letter and \(0\) otherwise, e.g., the sequence *AZBXYZ* becomes \(\left[\begin{array}{cccccc} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0& 0& 0 \\ 0 & 0 & 0 & 1& 0& 0 \\ 0 & 0 & 0 & 0& 1& 0 \\ 0 & 1 & 0 & 0& 0& 1 \\\end{array}\right]\).

The following code displays a function to generate pairs of sequences and labels and a function to encode a given sequence in its one-hot representation. Notice how Flux provides a nifty helper to do the encoding.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

# Load necessary packages
using Flux
using Random
using StatsBase
# Generates a sequence of `seqlen` which follows the rules of the toy example
function generate_sequence(seqlen::Int)
@assert seqlen > 2 "The sequence should have a length of at least 3"
# Randomize A/B placement in the sequence and randomly fill the rest with XYZ
seq = sample(['X', 'Y', 'Z'], seqlen)
label = sample(['A', 'B'], 2)
idx = sort(sample(1:seqlen, 2, replace=false)) # Indexes of A/B
seq[idx] .= label # Fill-in A/B
seq, label
end
function generate_sequence_batch(seqlen::Int, nseq::Int)
# Initialize empty arrays for sequences and labels
seqs = Vector{Vector{Char}}(undef, nseq)
labs = Vector{Vector{Char}}(undef, nseq)
# Fill arrays
for i ∈ 1:nseq
seqs[i], labs[i] = generate_sequence(seqlen)
end
# Encode sequences using one-hot encoding and return
seqs = map(x -> Flux.onehot.(x, "ABXYZ"), seqs)
labs = map(x -> Flux.onehot(x, ("AA", "AB", "BA", "BB")), join.(labs))
seqs, labs
end

We now have a method of generating arrangements according to our toy example and encoding them using one-hot encoding. We can create a train and a test dataset and try solving the problem. Let us begin with a sequence length of 12, i.e., a short sequence.

1
2
3
4
5
6
7
8
9
10
11
12
13

Random.seed!(72) # Set seed for replication
seqlen = 12 # Define sequence length
# Train and test data sets with 10'000 observations each
Xtrain, ytrain = generate_sequence_batch(seqlen, 10_000)
Xtest, ytest = generate_sequence_batch(seqlen, 10_000)
# Define an accuracy measure
function accuracy(m, X, y)
Flux.reset!(m) # Only important for recurrent network
100mean(isequal.(
map(x -> findmax(x)[2], eachcol(m(X))),
map(x -> findmax(x)[2], eachcol(y))
))
end

## Feedforward neural network

Recall how I mentioned that this classification task was challenging for a feedforward neural network? Let’s start by verifying this claim. I won’t detail the tuning of hyperparameters or the choice of architecture as it is not the main topic of this post. You can try other combinations and convince yourself that the network will not perform well independent of the chosen specifications.

I will use a deep feedforward neural network with three connected dense layers, a ReLU activation function, and an Adam optimization algorithm. We will use the cross-entropy loss as the objective to minimize.

The first difficulty we face with a feedforward network is that we must concatenate our features into a vector format. Indeed, with a sequence of length 12, we have 12 one-hot vectors of size 5. Hence, we have an input vector with 5×12 = 60 features. On the other hand, the output is the same for both the recurrent and the feedforward nets. It is a one-hot encoded vector of size four, i.e., we have four possible classes to choose from when predicting.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

# Reshape the outputs for Flux, this is the same for both RNN and feedforward nets
ytrain = hcat(ytrain...)
ytest = hcat(ytest...)
# Create train and test datasets for feedforward neural network
# (each observation is a vector of length 12×5=60)
Xtrain_ffnn = hcat([vcat(x...) for x ∈ Xtrain]...)
Xtest_ffnn = hcat([vcat(x...) for x ∈ Xtest]...)
# Create feedforward neural network, initialize optimizer
ffnn = Chain(
Dense(5seqlen => 128, relu),
Dense(128 => 128, relu),
Dense(128 => 4)
)
opt_ffnn = ADAM()
θ_ffnn = Flux.params(ffnn) # Keep track of the trainable parameters
epochs = 100 # Train the model for 100 epochs
for epoch ∈ 1:epochs
# Train the model using batches of size 32
for idx ∈ Iterators.partition(shuffle(1:size(Xtrain_ffnn, 1)), 32)
X, y = Xtrain_ffnn[:, idx], ytrain[:, idx]
∇ = gradient(θ_ffnn) do
Flux.logitcrossentropy(ffnn(X), y)
end
Flux.update!(opt_ffnn, θ_ffnn, ∇)
end
end
# Compute accuracy of feedforward neural network
accuracy(ffnn, Xtrain_ffnn, ytrain) # 46.16
accuracy(ffnn, Xtest_ffnn, ytest) # 46.12

The results might change depending on the selected seed, but the accuracy should remain. After 100 epochs of training, the presented deep feedforward neural network reaches only 46% accuracy, which is rather dire for such an easy task. Let us look at how a recurrent neural network performs in comparison.

## Recurrent neural network

While I have already written about how to construct a sequence-to-sequence model in Flux, the sequence-to-one model setup differs drastically. In fact, we must build our own `struct`

with recurrent layers and dense layers. Here is one way to do it:

1
2
3
4
5
6
7
8
9
10
11
12

struct Seq2One
rnn # Recurrent layers
fc # Fully-connected layers
end
Flux.@functor Seq2One # Make the structure differentiable
# Define behavior of passing data to an instance of this struct
function (m::Seq2One)(X)
# Run recurrent layers on all but final data point
[m.rnn(x) for x ∈ X[1:end-1]]
# Pass last data point through both recurrent and fully-connected layers
m.fc(m.rnn(X[end]))
end

That’s it. It might seem challenging at first, but once you get used to Julia and Flux, it all becomes relatively straightforward. Lastly, we can build our model, which we will do using vanilla RNN layers. I won’t go over the data reshaping for recurrent networks again. Still, if you are unfamiliar with it, I recommend you look at one of my past posts on recurrent models: introduction to recurrent models and batching time series data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

# Create the sequence-to-one network using a similar layer architecture as above
seq2one = Seq2One(
Chain(
RNN(5 => 128, relu),
RNN(128 => 128, relu)
),
Dense(128 => 4)
)
opt_rnn = ADAM()
θ_rnn = Flux.params(seq2one) # Keep track of the trainable parameters
epochs = 10 # Train the model for 10 epochs
for epoch ∈ 1:epochs
# Train the model using batches of size 32
for idx ∈ Iterators.partition(shuffle(1:size(Xtrain, 1)), 32)
Flux.reset!(seq2one) # Reset hidden state
X, y = Xtrain[idx], ytrain[:, idx]
X = [hcat([x[i] for x ∈ X]...) for i ∈ 1:seqlen] # Reshape X for RNN format
∇ = gradient(θ_rnn) do
Flux.logitcrossentropy(seq2one(X), y)
end
Flux.update!(opt_rnn, θ_rnn, ∇)
end
end
# Reshape full input for RNN format
Xtrain_rnn = [hcat([x[i] for x ∈ Xtrain]...) for i ∈ 1:seqlen]
Xtest_rnn = [hcat([x[i] for x ∈ Xtest]...) for i ∈ 1:seqlen]
# Compute accuracy of feedforward neural network
accuracy(seq2one, Xtrain_rnn, ytrain) # 100.0
accuracy(seq2one, Xtest_rnn, ytest) # 100.0

Alright! That’s a 100% accuracy for the sequence-to-one model on both the train and test set after **only 10 epochs** while the feedforward net didn’t pass 50% accuracy with 100 epochs. Pretty neat!

That’s all for today. I suggest you try and play around with the sequence length. It’s a fun exercise that shows how the vanilla RNN layers struggle with long sequences and how replacing them with LSTM layers can yield much better results.