The Machine Learning “Advent Calendar” Day 23: CNN In Excel

were first introduced for images, and for images they are often easy to understand.

A filter slides over pixels and detects edges, shapes, or textures. You can read this article I wrote earlier to understand how CNNs work for images with Excel.

For text, the idea is the same.

Instead of pixels, we slide filters over words.

Instead of visual patterns, we detect linguistic patterns.

And many important patterns in text are very local. Let’s take these very simple examples:

“good” is positive
“bad” is negative
“not good” is negative
“not bad” is often positive

In my previous article, we saw how to represent words as numbers using embeddings.

We also saw a key limitation: when we used a global average, word order was completely ignored.

From the model’s point of view, “not good” and “good not” looked exactly the same.

So the next challenge is clear: we want the model to take word order into account.

A 1D Convolutional Neural Network is a natural tool for this, because it scans a sentence with small sliding windows and reacts when it recognizes familiar local patterns.

1. Understanding a 1D CNN for Text: Architecture and Depth

1.1. Building a 1D CNN for text in Excel

In this article, we build a 1D CNN architecture in Excel with the following components:

Embedding dictionary
We use a 2-dimensional embedding. Because one dimension is not enough for this task.
One dimension encodes sentiment, and the second dimension encodes negation.
Conv1D layer
This is the core component of a CNN architecture.
It consists of filters that slide across the sentence with a window length of 2 words. We choose 2 words to be simple.
ReLU and global max pooling
These steps keep only the strongest matches detected by the filters.
We will also discuss the fact that ReLU is optional.
Logistic regression
This is the final classification layer, which combines the detected patterns into a probability.

1D CNN in Excel – all images by author

This pipeline corresponds to a standard CNN text classifier.
The only difference here is that we explicitly write and visualize the forward pass in Excel.

1.2. What “deep learning” means in this architecture

Before going further, let us take a step back.
Yes, I know, I do this often, but having a global view of models really helps to understand them.

The definition of deep learning is often blurred.
For many people, deep learning simply means “many layers”.

Here, I will take a slightly different point of view.

What really characterizes deep learning is not the number of layers, but the depth of the transformation applied to the input data.

With this definition:

Even a model with a single convolution layer can be considered deep learning,
because the input is transformed into a more structured and abstract representation.

On the other hand, taking raw input data, applying one-hot encoding, and stacking many fully connected layers does not necessarily make a model deep in a meaningful sense.
In theory, if we don’t have any transformation, one layer is enough.

In CNNs, the presence of multiple layers has a very concrete motivation.

Consider a sentence like:

This movie is not very good

With a single convolution layer and a small window, we can detect simple local patterns such as: “very + good”

But we cannot yet detect higher-level patterns such as: “not + (very good)”

This is why CNNs are often stacked:

the first layer detects simple local patterns,
the second layer combines them into more complex ones.

In this article, we deliberately focus on one convolution layer.
This makes every step visible and easy to understand in Excel, while keeping the logic identical to deeper CNN architectures.

2. Turning words into embeddings

Let us start with some simple words. We will try to detect negation, so we will use these terms, with other words (that we will not model)

“good”
“bad”
“not good”
“not bad”

We keep the representation intentionally small so that every step is visible.

We will only use a dictionary of three words : good, bad and not.

All other words will have 0 as embeddings.

2.1 Why one dimension is not enough

In a previous article on sentiment detection, we used a single dimension.
That worked for “good” versus “bad”.

But now we want to handle negation.

One dimension can only represent one concept well.
So we need two dimensions:

senti: sentiment polarity
neg: negation marker

2.2 The embedding dictionary

Each word becomes a 2D vector:

good → (senti = +1, neg = 0)
bad → (senti = -1, neg = 0)
not → (senti = 0, neg = +1)
any other word → (0, 0)

This is not how real embeddings look. Real embeddings are learned, high-dimensional, and not directly interpretable.

But for understanding how Conv1D works, this toy embedding is perfect.

In Excel, this is just a lookup table.
In a real neural network, this embedding matrix would be trainable.

3. Conv1D filters as sliding pattern detectors

Now we arrive at the core idea of a 1D CNN.

A Conv1D filter is nothing mysterious. It is just a small set of weights plus a bias that slides over the sentence.

Because:

each word embedding has 2 values (senti, neg)
our window contains 2 words

each filter has:

4 weights (2 dimensions × 2 positions)
1 bias

That is all.

You can think of a filter as repeatedly asking the same question at every position:

“Do these two neighboring words match a pattern I care about?”

3.1 Sliding windows: how Conv1D sees a sentence

Consider this sentence:

it is not bad at all

We choose a window size of 2 words.

That means the model looks at every adjacent pair:

(it, is)
(is, not)
(not, bad)
(bad, at)
(at, all)

Important point:
The filters slide everywhere, even when both words are neutral (all zeros).

3.2 Four intuitive filters

To make the behavior easy to understand, we use four filters.

Filter 1 – “I see GOOD”

This filter looks only at the sentiment of the current word.

Plain-text equation for one window:

z = senti(current_word)

If the word is “good”, z = 1
If the word is “bad”, z = -1
If the word is neutral, z = 0

After ReLU, negative values become 0. But it is optional.

Filter 2 – “I see BAD”

This one is symmetric.

z = -senti(current_word)

So:

“bad” → z = 1
“good” → z = -1 → ReLU → 0

Filter 3 – “I see NOT GOOD”

This filter looks at two things at the same time:

neg(previous_word)
senti(current_word)

Equation:

z = neg(previous_word) + senti(current_word) – 1

Why the “-1”?
It acts like a threshold so that both conditions must be true.

Results:

“not good” → 1 + 1 – 1 = 1 → activated
“is good” → 0 + 1 – 1 = 0 → not activated
“not bad” → 1 – 1 – 1 = -1 → ReLU → 0

Filter 4 – “I see NOT BAD”

Same idea, slightly different sign:

z = neg(previous_word) + (-senti(current_word)) – 1

Results:

“not bad” → 1 + 1 – 1 = 1
“not good” → 1 – 1 – 1 = -1 → 0

This is a very important intuition:

A CNN filter can behave like a local logical rule, learned from data.

3.3 Final result of sliding windows

Here is the final results of these 4 filters.

4. ReLU and max pooling: from local to global

4.1 ReLU

After computing z for every window, we apply ReLU:

ReLU(z) = max(0, z)

Meaning:

negative evidence is ignored
positive evidence is kept

Each filter becomes a presence detector.

By the way, it is an activation function in the Neural network. So a Neural network is not that difficult after all.

4.2 Global Max pooling

Then comes global max pooling.

For each filter, we keep only:

max activation over all windows

Interpretation:
“I do not care where the pattern appears, only whether it appears strongly somewhere.”

At this point, the whole sentence is summarized by 4 numbers:

strongest “good” signal
strongest “bad” signal
strongest “not good” signal
strongest “not bad” signal

4.3 What happens if we remove ReLU?

Without ReLU:

negative values stay negative
max pooling may select negative values

This mixes two ideas:

absence of a pattern
opposite of a pattern

The filter stops being a clean detector and becomes a signed score.

The model could still work mathematically, but interpretation becomes harder.

5. The final layer is logistic regression

Now we combine these signals.

We compute a score using a linear combination:

score = 2 × F_good – 2 × F_bad – 3 × F_not_good – 3 × F_not_bad – bias

Then we convert the score into a probability:

probability = 1 / (1 + exp(-score))

That is exactly logistic regression.

So yes:

the CNN extracts features: this step can be considered as feature engineering, right?
logistic regression makes the final decisions, it is a classic machine learning model we know well

6. Full examples with sliding filters

Example 1

“it is bad, so it is not good at all”

The sentence contains:

After max pooling:

F_good = 1 (because “good” exists)
F_bad = 1
F_not_good = 1
F_not_bad = 0

Final score becomes strongly negative.
Prediction: negative sentiment.

Example 2

“it is good. yes, not bad.”

The sentence contains:

After max pooling:

F_good = 1
F_bad = 1 (because the word “bad” appears)
F_not_good = 0
F_not_bad = 1

The final linear layer learns that “not bad” should outweigh “bad”.

Prediction: positive sentiment.

This also shows something important: max pooling keeps all strong signals.
The final layer decides how to combine them.

Exemple 3 with A limitation that explains why CNNs get deeper

Try this sentence:

“it is not very bad”

With a window of size 2, the model sees:

It never sees (not, bad), so the “not bad” filter never fires.

It explains why real models use:

larger windows
multiple convolution layers
or other architectures for longer dependencies

Conclusion

The strength of Excel is visibility.

You can see:

the embedding dictionary
all filter weights and biases
every sliding window
every ReLU activation
the max pooling result
the logistic regression parameters

Training is simply the process of adjusting these numbers.

Once you see that, CNNs stop being mysterious.

They become what they really are: structured, trainable pattern detectors that slide over data.

The Machine Learning “Advent Calendar” Day 23: CNN in Excel