Neural Style Transfer with TensorFlow and Convolutional Neural Networks

Neural Style Transfer (NST) takes the semantic structure of one image (the content) and renders it in the texture and visual style of another. It is based on the pre-trained VGG-19 Convolutional Neural Network (CNN). This implementation is inspired by the official TensorFlow tutorial Style Transfer, and extends it with stronger diagnostics (effective learning rate, optional Lipschitz constant estimation), reproducibility (logging, checkpoints), and clear notation.

A Quick Example

To illustrate the potential of NST, let Fig. 1 be the content image that we want to redraw using Fig. 2 as the style image. The result is the image in Fig. 3.


Fig. 1. The content image to be redrawn.	Fig. 2. The style image used to redraw the content image in Fig. 1.	Fig. 3. Neural Style Transfer: the content image (Fig. 1) is redrawn using the style image (Fig. 2).

In particular, the lion’s fur texture has been replaced with the cartoon fox’s. In the top right corner of Fig. 3, we can see some smoothing issues (dark patches) as well as on the top of the lion’s head. This shows the model sensitivity to the parameters (you can tune them with command-line arguments and make your own tests). Nevertheless, the purpose of this project is to experiment with CNNs and see their potential for Generative AI.

Getting the code

You can download the code from its GitHub repository with

$ git clone https://github.com/CoffeeDynamics/Neural-Style-Transfer.git

$ git clone https://github.com/CoffeeDynamics/Neural-Style-Transfer.git

The sections below give some insights about the theory and the implementation, and show some example results.

Problem setup and notation

We optimise directly over the pixels of an image. Let $x$ be the optimisation variable (stylized image), $x_c$, the content image, and $x_s$, the style image. One has

\begin{equation}
x \in [0,1]^{H \times W \times 3},
\end{equation}

where $H$ and $W$ are the image height and width, and the final dimension of size 3 corresponds to RGB (Red – Green – Blue) colour channels.

A frozen VGG-19 provides perceptual features. For a layer $\ell$, the activation tensor is

\begin{equation}
\phi_\ell(x) \in \mathbb{R}^{H_\ell \times W_\ell \times C_\ell},
\end{equation}

where $H_\ell$ and $W_\ell$ are the spatial dimensions at layer $\ell$ and $C_\ell$ is the number of channels (filters).

We define the set of content layers,

\begin{equation}
L_c = \left\{\text{block5_conv2}\right\},
\end{equation}

and the set of style layers,

\begin{equation}
L_s = \left\{\text{block1_conv1}, \text{block2_conv1}, \text{block3_conv1},\\ \text{block4_conv1}, \text{block5_conv1}\right\}.
\end{equation}

To build style statistics, we flatten spatial dimensions so rows enumerate spatial locations and columns enumerate channels:

\begin{equation}
F_\ell(x) \in \mathbb{R}^{(H_\ell W_\ell)\times C_\ell}, \qquad
F_\ell(x)_{p,c} := \phi_\ell(x)_{i_p,j_p,c}.
\end{equation}

Here, $p$ indexes spatial positions after flattening, $(i_p,j_p)$ are the corresponding 2D coordinates, and $c$ indexes channels.

The normalized Gram matrix captures channel–channel correlations:

\begin{equation}
G_\ell(x) := \frac{1}{H_\ell W_\ell}F_\ell(x)^\top F_\ell(x) \in \mathbb{R}^{C_\ell \times C_\ell}.
\end{equation}

Loss terms

Content loss. Squared Euclidean distance of deep features:

\begin{equation}
L_{\mathrm{content}}(x,x_c) = \frac{1}{|L_c|}\sum_{\ell \in L_c} \left|\phi_\ell(x) – \phi_\ell(x_c)\right|_F^2.
\end{equation}

Here, $|\cdot|_F$ denotes the Frobenius norm, i.e., the Euclidean norm of all entries in the tensor after flattening.

Style loss. Squared Frobenius distance between Gram matrices across style layers:

\begin{equation}
L_{\mathrm{style}}(x,x_s) = \frac{1}{|L_s|}\sum_{\ell \in L_s} \left|G_\ell(x) – G_\ell(x_s)\right|_F^2.
\end{equation}

Total variation (TV) loss. Spatial regularizer that discourages high-frequency artefacts. The implementation uses TensorFlow’s isotropic form:

\begin{equation}
L_{\mathrm{TV}}(x) = \sum_{i,j} \sqrt{ \big(x_{i+1,j} – x_{i,j}\big)^2 + \big(x_{i,j+1} – x_{i,j}\big)^2 }.
\end{equation}

Here $i,j$ index spatial coordinates, and the sum is applied channel-wise across the three RGB channels.

Total objective. Weights $\alpha$ (content), $\beta$ (style), $\tau$ (TV) trade off structure, texture, and smoothness:

\begin{equation}
L(x) = \alpha L_{\mathrm{content}}(x,x_c) + \beta L_{\mathrm{style}}(x,x_s) + \tau L_{\mathrm{TV}}(x).
\end{equation}

This is the function minimized by our model.

VGG-19: why and how it’s used

VGG-19 is a deep CNN yet uniform: stacks of $3\times 3$ convolutions with max-pooling between blocks. Early layers encode colours and fine textures; mid layers encode motifs; deep layers encode semantics.

In NST:

Content comes from a deep semantic layer: block5_conv2.
Style comes from the first convolution layer of each block: block{1..5}_conv1.

We keep the ImageNet-pre-trained weights frozen, turning VGG-19 into a stable perceptual feature extractor.

For a generic input of size $H\times W$ (resized so that $\max(H,W)\le 512$), Table 1 shows the shape evolution as information is propagated into the network.

Block	Layer(s)	Output shape $(H_\ell \times W_\ell \times C_\ell)$
—	Input	$H \times W \times 3$
B1	`block1_conv1`, `block1_conv2`	$H \times W \times 64$
	MaxPooling	$\tfrac{H}{2}\times \tfrac{W}{2}\times 64$
B2	`block2_conv1`, `block2_conv2`	$\tfrac{H}{2}\times \tfrac{W}{2}\times 128$
	MaxPooling	$\tfrac{H}{4}\times \tfrac{W}{4}\times 128$
B3	`block3_conv1`, `block3_conv2`, `block3_conv3`, `block3_conv4`	$\tfrac{H}{4}\times \tfrac{W}{4}\times 256$
	MaxPooling	$\tfrac{H}{8}\times \tfrac{W}{8}\times 256$
B4	`block4_conv1`, `block4_conv2`, `block4_conv3`, `block4_conv4`	$\tfrac{H}{8}\times \tfrac{W}{8}\times 512$
	MaxPooling	$\tfrac{H}{16}\times \tfrac{W}{16}\times 512$
B5	`block5_conv1`, `block5_conv2`, `block5_conv3`, `block5_conv4`	$\tfrac{H}{16}\times \tfrac{W}{16}\times 512$
	MaxPooling	$\tfrac{H}{32}\times \tfrac{W}{32}\times 512$

Table 1. Shape evolution as information is propagated forward in VGG-19.

Preprocessing

Both content and style images are automatically resized so their longest side is 512 pixels (aspect ratio preserved). They are converted to float32 tensors in $[0,1]$, and a batch dimension is added so the actual network input dimension is $1 \times H \times W \times 3$.

Optimisation and diagnostics

We optimise the pixels of $x$ (initialised from $x_c$) with the Adam optimiser and clip back to $[0,1]$ after each step.

To make optimiser behaviour observable, we log the effective learning rate:

\begin{equation}
\alpha_{\mathrm{eff}} = \frac{\alpha}{\sqrt{\hat v} + \epsilon},
\end{equation}

where $\alpha$ is the Adam base learning rate, $\hat v$ is the bias-corrected second-moment estimate, and $\epsilon$ is a small constant for numerical stability. We report min/mean/max of $\alpha_{\mathrm{eff}}$ across all pixels each iteration.

Optional Lipschitz estimation. With --optim, we estimate the largest eigenvalue, $L$, of the pixel-space Hessian via power iteration on Hessian–vector products, providing a stability heuristic. For vanilla gradient descent, the step size, $\eta$, must satisfy

\begin{equation}
\eta < \frac{2}{L}.
\end{equation}

Here $\eta$ denotes the learning rate of plain gradient descent (analogous to $\alpha$ in Adam). Although Adam is not plain GD, monitoring $\alpha_{\mathrm{eff}}\cdot L$ is informative: large values correlate with oscillations; tiny values with stagnation.

Command-line interface

The script is designed as a CLI. Minimal usage:

$ python neural_style_transfer.py --content path/to/content.jpg \
   --style path/to/style.jpg --output out/stylized.png

$ python neural_style_transfer.py --content path/to/content.jpg \
   --style path/to/style.jpg --output out/stylized.png

Key flags:

--nsteps: number of steps (default: 1,000)
--content_weight: content loss weight, $\alpha$ (default: 10,000)
--style_weight: style loss weight, $\beta$ (default: 0.01)
--tv_weight: total variation loss weight, $\tau$ (default: 300)
Adam hyperparameters:
- --adam_learning_rate: learning rate (default: 0.01)
- --adam_beta_1: $\beta_1$ coefficient (default: 0.9)
- --adam_beta_2: $\beta_2$ coefficient (default: 0.999)
- --adam_epsilon: $\epsilon$ coefficient (default: 10^-7)
Logging and saving cadence:
- --save_every: save every $N$ iterations (default: 100)
- --log_every: log every $N$ iterations (default: 1)
- --ckpt_write_every: write a checkpoint every $N$ iterations (default: 100)
Checkpoints:
- --resume: resume from checkpoint if available (default: start fresh)
- --resume_from: path to a specific checkpoint to restore (overrides --resume)
- --ckpt_in: path to checkpoint folder to restore from (default: None)
- --ckpt_keep_max: keep only the $N$ latest checkpoints (default: 5)

To see all flags:

$ python neural_style_transfer.py -h

$ python neural_style_transfer.py -h

What gets saved

Images: the style image, the initial content snapshot, periodic stylized results every --save_every steps, and a final image with suffix _final.png. Filenames embed the content/style base names and step count.
Checkpoints: TensorFlow checkpoints in .../checkpoints/, storing both the image variable and optimiser state; can resume training exactly.
Logs: a .txt log with environment info and hyperparameters in the header. Each row reports: step, content/style/TV losses, total loss, and min/mean/max effective learning rates (plus $\alpha_{\mathrm{eff,max}}\cdot L$ if --optim is set).

Sample log excerpt (--optim not set):

#     Step        Content Loss          Style loss     Total Variation          Total Loss       alpha_eff_min      alpha_eff_mean       alpha_eff_max
         1      0.00000000e+00      7.93783552e+08      2.47124060e+07      8.18495936e+08      1.88675564e-08      2.50102585e-06      9.89065990e-02
         2      3.57562225e+06      6.26893376e+08      2.52307880e+07      6.55699776e+08      2.10574669e-08      5.46445506e-07      4.36682312e-04
         3      7.84002150e+06      4.79356608e+08      2.58209200e+07      5.13017568e+08      2.55993946e-08      4.54668054e-07      4.92347754e-05

#     Step        Content Loss          Style loss     Total Variation          Total Loss       alpha_eff_min      alpha_eff_mean       alpha_eff_max
         1      0.00000000e+00      7.93783552e+08      2.47124060e+07      8.18495936e+08      1.88675564e-08      2.50102585e-06      9.89065990e-02
         2      3.57562225e+06      6.26893376e+08      2.52307880e+07      6.55699776e+08      2.10574669e-08      5.46445506e-07      4.36682312e-04
         3      7.84002150e+06      4.79356608e+08      2.58209200e+07      5.13017568e+08      2.55993946e-08      4.54668054e-07      4.92347754e-05

Visualizing logs with gnuplot

The training script writes a structured .txt log. To make these diagnostics interpretable, I use a companion gnuplot.plt script to plot the content, style, TV, and total losses over steps, as shown in Fig. 4.

**Fig. 4.** Content loss (red), style loss (green), TV loss (blue), and total loss (black) w.r.t. the number of steps for the Neural Style Transfer shown in **Figs. 1-3**. The model converges in approximately 1,000 steps.

These plots complement the image snapshots: they show not just what the algorithm produces, but how it behaves during optimisation.

Results

The lion-fox illustration in Figs. 1-3 shows the core idea on a single pair. To examine the behaviour systematically, I generated 25 outputs by combining 5 content images (c1, …, c5) with 5 style images (s1, …, s5) using the default parameters (1,000 steps, $\alpha=10^4$, $\beta=10^{-2}$, and $\tau=3\cdot 10^2$), as shown in Table 2 (click on each image to enlarge it in another tab).

**Table 2.** Neural Style Transfer with 5 content images and 5 style images (click on each image to enlarge).

For example, the resulting image corresponding to the combination of content c1 and style s1 has been generated with

$ python neural_style_transfer.py --content content_files/c1.png \
   --style style_files/s1.png \
   --output output_files/c1_s1/run0/result.png

$ python neural_style_transfer.py --content content_files/c1.png \
   --style style_files/s1.png \
   --output output_files/c1_s1/run0/result.png

And the final output image was written in output_files/c1_s1/run0/result_c1_s1_final.png in approximately 18 minutes on my desktop. These information are available in the log file result_c1_s1_log.txt written in the same folder.

Qualitative trends. Early iterations preserve structure with faint textures; mid-iterations intensify textures (TV regularisation curbs noise); at convergence, the balance is set by $(\alpha,\beta,\tau)$.

Conclusion

VGG-19 as a perceptual extractor: deep content, shallow-to-mid style.
Well-posed objective: content/style/TV terms with clear roles and weights.
Pixel-space optimisation with visibility: effective learning rate logging, optional Lipschitz constant estimation, and gnuplot visualizations help explain stability and convergence.
Reproducibility: deterministic preprocessing, logs, checkpoints, and periodic snapshots.

NST is both an artistic tool and a compact demonstration of deep learning’s flexibility: pre-trained CNN features, differentiable objectives, and iterative optimisation working together to transform images in a controlled, repeatable way.

	Style s1	Style s2	Style s3	Style s4	Style s5

Content c1
Content c2
Content c3
Content c4
Content c5