ANALYZING STYLE TRANSFER ALGORITHMS FOR SEGMENTED IMAGES

By

Seyed Hadi Seyed

December, 2024

Director of Thesis: David Hart, PhD

Major Department: Computer Science

ABSTRACT

The recently developed Segment Anything Model has made grabbing semantically meaning-

ful regions of an image easier than before. This will allow for new applications that build

on this approach that weren’t previously possible. This thesis investigates integrating the

Segment Anything Model with style transfer. Specifically, it proposes Partial Convolution

as a way to improve style transfer for segmented regions. Additionally, it investigates how

different style transfer techniques are affected by different mask sizes, image statistics, etc.


ANALYZING STYLE TRANSFER ALGORITHMS FOR SEGMENTED IMAGES

A Thesis

Presented to The Faculty of the Department of Computer Science

East Carolina University

In Partial Fulfillment of the Requirements for the Degree

Master of Science in Data Science

By

Seyed Hadi Seyed

December, 2024

Director of Thesis: David Hart, PhD

Thesis Committee Members:

Nic Herndon, PhD

Rui Wu, PhD


©Seyed Hadi Seyed, 2024


DEDICATION

This thesis is dedicated to my family, whose love and support have been the foundation of

my achievements. To my parents, thank you for inspiring my passion for learning.


ACKNOWLEDGEMENTS

I sincerely thank everyone who supported me during this journey. I am especially grateful

to my advisor, Dr. David Hart, for his invaluable guidance, encouragement, and insight and

to the respectful committee for their constructive and valuable feedback. To my family and

friends, your unwavering belief in me gave me the strength to complete this work. I would

like to thank my parents for providing me with the opportunity to gain this tremendous

education and the necessary tools to succeed in life.


Table of Contents

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

CHAPTER 1: INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 What is the Segment Anything Model? . . . . . . . . . . . . . . . . . . . . . 1

1.2 What is Style Transfer? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Combining Style Transfer with the Segment Anything Model . . . . . . . . . 4

CHAPTER 2: BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1 Style Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Segment Anything Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Segmented Style Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

CHAPTER 3: METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1 Style Transfer by Linear Transformation . . . . . . . . . . . . . . . . . . . . 9

3.2 Integrating Partial Convolution into the Style Transfer Network . . . . . . . 10

3.3 Segmented Style Transfer Techniques . . . . . . . . . . . . . . . . . . . . . . 12

3.4 Predicting Which Technique Will Perform Best . . . . . . . . . . . . . . . . 17

CHAPTER 4: RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1 Qualitative Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2.1 Comparing Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2.2 Analyzing Contrast . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24


4.2.3 Insight on Stylization Preferences . . . . . . . . . . . . . . . . . . . . 28

CHAPTER 5: CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.3 Potential Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33


LIST OF FIGURES

1.1 Example output using the Segment Anything Model . . . . . . . . . . . . . . 2

1.2 Example output using the Linear Style Transfer algorithm for a full image . 4

1.3 Example output using a Partial Convolution algorithm for style transfer . . . 5

3.1 Overview of the original style transfer network . . . . . . . . . . . . . . . . . 10

3.2 Visualization of the Partial Convolution layer . . . . . . . . . . . . . . . . . 11

3.3 Pipeline for the Style-then-Mask algorithm . . . . . . . . . . . . . . . . . . . 13

3.4 Example output using the Style-then-Mask algorithm . . . . . . . . . . . . . 13

3.5 Pipeline for the Mask-then-Style algorithm . . . . . . . . . . . . . . . . . . . 15

3.6 Example output using the Mask-then-Style algorithm . . . . . . . . . . . . . 15

3.7 Pipeline for the Partial Convolution algorithm . . . . . . . . . . . . . . . . . 16

3.8 Example output using the Partial Convolution algorithm . . . . . . . . . . . 16

4.1 Example outputs from the three techniques - part 1 . . . . . . . . . . . . . . 20

4.2 Example outputs from the three techniques - part 2 . . . . . . . . . . . . . . 21

4.3 Comparison of the three techniques - part 1 . . . . . . . . . . . . . . . . . . 22

4.4 Comparison of the three techniques - part 2 . . . . . . . . . . . . . . . . . . 23

4.5 Color histograms for image and masked region in RGB and grayscale . . . . 25

4.6 Contrast scatter plot on content image and masked region . . . . . . . . . . 26

4.7 Example input with varied contrast . . . . . . . . . . . . . . . . . . . . . . . 27

4.8 Contrast scatter plot on output stylizations . . . . . . . . . . . . . . . . . . . 28

4.9 Example 1 output of contrast insight . . . . . . . . . . . . . . . . . . . . . . 29


4.10 Example 2 output of contrast insight . . . . . . . . . . . . . . . . . . . . . . 29


Chapter 1

Introduction

In 2023, Meta AI released the Segment Anything Model [8], a new approach for automatically

detecting and outlining meaningful regions of an image. In this thesis, I will combine this

technique with Style Transfer [2], an approach for making an image look like it was painted

or styled in a particular way. I will now describe both of these fundamental pieces in more

detail.

1.1 What is the Segment Anything Model?

The Segment Anything Model (SAM) is a powerful AI model developed by Meta AI that can

automatically identify and segment objects in images with minimal human input. Previous

approaches relied on class labels to segment objects [17, 4] whereas SAM was trained without

labels. SAM is designed to work with a wide variety of objects and environments, even those

it hasn’t seen before, making it highly versatile for various applications in computer vision.

The output of SAM is a list of masks for the objects in the scene. Each mask indicates

the pixels that are a part of the object. Masks can overlap and the set of masks does

not necessarily need to cover every pixel in the image. An example output from SAM is

illustrated in Figure 1.1.

SAM has key features that will continue to make it more accessible to public and main-

stream in the coming years. Those features include:

• Zero-shot Segmentation: SAM can segment objects in an image without requiring


Figure 1.1: Example output using the Segment Anything Model. Given an input image (top)
and a point from the user, the model can determine all pixels associated with the object
closest surrounding that point, providing a mask of the object (bottom).

2


task-specific training. It can generalize well across different objects, including unfamil-
iar ones.

• Interactive Segmentation: The model allows users to interact with the segmentation
process by providing various inputs like points, bounding boxes, or text prompts.

• General-purpose: Unlike traditional segmentation models that are trained to identify
specific objects or categories, SAM can segment “anything.” It is not restricted to a
predefined set of object categories and can work in a wide range of domains.

• Pre-trained on Massive Data: SAM is trained on a massive dataset, which gives it
a broad understanding of object shapes, boundaries, and contexts. This pre-training
allows it to generalize across new images and unseen categories. SAM leverages trans-
formers, a neural network architecture commonly used in natural language processing,
for its ability to capture global context in images. It also uses mask generation to
highlight different areas of interest within an image, offering flexibility in how objects
are segmented.

Clearly, the Segment Anything Model represents a leap in general-purpose segmentation,

making it easier to use AI in areas where accurate object identification is beneficial. It

will have broad applications in the fields of image editing, medical imaging, robotics, and

augmented/virtual reality.

1.2 What is Style Transfer?

Style Transfer is a technique in computer vision and deep learning that involves applying the

visual style featuers onto an image. The technique was popularized by the paper “A Neural

Algorithm of Artistic Style” by Leon Gatys and colleagues in 2015 [2]. It creates a new image

that retains the content from one image (content image) and adopts the artistic or visual

style of another image (style image). The process usually uses neural networks, especially

Convolutional Neural Networks (CNNs), which are good at recognizing spatial patterns like

textures, shapes, and edges.

In particular, this work uses the style transfer approach presented by Li et al. [13]. In

this approach, a content image is provided which contains the objects, shapes, and structure

that want to be preserved. A style image is also provided which contains the artistic style

(brushstrokes, color patterns, textures) that want to be applied to the content image. Both of

3


Figure 1.2: Example output using the Linear Style Transfer algorithm for a full image. A
content and style image (left) are fed into the style transfer network, resulting in stylized
version of the content image (right).

these are fed into a pre-trained CNN and there representative features are blended together

to create the output. An example of the style transfer output is shown in Figure 1.2.

Style Transfer continues to be actively researched and have wide use of applications in art

creation, image editing, and film/game design. However, almost all approaches have focused

on whole-image stylization without addressing how these approaches behave when working

on individual regions of an image. This is the focus of my work.

1.3 Combining Style Transfer with the Segment Anything Model

The goal of this work is to combine Style Transfer with the Segment Anything Model. This

can result in powerful and creative applications that allow users to selectively apply artistic

styles to specific objects or regions within an image.

The naive approach would be to simply apply style transfer to the whole image, and then

to mask the image after the stylization. However, I show that this method sometimes leads

to under stylization, where the features of the style image are not well represented in the

stylized content image.

To resolve this issue, I implement Partial Convolution [16] inside the style transfer CNN.

4


Figure 1.3: Example output using a Partial Convolution algorithm for style transfer. A
content, style image (left) and a given masked object (the bird), the partial convolution
style transfer algorithm can apply the stylization to the masked region exclusively (right).

Partial convolution is a convolution technique that can properly convolve across masked re-

gions. Partial Convolution was selected for its inherent robustness for mask-aware processing

and seamless integration into previously trained CNNs. By using this at each layer of the

network and handling the resized mask at each step, the style transfer can be more accurate

to the original style. An example output of the partial-convolution-based approach is shown

in Figure 1.3.

To verify the approach, the results are analyzed both qualitatively and quantitatively.

Partial convolution results will be compared to naive alternatives. Additionally, the results

are numerically analyzed to determine why and when partial convolution does better than

alternatives. Specifically, I find that the level of contrast in the image versus the masked

region can indicate the degree to which partial convolution will improve the stylization.

Lastly, this work also provides the insight into why users think a stylization is good or bad

based on the level of contrast within the stylization.

In summary, the contributions of this thesis are as follows

5


• A robust style transfer network that uses partial convolution in each layer.

• Qualitative results showing improvement of stylization in segmented images.

• A quantitative analysis and comparison of style transfer results, providing insights
into when partial convolution perform betters and what causes users to prefer one
stylization over another.

6


Chapter 2

Background

2.1 Style Transfer

Modern style transfer techniques begin with the groundbreaking work of Gatys et al. [2] who

presented the first CNN-based style transfer algorithm. This approach was optimization-

based and operated on a single image and single style at a time. Others works improved the

speed and control of this approach [3, 9, 20].

Later, a feed-forward method was presented by Johnson et al. [6] that trained a network

for a single style image. After this training process, any content image could be fed into

the network and stylized in a quick feed-forward pass. Others also improved this approach

[10, 21, 23].

In 2017, Li et al. [14] reformulated style transfer as a modified image reconstruction

process. After a standard image reconstruction network is trained (usually an autoencoder),

the content image is fed into the network and the intermediate representation is altered

based on the style statistics. This approach is considered state-of-the-art and has CNN [13],

vision transformer [1], and even diffusion network [24] implementations.

For this thesis, I use the implementation of “Learning Linear Transformations for Fast

Image and Video Style Transfer” [13]. This network is used because it is a fully convolutional

neural network, making it easy to modify the convolution layers to be partial convolutions.


2.2 Segment Anything Models

Segmentation is a common computer vision task. Before SAM, modern AI-based approaches

were class-based, relying on class labels to complete the segmentation [17]. Approaches

included ResNet [4] and Detectron [22]. The Segment Anything Model [8] is unique because

it was trained in an iterative fashion that does not require class labels, thus allowing it to

segment objects, even if they have not been seen by the network before or have no semantic

meaning. Many works have built on this approach [18, 26].

2.3 Segmented Style Transfer

Not many works attempt to complete style transfer on segmented regions. Other approaches

use segmentation to influence style transfer [25, 15] or use class-based labels [12, 11]. Re-

cently, a work titled SAMStyler [19] has combined the Segment-Anything Model with Style

Transfer techniques, but relies on the slow optimization-based approach of the original Gatys

implementation. In this work, I will present an approach that completes classless style trans-

fer on segmented regions in a fast feed-forward way.

8


Chapter 3

Methodology

In this section, I explain the original network used for Style Transfer, how the network is

modified using Partial Convolution, and the qualitative and quantitative metrics I will use

to evaluate this new network.

3.1 Style Transfer by Linear Transformation

For this work, I use the Linear Style Transfer network model proposed by Li et al. [13]. I

use this model because it has state of the art results and is able to integrate partial convo-

lution. The model contains two feed-forward networks, a symmetric encoder-decoder image

reconstruction module and a transformation learning module, as shown in Figure 3.1. The

encoder-decoder is trained to reconstruct any input image faithfully. It is then fixed and

serves as a base network in the remaining training procedures for stylization. The transfor-

mation learning module contains two small CNNs which takes features of the content and

style images from the encoder, and outputs a transformation matrix T . The image style

is transferred through linear multiplication between the content features and the transfor-

mation matrix T in the same layer. A pre-trained and fixed VGG-19 network is used to

compute style losses at multiple levels and one content loss in a way similar to the prior

work [7, 5]. The model is a pure feed-forward convolutional neural network, which is able to

transfer arbitrary styles efficiently ( 140 fps).


Content Image

Stylized Image

Style Image

Encoder

Decoder

Encoder

Feature
Transform

Feature
Vector

Figure 3.1: Overview of the original style transfer network. The model contains a pre-trained
encoder, decoder, and a transformation module. Only the transformation module is learnable
during the style transfer training, while all the others are fixed. The transformation module
takes the content and style features and outputs a learned feature transform T . The content
features are multiplied by T and then fed into the decoder to generate the final output.

3.2 Integrating Partial Convolution into the Style Transfer Network

Partial convolution was first proposed by Liu et al. as way of improving inpainting techniques

[16]. The partial convolution not only acts on an image, but is provided a mask that indicates

which pixels to include and which to ignore. Only pixels within the mask are multiplied by

the learned weights and affect the convolution output. A visual of this method is provided

in Figure 3.2.

In addition to inpainting, partial convolution can also be integrated into the style transfer

task since it can replace any standard convolution layer. Partial convolution was also selected

for its natural mask-aware processing. This processing is robust and flexible to any mask

and image content and seamlessly handle boundaries in the mask during the convolution

operation.

In the original inpainting work, partial convolution layers replaced regular convolution

layers during the training process. In comparison, the style transfer network was not trained

with partial convolutions. However, partial convolutions can replace the regular convolution

10


Figure 3.2: Visualization of the Partial Convolution layer. As the kernel for the CNN moves
across the image, it is multiplied by the mask for the image, causing aggregation to only
occur across pixels within the mask. This can effectively separate foreground information
from background information.

in the style transfer network as long as the same convolution weights are used. No additional

fine-tuning is needed.

To set up the partial convolution style transfer network, every convolution layer in the en-

coder, decoder, and transformation blocks is replaced with partial convolutions, making sure

to transfer over the same weights. When providing an input image, the input mask is also

provided. In the encoder, the mask goes through each padding and pooling layer that the

image goes through to guarantee they stay the same size. On the decoder, bilinear interpola-

tion of the original mask is used to guarantee the input mask is the correct size at each layer.

Finally, the matrix multiplication that performs the style transformation is also masked at

the feature level. The final output is alpha blended with the original image to place the styl-

ized region back onto the background. All of these steps modify the style transfer network to

only stylize the region inside the masked area without influencing the stylization with image

values outside the masked area. The code for this new network is provided in a GitHub

respository at https://github.com/HadiSeyed/analyzing-segmented-style-transfer.

11

https://github.com/HadiSeyed/analyzing-segmented-style-transfer


3.3 Segmented Style Transfer Techniques

The partial convolution network can perform style transfer on segmented images. I compare

it to two naive approaches for segmented/masked style transfer. This gives three total

segmented style transfer approaches:

• Style-then-mask: Stylize the whole image, then mask the stylized output.

• Mask-then-style: Mask the image, then stylize the modified.

• Partial Convolution: Provide the mask in the style network using partial convolu-
tion.

I will describe each of these methods in detail.

Style-then-Mask

This method applies style transfer to the entire image first and then applies a mask to

keep or remove parts of the stylized output. The process is as follows:

1. Apply Style Transfer Globally: The entire content image undergoes a style transfer.

2. Apply Mask After Styling: After the stylization is complete, a binary mask is applied

to the result. The mask dictates which parts of the stylized image are retained, and

which parts will revert back to the original content after an alpha blending.

This pipeline is illustrated in Fig. 3.3. Visually, I find that this tends to lead to under stylized

results. Since the whole image affected the stylization, the stylization in the region of interest

does not usually capture all of the style details that were present in the style image. For

most styles, this leads to darker colors in the stylizations than are present in the overall style

image. An example output is shown in Fig. 3.4.

Mask-then-Style

In this method, the mask is applied first, removing all other content from the image. The

process is as follows:

12


Figure 3.3: Pipeline for the Style-then-Mask algorithm. The mask for the content image is
only applied during the blending stage.

Figure 3.4: Example output using the Style-then-Mask algorithm. A content image and
style image (left) give the following stylized output for the masked region (right). The result
stylization tends to be darker than the input style features.

13


1. Apply Mask to the Content Image: A binary mask is applied to the content image to

select specific regions where the style transfer will occur. The pixels surrounding the

region of interest become black.

2. Apply Style Transfer to the Modified Image: The whole modified image is stylized.

The output is alpha blended using the mask to place the stylized output onto the

original content image.

This pipeline is illustrated in Fig. 3.5. Visually, I find that this tends to lead to poor results.

When masking beforehand, the region of interest is surrounded by constant dark pixels.

This makes the region of interest appear as a very bright object against a dark background.

Thus, the stylization tends to make very bright outputs when using this technique and these

outputs tend to have poor quality and do not match the style features. An example output

is given in Fig. 3.6.

Partial Convolution

This method applies style transfer to the unmodified image, but the mask is provided

the style transfer network with partial convolutions replacing regular convolution layers as

described in Section 3.2. The process is as follows:

1. Modify the Style Transfer Network: Modify the style transfer network to use partial

convolution while maintaining the original weights.

2. Apply the Modified Style Transfer Network: Input content, style, and mask into the

modified style transfer network.

3. Apply Blending After Stylization: Since partial convolution was used in the neural

network, the output is already masked. The output is alpha blended with the original

content based on the mask.

This pipeline is illustrated in Fig. 3.7. Visually, I find that partial convolution better matches

the style image statistics since no stylized pixels are removed from the final output during

14


Figure 3.5: Pipeline for the Mask-then-Style algorithm. The mask for the content image is
applied before the image is fed into the style transfer network.

Figure 3.6: Example output using the Mask-then-Style algorithm. A content image and
style image (left) give the following stylized output for the masked region (right). The result
stylization tends to be much brighter than the input style features.

the alpha blending. General style transfer focuses on globally blending content and style

features across the entire image while partial convolution aims at selectively applying style

only in regions where data is missing or masked. Also, in style transfer, the style is applied

globally, regardless of whether parts of the image are masked, but in partial convolution,

convolutions are applied only to valid regions, resulting in more controlled style application

in selected areas. An example output is given in Fig. 3.8.

15


Figure 3.7: Pipeline for the Partial Convolution algorithm. The mask for the content image
is applied and the partial convolution layer used at each stage of the style transfer network.

Figure 3.8: Example output using the Partial Convolution algorithm. A content image and
style image (left) give the following stylized output for the masked region (right). The result
stylization tends to be closer to the input style features than the other two approaches.

16


3.4 Predicting Which Technique Will Perform Best

In Chapter 4, the outputs of all three segmented style transfer techniques are visually com-

pared side-by-side. This qualitative comparison is ultimately the best comparison for a task

that is visual and subjective in nature. What I found was the partial convolution usually

stylizes the best, but sometimes style-then-mask performs similarly to partial convolution,

and sometimes even better. Mask-then-style appears to never stylize as well as the other

two techniques.

Since partial convolution is not the best technique in every case, the last part of this

work is to understand what conditions in the initial image leads to a particular method

doing better than another. A quantitative analysis between the inputs and outputs of the

techniques is conducted. Specifically, I analyze the statistics on the following:

Data

• Original Image

• Masked Region

• Style Image

• Style-then-Mask Output

• Mask-then-Style Output

• Partial Convolution Output

Statistics

• Mean RGB Color Value

• Mean Grayscale Value

• Standard Deviation of RGB Color Value

• Standard Deviation of Grayscale Value

• Contrast in RGB Color (Max Value - Min Value)

• Contrast in Grayscale (Max Value - Min Value)

17


Additionally, distributions in RGB and grayscale color values can be compared for each kind

of data. The findings of this analysis are presented in Chapter 4.

18


Chapter 4

Results

In this chapter, I present qualitative comparisons between the different style transfer tech-

niques, as well as quantitative analyses of the inputs and results.

4.1 Qualitative Comparisons

For understanding which method works best, I compare all of them side by side for different

masks sizes such as large, medium, small, and very small masks. The following process was

used to generate results.

First, the content image was loaded and fed into the Segment Anything Model (SAM).

SAM then generates and saves masks for each distinct region of interest in the image. Next,

the original content image is displayed. An interactive visualization tool was developed using

OpenCV where masks from SAM are highlighted and overlayed on the image. Masks are

selected interactively with one click to choose only the regions of interest. Here, all masks

combine into a single mask that will be applied uniformly across the image. An alpha channel

is created from the combined mask to manage transparency and layering, and then the image

is normalized so that the values fall within the desired range for processing if needed. Then,

the content, style, and mask images are processed as tensors to be fed into the style transfer

techniques. The three style transfer techniques are implemented in PyTorch. Finally, the

three outputs are alpha blended with the original image to map the stylized regions back

onto the background.


Figure 4.1: Example outputs from the three techniques - part 1: style-then-mask (left),
mask-then-style (middle), partial convolution (right). content images and style images give
the following stylized output for the masked region.

20


Figure 4.2: Example outputs from the three techniques - part 2: style-then-mask (left),
mask-then-style (middle), partial convolution (right). content images and style images give
the following stylized output for the masked region.

21


Figure 4.3: Comparison of the three techniques - part 1: style-then-mask (left), mask-then-
style (middle), partial convolution (right) - partial convolution is better. A content image
and style image (top) give the following stylized output for the masked region (bottom). The
result stylization from partial convolution tends to be closer to the input style features than
the other two approaches.

22


Figure 4.4: Comparison of the three techniques - part 2: style-then-mask (left), mask-then-
style (middle), partial convolution (right) - style-then-mask is better. A content image and
style image (top) give the following stylized output for the masked region (bottom). The
style-then-mask approach matches closer to the content and partial convolution matches
closer to the style.

Many visual comparisons are provided for different content images, style images, and

masks sizes in Figures 4.1 and 4.2. From these visuals, some observations can be made.

First, the mask-then-style method has brighter outputs then the other two and usually has

poor results. In general, mask-then-style is not a good approach for segmented stylization.

Second, partial convolution does better than the other two in most cases. This is because

the partial convolution approach can better match the statistics within the style for the

masked region, with style-then-mask usually being too dark and mask-then-style being too

bright. A clear visual of this behavior is shown in Figure 4.3.

Third, in some cases, style-then-mask and partial convolution results look the same,

usually in large masks. For small masks, style-then-mask can sometimes do better. Also,

visual quality is ultimately a subjective matter and it can be unclear which one did better.

Style-then-mask tends to hold to the content better and partial convolution holds to the

style features better. A good example of this is shown in Figure 4.4.

23


4.2 Quantitative Analysis

In this section, the aim is to understand why partial convolution does better on some images

but not on others. Ultimately, that difference has to depend on the content, style, and mask

inputs. During the stylization process, statistics about the content image, the masked region

of the content image, and the style image were saved to a JSON file for each experiment.

By analyzing these saved statistics, the effects they have on the final stylized outputs can

be better understood for each of the three methods.

Overall, I present the following hypothesis: partial convolution stylizes the best when the

distribution of colors in the masked region is different than the distribution for the whole

image. Specifically, partial convolution stylizes better when the contrast in the masked

region is smaller than the contrast in the whole image. This conclusion is supported with

two experiments as presented in the following subsections:

4.2.1 Comparing Distributions

First, a visual comparison between the original and mask images is provided. The plot-

channel-histogram function is defined to create a histogram plot for the image channels.

The histogram is normalized by dividing by the total number of pixels in the image or mask

to ensure accurate relative frequency representation. The red, green, blue, and grayscale

historgrams are plotted separately.

The histograms for the bird image in 4.3 are plotted in Figure 4.5. It can be seen that

the distributions for the full content image versus the masked region are very different. This

difference leads to the better visual output for partial convolution.

4.2.2 Analyzing Contrast

In order to accurately analyze the input image, it is first necessary to categorize the output

into three categories:

24


Figure 4.5: The histograms of bird image in red, green, blue channels and grayscale. The
statistics for original image and masked region are overlayed on each other. The histograms
show that the distribution of colors for the full image and masked region are disparate. This
leads to partial convolution outperforming the other two methods.

• partialconv-better: When partial convolution had better stylization than style-then-
mask

• same: When partial convolution had similar stylization than style-then-mask

• stylethenmask-better: When style-then-mask had better stylization than partial
convolution

Mask-then-style was not included because there were no cases when it performed better. For

each input image and mask, the outputs were visually analyzed to determine which method

performed best. Then, the associated JSON statistics files were separated into different

folders based on that categorization. Finally, I gathered data for different categories, loaded

and processed data from each category by iterating through partialconv-better, same, and

stylethenmask-better directories to generate a scatter plot for all JSON files. This allows

for a visual comparison of the extracted metrics across different categories. The figures were

saved to review results and retain a copy for further analysis.

For this experiment, 36 different images were loaded and their outputs categorized. The

25


Figure 4.6: Contrast scatter plot in the gray channel in the three categories: partialconv-
better, same, stylethenmask-better. Statistics for 36 images are plotted to show the partial
convolution method does better where the content image contrast in grayscale is high, but
masked region contrast is low.

JSON files were read independently for each category and the statistics were plotted for

comparison. The content image, masked region of the content image, and the style image

statistics were computed. Mean, standard deviation, and maximum contrast for both RGB

and grayscale were analyzed. All possible pairings of statistics and data were considered

across all color channels.

After completing this thorough analysis, two statistics seemed to be most closely cor-

related with when partial convolution does better. Specifically, the contrast in the gray

channel appears to be the key factor. Partial convolution does better when the content

image contrast is high but masked region contrast is low. The scatter plot showing this

pattern is given in Figure 4.6. Intuitively, this makes sense since partial convolution will do

a better job of spreading the style across the masked region statistics. An examples of this

phenomenon are shown in Figure 4.7.

26


Figure 4.7: Example with different contrast in the contrast image vs the masked region,before
and after stylizations. A content image, style image, and mask images give the following
stylized output for the masked region (bottom). The result stylization shows the partial
convolution will do a better job of spreading the style across the masked region statistics
than style-then-mask method due to high contrast.

27


Figure 4.8: Visualization of grayscale contrast values for output stylizations. This plot leads
to the insight that contrast in the output may be linked to the selection of the best style
method. For example, when the style-then-mask had high contrast, but partial conv did not,
style-then-mask was selected as the better method.

4.2.3 Insight on Stylization Preferences

In addition to the content, masked region, and style statistics. The statistics on the three

output images were computed. As part of this analysis, an interesting insight was discovered.

In terms of categorizing which method performed best, it appears to be highly correlated

with the contrast of the output stylization. The stylization that has the highest contrast

tends to be selected as the best method. Similar contrasts in the output were often marked

as performing the same. This might indicate that contrast can be used to understand user

preferences in stylization. A visualization of this pattern is given in Figure 4.8. Two examples

illustrating contrasts effect on preference are given in Figures 4.9 and 4.10.

28


Figure 4.9: Example 1 output of contrast insight on stylization preferences. Style-then-mask
(left), mask-then-style (center), and partial convolution (right) style outputs. The result
stylization shows the partial convolution contrast is high, but style-then-mask contrast is
low, so partial convolution is more visually appealing since it does a better job of spreading
the style across the masked region statistics than style-then-mask method.

Figure 4.10: Example 1 output of contrast insight on stylization preferences. Style-then-mask
(left), mask-then-style (center), and partial convolution (right) style outputs. The result
stylization shows that style-then-mask and partial convolution both have high contrast, so
both would have similar user preferences.

29


Chapter 5

Conclusion

This work presented an effective technique for combining style transfer with the segment

anything model. It showed that introducing partial convolution into the style transfer net-

work created stylization that more closely matched the style image statistics. It verified

this evaluation through both quantitative and qualitative means. It compared multiple seg-

mented stylization techniques and analyzed them through side-by-side visuals. Last of all,

statistics were calculated in the masked and unmasked region and shown to play a significant

role in the output stylization. Specifically, the contrast is a key factor for determining the

effectiveness of stylization with partial convolution.

5.1 Limitations

Combining Style Transfer with the Segment Anything Model (SAM) offers creative control

over selective areas of an image, but there are some limitations to be aware of:

• Boundary Precision: SAM segments may not perfectly align with edges in complex
images, which can lead to blending artifacts when style transfer is applied only to
selected regions. This can make the styled area look unnatural or mismatched with
the surrounding content.

• Consistency Across Frames: In videos or multiple-image sequences, SAM may produce
slightly different masks for each frame, making it challenging to achieve consistent style
transfer across frames, resulting in flickering or temporal distortion.

• Detail Loss: Applying style transfer only to specific regions can sometimes blur or alter
fine details within those areas, especially if the style heavily emphasizes textures or
colors, potentially reducing important visual features within the segmented areas.


• Performance Overhead: Combining SAM with style transfer requires multiple process-
ing steps, such as segmenting, masking, and transferring styles, which can be compu-
tationally intensive, especially for high-resolution images or real-time applications.

• Limited Adaptability to Complex Textures: SAM’s segmentation may struggle with
highly intricate or overlapping textures, which can hinder precise masking and reduce
control over where the style is applied. This can be especially challenging when using
complex or textured styles that require accurate segmentation.

• Fixed Semantic Segmentation: SAM does not inherently prioritize semantic informa-
tion (e.g., distinguishing background from foreground in every context), so certain style
transfers may apply to unintended areas if the segmentation is not carefully refined.

These limitations highlight the need for careful mask selection and tuning when integrat-

ing SAM with style transfer to achieve a natural, cohesive look in stylized images.

5.2 Future Work

As future work, the following directions could be explored:

• Implementing additional distribution metrics, as well as comparing stylized results with
perceptual loss common to the literature.

• Completing an ablation study across multiple configurations of the network.

• Using machine learning algorithms to develop a more precise mathematical model for
how image and mask values affect the performance of partial convolution.

• Comparing techniques in the context of stylizing multiple regions at a time, each with
its own style image.

• Doing a user study to see if more people agree on the best stylization result for each
image.

• Computing the statistics with more images. The prediction scatter plot was generated
with only 36 images. Lots of additional images would further solidify the findings.

5.3 Potential Applications

The fusion of segmentation with style transfer opens up new possibilities in both artistic and

practical domains by enhancing control over how and where style transfer is applied. SAM’s

ability to segment specific objects or regions in an image will allow for more precise control

31


over which parts of an image are stylized and which remain in their original form, unlike

traditional style transfer, which affects the entire image. Additionally, artists will be able to

use SAM to segment different objects or regions interactively and apply different styles to

each. Flexible tools that can be created to improve control of both the mask and stylization

to give graphic designers, photographers, and other digital artists more expressive control

and while removing difficult or time-consuming barriers.

32


BIBLIOGRAPHY

[1] Deng, Y., Tang, F., Dong, W., Ma, C., Pan, X., Wang, L., and Xu, C. Stytr2:
Image style transfer with transformers. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition (2022), pp. 11326–11336.

[2] Gatys, L. A., Ecker, A. S., and Bethge, M. Image style transfer using convolu-
tional neural networks. In Proceedings of the IEEE conference on computer vision and
pattern recognition (2016), pp. 2414–2423.

[3] Gatys, L. A., Ecker, A. S., Bethge, M., Hertzmann, A., and Shechtman,
E. Controlling perceptual factors in neural style transfer. In Proceedings of the IEEE
conference on computer vision and pattern recognition (2017), pp. 3985–3993.

[4] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recogni-
tion. In Proceedings of the IEEE conference on computer vision and pattern recognition
(2016), pp. 770–778.

[5] Huang, X., and Belongie, S. J. Arbitrary style transfer in real-time with adaptive
instance normalization. CoRR abs/1703.06868 (2017).

[6] Johnson, J., Alahi, A., and Fei-Fei, L. Perceptual losses for real-time style trans-
fer and super-resolution. In Computer Vision–ECCV 2016: 14th European Conference,
Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14 (2016),
Springer, pp. 694–711.

[7] Johnson, J., Alahi, A., and Fei-Fei, L. Perceptual losses for real-time style
transfer and super-resolution. CoRR abs/1603.08155 (2016).

[8] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L.,
Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., et al. Segment anything.
In Proceedings of the IEEE/CVF International Conference on Computer Vision (2023),
pp. 4015–4026.

[9] Kolkin, N., Salavon, J., and Shakhnarovich, G. Style transfer by relaxed
optimal transport and self-similarity. In The IEEE Conference on Computer Vision
and Pattern Recognition (CVPR) (June 2019).

[10] Kotovenko, D., Sanakoyeu, A., Ma, P., Lang, S., and Ommer, B. A content
transformation block for image style transfer. In The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR) (June 2019).


[11] Kulkarni, H., Khare, O., Barve, N., and Mane, S. Improved object-based style
transfer with single deep network, 2024.

[12] Kurzman, L., Vazquez, D., and Laradji, I. Class-based styling: Real-time local-
ized style transfer with semantic segmentation. In Proceedings of the IEEE International
Conference on Computer Vision Workshops (2019), pp. 0–0.

[13] Li, X., Liu, S., Kautz, J., and Yang, M.-H. Learning linear transformations for
fast image and video style transfer. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (2019), pp. 3809–3817.

[14] Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., and Yang, M.-H. Universal style
transfer via feature transforms. Advances in neural information processing systems 30
(2017).

[15] Lin, Z., Wang, Z., Chen, H., Ma, X., Xie, C., Xing, W., Zhao, L., and Song,
W. Image style transfer algorithm based on semantic segmentation. IEEE Access 9
(2021), 54518–54529.

[16] Liu, G., Reda, F. A., Shih, K. J., Wang, T.-C., Tao, A., and Catanzaro,
B. Image inpainting for irregular holes using partial convolutions. In The European
Conference on Computer Vision (ECCV) (2018).

[17] Long, J., Shelhamer, E., and Darrell, T. Fully convolutional networks for
semantic segmentation. In Proceedings of the IEEE conference on computer vision and
pattern recognition (2015), pp. 3431–3440.

[18] Ma, J., He, Y., Li, F., Han, L., You, C., and Wang, B. Segment anything in
medical images. Nature Communications 15, 1 (2024), 654.

[19] Psychogyios, K., Leligou, H. C., Melissari, F., Bourou, S., Anastasakis,
Z., and Zahariadis, T. B. Samstyler: Enhancing visual creativity with neural style
transfer and segment anything model (sam). IEEE Access 11 (2023), 100256–100267.

[20] Puy, G., and Perez, P. A flexible convolutional solver for fast style transfers. In The
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019).

[21] Ulyanov, D., Vedaldi, A., and Lempitsky, V. Improved texture networks: Max-
imizing quality and diversity in feed-forward stylization and texture synthesis. In The
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017).

[22] Wu, Y., Kirillov, A., Massa, F., Lo, W.-Y., and Girshick, R. Detectron2.
https://github.com/facebookresearch/detectron2, 2019.

[23] Yanai, K., and Tanno, R. Conditional fast style transfer network. In Proceedings of
the 2017 ACM on International Conference on Multimedia Retrieval (New York, NY,
USA, 2017), ICMR ’17, ACM, pp. 434–437.

34

https://github.com/facebookresearch/detectron2


[24] Zhang, Y., Huang, N., Tang, F., Huang, H., Ma, C., Dong, W., and Xu, C.
Inversion-based style transfer with diffusion models. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition (2023), pp. 10146–10156.

[25] Zhao, H.-H., Rosin, P. L., Lai, Y.-K., and Wang, Y.-N. Automatic semantic
style transfer using deep convolutional neural networks and soft masks. The Visual
Computer 36, 7 (2020), 1307–1324.

[26] Zou, X., Yang, J., Zhang, H., Li, F., Li, L., Wang, J., Wang, L., Gao,
J., and Lee, Y. J. Segment everything everywhere all at once. Advances in Neural
Information Processing Systems 36 (2024).

35


	LIST OF FIGURES
	Introduction
	What is the Segment Anything Model?
	What is Style Transfer?
	Combining Style Transfer with the Segment Anything Model

	Background
	Style Transfer
	Segment Anything Models
	Segmented Style Transfer

	Methodology
	Style Transfer by Linear Transformation
	Integrating Partial Convolution into the Style Transfer Network
	Segmented Style Transfer Techniques
	Predicting Which Technique Will Perform Best

	Results
	Qualitative Comparisons
	Quantitative Analysis
	Comparing Distributions
	Analyzing Contrast
	Insight on Stylization Preferences


	Conclusion
	Limitations
	Future Work
	Potential Applications

	BIBLIOGRAPHY