Neo Quest: Blockchain Powered AI Art Competition

In 2022 at the very beginning of the whole AI buzz(that started with AI image generators) when artists were thinking AI is going to take their job, I worked on an interesting project for an AI art competition concept where participants compete to recreate a target image using only AI prompts. The idea was simple:

If you think it's easy, buy a ticket and join the competition.

The goal was to show artists it's not that easy to recreate what's in your mind only using your words. Well NOW it's 2025 and thats not valid anymore, but in my defense, back then we only had Stable Diffusion and early version of Midjourney which you could only use via Discord, and the quality wasn't that good. Here is the very first image i generated using Midjourney:

The platform had daily competitions, and it was based on Blockchain (so everything was transparent and safe) and users needed to buy a ticket, and had 1 hour and 100 prompts to recreate the target image. The winner is determined by whose generated image is most similar to the target and the prize was the 80% of the collected tickets. In this post I will share how I approached the image similarity scoring system (The blockchain part wasn't that big of a deal to explain)

‍

The Problem:

The core challenge was simple: given one target image per competition, how do you accurately rank candidate submissions? This might sound straightforward at first, but as I dug deeper, I realized the classical approaches I had used before were fundamentally flawed for this task.

I chose the Girl with Pearl Earing as the target image, and found different meme versions of it online to use as the attempts to test the system

A young woman wearing a yellow dress with white collar and a blue and yellow turban and a large pearl earring.

‍

How I Did It initially (Classic Image Processing)

In my first attempts at image comparison, I threw everything at the problem. I combined dozens of hand-crafted metrics from different libraries and image processing techniques. The idea was to calculate all these metrics between the target and candidate images, normalize them somehow, and combine them with some weights to produce a final similarity score. I will try to briefly explain the methods I have tried and show the results for these two images:

‍

‍

Global Histogram Comparisons and Pixel Statistics

This method compares the overall color or intensity distribution of two images by analyzing their histograms and basic pixel statistics, providing a simple measure of similarity. It has two sides:

Histogram Difference: Compares color or intensity histograms

Flattened Correlation: Correlation of flattened grayscale pixel arrays.

Histogram difference is global and ignores pixel order and correlation checks linear relationship between pixel values. To find the image similarity using histogram difference, we have to compare the frequency distributions of pixel intensities or colors between two images, typically by computing the sum of absolute differences across histogram bins. Lower histogram difference values indicate greater similarity in overall color or intensity distribution. Then we convert both images to grayscale, flattening them into one-dimensional arrays, and calculating the Pearson correlation coefficient between these arrays. A correlation value closer to 1 signifies high similarity in pixel intensity patterns, even if spatial positions differ.

That's why it wasn't enough, it just tells you if you have used same colors with the same brightness and contrast or not. I needed a method that considers the structure and the "Actual" similarity...

‍

Structural and Perceptual Metrics

These metrics refer to a family of image similarity measures that go beyond simple pixel-wise comparisons by modeling how humans perceive visual quality and fidelity. They evaluate aspects like luminance (brightness), contrast, structure (spatial relationships), and information preservation, aiming to align with human visual perception instead of just mathematical differences. For these metric i used SEWAR library, which offers the following metrics:

SSIM (Structural Similarity Index): Measures luminance, contrast, and structure similarity (closer to human perception)
VIFP (Visual Information Fidelity): Measures information preserved in the image.

They might have some uses in detecting similarities, but it's only on low-level attributes such as luminance, contrast, and structure, overlooking semantic content, contextual meaning, and high-level features. This can be problematic, two images might score highly similar if they share structural elements despite depicting completely different subjects or scenes, or conversely, images of the same object under varying conditions might appear dissimilar. Moreover, these metrics can be tricked by adversarial perturbations that subtly alter pixel values to preserve perceptual scores while changing the image's interpretation, or by transformations like rotations and scalings that maintain statistics but disrupt spatial relationships beyond their evaluation scope.

So i needed something that detects similar features

‍

Global histogram comparisons and pixel statistics
Full-reference metrics from SEWAR
Keypoint matching with ORB, SIFT, and FLANN as the matcher.
Simple correlation of flattened grayscale pixels
Heuristic scaling and clipping of metric values

The results looked interesting at first, but there were some issues:

‍

The Problems with This Approach

After implementing this system and running it on test images, several critical issues became clear:

1. Mixed scales and ad-hoc normalization: Each metric has different ranges and statistical meanings. MSE could be in thousands while SSIM is between 0 and 1. I tried to normalize them with arbitrary constants (like subtracting from 1e16) which completely broke interpretability and made the system unstable.

2. Fragile to common variations: Histogram comparison and pixel correlation fail badly under lighting changes, small crops, translations, or compression artifacts. These are exactly the kinds of variations that legitimate AI-generated attempts might have.

3. Keypoint matching failures: For images with low texture or uniform regions, SIFT and ORB would find very few keypoints. When viewpoint, scale, or illumination changed even slightly, the matches dropped dramatically even for perceptually similar images.

4. Over-engineering and redundancy: I was computing 15+ different metrics, many of which were highly correlated. This added computation cost but very little new information. The weighted average at the end was just guessing.

5. Poor statistical grounding: There was no principled way to set the weights. Each new target image or dataset would require manual retuning. The final score had no clear interpretation.

[IMAGE: Scatter plot showing correlation between different classical metrics]

6. Undefined behavior and bugs: Looking back at the code, there were several places with undefined variables, inconsistent preprocessing, and odd transforms that made the results unreliable and impossible to reproduce.

The most telling sign that this approach was wrong came when I tried to tune the weights. No matter how I adjusted them, some obvious matches scored poorly while some clearly different images scored well. The system was fundamentally not learning what "similar" meant for this task.

Then I remembered I know Deep Learning...

I was thinking what if instead of hand-crafting similarity metrics and trial and error, I let a model learn what makes the target image unique?

The key insight was this: I have exactly one target image per competition, and I need to score how well candidates match that specific target. This is not a general image similarity problem. It's a single-image manifold learning problem.

If I could train a model that learns the "space of valid variations" of the target image, then candidates that fall within that learned manifold should score well, and candidates that fall outside should score poorly. The reconstruction error of an autoencoder trained on augmented versions of the target would be exactly this measure.

The New Approach: Per-Target Autoencoder

I decided to train a small convolutional autoencoder for each target image. The training data would be many augmented versions of that single target, teaching the model what kinds of transformations are acceptable while preserving the essential structure.

Here's the architecture I settled on:

Encoder:

Four convolutional layers with stride=2 downsampling
Progressive channel increase: 3 -> 64 -> 128 -> 256 -> 512
BatchNorm + ReLU after each conv
Reduces 256x256 input to 16x16 feature maps

Bottleneck:

Flatten spatial features
Dense layer to 128-dimensional latent vector
This compression forces the model to learn semantic structure

Decoder:

Dense layer back to spatial dimensions
Four transpose convolution layers mirroring the encoder
Upsamples 16x16 back to 256x256
Final layer outputs 3-channel RGB image

[IMAGE: Architecture diagram showing encoder -> bottleneck -> decoder flow]

The loss function is simple: pixel-wise MSE between input and reconstruction. The key is not in the loss but in what the model learns from the training data.

Data Augmentation Strategy

Since I only have one target image, the augmentation strategy is critical. I needed to generate diverse examples that capture legitimate variations while avoiding overfitting to pixel-perfect identity.

I used heavy augmentation with these transforms:

Random horizontal flip (50% probability)
Random vertical flip (30% probability)
Random rotation (up to 15 degrees)
Random affine transformations (translate up to 10%)
Color jitter (brightness, contrast, saturation, hue)
Random perspective distortion
Random crops with padding
Occasional minimal augmentation samples (10%) to anchor the original

[IMAGE: Grid showing 8 augmented versions of the target image]

This creates a distribution around the target. The autoencoder learns to reconstruct any image from this distribution well, but will struggle to reconstruct images that are far from this manifold.

Why This Architecture?

The bottleneck is intentional. I could have used a U-Net architecture with skip connections that would give much lower reconstruction error. But that's exactly what I don't want. U-Net preserves all spatial details through skip connections, which means it can reconstruct almost any input well if trained long enough. That would reduce the discriminative power for similarity scoring.

By forcing all information through a 128-dimensional bottleneck, I ensure the model learns a compressed representation that captures only the essential structure of the target. Images similar to the target will have low reconstruction error. Images that differ in important ways will have high reconstruction error.

Stride-based downsampling reduces spatial resolution by half at each layer (256 -> 128 -> 64 -> 32 -> 16). This is computationally efficient and increases the receptive field. The four stride=2 layers mean the bottleneck sees patterns at 16x downsampling, forcing global structure learning rather than local pixel memorization.

BatchNorm + ReLU provides stable training and faster convergence. The small model size (relative to modern standards) prevents memorizing pixel-perfect identity and encourages meaningful embeddings.

[IMAGE: Visualization showing how features evolve through encoder layers]

Training and Scoring

Training is straightforward:

MSE reconstruction loss
Adam optimizer with weight decay
ReduceLROnPlateau scheduler
Save best checkpoint by validation loss

I trained for 100 epochs on 200 augmented samples per target. Training takes only a few minutes on CPU.

For scoring a candidate image:

Preprocess to match training (resize, normalize)
Pass through the trained autoencoder
Calculate MSE between input and reconstruction
Lower MSE = more similar to target

[IMAGE: Training loss curve over epochs]

The beauty of this approach is interpretability. I can visualize the reconstruction and the per-pixel error heatmap to see exactly which regions the model found unexpected.

[IMAGE: Three panels showing original candidate, reconstruction, and difference heatmap]

Normalization: A Critical Detail

In my initial implementation, I used ImageNet normalization statistics (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) because that's the standard in transfer learning. But this is arbitrary for a single-image autoencoder trained from scratch.

Those numbers represent the mean and standard deviation of millions of ImageNet photos. They have nothing to do with my target image. Using them can hurt training and make visualization inconsistent.

Instead, I compute mean and standard deviation directly from the target image (or from sampled augmentations):

def compute_image_mean_std(image_path: str): img = cv2.imread(str(image_path)) img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB).astype(np.float32) / 255.0 mean = img.mean(axis=(0,1)).tolist() std = img.std(axis=(0,1)).tolist() return mean, std

This ensures the model sees normalized data that makes sense for this specific image, and denormalization for visualization is accurate.

Why Not Use Pretrained Models?

A natural question is: why not use a pretrained ResNet or VGG as the encoder? Wouldn't transfer learning give better features?

The answer is nuanced. Pretrained features are rich and general, which is great for many tasks. But for this single-image similarity scoring, they can actually hurt:

1. Too good at reconstruction: Pretrained encoders capture so much semantic information that the reconstruction error might be too small to discriminate between similar and dissimilar images. The error gap I rely on for ranking could disappear.

2. Different invariances: ImageNet-pretrained models learn invariances useful for classification (ignore background, focus on objects, etc.). But for image similarity, I might care about exact layout, color distribution, and texture that classification models learn to ignore.

3. Memory and speed: ResNet-50 has 25 million parameters. My custom autoencoder has under 10 million and trains faster on CPU. For a competition platform, being able to train and score quickly matters.

That said, pretrained features are useful for perceptual loss or as a complementary similarity measure. I could compute distance in VGG feature space alongside reconstruction MSE for a hybrid scoring system.

[IMAGE: Comparison chart showing custom AE vs pretrained backbone trade-offs]

Improvements and Extensions

This is a working prototype that demonstrates the core idea, but there are several ways to improve it:

1. Perceptual loss: Instead of pixel-wise MSE, use features from a pretrained network (VGG, LPIPS) as the reconstruction target. This would capture perceptual similarity better than pixel differences.

2. Latent space distance: In addition to reconstruction error, compute cosine distance or L2 distance in the 128-dimensional bottleneck space. Images with similar latent codes should be similar.

3. Denoising objective: Add small Gaussian noise to inputs during training and train the model to reconstruct the clean image. This can improve robustness.

4. Multi-scale features: Use a pyramid of autoencoders at different resolutions to capture both fine details and global structure.

5. Calibration: Map raw MSE scores to normalized similarity percentages using percentile ranking or learned calibration curves.

6. Ensemble: Train multiple autoencoders with different augmentation strategies or architectures and average their scores.

Results and Demonstration

I tested the system on a small competition with one target image and 15 candidate submissions. The model ranked them clearly, with lower scores for candidates that matched the target's structure and higher scores for those that diverged.

[IMAGE: Grid of top 10 candidates ranked by similarity score]

The reconstruction error heatmaps show that the model focuses on distinctive features. Regions with unusual colors, textures, or shapes light up in the error map, while common or expected regions stay dark.

[IMAGE: Heatmap analysis showing which regions contributed most to score]

For the target image itself, the reconstruction is nearly perfect (MSE around 0.001), confirming the model has learned its structure. For similar candidates, MSE ranges from 0.003 to 0.02. For clearly different images, MSE jumps above 0.05.

How This Compares to the Old Approach

The difference is night and day:

Old approach:

15+ hand-crafted metrics with arbitrary normalization
Brittle to lighting, crops, compression
Required manual weight tuning for each target
Buggy implementation with undefined behavior
No clear interpretation of final score

New approach:

Single learned metric (reconstruction MSE)
Robust to augmented variations
Automatic per-target training
Clean, reproducible implementation
Clear interpretation: distance from learned manifold

The learned approach doesn't just work better, it's fundamentally more principled. Instead of guessing how to combine metrics, I let the model learn what matters for this specific target.

What I Learned

This project taught me an important lesson about machine learning: sometimes the right approach is not to engineer features but to engineer the learning problem itself.

I spent days trying to tune weights and normalize metrics in the classical approach. But the real breakthrough came from stepping back and asking: what am I actually trying to learn? Once I reframed the problem as "learn the manifold of one image and measure distance from it," the solution became clear.

The autoencoder architecture, the augmentation strategy, the bottleneck dimension, these are all design choices that shape what the model can learn. But they're much easier to tune than 15 different metric weights because they're grounded in a clear learning objective.

Implementation Details

The full system is implemented in PyTorch with OpenCV for image preprocessing. The code is organized into clean classes:

AugmentedImageDataset: Handles loading and augmenting the single target
ConvAutoencoder: Defines the encoder-bottleneck-decoder architecture
ImageSimilarityTrainer: Manages training loop, scheduling, checkpointing
SimilarityScorer: Loads trained model and scores candidates

[IMAGE: Code structure diagram or class relationship chart]

Training on 200 augmentations for 100 epochs takes about 5 minutes on an M1 Mac. Scoring a candidate takes less than 100ms. The trained model is around 40MB.

For deployment, the system would:

Train one model per competition when the target is announced
Store the trained checkpoint
Score incoming submissions in real-time
Update the leaderboard with ranking

Next Steps

To turn this into a production system, I would need to:

1. Validation framework: Create a test suite with known similar/dissimilar pairs to validate scoring accuracy before each competition.

2. Hyperparameter tuning: Grid search over latent dimensions, augmentation strengths, and loss functions to optimize discriminative power.

3. Failure mode analysis: Identify edge cases where reconstruction error fails (adversarial examples, mode collapse) and add robustness measures.

4. Multi-metric scoring: Combine reconstruction MSE with latent distance and perceptual metrics for more robust ranking.

5. User interface: Build a dashboard showing not just scores but visual explanations (heatmaps, nearest neighbors in latent space) so participants understand why they ranked where they did.

But even in its current form, this approach is a massive improvement over the classical metric-stacking mess I started with. It's clean, principled, and actually works.

Conclusion

Building an image similarity scorer for AI generation competitions turned out to be a great exercise in problem framing. The naive approach of combining existing metrics led to a brittle, uninterpretable system. The learned approach of training per-target autoencoders gave a clean, robust solution.

The key insight was treating this as a one-shot learning problem rather than a general similarity problem. By learning the manifold of acceptable variations for each specific target, the model naturally captures what makes that image unique and scores candidates accordingly.

If you're working on a similar problem, I'd recommend starting with the learning objective rather than the features. Ask: what do I want my model to know? Then design the architecture and training strategy to teach exactly that.

‍

Portfolio