Technical Deep Dive

Architecture & Research

How we built a production-ready wine price prediction system combining ensemble ML methods, NLP feature extraction, and Vision Language Models.

66%

R² Score

130k+

Training Samples

45

Engineered Features

36%

Error Reduction

Machine Learning Overview

The core of WineValue is an ensemble machine learning model that predicts fair market prices for wines based on their characteristics. The model was trained on over 130,000 wine reviews from Wine Enthusiast magazine, combining structured features (variety, region, ratings) with unstructured text data from professional tasting notes.

Ensemble Architecture

We implemented a stacked ensemble combining three gradient boosting frameworks:

CatBoost

Handles categorical features natively, reducing preprocessing complexity

XGBoost

Strong performance on structured data with built-in regularization

LightGBM

Fast training with leaf-wise growth, excellent for large datasets

Feature Engineering Deep Dive

Feature engineering was critical to achieving strong predictive performance. We developed 45 engineered features across four categories:

TF-IDF NLP Features15 features

Extracted from tasting notes using TF-IDF vectorization with trigrams. Identified quality indicators like "complex," "elegant," "balanced" that correlate with higher prices.

Target Encoding12 features

Encoded high-cardinality categorical variables (variety, region, winery) using target encoding with smoothing to prevent overfitting.

Interaction Features10 features

Created interaction terms between key variables (variety × region, points × designation) to capture price synergies.

Original Features8 features

Points/rating, country, province, region, variety, designation, winery, and price per liter calculations.

Sample Code

# TF-IDF Feature Extraction
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    max_features=500,
    ngram_range=(1, 3),
    stop_words='english'
)

# Extract features from tasting notes
tfidf_features = tfidf.fit_transform(wines['description'])

# Target encoding with smoothing
def target_encode(df, col, target, weight=10):
    means = df.groupby(col)[target].mean()
    counts = df.groupby(col)[target].count()
    global_mean = df[target].mean()
    
    smoothed = (counts * means + weight * global_mean) / (counts + weight)
    return df[col].map(smoothed)
VLM Integration for Label Scanning

A key innovation in WineValue is the integration of Vision Language Models (VLMs) for automated label scanning. This allows users to simply photograph a wine label and get instant price analysis, eliminating manual data entry.

How It Works

1

Image Capture

User uploads or captures a photo of the wine label

2

VLM Analysis

Vision model extracts winery name, variety, region, vintage, and any visible ratings

3

Data Validation

Fuzzy matching against our database of known varieties and regions

4

Price Prediction

ML model generates fair value estimate with confidence interval

5

Live Price Check

Web search integration finds current retail prices for comparison

API Integration

// Label Analysis via Tesseract.js OCR (Next.js API Route)
// No API keys needed - runs entirely server-side
import Tesseract from 'tesseract.js';

export async function POST(request: NextRequest) {
  const { image } = await request.json();
  const buffer = Buffer.from(image.split(',')[1], 'base64');
  const { data } = await Tesseract.recognize(buffer, 'eng');
  
  // Parse wine info from OCR text using pattern matching
  const variety = findVariety(data.text);
  const country = findCountry(data.text);
  const vintage = findVintage(data.text);
  
  // Parse structured wine data from VLM response
  const wineData = parseWineFromVLM(response);
  return NextResponse.json({ wine: wineData });
}
Business Value & Outcomes

This project demonstrates several valuable business applications of ML and VLM technology in the consumer wine market:

Consumer Empowerment

Helps consumers make informed purchasing decisions by providing objective fair value estimates.

Price Transparency

Reduces information asymmetry in wine pricing, particularly beneficial for less experienced buyers.

Value Discovery

Identifies undervalued wines and regions, creating opportunities for savvy consumers and retailers.

Retailer Intelligence

Provides competitive intelligence on pricing strategies and market positioning.

Unique Differentiators

  • VLM-Powered UX: Unlike competitors, we use VLM to eliminate manual data entry, making the tool accessible to casual users.
  • Explainable Predictions: SHAP values provide transparency into why a wine is priced the way it is.
  • NLP from Tasting Notes: Extracts pricing signals from professional reviews that other models miss.
  • Live Price Integration: Cross-references predictions with real-time retail data for accuracy validation.
Technical Stack

Frontend

  • Next.js 16 with App Router
  • TypeScript for type safety
  • Tailwind CSS for styling
  • Framer Motion for animations
  • shadcn/ui components

Backend & ML

  • Next.js API Routes
  • CatBoost, XGBoost, LightGBM
  • scikit-learn for preprocessing
  • OCR via Tesseract.js (no API key)
  • Web search for live prices