Quickstart Guide#

Get started with Row2Vec in 5 minutes! This guide shows the essential features through executable examples.

Basic Usage#

The core of Row2Vec is the learn_embedding() function:

# Import complete suppression first
exec(open('suppress_minimal.py').read())

from row2vec import learn_embedding, generate_synthetic_data
import pandas as pd

# Generate sample data
df = generate_synthetic_data(num_records=200, seed=42)
print(f"Dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
✓ Enhanced minimal suppression active
Dataset shape: (200, 3)
Columns: ['Country', 'Product', 'Sales']

Unsupervised Embeddings#

Create compressed representations of each row:

# Learn 5-dimensional embeddings for each row
embeddings = learn_embedding(
    df,
    mode="unsupervised",
    embedding_dim=5,
    max_epochs=20,
    verbose=False
)

print(f"Embeddings shape: {embeddings.shape}")
print("\nFirst 3 embeddings:")
print(embeddings.head(3))
Model: "functional"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                     Output Shape                  Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ input_layer (InputLayer)        │ (None, 10)             │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense (Dense)                   │ (None, 128)            │         1,408 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout (Dropout)               │ (None, 128)            │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ embedding (Dense)               │ (None, 5)              │           645 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ (None, 128)            │           768 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_1 (Dropout)             │ (None, 128)            │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_2 (Dense)                 │ (None, 10)             │         1,290 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 4,111 (16.06 KB)
 Trainable params: 4,111 (16.06 KB)
 Non-trainable params: 0 (0.00 B)
Embeddings shape: (200, 5)

First 3 embeddings:
   embedding_0  embedding_1  embedding_2  embedding_3  embedding_4
0    -0.398966     0.143785     0.321987     0.450521     1.021751
1     0.388859    -0.476230     0.629208    -0.746208    -0.325556
2    -0.599080    -0.385154     0.816347     0.604446    -0.107327
# Verify: each row gets an embedding
print(f"Original data: {len(df)} rows")
print(f"Embeddings: {len(embeddings)} rows")
print(f"Dimensions per embedding: {embeddings.shape[1]}")
Original data: 200 rows
Embeddings: 200 rows
Dimensions per embedding: 5

Target-Based Embeddings#

Learn embeddings for categorical column values:

# Learn embeddings for each country
country_embeddings = learn_embedding(
    df,
    mode="target",
    reference_column="Country",
    embedding_dim=3,
    max_epochs=20,
    verbose=False
)

print("Country embeddings:")
print(country_embeddings)
Model: "functional_2"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                     Output Shape                  Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ input_layer_1 (InputLayer)      │ (None, 5)              │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_3 (Dense)                 │ (None, 128)            │           768 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_2 (Dropout)             │ (None, 128)            │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ embedding (Dense)               │ (None, 3)              │           387 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_4 (Dense)                 │ (None, 5)              │            20 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 1,175 (4.59 KB)
 Trainable params: 1,175 (4.59 KB)
 Non-trainable params: 0 (0.00 B)
Country embeddings:
          embedding_0  embedding_1  embedding_2
category                                       
0            0.010338    -0.210590    -0.106090
1            0.433171     0.148842     0.318366
2           -0.003074    -0.207572    -0.271954
3            0.052782    -0.166747    -0.342660
4            0.445591     0.411686     0.303609

Classical Methods#

Row2Vec also provides classical dimensionality reduction:

PCA (Fast Linear Reduction)#

pca_embeddings = learn_embedding(
    df,
    mode="pca",
    embedding_dim=2,
    verbose=False
)

print("PCA embeddings (first 5):")
print(pca_embeddings.head())
PCA embeddings (first 5):
   embedding_0  embedding_1
0    -0.536821     0.593924
1     1.270871     0.565482
2    -0.705281     0.137746
3    -0.613175     0.593205
4    -0.764737    -0.584174

t-SNE (Visualization)#

tsne_embeddings = learn_embedding(
    df,
    mode="tsne",
    embedding_dim=2,
    perplexity=30,
    verbose=False
)

print("t-SNE embeddings (first 5):")
print(tsne_embeddings.head())
t-SNE embeddings (first 5):
   embedding_0  embedding_1
0    -8.835135    -8.234326
1    28.354614    -0.902780
2    -1.862854     1.319538
3    -8.821446    -7.312581
4    -8.716358    -0.411034

UMAP (Balanced Approach)#

try:
    umap_embeddings = learn_embedding(
        df,
        mode="umap",
        embedding_dim=2,
        n_neighbors=15,
        verbose=False
    )
    print("UMAP embeddings (first 5):")
    print(umap_embeddings.head())
except ImportError:
    print("UMAP not installed. Install with: pip install umap-learn")
UMAP embeddings (first 5):
   embedding_0  embedding_1
0    -7.495312     9.925724
1     8.585988    14.906081
2    -4.510832     2.909270
3    -6.949340    10.106042
4    -2.152459    -5.157022

Scaling Embeddings#

Apply post-processing scaling to embeddings:

# Scale embeddings to [-1, 1] range
scaled_embeddings = learn_embedding(
    df,
    mode="unsupervised",
    embedding_dim=3,
    max_epochs=10,
    scale_method="minmax",
    scale_range=(-1.0, 1.0),
    verbose=False
)

print("Scaled embeddings statistics:")
print(f"Min value: {scaled_embeddings.min().min():.3f}")
print(f"Max value: {scaled_embeddings.max().max():.3f}")
print("\nFirst 3 scaled embeddings:")
print(scaled_embeddings.head(3))
Model: "functional_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                     Output Shape                  Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ input_layer_2 (InputLayer)      │ (None, 10)             │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_5 (Dense)                 │ (None, 128)            │         1,408 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_3 (Dropout)             │ (None, 128)            │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ embedding (Dense)               │ (None, 3)              │           387 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_6 (Dense)                 │ (None, 128)            │           512 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_4 (Dropout)             │ (None, 128)            │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_7 (Dense)                 │ (None, 10)             │         1,290 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 3,597 (14.05 KB)
 Trainable params: 3,597 (14.05 KB)
 Non-trainable params: 0 (0.00 B)
Scaled embeddings statistics:
Min value: -1.000
Max value: 1.000

First 3 scaled embeddings:
   embedding_0  embedding_1  embedding_2
0    -0.800353     0.927060     0.479231
1     0.293406    -0.724188    -0.400395
2    -1.000000     0.071887     0.526539

Handling Missing Values#

Row2Vec automatically handles missing values:

import numpy as np

# Create data with missing values
df_missing = df.copy()
df_missing.loc[0:5, 'Sales'] = np.nan
df_missing.loc[10:15, 'Product'] = np.nan

print(f"Missing values introduced: {df_missing.isnull().sum().sum()}")

# Row2Vec handles this automatically
embeddings_missing = learn_embedding(
    df_missing,
    mode="unsupervised",
    embedding_dim=3,
    max_epochs=10,
    verbose=False
)

print(f"\nEmbeddings generated: {embeddings_missing.shape}")
print("No errors - missing values handled automatically!")
Model: "functional_6"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                     Output Shape                  Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ input_layer_3 (InputLayer)      │ (None, 10)             │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_8 (Dense)                 │ (None, 128)            │         1,408 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_5 (Dropout)             │ (None, 128)            │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ embedding (Dense)               │ (None, 3)              │           387 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_9 (Dense)                 │ (None, 128)            │           512 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_6 (Dropout)             │ (None, 128)            │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_10 (Dense)                │ (None, 10)             │         1,290 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 3,597 (14.05 KB)
 Trainable params: 3,597 (14.05 KB)
 Non-trainable params: 0 (0.00 B)
Missing values introduced: 12
Embeddings generated: (200, 3)
No errors - missing values handled automatically!

Save and Load Models#

Train once, use many times:

from row2vec import train_and_save_model, load_model
import tempfile
import os

# Create temporary directory for demo
with tempfile.TemporaryDirectory() as tmpdir:
    model_path = os.path.join(tmpdir, "my_model")

    # Train and save model
    embeddings, script_path, binary_path = train_and_save_model(
        df,
        base_path=model_path,
        embedding_dim=4,
        mode="unsupervised",
        max_epochs=10,
        verbose=False
    )

    print(f"Model saved to: {script_path}")

    # Load and use model
    model = load_model(script_path)

    # Generate embeddings for new data
    new_data = generate_synthetic_data(num_records=50, seed=999)
    new_embeddings = model.predict(new_data)

    print(f"\nNew embeddings shape: {new_embeddings.shape}")
    print("Model successfully loaded and used!")
Model: "functional_8"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                     Output Shape                  Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ input_layer_4 (InputLayer)      │ (None, 10)             │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_11 (Dense)                │ (None, 128)            │         1,408 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_7 (Dropout)             │ (None, 128)            │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ embedding (Dense)               │ (None, 4)              │           516 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_12 (Dense)                │ (None, 128)            │           640 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_8 (Dropout)             │ (None, 128)            │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_13 (Dense)                │ (None, 10)             │         1,290 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 3,854 (15.05 KB)
 Trainable params: 3,854 (15.05 KB)
 Non-trainable params: 0 (0.00 B)
1/7 ━━━━━━━━━━━━━━━━━━━━ 0s 26ms/step

7/7 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step 
Model saved to: /tmp/tmpluyem0ac/my_model.py

New embeddings shape: (50, 4)
Model successfully loaded and used!

Command-Line Interface#

Row2Vec also provides a CLI for batch processing:

# Quick embeddings
row2vec data.csv --output embeddings.csv

# With configuration
row2vec data.csv \
  --dimensions 10 \
  --mode unsupervised \
  --output embeddings.csv

# Train and save model
row2vec-train data.csv \
  --dimensions 5 \
  --output model.pkl

# Apply saved model
row2vec-embed new_data.csv \
  --model model.py \
  --output new_embeddings.csv

Method Comparison#

Let’s compare all methods on the same data:

import time

methods = {
    "Neural": {"mode": "unsupervised", "max_epochs": 10},
    "PCA": {"mode": "pca"},
    "t-SNE": {"mode": "tsne", "perplexity": 30},
}

results = {}
for name, params in methods.items():
    start = time.time()
    emb = learn_embedding(df, embedding_dim=2, verbose=False, **params)
    elapsed = time.time() - start
    results[name] = {
        "time": elapsed,
        "shape": emb.shape,
        "mean": emb.mean().mean(),
        "std": emb.std().mean()
    }

print("Method Comparison:")
print("-" * 50)
for method, stats in results.items():
    print(f"{method:10} | Time: {stats['time']:.2f}s | Mean: {stats['mean']:6.3f} | Std: {stats['std']:5.3f}")
Model: "functional_13"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                     Output Shape                  Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ input_layer_6 (InputLayer)      │ (None, 10)             │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_17 (Dense)                │ (None, 128)            │         1,408 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_11 (Dropout)            │ (None, 128)            │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ embedding (Dense)               │ (None, 2)              │           258 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_18 (Dense)                │ (None, 128)            │           384 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_12 (Dropout)            │ (None, 128)            │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_19 (Dense)                │ (None, 10)             │         1,290 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 3,340 (13.05 KB)
 Trainable params: 3,340 (13.05 KB)
 Non-trainable params: 0 (0.00 B)
Method Comparison:
--------------------------------------------------
Neural     | Time: 1.61s | Mean: -0.043 | Std: 0.391
PCA        | Time: 0.02s | Mean: -0.000 | Std: 0.792
t-SNE      | Time: 0.42s | Mean:  1.264 | Std: 10.479

Method Selection Guide#

Each embedding method has different strengths. Here’s when to use each:

Method

Speed

Deterministic

Best For

Embedding Range

Neural

Medium

Yes (with seed)

Complex patterns, feature engineering

Typically [-1, 1]

PCA

Fast

Yes

Quick dimensionality reduction, linear relationships

Variable scale

t-SNE

Slow

No*

2D/3D visualization, cluster discovery

Large range, clustered

UMAP

Fast

Yes (with seed)

General purpose, balanced local/global structure

Moderate range

*t-SNE can be made more deterministic with proper seeding, but still has some inherent randomness.

When to Use Each Method#

Choose Neural Networks (mode="unsupervised") when:

  • You need embeddings for downstream machine learning models

  • Your data has complex, non-linear relationships

  • You want features that can capture intricate patterns

  • You have sufficient training time and computational resources

Choose PCA (mode="pca") when:

  • You need fast, deterministic results

  • Your data relationships are primarily linear

  • You want interpretable principal components

  • You’re preprocessing data for other algorithms

Choose t-SNE (mode="tsne") when:

  • You want to visualize data in 2D or 3D

  • Discovering clusters is your primary goal

  • Local neighborhood preservation is most important

  • You don’t mind longer computation times

Choose UMAP (mode="umap") when:

  • You want general-purpose dimensionality reduction

  • You need both local and global structure preserved

  • You want faster performance than t-SNE

  • You’re working with higher-dimensional outputs (>3D)

Next Steps#

Now you know the basics! For more detailed examples: