Quickstart Guide#
Get started with Row2Vec in 5 minutes! This guide shows the essential features through executable examples.
Basic Usage#
The core of Row2Vec is the learn_embedding() function:
# Import complete suppression first
exec(open('suppress_minimal.py').read())
from row2vec import learn_embedding, generate_synthetic_data
import pandas as pd
# Generate sample data
df = generate_synthetic_data(num_records=200, seed=42)
print(f"Dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
✓ Enhanced minimal suppression active
Dataset shape: (200, 3)
Columns: ['Country', 'Product', 'Sales']
Unsupervised Embeddings#
Create compressed representations of each row:
# Learn 5-dimensional embeddings for each row
embeddings = learn_embedding(
df,
mode="unsupervised",
embedding_dim=5,
max_epochs=20,
verbose=False
)
print(f"Embeddings shape: {embeddings.shape}")
print("\nFirst 3 embeddings:")
print(embeddings.head(3))
Model: "functional"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ input_layer (InputLayer) │ (None, 10) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense (Dense) │ (None, 128) │ 1,408 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout (Dropout) │ (None, 128) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ embedding (Dense) │ (None, 5) │ 645 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_1 (Dense) │ (None, 128) │ 768 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_1 (Dropout) │ (None, 128) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_2 (Dense) │ (None, 10) │ 1,290 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 4,111 (16.06 KB)
Trainable params: 4,111 (16.06 KB)
Non-trainable params: 0 (0.00 B)
Embeddings shape: (200, 5)
First 3 embeddings:
embedding_0 embedding_1 embedding_2 embedding_3 embedding_4
0 -0.398966 0.143785 0.321987 0.450521 1.021751
1 0.388859 -0.476230 0.629208 -0.746208 -0.325556
2 -0.599080 -0.385154 0.816347 0.604446 -0.107327
# Verify: each row gets an embedding
print(f"Original data: {len(df)} rows")
print(f"Embeddings: {len(embeddings)} rows")
print(f"Dimensions per embedding: {embeddings.shape[1]}")
Original data: 200 rows
Embeddings: 200 rows
Dimensions per embedding: 5
Target-Based Embeddings#
Learn embeddings for categorical column values:
# Learn embeddings for each country
country_embeddings = learn_embedding(
df,
mode="target",
reference_column="Country",
embedding_dim=3,
max_epochs=20,
verbose=False
)
print("Country embeddings:")
print(country_embeddings)
Model: "functional_2"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ input_layer_1 (InputLayer) │ (None, 5) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_3 (Dense) │ (None, 128) │ 768 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_2 (Dropout) │ (None, 128) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ embedding (Dense) │ (None, 3) │ 387 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_4 (Dense) │ (None, 5) │ 20 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 1,175 (4.59 KB)
Trainable params: 1,175 (4.59 KB)
Non-trainable params: 0 (0.00 B)
Country embeddings:
embedding_0 embedding_1 embedding_2
category
0 0.010338 -0.210590 -0.106090
1 0.433171 0.148842 0.318366
2 -0.003074 -0.207572 -0.271954
3 0.052782 -0.166747 -0.342660
4 0.445591 0.411686 0.303609
Classical Methods#
Row2Vec also provides classical dimensionality reduction:
PCA (Fast Linear Reduction)#
pca_embeddings = learn_embedding(
df,
mode="pca",
embedding_dim=2,
verbose=False
)
print("PCA embeddings (first 5):")
print(pca_embeddings.head())
PCA embeddings (first 5):
embedding_0 embedding_1
0 -0.536821 0.593924
1 1.270871 0.565482
2 -0.705281 0.137746
3 -0.613175 0.593205
4 -0.764737 -0.584174
t-SNE (Visualization)#
tsne_embeddings = learn_embedding(
df,
mode="tsne",
embedding_dim=2,
perplexity=30,
verbose=False
)
print("t-SNE embeddings (first 5):")
print(tsne_embeddings.head())
t-SNE embeddings (first 5):
embedding_0 embedding_1
0 -8.835135 -8.234326
1 28.354614 -0.902780
2 -1.862854 1.319538
3 -8.821446 -7.312581
4 -8.716358 -0.411034
UMAP (Balanced Approach)#
try:
umap_embeddings = learn_embedding(
df,
mode="umap",
embedding_dim=2,
n_neighbors=15,
verbose=False
)
print("UMAP embeddings (first 5):")
print(umap_embeddings.head())
except ImportError:
print("UMAP not installed. Install with: pip install umap-learn")
UMAP embeddings (first 5):
embedding_0 embedding_1
0 -7.495312 9.925724
1 8.585988 14.906081
2 -4.510832 2.909270
3 -6.949340 10.106042
4 -2.152459 -5.157022
Scaling Embeddings#
Apply post-processing scaling to embeddings:
# Scale embeddings to [-1, 1] range
scaled_embeddings = learn_embedding(
df,
mode="unsupervised",
embedding_dim=3,
max_epochs=10,
scale_method="minmax",
scale_range=(-1.0, 1.0),
verbose=False
)
print("Scaled embeddings statistics:")
print(f"Min value: {scaled_embeddings.min().min():.3f}")
print(f"Max value: {scaled_embeddings.max().max():.3f}")
print("\nFirst 3 scaled embeddings:")
print(scaled_embeddings.head(3))
Model: "functional_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ input_layer_2 (InputLayer) │ (None, 10) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_5 (Dense) │ (None, 128) │ 1,408 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_3 (Dropout) │ (None, 128) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ embedding (Dense) │ (None, 3) │ 387 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_6 (Dense) │ (None, 128) │ 512 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_4 (Dropout) │ (None, 128) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_7 (Dense) │ (None, 10) │ 1,290 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 3,597 (14.05 KB)
Trainable params: 3,597 (14.05 KB)
Non-trainable params: 0 (0.00 B)
Scaled embeddings statistics:
Min value: -1.000
Max value: 1.000
First 3 scaled embeddings:
embedding_0 embedding_1 embedding_2
0 -0.800353 0.927060 0.479231
1 0.293406 -0.724188 -0.400395
2 -1.000000 0.071887 0.526539
Handling Missing Values#
Row2Vec automatically handles missing values:
import numpy as np
# Create data with missing values
df_missing = df.copy()
df_missing.loc[0:5, 'Sales'] = np.nan
df_missing.loc[10:15, 'Product'] = np.nan
print(f"Missing values introduced: {df_missing.isnull().sum().sum()}")
# Row2Vec handles this automatically
embeddings_missing = learn_embedding(
df_missing,
mode="unsupervised",
embedding_dim=3,
max_epochs=10,
verbose=False
)
print(f"\nEmbeddings generated: {embeddings_missing.shape}")
print("No errors - missing values handled automatically!")
Model: "functional_6"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ input_layer_3 (InputLayer) │ (None, 10) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_8 (Dense) │ (None, 128) │ 1,408 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_5 (Dropout) │ (None, 128) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ embedding (Dense) │ (None, 3) │ 387 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_9 (Dense) │ (None, 128) │ 512 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_6 (Dropout) │ (None, 128) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_10 (Dense) │ (None, 10) │ 1,290 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 3,597 (14.05 KB)
Trainable params: 3,597 (14.05 KB)
Non-trainable params: 0 (0.00 B)
Missing values introduced: 12
Embeddings generated: (200, 3)
No errors - missing values handled automatically!
Save and Load Models#
Train once, use many times:
from row2vec import train_and_save_model, load_model
import tempfile
import os
# Create temporary directory for demo
with tempfile.TemporaryDirectory() as tmpdir:
model_path = os.path.join(tmpdir, "my_model")
# Train and save model
embeddings, script_path, binary_path = train_and_save_model(
df,
base_path=model_path,
embedding_dim=4,
mode="unsupervised",
max_epochs=10,
verbose=False
)
print(f"Model saved to: {script_path}")
# Load and use model
model = load_model(script_path)
# Generate embeddings for new data
new_data = generate_synthetic_data(num_records=50, seed=999)
new_embeddings = model.predict(new_data)
print(f"\nNew embeddings shape: {new_embeddings.shape}")
print("Model successfully loaded and used!")
Model: "functional_8"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ input_layer_4 (InputLayer) │ (None, 10) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_11 (Dense) │ (None, 128) │ 1,408 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_7 (Dropout) │ (None, 128) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ embedding (Dense) │ (None, 4) │ 516 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_12 (Dense) │ (None, 128) │ 640 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_8 (Dropout) │ (None, 128) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_13 (Dense) │ (None, 10) │ 1,290 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 3,854 (15.05 KB)
Trainable params: 3,854 (15.05 KB)
Non-trainable params: 0 (0.00 B)
1/7 ━━━━━━━━━━━━━━━━━━━━ 0s 26ms/step
7/7 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
Model saved to: /tmp/tmpluyem0ac/my_model.py
New embeddings shape: (50, 4)
Model successfully loaded and used!
Command-Line Interface#
Row2Vec also provides a CLI for batch processing:
# Quick embeddings
row2vec data.csv --output embeddings.csv
# With configuration
row2vec data.csv \
--dimensions 10 \
--mode unsupervised \
--output embeddings.csv
# Train and save model
row2vec-train data.csv \
--dimensions 5 \
--output model.pkl
# Apply saved model
row2vec-embed new_data.csv \
--model model.py \
--output new_embeddings.csv
Method Comparison#
Let’s compare all methods on the same data:
import time
methods = {
"Neural": {"mode": "unsupervised", "max_epochs": 10},
"PCA": {"mode": "pca"},
"t-SNE": {"mode": "tsne", "perplexity": 30},
}
results = {}
for name, params in methods.items():
start = time.time()
emb = learn_embedding(df, embedding_dim=2, verbose=False, **params)
elapsed = time.time() - start
results[name] = {
"time": elapsed,
"shape": emb.shape,
"mean": emb.mean().mean(),
"std": emb.std().mean()
}
print("Method Comparison:")
print("-" * 50)
for method, stats in results.items():
print(f"{method:10} | Time: {stats['time']:.2f}s | Mean: {stats['mean']:6.3f} | Std: {stats['std']:5.3f}")
Model: "functional_13"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ input_layer_6 (InputLayer) │ (None, 10) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_17 (Dense) │ (None, 128) │ 1,408 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_11 (Dropout) │ (None, 128) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ embedding (Dense) │ (None, 2) │ 258 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_18 (Dense) │ (None, 128) │ 384 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_12 (Dropout) │ (None, 128) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_19 (Dense) │ (None, 10) │ 1,290 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 3,340 (13.05 KB)
Trainable params: 3,340 (13.05 KB)
Non-trainable params: 0 (0.00 B)
Method Comparison:
--------------------------------------------------
Neural | Time: 1.61s | Mean: -0.043 | Std: 0.391
PCA | Time: 0.02s | Mean: -0.000 | Std: 0.792
t-SNE | Time: 0.42s | Mean: 1.264 | Std: 10.479
Method Selection Guide#
Each embedding method has different strengths. Here’s when to use each:
Method |
Speed |
Deterministic |
Best For |
Embedding Range |
|---|---|---|---|---|
Neural |
Medium |
Yes (with seed) |
Complex patterns, feature engineering |
Typically [-1, 1] |
PCA |
Fast |
Yes |
Quick dimensionality reduction, linear relationships |
Variable scale |
t-SNE |
Slow |
No* |
2D/3D visualization, cluster discovery |
Large range, clustered |
UMAP |
Fast |
Yes (with seed) |
General purpose, balanced local/global structure |
Moderate range |
*t-SNE can be made more deterministic with proper seeding, but still has some inherent randomness.
When to Use Each Method#
Choose Neural Networks (mode="unsupervised") when:
You need embeddings for downstream machine learning models
Your data has complex, non-linear relationships
You want features that can capture intricate patterns
You have sufficient training time and computational resources
Choose PCA (mode="pca") when:
You need fast, deterministic results
Your data relationships are primarily linear
You want interpretable principal components
You’re preprocessing data for other algorithms
Choose t-SNE (mode="tsne") when:
You want to visualize data in 2D or 3D
Discovering clusters is your primary goal
Local neighborhood preservation is most important
You don’t mind longer computation times
Choose UMAP (mode="umap") when:
You want general-purpose dimensionality reduction
You need both local and global structure preserved
You want faster performance than t-SNE
You’re working with higher-dimensional outputs (>3D)
Next Steps#
Now you know the basics! For more detailed examples:
📊 Titanic Example - Complete walkthrough with real data
🏠 Housing Example - Regression features
🎯 Advanced Features - Neural architecture search, imputation
💻 CLI Guide - Command-line workflows
📚 API Reference - Complete documentation