Command-Line Interface Guide#
Row2Vec provides a comprehensive CLI for batch processing and production workflows.
CLI Commands Overview#
Row2Vec offers three main commands:
row2vec- Direct embedding generation (most common)row2vec-train- Train and save modelsrow2vec-embed- Apply saved models to new data
Installation#
The CLI is included with Row2Vec:
pip install row2vec[cli] # Include CLI dependencies
Or install CLI dependencies manually:
pip install click rich pyyaml
Basic Usage Examples#
Quick Embeddings#
# Simple 2D embeddings
row2vec data.csv --output embeddings.csv
# With specific dimensions
row2vec data.csv --dimensions 10 --output embeddings.csv
# Target-based embeddings
row2vec data.csv --target-column category --output embeddings.csv
Working with Different Modes#
# Neural network (default)
row2vec data.csv --mode unsupervised --dimensions 5 --output neural_emb.csv
# PCA for fast linear reduction
row2vec data.csv --mode pca --dimensions 5 --output pca_emb.csv
# t-SNE for visualization
row2vec data.csv --mode tsne --dimensions 2 --output tsne_emb.csv
# UMAP for general purpose
row2vec data.csv --mode umap --dimensions 3 --output umap_emb.csv
Advanced Configuration#
Using Configuration Files#
Create a YAML configuration file for complex setups:
# config.yaml
neural:
max_epochs: 100
batch_size: 128
dropout_rate: 0.3
hidden_units: [512, 256]
early_stopping: true
preprocessing:
categorical_encoding_strategy: "adaptive"
numeric_scaling: "standard"
handle_missing: "median"
scaling:
method: "minmax"
feature_range: [-1.0, 1.0]
logging:
level: "INFO"
enable_performance: true
Use the configuration:
row2vec data.csv --config config.yaml --output embeddings.csv
Large Dataset Processing#
# Sample large datasets
row2vec huge_data.csv --sample-size 10000 --output embeddings.csv
# Batch processing with specific settings
row2vec big_data.parquet \
--dimensions 20 \
--batch-size 512 \
--sample-size 50000 \
--output embeddings.parquet
Training and Model Management#
Train Models#
# Basic model training
row2vec-train data.csv --output model.pkl
# Advanced training
row2vec-train data.csv \
--dimensions 15 \
--mode unsupervised \
--max-epochs 100 \
--batch-size 256 \
--validation-split 0.2 \
--save-embeddings \
--output production_model.pkl
Apply Trained Models#
# Basic inference
row2vec-embed new_data.csv --model model.py --output embeddings.csv
# With validation
row2vec-embed data.csv \
--model trained_model.py \
--strict-validation \
--output embeddings.parquet
File Format Support#
Row2Vec CLI supports multiple formats:
Input formats:
CSV (
.csv)Parquet (
.parquet)Excel (
.xlsx,.xls)JSON (
.json)TSV (
.tsv)
Output formats:
CSV (
.csv) - Default, widely compatibleParquet (
.parquet) - Recommended for large datasetsExcel (
.xlsx) - For reports and analysisJSON (
.json) - For APIs and web services
# Different format examples
row2vec data.parquet --output embeddings.parquet # Parquet to Parquet
row2vec data.xlsx --output embeddings.csv # Excel to CSV
row2vec data.json --output embeddings.json # JSON to JSON
Real-World Workflow Examples#
Data Science Pipeline#
# 1. Explore with quick embeddings
row2vec exploration_data.csv --dimensions 2 --mode tsne --output explore.csv
# 2. Train production model
row2vec-train clean_data.csv \
--dimensions 50 \
--max-epochs 200 \
--validation-split 0.3 \
--save-embeddings \
--output production_model.pkl
# 3. Apply to new data
row2vec-embed daily_data.csv \
--model production_model.py \
--validate-schema \
--output daily_embeddings.csv
Customer Analytics#
# Customer segmentation embeddings
row2vec customer_data.csv \
--dimensions 10 \
--mode unsupervised \
--categorical-strategy adaptive \
--numeric-scaling robust \
--output customer_segments.csv
# Category analysis
row2vec customer_data.csv \
--mode target \
--target-column customer_type \
--dimensions 3 \
--output customer_types.csv
A/B Testing Setup#
# Train baseline model
row2vec-train historical_data.csv \
--config baseline_config.yaml \
--output baseline_model.pkl
# Generate embeddings for test groups
row2vec-embed test_group_a.csv \
--model baseline_model.py \
--output test_a_embeddings.csv
row2vec-embed test_group_b.csv \
--model baseline_model.py \
--output test_b_embeddings.csv
Monitoring and Debugging#
Verbose Output#
# Enable detailed logging
row2vec data.csv --verbose --output embeddings.csv
# Custom log levels and files
row2vec data.csv \
--log-level DEBUG \
--log-file training.log \
--output embeddings.csv
Performance Monitoring#
# Time execution
time row2vec large_data.csv --output embeddings.csv
# Monitor memory usage
row2vec data.csv --log-level INFO --output embeddings.csv 2>&1 | grep memory
Error Handling#
# Relaxed validation for messy data
row2vec-embed messy_data.csv \
--model model.py \
--no-strict-validation \
--output embeddings.csv
# Skip problematic rows (if implemented)
row2vec problematic_data.csv \
--handle-errors skip \
--output embeddings.csv
Integration Examples#
Shell Scripting#
#!/bin/bash
# process_daily_data.sh
DATA_DIR="/data/daily"
MODEL_PATH="/models/production_model.py"
OUTPUT_DIR="/embeddings/daily"
for file in $DATA_DIR/*.csv; do
basename=$(basename "$file" .csv)
echo "Processing $basename..."
row2vec-embed "$file" \
--model "$MODEL_PATH" \
--output "$OUTPUT_DIR/${basename}_embeddings.csv" \
--log-file "$OUTPUT_DIR/${basename}.log"
done
echo "Daily processing complete!"
Python Integration#
# Generate embeddings then process with Python
row2vec data.csv --output embeddings.csv
python analyze_embeddings.py embeddings.csv
Docker Usage#
# Dockerfile
FROM python:3.10
RUN pip install row2vec
COPY . /app
WORKDIR /app
# Process data in container
CMD ["row2vec", "input/data.csv", "--output", "output/embeddings.csv"]
# Build and run
docker build -t row2vec-processor .
docker run -v $(pwd)/data:/app/input -v $(pwd)/output:/app/output row2vec-processor
Performance Tips#
Optimization Strategies#
# Large datasets: use sampling
row2vec huge_data.csv \
--sample-size 100000 \
--batch-size 1024 \
--output embeddings.csv
# Fast iteration: use PCA
row2vec data.csv --mode pca --dimensions 10 --output quick_emb.csv
# Production quality: more epochs
row2vec-train data.csv \
--max-epochs 500 \
--early-stopping \
--output final_model.pkl
Memory Management#
# Process in chunks for very large files
split -l 10000 huge_data.csv chunk_
for chunk in chunk_*; do
row2vec "$chunk" --output "${chunk}_emb.csv"
done
cat chunk_*_emb.csv > all_embeddings.csv
Troubleshooting#
Common Issues#
Memory errors:
# Reduce batch size and sample data
row2vec large_data.csv \
--sample-size 10000 \
--batch-size 32 \
--output embeddings.csv
Slow performance:
# Use PCA for quick results
row2vec data.csv --mode pca --dimensions 5 --output fast_emb.csv
# Or reduce epochs
row2vec data.csv --max-epochs 10 --output quick_emb.csv
Schema validation errors:
# Disable strict validation
row2vec-embed data.csv \
--model model.py \
--no-strict-validation \
--output embeddings.csv
Getting Help#
# General help
row2vec --help
# Command-specific help
row2vec-train --help
row2vec-embed --help
# Version information
row2vec --version
Configuration Reference#
Complete YAML configuration example:
# Complete configuration example
neural:
max_epochs: 100
batch_size: 128
dropout_rate: 0.25
hidden_units: [512, 256, 128]
early_stopping: true
early_stopping_patience: 10
classical:
n_neighbors: 20 # UMAP
perplexity: 50.0 # t-SNE
min_dist: 0.05 # UMAP
n_iter: 1000 # t-SNE
preprocessing:
categorical_encoding_strategy: "adaptive"
numeric_scaling: "standard"
handle_missing: "adaptive"
scaling:
method: "minmax"
feature_range: [-1.0, 1.0]
logging:
level: "INFO"
enable_performance: true
enable_memory: false
The CLI provides powerful batch processing capabilities while maintaining Row2Vec’s simplicity. Use it for production workflows, automation, and large-scale data processing.
Advanced Workflows#
Complete ML Pipeline#
# 1. Explore with quick embeddings
row2vec exploration_data.csv --dimensions 10 --output exploration.csv
# 2. Train production model
row2vec-train training_data.csv \
--target outcome \
--validation-split 0.2 \
--max-epochs 100 \
--output production_model.pkl
# 3. Apply to new data
row2vec-embed new_data.csv \
--model production_model.py \
--output predictions.csv
Production Deployment#
# 1. Model training with validation
row2vec-train historical_data.csv \
--target conversion \
--validation-split 0.3 \
--max-epochs 150 \
--batch-size 256 \
--categorical-strategy adaptive \
--output production_model.pkl
# 2. Daily batch inference
row2vec-embed daily_data.csv \
--model production_model.py \
--validate-schema \
--output daily_embeddings.parquet
# 3. Real-time inference
row2vec-embed realtime_batch.csv \
--model production_model.py \
--no-strict-validation \
--output realtime_embeddings.json
Schema Validation#
The CLI provides robust schema validation for production use:
# Strict validation (default)
row2vec-embed data.csv --model model.py --strict-validation --output embeddings.csv
# Relaxed validation for data drift
row2vec-embed data.csv --model model.py --no-strict-validation --output embeddings.csv
Performance Tips#
Use Parquet for large datasets: Better compression and faster I/O
Sample large datasets: Use
--sample-sizeto manage memoryEnable early stopping: Reduces training time while maintaining quality
Use appropriate batch sizes: Larger batches for GPUs, smaller for CPUs
Cache trained models: Reuse models across similar datasets
Monitor memory usage: Enable performance logging for optimization
Troubleshooting#
Common Issues#
Import errors: Ensure all dependencies are installed
pip install click rich pyyaml pandas scikit-learn tensorflow
Memory errors: Use sampling for large datasets
row2vec large_data.csv --sample-size 10000 --output embeddings.csv
Schema validation failures: Use relaxed validation for data drift
row2vec-embed data.csv --model model.py --no-strict-validation --output embeddings.csv
Training convergence: Adjust epochs and early stopping
row2vec-train data.csv --max-epochs 200 --no-early-stopping --output model.pkl
Debug Mode#
Enable verbose output and logging for troubleshooting:
row2vec data.csv --verbose --log-level DEBUG --log-file debug.log --output embeddings.csv
Next Steps#
📖 API Reference - Complete function documentation
🏠 Examples - Return to interactive examples
⚙️ Advanced Features - Neural architecture search and more