API Reference

Contents

API Reference#

Complete documentation of Row2Vec’s Python API.

Core Functions#

learn_embedding()#

The main function for generating embeddings from tabular data.

learn_embedding(df, embedding_dim=10, mode='unsupervised', reference_column=None, max_epochs=50, batch_size=64, dropout_rate=0.2, hidden_units=128, early_stopping=True, seed=1305, verbose=False, scale_method=None, scale_range=None, log_level='INFO', log_file=None, enable_logging=True, n_neighbors=15, perplexity=30.0, min_dist=0.1, n_iter=1000, similar_pairs=None, dissimilar_pairs=None, auto_pairs=None, contrastive_loss='triplet', margin=1.0, negative_samples=5, config=None)[source]#

Learns a low-dimensional embedding from a pandas DataFrame.

Note

Current version supports numeric and categorical features. Textual and temporal features are not directly supported - please preprocess them yourself using appropriate tools (e.g., BERT-like embeddings for text, temporal libraries for time series). Support for these feature types is planned for future versions.

Parameters:
  • df (pd.DataFrame) – The input DataFrame containing numeric and categorical features.

  • embedding_dim (int) – The dimensionality of the embedding space.

  • mode (str) – Embedding method - ‘unsupervised’ (autoencoder), ‘target’ (supervised), ‘pca’ (Principal Component Analysis), ‘tsne’ (t-SNE), ‘umap’ (UMAP), or ‘contrastive’ (contrastive learning).

  • reference_column (str) – The target column for ‘target’ mode.

  • max_epochs (int) – The maximum number of training epochs (neural methods only).

  • batch_size (int) – The batch size for training (neural methods only).

  • dropout_rate (float) – The dropout rate for regularization (neural methods only).

  • hidden_units (Union[int, list[int]]) – Hidden layer configuration - single int for one layer or list of ints for multiple layers (neural methods only).

  • early_stopping (bool) – Whether to use early stopping (neural methods only).

  • seed (int) – A random seed for reproducibility.

  • verbose (bool) – Whether to print training progress.

  • scale_method (str, optional) – Scaling method for embeddings. Options: ‘none’, ‘minmax’, ‘standard’, ‘l2’, ‘tanh’.

  • scale_range (tuple, optional) – Range for minmax scaling. Default: (0, 1).

  • log_level (str) – Logging level (‘DEBUG’, ‘INFO’, ‘WARNING’, ‘ERROR’).

  • log_file (str, optional) – File path for logging output.

  • enable_logging (bool) – Whether to enable structured logging.

  • n_neighbors (int) – Number of neighbors for UMAP (default: 15).

  • perplexity (float) – Perplexity parameter for t-SNE (default: 30.0).

  • min_dist (float) – Minimum distance for UMAP (default: 0.1).

  • n_iter (int) – Number of iterations for t-SNE (default: 1000).

  • similar_pairs (list[tuple[int, int]], optional) – List of (row_idx1, row_idx2) pairs that should have similar embeddings (for contrastive mode).

  • dissimilar_pairs (list[tuple[int, int]], optional) – List of (row_idx1, row_idx2) pairs that should have dissimilar embeddings (for contrastive mode).

  • auto_pairs (str, optional) – Strategy for automatic pair generation. Options: ‘cluster’ (cluster-based), ‘neighbors’ (k-NN based), ‘categorical’ (same category values), ‘random’ (random sampling).

  • contrastive_loss (str) – Contrastive loss function. Options: ‘triplet’, ‘contrastive’.

  • margin (float) – Margin parameter for contrastive loss functions (default: 1.0).

  • negative_samples (int) – Number of negative samples per positive pair (default: 5).

  • config (EmbeddingConfig, optional) – Configuration for preprocessing and model behavior. If None, intelligent defaults are used based on data analysis.

Returns:

A DataFrame containing the learned embeddings.

Return type:

pd.DataFrame

Raises:
  • ValueError – If input validation fails or unsupported mode is specified.

  • TypeError – If input types are incorrect.

learn_embedding_with_model()#

Extended function that returns model components for advanced use cases.

learn_embedding_with_model(df, embedding_dim=10, mode='unsupervised', reference_column=None, max_epochs=50, batch_size=64, dropout_rate=0.2, hidden_units=128, early_stopping=True, seed=1305, verbose=False, scale_method=None, scale_range=None, log_level='INFO', log_file=None, enable_logging=True, n_neighbors=15, perplexity=30.0, min_dist=0.1, n_iter=1000, similar_pairs=None, dissimilar_pairs=None, auto_pairs=None, negative_samples=5, contrastive_loss='triplet', margin=1.0, config=None)[source]#

Extended version of learn_embedding that also returns the model, preprocessor, and training metadata.

This function is designed for use with the serialization system to capture all necessary components for saving and loading trained models.

Parameters:
  • function (Same as learn_embedding)

  • df (DataFrame)

  • embedding_dim (int)

  • mode (str)

  • reference_column (str | None)

  • max_epochs (int)

  • batch_size (int)

  • dropout_rate (float)

  • hidden_units (int | list[int])

  • early_stopping (bool)

  • seed (int)

  • verbose (bool)

  • scale_method (str | None)

  • scale_range (tuple[float, float] | None)

  • log_level (str)

  • log_file (str | None)

  • enable_logging (bool)

  • n_neighbors (int)

  • perplexity (float)

  • min_dist (float)

  • n_iter (int)

  • similar_pairs (list[tuple[int, int]] | None)

  • dissimilar_pairs (list[tuple[int, int]] | None)

  • auto_pairs (str | None)

  • negative_samples (int)

  • contrastive_loss (str)

  • margin (float)

  • config (EmbeddingConfig | None)

Returns:

Tuple of (embeddings, model, preprocessor, metadata) - embeddings: DataFrame with learned embeddings - model: Trained model (Keras model for neural methods, sklearn estimator for classical) - preprocessor: Fitted sklearn ColumnTransformer for data preprocessing - metadata: Dictionary containing training metadata and configuration

Return type:

tuple[DataFrame, Model | BaseEstimator, ColumnTransformer, dict[str, Any]]

V2 API Functions#

The newer, more flexible API with configuration objects.

learn_embedding_v2(df, config=None, auto_architecture=False, architecture_search_config=None, **config_overrides)[source]#

Modern config-based API for learning embeddings from tabular data.

This is the new recommended API that uses configuration objects instead of long parameter lists. It provides better organization, type safety, and extensibility.

Parameters:
  • df (DataFrame) – Input DataFrame containing the data to embed

  • config (EmbeddingConfig | None) – Complete embedding configuration. If None, default config is used.

  • auto_architecture (bool) – Enable automatic neural architecture search for neural modes

  • architecture_search_config (ArchitectureSearchConfig | None) – Custom architecture search configuration

  • **config_overrides – Override specific config values (supports nested keys with dots)

Returns:

DataFrame containing the learned embeddings

Return type:

DataFrame

Examples

# Basic usage with defaults embeddings = learn_embedding_v2(df)

# Using a custom config config = EmbeddingConfig(

mode=”contrastive”, embedding_dim=50, contrastive=ContrastiveConfig(loss_type=”triplet”, margin=2.0)

) embeddings = learn_embedding_v2(df, config)

# With automatic architecture search embeddings = learn_embedding_v2(df, config, auto_architecture=True)

# Quick overrides without config object embeddings = learn_embedding_v2(df, embedding_dim=20, mode=”target”, reference_column=”category”)

# Loading from YAML config = EmbeddingConfig.from_yaml(“my_config.yaml”) embeddings = learn_embedding_v2(df, config)

learn_embedding_with_model_v2(df, config=None, **config_overrides)[source]#

Modern config-based API for learning embeddings with model artifacts.

This function returns the embeddings along with the trained model, preprocessor, and metadata for serialization purposes.

Parameters:
  • df (DataFrame) – Input DataFrame containing the data to embed

  • config (EmbeddingConfig | None) – Complete embedding configuration. If None, default config is used.

  • **config_overrides – Override specific config values

Returns:

Tuple of (embeddings, model, preprocessor, metadata)

Return type:

tuple[DataFrame, Any | BaseEstimator, ColumnTransformer, dict[str, Any]]

learn_embedding_unsupervised(df, embedding_dim=10, **overrides)[source]#

Learn unsupervised embeddings with optimized defaults.

Parameters:
  • df (DataFrame)

  • embedding_dim (int)

Return type:

DataFrame

learn_embedding_target(df, reference_column, embedding_dim=10, **overrides)[source]#

Learn target-based embeddings with optimized defaults.

Parameters:
  • df (DataFrame)

  • reference_column (str)

  • embedding_dim (int)

Return type:

DataFrame

learn_embedding_classical(df, method='pca', embedding_dim=10, **overrides)[source]#

Learn classical ML embeddings (PCA, t-SNE, UMAP) with optimized defaults.

Parameters:
  • df (DataFrame)

  • method (str)

  • embedding_dim (int)

Return type:

DataFrame

Configuration Classes#

EmbeddingConfig#

class EmbeddingConfig(embedding_dim=10, mode='unsupervised', reference_column=None, seed=1305, verbose=False, neural=<factory>, classical=<factory>, contrastive=<factory>, scaling=<factory>, logging=<factory>, preprocessing=<factory>)[source]#

Bases: object

Complete configuration for embedding learning.

Parameters:
classmethod from_dict(config_dict)[source]#

Create config from dictionary (e.g., from YAML).

Parameters:

config_dict (dict[str, Any])

Return type:

EmbeddingConfig

classmethod from_yaml(yaml_path)[source]#

Create config from YAML file.

Parameters:

yaml_path (str | Path)

Return type:

EmbeddingConfig

to_dict()[source]#

Convert config to dictionary.

Return type:

dict[str, Any]

to_yaml(yaml_path)[source]#

Save config to YAML file.

Parameters:

yaml_path (str | Path)

Return type:

None

NeuralConfig#

class NeuralConfig(max_epochs=50, batch_size=64, dropout_rate=0.2, hidden_units=128, activation='relu', early_stopping=True)[source]#

Bases: object

Configuration for neural network-based embedding methods.

Parameters:
  • max_epochs (int)

  • batch_size (int)

  • dropout_rate (float)

  • hidden_units (int | list[int])

  • activation (str)

  • early_stopping (bool)

ClassicalConfig#

class ClassicalConfig(n_neighbors=15, min_dist=0.1, perplexity=30.0, n_iter=1000)[source]#

Bases: object

Configuration for classical ML dimensionality reduction methods.

Parameters:
  • n_neighbors (int)

  • min_dist (float)

  • perplexity (float)

  • n_iter (int)

ScalingConfig#

class ScalingConfig(method=None, range=None)[source]#

Bases: object

Configuration for embedding scaling/normalization.

Parameters:
  • method (str | None)

  • range (tuple[float, float] | None)

LoggingConfig#

class LoggingConfig(level='INFO', file=None, enabled=True)[source]#

Bases: object

Configuration for logging and output.

Parameters:
  • level (str)

  • file (str | None)

  • enabled (bool)

PreprocessingConfig#

class PreprocessingConfig(handle_missing='auto', numeric_scaling='standard', categorical_encoding_strategy='adaptive', categorical_onehot_threshold=20, categorical_target_threshold=100, categorical_entity_threshold=1000)[source]#

Bases: object

Configuration for data preprocessing including categorical encoding.

Parameters:
  • handle_missing (str)

  • numeric_scaling (str)

  • categorical_encoding_strategy (str)

  • categorical_onehot_threshold (int)

  • categorical_target_threshold (int)

  • categorical_entity_threshold (int)

Auto Dimension Selection#

auto_select_dimension()#

auto_select_dimension(df, config=None, target_column=None, methods=None, **selector_kwargs)[source]#

Convenience function for automatic dimension selection.

Parameters:
  • df (DataFrame) – Input dataframe

  • config (EmbeddingConfig | None) – Base embedding configuration (uses defaults if None)

  • target_column (str | None) – Optional target column for supervised evaluation

  • methods (list[str] | None) – List of selection methods to use

  • **selector_kwargs – Additional arguments for AutoDimensionSelector

Returns:

Tuple of (optimal_dimension, selection_metadata)

Return type:

tuple[int, dict[str, Any]]

AutoDimensionSelector#

class AutoDimensionSelector(methods=None, performance_weight=0.4, efficiency_weight=0.3, intrinsic_weight=0.3, max_dimension=None, min_dimension=2, n_trials=5, verbose=True)[source]#

Bases: object

Automatically selects optimal embedding dimensions using multiple strategies.

Combines data-driven analysis, performance optimization, and heuristic rules to determine the best embedding dimension for a given dataset.

Parameters:
  • methods (list[str] | None)

  • performance_weight (float)

  • efficiency_weight (float)

  • intrinsic_weight (float)

  • max_dimension (int | None)

  • min_dimension (int)

  • n_trials (int)

  • verbose (bool)

select_dimension(df, config, target_column=None, candidate_dims=None)[source]#

Select optimal embedding dimension for the given data.

Parameters:
  • df (DataFrame) – Input dataframe

  • config (EmbeddingConfig) – Base embedding configuration (dimension will be overridden)

  • target_column (str | None) – Optional target for supervised evaluation

  • candidate_dims (list[int] | None) – Specific dimensions to evaluate (auto-generated if None)

Returns:

Tuple of (optimal_dimension, selection_metadata)

Return type:

tuple[int, dict[str, Any]]

Imputation#

ImputationConfig#

class ImputationConfig(numeric_strategy='adaptive', categorical_strategy='adaptive', prefer_speed=True, missing_threshold=0.7, row_missing_threshold=0.9, knn_neighbors=5, preserve_missing_patterns=False, missing_indicator_suffix='_was_missing', auto_detect_patterns=True, warn_high_missingness=True, categorical_fill_value='Missing')[source]#

Bases: object

Configuration for intelligent missing value imputation strategies.

This class provides comprehensive control over how missing values are handled, with sensible defaults that work well for most datasets while allowing power users to fine-tune every aspect of the imputation process.

Parameters:
  • numeric_strategy (str)

  • categorical_strategy (str)

  • prefer_speed (bool)

  • missing_threshold (float)

  • row_missing_threshold (float)

  • knn_neighbors (int)

  • preserve_missing_patterns (bool)

  • missing_indicator_suffix (str)

  • auto_detect_patterns (bool)

  • warn_high_missingness (bool)

  • categorical_fill_value (str)

numeric_strategy: str = 'adaptive'#

Numeric imputation strategy. Options: - “adaptive”: Automatically selects best strategy based on missing percentage - “mean”: Mean imputation (fastest, good for <10% missing) - “median”: Median imputation (robust to outliers, good for 10-30% missing) - “knn”: K-nearest neighbors imputation (better for >30% missing) - “iterative”: MICE-style iterative imputation (best quality, slowest)

categorical_strategy: str = 'adaptive'#

Categorical imputation strategy. Options: - “adaptive”: Automatically selects best strategy based on data characteristics - “mode”: Most frequent value imputation - “constant”: Fill with specified constant value - “missing_category”: Create explicit “Missing” category

prefer_speed: bool = True#

Whether to prefer faster methods over more accurate but slower ones. When True, uses simpler strategies by default. When False, prefers more sophisticated methods even if they take longer.

missing_threshold: float = 0.7#

Columns with more than this fraction of missing values will be flagged. Conservative default of 0.7 to avoid dropping useful but sparse columns.

row_missing_threshold: float = 0.9#

Rows with more than this fraction of missing values will be flagged. Very conservative default to avoid losing data.

knn_neighbors: int = 5#

Number of neighbors for KNN imputation. Should be odd to avoid ties.

preserve_missing_patterns: bool = False#

Whether to preserve missing patterns when they might be informative.

When True, adds binary indicator columns for originally missing values. This is useful when missingness itself carries information (e.g., customers not providing income information might be systematically different).

Example

Original: [1.0, NaN, 3.0] -> After imputation: [1.0, 2.0, 3.0] With preservation: adds column [False, True, False] indicating missingness

missing_indicator_suffix: str = '_was_missing'#

Suffix for missing indicator columns when preserve_missing_patterns=True.

auto_detect_patterns: bool = True#

Whether to automatically analyze missing data patterns and adjust strategies.

warn_high_missingness: bool = True#

Whether to warn users about columns/rows with high missing percentages.

categorical_fill_value: str = 'Missing'#

Fill value when using ‘constant’ strategy for categorical data.

AdaptiveImputer#

class AdaptiveImputer(config)[source]#

Bases: BaseEstimator

Adaptive imputer that automatically selects and applies appropriate imputation strategies based on data characteristics.

Parameters:

config (ImputationConfig)

fit(X, y=None)[source]#

Fit the adaptive imputer to the data.

Parameters:
  • X (DataFrame) – Input DataFrame with potential missing values

  • y (Any) – Ignored, present for API compatibility

Returns:

Fitted imputer

Return type:

self

transform(X)[source]#

Transform the data by applying imputation strategies.

Parameters:

X (DataFrame) – Input DataFrame with potential missing values

Returns:

DataFrame with missing values imputed

Return type:

DataFrame

fit_transform(X, y=None, **fit_params)[source]#

Fit the imputer and transform the data in one step.

Parameters:
  • X (DataFrame)

  • y (Any)

  • fit_params (Any)

Return type:

DataFrame

get_imputation_report()[source]#

Get detailed report about the imputation process.

Returns:

Dict containing analysis and imputation details

Return type:

dict[str, Any]

MissingPatternAnalyzer#

class MissingPatternAnalyzer(config)[source]#

Bases: object

Analyzes missing data patterns to inform imputation strategy selection.

Parameters:

config (ImputationConfig)

analyze(df)[source]#

Analyze missing data patterns in the DataFrame.

Parameters:

df (DataFrame) – Input DataFrame to analyze

Returns:

Dict containing analysis results and recommendations

Return type:

dict[str, Any]

Categorical Encoding#

CategoricalEncodingConfig#

class CategoricalEncodingConfig(encoding_strategy='adaptive', onehot_threshold=20, target_threshold=100, entity_threshold=1000, correlation_threshold=0.1, target_smoothing=1.0, target_noise=0.01, target_cv_folds=5, embedding_dim_ratio=0.5, min_embedding_dim=2, max_embedding_dim=50, entity_epochs=50, entity_batch_size=256, prefer_speed=True, preserve_interpretability=False, enable_feature_selection=False, feature_importance_threshold=0.01, custom_strategies=<factory>, handle_unknown='ignore', random_state=42)[source]#

Bases: object

Configuration for intelligent categorical encoding strategies.

This class provides comprehensive control over how categorical variables are encoded, with intelligent defaults that automatically select optimal strategies based on data characteristics while allowing expert users to fine-tune every aspect.

Parameters:
  • encoding_strategy (str)

  • onehot_threshold (int)

  • target_threshold (int)

  • entity_threshold (int)

  • correlation_threshold (float)

  • target_smoothing (float)

  • target_noise (float)

  • target_cv_folds (int)

  • embedding_dim_ratio (float)

  • min_embedding_dim (int)

  • max_embedding_dim (int)

  • entity_epochs (int)

  • entity_batch_size (int)

  • prefer_speed (bool)

  • preserve_interpretability (bool)

  • enable_feature_selection (bool)

  • feature_importance_threshold (float)

  • custom_strategies (dict[str, str])

  • handle_unknown (str)

  • random_state (int)

encoding_strategy: str = 'adaptive'#

Encoding strategy selection. Options: - “adaptive”: Automatically selects best strategy based on data analysis - “onehot”: One-hot encoding for all categorical features - “target”: Target encoding for all categorical features - “entity”: Entity embeddings for all categorical features - “ordinal”: Ordinal encoding (assumes natural order) - “mixed”: Use custom strategies per column (requires custom_strategies)

onehot_threshold: int = 20#

Use OneHot encoding if cardinality <= this threshold and correlation is low.

target_threshold: int = 100#

Use target encoding if cardinality is between onehot_threshold and this value.

entity_threshold: int = 1000#

Use entity embeddings if cardinality > target_threshold and <= this value.

correlation_threshold: float = 0.1#

Minimum mutual information score to prefer target/entity over onehot.

target_smoothing: float = 1.0#

Bayesian smoothing factor for target encoding. Higher values = more smoothing.

target_noise: float = 0.01#

Gaussian noise standard deviation added to target encodings to prevent overfitting.

target_cv_folds: int = 5#

Number of cross-validation folds for target encoding to prevent data leakage.

embedding_dim_ratio: float = 0.5#

Embedding dimension as ratio of sqrt(cardinality). Controls embedding size.

min_embedding_dim: int = 2#

Minimum embedding dimension for entity embeddings.

max_embedding_dim: int = 50#

Maximum embedding dimension for entity embeddings.

entity_epochs: int = 50#

Number of training epochs for entity embedding networks.

entity_batch_size: int = 256#

Batch size for entity embedding training.

prefer_speed: bool = True#

Whether to prefer faster methods over more accurate but slower ones.

preserve_interpretability: bool = False#

Whether to prefer interpretable encodings (OneHot/Ordinal) when possible.

enable_feature_selection: bool = False#

Whether to enable automatic feature selection based on importance.

feature_importance_threshold: float = 0.01#

Minimum feature importance score to keep feature (only if enable_feature_selection=True).

custom_strategies: dict[str, str]#

strategy}

Type:

Custom encoding strategy for specific columns. Format

Type:

{column_name

handle_unknown: str = 'ignore'#

‘ignore’, ‘error’, ‘infrequent_if_exist’

Type:

How to handle unknown categories. Options

random_state: int = 42#

Random state for reproducible results.

CategoricalEncoder#

class CategoricalEncoder(config=None)[source]#

Bases: BaseEstimator, TransformerMixin

Intelligent categorical encoder with adaptive strategy selection.

This encoder analyzes categorical data characteristics and automatically selects optimal encoding strategies while providing full control for advanced users.

Parameters:

config (CategoricalEncodingConfig | None)

fit(X, y=None)[source]#

Fit the categorical encoder on training data.

Parameters:
  • X (pd.DataFrame) – Categorical features to encode

  • y (pd.Series, optional) – Target variable for supervised encoding strategies

Returns:

self – Fitted encoder instance

Return type:

CategoricalEncoder

transform(X)[source]#

Transform categorical data using fitted encoders.

Parameters:

X (pd.DataFrame) – Categorical data to transform

Returns:

Encoded categorical data

Return type:

pd.DataFrame

get_feature_names_out(input_features=None)[source]#

Get output feature names for transformation.

Parameters:

input_features (list[str] | None)

Return type:

list[str]

get_analysis_report()[source]#

Get detailed analysis report for all columns.

Return type:

dict[str, dict[str, Any]]

CategoricalAnalyzer#

class CategoricalAnalyzer(config)[source]#

Bases: object

Analyzes categorical data to recommend optimal encoding strategies.

Parameters:

config (CategoricalEncodingConfig)

analyze_column(series, target=None)[source]#

Analyze a categorical column to recommend encoding strategy.

Parameters:
  • series (pd.Series) – Categorical column to analyze

  • target (pd.Series, optional) – Target variable for correlation analysis

Returns:

Analysis results and strategy recommendation

Return type:

Dict[str, Any]

Model Serialization#

Row2VecModel#

class Row2VecModel(model=None, preprocessor=None, metadata=None)[source]#

Bases: object

Complete Row2Vec model with preprocessing pipeline and metadata.

This class encapsulates the trained model, preprocessing pipeline, and all metadata needed for inference.

Parameters:
  • model (Any | BaseEstimator | None)

  • preprocessor (ColumnTransformer | None)

  • metadata (Row2VecModelMetadata | None)

validate_input_schema(df, strict=True)[source]#

Validate input DataFrame schema against expected schema.

Parameters:
  • df (DataFrame) – Input DataFrame to validate

  • strict (bool) – If True, fails on any schema mismatch. If False, warns only.

Returns:

True if schema is valid

Return type:

bool

Raises:

ValueError – If strict=True and schema validation fails

predict(df, validate_schema=True)[source]#

Generate embeddings for new data.

Parameters:
  • df (DataFrame) – Input DataFrame

  • validate_schema (bool) – Whether to validate input schema

Returns:

DataFrame with embeddings

Raises:

ValueError – If model is not loaded or schema validation fails

Return type:

DataFrame

Row2VecModelMetadata#

class Row2VecModelMetadata(embedding_dim, mode, reference_column=None, max_epochs=50, batch_size=64, dropout_rate=0.2, hidden_units=128, early_stopping=True, seed=1305, scale_method=None, scale_range=None, n_neighbors=15, perplexity=30.0, min_dist=0.1, n_iter=1000, training_history=None, final_loss=None, epochs_trained=None, training_time=None, original_columns=None, preprocessed_feature_names=None, data_shape=None, data_types=None, expected_schema=None)[source]#

Bases: object

Container for Row2Vec model training metadata.

Parameters:
  • embedding_dim (int)

  • mode (str)

  • reference_column (str | None)

  • max_epochs (int)

  • batch_size (int)

  • dropout_rate (float)

  • hidden_units (int)

  • early_stopping (bool)

  • seed (int)

  • scale_method (str | None)

  • scale_range (tuple[float, float] | None)

  • n_neighbors (int)

  • perplexity (float)

  • min_dist (float)

  • n_iter (int)

  • training_history (dict[str, Any] | None)

  • final_loss (float | None)

  • epochs_trained (int | None)

  • training_time (float | None)

  • original_columns (list[str] | None)

  • preprocessed_feature_names (list[str] | None)

  • data_shape (tuple[int, int] | None)

  • data_types (dict[str, str] | None)

  • expected_schema (dict[str, Any] | None)

to_dict()[source]#

Convert metadata to dictionary for serialization.

Return type:

dict[str, Any]

classmethod from_dict(data)[source]#

Create metadata from dictionary.

Parameters:

data (dict[str, Any])

Return type:

Row2VecModelMetadata

save_model()#

save_model(model, base_path, overwrite=False)[source]#

Save a Row2Vec model using the two-file approach.

Parameters:
  • model (Row2VecModel) – The Row2Vec model to save

  • base_path (str | Path) – Base path for saving (without extension)

  • overwrite (bool) – Whether to overwrite existing files

Returns:

Tuple of (script_path, binary_path)

Raises:
  • FileExistsError – If files exist and overwrite=False

  • ValueError – If model is incomplete

Return type:

tuple[str, str]

load_model()#

load_model(script_path)[source]#

Load a Row2Vec model from the script file.

Parameters:

script_path (str | Path) – Path to the Python script file

Returns:

Loaded Row2Vec model

Raises:
  • FileNotFoundError – If script or binary file not found

  • ValueError – If loading fails

Return type:

Row2VecModel

train_and_save_model()#

train_and_save_model(df, base_path, embedding_dim=10, mode='unsupervised', reference_column=None, max_epochs=50, batch_size=64, dropout_rate=0.2, hidden_units=128, early_stopping=True, seed=1305, verbose=False, scale_method=None, scale_range=None, log_level='INFO', log_file=None, enable_logging=True, n_neighbors=15, perplexity=30.0, min_dist=0.1, n_iter=1000, similar_pairs=None, dissimilar_pairs=None, auto_pairs=None, negative_samples=5, contrastive_loss='triplet', margin=1.0, overwrite=False, include_training_history=True)[source]#

Train a Row2Vec model and save it using the two-file approach.

This is a convenience function that combines training and saving.

Parameters:
  • df (DataFrame) – Input DataFrame for training

  • base_path (str | Path) – Base path for saving the model

  • **kwargs – All parameters from learn_embedding

  • overwrite (bool) – Whether to overwrite existing model files

  • include_training_history (bool) – Whether to include full training history in metadata

  • embedding_dim (int)

  • mode (str)

  • reference_column (str | None)

  • max_epochs (int)

  • batch_size (int)

  • dropout_rate (float)

  • hidden_units (int)

  • early_stopping (bool)

  • seed (int)

  • verbose (bool)

  • scale_method (str | None)

  • scale_range (tuple[float, float] | None)

  • log_level (str)

  • log_file (str | None)

  • enable_logging (bool)

  • n_neighbors (int)

  • perplexity (float)

  • min_dist (float)

  • n_iter (int)

  • similar_pairs (list[tuple[int, int]] | None)

  • dissimilar_pairs (list[tuple[int, int]] | None)

  • auto_pairs (str | None)

  • negative_samples (int)

  • contrastive_loss (str)

  • margin (float)

Returns:

Tuple of (embeddings, script_path, binary_path)

Return type:

tuple[DataFrame, str, str]

Utilities#

generate_synthetic_data()#

generate_synthetic_data(num_records, seed=1305)[source]#

Generates a synthetic DataFrame for demonstration purposes.

Parameters:
  • num_records (int) – The number of records to generate.

  • seed (int) – A random seed for reproducibility.

Returns:

A synthetic DataFrame with mixed data types.

Return type:

pd.DataFrame

create_dataframe_schema()#

create_dataframe_schema(df)[source]#

Create a schema dictionary from a DataFrame for validation purposes.

Parameters:

df (DataFrame) – DataFrame to analyze

Returns:

Dictionary containing schema information

Return type:

dict[str, Any]

validate_dataframe_schema()#

validate_dataframe_schema(df, expected_schema, allow_extra_columns=False, allow_missing_columns=False)[source]#

Validate DataFrame schema against expected schema.

Parameters:
  • df (DataFrame) – DataFrame to validate

  • expected_schema (dict[str, Any]) – Expected schema dictionary

  • allow_extra_columns (bool) – Whether to allow extra columns in df

  • allow_missing_columns (bool) – Whether to allow missing columns in df

Raises:

ValueError – If schema validation fails

Return type:

None

Logging#

get_logger()#

get_logger(name='row2vec', level='INFO', log_file=None, **kwargs)[source]#

Create a Row2Vec logger with standard configuration.

Parameters:
  • name (str) – Logger name

  • level (str) – Logging level

  • log_file (str | Path | None) – Optional log file path

  • **kwargs (Any) – Additional arguments for Row2VecLogger

Returns:

Configured Row2VecLogger instance

Return type:

Row2VecLogger

Row2VecLogger#

class Row2VecLogger(name='row2vec', level='INFO', log_file=None, include_performance=True, include_memory=True)[source]#

Bases: object

Centralized logging system for Row2Vec operations.

Provides structured logging for training progress, debug information, and performance metrics with configurable output formats and levels.

Parameters:
  • name (str)

  • level (str)

  • log_file (str | Path | None)

  • include_performance (bool)

  • include_memory (bool)

start_training(**kwargs)[source]#

Log training start with configuration details.

Parameters:

kwargs (Any)

Return type:

None

start_epoch(epoch, total_epochs)[source]#

Log epoch start.

Parameters:
  • epoch (int)

  • total_epochs (int)

Return type:

None

log_epoch_metrics(epoch, loss, val_loss=None, additional_metrics=None)[source]#

Log epoch completion with metrics.

Parameters:
  • epoch (int)

  • loss (float)

  • val_loss (float | None)

  • additional_metrics (dict[str, float] | None)

Return type:

None

log_early_stopping(epoch, reason)[source]#

Log early stopping event.

Parameters:
  • epoch (int)

  • reason (str)

Return type:

None

end_training(final_loss, total_epochs)[source]#

Log training completion with summary.

Parameters:
  • final_loss (float)

  • total_epochs (int)

Return type:

None

log_data_preprocessing(df_shape, processing_steps)[source]#

Log data preprocessing information.

Parameters:
  • df_shape (tuple[int, int])

  • processing_steps (list[str])

Return type:

None

log_preprocessing_result(original_shape, processed_shape, processing_time)[source]#

Log preprocessing completion.

Parameters:
  • original_shape (tuple[int, int])

  • processed_shape (tuple[int, int])

  • processing_time (float)

Return type:

None

log_model_architecture(model_summary)[source]#

Log model architecture details.

Parameters:

model_summary (str)

Return type:

None

log_embedding_stats(embeddings)[source]#

Log embedding statistics.

Parameters:

embeddings (DataFrame)

Return type:

None

log_performance_warning(message)[source]#

Log performance-related warnings.

Parameters:

message (str)

Return type:

None

log_validation_issue(message)[source]#

Log validation or data quality issues.

Parameters:

message (str)

Return type:

None

log_debug_info(message, data=None)[source]#

Log debug information with optional data context.

Parameters:
  • message (str)

  • data (dict[str, Any] | None)

Return type:

None

log_error(error, context=None)[source]#

Log error with context information.

Parameters:
  • error (Exception)

  • context (str | None)

Return type:

None

log_completion(message='Embedding generation completed successfully!')[source]#

Log completion of embedding generation.

Parameters:

message (str)

Return type:

None

Pipeline Building#

PipelineBuilder#

class PipelineBuilder(config=None)[source]#

Bases: object

Intelligent pipeline builder that analyzes data and constructs optimal preprocessing pipelines with adaptive strategies.

Parameters:

config (EmbeddingConfig | None)

build_preprocessing_pipeline(df, target=None, mode='unsupervised')[source]#

Build intelligent preprocessing pipeline based on data analysis.

Parameters:
  • df (pd.DataFrame) – Input dataset to analyze

  • target (pd.Series, optional) – Target variable for supervised preprocessing

  • mode (str) – Embedding mode that influences preprocessing strategy

Returns:

Fitted preprocessing pipeline and analysis report

Return type:

Tuple[ColumnTransformer, Dict[str, Any]]

get_analysis_report()[source]#

Get detailed analysis report of the dataset.

Return type:

dict[str, Any]

get_pipeline_description()[source]#

Get human-readable description of the constructed pipeline.

Return type:

dict[str, Any]

build_adaptive_pipeline()#

build_adaptive_pipeline(df, target=None, config=None, mode='unsupervised')[source]#

Build adaptive preprocessing pipeline for Row2Vec.

This is the main entry point for intelligent pipeline construction. It analyzes the dataset and automatically selects optimal preprocessing strategies based on data characteristics.

Parameters:
  • df (pd.DataFrame) – Input dataset

  • target (pd.Series, optional) – Target variable for supervised preprocessing

  • config (EmbeddingConfig, optional) – Configuration for preprocessing. If None, intelligent defaults are used.

  • mode (str) – Embedding mode (“unsupervised”, “target”, etc.)

Returns:

Preprocessing pipeline and analysis report

Return type:

Tuple[ColumnTransformer, Dict[str, Any]]

Examples

Basic usage with automatic configuration: >>> pipeline, report = build_adaptive_pipeline(df) >>> X_processed = pipeline.fit_transform(df)

With custom configuration: >>> config = EmbeddingConfig() >>> config.preprocessing.categorical_encoding_strategy = “entity” >>> pipeline, report = build_adaptive_pipeline(df, target=y, config=config)

sklearn Integration#

When scikit-learn integration is available:

Row2VecTransformer#

class Row2VecTransformer(embedding_dim=10, mode='unsupervised', reference_column=None, config=None, **kwargs)[source]#

Bases: BaseEstimator, TransformerMixin

Scikit-learn compatible transformer for Row2Vec embeddings.

This transformer can be used in sklearn pipelines and follows the standard fit/transform API. It internally uses Row2Vec’s config-based API for flexibility and type safety.

Parameters:
  • embedding_dim (int, default=10) – Dimensionality of the embedding space.

  • mode (str, default="unsupervised") – Embedding mode. Options: “unsupervised”, “target”, “pca”, “tsne”, “umap”, “contrastive”.

  • reference_column (str, optional) – Reference column name for supervised (“target”) mode.

  • config (EmbeddingConfig, optional) – Pre-configured EmbeddingConfig object. If provided, other parameters are ignored.

  • **kwargs – Additional parameters passed to the embedding configuration. These can include nested parameters like neural__max_epochs=100.

config_#

The configuration object used for embedding generation.

Type:

EmbeddingConfig

model_#

The trained Row2Vec model (if using model-based modes).

Type:

object

feature_names_in_#

Names of features seen during fit.

Type:

ndarray of shape (n_features,)

n_features_in_#

Number of features seen during fit.

Type:

int

Examples

>>> from row2vec.sklearn import Row2VecTransformer
>>> import pandas as pd
>>>
>>> # Simple usage
>>> transformer = Row2VecTransformer(embedding_dim=5, mode="unsupervised")
>>> X_embedded = transformer.fit_transform(df)
>>>
>>> # In a pipeline
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.cluster import KMeans
>>>
>>> pipeline = Pipeline([
...     ('embed', Row2VecTransformer(embedding_dim=10)),
...     ('cluster', KMeans(n_clusters=3))
... ])
>>> pipeline.fit(df)
>>>
>>> # With configuration object
>>> from row2vec.config import EmbeddingConfig, NeuralConfig
>>> config = EmbeddingConfig(
...     mode="unsupervised",
...     embedding_dim=15,
...     neural=NeuralConfig(max_epochs=100, batch_size=32)
... )
>>> transformer = Row2VecTransformer(config=config)
>>> X_embedded = transformer.fit_transform(df)
fit(X, y=None)[source]#

Fit the Row2Vec transformer.

Parameters:
  • X (DataFrame or array-like of shape (n_samples, n_features)) – Training data.

  • y (array-like of shape (n_samples,), optional) – Target values (ignored, exists for sklearn compatibility).

Returns:

self – Returns the instance itself.

Return type:

Row2VecTransformer

transform(X)[source]#

Transform data to embedding space.

Parameters:

X (DataFrame or array-like of shape (n_samples, n_features)) – Data to transform.

Returns:

X_embedded – Embedded data.

Return type:

ndarray of shape (n_samples, embedding_dim)

fit_transform(X, y=None, **fit_params)[source]#

Fit the transformer and transform the data.

Parameters:
  • X (DataFrame or array-like of shape (n_samples, n_features)) – Training data.

  • y (array-like of shape (n_samples,), optional) – Target values (ignored, exists for sklearn compatibility).

  • **fit_params (dict) – Additional parameters (ignored, exists for sklearn compatibility).

Returns:

X_embedded – Embedded training data.

Return type:

ndarray of shape (n_samples, embedding_dim)

get_feature_names_out(input_features=None)[source]#

Get output feature names for transformation.

Parameters:

input_features (array-like of str or None, default=None) – Not used, exists for sklearn compatibility.

Returns:

feature_names_out – Feature names for the embedded space.

Return type:

ndarray of shape (embedding_dim,), dtype=str

Row2VecClassifier#

class Row2VecClassifier(embedding_dim=10, classifier=None, embedding_config=None, **embedding_kwargs)[source]#

Bases: BaseEstimator

Scikit-learn compatible classifier using Row2Vec embeddings.

This combines Row2Vec embedding generation with a downstream classifier, making it easy to use embeddings for classification tasks in sklearn pipelines.

Parameters:
  • embedding_dim (int, default=10) – Dimensionality of the embedding space.

  • classifier (sklearn classifier, optional) – The downstream classifier. If None, uses LogisticRegression.

  • embedding_config (EmbeddingConfig, optional) – Configuration for embedding generation.

  • **embedding_kwargs – Additional parameters for embedding configuration.

Examples

>>> from row2vec.sklearn import Row2VecClassifier
>>> from sklearn.ensemble import RandomForestClassifier
>>>
>>> # With default classifier
>>> clf = Row2VecClassifier(embedding_dim=15)
>>> clf.fit(X_train, y_train)
>>> predictions = clf.predict(X_test)
>>>
>>> # With custom classifier
>>> clf = Row2VecClassifier(
...     embedding_dim=20,
...     classifier=RandomForestClassifier(n_estimators=100)
... )
>>> clf.fit(X_train, y_train)
fit(X, y)[source]#

Fit the embedding and classifier.

Parameters:
  • X (Any)

  • y (Any)

Return type:

Row2VecClassifier

predict(X)[source]#

Make predictions on new data.

Parameters:

X (Any)

Return type:

ndarray[Any, Any]

predict_proba(X)[source]#

Predict class probabilities.

Parameters:

X (Any)

Return type:

ndarray[Any, Any]

pandas Integration#

When pandas integration is available:

DataFrame.row2vec Accessor#

The .row2vec accessor provides direct embedding methods on pandas DataFrames:

import pandas as pd
from row2vec import *

df = pd.read_csv('data.csv')

# Generate embeddings directly from DataFrame
embeddings = df.row2vec.embed(mode='unsupervised', embedding_dim=5)

# Target-based embeddings
category_embeddings = df.row2vec.embed_target('category_column', embedding_dim=3)

# Quick visualization embeddings
viz_embeddings = df.row2vec.embed_2d()

Type Hints#

Row2Vec is fully type-annotated. Key type aliases:

from typing import Union, List, Tuple, Optional, Dict, Any
import pandas as pd
import numpy as np

# Common type aliases used in Row2Vec
DataFrame = pd.DataFrame
NDArray = np.ndarray
ModelType = Union['keras.Model', 'sklearn.base.BaseEstimator']
EmbeddingDimensions = Union[int, List[int]]
ScalingRange = Tuple[float, float]
ConfigDict = Dict[str, Any]

Error Handling#

Row2Vec defines custom exceptions for better error handling:

# Common exceptions you might encounter
from row2vec.exceptions import (
    Row2VecError,           # Base exception
    ConfigurationError,     # Invalid configuration
    DataValidationError,    # Data validation failed
    ModelError,             # Model-related errors
    SerializationError      # Save/load errors
)

Performance Considerations#

Memory Usage#

For large datasets:

  • Use sample_size parameter to limit memory usage

  • Consider batch_size parameter for neural networks

  • Use appropriate data types (float32 vs float64)

Speed Optimization#

  • Use PCA mode for fastest results

  • Reduce max_epochs for quick prototyping

  • Use larger batch_size with sufficient memory

  • Enable early_stopping to avoid overtraining

GPU Support#

Row2Vec automatically uses GPU when available through TensorFlow:

import tensorflow as tf
print("GPU Available: ", tf.config.list_physical_devices('GPU'))

# Force CPU usage if needed
with tf.device('/CPU:0'):
    embeddings = learn_embedding(df, mode='unsupervised')

Version Information#

Check Row2Vec version and dependencies:

import row2vec
print(f"Row2Vec version: {row2vec.__version__}")

# Check feature availability
print(f"Pandas integration: {row2vec._PANDAS_AVAILABLE}")
print(f"sklearn integration: {row2vec._SKLEARN_AVAILABLE}")