API Reference#
Complete documentation of Row2Vec’s Python API.
Core Functions#
learn_embedding()#
The main function for generating embeddings from tabular data.
- learn_embedding(df, embedding_dim=10, mode='unsupervised', reference_column=None, max_epochs=50, batch_size=64, dropout_rate=0.2, hidden_units=128, early_stopping=True, seed=1305, verbose=False, scale_method=None, scale_range=None, log_level='INFO', log_file=None, enable_logging=True, n_neighbors=15, perplexity=30.0, min_dist=0.1, n_iter=1000, similar_pairs=None, dissimilar_pairs=None, auto_pairs=None, contrastive_loss='triplet', margin=1.0, negative_samples=5, config=None)[source]#
Learns a low-dimensional embedding from a pandas DataFrame.
Note
Current version supports numeric and categorical features. Textual and temporal features are not directly supported - please preprocess them yourself using appropriate tools (e.g., BERT-like embeddings for text, temporal libraries for time series). Support for these feature types is planned for future versions.
- Parameters:
df (pd.DataFrame) – The input DataFrame containing numeric and categorical features.
embedding_dim (int) – The dimensionality of the embedding space.
mode (str) – Embedding method - ‘unsupervised’ (autoencoder), ‘target’ (supervised), ‘pca’ (Principal Component Analysis), ‘tsne’ (t-SNE), ‘umap’ (UMAP), or ‘contrastive’ (contrastive learning).
reference_column (str) – The target column for ‘target’ mode.
max_epochs (int) – The maximum number of training epochs (neural methods only).
batch_size (int) – The batch size for training (neural methods only).
dropout_rate (float) – The dropout rate for regularization (neural methods only).
hidden_units (Union[int, list[int]]) – Hidden layer configuration - single int for one layer or list of ints for multiple layers (neural methods only).
early_stopping (bool) – Whether to use early stopping (neural methods only).
seed (int) – A random seed for reproducibility.
verbose (bool) – Whether to print training progress.
scale_method (str, optional) – Scaling method for embeddings. Options: ‘none’, ‘minmax’, ‘standard’, ‘l2’, ‘tanh’.
scale_range (tuple, optional) – Range for minmax scaling. Default: (0, 1).
log_level (str) – Logging level (‘DEBUG’, ‘INFO’, ‘WARNING’, ‘ERROR’).
log_file (str, optional) – File path for logging output.
enable_logging (bool) – Whether to enable structured logging.
n_neighbors (int) – Number of neighbors for UMAP (default: 15).
perplexity (float) – Perplexity parameter for t-SNE (default: 30.0).
min_dist (float) – Minimum distance for UMAP (default: 0.1).
n_iter (int) – Number of iterations for t-SNE (default: 1000).
similar_pairs (list[tuple[int, int]], optional) – List of (row_idx1, row_idx2) pairs that should have similar embeddings (for contrastive mode).
dissimilar_pairs (list[tuple[int, int]], optional) – List of (row_idx1, row_idx2) pairs that should have dissimilar embeddings (for contrastive mode).
auto_pairs (str, optional) – Strategy for automatic pair generation. Options: ‘cluster’ (cluster-based), ‘neighbors’ (k-NN based), ‘categorical’ (same category values), ‘random’ (random sampling).
contrastive_loss (str) – Contrastive loss function. Options: ‘triplet’, ‘contrastive’.
margin (float) – Margin parameter for contrastive loss functions (default: 1.0).
negative_samples (int) – Number of negative samples per positive pair (default: 5).
config (EmbeddingConfig, optional) – Configuration for preprocessing and model behavior. If None, intelligent defaults are used based on data analysis.
- Returns:
A DataFrame containing the learned embeddings.
- Return type:
pd.DataFrame
- Raises:
ValueError – If input validation fails or unsupported mode is specified.
TypeError – If input types are incorrect.
learn_embedding_with_model()#
Extended function that returns model components for advanced use cases.
- learn_embedding_with_model(df, embedding_dim=10, mode='unsupervised', reference_column=None, max_epochs=50, batch_size=64, dropout_rate=0.2, hidden_units=128, early_stopping=True, seed=1305, verbose=False, scale_method=None, scale_range=None, log_level='INFO', log_file=None, enable_logging=True, n_neighbors=15, perplexity=30.0, min_dist=0.1, n_iter=1000, similar_pairs=None, dissimilar_pairs=None, auto_pairs=None, negative_samples=5, contrastive_loss='triplet', margin=1.0, config=None)[source]#
Extended version of learn_embedding that also returns the model, preprocessor, and training metadata.
This function is designed for use with the serialization system to capture all necessary components for saving and loading trained models.
- Parameters:
function (Same as learn_embedding)
df (DataFrame)
embedding_dim (int)
mode (str)
reference_column (str | None)
max_epochs (int)
batch_size (int)
dropout_rate (float)
hidden_units (int | list[int])
early_stopping (bool)
seed (int)
verbose (bool)
scale_method (str | None)
scale_range (tuple[float, float] | None)
log_level (str)
log_file (str | None)
enable_logging (bool)
n_neighbors (int)
perplexity (float)
min_dist (float)
n_iter (int)
similar_pairs (list[tuple[int, int]] | None)
dissimilar_pairs (list[tuple[int, int]] | None)
auto_pairs (str | None)
negative_samples (int)
contrastive_loss (str)
margin (float)
config (EmbeddingConfig | None)
- Returns:
Tuple of (embeddings, model, preprocessor, metadata) - embeddings: DataFrame with learned embeddings - model: Trained model (Keras model for neural methods, sklearn estimator for classical) - preprocessor: Fitted sklearn ColumnTransformer for data preprocessing - metadata: Dictionary containing training metadata and configuration
- Return type:
tuple[DataFrame, Model | BaseEstimator, ColumnTransformer, dict[str, Any]]
V2 API Functions#
The newer, more flexible API with configuration objects.
- learn_embedding_v2(df, config=None, auto_architecture=False, architecture_search_config=None, **config_overrides)[source]#
Modern config-based API for learning embeddings from tabular data.
This is the new recommended API that uses configuration objects instead of long parameter lists. It provides better organization, type safety, and extensibility.
- Parameters:
df (DataFrame) – Input DataFrame containing the data to embed
config (EmbeddingConfig | None) – Complete embedding configuration. If None, default config is used.
auto_architecture (bool) – Enable automatic neural architecture search for neural modes
architecture_search_config (ArchitectureSearchConfig | None) – Custom architecture search configuration
**config_overrides – Override specific config values (supports nested keys with dots)
- Returns:
DataFrame containing the learned embeddings
- Return type:
DataFrame
Examples
# Basic usage with defaults embeddings = learn_embedding_v2(df)
# Using a custom config config = EmbeddingConfig(
mode=”contrastive”, embedding_dim=50, contrastive=ContrastiveConfig(loss_type=”triplet”, margin=2.0)
) embeddings = learn_embedding_v2(df, config)
# With automatic architecture search embeddings = learn_embedding_v2(df, config, auto_architecture=True)
# Quick overrides without config object embeddings = learn_embedding_v2(df, embedding_dim=20, mode=”target”, reference_column=”category”)
# Loading from YAML config = EmbeddingConfig.from_yaml(“my_config.yaml”) embeddings = learn_embedding_v2(df, config)
- learn_embedding_with_model_v2(df, config=None, **config_overrides)[source]#
Modern config-based API for learning embeddings with model artifacts.
This function returns the embeddings along with the trained model, preprocessor, and metadata for serialization purposes.
- Parameters:
df (DataFrame) – Input DataFrame containing the data to embed
config (EmbeddingConfig | None) – Complete embedding configuration. If None, default config is used.
**config_overrides – Override specific config values
- Returns:
Tuple of (embeddings, model, preprocessor, metadata)
- Return type:
tuple[DataFrame, Any | BaseEstimator, ColumnTransformer, dict[str, Any]]
- learn_embedding_unsupervised(df, embedding_dim=10, **overrides)[source]#
Learn unsupervised embeddings with optimized defaults.
- Parameters:
df (DataFrame)
embedding_dim (int)
- Return type:
DataFrame
Configuration Classes#
EmbeddingConfig#
- class EmbeddingConfig(embedding_dim=10, mode='unsupervised', reference_column=None, seed=1305, verbose=False, neural=<factory>, classical=<factory>, contrastive=<factory>, scaling=<factory>, logging=<factory>, preprocessing=<factory>)[source]#
Bases:
objectComplete configuration for embedding learning.
- Parameters:
embedding_dim (int)
mode (str)
reference_column (str | None)
seed (int)
verbose (bool)
neural (NeuralConfig)
classical (ClassicalConfig)
contrastive (ContrastiveConfig)
scaling (ScalingConfig)
logging (LoggingConfig)
preprocessing (PreprocessingConfig)
- classmethod from_dict(config_dict)[source]#
Create config from dictionary (e.g., from YAML).
- Parameters:
config_dict (dict[str, Any])
- Return type:
NeuralConfig#
- class NeuralConfig(max_epochs=50, batch_size=64, dropout_rate=0.2, hidden_units=128, activation='relu', early_stopping=True)[source]#
Bases:
objectConfiguration for neural network-based embedding methods.
- Parameters:
max_epochs (int)
batch_size (int)
dropout_rate (float)
hidden_units (int | list[int])
activation (str)
early_stopping (bool)
ClassicalConfig#
ScalingConfig#
LoggingConfig#
PreprocessingConfig#
- class PreprocessingConfig(handle_missing='auto', numeric_scaling='standard', categorical_encoding_strategy='adaptive', categorical_onehot_threshold=20, categorical_target_threshold=100, categorical_entity_threshold=1000)[source]#
Bases:
objectConfiguration for data preprocessing including categorical encoding.
- Parameters:
handle_missing (str)
numeric_scaling (str)
categorical_encoding_strategy (str)
categorical_onehot_threshold (int)
categorical_target_threshold (int)
categorical_entity_threshold (int)
Architecture Search#
ArchitectureSearchConfig#
- class ArchitectureSearchConfig(method='random', max_trials=30, max_time=1800, patience=10, min_improvement=0.01, layer_range=(1, 4), max_layers=4, width_options=<factory>, dropout_options=<factory>, activation_options=<factory>, initial_epochs=10, intermediate_epochs=25, final_epochs=50, top_k_intermediate=10, top_k_final=3, reconstruction_weight=0.4, clustering_weight=0.3, efficiency_weight=0.2, stability_weight=0.1, verbose=True, random_seed=None, return_full_history=False)[source]#
Bases:
objectConfiguration for automatic neural architecture search.
This class defines the search space, evaluation criteria, and stopping conditions for finding optimal neural network architectures.
- Parameters:
method (str)
max_trials (int)
max_time (float | None)
patience (int)
min_improvement (float)
layer_range (tuple[int, int])
max_layers (int)
width_options (list[int])
dropout_options (list[float])
activation_options (list[str])
initial_epochs (int)
intermediate_epochs (int)
final_epochs (int)
top_k_intermediate (int)
top_k_final (int)
reconstruction_weight (float)
clustering_weight (float)
efficiency_weight (float)
stability_weight (float)
verbose (bool)
random_seed (int | None)
return_full_history (bool)
search_architecture()#
- search_architecture(df, base_config, search_config=None, target_column=None)[source]#
Perform automatic neural architecture search.
This is the main entry point for architecture search functionality.
- Parameters:
df (DataFrame) – Input dataframe for embedding generation
base_config (EmbeddingConfig) – Base embedding configuration
search_config (ArchitectureSearchConfig | None) – Architecture search configuration (uses defaults if None)
target_column (str | None) – Optional target column for supervised evaluation
- Returns:
Tuple of (best_architecture_dict, full_search_result)
- Return type:
tuple[dict[str, Any], ArchitectureSearchResult]
ArchitectureSearcher#
- class ArchitectureSearcher(config)[source]#
Bases:
objectMain class for performing neural architecture search.
Implements multiple search strategies to find optimal neural network architectures for embedding generation tasks.
- Parameters:
config (ArchitectureSearchConfig)
- search(df, base_config, target_column=None)[source]#
Perform architecture search on the given dataset.
- Parameters:
df (DataFrame) – Input dataframe for embedding generation
base_config (EmbeddingConfig) – Base embedding configuration
target_column (str | None) – Optional target column for supervised evaluation
- Returns:
ArchitectureSearchResult containing the best architecture and metadata
- Return type:
ArchitectureSearchResult
Auto Dimension Selection#
auto_select_dimension()#
- auto_select_dimension(df, config=None, target_column=None, methods=None, **selector_kwargs)[source]#
Convenience function for automatic dimension selection.
- Parameters:
df (DataFrame) – Input dataframe
config (EmbeddingConfig | None) – Base embedding configuration (uses defaults if None)
target_column (str | None) – Optional target column for supervised evaluation
methods (list[str] | None) – List of selection methods to use
**selector_kwargs – Additional arguments for AutoDimensionSelector
- Returns:
Tuple of (optimal_dimension, selection_metadata)
- Return type:
tuple[int, dict[str, Any]]
AutoDimensionSelector#
- class AutoDimensionSelector(methods=None, performance_weight=0.4, efficiency_weight=0.3, intrinsic_weight=0.3, max_dimension=None, min_dimension=2, n_trials=5, verbose=True)[source]#
Bases:
objectAutomatically selects optimal embedding dimensions using multiple strategies.
Combines data-driven analysis, performance optimization, and heuristic rules to determine the best embedding dimension for a given dataset.
- Parameters:
methods (list[str] | None)
performance_weight (float)
efficiency_weight (float)
intrinsic_weight (float)
max_dimension (int | None)
min_dimension (int)
n_trials (int)
verbose (bool)
- select_dimension(df, config, target_column=None, candidate_dims=None)[source]#
Select optimal embedding dimension for the given data.
- Parameters:
df (DataFrame) – Input dataframe
config (EmbeddingConfig) – Base embedding configuration (dimension will be overridden)
target_column (str | None) – Optional target for supervised evaluation
candidate_dims (list[int] | None) – Specific dimensions to evaluate (auto-generated if None)
- Returns:
Tuple of (optimal_dimension, selection_metadata)
- Return type:
tuple[int, dict[str, Any]]
Imputation#
ImputationConfig#
- class ImputationConfig(numeric_strategy='adaptive', categorical_strategy='adaptive', prefer_speed=True, missing_threshold=0.7, row_missing_threshold=0.9, knn_neighbors=5, preserve_missing_patterns=False, missing_indicator_suffix='_was_missing', auto_detect_patterns=True, warn_high_missingness=True, categorical_fill_value='Missing')[source]#
Bases:
objectConfiguration for intelligent missing value imputation strategies.
This class provides comprehensive control over how missing values are handled, with sensible defaults that work well for most datasets while allowing power users to fine-tune every aspect of the imputation process.
- Parameters:
numeric_strategy (str)
categorical_strategy (str)
prefer_speed (bool)
missing_threshold (float)
row_missing_threshold (float)
knn_neighbors (int)
preserve_missing_patterns (bool)
missing_indicator_suffix (str)
auto_detect_patterns (bool)
warn_high_missingness (bool)
categorical_fill_value (str)
- numeric_strategy: str = 'adaptive'#
Numeric imputation strategy. Options: - “adaptive”: Automatically selects best strategy based on missing percentage - “mean”: Mean imputation (fastest, good for <10% missing) - “median”: Median imputation (robust to outliers, good for 10-30% missing) - “knn”: K-nearest neighbors imputation (better for >30% missing) - “iterative”: MICE-style iterative imputation (best quality, slowest)
- categorical_strategy: str = 'adaptive'#
Categorical imputation strategy. Options: - “adaptive”: Automatically selects best strategy based on data characteristics - “mode”: Most frequent value imputation - “constant”: Fill with specified constant value - “missing_category”: Create explicit “Missing” category
- prefer_speed: bool = True#
Whether to prefer faster methods over more accurate but slower ones. When True, uses simpler strategies by default. When False, prefers more sophisticated methods even if they take longer.
- missing_threshold: float = 0.7#
Columns with more than this fraction of missing values will be flagged. Conservative default of 0.7 to avoid dropping useful but sparse columns.
- row_missing_threshold: float = 0.9#
Rows with more than this fraction of missing values will be flagged. Very conservative default to avoid losing data.
- knn_neighbors: int = 5#
Number of neighbors for KNN imputation. Should be odd to avoid ties.
- preserve_missing_patterns: bool = False#
Whether to preserve missing patterns when they might be informative.
When True, adds binary indicator columns for originally missing values. This is useful when missingness itself carries information (e.g., customers not providing income information might be systematically different).
Example
Original: [1.0, NaN, 3.0] -> After imputation: [1.0, 2.0, 3.0] With preservation: adds column [False, True, False] indicating missingness
- missing_indicator_suffix: str = '_was_missing'#
Suffix for missing indicator columns when preserve_missing_patterns=True.
- auto_detect_patterns: bool = True#
Whether to automatically analyze missing data patterns and adjust strategies.
- warn_high_missingness: bool = True#
Whether to warn users about columns/rows with high missing percentages.
- categorical_fill_value: str = 'Missing'#
Fill value when using ‘constant’ strategy for categorical data.
AdaptiveImputer#
- class AdaptiveImputer(config)[source]#
Bases:
BaseEstimatorAdaptive imputer that automatically selects and applies appropriate imputation strategies based on data characteristics.
- Parameters:
config (ImputationConfig)
- fit(X, y=None)[source]#
Fit the adaptive imputer to the data.
- Parameters:
X (DataFrame) – Input DataFrame with potential missing values
y (Any) – Ignored, present for API compatibility
- Returns:
Fitted imputer
- Return type:
self
- transform(X)[source]#
Transform the data by applying imputation strategies.
- Parameters:
X (DataFrame) – Input DataFrame with potential missing values
- Returns:
DataFrame with missing values imputed
- Return type:
DataFrame
MissingPatternAnalyzer#
- class MissingPatternAnalyzer(config)[source]#
Bases:
objectAnalyzes missing data patterns to inform imputation strategy selection.
- Parameters:
config (ImputationConfig)
Categorical Encoding#
CategoricalEncodingConfig#
- class CategoricalEncodingConfig(encoding_strategy='adaptive', onehot_threshold=20, target_threshold=100, entity_threshold=1000, correlation_threshold=0.1, target_smoothing=1.0, target_noise=0.01, target_cv_folds=5, embedding_dim_ratio=0.5, min_embedding_dim=2, max_embedding_dim=50, entity_epochs=50, entity_batch_size=256, prefer_speed=True, preserve_interpretability=False, enable_feature_selection=False, feature_importance_threshold=0.01, custom_strategies=<factory>, handle_unknown='ignore', random_state=42)[source]#
Bases:
objectConfiguration for intelligent categorical encoding strategies.
This class provides comprehensive control over how categorical variables are encoded, with intelligent defaults that automatically select optimal strategies based on data characteristics while allowing expert users to fine-tune every aspect.
- Parameters:
encoding_strategy (str)
onehot_threshold (int)
target_threshold (int)
entity_threshold (int)
correlation_threshold (float)
target_smoothing (float)
target_noise (float)
target_cv_folds (int)
embedding_dim_ratio (float)
min_embedding_dim (int)
max_embedding_dim (int)
entity_epochs (int)
entity_batch_size (int)
prefer_speed (bool)
preserve_interpretability (bool)
enable_feature_selection (bool)
feature_importance_threshold (float)
custom_strategies (dict[str, str])
handle_unknown (str)
random_state (int)
- encoding_strategy: str = 'adaptive'#
Encoding strategy selection. Options: - “adaptive”: Automatically selects best strategy based on data analysis - “onehot”: One-hot encoding for all categorical features - “target”: Target encoding for all categorical features - “entity”: Entity embeddings for all categorical features - “ordinal”: Ordinal encoding (assumes natural order) - “mixed”: Use custom strategies per column (requires custom_strategies)
- onehot_threshold: int = 20#
Use OneHot encoding if cardinality <= this threshold and correlation is low.
- target_threshold: int = 100#
Use target encoding if cardinality is between onehot_threshold and this value.
- entity_threshold: int = 1000#
Use entity embeddings if cardinality > target_threshold and <= this value.
- correlation_threshold: float = 0.1#
Minimum mutual information score to prefer target/entity over onehot.
- target_smoothing: float = 1.0#
Bayesian smoothing factor for target encoding. Higher values = more smoothing.
- target_noise: float = 0.01#
Gaussian noise standard deviation added to target encodings to prevent overfitting.
- target_cv_folds: int = 5#
Number of cross-validation folds for target encoding to prevent data leakage.
- embedding_dim_ratio: float = 0.5#
Embedding dimension as ratio of sqrt(cardinality). Controls embedding size.
- min_embedding_dim: int = 2#
Minimum embedding dimension for entity embeddings.
- max_embedding_dim: int = 50#
Maximum embedding dimension for entity embeddings.
- entity_epochs: int = 50#
Number of training epochs for entity embedding networks.
- entity_batch_size: int = 256#
Batch size for entity embedding training.
- prefer_speed: bool = True#
Whether to prefer faster methods over more accurate but slower ones.
- preserve_interpretability: bool = False#
Whether to prefer interpretable encodings (OneHot/Ordinal) when possible.
- enable_feature_selection: bool = False#
Whether to enable automatic feature selection based on importance.
- feature_importance_threshold: float = 0.01#
Minimum feature importance score to keep feature (only if enable_feature_selection=True).
- custom_strategies: dict[str, str]#
strategy}
- Type:
Custom encoding strategy for specific columns. Format
- Type:
{column_name
- handle_unknown: str = 'ignore'#
‘ignore’, ‘error’, ‘infrequent_if_exist’
- Type:
How to handle unknown categories. Options
- random_state: int = 42#
Random state for reproducible results.
CategoricalEncoder#
- class CategoricalEncoder(config=None)[source]#
Bases:
BaseEstimator,TransformerMixinIntelligent categorical encoder with adaptive strategy selection.
This encoder analyzes categorical data characteristics and automatically selects optimal encoding strategies while providing full control for advanced users.
- Parameters:
config (CategoricalEncodingConfig | None)
- fit(X, y=None)[source]#
Fit the categorical encoder on training data.
- Parameters:
X (pd.DataFrame) – Categorical features to encode
y (pd.Series, optional) – Target variable for supervised encoding strategies
- Returns:
self – Fitted encoder instance
- Return type:
- transform(X)[source]#
Transform categorical data using fitted encoders.
- Parameters:
X (pd.DataFrame) – Categorical data to transform
- Returns:
Encoded categorical data
- Return type:
pd.DataFrame
CategoricalAnalyzer#
- class CategoricalAnalyzer(config)[source]#
Bases:
objectAnalyzes categorical data to recommend optimal encoding strategies.
- Parameters:
config (CategoricalEncodingConfig)
- analyze_column(series, target=None)[source]#
Analyze a categorical column to recommend encoding strategy.
- Parameters:
series (pd.Series) – Categorical column to analyze
target (pd.Series, optional) – Target variable for correlation analysis
- Returns:
Analysis results and strategy recommendation
- Return type:
Dict[str, Any]
Model Serialization#
Row2VecModel#
- class Row2VecModel(model=None, preprocessor=None, metadata=None)[source]#
Bases:
objectComplete Row2Vec model with preprocessing pipeline and metadata.
This class encapsulates the trained model, preprocessing pipeline, and all metadata needed for inference.
- Parameters:
model (Any | BaseEstimator | None)
preprocessor (ColumnTransformer | None)
metadata (Row2VecModelMetadata | None)
- validate_input_schema(df, strict=True)[source]#
Validate input DataFrame schema against expected schema.
- Parameters:
df (DataFrame) – Input DataFrame to validate
strict (bool) – If True, fails on any schema mismatch. If False, warns only.
- Returns:
True if schema is valid
- Return type:
bool
- Raises:
ValueError – If strict=True and schema validation fails
- predict(df, validate_schema=True)[source]#
Generate embeddings for new data.
- Parameters:
df (DataFrame) – Input DataFrame
validate_schema (bool) – Whether to validate input schema
- Returns:
DataFrame with embeddings
- Raises:
ValueError – If model is not loaded or schema validation fails
- Return type:
DataFrame
Row2VecModelMetadata#
- class Row2VecModelMetadata(embedding_dim, mode, reference_column=None, max_epochs=50, batch_size=64, dropout_rate=0.2, hidden_units=128, early_stopping=True, seed=1305, scale_method=None, scale_range=None, n_neighbors=15, perplexity=30.0, min_dist=0.1, n_iter=1000, training_history=None, final_loss=None, epochs_trained=None, training_time=None, original_columns=None, preprocessed_feature_names=None, data_shape=None, data_types=None, expected_schema=None)[source]#
Bases:
objectContainer for Row2Vec model training metadata.
- Parameters:
embedding_dim (int)
mode (str)
reference_column (str | None)
max_epochs (int)
batch_size (int)
dropout_rate (float)
hidden_units (int)
early_stopping (bool)
seed (int)
scale_method (str | None)
scale_range (tuple[float, float] | None)
n_neighbors (int)
perplexity (float)
min_dist (float)
n_iter (int)
training_history (dict[str, Any] | None)
final_loss (float | None)
epochs_trained (int | None)
training_time (float | None)
original_columns (list[str] | None)
preprocessed_feature_names (list[str] | None)
data_shape (tuple[int, int] | None)
data_types (dict[str, str] | None)
expected_schema (dict[str, Any] | None)
save_model()#
- save_model(model, base_path, overwrite=False)[source]#
Save a Row2Vec model using the two-file approach.
- Parameters:
model (Row2VecModel) – The Row2Vec model to save
base_path (str | Path) – Base path for saving (without extension)
overwrite (bool) – Whether to overwrite existing files
- Returns:
Tuple of (script_path, binary_path)
- Raises:
FileExistsError – If files exist and overwrite=False
ValueError – If model is incomplete
- Return type:
tuple[str, str]
load_model()#
train_and_save_model()#
- train_and_save_model(df, base_path, embedding_dim=10, mode='unsupervised', reference_column=None, max_epochs=50, batch_size=64, dropout_rate=0.2, hidden_units=128, early_stopping=True, seed=1305, verbose=False, scale_method=None, scale_range=None, log_level='INFO', log_file=None, enable_logging=True, n_neighbors=15, perplexity=30.0, min_dist=0.1, n_iter=1000, similar_pairs=None, dissimilar_pairs=None, auto_pairs=None, negative_samples=5, contrastive_loss='triplet', margin=1.0, overwrite=False, include_training_history=True)[source]#
Train a Row2Vec model and save it using the two-file approach.
This is a convenience function that combines training and saving.
- Parameters:
df (DataFrame) – Input DataFrame for training
base_path (str | Path) – Base path for saving the model
**kwargs – All parameters from learn_embedding
overwrite (bool) – Whether to overwrite existing model files
include_training_history (bool) – Whether to include full training history in metadata
embedding_dim (int)
mode (str)
reference_column (str | None)
max_epochs (int)
batch_size (int)
dropout_rate (float)
hidden_units (int)
early_stopping (bool)
seed (int)
verbose (bool)
scale_method (str | None)
scale_range (tuple[float, float] | None)
log_level (str)
log_file (str | None)
enable_logging (bool)
n_neighbors (int)
perplexity (float)
min_dist (float)
n_iter (int)
similar_pairs (list[tuple[int, int]] | None)
dissimilar_pairs (list[tuple[int, int]] | None)
auto_pairs (str | None)
negative_samples (int)
contrastive_loss (str)
margin (float)
- Returns:
Tuple of (embeddings, script_path, binary_path)
- Return type:
tuple[DataFrame, str, str]
Utilities#
generate_synthetic_data()#
- generate_synthetic_data(num_records, seed=1305)[source]#
Generates a synthetic DataFrame for demonstration purposes.
- Parameters:
num_records (int) – The number of records to generate.
seed (int) – A random seed for reproducibility.
- Returns:
A synthetic DataFrame with mixed data types.
- Return type:
pd.DataFrame
create_dataframe_schema()#
validate_dataframe_schema()#
- validate_dataframe_schema(df, expected_schema, allow_extra_columns=False, allow_missing_columns=False)[source]#
Validate DataFrame schema against expected schema.
- Parameters:
df (DataFrame) – DataFrame to validate
expected_schema (dict[str, Any]) – Expected schema dictionary
allow_extra_columns (bool) – Whether to allow extra columns in df
allow_missing_columns (bool) – Whether to allow missing columns in df
- Raises:
ValueError – If schema validation fails
- Return type:
None
Logging#
get_logger()#
- get_logger(name='row2vec', level='INFO', log_file=None, **kwargs)[source]#
Create a Row2Vec logger with standard configuration.
- Parameters:
name (str) – Logger name
level (str) – Logging level
log_file (str | Path | None) – Optional log file path
**kwargs (Any) – Additional arguments for Row2VecLogger
- Returns:
Configured Row2VecLogger instance
- Return type:
Row2VecLogger#
- class Row2VecLogger(name='row2vec', level='INFO', log_file=None, include_performance=True, include_memory=True)[source]#
Bases:
objectCentralized logging system for Row2Vec operations.
Provides structured logging for training progress, debug information, and performance metrics with configurable output formats and levels.
- Parameters:
name (str)
level (str)
log_file (str | Path | None)
include_performance (bool)
include_memory (bool)
- start_training(**kwargs)[source]#
Log training start with configuration details.
- Parameters:
kwargs (Any)
- Return type:
None
- start_epoch(epoch, total_epochs)[source]#
Log epoch start.
- Parameters:
epoch (int)
total_epochs (int)
- Return type:
None
- log_epoch_metrics(epoch, loss, val_loss=None, additional_metrics=None)[source]#
Log epoch completion with metrics.
- Parameters:
epoch (int)
loss (float)
val_loss (float | None)
additional_metrics (dict[str, float] | None)
- Return type:
None
- log_early_stopping(epoch, reason)[source]#
Log early stopping event.
- Parameters:
epoch (int)
reason (str)
- Return type:
None
- end_training(final_loss, total_epochs)[source]#
Log training completion with summary.
- Parameters:
final_loss (float)
total_epochs (int)
- Return type:
None
- log_data_preprocessing(df_shape, processing_steps)[source]#
Log data preprocessing information.
- Parameters:
df_shape (tuple[int, int])
processing_steps (list[str])
- Return type:
None
- log_preprocessing_result(original_shape, processed_shape, processing_time)[source]#
Log preprocessing completion.
- Parameters:
original_shape (tuple[int, int])
processed_shape (tuple[int, int])
processing_time (float)
- Return type:
None
- log_model_architecture(model_summary)[source]#
Log model architecture details.
- Parameters:
model_summary (str)
- Return type:
None
- log_embedding_stats(embeddings)[source]#
Log embedding statistics.
- Parameters:
embeddings (DataFrame)
- Return type:
None
- log_performance_warning(message)[source]#
Log performance-related warnings.
- Parameters:
message (str)
- Return type:
None
- log_validation_issue(message)[source]#
Log validation or data quality issues.
- Parameters:
message (str)
- Return type:
None
- log_debug_info(message, data=None)[source]#
Log debug information with optional data context.
- Parameters:
message (str)
data (dict[str, Any] | None)
- Return type:
None
Pipeline Building#
PipelineBuilder#
- class PipelineBuilder(config=None)[source]#
Bases:
objectIntelligent pipeline builder that analyzes data and constructs optimal preprocessing pipelines with adaptive strategies.
- Parameters:
config (EmbeddingConfig | None)
- build_preprocessing_pipeline(df, target=None, mode='unsupervised')[source]#
Build intelligent preprocessing pipeline based on data analysis.
- Parameters:
df (pd.DataFrame) – Input dataset to analyze
target (pd.Series, optional) – Target variable for supervised preprocessing
mode (str) – Embedding mode that influences preprocessing strategy
- Returns:
Fitted preprocessing pipeline and analysis report
- Return type:
Tuple[ColumnTransformer, Dict[str, Any]]
build_adaptive_pipeline()#
- build_adaptive_pipeline(df, target=None, config=None, mode='unsupervised')[source]#
Build adaptive preprocessing pipeline for Row2Vec.
This is the main entry point for intelligent pipeline construction. It analyzes the dataset and automatically selects optimal preprocessing strategies based on data characteristics.
- Parameters:
df (pd.DataFrame) – Input dataset
target (pd.Series, optional) – Target variable for supervised preprocessing
config (EmbeddingConfig, optional) – Configuration for preprocessing. If None, intelligent defaults are used.
mode (str) – Embedding mode (“unsupervised”, “target”, etc.)
- Returns:
Preprocessing pipeline and analysis report
- Return type:
Tuple[ColumnTransformer, Dict[str, Any]]
Examples
Basic usage with automatic configuration: >>> pipeline, report = build_adaptive_pipeline(df) >>> X_processed = pipeline.fit_transform(df)
With custom configuration: >>> config = EmbeddingConfig() >>> config.preprocessing.categorical_encoding_strategy = “entity” >>> pipeline, report = build_adaptive_pipeline(df, target=y, config=config)
sklearn Integration#
When scikit-learn integration is available:
Row2VecTransformer#
- class Row2VecTransformer(embedding_dim=10, mode='unsupervised', reference_column=None, config=None, **kwargs)[source]#
Bases:
BaseEstimator,TransformerMixinScikit-learn compatible transformer for Row2Vec embeddings.
This transformer can be used in sklearn pipelines and follows the standard fit/transform API. It internally uses Row2Vec’s config-based API for flexibility and type safety.
- Parameters:
embedding_dim (int, default=10) – Dimensionality of the embedding space.
mode (str, default="unsupervised") – Embedding mode. Options: “unsupervised”, “target”, “pca”, “tsne”, “umap”, “contrastive”.
reference_column (str, optional) – Reference column name for supervised (“target”) mode.
config (EmbeddingConfig, optional) – Pre-configured EmbeddingConfig object. If provided, other parameters are ignored.
**kwargs – Additional parameters passed to the embedding configuration. These can include nested parameters like neural__max_epochs=100.
- config_#
The configuration object used for embedding generation.
- Type:
- model_#
The trained Row2Vec model (if using model-based modes).
- Type:
object
- feature_names_in_#
Names of features seen during fit.
- Type:
ndarray of shape (n_features,)
- n_features_in_#
Number of features seen during fit.
- Type:
int
Examples
>>> from row2vec.sklearn import Row2VecTransformer >>> import pandas as pd >>> >>> # Simple usage >>> transformer = Row2VecTransformer(embedding_dim=5, mode="unsupervised") >>> X_embedded = transformer.fit_transform(df) >>> >>> # In a pipeline >>> from sklearn.pipeline import Pipeline >>> from sklearn.cluster import KMeans >>> >>> pipeline = Pipeline([ ... ('embed', Row2VecTransformer(embedding_dim=10)), ... ('cluster', KMeans(n_clusters=3)) ... ]) >>> pipeline.fit(df) >>> >>> # With configuration object >>> from row2vec.config import EmbeddingConfig, NeuralConfig >>> config = EmbeddingConfig( ... mode="unsupervised", ... embedding_dim=15, ... neural=NeuralConfig(max_epochs=100, batch_size=32) ... ) >>> transformer = Row2VecTransformer(config=config) >>> X_embedded = transformer.fit_transform(df)
- fit(X, y=None)[source]#
Fit the Row2Vec transformer.
- Parameters:
X (DataFrame or array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,), optional) – Target values (ignored, exists for sklearn compatibility).
- Returns:
self – Returns the instance itself.
- Return type:
- transform(X)[source]#
Transform data to embedding space.
- Parameters:
X (DataFrame or array-like of shape (n_samples, n_features)) – Data to transform.
- Returns:
X_embedded – Embedded data.
- Return type:
ndarray of shape (n_samples, embedding_dim)
- fit_transform(X, y=None, **fit_params)[source]#
Fit the transformer and transform the data.
- Parameters:
X (DataFrame or array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,), optional) – Target values (ignored, exists for sklearn compatibility).
**fit_params (dict) – Additional parameters (ignored, exists for sklearn compatibility).
- Returns:
X_embedded – Embedded training data.
- Return type:
ndarray of shape (n_samples, embedding_dim)
- get_feature_names_out(input_features=None)[source]#
Get output feature names for transformation.
- Parameters:
input_features (array-like of str or None, default=None) – Not used, exists for sklearn compatibility.
- Returns:
feature_names_out – Feature names for the embedded space.
- Return type:
ndarray of shape (embedding_dim,), dtype=str
Row2VecClassifier#
- class Row2VecClassifier(embedding_dim=10, classifier=None, embedding_config=None, **embedding_kwargs)[source]#
Bases:
BaseEstimatorScikit-learn compatible classifier using Row2Vec embeddings.
This combines Row2Vec embedding generation with a downstream classifier, making it easy to use embeddings for classification tasks in sklearn pipelines.
- Parameters:
embedding_dim (int, default=10) – Dimensionality of the embedding space.
classifier (sklearn classifier, optional) – The downstream classifier. If None, uses LogisticRegression.
embedding_config (EmbeddingConfig, optional) – Configuration for embedding generation.
**embedding_kwargs – Additional parameters for embedding configuration.
Examples
>>> from row2vec.sklearn import Row2VecClassifier >>> from sklearn.ensemble import RandomForestClassifier >>> >>> # With default classifier >>> clf = Row2VecClassifier(embedding_dim=15) >>> clf.fit(X_train, y_train) >>> predictions = clf.predict(X_test) >>> >>> # With custom classifier >>> clf = Row2VecClassifier( ... embedding_dim=20, ... classifier=RandomForestClassifier(n_estimators=100) ... ) >>> clf.fit(X_train, y_train)
pandas Integration#
When pandas integration is available:
DataFrame.row2vec Accessor#
The .row2vec accessor provides direct embedding methods on pandas DataFrames:
import pandas as pd
from row2vec import *
df = pd.read_csv('data.csv')
# Generate embeddings directly from DataFrame
embeddings = df.row2vec.embed(mode='unsupervised', embedding_dim=5)
# Target-based embeddings
category_embeddings = df.row2vec.embed_target('category_column', embedding_dim=3)
# Quick visualization embeddings
viz_embeddings = df.row2vec.embed_2d()
Type Hints#
Row2Vec is fully type-annotated. Key type aliases:
from typing import Union, List, Tuple, Optional, Dict, Any
import pandas as pd
import numpy as np
# Common type aliases used in Row2Vec
DataFrame = pd.DataFrame
NDArray = np.ndarray
ModelType = Union['keras.Model', 'sklearn.base.BaseEstimator']
EmbeddingDimensions = Union[int, List[int]]
ScalingRange = Tuple[float, float]
ConfigDict = Dict[str, Any]
Error Handling#
Row2Vec defines custom exceptions for better error handling:
# Common exceptions you might encounter
from row2vec.exceptions import (
Row2VecError, # Base exception
ConfigurationError, # Invalid configuration
DataValidationError, # Data validation failed
ModelError, # Model-related errors
SerializationError # Save/load errors
)
Performance Considerations#
Memory Usage#
For large datasets:
Use
sample_sizeparameter to limit memory usageConsider
batch_sizeparameter for neural networksUse appropriate data types (float32 vs float64)
Speed Optimization#
Use PCA mode for fastest results
Reduce
max_epochsfor quick prototypingUse larger
batch_sizewith sufficient memoryEnable
early_stoppingto avoid overtraining
GPU Support#
Row2Vec automatically uses GPU when available through TensorFlow:
import tensorflow as tf
print("GPU Available: ", tf.config.list_physical_devices('GPU'))
# Force CPU usage if needed
with tf.device('/CPU:0'):
embeddings = learn_embedding(df, mode='unsupervised')
Version Information#
Check Row2Vec version and dependencies:
import row2vec
print(f"Row2Vec version: {row2vec.__version__}")
# Check feature availability
print(f"Pandas integration: {row2vec._PANDAS_AVAILABLE}")
print(f"sklearn integration: {row2vec._SKLEARN_AVAILABLE}")