Boston Housing Dataset Example

Boston Housing Dataset Example#

The Boston Housing dataset demonstrates Row2Vec with continuous features and regression targets.

Load and Explore Data#

# Import complete suppression first
exec(open('suppress_minimal.py').read())

import pandas as pd
import numpy as np
from row2vec import learn_embedding
import os

# Load Boston Housing dataset (originally mislabeled as Ames)
data_path = os.path.join('..', 'data', 'ames_housing.csv')

# Boston housing dataset column names
column_names = [
    'CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS',
    'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'
]

df = pd.read_csv(data_path, header=None, names=column_names)

print(f"Dataset shape: {df.shape}")
print(f"\nColumn names: {df.columns.tolist()}")
print(f"Total columns: {len(df.columns)}")

✓ Enhanced minimal suppression active

Dataset shape: (506, 14)

Column names: ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
Total columns: 14

Data Overview#

# Focus on a manageable subset of important features
important_cols = [
    'CRIM', 'ZN', 'INDUS', 'NOX', 'RM', 'AGE', 'DIS',
    'RAD', 'TAX', 'PTRATIO', 'LSTAT', 'MEDV'
]

df_subset = df[important_cols].copy()
print(f"Working with {len(important_cols)} key features:")
print(df_subset.columns.tolist())

print(f"\nDataset shape: {df_subset.shape}")
print(f"\nMissing values:")
missing = df_subset.isnull().sum()
print(missing[missing > 0] if missing.any() else "No missing values")

Working with 12 key features:
['CRIM', 'ZN', 'INDUS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'LSTAT', 'MEDV']

Dataset shape: (506, 12)

Missing values:
No missing values

# Basic statistics
print("Target variable (MEDV - Median home value) statistics:")
print(f"Mean: ${df_subset['MEDV'].mean():,.0f}k")
print(f"Median: ${df_subset['MEDV'].median():,.0f}k")
print(f"Min: ${df_subset['MEDV'].min():,.0f}k")
print(f"Max: ${df_subset['MEDV'].max():,.0f}k")

print("\nSample of the data:")
print(df_subset.head())

Target variable (MEDV - Median home value) statistics:
Mean: $23k
Median: $21k
Min: $5k
Max: $50k

Sample of the data:
      CRIM    ZN  INDUS    NOX     RM   AGE     DIS  RAD    TAX  PTRATIO  \
0  0.00632  18.0   2.31  0.538  6.575  65.2  4.0900    1  296.0     15.3   
1  0.02731   0.0   7.07  0.469  6.421  78.9  4.9671    2  242.0     17.8   
2  0.02729   0.0   7.07  0.469  7.185  61.1  4.9671    2  242.0     17.8   
3  0.03237   0.0   2.18  0.458  6.998  45.8  6.0622    3  222.0     18.7   
4  0.06905   0.0   2.18  0.458  7.147  54.2  6.0622    3  222.0     18.7   

   LSTAT  MEDV  
0   4.98  24.0  
1   9.14  21.6  
2   4.03  34.7  
3   2.94  33.4  
4   5.33  36.2  

Prepare Features for Embedding#

# Separate features from target
df_features = df_subset.drop(columns=['MEDV'])
print(f"Features for embedding: {df_features.columns.tolist()}")

# Check data types
print(f"\nData types:")
print(df_features.dtypes)

# Check for any categorical columns (should be none for Boston housing)
cat_cols = df_features.select_dtypes(include=['object']).columns
if len(cat_cols) > 0:
    print(f"\nCategorical column cardinalities:")
    for col in cat_cols:
        print(f"  {col}: {df_features[col].nunique()} unique values")
else:
    print(f"\nAll features are numeric (no categorical columns)")

Features for embedding: ['CRIM', 'ZN', 'INDUS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'LSTAT']

Data types:
CRIM       float64
ZN         float64
INDUS      float64
NOX        float64
RM         float64
AGE        float64
DIS        float64
RAD          int64
TAX        float64
PTRATIO    float64
LSTAT      float64
dtype: object

All features are numeric (no categorical columns)

Unsupervised House Embeddings#

Generate embeddings for each house:

# Generate 8D embeddings for houses
house_embeddings = learn_embedding(
    df_features,
    mode="unsupervised",
    embedding_dim=8,
    max_epochs=40,
    batch_size=64,
    dropout_rate=0.2,
    hidden_units=256,
    verbose=False,
    seed=42
)

print(f"House embeddings shape: {house_embeddings.shape}")
print("\nEmbedding statistics:")
print(house_embeddings.describe().round(3))

Model: "functional"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ input_layer (InputLayer)        │ (None, 11)             │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense (Dense)                   │ (None, 256)            │         3,072 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout (Dropout)               │ (None, 256)            │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ embedding (Dense)               │ (None, 8)              │         2,056 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ (None, 256)            │         2,304 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_1 (Dropout)             │ (None, 256)            │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_2 (Dense)                 │ (None, 11)             │         2,827 │
└─────────────────────────────────┴────────────────────────┴───────────────┘

 Total params: 10,259 (40.07 KB)

 Trainable params: 10,259 (40.07 KB)

 Non-trainable params: 0 (0.00 B)

House embeddings shape: (506, 8)

Embedding statistics:
       embedding_0  embedding_1  embedding_2  embedding_3  embedding_4  \
count      506.000      506.000      506.000      506.000      506.000   
mean        -0.214        0.105       -0.408        0.042       -0.061   
std          0.804        0.832        0.767        1.260        0.926   
min         -2.581       -2.101       -2.726       -3.906       -2.505   
25%         -0.785       -0.332       -0.923       -0.850       -0.639   
50%         -0.251        0.066       -0.408        0.181        0.052   
75%          0.293        0.616        0.131        1.074        0.528   
max          2.220        4.167        3.160        3.110        4.960   

       embedding_5  embedding_6  embedding_7  
count      506.000      506.000      506.000  
mean         0.036        0.014        0.016  
std          1.647        1.008        0.877  
min         -2.863       -4.309       -1.751  
25%         -1.832       -0.443       -0.697  
50%          0.501        0.002       -0.149  
75%          1.159        0.547        0.591  
max          4.014        3.149        2.713  

Visualize House Embeddings#

import matplotlib.pyplot as plt

# Create 2D embeddings for visualization
house_embeddings_2d = learn_embedding(
    df_features,
    mode="unsupervised",
    embedding_dim=2,
    max_epochs=30,
    batch_size=64,
    verbose=False,
    seed=42
)

# Create price categories for coloring
price_quartiles = df_subset['MEDV'].quantile([0.25, 0.5, 0.75])
price_categories = pd.cut(
    df_subset['MEDV'],
    bins=[0, price_quartiles[0.25], price_quartiles[0.5], price_quartiles[0.75], float('inf')],
    labels=['Low', 'Medium-Low', 'Medium-High', 'High']
)

# Plot
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Plot 1: Colored by price
scatter1 = axes[0].scatter(
    house_embeddings_2d.iloc[:, 0],
    house_embeddings_2d.iloc[:, 1],
    c=df_subset['MEDV'],
    cmap='viridis',
    alpha=0.6,
    s=20
)
axes[0].set_xlabel('Embedding Dimension 0')
axes[0].set_ylabel('Embedding Dimension 1')
axes[0].set_title('House Embeddings Colored by Median Value')
plt.colorbar(scatter1, ax=axes[0], label='Median Value ($k)')

# Plot 2: Colored by crime rate
scatter2 = axes[1].scatter(
    house_embeddings_2d.iloc[:, 0],
    house_embeddings_2d.iloc[:, 1],
    c=df_subset['CRIM'],
    cmap='coolwarm',
    alpha=0.6,
    s=20
)
axes[1].set_xlabel('Embedding Dimension 0')
axes[1].set_ylabel('Embedding Dimension 1')
axes[1].set_title('House Embeddings Colored by Crime Rate')
plt.colorbar(scatter2, ax=axes[1], label='Crime Rate')

plt.tight_layout()
plt.show()

print("Notice how expensive/high-quality houses cluster together!")

Model: "functional_2"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ input_layer_1 (InputLayer)      │ (None, 11)             │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_3 (Dense)                 │ (None, 128)            │         1,536 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_2 (Dropout)             │ (None, 128)            │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ embedding (Dense)               │ (None, 2)              │           258 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_4 (Dense)                 │ (None, 128)            │           384 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_3 (Dropout)             │ (None, 128)            │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_5 (Dense)                 │ (None, 11)             │         1,419 │
└─────────────────────────────────┴────────────────────────┴───────────────┘

 Total params: 3,597 (14.05 KB)

 Trainable params: 3,597 (14.05 KB)

 Non-trainable params: 0 (0.00 B)

_images/b5970acdd6daf88d5f3194d637cfc996602aeb6ed058ee685417d62926277061.png

Notice how expensive/high-quality houses cluster together!

Categorical Zone Embeddings#

Create zone categories based on accessibility to radial highways:

# Create accessibility zones based on RAD (radial highway access)
df_subset['AccessZone'] = pd.cut(
    df_subset['RAD'],
    bins=[0, 5, 10, 25],
    labels=['Low', 'Medium', 'High']
)

# Learn zone embeddings
zone_embeddings = learn_embedding(
    df_subset,
    mode="target",
    reference_column="AccessZone",
    embedding_dim=2,
    max_epochs=40,
    batch_size=128,
    verbose=False,
    seed=42
)

# Set proper index with category names
zone_embeddings.index = ['Low', 'Medium', 'High']

print(f"Number of access zones: {len(zone_embeddings)}")
print("\nAccess zone embeddings (2D):")
print(zone_embeddings.round(3))

Model: "functional_4"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ input_layer_2 (InputLayer)      │ (None, 12)             │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_6 (Dense)                 │ (None, 128)            │         1,664 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_4 (Dropout)             │ (None, 128)            │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ embedding (Dense)               │ (None, 2)              │           258 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_7 (Dense)                 │ (None, 3)              │             9 │
└─────────────────────────────────┴────────────────────────┴───────────────┘

 Total params: 1,931 (7.54 KB)

 Trainable params: 1,931 (7.54 KB)

 Non-trainable params: 0 (0.00 B)

Number of access zones: 3

Access zone embeddings (2D):
        embedding_0  embedding_1
Low           3.889       -0.614
Medium        3.207        0.560
High         -5.042       -4.838

Analyze Zone Relationships#

# Calculate average median value by access zone for comparison
zone_prices = df_subset.groupby('AccessZone')['MEDV'].agg(['mean', 'count']).round(1)
zone_prices.columns = ['Avg_Value', 'House_Count']
zone_prices = zone_prices.sort_values('Avg_Value', ascending=False)

print("Access zones by average median value:")
print(zone_prices)

Access zones by average median value:
            Avg_Value  House_Count
AccessZone                        
Medium           25.9           67
Low              24.4          307
High             16.4          132

# Visualize zone embeddings with price information
# Since we only have 2D embeddings, we can plot them directly
zones = zone_embeddings.index.tolist()
zone_avg_values = [zone_prices.loc[z, 'Avg_Value'] for z in zones]

plt.figure(figsize=(10, 8))
scatter = plt.scatter(
    zone_embeddings.iloc[:, 0],
    zone_embeddings.iloc[:, 1],
    c=zone_avg_values,
    cmap='viridis',
    s=200,
    alpha=0.8
)

# Label all zones
for i, zone in enumerate(zones):
    plt.annotate(
        f'{zone}\n(${zone_avg_values[i]:.1f}k)',
        (zone_embeddings.iloc[i, 0], zone_embeddings.iloc[i, 1]),
        xytext=(10, 10),
        textcoords='offset points',
        fontsize=12,
        fontweight='bold',
        bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.7)
    )

plt.xlabel('Embedding Dimension 0')
plt.ylabel('Embedding Dimension 1')
plt.title('Access Zone Embeddings (All zones labeled with avg values)')
plt.colorbar(scatter, label='Average Median Value ($k)')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Zones with similar accessibility patterns cluster in embedding space!")

_images/b7a34d4203e106088dcd7dfc7e680f926e0a7935b6f7b737edd510dd5c8c05d1.png

Zones with similar accessibility patterns cluster in embedding space!

Age Category Embeddings#

Create age categories based on property age:

# Create age categories based on AGE (proportion of owner-occupied units built prior to 1940)
df_subset['AgeCategory'] = pd.cut(
    df_subset['AGE'],
    bins=[0, 30, 70, 100],
    labels=['New', 'Medium', 'Old']
)

# Learn age category embeddings
age_embeddings = learn_embedding(
    df_subset,
    mode="target",
    reference_column="AgeCategory",
    embedding_dim=2,
    max_epochs=30,
    verbose=False,
    seed=42
)

# Set proper index with category names
age_embeddings.index = ['New', 'Medium', 'Old']

print("Age category embeddings:")
print(age_embeddings.round(3))

# Compare with actual values
age_values = df_subset.groupby('AgeCategory')['MEDV'].agg(['mean', 'count'])
age_values.columns = ['Avg_Value', 'Count']
print(f"\nAge categories by average median value:")
print(age_values.sort_values('Avg_Value', ascending=False).round(1))

Model: "functional_6"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ input_layer_3 (InputLayer)      │ (None, 15)             │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_8 (Dense)                 │ (None, 128)            │         2,048 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_5 (Dropout)             │ (None, 128)            │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ embedding (Dense)               │ (None, 2)              │           258 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_9 (Dense)                 │ (None, 3)              │             9 │
└─────────────────────────────────┴────────────────────────┴───────────────┘

 Total params: 2,315 (9.04 KB)

 Trainable params: 2,315 (9.04 KB)

 Non-trainable params: 0 (0.00 B)

Age category embeddings:
        embedding_0  embedding_1
New           4.992        4.203
Medium        0.689        3.404
Old          -5.824       -0.799

Age categories by average median value:
             Avg_Value  Count
AgeCategory                  
New               27.4     64
Medium            25.6    155
Old               19.8    287

Use Embeddings for Price Prediction#

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Prepare data
X = house_embeddings  # 8D embeddings as features
y = df_subset['MEDV']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train regressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict
y_pred = rf.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mse)

print(f"House median value prediction using 8D embeddings:")
print(f"R² Score: {r2:.3f}")
print(f"RMSE: ${rmse:,.1f}k")
print(f"Mean Absolute Error: ${np.mean(np.abs(y_test - y_pred)):,.1f}k")

# Feature importance (though these are embedding dimensions, not original features)
print(f"\nEmbedding dimension importance:")
for i, importance in enumerate(rf.feature_importances_):
    print(f"  Dimension {i}: {importance:.3f}")

House median value prediction using 8D embeddings:
R² Score: 0.788
RMSE: $3.9k
Mean Absolute Error: $2.5k

Embedding dimension importance:
  Dimension 0: 0.076
  Dimension 1: 0.039
  Dimension 2: 0.043
  Dimension 3: 0.356
  Dimension 4: 0.022
  Dimension 5: 0.038
  Dimension 6: 0.376
  Dimension 7: 0.051

Compare Prediction Performance#

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer

# Compare embeddings vs raw features
# Prepare raw features manually
df_raw = df_features.copy()

# Label encode categorical columns (if any)
le_dict = {}
if len(cat_cols) > 0:
    for col in cat_cols:
        le = LabelEncoder()
        df_raw[col] = le.fit_transform(df_raw[col])
        le_dict[col] = le

# Scale raw features
scaler = StandardScaler()
X_raw_scaled = scaler.fit_transform(df_raw)

# Split raw features
X_raw_train, X_raw_test, y_train_raw, y_test_raw = train_test_split(
    X_raw_scaled, y, test_size=0.2, random_state=42
)

# Train on raw features
rf_raw = RandomForestRegressor(n_estimators=100, random_state=42)
rf_raw.fit(X_raw_train, y_train_raw)
y_pred_raw = rf_raw.predict(X_raw_test)

# Compare performance
r2_raw = r2_score(y_test_raw, y_pred_raw)
rmse_raw = np.sqrt(mean_squared_error(y_test_raw, y_pred_raw))

print("Performance Comparison:")
print("-" * 40)
print(f"Embeddings (8D):     R² = {r2:.3f}, RMSE = ${rmse:,.1f}k")
print(f"Raw Features ({X_raw_scaled.shape[1]}D):  R² = {r2_raw:.3f}, RMSE = ${rmse_raw:,.1f}k")
print(f"\nDimensionality reduction: {X_raw_scaled.shape[1]} → {X.shape[1]} ({(1-X.shape[1]/X_raw_scaled.shape[1]):.1%} reduction)")

Performance Comparison:
----------------------------------------
Embeddings (8D):     R² = 0.788, RMSE = $3.9k
Raw Features (11D):  R² = 0.887, RMSE = $2.9k

Dimensionality reduction: 11 → 8 (27.3% reduction)

Classical Methods Comparison#

# Compare neural embeddings with classical methods
methods = {
    "Neural": {"mode": "unsupervised", "max_epochs": 20},
    "PCA": {"mode": "pca"}
}

# Use smaller sample for faster execution (Boston housing has 506 rows)
sample_size = 400
df_sample = df_features.sample(n=sample_size, random_state=42)
y_sample = y.loc[df_sample.index]

comparison_results = {}
for name, params in methods.items():
    print(f"Training {name}...")

    # Generate embeddings
    emb = learn_embedding(
        df_sample,
        embedding_dim=8,
        verbose=False,
        seed=42,
        **params
    )

    # Train and evaluate
    X_train_comp, X_test_comp, y_train_comp, y_test_comp = train_test_split(
        emb, y_sample, test_size=0.2, random_state=42
    )

    rf_comp = RandomForestRegressor(n_estimators=50, random_state=42)
    rf_comp.fit(X_train_comp, y_train_comp)
    y_pred_comp = rf_comp.predict(X_test_comp)

    r2_comp = r2_score(y_test_comp, y_pred_comp)
    rmse_comp = np.sqrt(mean_squared_error(y_test_comp, y_pred_comp))

    comparison_results[name] = {"r2": r2_comp, "rmse": rmse_comp}

print(f"\nMethod comparison (sample of {sample_size} houses):")
print("-" * 50)
for method, results in comparison_results.items():
    print(f"{method:8}: R² = {results['r2']:.3f}, RMSE = ${results['rmse']:,.1f}k")

Model: "functional_8"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ input_layer_4 (InputLayer)      │ (None, 11)             │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_10 (Dense)                │ (None, 128)            │         1,536 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_6 (Dropout)             │ (None, 128)            │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ embedding (Dense)               │ (None, 8)              │         1,032 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_11 (Dense)                │ (None, 128)            │         1,152 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_7 (Dropout)             │ (None, 128)            │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_12 (Dense)                │ (None, 11)             │         1,419 │
└─────────────────────────────────┴────────────────────────┴───────────────┘

 Total params: 5,139 (20.07 KB)

 Trainable params: 5,139 (20.07 KB)

 Non-trainable params: 0 (0.00 B)

Training Neural...

Training PCA...

Method comparison (sample of 400 houses):
--------------------------------------------------
Neural  : R² = 0.806, RMSE = $3.7k
PCA     : R² = 0.829, RMSE = $3.5k

Production Pipeline#

from row2vec import train_and_save_model
import tempfile
import os

# Create production model
with tempfile.TemporaryDirectory() as tmpdir:
    model_path = os.path.join(tmpdir, "housing_model")

    embeddings_final, script_path, binary_path = train_and_save_model(
        df_features,
        base_path=model_path,
        embedding_dim=10,
        mode="unsupervised",
        max_epochs=50,
        batch_size=128,
        dropout_rate=0.2,
        hidden_units=512,
        scale_method="standard",
        verbose=False,
        seed=42
    )

    print(f"Housing model saved: {os.path.basename(script_path)}")

    # Demonstrate model loading and usage
    from row2vec import load_model
    model = load_model(script_path)

    # Test on new data
    test_houses = df_features.sample(n=50, random_state=999)
    test_embeddings = model.predict(test_houses)

    print(f"\nModel applied to {len(test_houses)} test houses")
    print(f"Generated embeddings shape: {test_embeddings.shape}")
    print(f"Training metadata:")
    print(f"  Epochs trained: {model.metadata.epochs_trained}")
    print(f"  Final loss: {model.metadata.final_loss if model.metadata.final_loss is not None else 'N/A'}")
    print(f"  Training time: {model.metadata.training_time:.2f}s")

Model: "functional_10"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ input_layer_5 (InputLayer)      │ (None, 11)             │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_13 (Dense)                │ (None, 512)            │         6,144 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_8 (Dropout)             │ (None, 512)            │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ embedding (Dense)               │ (None, 10)             │         5,130 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_14 (Dense)                │ (None, 512)            │         5,632 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_9 (Dropout)             │ (None, 512)            │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_15 (Dense)                │ (None, 11)             │         5,643 │
└─────────────────────────────────┴────────────────────────┴───────────────┘

 Total params: 22,549 (88.08 KB)

 Trainable params: 22,549 (88.08 KB)

 Non-trainable params: 0 (0.00 B)

 1/16 ━━━━━━━━━━━━━━━━━━━━ 0s 25ms/step


16/16 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step

Housing model saved: housing_model.py

Model applied to 50 test houses
Generated embeddings shape: (50, 10)
Training metadata:
  Epochs trained: 50
  Final loss: N/A
  Training time: 7.68s

Key Insights#

Continuous Features: Row2Vec effectively captures patterns in continuous housing data
Urban Patterns: Properties with similar accessibility and socioeconomic factors cluster together
Dimensionality Reduction: 11 features → 8 embeddings with minimal performance loss
Predictive Power: Embeddings achieve good R² for median value prediction
Feature Relationships: Crime rates, accessibility, and property age show meaningful embedding patterns

Next Steps#

Learn about Advanced Features like architecture search
Explore the CLI Guide for processing large real estate datasets
Check the API Reference for complete parameter details