Boston Housing Dataset Example#
The Boston Housing dataset demonstrates Row2Vec with continuous features and regression targets.
Load and Explore Data#
# Import complete suppression first
exec(open('suppress_minimal.py').read())
import pandas as pd
import numpy as np
from row2vec import learn_embedding
import os
# Load Boston Housing dataset (originally mislabeled as Ames)
data_path = os.path.join('..', 'data', 'ames_housing.csv')
# Boston housing dataset column names
column_names = [
'CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS',
'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'
]
df = pd.read_csv(data_path, header=None, names=column_names)
print(f"Dataset shape: {df.shape}")
print(f"\nColumn names: {df.columns.tolist()}")
print(f"Total columns: {len(df.columns)}")
✓ Enhanced minimal suppression active
Dataset shape: (506, 14)
Column names: ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
Total columns: 14
Data Overview#
# Focus on a manageable subset of important features
important_cols = [
'CRIM', 'ZN', 'INDUS', 'NOX', 'RM', 'AGE', 'DIS',
'RAD', 'TAX', 'PTRATIO', 'LSTAT', 'MEDV'
]
df_subset = df[important_cols].copy()
print(f"Working with {len(important_cols)} key features:")
print(df_subset.columns.tolist())
print(f"\nDataset shape: {df_subset.shape}")
print(f"\nMissing values:")
missing = df_subset.isnull().sum()
print(missing[missing > 0] if missing.any() else "No missing values")
Working with 12 key features:
['CRIM', 'ZN', 'INDUS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'LSTAT', 'MEDV']
Dataset shape: (506, 12)
Missing values:
No missing values
# Basic statistics
print("Target variable (MEDV - Median home value) statistics:")
print(f"Mean: ${df_subset['MEDV'].mean():,.0f}k")
print(f"Median: ${df_subset['MEDV'].median():,.0f}k")
print(f"Min: ${df_subset['MEDV'].min():,.0f}k")
print(f"Max: ${df_subset['MEDV'].max():,.0f}k")
print("\nSample of the data:")
print(df_subset.head())
Target variable (MEDV - Median home value) statistics:
Mean: $23k
Median: $21k
Min: $5k
Max: $50k
Sample of the data:
CRIM ZN INDUS NOX RM AGE DIS RAD TAX PTRATIO \
0 0.00632 18.0 2.31 0.538 6.575 65.2 4.0900 1 296.0 15.3
1 0.02731 0.0 7.07 0.469 6.421 78.9 4.9671 2 242.0 17.8
2 0.02729 0.0 7.07 0.469 7.185 61.1 4.9671 2 242.0 17.8
3 0.03237 0.0 2.18 0.458 6.998 45.8 6.0622 3 222.0 18.7
4 0.06905 0.0 2.18 0.458 7.147 54.2 6.0622 3 222.0 18.7
LSTAT MEDV
0 4.98 24.0
1 9.14 21.6
2 4.03 34.7
3 2.94 33.4
4 5.33 36.2
Prepare Features for Embedding#
# Separate features from target
df_features = df_subset.drop(columns=['MEDV'])
print(f"Features for embedding: {df_features.columns.tolist()}")
# Check data types
print(f"\nData types:")
print(df_features.dtypes)
# Check for any categorical columns (should be none for Boston housing)
cat_cols = df_features.select_dtypes(include=['object']).columns
if len(cat_cols) > 0:
print(f"\nCategorical column cardinalities:")
for col in cat_cols:
print(f" {col}: {df_features[col].nunique()} unique values")
else:
print(f"\nAll features are numeric (no categorical columns)")
Features for embedding: ['CRIM', 'ZN', 'INDUS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'LSTAT']
Data types:
CRIM float64
ZN float64
INDUS float64
NOX float64
RM float64
AGE float64
DIS float64
RAD int64
TAX float64
PTRATIO float64
LSTAT float64
dtype: object
All features are numeric (no categorical columns)
Unsupervised House Embeddings#
Generate embeddings for each house:
# Generate 8D embeddings for houses
house_embeddings = learn_embedding(
df_features,
mode="unsupervised",
embedding_dim=8,
max_epochs=40,
batch_size=64,
dropout_rate=0.2,
hidden_units=256,
verbose=False,
seed=42
)
print(f"House embeddings shape: {house_embeddings.shape}")
print("\nEmbedding statistics:")
print(house_embeddings.describe().round(3))
Model: "functional"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ input_layer (InputLayer) │ (None, 11) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense (Dense) │ (None, 256) │ 3,072 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout (Dropout) │ (None, 256) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ embedding (Dense) │ (None, 8) │ 2,056 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_1 (Dense) │ (None, 256) │ 2,304 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_1 (Dropout) │ (None, 256) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_2 (Dense) │ (None, 11) │ 2,827 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 10,259 (40.07 KB)
Trainable params: 10,259 (40.07 KB)
Non-trainable params: 0 (0.00 B)
House embeddings shape: (506, 8)
Embedding statistics:
embedding_0 embedding_1 embedding_2 embedding_3 embedding_4 \
count 506.000 506.000 506.000 506.000 506.000
mean -0.214 0.105 -0.408 0.042 -0.061
std 0.804 0.832 0.767 1.260 0.926
min -2.581 -2.101 -2.726 -3.906 -2.505
25% -0.785 -0.332 -0.923 -0.850 -0.639
50% -0.251 0.066 -0.408 0.181 0.052
75% 0.293 0.616 0.131 1.074 0.528
max 2.220 4.167 3.160 3.110 4.960
embedding_5 embedding_6 embedding_7
count 506.000 506.000 506.000
mean 0.036 0.014 0.016
std 1.647 1.008 0.877
min -2.863 -4.309 -1.751
25% -1.832 -0.443 -0.697
50% 0.501 0.002 -0.149
75% 1.159 0.547 0.591
max 4.014 3.149 2.713
Visualize House Embeddings#
import matplotlib.pyplot as plt
# Create 2D embeddings for visualization
house_embeddings_2d = learn_embedding(
df_features,
mode="unsupervised",
embedding_dim=2,
max_epochs=30,
batch_size=64,
verbose=False,
seed=42
)
# Create price categories for coloring
price_quartiles = df_subset['MEDV'].quantile([0.25, 0.5, 0.75])
price_categories = pd.cut(
df_subset['MEDV'],
bins=[0, price_quartiles[0.25], price_quartiles[0.5], price_quartiles[0.75], float('inf')],
labels=['Low', 'Medium-Low', 'Medium-High', 'High']
)
# Plot
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
# Plot 1: Colored by price
scatter1 = axes[0].scatter(
house_embeddings_2d.iloc[:, 0],
house_embeddings_2d.iloc[:, 1],
c=df_subset['MEDV'],
cmap='viridis',
alpha=0.6,
s=20
)
axes[0].set_xlabel('Embedding Dimension 0')
axes[0].set_ylabel('Embedding Dimension 1')
axes[0].set_title('House Embeddings Colored by Median Value')
plt.colorbar(scatter1, ax=axes[0], label='Median Value ($k)')
# Plot 2: Colored by crime rate
scatter2 = axes[1].scatter(
house_embeddings_2d.iloc[:, 0],
house_embeddings_2d.iloc[:, 1],
c=df_subset['CRIM'],
cmap='coolwarm',
alpha=0.6,
s=20
)
axes[1].set_xlabel('Embedding Dimension 0')
axes[1].set_ylabel('Embedding Dimension 1')
axes[1].set_title('House Embeddings Colored by Crime Rate')
plt.colorbar(scatter2, ax=axes[1], label='Crime Rate')
plt.tight_layout()
plt.show()
print("Notice how expensive/high-quality houses cluster together!")
Model: "functional_2"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ input_layer_1 (InputLayer) │ (None, 11) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_3 (Dense) │ (None, 128) │ 1,536 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_2 (Dropout) │ (None, 128) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ embedding (Dense) │ (None, 2) │ 258 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_4 (Dense) │ (None, 128) │ 384 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_3 (Dropout) │ (None, 128) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_5 (Dense) │ (None, 11) │ 1,419 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 3,597 (14.05 KB)
Trainable params: 3,597 (14.05 KB)
Non-trainable params: 0 (0.00 B)
Notice how expensive/high-quality houses cluster together!
Categorical Zone Embeddings#
Create zone categories based on accessibility to radial highways:
# Create accessibility zones based on RAD (radial highway access)
df_subset['AccessZone'] = pd.cut(
df_subset['RAD'],
bins=[0, 5, 10, 25],
labels=['Low', 'Medium', 'High']
)
# Learn zone embeddings
zone_embeddings = learn_embedding(
df_subset,
mode="target",
reference_column="AccessZone",
embedding_dim=2,
max_epochs=40,
batch_size=128,
verbose=False,
seed=42
)
# Set proper index with category names
zone_embeddings.index = ['Low', 'Medium', 'High']
print(f"Number of access zones: {len(zone_embeddings)}")
print("\nAccess zone embeddings (2D):")
print(zone_embeddings.round(3))
Model: "functional_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ input_layer_2 (InputLayer) │ (None, 12) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_6 (Dense) │ (None, 128) │ 1,664 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_4 (Dropout) │ (None, 128) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ embedding (Dense) │ (None, 2) │ 258 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_7 (Dense) │ (None, 3) │ 9 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 1,931 (7.54 KB)
Trainable params: 1,931 (7.54 KB)
Non-trainable params: 0 (0.00 B)
Number of access zones: 3
Access zone embeddings (2D):
embedding_0 embedding_1
Low 3.889 -0.614
Medium 3.207 0.560
High -5.042 -4.838
Analyze Zone Relationships#
# Calculate average median value by access zone for comparison
zone_prices = df_subset.groupby('AccessZone')['MEDV'].agg(['mean', 'count']).round(1)
zone_prices.columns = ['Avg_Value', 'House_Count']
zone_prices = zone_prices.sort_values('Avg_Value', ascending=False)
print("Access zones by average median value:")
print(zone_prices)
Access zones by average median value:
Avg_Value House_Count
AccessZone
Medium 25.9 67
Low 24.4 307
High 16.4 132
# Visualize zone embeddings with price information
# Since we only have 2D embeddings, we can plot them directly
zones = zone_embeddings.index.tolist()
zone_avg_values = [zone_prices.loc[z, 'Avg_Value'] for z in zones]
plt.figure(figsize=(10, 8))
scatter = plt.scatter(
zone_embeddings.iloc[:, 0],
zone_embeddings.iloc[:, 1],
c=zone_avg_values,
cmap='viridis',
s=200,
alpha=0.8
)
# Label all zones
for i, zone in enumerate(zones):
plt.annotate(
f'{zone}\n(${zone_avg_values[i]:.1f}k)',
(zone_embeddings.iloc[i, 0], zone_embeddings.iloc[i, 1]),
xytext=(10, 10),
textcoords='offset points',
fontsize=12,
fontweight='bold',
bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.7)
)
plt.xlabel('Embedding Dimension 0')
plt.ylabel('Embedding Dimension 1')
plt.title('Access Zone Embeddings (All zones labeled with avg values)')
plt.colorbar(scatter, label='Average Median Value ($k)')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("Zones with similar accessibility patterns cluster in embedding space!")
Zones with similar accessibility patterns cluster in embedding space!
Age Category Embeddings#
Create age categories based on property age:
# Create age categories based on AGE (proportion of owner-occupied units built prior to 1940)
df_subset['AgeCategory'] = pd.cut(
df_subset['AGE'],
bins=[0, 30, 70, 100],
labels=['New', 'Medium', 'Old']
)
# Learn age category embeddings
age_embeddings = learn_embedding(
df_subset,
mode="target",
reference_column="AgeCategory",
embedding_dim=2,
max_epochs=30,
verbose=False,
seed=42
)
# Set proper index with category names
age_embeddings.index = ['New', 'Medium', 'Old']
print("Age category embeddings:")
print(age_embeddings.round(3))
# Compare with actual values
age_values = df_subset.groupby('AgeCategory')['MEDV'].agg(['mean', 'count'])
age_values.columns = ['Avg_Value', 'Count']
print(f"\nAge categories by average median value:")
print(age_values.sort_values('Avg_Value', ascending=False).round(1))
Model: "functional_6"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ input_layer_3 (InputLayer) │ (None, 15) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_8 (Dense) │ (None, 128) │ 2,048 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_5 (Dropout) │ (None, 128) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ embedding (Dense) │ (None, 2) │ 258 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_9 (Dense) │ (None, 3) │ 9 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 2,315 (9.04 KB)
Trainable params: 2,315 (9.04 KB)
Non-trainable params: 0 (0.00 B)
Age category embeddings:
embedding_0 embedding_1
New 4.992 4.203
Medium 0.689 3.404
Old -5.824 -0.799
Age categories by average median value:
Avg_Value Count
AgeCategory
New 27.4 64
Medium 25.6 155
Old 19.8 287
Use Embeddings for Price Prediction#
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
# Prepare data
X = house_embeddings # 8D embeddings as features
y = df_subset['MEDV']
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train regressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
# Predict
y_pred = rf.predict(X_test)
# Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mse)
print(f"House median value prediction using 8D embeddings:")
print(f"R² Score: {r2:.3f}")
print(f"RMSE: ${rmse:,.1f}k")
print(f"Mean Absolute Error: ${np.mean(np.abs(y_test - y_pred)):,.1f}k")
# Feature importance (though these are embedding dimensions, not original features)
print(f"\nEmbedding dimension importance:")
for i, importance in enumerate(rf.feature_importances_):
print(f" Dimension {i}: {importance:.3f}")
House median value prediction using 8D embeddings:
R² Score: 0.788
RMSE: $3.9k
Mean Absolute Error: $2.5k
Embedding dimension importance:
Dimension 0: 0.076
Dimension 1: 0.039
Dimension 2: 0.043
Dimension 3: 0.356
Dimension 4: 0.022
Dimension 5: 0.038
Dimension 6: 0.376
Dimension 7: 0.051
Compare Prediction Performance#
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
# Compare embeddings vs raw features
# Prepare raw features manually
df_raw = df_features.copy()
# Label encode categorical columns (if any)
le_dict = {}
if len(cat_cols) > 0:
for col in cat_cols:
le = LabelEncoder()
df_raw[col] = le.fit_transform(df_raw[col])
le_dict[col] = le
# Scale raw features
scaler = StandardScaler()
X_raw_scaled = scaler.fit_transform(df_raw)
# Split raw features
X_raw_train, X_raw_test, y_train_raw, y_test_raw = train_test_split(
X_raw_scaled, y, test_size=0.2, random_state=42
)
# Train on raw features
rf_raw = RandomForestRegressor(n_estimators=100, random_state=42)
rf_raw.fit(X_raw_train, y_train_raw)
y_pred_raw = rf_raw.predict(X_raw_test)
# Compare performance
r2_raw = r2_score(y_test_raw, y_pred_raw)
rmse_raw = np.sqrt(mean_squared_error(y_test_raw, y_pred_raw))
print("Performance Comparison:")
print("-" * 40)
print(f"Embeddings (8D): R² = {r2:.3f}, RMSE = ${rmse:,.1f}k")
print(f"Raw Features ({X_raw_scaled.shape[1]}D): R² = {r2_raw:.3f}, RMSE = ${rmse_raw:,.1f}k")
print(f"\nDimensionality reduction: {X_raw_scaled.shape[1]} → {X.shape[1]} ({(1-X.shape[1]/X_raw_scaled.shape[1]):.1%} reduction)")
Performance Comparison:
----------------------------------------
Embeddings (8D): R² = 0.788, RMSE = $3.9k
Raw Features (11D): R² = 0.887, RMSE = $2.9k
Dimensionality reduction: 11 → 8 (27.3% reduction)
Classical Methods Comparison#
# Compare neural embeddings with classical methods
methods = {
"Neural": {"mode": "unsupervised", "max_epochs": 20},
"PCA": {"mode": "pca"}
}
# Use smaller sample for faster execution (Boston housing has 506 rows)
sample_size = 400
df_sample = df_features.sample(n=sample_size, random_state=42)
y_sample = y.loc[df_sample.index]
comparison_results = {}
for name, params in methods.items():
print(f"Training {name}...")
# Generate embeddings
emb = learn_embedding(
df_sample,
embedding_dim=8,
verbose=False,
seed=42,
**params
)
# Train and evaluate
X_train_comp, X_test_comp, y_train_comp, y_test_comp = train_test_split(
emb, y_sample, test_size=0.2, random_state=42
)
rf_comp = RandomForestRegressor(n_estimators=50, random_state=42)
rf_comp.fit(X_train_comp, y_train_comp)
y_pred_comp = rf_comp.predict(X_test_comp)
r2_comp = r2_score(y_test_comp, y_pred_comp)
rmse_comp = np.sqrt(mean_squared_error(y_test_comp, y_pred_comp))
comparison_results[name] = {"r2": r2_comp, "rmse": rmse_comp}
print(f"\nMethod comparison (sample of {sample_size} houses):")
print("-" * 50)
for method, results in comparison_results.items():
print(f"{method:8}: R² = {results['r2']:.3f}, RMSE = ${results['rmse']:,.1f}k")
Model: "functional_8"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ input_layer_4 (InputLayer) │ (None, 11) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_10 (Dense) │ (None, 128) │ 1,536 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_6 (Dropout) │ (None, 128) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ embedding (Dense) │ (None, 8) │ 1,032 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_11 (Dense) │ (None, 128) │ 1,152 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_7 (Dropout) │ (None, 128) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_12 (Dense) │ (None, 11) │ 1,419 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 5,139 (20.07 KB)
Trainable params: 5,139 (20.07 KB)
Non-trainable params: 0 (0.00 B)
Training Neural...
Training PCA...
Method comparison (sample of 400 houses):
--------------------------------------------------
Neural : R² = 0.806, RMSE = $3.7k
PCA : R² = 0.829, RMSE = $3.5k
Production Pipeline#
from row2vec import train_and_save_model
import tempfile
import os
# Create production model
with tempfile.TemporaryDirectory() as tmpdir:
model_path = os.path.join(tmpdir, "housing_model")
embeddings_final, script_path, binary_path = train_and_save_model(
df_features,
base_path=model_path,
embedding_dim=10,
mode="unsupervised",
max_epochs=50,
batch_size=128,
dropout_rate=0.2,
hidden_units=512,
scale_method="standard",
verbose=False,
seed=42
)
print(f"Housing model saved: {os.path.basename(script_path)}")
# Demonstrate model loading and usage
from row2vec import load_model
model = load_model(script_path)
# Test on new data
test_houses = df_features.sample(n=50, random_state=999)
test_embeddings = model.predict(test_houses)
print(f"\nModel applied to {len(test_houses)} test houses")
print(f"Generated embeddings shape: {test_embeddings.shape}")
print(f"Training metadata:")
print(f" Epochs trained: {model.metadata.epochs_trained}")
print(f" Final loss: {model.metadata.final_loss if model.metadata.final_loss is not None else 'N/A'}")
print(f" Training time: {model.metadata.training_time:.2f}s")
Model: "functional_10"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ input_layer_5 (InputLayer) │ (None, 11) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_13 (Dense) │ (None, 512) │ 6,144 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_8 (Dropout) │ (None, 512) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ embedding (Dense) │ (None, 10) │ 5,130 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_14 (Dense) │ (None, 512) │ 5,632 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_9 (Dropout) │ (None, 512) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_15 (Dense) │ (None, 11) │ 5,643 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 22,549 (88.08 KB)
Trainable params: 22,549 (88.08 KB)
Non-trainable params: 0 (0.00 B)
1/16 ━━━━━━━━━━━━━━━━━━━━ 0s 25ms/step
16/16 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
Housing model saved: housing_model.py
Model applied to 50 test houses
Generated embeddings shape: (50, 10)
Training metadata:
Epochs trained: 50
Final loss: N/A
Training time: 7.68s
Key Insights#
Continuous Features: Row2Vec effectively captures patterns in continuous housing data
Urban Patterns: Properties with similar accessibility and socioeconomic factors cluster together
Dimensionality Reduction: 11 features → 8 embeddings with minimal performance loss
Predictive Power: Embeddings achieve good R² for median value prediction
Feature Relationships: Crime rates, accessibility, and property age show meaningful embedding patterns
Next Steps#
Learn about Advanced Features like architecture search
Explore the CLI Guide for processing large real estate datasets
Check the API Reference for complete parameter details