Data Formatting
This guide shows how to structure your data so it works with elwood-spatial.
The package expects a long-format DataFrame and a spatial network dict.
1. The DataFrame
For batch processing with detect_outliers_batch, your data should be a long-format DataFrame with one row per (device, timestamp) pair:
import pandas as pd
# Long-format: one row per (sensor, timestamp)
df = pd.DataFrame([
{"id": "abc123", "timestamp": "2024-07-15 00:00", "value": 42},
{"id": "abc123", "timestamp": "2024-07-15 01:00", "value": 45},
{"id": "def234", "timestamp": "2024-07-15 00:00", "value": 38},
{"id": "def234", "timestamp": "2024-07-15 01:00", "value": 41},
# ...
])
df["timestamp"] = pd.to_datetime(df["timestamp"])
# id timestamp value
# 0 abc123 2024-07-15 00:00 42
# 1 abc123 2024-07-15 01:00 45
# 2 def234 2024-07-15 00:00 38
# 3 def234 2024-07-15 01:00 41 The default column names are id, timestamp, and value,
but you can override them:
result = es.detect_outliers_batch(
df, bins, network, params,
id_column="sensor_id",
time_column="time",
value_column="aqi"
) 2. Building a Network
If your sensors have geographic coordinates, build the network from a GeoDataFrame:
import geopandas as gpd
from shapely.geometry import Point
import elwood_spatial as es
# GeoDataFrame with sensor locations (projected CRS for distance)
gdf = gpd.GeoDataFrame([
{"id": "abc123", "geometry": Point(-122.4, 47.6)},
{"id": "def234", "geometry": Point(-122.3, 47.6)},
{"id": "ghi345", "geometry": Point(-122.4, 47.5)},
{"id": "jkl456", "geometry": Point(-122.3, 47.5)},
], crs="EPSG:4326")
# Build with 50 km threshold, auto-project for distance calculation
network = es.build_network(
gdf, threshold=50_000, id_column="id",
projected_crs="EPSG:26910" # UTM Zone 10N
) Manual Network
If you already know which sensors are neighbors, you can build the network dict directly:
network = {
"abc123": {"neighbors": ["def234", "ghi345"], "weights": [1.0, 0.9]},
"def234": {"neighbors": ["abc123", "ghi345"], "weights": [1.0, 0.8]},
"ghi345": {"neighbors": ["abc123", "def234"], "weights": [0.9, 0.8]},
} 3. Defining Bins
Bins discretize your continuous measurements into categories. You can define them from tuples or from dicts with labels:
import elwood_spatial as es
# From tuples — simple range definitions
temp_bins = es.BinSpec.from_tuples([
(-10, 0), (1, 10), (11, 20), (21, 30), (31, 45)
])
# From dicts — with human-readable labels
aqi_bins = es.BinSpec.from_dicts([
{"cat": "Good", "range": [0, 50]},
{"cat": "Moderate", "range": [51, 100]},
{"cat": "USG", "range": [101, 150]},
{"cat": "Unhealthy", "range": [151, 200]},
])
# Or use the built-in AQI presets:
from elwood_spatial.air_quality import AQI_BINS, AQI_MODIFIED_BINS 4. Putting It Together
import elwood_spatial as es
from elwood_spatial.air_quality import AQI_MODIFIED_BINS
# Detect outliers across the whole time series
result = es.detect_outliers_batch(
df,
bins=AQI_MODIFIED_BINS,
network=network,
params=es.PARAMS_OPERATIONAL
)
# Check which sensor-hours were flagged
outlier_rows = result[result["is_outlier"]]
print(f"Flagged {len(outlier_rows)} out of {len(result)} rows")
print(outlier_rows[["id", "timestamp", "value", "information", "entropy", "bin_deviation"]]) The returned DataFrame has the original columns plus: bin_index, is_outlier, information, entropy, bin_deviation.