Data Formatting

This guide shows how to structure your data so it works with elwood-spatial. The package expects a long-format DataFrame and a spatial network dict.

1. The DataFrame

For batch processing with detect_outliers_batch, your data should be a long-format DataFrame with one row per (device, timestamp) pair:

import pandas as pd

# Long-format: one row per (sensor, timestamp)
df = pd.DataFrame([
    {"id": "abc123", "timestamp": "2024-07-15 00:00", "value": 42},
    {"id": "abc123", "timestamp": "2024-07-15 01:00", "value": 45},
    {"id": "def234", "timestamp": "2024-07-15 00:00", "value": 38},
    {"id": "def234", "timestamp": "2024-07-15 01:00", "value": 41},
    # ...
])
df["timestamp"] = pd.to_datetime(df["timestamp"])

#       id            timestamp  value
# 0  abc123  2024-07-15 00:00     42
# 1  abc123  2024-07-15 01:00     45
# 2  def234  2024-07-15 00:00     38
# 3  def234  2024-07-15 01:00     41

The default column names are id, timestamp, and value, but you can override them:

result = es.detect_outliers_batch(
    df, bins, network, params,
    id_column="sensor_id",
    time_column="time",
    value_column="aqi"
)

2. Building a Network

If your sensors have geographic coordinates, build the network from a GeoDataFrame:

import geopandas as gpd
from shapely.geometry import Point
import elwood_spatial as es

# GeoDataFrame with sensor locations (projected CRS for distance)
gdf = gpd.GeoDataFrame([
    {"id": "abc123", "geometry": Point(-122.4, 47.6)},
    {"id": "def234", "geometry": Point(-122.3, 47.6)},
    {"id": "ghi345", "geometry": Point(-122.4, 47.5)},
    {"id": "jkl456", "geometry": Point(-122.3, 47.5)},
], crs="EPSG:4326")

# Build with 50 km threshold, auto-project for distance calculation
network = es.build_network(
    gdf, threshold=50_000, id_column="id",
    projected_crs="EPSG:26910"  # UTM Zone 10N
)

Manual Network

If you already know which sensors are neighbors, you can build the network dict directly:

network = {
    "abc123": {"neighbors": ["def234", "ghi345"], "weights": [1.0, 0.9]},
    "def234": {"neighbors": ["abc123", "ghi345"], "weights": [1.0, 0.8]},
    "ghi345": {"neighbors": ["abc123", "def234"], "weights": [0.9, 0.8]},
}

3. Defining Bins

Bins discretize your continuous measurements into categories. You can define them from tuples or from dicts with labels:

import elwood_spatial as es

# From tuples — simple range definitions
temp_bins = es.BinSpec.from_tuples([
    (-10, 0), (1, 10), (11, 20), (21, 30), (31, 45)
])

# From dicts — with human-readable labels
aqi_bins = es.BinSpec.from_dicts([
    {"cat": "Good",      "range": [0, 50]},
    {"cat": "Moderate",  "range": [51, 100]},
    {"cat": "USG",       "range": [101, 150]},
    {"cat": "Unhealthy", "range": [151, 200]},
])

# Or use the built-in AQI presets:
from elwood_spatial.air_quality import AQI_BINS, AQI_MODIFIED_BINS

4. Putting It Together

import elwood_spatial as es
from elwood_spatial.air_quality import AQI_MODIFIED_BINS

# Detect outliers across the whole time series
result = es.detect_outliers_batch(
    df,
    bins=AQI_MODIFIED_BINS,
    network=network,
    params=es.PARAMS_OPERATIONAL
)

# Check which sensor-hours were flagged
outlier_rows = result[result["is_outlier"]]
print(f"Flagged {len(outlier_rows)} out of {len(result)} rows")
print(outlier_rows[["id", "timestamp", "value", "information", "entropy", "bin_deviation"]])

The returned DataFrame has the original columns plus: bin_index, is_outlier, information, entropy, bin_deviation.