Examples#
Basic Usage#
import pandas as pd
from atlas_profiler import process_dataset
# Load data into DataFrame
df = pd.read_csv("geospatial_data.csv")
# Process with all features enabled
metadata = process_dataset(
df,
geo_classifier=True,
geo_classifier_threshold=0.5,
coverage=True,
plots=True,
include_sample=True,
)
# Access spatial coverage
spatial_coverage = metadata.get('spatial_coverage', {})
print(f"Bounding box: {spatial_coverage.get('bbox')}")
Getting Started with NYC Traffic Data#
from atlas_profiler import process_dataset
import pandas as pd
# Load sample NYC traffic volume data
nyc_traffic_volume_sample_df = pd.read_csv(
"hf://datasets/oscur/automated-traffic-volume-counts-sample/sample_traffic.csv"
)
nyc_traffic_volume_sample_df.head()
# Profile the dataset with geo classifier enabled
profiled_columns = process_dataset(
nyc_traffic_volume_sample_df,
coverage=False,
indexes=False,
geo_classifier=True,
load_max_size=1000000,
)
profiled_columns
Output Structure#
The process_dataset function returns a dictionary containing comprehensive metadata about the profiled dataset. Here’s a breakdown of the main fields based on the actual output:
Dataset-level Information:
nb_rows: Total number of rows in the datasetnb_profiled_rows: Number of rows actually profiled (may be less if sampling was used)nb_columns: Total number of columnsnb_spatial_columns: Number of columns identified as spatialnb_categorical_columns: Number of columns identified as categoricalnb_numerical_columns: Number of columns identified as numericaltypes: List of dataset-level type categories present (e.g.,["categorical", "numerical", "spatial"])
Column-level Information:
columns: List of column metadata dictionaries, each containing: -name: Column name -structural_type: Detected structural type using schema.org URLs (e.g.,"http://schema.org/Integer","http://schema.org/Text","http://schema.org/GeoCoordinates") -semantic_types: List of semantic type annotations using schema.org URLs (e.g.,["http://schema.org/identifier"],["http://schema.org/AdministrativeArea"],["http://schema.org/address"]) -unclean_values_ratio: Ratio of unclean/missing values (0.0 to 1.0) -num_distinct_values: Number of unique values in the column -geo_classifier(optional): Geo classifier results for spatial columns, containing:label: Predicted spatial label (e.g.,"borough","point","address")confidence: Confidence score (0.0 to 1.0)source: Source of the classification (e.g.,"ml+validated","ml")
Additional Metadata:
attribute_keywords: List of keywords extracted from all column names_profiling_times: Performance timing information: -steps: Dictionary of timing for each profiling step (in seconds) -total: Total profiling time (in seconds)