OpenStreetMap Data in Parquet Format
Efficiently managing and analyzing OpenStreetMap data
Working with OpenStreetMap (OSM) data can be challenging, especially when dealing with global datasets rather than regional exports. Traditional tools like osmium
or osm2pgsql
simplify some aspects of data handling but often fall short in areas such as processing complex way geometries, which can be resource-intensive. Furthermore, these tools don't always integrate seamlessly with modern data science ecosystems like Apache Spark, Polars, or DuckDB.
To address these challenges, we've transformed the native OSM XML-based data into a highly optimized Parquet format and made it available in an S3-compatible object storage. This approach significantly enhances compatibility with leading data lake technologies and popular data science software, streamlining your analytical workflows.
Accessing the Data
Option 1: Object Storage (S3-compatible)
Accessing our OSM Parquet datasets directly from an S3-compatible object storage is straightforward. You can connect using the following credentials:
- Endpoint URL:
https://object-store.geo-lake.com
- Bucket name:
data-lakehouse
For those using the AWS CLI, you can list the available Parquet files with this command:
aws s3 ls --endpoint-url https://object-store.geo-lake.com --no-sign-request s3://data-lakehouse/bronze/osm/parquet/
Option 2: Direct File Downloads
Alternatively, if direct S3 access isn't preferred, you can download individual Parquet files. Below, you'll find a comprehensive list of download links for OSM nodes, ways (including those with geometries), and relations.
Usage Examples
This section demonstrates how to leverage these OSM Parquet files within popular data science libraries, showcasing their ease of integration and powerful analytical capabilities.
DuckDB
Here's an example demonstrating how to query OSM node data using DuckDB:
INSTALL httpfs;
LOAD httpfs;
SET s3_region='us-east-1';
SET s3_url_style='path';
SET s3_endpoint='object-store.geo-lake.com';
SELECT *
FROM read_parquet('s3://data-lakehouse/bronze/osm/parquet/node/version=2025-05-17T14:31:37Z/*.parquet')
WHERE id = 2480654035;
Polars (Python)
This Polars (Python) snippet illustrates how to efficiently load and collect OSM relation data directly from the S3 storage into a DataFrame:
import polars as pl
storage_options = {
'aws_region': 'us-east-1',
'aws_endpoint_url': 'https://object-store.geo-lake.com',
'skip_signature': 'true'
}
df = pl.scan_parquet('s3://data-lakehouse/bronze/osm/parquet/relation/version=2025-05-17T14:31:37Z/*.parquet', storage_options=storage_options).collect()