docs/getting-started/nyc-taxi/README.md
Get started with TimescaleDB using New York City taxi trip data. This example demonstrates how to handle high-volume transportation data with location-based analytics and time-series aggregations.
time_bucket()psql PostgreSQL clientYou have two options to start TimescaleDB:
The easiest way to get started:
Important: This script is intended for local development and testing only. Do not use it for production deployments. For production-ready installation options, see the TimescaleDB installation guide.
Linux/Mac:
curl -sL https://tsdb.co/start-local | sh
This command:
Alternatively, you can run TimescaleDB directly with Docker:
docker run -d --name timescaledb \
-p 6543:5432 \
-e POSTGRES_PASSWORD=password \
timescale/timescaledb-ha:pg18
Note: We use port 6543 (mapped to container port 5432) to avoid conflicts if you have other PostgreSQL instances running on the standard port 5432.
Wait about 1-2 minutes for TimescaleDB to download & initialize.
Connect using psql:
psql -h localhost -p 6543 -U postgres
# When prompted, enter password: password
You should see the PostgreSQL prompt. Verify TimescaleDB is installed:
SELECT extname, extversion FROM pg_extension WHERE extname = 'timescaledb';
Expected output:
extname | extversion
-------------+------------
timescaledb | 2.x.x
Prefer a GUI? If you'd rather use a graphical tool instead of the command line, you can download pgAdmin and connect to TimescaleDB using the same connection details (host: localhost, port: 6543, user: postgres, password: password).
Create the optimized hypertable by running this SQL in your psql session:
-- Create the hypertable with optimal settings for NYC Taxi data
-- This automatically enables columnstore for fast analytical queries
CREATE TABLE trips (
vendor_id TEXT,
pickup_boroname VARCHAR,
pickup_datetime TIMESTAMP WITHOUT TIME ZONE NOT NULL,
dropoff_datetime TIMESTAMP WITHOUT TIME ZONE NOT NULL,
passenger_count NUMERIC,
trip_distance NUMERIC,
pickup_longitude NUMERIC,
pickup_latitude NUMERIC,
rate_code INTEGER,
dropoff_longitude NUMERIC,
dropoff_latitude NUMERIC,
payment_type VARCHAR,
fare_amount NUMERIC,
extra NUMERIC,
mta_tax NUMERIC,
tip_amount NUMERIC,
tolls_amount NUMERIC,
improvement_surcharge NUMERIC,
total_amount NUMERIC
) WITH (
tsdb.hypertable,
tsdb.partition_column='pickup_datetime',
tsdb.enable_columnstore=true,
tsdb.segmentby='pickup_boroname',
tsdb.orderby='pickup_datetime DESC'
);
-- Create indexes
CREATE INDEX idx_trips_pickup_time ON trips (pickup_datetime DESC);
CREATE INDEX idx_trips_borough_time ON trips (pickup_boroname, pickup_datetime DESC);
This creates a trips table with:
pickup_datetimepickup_boroname for optimal compression (6 boroughs)First, download and decompress the sample data:
# Download the sample data
wget https://assets.timescale.com/timescaledb-datasets/nyc_taxi_sample_nov_dec_2015.csv.gz
# Decompress the CSV file
gunzip nyc_taxi_sample_nov_dec_2015.csv.gz
# This will create nyc_taxi_sample_nov_dec_2015.csv ready for loading
We provide two approaches for loading data. Choose based on your needs:
This approach writes data directly to the columnstore, bypassing the rowstore entirely. You get instant analytical performance.
From command line:
psql -h localhost -p 6543 -U postgres \
-v ON_ERROR_STOP=1 \
-c "SET timescaledb.enable_direct_compress_copy = on;
COPY trips FROM STDIN WITH (FORMAT csv, HEADER true);" \
< nyc_taxi_sample_nov_dec_2015.csv
This command reads the CSV file from your local filesystem and pipes it to PostgreSQL, which loads it directly into the columnstore.
Verify data loaded:
SELECT COUNT(*) FROM trips;
This approach loads data into the rowstore first. Data will be converted to the columnstore by a background policy (12-24 hours) for faster querying.
From command line:
psql -h localhost -p 6543 -U postgres \
-v ON_ERROR_STOP=1 \
-c "COPY trips FROM STDIN WITH (FORMAT csv, HEADER true);" \
< nyc_taxi_sample_nov_dec_2015.csv
Verify data loaded:
SELECT COUNT(*) FROM trips;
Manually convert to columnstore (Optional):
If you loaded data using standard copy a background process will convert your rowstore data to the columnstore in 12-24 hours, you can manually convert it immediately to get the best query performance:
DO $$
DECLARE ch TEXT;
BEGIN
FOR ch IN SELECT show_chunks('trips') LOOP
CALL convert_to_columnstore(ch);
END LOOP;
END $$;
Now let's explore the data with some analytical queries. Run these in your psql session:
Query 1: Overall statistics
\timing on
SELECT
COUNT(*) as total_trips,
ROUND(SUM(fare_amount)::numeric, 2) as total_revenue,
ROUND(AVG(fare_amount)::numeric, 2) as avg_fare,
ROUND(AVG(trip_distance)::numeric, 2) as avg_distance
FROM trips;
Query 2: Breakdown by vendor
SELECT
vendor_id,
COUNT(*) as trips,
ROUND(AVG(fare_amount)::numeric, 2) as avg_fare,
ROUND(AVG(tip_amount)::numeric, 2) as avg_tip,
ROUND(AVG(passenger_count)::numeric, 2) as avg_passengers
FROM trips
GROUP BY vendor_id
ORDER BY trips DESC;
Query 3: Hourly patterns using time_bucket
SELECT
time_bucket('1 hour', pickup_datetime) AS hour,
COUNT(*) as trips,
ROUND(AVG(fare_amount)::numeric, 2) as avg_fare,
ROUND(SUM(tip_amount)::numeric, 2) as total_tips
FROM trips
GROUP BY hour
ORDER BY hour DESC
LIMIT 24;
Query 4: Payment type analysis
SELECT
payment_type,
COUNT(*) as trip_count,
ROUND(SUM(fare_amount)::numeric, 2) as total_revenue,
ROUND(AVG(trip_distance)::numeric, 2) as avg_distance,
ROUND(AVG(tip_amount)::numeric, 2) as avg_tip
FROM trips
GROUP BY payment_type
ORDER BY total_revenue DESC;
Query 5: Daily statistics by borough
SELECT
time_bucket('1 day', pickup_datetime) AS day,
pickup_boroname,
COUNT(*) as trips,
ROUND(AVG(fare_amount)::numeric, 2) as avg_fare,
ROUND(MAX(fare_amount)::numeric, 2) as max_fare
FROM trips
GROUP BY day, pickup_boroname
ORDER BY day DESC, pickup_boroname
LIMIT 20;
Query 6: Trips by distance category
SELECT
CASE
WHEN trip_distance < 1 THEN 'Short (< 1 mile)'
WHEN trip_distance < 5 THEN 'Medium (1-5 miles)'
WHEN trip_distance < 10 THEN 'Long (5-10 miles)'
ELSE 'Very Long (> 10 miles)'
END as distance_category,
COUNT(*) as trips,
ROUND(AVG(fare_amount)::numeric, 2) as avg_fare,
ROUND(AVG(tip_amount)::numeric, 2) as avg_tip
FROM trips
GROUP BY distance_category
ORDER BY trips DESC;
When you create a table with tsdb.hypertable, TimescaleDB automatically:
With tsdb.enable_columnstore=true:
When you use SET timescaledb.enable_direct_compress_copy = on:
The tsdb.segmentby='pickup_boroname' setting:
TimescaleDB's time_bucket() is like PostgreSQL's date_trunc() but more powerful:
5 minutes, 1 hour, 1 day, etc.See nyc-taxi-queries.sql for the complete set of queries. Each query demonstrates:
time_bucket() for time-based aggregationpartition_column='pickup_datetime'
segmentby='pickup_boroname'
orderby='pickup_datetime DESC'
For real-time dashboards, you can create continuous aggregates that automatically update:
-- Create a continuous aggregate for hourly statistics by borough
CREATE MATERIALIZED VIEW trips_hourly
WITH (timescaledb.continuous) AS
SELECT
time_bucket('1 hour', pickup_datetime) AS hour,
pickup_boroname,
COUNT(*) as trip_count,
AVG(fare_amount) as avg_fare,
SUM(fare_amount) as total_revenue,
AVG(trip_distance) as avg_distance
FROM trips
GROUP BY hour, pickup_boroname;
-- Add a refresh policy to keep it updated
SELECT add_continuous_aggregate_policy('trips_hourly',
start_offset => INTERVAL '2 hours',
end_offset => INTERVAL '1 hour',
schedule_interval => INTERVAL '1 hour');
Now you can query trips_hourly for instant results on pre-aggregated data.
LIMIT 10 in your data fileSELECT extversion FROM pg_extension WHERE extname = 'timescaledb';SET timescaledb.enable_direct_compress_copy = on; in the same sessionSELECT * FROM timescaledb_information.hypertables WHERE hypertable_name = 'trips';SELECT * FROM timescaledb_information.chunks WHERE hypertable_name = 'trips';This NYC Taxi example demonstrates patterns applicable to:
When you're done experimenting:
# Stop the container
docker stop timescaledb-ha-pg18-quickstart
# Remove the container
docker rm timescaledb-ha-pg18-quickstart
# Remove the persistent data volume
docker volume rm timescaledb_data
# (Optional) Remove the Docker image
docker rmi timescale/timescaledb-ha:pg18
# Stop the container
docker stop timescaledb
# Remove the container
docker rm timescaledb
# (Optional) Remove the Docker image
docker rmi timescale/timescaledb-ha:pg18
Note: If you created a named volume with the manual Docker command, you can remove it with docker volume rm <volume_name>.
Found an issue or want to improve this example? Contributions welcome! Open an issue or PR on GitHub.
Questions? Check out Timescale Community Forums or TimescaleDB Documentation.