Skip to content

support Duckdb #39

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

support Duckdb #39

wants to merge 1 commit into from

Conversation

@derhuerst derhuerst self-assigned this Apr 11, 2023
@socket-security
Copy link

socket-security bot commented Apr 11, 2023

All alerts resolved. Learn more about Socket for GitHub.

This PR previously contained dependency changes with security issues that have been resolved, removed, or ignored.

View full report

Copy link

socket-security bot commented Feb 26, 2024

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Added@​duckdb/​node-api@​1.2.2-alpha.1910010010095100

View full report

@matthiasfeist
Copy link

Oh nice with this PR :) I was just opening an issue that DuckDB could really be a quicker way to do data analytics on GTFS datasets.

@derhuerst
Copy link
Member Author

I'm making progress! With the current state, importing the 2025-05-09 VBB GTFS works.

@derhuerst
Copy link
Member Author

derhuerst commented May 12, 2025

I stumbled upon this weird behaviour (bug?) in DuckDB v1.2.2's query plan output.

I redefined `arrivals_departures` as follows:
CREATE OR REPLACE VIEW "main.arrivals_departures" AS
SELECT
	(
		to_base64(encode(trip_id))
		|| ':' || to_base64(encode(
			extract(ISOYEAR FROM "date")
			|| '-' || lpad(extract(MONTH FROM "date")::text, 2, '0')
			|| '-' || lpad(extract(DAY FROM "date")::text, 2, '0')
		))
		|| ':' || to_base64(encode(stop_sequence::text))
		-- frequencies_row
		|| ':' || to_base64(encode('-1'))
		-- frequencies_it
		|| ':' || to_base64(encode('-1'))
	) as arrival_departure_id,

	-- todo: expose local arrival/departure "wall clock time"?

	-1 AS frequencies_row,
	-1 AS frequencies_it,

	stop_times_based.*
	EXCLUDE (
		arrival_time,
		departure_time
	)
FROM (
	SELECT
		agency.agency_id,
		trips.route_id,
		route_short_name,
		route_long_name,
		route_type,
		s.trip_id,
		trips.direction_id,
		trips.trip_headsign,
		trips.wheelchair_accessible,
		trips.bikes_allowed,
		service_days.service_id,
		trips.shape_id,
		"date",
		stop_sequence,
		stop_sequence_consec,
		stop_headsign,
		pickup_type,
		drop_off_type,
		shape_dist_traveled,
		timepoint,
		agency.agency_timezone as tz,
		arrival_time,
		(
			make_timestamptz(
				date_part('year', "date")::int,
				date_part('month', "date")::int,
				date_part('day', "date")::int,
				12, 0, 0,
				agency.agency_timezone
			)
			- INTERVAL '12 hours'
			+ arrival_time
		) t_arrival,
		departure_time,
		(
			make_timestamptz(
				date_part('year', "date")::int,
				date_part('month', "date")::int,
				date_part('day', "date")::int,
				12, 0, 0,
				agency.agency_timezone
			)
			- INTERVAL '12 hours'
			+ departure_time
		) t_departure,
		trip_start_time,
		s.stop_id, stops.stop_name,
		stations.stop_id station_id, stations.stop_name station_name,
		-- todo: PR #47
		coalesce(
			nullif(stops.wheelchair_boarding, 'no_info_or_inherit'),
			nullif(stations.wheelchair_boarding, 'no_info_or_inherit'),
			'no_info_or_inherit'
		) AS wheelchair_boarding
	FROM (
		"main.stop_times" s
		JOIN "main.stops" stops ON s.stop_id = stops.stop_id
		LEFT JOIN "main.stops" stations ON stops.parent_station = stations.stop_id
		JOIN "main.trips" trips ON s.trip_id = trips.trip_id
		JOIN "main.routes" routes ON trips.route_id = routes.route_id
		LEFT JOIN "main.agency" agency ON (
			-- The GTFS spec allows routes.agency_id to be NULL if there is exactly one agency in the feed.
			-- Note: We implicitly rely on other parts of the code base to validate that agency has just one row!
			-- It seems that GTFS has allowed this at least since 2016:
			-- https://github.com/google/transit/blame/217e9bf/gtfs/spec/en/reference.md#L544-L554
			routes.agency_id IS NULL -- match first (and only) agency
			OR routes.agency_id = agency.agency_id -- match by ID
		)
		JOIN "main.service_days" service_days ON trips.service_id = service_days.service_id
	)
	-- todo: this slows down slightly
	-- ORDER BY route_id, s.trip_id, "date", stop_sequence
) stop_times_based;

Look at the time (0.36s) of the sequential scan over main.stop_times at the bottom of the query plan's tree, while it says "Total Time: 0.0924s" at the top. Does it use multiple cores for scanning?

query plan
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE SELECT * FROM "main.arrivals_departures" WHERE t_departure >= '2025-05-09 18:00:00+02:00' AND t_departure < '2025-05-09 18:20:00+02:00' AND (date = '2025-05-08' OR date = '2025-05-09') AND station_id = 'de:11000:900100001';
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││              Total Time: 0.0924s             ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│           QUERY           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│    ────────────────────   │
│           0 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│    arrival_departure_id   │
│      frequencies_row      │
│       frequencies_it      │
│         agency_id         │
│          route_id         │
│      route_short_name     │
│      route_long_name      │
│         route_type        │
│          trip_id          │
│        direction_id       │
│       trip_headsign       │
│   wheelchair_accessible   │
│       bikes_allowed       │
│         service_id        │
│          shape_id         │
│            ...            │
│    stop_sequence_consec   │
│       stop_headsign       │
│        pickup_type        │
│       drop_off_type       │
│    shape_dist_traveled    │
│         timepoint         │
│             tz            │
│         t_arrival         │
│        t_departure        │
│      trip_start_time      │
│          stop_id          │
│         stop_name         │
│         station_id        │
│        station_name       │
│    wheelchair_boarding    │
│                           │
│          48 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│         agency_id         │
│          route_id         │
│      route_short_name     │
│      route_long_name      │
│         route_type        │
│          trip_id          │
│        direction_id       │
│       trip_headsign       │
│   wheelchair_accessible   │
│       bikes_allowed       │
│         service_id        │
│          shape_id         │
│            date           │
│       stop_sequence       │
│    stop_sequence_consec   │
│       stop_headsign       │
│        pickup_type        │
│       drop_off_type       │
│    shape_dist_traveled    │
│         timepoint         │
│             tz            │
│         t_arrival         │
│        t_departure        │
│      trip_start_time      │
│          stop_id          │
│         stop_name         │
│         station_id        │
│        station_name       │
│    wheelchair_boarding    │
│                           │
│          48 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│         agency_id         │
│          route_id         │
│      route_short_name     │
│      route_long_name      │
│         route_type        │
│          trip_id          │
│        direction_id       │
│       trip_headsign       │
│   wheelchair_accessible   │
│       bikes_allowed       │
│         service_id        │
│          shape_id         │
│            date           │
│       stop_sequence       │
│    stop_sequence_consec   │
│            ...            │
│    shape_dist_traveled    │
│         timepoint         │
│             tz            │
│ CAST(CAST("year"(date) AS │
│     INTEGER) AS BIGINT)   │
│ CAST(CAST("month"(date) AS│
│     INTEGER) AS BIGINT)   │
│  CAST(CAST("day"(date) AS │
│     INTEGER) AS BIGINT)   │
│        arrival_time       │
│       departure_time      │
│      trip_start_time      │
│          stop_id          │
│         stop_name         │
│         station_id        │
│        station_name       │
│    wheelchair_boarding    │
│    wheelchair_boarding    │
│                           │
│          48 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│  (((make_timestamptz(CAST │
│   (CAST("year"(date) AS   │
│  INTEGER) AS BIGINT), CAST│
│   (CAST("month"(date) AS  │
│  INTEGER) AS BIGINT), CAST│
│    (CAST("day"(date) AS   │
│  INTEGER) AS BIGINT), 12, │
│  0, 0.0, agency_timezone) │
│ - '12:00:00'::INTERVAL) + │
│  departure_time) BETWEEN  │
│ '2025-05-09 16:00:00+00': │
│ :TIMESTAMP WITH TIME ZONE │
│  AND '2025-05-09 16:20:00 │
│ +00'::TIMESTAMP WITH TIME │
│            ZONE)          │
│                           │
│          48 Rows          │
│          (0.01s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         HASH_JOIN         │
│    ────────────────────   │
│      Join Type: INNER     │
│                           │
│        Conditions:        ├──────────────┐
│  service_id = service_id  │              │
│                           │              │
│         2686 Rows         │              │
│          (0.00s)          │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│           FILTER          ││     BLOCKWISE_NL_JOIN     │
│    ────────────────────   ││    ────────────────────   │
│ (date = '2025-05-09 00:00 ││      Join Type: RIGHT     │
│      :00'::TIMESTAMP)     ││                           │
│                           ││         Condition:        ├──────────────┐
│                           ││  (agency_id = agency_id)  │              │
│                           ││                           │              │
│          995 Rows         ││         13256 Rows        │              │
│          (0.00s)          ││          (0.00s)          │              │
└─────────────┬─────────────┘└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         TABLE_SCAN        ││         TABLE_SCAN        ││         HASH_JOIN         │
│    ────────────────────   ││    ────────────────────   ││    ────────────────────   │
│           Table:          ││     Table: main.agency    ││      Join Type: INNER     │
│     main.service_days     ││   Type: Sequential Scan   ││                           │
│                           ││                           ││        Conditions:        │
│   Type: Sequential Scan   ││        Projections:       ││    route_id = route_id    │
│                           ││         agency_id         ││                           ├──────────────┐
│        Projections:       ││      agency_timezone      ││                           │              │
│         service_id        ││                           ││                           │              │
│            date           ││                           ││                           │              │
│                           ││                           ││                           │              │
│        193254 Rows        ││          34 Rows          ││         13256 Rows        │              │
│          (0.00s)          ││          (0.00s)          ││          (0.00s)          │              │
└───────────────────────────┘└───────────────────────────┘└─────────────┬─────────────┘              │
                                                          ┌─────────────┴─────────────┐┌─────────────┴─────────────┐
                                                          │         TABLE_SCAN        ││         HASH_JOIN         │
                                                          │    ────────────────────   ││    ────────────────────   │
                                                          │     Table: main.routes    ││      Join Type: INNER     │
                                                          │   Type: Sequential Scan   ││                           │
                                                          │                           ││        Conditions:        │
                                                          │        Projections:       ││     trip_id = trip_id     │
                                                          │          route_id         ││                           │
                                                          │         agency_id         ││                           ├──────────────┐
                                                          │      route_short_name     ││                           │              │
                                                          │      route_long_name      ││                           │              │
                                                          │         route_type        ││                           │              │
                                                          │                           ││                           │              │
                                                          │          884 Rows         ││         13256 Rows        │              │
                                                          │          (0.00s)          ││          (0.01s)          │              │
                                                          └───────────────────────────┘└─────────────┬─────────────┘              │
                                                                                       ┌─────────────┴─────────────┐┌─────────────┴─────────────┐
                                                                                       │         TABLE_SCAN        ││         HASH_JOIN         │
                                                                                       │    ────────────────────   ││    ────────────────────   │
                                                                                       │     Table: main.trips     ││      Join Type: INNER     │
                                                                                       │   Type: Sequential Scan   ││                           │
                                                                                       │                           ││        Conditions:        │
                                                                                       │        Projections:       ││     stop_id = stop_id     │
                                                                                       │          trip_id          ││                           │
                                                                                       │          route_id         ││                           │
                                                                                       │         service_id        ││                           ├──────────────┐
                                                                                       │        direction_id       ││                           │              │
                                                                                       │       trip_headsign       ││                           │              │
                                                                                       │   wheelchair_accessible   ││                           │              │
                                                                                       │       bikes_allowed       ││                           │              │
                                                                                       │          shape_id         ││                           │              │
                                                                                       │                           ││                           │              │
                                                                                       │        266359 Rows        ││         13256 Rows        │              │
                                                                                       │          (0.01s)          ││          (0.06s)          │              │
                                                                                       └───────────────────────────┘└─────────────┬─────────────┘              │
                                                                                                                    ┌─────────────┴─────────────┐┌─────────────┴─────────────┐
                                                                                                                    │         TABLE_SCAN        ││         HASH_JOIN         │
                                                                                                                    │    ────────────────────   ││    ────────────────────   │
                                                                                                                    │           Table:          ││      Join Type: INNER     │
                                                                                                                    │      main.stop_times      ││                           │
                                                                                                                    │                           ││        Conditions:        │
                                                                                                                    │   Type: Sequential Scan   ││parent_station = station_id│
                                                                                                                    │                           ││                           │
                                                                                                                    │        Projections:       ││                           │
                                                                                                                    │          stop_id          ││                           │
                                                                                                                    │          trip_id          ││                           │
                                                                                                                    │       stop_sequence       ││                           │
                                                                                                                    │    stop_sequence_consec   ││                           ├──────────────┐
                                                                                                                    │       stop_headsign       ││                           │              │
                                                                                                                    │        pickup_type        ││                           │              │
                                                                                                                    │       drop_off_type       ││                           │              │
                                                                                                                    │    shape_dist_traveled    ││                           │              │
                                                                                                                    │         timepoint         ││                           │              │
                                                                                                                    │        arrival_time       ││                           │              │
                                                                                                                    │       departure_time      ││                           │              │
                                                                                                                    │      trip_start_time      ││                           │              │
                                                                                                                    │                           ││                           │              │
                                                                                                                    │        2899055 Rows       ││          177 Rows         │              │
                                                                                                                    │          (0.36s)          ││          (0.00s)          │              │
                                                                                                                    └───────────────────────────┘└─────────────┬─────────────┘              │
                                                                                                                                                 ┌─────────────┴─────────────┐┌─────────────┴─────────────┐
                                                                                                                                                 │         TABLE_SCAN        ││         TABLE_SCAN        │
                                                                                                                                                 │    ────────────────────   ││    ────────────────────   │
                                                                                                                                                 │     Table: main.stops     ││     Table: main.stops     │
                                                                                                                                                 │   Type: Sequential Scan   ││      Type: Index Scan     │
                                                                                                                                                 │                           ││                           │
                                                                                                                                                 │        Projections:       ││        Projections:       │
                                                                                                                                                 │          stop_id          ││          stop_id          │
                                                                                                                                                 │       parent_station      ││         stop_name         │
                                                                                                                                                 │         stop_name         ││    wheelchair_boarding    │
                                                                                                                                                 │    wheelchair_boarding    ││                           │
                                                                                                                                                 │                           ││          Filters:         │
                                                                                                                                                 │          Filters:         ││     stop_id='de:11000     │
                                                                                                                                                 │  parent_station='de:11000 ││        :900100001'        │
                                                                                                                                                 │        :900100001'        ││                           │
                                                                                                                                                 │                           ││                           │
                                                                                                                                                 │          177 Rows         ││           1 Rows          │
                                                                                                                                                 │          (0.00s)          ││          (0.00s)          │
                                                                                                                                                 └───────────────────────────┘└───────────────────────────┘

edit: maybe duckdb/duckdb#17607 is related, but most likely not

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants