-
-
Notifications
You must be signed in to change notification settings - Fork 18
support Duckdb #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
support Duckdb #39
Conversation
All alerts resolved. Learn more about Socket for GitHub. This PR previously contained dependency changes with security issues that have been resolved, removed, or ignored. |
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
Oh nice with this PR :) I was just opening an issue that DuckDB could really be a quicker way to do data analytics on GTFS datasets. |
3b631aa
to
68b2d4e
Compare
I'm making progress! With the current state, importing the |
I stumbled upon this weird behaviour (bug?) in DuckDB v1.2.2's query plan output. I redefined `arrivals_departures` as follows:CREATE OR REPLACE VIEW "main.arrivals_departures" AS
SELECT
(
to_base64(encode(trip_id))
|| ':' || to_base64(encode(
extract(ISOYEAR FROM "date")
|| '-' || lpad(extract(MONTH FROM "date")::text, 2, '0')
|| '-' || lpad(extract(DAY FROM "date")::text, 2, '0')
))
|| ':' || to_base64(encode(stop_sequence::text))
-- frequencies_row
|| ':' || to_base64(encode('-1'))
-- frequencies_it
|| ':' || to_base64(encode('-1'))
) as arrival_departure_id,
-- todo: expose local arrival/departure "wall clock time"?
-1 AS frequencies_row,
-1 AS frequencies_it,
stop_times_based.*
EXCLUDE (
arrival_time,
departure_time
)
FROM (
SELECT
agency.agency_id,
trips.route_id,
route_short_name,
route_long_name,
route_type,
s.trip_id,
trips.direction_id,
trips.trip_headsign,
trips.wheelchair_accessible,
trips.bikes_allowed,
service_days.service_id,
trips.shape_id,
"date",
stop_sequence,
stop_sequence_consec,
stop_headsign,
pickup_type,
drop_off_type,
shape_dist_traveled,
timepoint,
agency.agency_timezone as tz,
arrival_time,
(
make_timestamptz(
date_part('year', "date")::int,
date_part('month', "date")::int,
date_part('day', "date")::int,
12, 0, 0,
agency.agency_timezone
)
- INTERVAL '12 hours'
+ arrival_time
) t_arrival,
departure_time,
(
make_timestamptz(
date_part('year', "date")::int,
date_part('month', "date")::int,
date_part('day', "date")::int,
12, 0, 0,
agency.agency_timezone
)
- INTERVAL '12 hours'
+ departure_time
) t_departure,
trip_start_time,
s.stop_id, stops.stop_name,
stations.stop_id station_id, stations.stop_name station_name,
-- todo: PR #47
coalesce(
nullif(stops.wheelchair_boarding, 'no_info_or_inherit'),
nullif(stations.wheelchair_boarding, 'no_info_or_inherit'),
'no_info_or_inherit'
) AS wheelchair_boarding
FROM (
"main.stop_times" s
JOIN "main.stops" stops ON s.stop_id = stops.stop_id
LEFT JOIN "main.stops" stations ON stops.parent_station = stations.stop_id
JOIN "main.trips" trips ON s.trip_id = trips.trip_id
JOIN "main.routes" routes ON trips.route_id = routes.route_id
LEFT JOIN "main.agency" agency ON (
-- The GTFS spec allows routes.agency_id to be NULL if there is exactly one agency in the feed.
-- Note: We implicitly rely on other parts of the code base to validate that agency has just one row!
-- It seems that GTFS has allowed this at least since 2016:
-- https://github.com/google/transit/blame/217e9bf/gtfs/spec/en/reference.md#L544-L554
routes.agency_id IS NULL -- match first (and only) agency
OR routes.agency_id = agency.agency_id -- match by ID
)
JOIN "main.service_days" service_days ON trips.service_id = service_days.service_id
)
-- todo: this slows down slightly
-- ORDER BY route_id, s.trip_id, "date", stop_sequence
) stop_times_based; Look at the time ( query plan
edit: maybe duckdb/duckdb#17607 is related, but most likely not |
0.8.0
.