During my college I used a lot the bus service and it bothered me a lot how close were the two next bus stop, they were extremely close. After investigating a bit I discovered that TITSA post they google data on the open.
It took me a bit to understand the GTFS but the official docs are quite good.
The code attached is a notebook developed on top docker stack using Spark 3.2 (but most functions are retrocompat)
Also there is a docker-composer file containing the image and the volume mounted (the password for jupyter is my-password)
If you want to run the notebook run the command download.sh
it will download the last zip files from Titsa webpages and unzip them in the proper directory.
Both stops could be close to each other but some may be for specific lines, so using stop_times you can get the trip stop sequence and join it with the stop master data for the coords.
For calculating the distance I used the harvesine distance (kudos to [https://stackoverflow.com/questions/38994903/how-to-sum-distances-between-data-points-in-a-dataset-using-pyspark]) and compare each consecutive stop.
There were some errores reporting the same stop for the same trip those were discarded... And well, I could found the same name in two similars id's and they were a few degrees deviated one from another, as in the following img:
So I discarded also the same name as this is not really clear. And well I ended up with two pretty similar bus stops that were both the same place but included the (T) from terminal.
So after discarded that error I found...
Which was hilarious, the difference between both is about 20 meters and it takes much more in the bus.
In case you're instered: the bus stops I was referring to were ranked as: 157.
Obviously there are some bus stations but they should share id or at least name.
stop_id | stop_name | diff_routes |
---|---|---|
9181 | INTERCAMBIADOR STA.CRUZ | 44 |
2625 | INTERCAMBIADOR LAGUNA (T) | 36 |
9413 | MERIDIANO | 25 |
9450 | INTERCAMBIADOR STA.CRUZ | 23 |
2582 | COROMOTO (T) | 22 |
2549 | LEOCADIO MACHADO | 22 |
2692 | FRANCISCO SÁNCHEZ (T) | 21 |
Well I expected some magic output but there are the main bus stations and the previous / next stops. As we can see Santa Cruz station is splitted so I tried ot group also the stops by name.
And it yield some interesting results, a lot of bus stops share it's name even if they're not related the most common one is "Cementerio" (graveyard) and the second one is "Centro de salud" (health centre).
I include here the top list and it's quite funny tough in such a small location to have so many collisions in names.
stop_name | stop_id | diff_stops |
---|---|---|
CEMENTERIO | [1137, 1141, 1204, 1225, 1376, 4074, 4124, 4926, 5027, 5029, 7076, 7095, 7256, 7362, 9105, 9106] | 16 |
CENTRO DE SALUD | [1219, 1883, 1924, 1928, 2587, 2789, 7257, 7361, 7364, 7382, 7455, 9409] | 12 |
EL PINO | [1636, 1647, 2130, 2145, 2314, 2704, 4957, 7577, 7603, 7735, 7782] | 11 |
EL CALVARIO | [1203, 1226, 1258, 1259, 4016, 4035, 4217, 4356, 4359, 4739] | 10 |
EL MOLINO | [1519, 1571, 1971, 1977, 2573, 2574, 4301, 4308, 4642] | 9 |
CAMPO DE FUTBOL | [1622, 1628, 2128, 2147, 4389, 4533, 9362, 9370] | 8 |
LAS TOSCAS | [1206, 1223, 1305, 1350, 1765, 2310, 4728, 4733] | 8 |
This question should be quite straightforward as we have for any trip all the stops and the predicted "arrival" and "departure" for each one. So we just need to group it and... What is this?
trip_id | start | end |
---|---|---|
3927806 | 24:10:00 | 24:34:19 |
3927807 | 25:30:00 | 25:54:19 |
3927809 | 24:50:00 | 25:08:38 |
3927810 | 26:20:00 | 26:38:38 |
3928436 | 25:10:00 | 25:20:20 |
3928763 | 24:05:00 | 24:21:39 |
3928766 | 24:15:00 | 24:44:20 |
3928769 | 24:45:00 | 25:01:39 |
3930761 | 28:40:00 | 29:37:11 |
3930762 | 25:00:00 | 25:51:52 |
3930763 | 27:30:00 | 28:21:52 |
3930764 | 24:00:00 | 24:50:36 |
3930765 | 26:25:00 | 27:15:36 |
3930767 | 24:00:00 | 24:51:52 |
3930768 | 26:25:00 | 27:16:52 |
3930769 | 25:20:00 | 26:10:36 |
3932368 | 24:05:00 | 24:31:00 |
3932373 | 24:40:00 | 25:01:17 |
3934883 | 24:05:00 | 24:55:17 |
3934886 | 25:00:00 | 25:50:17 |
Seems that the people decided to put hour 24 and so on for representing the next day.
Checking the standard from the gtfs reference is correct:
Service day - A service day is a time period used to indicate route scheduling. The exact definition of service day varies from agency to agency but service days often do not correspond with calendar days. A service day may exceed 24:00:00 if service begins on one day and ends on a following day. For example, service that runs from 08:00:00 on Friday to 02:00:00 on Saturday, could be denoted as running from 08:00:00 to 26:00:00 on a single service day.
So... Let's fix the time and convert it into a timestamp doing this is awful in pyspark.
And after doing this, here are the top results:
route_short_name | elapsed | rank |
---|---|---|
330 | INTERVAL '0 02:47:22' DAY TO SECOND | 1 |
330 | INTERVAL '0 02:44:55' DAY TO SECOND | 2 |
325 | INTERVAL '0 02:36:24' DAY TO SECOND | 3 |
343 | INTERVAL '0 02:28:52' DAY TO SECOND | 4 |
342 | INTERVAL '0 02:14:25' DAY TO SECOND | 5 |
108 | INTERVAL '0 02:13:36' DAY TO SECOND | 6 |
342 | INTERVAL '0 01:59:51' DAY TO SECOND | 7 |
343 | INTERVAL '0 01:56:35' DAY TO SECOND | 8 |
325 | INTERVAL '0 01:53:27' DAY TO SECOND | 9 |
34 | INTERVAL '0 01:52:27' DAY TO SECOND | 10 |
The 330 is a "beltway", it starts and finishes on the same point, so it makes a lot of sense to have such a large estimate time.
And the 325, makes an amazing way. It only works 5 times on labour days and 3 times on weekends.
More questions will be solved soon :)