Allow for processed data to be re-processed quickly next time

Description

Sometimes the work of preparing data to be processed takes quite some time. Raw data may need to be manipulated quite a bit in order to get it to be in the right shape and form. After this work is done, the system can finally read and record the data. However, this process will need to be repeated next time the same dataset is being updated by the original author. It would be good to see if there are ways in which we can streamline this process.

Some possibilities include:

Looking at common processing patterns (e.g. converting columns to rows) and trying to embed this functionality on the site, instead of having people do it in the spreadsheet.
Somehow saving the processing as a formula/macro in the spreadsheet, so it can be re-run easily
Looking at python scripts / Jupyter notebooks to help with the processing and thus allow for re-processing new data with the click of a button.

Let's start by collecting some examples of datasets that are time-consuming to process, so that we can make an informed decision on the best route forward. Anyone who has a good example, please post it below.

Discussion and updates

Paul Hoekman Johannesburg 2140 points

3 years, 2 months ago

New task was created

Paul Hoekman Johannesburg 2140 points

3 years, 2 months ago

Here is a dataset Jens shared in this thread (which is where this task is based on in the first place).

Jens Peters Madrid 104 points

3 years, 2 months ago

I add some more datasets. The ones I am currently dealing with are basically all providing several columns of data for a common timeline (rows). The target should be to allow appending data easily e.g., by simply pasting the raw data table as downloaded in one excel tab, and a second tab (the one that is then uploaded) links to this and has the data right away in the correct format. A simple possibility would be to duplicate all the required columns with an extension for the data e.g.,
product (name) [A] | product (code)[A] | Amount[A] | Unit[A] | Segment[A] | product (name) [B] | product (code)[B] | Amount[B] | Unit[B] | and so on (date, period and reference space can be common for all).

Trafico aereo.xls (180.0 KB)

Características principales de la oferta de transporte del Metro de Madrid por Mes.xls (229.5 KB)

Características principales del servicio de autobuses 2013-2020.xls (173.5 KB)

Alumbrado público vial por Mes según Horas y Consumo.xls (199.5 KB)

Jens Peters Madrid 104 points

3 years, 2 months ago

Would be great if the system could also ignore rows with empty date . Some data tables have gaps due to their format, not having to revise these would be nice (though its also not such a big deal to do it manually ). See examples

Paul Hoekman Johannesburg 2140 points

3 years, 2 months ago

Task was assigned to Paul Hoekman

Paul Hoekman Johannesburg 2140 points

3 years, 2 months ago

Thanks for the info Jens, this is very helpful! It will take some analysis and time to figure out the best way forward, but this is something we'll look at.

If someone has similar (or other) ideas about how to make updating of data more efficient - or sample files to share - please do post!

Paul Hoekman Johannesburg 2140 points

6 months ago

Paul Hoekman is no longer responsible for this task

Community Portal