Need functionality that facilitates cross study analysis #145

satagopam7 · 2020-01-22T14:07:18Z

We need some functionality both frontend and backend to support cross-study data pooling and/or comparison. I can provide more details if needed.

sherzinger · 2020-01-23T07:45:19Z

I thought about this a bit yesterday evening. Given the current tools and structure of Ada, I can propose this approach:

We give the user the possibility to merge two datasets into a third one. This could be done via a simple form where the user can select DataSet1, DataSet2, and the field name to use for merging (e.g. sampleID). Of course this needs to be properly designed so the user understands "why" and "how", without additional training.
In the background we use an existing or new (should be easy to implement) data transformation to achieve this. It is however important that we create a new field "source_data_set".
The user has now the option to create views, charts, or analysis by using the "source_data_set" field if separation is needed (e.g. Age Boxplot)

This is technically the most straight forward approach I can think of.

satagopam7 · 2020-01-23T10:23:27Z

Yes, more or less in the similar direction. This is very similar to views in RDBMS (Oracle, postgress). They can create different pooled datasets (analogy: views) derived from two or more source datasets. This is ‘Cross study data pooling’. Other case that also need to be address is ‘Cross study comparison’, no pooling here, but need to compare them.

On 23 Jan 2020, at 08:45, Sascha Herzinger ***@***.***> wrote: I thought about this a bit yesterday evening. Given the current tools and structure of Ada, I can propose this approach: We give the user the possibility to merge two datasets into a third one. This could be done via a simple form where the user can select DataSet1, DataSet2, and the field name to use for merging (e.g. sampleID). Of course this needs to be properly designed so the user understands "why" and "how", without additional training. In the background we use an existing or new (should be easy to implement) data transformation to achieve this. It is however important that we create a new field "source_data_set". The user has now the option to create views, charts, or analysis by using the "source_data_set" field if separation is needed (e.g. Age Boxplot) This is technically the most straight forward approach I can think of. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#145?email_source=notifications&email_token=AA4F5RBCOZM7OF64LQZNGZDQ7FDJBA5CNFSM4KKGL7F2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJWMQMA#issuecomment-577554480>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA4F5RA3LPCKZS33HYLM7ZDQ7FDJBANCNFSM4KKGL7FQ>.

— Dr. Venkata Satagopam Bioinformatics Core Luxembourg Centre For Systems Biomedicine (LCSB) University of Luxembourg Campus Belval, House of Biomedicine II 6, avenue du Swing L-4367 Belvaux T +352-466-644-6421 F +352-466-644-36421 [email protected] or [email protected] http://lcsb.uni.lu ----- This message is confidential and may contain privileged information. It is intended for the named recipient only. If you receive it in error please notify me and permanently delete the original message and any copies. -----

sherzinger · 2020-01-23T11:03:34Z

Hey Venkata,

given the current design of Ada, comparing datasets without merging them into a third one could be considerable effort. The views, tree, analyses, dictionaries, filters, etc. are all centred around the currently selected data set. Pulling (part of) another dataset into these features would require design (and probably architectural) changes to all of them.

We could however think about making the process of merging invisible to the user.
I was thinking about a tab/button "Compare Datasets" that, when clicked, allows you pick two datasets. We could also add a "Stop comparison" button, which will delete this merged dataset.
The user would not even know that they operate on a new data set.

satagopam7 · 2020-01-23T11:17:56Z

Hi Sascha I’m aware of this needs quite some effort. This needs some discussion. Let’s catchup.

On 23 Jan 2020, at 12:03, Sascha Herzinger ***@***.***> wrote: Hey Venkata, given the current design of Ada, comparing datasets without merging them into a third one could be considerable effort. The views, tree, analyses, dictionaries, filters, etc. are all centred around the currently selected data set. Pulling (part of) another dataset into these features would require design (and probably architectural) changes to all of them. We could however think about making the process of merging invisible to the user. I was thinking about a tab/button "Compare Datasets" that, when clicked, allows you pick two datasets. We could also add a "Stop comparison" button, which will delete this merged dataset. The user would not even know that they operate on a new data set. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#145?email_source=notifications&email_token=AA4F5RCMLS764D6EFUCGUN3Q7F2QPA5CNFSM4KKGL7F2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJW7ZTY#issuecomment-577633487>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA4F5RFKCBLTL33KJ4EJPJLQ7F2QPANCNFSM4KKGL7FQ>.

— Dr. Venkata Satagopam Bioinformatics Core Luxembourg Centre For Systems Biomedicine (LCSB) University of Luxembourg Campus Belval, House of Biomedicine II 6, avenue du Swing L-4367 Belvaux T +352-466-644-6421 F +352-466-644-36421 [email protected] or [email protected] http://lcsb.uni.lu ----- This message is confidential and may contain privileged information. It is intended for the named recipient only. If you receive it in error please notify me and permanently delete the original message and any copies. -----

peterbanda · 2020-01-23T19:12:34Z

For cross-study comparison there are essentially three options:

Merge transformation - That's exactly the option Sascha correctly pointed out. It is currently supported and works out-of-box (although it's allowed only for admins). This transformation can be applied to any number of data sets (not just two) where compatibility is checked by field types. There are two flavors: 1) All the fields are automatically merged by name. Note that not necessarily all the fields need to be present in each data set (could be unique). 2) By manual linking the fields by name. Regardless of the transformation type, the result is a new data set with its own dictionary, views, filters, etc.
Data source virtualization - This is something I presented a couple of times and would combine several repo sources into one and effectively hide them. The final union would again implement the CRUR repo interface so would be pluggable wherever a single data source is currently supported. This has already been reported at Introduce a virtual repo defined as an union of partial repos. #4 . The main implementation problem would be sorting (e.g. for box plots), also partially the offset and limit operators could be tricky. For Akka streaming the best approach, preserving order, would be to employ the mergeSorted function, currently used for the optimized linking transformation (not yet released).
Multi Source Visualization - We can of course allow to generate different widgets (charts) in a single view from different sources, which would mean integration at the visualization level (as was done by Fractalis). This can be supported rather quickly (some ad-hoc experiments along these lines worked) but the resulting artifacts/views would not be fitting into a single data set abstraction. Therefore new meta data, tree node/type, controllers, and permissions would need to be introduced (kind of ad-hoc). Also to allow a closer comparison of field values from different data sets (studies) introducing a multi-field distribution widget would be quite handy (low hanging fruit). Already reported at Extend the numeric distribution widget to support multiple fields (at once) with grouping #50 .

Naturally, as it’s probably implied, proper harmonization is expected for all the presented options. Moreover, the solutions 1 and 3 don’t necessarily need that matching fields have the same names, wheres the solution 2 would most likely require that (to have a clean impl).

sherzinger · 2020-01-24T08:20:29Z

Hi @peterbanda,

Could you show us (maybe in the meeting next week?) what you did with 2.? I've not seen that yet I think.

Regarding Option 3.: This is actually exactly the type of issues I was referring to further up, albeit in less technical language. Technically, injecting some data into a widget is relatively easy, as you mentioned. The problems come from the everything else:

How do filters work in this case?
Do we have to prefix every single field with the source dataset, in order to show to the user where the field came from?
What happens when you update the view by clicking on the widgets?
What happens when you update the view by clicking on the foreign study field e.g. in the pie chart?
Will it still be possible to save a view?
How do we indicate everywhere in the UI to which studies the filters, fields, views belong to?
What happens if a multi-dataset view is saved (somehow) and access to the dataset is removed?
How do you specify filters for the fields you pull in from the other dataset?
Do we need to modify every single widget, such that they can account for the new dimension "source study" alongside e.g. "gender"?

Just some of the questions that came to my mind, and this is largely just UI design. As you correctly pointed out this would also needs to be addressed on an architectural level in many locations.

Maybe limiting option 3 to single analyses/charts (not within a view!) would be doable in a reasonable amount of time if that satisfies the requirements?

And just to underline the fact: Option 1 is already there. We can compare datasets. It just needs to be wrapped in a user friendly interface.

sherzinger · 2020-01-24T12:31:45Z

Note: I discussed the issue with Venkata and I think we came to an agreement.
I'll prepare a mockup for the next meeting, so we can talk about it in detail.

sherzinger added calculation Issue concerning statistics server Issues concerning the back end UI Issues concerning the user interface labels Jan 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need functionality that facilitates cross study analysis #145

Need functionality that facilitates cross study analysis #145

satagopam7 commented Jan 22, 2020

sherzinger commented Jan 23, 2020

satagopam7 commented Jan 23, 2020 via email

sherzinger commented Jan 23, 2020

satagopam7 commented Jan 23, 2020 via email

peterbanda commented Jan 23, 2020

sherzinger commented Jan 24, 2020 •

edited

Loading

sherzinger commented Jan 24, 2020

Need functionality that facilitates cross study analysis #145

Need functionality that facilitates cross study analysis #145

Comments

satagopam7 commented Jan 22, 2020

sherzinger commented Jan 23, 2020

satagopam7 commented Jan 23, 2020 via email

sherzinger commented Jan 23, 2020

satagopam7 commented Jan 23, 2020 via email

peterbanda commented Jan 23, 2020

sherzinger commented Jan 24, 2020 • edited Loading

sherzinger commented Jan 24, 2020

sherzinger commented Jan 24, 2020 •

edited

Loading