Skip to content

Commit

Permalink
Merge pull request #13 from mpi2/5-redundant-batch-request-functions
Browse files Browse the repository at this point in the history
[Batch_solr_request] refactor: function to batch query and multiple value search
  • Loading branch information
dpavam authored Oct 23, 2024
2 parents 89df667 + 25264fe commit ee61e82
Show file tree
Hide file tree
Showing 10 changed files with 1,212 additions and 218 deletions.
132 changes: 106 additions & 26 deletions impc_api_helper/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,23 +12,43 @@ The functions in this package are intended for use on a Jupyter Notebook.
### Available functions
The available functions can be imported as:

`from impc_api_helper import solr_request, batch_request, iterator_solr_request`
```
from impc_api_helper import solr_request, batch_solr_request
```

### Solr request
## 1. Solr request
The most basic request to the IMPC solr API
```
num_found, df = solr_request( core='genotype-phenotype', params={
'q': '*:*'
'rows': 10
'q': '*:*',
'rows': 10,
'fl': 'marker_symbol,allele_symbol,parameter_stable_id'
}
)
```

#### Solr request validation
A common pitfall when writing a query is the misspelling of `core` and `fields` arguments. For this, we have included an `validate` argument that raises a warning when these values are not as expected. Note this does not prevent you from executing a query; it just alerts you to a potential issue.
### a. Facet request
`solr_request` allows facet requests

```
num_found, df = solr_request(
core="genotype-phenotype",
params={
"q": "*:*",
"rows": 0,
"facet": "on",
"facet.field": "zygosity",
"facet.limit": 15,
"facet.mincount": 1,
},
)
```

### b. Solr request validation
A common pitfall when writing a query is the misspelling of `core` and `fields` arguments. For this, we have included a `validate` argument that raises a warning when these values are not as expected. Note this does not prevent you from executing a query; it just alerts you to a potential issue.


##### Core validation
#### Core validation
```
num_found, df = solr_request( core='invalid_core', params={
'q': '*:*',
Expand All @@ -41,7 +61,7 @@ num_found, df = solr_request( core='invalid_core', params={
> dict_keys(['experiment', 'genotype-phenotype', 'impc_images', 'phenodigm', 'statistical-result']))
```

##### Field list validation
#### Field list validation
```
num_found, df = solr_request( core='genotype-phenotype', params={
'q': '*:*',
Expand All @@ -54,31 +74,91 @@ num_found, df = solr_request( core='genotype-phenotype', params={
> To see expected fields check the documentation at: https://www.ebi.ac.uk/mi/impc/solrdoc/
```

### Batch request
For larger requests, use the batch request function to query the API responsibly.
## 2. Batch Solr Request
`batch_solr_request` is available for large queries. This solves issues where a request is too large to fit into memory or where it puts a lot of strain on the API.

Use `batch_solr_request` for:
- Large queries (>1,000,000)
- Querying multiple items in a list
- Downloading data in `json` or `csv` format.

### Large queries
For large queries you can choose between seeing them in a DataFrame or downloading them in `json` or `csv` format.

### a. Large query - see in DataFrame
This will fetch your data using the API responsibly and return a Pandas DataFrame

When your request is larger than recommended and you have not opted for downloading the data, a warning will be presented and you should follow the instructions to proceed.

```
df = batch_solr_request(
core='genotype-phenotype',
params={
'q':'*:*'
},
download=False,
batch_size=30000
)
print(df.head())
```

### b. Large query - Download
When using the `download=True` option, a file with the requested information will be saved as `filename`. The format is selected based on the `wt` parameter.
A DataFrame may be returned, provided it does not exceed the memory available on your laptop. If the DataFrame is too large, an error will be raised. For these cases, we recommend you read the downloaded file in batches/chunks.

```
df = batch_solr_request(
core='genotype-phenotype',
params={
'q':'*:*',
'wt':'csv'
},
download=True,
filename='geno_pheno_query',
batch_size=100000
)
print(df.head())
```

### c. Query by multiple values
`batch_solr_request` also allows to search multiple items in a list provided they belong to them same field.
Pass the list to the `field_list` param and specify the type of `fl` in `field_type`.

```
df = batch_request(
core="genotype-phenotype",
# List of gene symbols
genes = ["Zfp580","Firrm","Gpld1","Mbip"]
df = batch_solr_request(
core='genotype-phenotype',
params={
'q': 'top_level_mp_term_name:"cardiovascular system phenotype" AND effect_size:[* TO *] AND life_stage_name:"Late adult"',
'fl': 'allele_accession_id,life_stage_name,marker_symbol,mp_term_name,p_value,parameter_name,parameter_stable_id,phenotyping_center,statistical_method,top_level_mp_term_name,effect_size'
'q':'*:*',
'fl': 'marker_symbol,mp_term_name,p_value',
'field_list': genes,
'field_type': 'marker_symbol'
},
batch_size=100
download = False
)
print(df.head())
```
This too can be downloaded

### Iterator solr request
To pass a list of different fields and download a file with the information
```
# Genes example
# List of gene symbols
genes = ["Zfp580","Firrm","Gpld1","Mbip"]
# Initial query parameters
params = {
'q': "*:*",
'fl': 'marker_symbol,allele_symbol,parameter_stable_id',
'field_list': genes,
'field_type': "marker_symbol"
}
iterator_solr_request(core='genotype-phenotype', params=params, filename='marker_symbol', format ='csv')
df = batch_solr_request(
core='genotype-phenotype',
params={
'q':'*:*',
'fl': 'marker_symbol,mp_term_name,p_value',
'field_list': genes,
'field_type': 'marker_symbol'
},
download = True,
filename='gene_list_query'
)
print(df.head())
```



6 changes: 3 additions & 3 deletions impc_api_helper/impc_api_helper/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from .solr_request import solr_request, batch_request
from .iterator_solr_request import iterator_solr_request
from .solr_request import solr_request
from .batch_solr_request import batch_solr_request
from .utils import validators, warnings

# Control what gets imported by client
__all__ = ["solr_request", "batch_request", "iterator_solr_request"]
__all__ = ["solr_request", "batch_solr_request"]
Loading

0 comments on commit ee61e82

Please sign in to comment.