Data uploader tool for TranSMART, specifically for PostgreSQL databases. For loading the observations data, it copies data from table specific data files to database table, only substituting indexes for database identifiers for subjects, trial visits and studies.
The latest version can be downloaded here: transmart-copy-17.2.15.jar.
# Download transmart-copy
curl -f -L https://repo.thehyve.nl/service/local/repositories/releases/content/org/transmartproject/transmart-copy/17.2.15/transmart-copy-17.2.15.jar -o transmart-copy.jar
Usage:
java -jar transmart-copy.jar [-h|--help] [--delete <STUDY_ID>]
Parameters:
-d
,--directory
: Specifies a data directory.-m
,--mode <study|pedigree>
: Load mode, specifies what type(s) of data to load (default:study
).-I
,--incremental
: Enable incremental loading of patient data for a study (supported only for study mode).-D <STUDY_ID>
,--delete <STUDY_ID
: Deletes the study with id<STUDY_ID>
and related data.-U
,--update-concept-paths
: Workaround. Updates concept paths, names and tree nodes when there is concept code collision.-n
,--base-on-max-instance-num
: Adds to eachinstance_num
a base to avoid primary key collisions inobservation_fact
. The base is autodetected asmax(observation_fact.instance_num)
.-i
,--drop-indexes
: Drop indexes when loading.-r
,--restore-indexes
: Restore indexes.-u
,--unlogged
: Set observations table to unlogged when loading.-v
,--vacuum-analyze
: Vacuum analyze theobservation_fact
table.-b
,--batch-size
: Number of observation to insert in a batch (default:500
).-f
,--flush-size
: Number of batches to flush to the database (default:1000
).-w <file>
,--write <file>
: Write observations to TSV file<file>
.-p
,--partition
: Partition observation_fact table based ontrial_visit_num
(Experimental).-h
,--help
: Shows the available parameters and exits.-V
,--version
: Prints the application version and exits.
The program reads table data from the current working directory
and inserts new rows into the database if a row with the same identifier
does not yet exist.
The input directory should have the same structure as the database:
two directories i2b2metadata
and i2b2demodata
representing the schemas,
containing .tsv
files for each of the tables.
See the example directory for an example of the directory structure.
For shared data (used across studies), the identifiers of existing records are fetched first. If a record already exists, the data is not updated. The tool only adds new records.
For patients, visits, studies, trial visits, dimension descriptions and relation types,
identifiers are generated by the database. In the .tsv
files, an index
should be used in these columns. E.g., the first data row has the number
0
in the identifier column instead of the identifier.
We always assume the first row to have the column names, exactly matching
the columns that exist in the database.
If a study in the input data already exists in the database, the program aborts, unless incremental data loading is enabled.
For shared data, the following columns are used to identify if a record already exists:
Table | Column(s) |
---|---|
patient_dimension |
the patient_ide and patient_ide_source columns of patient_mapping are used. |
visit_dimension |
the encounter_ide and encounter_ide_source columns of encounter_mapping are used. |
concept_dimension |
concept_cd |
study |
study_id |
i2b2_secure |
path |
i2b2_tags |
(path, tag_type, tags_idx) |
relation_type |
label |
Currently, for patients and visits, only one identifier is allowed per patient or visit. I.e.,
if the mapping contains multiple identifiers (from different sources) for the same patient, it fails.
The patients and visits in the mapping files are expected to be numbered consecutively starting from 0.
The patient_ide_source
is expected to be SUBJ_ID
.
Observations are inserted without checking, because it is assumed that no
data for the study is already present in the database.
For incremental data loading, pass the --incremental
or -I
option. Then prior to data loading
all observations for for patients in the input data are deleted for the studies that are uploaded.
This allows to update data for a subset of patients for an existing study.
For relations data, the relation
table is first truncated, and then
the data from relation.tsv
is loaded.
With the --delete
parameter, a study and associated data (trial visits, observations)
can be deleted from the database.
The database settings are read from the environment variables:
PGHOST
: the hostname of the database server (default:localhost
)PGPORT
: the database server port (default:5432
)PGDATABASE
: the database name (default:transmart
)PGUSER
: the database admin user (required)PGPASSWORD
: the password of the admin user (required)MAXPOOLSIZE
: maximum pool size to avoid too many clients connection issue in db (default:8
)
gradle shadowJar
./transmart-copy.sh [-h|--help] [--delete <STUDY_ID>]