sifts-columnar

Benchmark for the encoding of PDB to UniProt residue-level mappings from the EBI SIFTS resource in compressed columnar dataformats.

The SIFTS (Structure Integration with Function, Taxonomy and Sequence) project provides residue-level mappings between PDB sequence, PDB residues, and UniProt residues [1].

SIFTS provides mapping xml files for each PDB entry, e.g., 1xyz.xml.gz.

In this project we explore the use of efficient columnar dataformats to represent the residue-level mappings for the entire PDB in a single file for efficient download and processing.

Columnar dataformats can achive unprecedented levels of compression due to the columnar data respresentation and columnwise packing strategies including built-in bit-packing, delta- and run-length encoding, followed by entropy encoding.

Dataset

SIFTS residue-level mappings were downloaded on July 28, 2018 and resulted in 105,594,955 residue level mappings. The encoded files where generated with the CreatePdbToUniProtMappingFile command line application.

File Sizes

Data were converted to parquet[2] and orc[3] files and compressed with the compression codecs available in Apache Spark.

Dataset name	File format	Compression codec	Size (MB)
xml_gzip	xml	gzip	~5200
csv_gzip	csv	gzip	519.7
parquet_snappy	parquet	snappy	145.1
parquet_gzip	parquet	gzip	57.9
orc_zlib	orc	zlib	41.9
orc_lzo	orc	lzo	41.7

The parquet files with gzip compression and the orc files with lzo compression are the best options for representing the SIFTS mapping data.

Query Performance

In order to evaluate the performance of operating on these datasets, we setup 4 benchmarks for the optimal datasets (orc_lzo and parquet_gzip). The benchmarks were run on a MacBook Pro (Retina, 13-inch, Late 2013, 2.8 GHz Intel Core i7, 16 GB 1600 MHz DDR3, and SSD drive).

Benchmark	orc_lzo (second)	parquet_gzip (seconds)
Count	3.7	4.1
Query	11.9	20.2
Join	12.0	23.3
Convert	6.0	7.9

Due to efficient indexing and predicate pushdown, the ORC file format outperforms the parquet file format for this dataset.

Jupyter Notebooks of this benchmark is available using 3 dataframe implementations:

Static view: Pandas Dataframe

Static view: Spark Dataframe

Static view: Dask Dataframe

Run notebooks

Reading Benchmark (preliminary)

For this benchmark the entire dataset was encoded in two compressed columnar filed formats. Each file was then read completely into memory and the parsing times were reported in seconds. Note, it is generally not necessary to load the whole dataset into PySpark/Spark. The data are provided to compare the performance with Pandas, which always load all the data.

Dataset name	Pandas[4] (seconds)	PySpark[5] (seconds)	Spark[6] (seconds)
parquet_gzip	86	164	177
parquet_snappy	88	144	148
orc_zlib	na	92	92
orc_lzo	na	85	85

References

[1] Velankar et al., Nucleic Acids Research 41, D483 (2013)

[2] Apache Parquet

[3] Apache orc

[4] Pandas

[5] Apache Spark, Python API

[6] Apache Spark, Java API

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
binder		binder
data		data
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sifts-columnar

Dataset

File Sizes

Query Performance

Reading Benchmark (preliminary)

References

About

Releases

Packages

Languages

License

sbl-sdsc/sifts-columnar

Folders and files

Latest commit

History

Repository files navigation

sifts-columnar

Dataset

File Sizes

Query Performance

Reading Benchmark (preliminary)

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages