Skip to content

Additional format support for the Python Libraries, Google Summer of Code 2019

Xavier Figueroa edited this page Sep 4, 2019 · 3 revisions

Project Overview

SPDX is an open standard for conveying components, licenses and copyrights information of software in a human-and-machine readable, unambiguous way.

SPDX community has developed some collaterals such as the SPDX specification, programing languages tools, among others.

As part of the programing languages tools, there is a Python tool that allows its users to write and read SPDX documents represented in two formats: RDF/XML and tag/value.

This Google Summer of Code 2019 project consists in extending the format support to include JSON, YAML and XML formats.

The eventual wide range of formats to interchange SPDX documents will make easy and painless their adoption because they would fit more and more development communities habits and guidelines. This will help to spread the standard, leading SPDX to reach its goals.

Work Summary

To achieve the project goals, three main components had to be implemented: parsers, builders and writers. Since this project consisted in extending an already available capability (parsing and creating SPDX documents), every piece of code was intended to fit the established code styles.

Parsers are responsible for taking the information from format-specific files and doing some shallow validations, such as presence or absence of required fields.

After parsers have done their job, the parsed information is passed into the builders, which are responsible for doing deeper validations, such as verifying the correctness of the field formats, the building order, etc; and finally storing all the information in the library models.

In the other direction of the process, writers are responsible for taking SPDX document information from this library models and eventually creating a format-specific file with that information in it.

Parsers were created from scratch and it was needed just one set of them to be able to parse JSON, YAML and XML. To handle the new three formats, specific-format interfaces were created to load the files using the more suitable library for it (json Python module, PyYAML or xmltodict). After the file is load that way, all the information have the same structure regardless it comes from JSON, YAML or XML, so it can be handled the same way.

Writers were created the same way. The information from models is stored so that it has the same representation and then the format-specific library is responsible for creating the file.

Since builders do not differ much across formats, a lot of legacy code was reused by inheritance.

There are several examples that show the new parsing and creating capabilities on action.

Related Pull Request: https://github.com/spdx/tools-python/pull/96

Additional Work

Besides the main project tasks, some additional work was done: adding full license expression support, and legacy bugs fixing. Here it is a summary.

License Expression Support

This library lacks the capability to parse complex license expressions. It is just posible to parse expressions with a unique operator (AND or OR), but not any combination of them or WITH exceptions. This additional work consists in integrating the license-expression library to add full support for license expressions. This work is not finished yet.

Related Pull Request: https://github.com/spdx/tools-python/pull/111


Several legacy bugs and issues were encountered while working on the project and they were fixed, not just to enable the project development, but also to make the spdx-tools even better. The following is a list of them.

Duplicate extracted licenses when parsing from RDF

The list of extracted licenses contained duplicate objects when they came from RDF documents. Extracted licenses were being added as many times as they were encountered in other license types, such as concluded or declared licenses. Now extracted liceses are parsed and added only once, when the extracted licenses document section is parsed.

Related issue: https://github.com/spdx/tools-python/issues/97

Related Pull Request: https://github.com/spdx/tools-python/pull/98


Version model fields must be integers but sometimes they aren't

Due to an explicit creation of a Version object in a default paramenter method, Version fields that are supposed to be integers, were being stored as strings, causing problems in type-sensitive formats, such as JSON. Now that Version object is created with the suitable method to handle string-like integers.

Related issue: https://github.com/spdx/tools-python/issues/102

Related Pull Request: https://github.com/spdx/tools-python/pull/103


rdflib objects are being stored in SPDX models

When parsing RDF documents, some fields were being stored as rdflib (library used to parse RDF files) objects. This was causing difficulties when converting from RDF to other formats. Now all information from RDF files is stored as Python types or this library models.

Related issue: https://github.com/spdx/tools-python/issues/91

Related Pull Request: https://github.com/spdx/tools-python/pull/110


Section artifactOf is not completely supported by RDF parsers and writers

RDF parsers do not handle the projectUri field, which is part of the artifactOf section and writers do not even write the whole artifactOf section. This work is not finished yet.

Related issue: https://github.com/spdx/tools-python/issues/104

Related Pull Request: https://github.com/spdx/tools-python/pull/115


Some of the pull requests linked above have been already merged and some others are expected to be merged after the Google Summer of Code 2019 ends.