Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Design of fetool #5

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 107 additions & 0 deletions sparkfe/design_of_fetool.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
- Start Date: 2021-05-13
- Target Major Version: 1.0
- Reference Issues: https://github.com/4paradigm/SparkFE/issues/58 and https://github.com/4paradigm/SparkFE/issues/69
- Implementation PR:

# Summary

We should provide some tools with easy-to-use API like Python for users to meet the featue extraction requirements like creating FEDB DDL.


# Basic example

Users can use fetool command to create FEDB DDL of creating tables.

```
fetool gen_ddl sql.yaml
```
tobegit3hub marked this conversation as resolved.
Show resolved Hide resolved

Users can use fetool command to finish the following tasks without development.

```
fetool csv_to_parquet /csv_files /parquet_files
fetool sample_parquet /parquet_files
fetool inspect_parquet /parquet_files
fetool check_skew /parquet_files
fetool benchmark 'spark-submit --master local /pyspark_app.py'
```

# Motivation

Now users can use SparkFE for feature extraction with SQL API. Development is required since it's the library of distributed computing and do not solve problems without specific SQLs.

However, there are some common tools which may be used for general feature extraction scenarios. For examples, we may sample the input dataset and check if its distribution is balanced. These tools are general and useful for feature extraction which may reduce the cost of development if we want to use SparkFE for AI. If we inspect the distribution of dataset in advance, the window skew optimization may use the distribution to achieve better performance.

Therefore, providing the common tools for feature extraction is useful for developers. The easiest way to use is commad-line and we should provide Python and Java API for different developers.

# Detailed design

We want to use Python to wrap the feature extraction tools since Python is easy to use and has integrated with other programming languages.

Users can install this tool with `pip` and all the functions can be called with command-line tool and Python functions. The fetool should be the standard Python package and command-line tool. There are some kinds of secnarios which fetool should support.

* For data processing jobs like converting dataset and sampling dataset, we can use PySpark API which can submit the jobs in Python programming language.
* For benchmark tool which requires running multiple jobs with different Spark distributions, we may use `subprocess` api to invoke the shell commands.
tobegit3hub marked this conversation as resolved.
Show resolved Hide resolved
* For some utilities written in Java/Scala like generating FEDB DDL, we may use `py4j` to invoke Java functions for Python API and command-line.

Python package is able to meet the above requirements and easy to maintain. The codebase would be like these.

```
- python
setup.py
requirements.txt
- fetool
__init__.py
fetool.py
csv_to_parquet.py
sample_parquet.py
inspect_parquet.py
check_skew.py
gen_fdb_ddl.py
......
```

The command to install fetool should be `pip install fetool` or `pip3 install fetool`.

The command-line should look like this.

```
$ fetool -h
usage: fetool [-h]
{version,csv_to_parquet,sample_parquet,inspect_parquet,check_skew,benchmark}
...

positional arguments:
{version,csv_to_parquet,sample_parquet,inspect_parquet,check_skew,benchmark}
version Print the version of fetool
csv_to_parquet csv_to_parquet $input_csv_path $output_parquet_path
sample_parquet sample_parquet $parquet_path
inspect_parquet inspect_parquet $parquet_path
check_skew check_skew $parquet_path
benchmark benchmark $command

optional arguments:
-h, --help show this help message and exit
```

Developers can extend the functionality by adding new Python scripts and sub-command for fetool.

# Drawbacks

Since it is the extension of SparkFE, there is no drawback for the existing core project.

Implementation and maintenance cost is small if the they are used by most developers because they don't need to maintain by themselves.

# Alternatives

What other designs have been considered? What is the impact of not doing this?

The command-line tool could be implemented in Java, C++ or other programming languages. But they may require to compile before using which is not easy as Python.

# Adoption strategy

Users may use the source Python scripts or install the tool with `pip install fetool`.

# Unresolved questions

None.