Skip to content

Latest commit

 

History

History
99 lines (84 loc) · 3.33 KB

README.md

File metadata and controls

99 lines (84 loc) · 3.33 KB

Test project for pyspark

Goals :

Easy Mode

  1. Use spark sql and dataframes API for data processing
    1. Write sql code in all src/main/resources/sql/task*/
    2. Write pyspark code for all dataframes in pyspark_task.py
    3. Optimize imports (Spark session need to be created during some function invocation not during import) for
      1. pyspark_task.py
      2. test_app.py
    4. Add parameters to the test_app.py, so you can invoke subsets of tests for
      1. Data Frame
      2. SQLs
      3. Task group
      4. Particular Task
    5. Make sure that all test passed,
      1. run commands

      ./bash/start-docker.sh y y

      1. or in master container execute

      pytest /opt/spark-apps/test

Hard Mode

  1. Implement easy mode
  2. Create own data comparison framework (write your own pyspark_task_validator.py)
  3. Test created all transformations for SQL and Dataframe api using pytest-spark (write your own test_app.py)
  4. Add logging to all your functions using decorators(write your own project_logs.py)
  5. Create docker image and run spark cluster (1 master 2 workers) on it (Add your own docker compose and Docker file)

Extra Hard Mode

  1. Implement hard mode
  2. Create UI using flask for execution implemented tasks, you should have ability to
    1. Choose task from drop down list
    2. Choose method of execution (sql, dataframe or both) from drop down list
    3. Button to start execution
    4. See logs generated by your script in real time on your web page

Expert Mode

  1. Implement Extra Hard mode
  2. Make this solution work on any cloud
  3. Add CD/CI to your git project (https://circleci.com/)

Requirements:

  • Docker ( on Linux or with WSL support to run bash scripts )
    • 6 cores, 12 GB RAM
      • SPARK_WORKER_CORES : 2 * 3
      • SPARK_WORKER_MEMORY : 2G * 3
      • SPARK_DRIVER_MEMORY : 1G * 3
      • SPARK_EXECUTOR_MEMORY : 1G * 3

How work with project environment:

  1. First time execution needs :

    1. Permissions

    chmod -R 755 ./*

    1. Docker image build

    ./bash/start-docker.sh y

    1. If bash scripts doesn't work for you run commands below

      docker build --build-arg SPARK_VERSION=3.0.2 --build-arg HADOOP_VERSION=3.2 -t cluster-apache-spark:3.0.2 ./

      docker compose up -d

      docker container exec -it py_spark_test_tasks-spark-master-1 /bin/bash

  2. To connect to docker container :

./bash/start-docker.sh n

  1. To run all tests :

./bash/start-docker.sh n y

  1. To run failed tests :

./bash/start-docker.sh n f

  1. To run tasks using UI, use link below in your browser :

http://localhost:8000/run_task

Project data

Tasks Description:

  • Task_Description.txt

Inputs:

  • data/tables/accounts/*.parquet
  • data/tables/country_abbreviation/*.parquet
  • data/tables/transactions/*.parquet

Outputs:

  • data/df/task.../...
  • data/sql/task.../...

Expected outputs:

  • test/task1/expected_output/..
  • test/task../expected_output/..

Project realisation files

  • src/pyspark_task.py - dataframes and sql definition

  • src/pyspark_task_validator.py - module to invoke and test dataframes and sql definition

  • src/sql/.. - sql files with the same logic as for dataframes

  • src/web/.. - web UI on flask for task invocation

  • test/test_app.py - all tests definition

  • bash/start-docker.sh - file to start project

  • bash/... other files are related to the spark env config