Skip to content
Jarkko Kotaniemi edited this page Mar 16, 2021 · 13 revisions

Distributed Systems (521290S) - Course Project Report

Jakuri

Group Member Student ID Email
Jarkko Kotaniemi 2676270 [email protected]
Vili Tiinanen 2612506 [email protected]

Course project overview

This project is a distributed function solving program. A broker distributes functions to workers that solve the function and return the output. Distribution of the functions relates to code migration topics. We might have to virtualize our function running environments. The broker works as a coordinator to the workers. The system uses publish-subscribe architecture with event based coordination. We plan to use python for this. Some open-source Function-as-a-Service solutions might be usable to the problem.

We will measure our code migration solution against static code implementation. Results per second is the measured performance indicator. We’ll measure this performance on different scales (1..30 instances). We’ll measure each case 10 times and use the average. We will implement at least one function to test. An interesting test case would be brute forcing a hash function.

To reduce our scope we plan to implement static code part first. If we have enough time we will do the dynamic code migration part.

1. Architecture

architecture

The clients subscribe to the event bus, where the coordinator server publishes a desired output and a function (the function is contained in the clients) e.g., a hash that it wants decrypted. The clients then indicate their readiness to complete the task and the broker gives them a set of inputs to compute. The clients then compute the function with the set of inputs and return whether they were successful or not. If successful they also return the correct input.

Middleware is used to send and receive instructions from the event bus.

The current design demonstrates process migration, but it can be expanded to also demonstrate code migration. Processes are distributed to the clients so that as the number of clients increases, so does the performance of the system.

We have a strong feeling that we can pull the implementation off by using off the shelf programs and libraries like Redis. The evaluation is conducted by invoking a function call from the server and clock all the runs.

2. Implementation

We spin the Redis instance with docker.

A Dispatcher server is a python program which coordinates the dispatching.

A worker runs a python program. The inner “function” of the worker can be requested with a common interface. Worker waits for a specific event to occur and only through these events the working is started.

On top of a publish/subscribe communication we create a middleware layer which handles the function invoking and worker assignment handling.

We can expand our worker capacity with external machines if the system firewalls allow it. This will of course introduce latency to our communication.

3. Communication

Communication is done using TCP/IP. We used Redis as our messaging channel. Redis provides the Pub/Sub functionality out of the box. Python clients have to have a Redis library in order to communicate with the Redis instance.

4. Naming

Clients create their own identifiers by using a universally unique identifier. The worker name will be in the form of “Worker-{UUID}”. Probability for name collision should be insignificant.

Functions get appended to the worker name with a prefix “.”. So that the function call will look like “Worker-{UUID}.{function}”.

5. Coordination

Dispatcher server must coordinate the whole procedure of invoking workers to start working. We do this by using event matching, or more precisely, notification filtering. We do this in a centralized manner and that will allow us to use the publish-subscribe system more efficiently.

Coordinator first pushes a job to the worker, the worker pulls that job to itself. After the job is done the worker pushes the result to the coordinator, the coordinator pulls the result.

6. Consistency and replication

Docker allows us to replicate the worker. Thus enabling scalability and parallelism.

7. Fault tolerance

Not going to implement fault tolerance. Some tolerance mechanisms are possible to implement since we are running multiple workers. Faults will become more common when the number of workers increases. Not in our scope.

8. Security

Security is implemented by the fact that the communication between server and client(s) is limited to a certain set of commands. Escaping those limitations and to escalate access is highly unlikely. Simplicity should be favored. Complexity will introduce insecurities (Unix philosophy enforces simplicity too).

Evaluation

We evaluated our implementation by running the same function with different amounts of workers. We ran the function for 10 times for each number of workers and calculated the average out of the processing times. We had 30 workers in total, distributed across 2 computers which had 6 cores and 12 logical processors each.

The function used tries to bruteforce a SHA256 hashed string. Our coordinator calculates maximum possible variations for a given character set. From that number we divide a batch for each job to solve. Then we distribute the jobs to the workers. Workers report results to the coordinator.

We benched the function with a character set of “abcdefghijklmnopqrstuvwxyz” and 5 as the maximum length of the string. Below is an example invocation of the program for 1 worker.

python3 server-coordinator.py -f shacrackprod -H 87e93406a19d11166fd4aff9addf299aad2221cbd45febc596a527b65269b78f -l 5 -c abcdefghijklmnopqrstuvwxyz -w 1

Benchmark test

The benefit of using more workers seems to degrade logarithmically and as for our test runs the major benefits have been reached around 10 workers. After that the gain is marginal. Using 30 workers compared to 1 worker resulted in about 7 times more performant run.

We would like to point out that we primed the function to be fitting to our problem. If the function is too “light” and it takes more cpu time to communicate than to process the function, there will be no time gain at all in the runs. At worst the runs will take more time to run compared to straight python script executing the function.

As a conclusion, a distributed function runner can boost performance by introducing parallelism. Attention should be taken when choosing the function to be run in this manner, since the performance might degrade if the function is too “light”.

Workload distribution

Student name Tasks Estimated wl. (h) Real wl. (h)
Jarkko Kotaniemi Project overview 2 2
Vili Tiinanen Project overview 2 2
Jarkko Kotaniemi Design 8 3,5
Vili Tiinanen Design 10 5,5
Jarkko Kotaniemi Implementation 20 20
Vili Tiinanen Implementation 20 25
Vili Tiinanen Finalization 3 2
Jarkko Kotaniemi Update wiki 1 0,5