Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Various C++ Documentation Examples to Current Interface #398

Merged
merged 6 commits into from
Oct 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ void kompute(const std::string& shader) {
auto tensorOutA = mgr.tensorT<uint32_t>({ 0, 0, 0 });
auto tensorOutB = mgr.tensorT<uint32_t>({ 0, 0, 0 });

std::vector<std::shared_ptr<kp::Tensor>> params = {tensorInA, tensorInB, tensorOutA, tensorOutB};
std::vector<std::shared_ptr<kp::Memory>> params = {tensorInA, tensorInB, tensorOutA, tensorOutB};

// 3. Create algorithm based on shader (supports buffers & push/spec constants)
kp::Workgroup workgroup({3, 1, 1});
Expand All @@ -110,15 +110,15 @@ void kompute(const std::string& shader) {

// 4. Run operation synchronously using sequence
mgr.sequence()
->record<kp::OpTensorSyncDevice>(params)
->record<kp::OpSyncDevice>(params)
->record<kp::OpAlgoDispatch>(algorithm) // Binds default push consts
->eval() // Evaluates the two recorded operations
->record<kp::OpAlgoDispatch>(algorithm, pushConstsB) // Overrides push consts
->eval(); // Evaluates only last recorded operation

// 5. Sync results from the GPU asynchronously
auto sq = mgr.sequence();
sq->evalAsync<kp::OpTensorSyncLocal>(params);
sq->evalAsync<kp::OpSyncLocal>(params);

// ... Do other work asynchronously whilst GPU finishes

Expand Down
6 changes: 3 additions & 3 deletions benchmark/TestBenchmark.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ TEST(TestBenchmark, TestMultipleSequenceOperationMostlyGPU)
std::shared_ptr<kp::TensorT<uint32_t>> tensorInB = mgr.tensorT<uint32_t>(std::vector<uint32_t>(numElems, elemValue));
std::shared_ptr<kp::TensorT<float>> tensorOut = mgr.tensor(std::vector<float>(numElems, 0));

std::vector<std::shared_ptr<kp::Tensor>> params = { tensorInA, tensorInB, tensorOut };
std::vector<std::shared_ptr<kp::Memory>> params = { tensorInA, tensorInB, tensorOut };

// Opt: Avoiding using anonimous sequences when we will reuse
std::vector<std::shared_ptr<kp::Sequence>> sequences(numSeqs);
Expand All @@ -63,7 +63,7 @@ TEST(TestBenchmark, TestMultipleSequenceOperationMostlyGPU)
}
}

mgr.sequence()->eval<kp::OpTensorSyncDevice>({ tensorInA });
mgr.sequence()->eval<kp::OpSyncDevice>({ tensorInA });

auto startTime = std::chrono::high_resolution_clock::now();

Expand All @@ -83,7 +83,7 @@ TEST(TestBenchmark, TestMultipleSequenceOperationMostlyGPU)
std::chrono::duration_cast<std::chrono::microseconds>(endTime - startTime)
.count();

mgr.sequence()->eval<kp::OpTensorSyncLocal>({ tensorOut });
mgr.sequence()->eval<kp::OpSyncLocal>({ tensorOut });

EXPECT_EQ(tensorOut->vector(), std::vector<float>(numElems, elemValue * numIter * numOps * numSeqs));

Expand Down
61 changes: 31 additions & 30 deletions docs/overview/advanced-examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -69,12 +69,12 @@ The example below shows how you can enable the "VK_EXT_shader_atomic_float" exte
mgr.algorithm({ tensor }, spirv, kp::Workgroup({ 1 }), {}, { 0.0, 0.0, 0.0 });

sq = mgr.sequence()
->record<kp::OpTensorSyncDevice>({ tensor })
->record<kp::OpSyncDevice>({ tensor })
->record<kp::OpAlgoDispatch>(algo,
std::vector<float>{ 0.1, 0.2, 0.3 })
->record<kp::OpAlgoDispatch>(algo,
std::vector<float>{ 0.3, 0.2, 0.1 })
->record<kp::OpTensorSyncLocal>({ tensor })
->record<kp::OpSyncLocal>({ tensor })
->eval();

EXPECT_EQ(tensor->data(), std::vector<float>({ 0.4, 0.4, 0.4 }));
Expand All @@ -92,12 +92,12 @@ We also provide tools that allow you to `convert shaders into C++ headers <https
.. code-block:: cpp
:linenos:

class OpMyCustom : public OpAlgoDispatch
class OpMyCustom : public kp::OpAlgoDispatch
{
public:
OpMyCustom(std::vector<std::shared_ptr<Tensor>> tensors,
OpMyCustom(std::vector<std::shared_ptr<kp::Memory>> tensors,
std::shared_ptr<kp::Algorithm> algorithm)
: OpAlgoBase(algorithm)
: kp::OpAlgoDispatch(algorithm)
{
if (tensors.size() != 3) {
throw std::runtime_error("Kompute OpMult expected 3 tensors but got " + tensors.size());
Expand Down Expand Up @@ -135,7 +135,7 @@ We also provide tools that allow you to `convert shaders into C++ headers <https

algorithm->rebuild(tensors, spirv);
}
}
};


int main() {
Expand All @@ -148,13 +148,13 @@ We also provide tools that allow you to `convert shaders into C++ headers <https
auto tensorOut = mgr.tensor({ 0., 0., 0. });

mgr.sequence()
->record<kp::OpTensorSyncDevice>({tensorLhs, tensorRhs, tensorOut})
->record<kp::OpMyCustom>({tensorLhs, tensorRhs, tensorOut}, mgr.algorithm())
->record<kp::OpTensorSyncLocal>({tensorLhs, tensorRhs, tensorOut})
->record<kp::OpSyncDevice>({tensorLhs, tensorRhs, tensorOut})
->record<OpMyCustom>({tensorLhs, tensorRhs, tensorOut}, mgr.algorithm())
->record<kp::OpSyncLocal>({tensorLhs, tensorRhs, tensorOut})
->eval();

// Prints the output which is { 0, 4, 12 }
std::cout << fmt::format("Output: {}", tensorOutput.data()) << std::endl;
std::cout << fmt::format("Output: {}", tensorOut->vector()) << std::endl;
}

Async/Await Example
Expand All @@ -170,8 +170,8 @@ First we are able to create the manager as we normally would.
// You can allow Kompute to create the GPU resources, or pass your existing ones
kp::Manager mgr; // Selects device 0 unless explicitly requested

// Creates tensor an initializes GPU memory (below we show more granularity)
auto tensor = mgr.tensor(10, 0.0);
// Creates tensor and initializes GPU memory (below we show more granularity)
auto tensor = mgr.tensorT<float>(10);
axsaucedo marked this conversation as resolved.
Show resolved Hide resolved

We can now run our first asynchronous command, which in this case we can use the default sequence.

Expand All @@ -181,7 +181,7 @@ Sequences can be executed in synchronously or asynchronously without having to c
:linenos:

// Create tensors data explicitly in GPU with an operation
mgr.sequence()->eval<kp::OpTensorSyncDevice>({tensor});
mgr.sequence()->eval<kp::OpSyncDevice>({tensor});


While this is running we can actually do other things like in this case create the shader we'll be using.
Expand Down Expand Up @@ -231,7 +231,7 @@ The parameter provided is the maximum amount of time to wait in nanoseconds. Whe
.. code-block:: cpp
:linenos:

auto sq = mgr.sequence()
auto sq = mgr.sequence();

// Run Async Kompute operation on the parameters provided
sq->evalAsync<kp::OpAlgoDispatch>(algo);
Expand All @@ -240,7 +240,7 @@ The parameter provided is the maximum amount of time to wait in nanoseconds. Whe

// When we're ready we can wait
// The default wait time is UINT64_MAX
sq.evalAwait()
sq->evalAwait();


Finally, below you can see that we can also run syncrhonous commands without having to change anything.
Expand All @@ -250,11 +250,11 @@ Finally, below you can see that we can also run syncrhonous commands without hav

// Sync the GPU memory back to the local tensor
// We can still run synchronous jobs in our created sequence
sq.eval<kp::OpTensorSyncLocal>({ tensor });
sq->eval<kp::OpSyncLocal>({ tensor });

// Prints the output: B: { 100000000, ... }
std::cout << fmt::format("B: {}",
tensor.data()) << std::endl;
tensor->vector()) << std::endl;


Parallel Operation Submission
Expand Down Expand Up @@ -318,20 +318,20 @@ It's worth mentioning you can have multiple sequences referencing the same queue
// We need to create explicit sequences with their respective queues
// The second parameter is the index in the familyIndex array which is relative
// to the vector we created the manager with.
sqOne = mgr.sequence(0);
sqTwo = mgr.sequence(1);
auto sqOne = mgr.sequence(0);
auto sqTwo = mgr.sequence(1);

We create the tensors without modifications.

.. code-block:: cpp
:linenos:

// Creates tensor an initializes GPU memory (below we show more granularity)
auto tensorA = mgr.tensor({ 10, 0.0 });
auto tensorB = mgr.tensor({ 10, 0.0 });
auto tensorA = mgr.tensorT<float>(10);
axsaucedo marked this conversation as resolved.
Show resolved Hide resolved
auto tensorB = mgr.tensorT<float>(10);

// Copies the data into GPU memory
mgr.sequence().eval<kp::OpTensorSyncDevice>({tensorA tensorB});
mgr.sequence()->eval<kp::OpSyncDevice>({tensorA, tensorB});

Similar to the asyncrhonous usecase above, we can still run synchronous commands without modifications.

Expand Down Expand Up @@ -367,38 +367,39 @@ Similar to the asyncrhonous usecase above, we can still run synchronous commands
// See shader documentation section for compileSource
std::vector<uint32_t> spirv = compileSource(shader);

std::shared_ptr<kp::Algorithm> algo = mgr.algorithm({tensorA, tenssorB}, spirv);
std::shared_ptr<kp::Algorithm> algoOne = mgr.algorithm({ tensorA }, spirv);
std::shared_ptr<kp::Algorithm> algoTwo = mgr.algorithm({ tensorB }, spirv);

Now we can actually trigger the parallel processing, running two OpAlgoBase Operations - each in a different sequence / queue.

.. code-block:: cpp
:linenos:

// Run the first parallel operation in the `queueOne` sequence
sqOne->evalAsync<kp::OpAlgoDispatch>(algo);
sqOne->evalAsync<kp::OpAlgoDispatch>(algoOne);

// Run the second parallel operation in the `queueTwo` sequence
sqTwo->evalAsync<kp::OpAlgoDispatch>(algo);
sqTwo->evalAsync<kp::OpAlgoDispatch>(algoTwo);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better way to signify running the same shader but with a different tensor bound for this example other than creating another Algorithm? The code before my change just ran both sequences updating tensorA and tensorB was never updated. The below print appears to expect both updated, which makes sense for a more useful example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow, could you provide an example of what you mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@axsaucedo Anything that more closely resembles what the code was previously trying to do, with reuse of algo instead of creating two separate algorithms. I've not looked into the backend, but this could be to avoid any overhead around spirv being duplicated (like making sure it's loaded/ready on device).

I'm going to guess that since there wasn't an immediate suggestion to change here that there is no concern in having the separate algorithms and we should continue as-is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a poor example as it doesn't make sense with how everything is laid out, but I could imagine a world where you would see something similar to:

sqOne->evalAsync<kp::OpAlgoDispatch>(algo, tensorA);
sqTwo->evalAsync<kp::OpAlgoDispatch>(algo, tensorB);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see what you mean now - yes this would actually be much desirable. When designing kompute this was initial aim as well, unfortunately due to the design of the underlying vulkan architecture it's not possible (namely due to the dependency between the descriptor sets, the algorithm and the tensors) which make it such that initialisation is coupled. This is something that I hope at some point is addressed in the design of Vulkan, but at least in the medium term this doesn't seem to be planned. Hope this provides further context.



Similar to the asynchronous example above, we are able to do other work whilst the tasks are executing.

We are able to wait for the tasks to complete by triggering the `evalOpAwait` on the respective sequence.
We are able to wait for the tasks to complete by triggering the `evalAwait` on the respective sequence.

.. code-block:: cpp
:linenos:

// Here we can do other work

// We can now wait for the two parallel tasks to finish
sqOne.evalOpAwait()
sqTwo.evalOpAwait()
sqOne->evalAwait();
sqTwo->evalAwait();

// Sync the GPU memory back to the local tensor
mgr.sequence()->eval<kp::OpTensorSyncLocal>({ tensorA, tensorB });
mgr.sequence()->eval<kp::OpSyncLocal>({ tensorA, tensorB });

// Prints the output: A: 100000000 B: 100000000
std::cout << fmt::format("A: {}, B: {}",
tensorA.data()[0], tensorB.data()[0]) << std::endl;
tensorA->data()[0], tensorB->data()[0]) << std::endl;


Loading