Code embeddings are used in machine learning to map source code into a dense vector space. Various models have been proposed to learn this mapping. They use different information from the code (eg just tokens, AST, control/data flow, …) as model input. EmbeddedKittens can be used to extract this information from Scratch projects and transform it into the required format to be used as input for the machine learning model.
EmbeddedKittens1 is developed at the Chair of Software Engineering II of the University of Passau.
It originally started as a fork/extension of the Scratch static code analysis tool LitterBox. Internally, it uses the parser, AST and data flow information as obtained from LitterBox.
EmbeddedKittens is built using Maven. To produce an executable JAR file, run the following command:
mvn package
Note
Until LitterBox is published on Maven Central you have to install it from the local Maven cache:
# clone LitterBox in version 1.9
git clone -b 1.9 https://github.com/se2p/LitterBox
cd LitterBox
# install the LitterBox JAR into the local Maven Cache so it can be found in this project
mvn install -DskipTests
Now, the package
command above should work in this repository.
This will produce target/embedded-kittens-1.0.full.jar
Pre-built JARs are also available for each release on GitHub.
To see an overview of the available command line options, type:
java -jar embedded-kittens-1.0.full.jar --help
All the subcommands also accept the --help
flag to show information about the specific parameters.
Eg
java -jar embedded-kittens-1.0.full.jar code2vec --help
The currently supported formats are suitable for the following models:
To be able to use the code2vec model with the programming language Scratch, a scratch parser is needed to generate the required input representation. According to the description on https://github.com/tech-srl/code2vec#extending-to-other-languages, EmbeddedKittens produces for each Scratch program a file with these rules. EmbeddedKittens needs a path to a single file or a folder with multiple projects and produces the output to the declared output folder.
java -jar embedded-kittens-1.0.full.jar code2vec \
--output "<path/to/folder/for/the/output>" \
--path "<path/to/json/project/or/folder/with/projects>"
There are some differences between Scratch and "normal" programming languages like Java. The most important is that sprites are primarily split into unnamed scripts rather than named methods. Because of that, Litterbox uses sprite names like method names and creates path contexts from every single sprite in a project.
EmbeddedKittens can generate the context paths per scripts and procedures. Given a Scratch program as input, it produces for each script and procedure a file containing the needed input representation for the code2vec model.
java -jar embedded-kittens-1.0.full.jar code2vec \
--output "<path/to/folder/for/the/output>" \
--path "<path/to/json/project/or/folder/with/projects>" \
--scripts
Please open an issue if you find a bug. We are open to pull requests both for fixes and the support of new model formats. For larger features or restructurings, please open an issue first to discuss the best approach on how to best achieve this. If possible, please split larger changes into smaller pull/merge requests to make them easier to review and integrate step-by-step.
EmbeddedKittens is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
EmbeddedKittens is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.