Validations of Functions

For the validations, I think we will need a F-score, that measures precision ( are the retrieved functions all relevant ? ) and Recall ( are all the relevant functions been retrieved? ). Perhaps we can make a table in the paper with 6 apps and their respective precision and recall value ( 3 apps with highest number of functions and 3 apps with lowest number of functions ).

There are a lot of advancements made on F1 score such as this paper -> https://aclanthology.org/2020.eval4nlp-1.9/, but since we are not measuring system ranking in information retrieval, we don't need measures that is this complex.

For the precision and recall idea, I am citing this book https://www.cambridge.org/highereducation/books/introduction-to-information-retrieval/669D108D20F556C5C30957D63B5AB65C#overview -> chapter 8.3 on "Evaluation of unranked retrieval sets". Link to pdf -> https://nlp.stanford.edu/IR-book/pdf/08eval.pdf

But I am still thinking whether doing this kind of validations would be weird because for the top participants we have a few thousands of functions and that makes counting them by hand pretty tedious or we should omit the validations altogether. Additionally, even if we miss a few functions, the F1 score will still be high.

For the functions validation, another idea that comes to mind is just to use the people with the lowest number of functions, people who have 1 function, up to people who have 50 functions, this is to make validations easier for human beings instead of taking people with the largest number of functions.

One thing often missing in research publications is that they would use just one parameter setting (F1 is very popular), but depending on the specific application, F1 may not be the best. Thus varying the parameter beta may be needed ( for example F2, F3, and F4 with beta values higher than 1 that signify higher weight given for recall ( all functions retrieved ) rather than precision ( accuracy of functions retrieved ); at least, showing F measure values with multiple settings of beta would be more informative. So we could have a table with varying betas parameters, and show the results from humans. Below is the formula of the F-measure that I am talking about. I will attach the relevant publications from information retrieval that do these.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validations of Functions

Clone this wiki locally