Group Members

**Ashwin Patil <[email protected]>
Chinmay Saraf <[email protected]>
Kedar Takwane <[email protected]>
Maahi Patel <[email protected]>**

Repository Link

Introduction

In this project, we’re evaluating ChatGPT’s effectiveness to reason about a code snippet’s execution, generate tests and finally generate a semantically equivalent function for a given function. We use Python and Java repositories from the CodeSearchNet corpus. We use OpenAI API to interact with ChatGPT and then we engineer prompts to elicit the required outputs from the model. We finally evaluate the generated reasoning, tests, and methods by setting up and running them. We present our observation and results overview in this report.

Methodology

  1. Data Collection - We selected 6 repositories from CodeSearchNet : 3 each for Python and Java (repositories listed further down). From these 6 repositories, we selected 50 functions for Python and Java each. Limiting the source repositories to 6 helps in testing the generated test cases as the project setup becomes relatively easier.
  2. Data Categorization - To help us understand the accuracy and effectiveness of ChatGPT to solve the 3 tasks we categorized the functions into three categories based on the type of inputs, outputs, and the body of the function by whether it’s primitive or non-primitive. We concede there might be better ways to categorize the dataset but these are quite good approximations.
    1. Easy - If the used variables are of primitive datatypes.
    2. Medium - If the used variables contain both primitive and non-primitive datatypes.
    3. Hard - If the used variables are of non-primitive datatypes.
  3. Prompt Engineering - Using random data points from the dataset we used multiple prompts to elicit the required response from the model. The tried prompts along with the screenshots can be found in the folder milestone-2/observations For a better understanding the prompts are further categorized by tasks.
  4. Automating Runs - Once the dynamic prompts were finalized we ran these prompts for 50 functions each for Python and Java. Outputs can be found at milestone-2/out
    1. Dynamic - We’ve simulated back-and-forth conversations with ChatGPT by breaking down the task into parts which are known to help the model understand the task better and elicit better responses.
    2. Markdown - As the output given by ChatGPT is in markdown format (which was quite an interesting observation), we save the response from the model in .md format which is visually pleasing and a better way to present our work.
  5. Evaluated Method Execution Reasoning - After the generation of the outputs we verify the evaluate the generated outputs and add our observations in the individual function files.
  6. Tested Generated Tests - For evaluating this task we had to set up individual repositories for running the generated tests. Once set up we follow the following methodology.
    1. If the generated test is generated as a standalone test file we create a new file and try to run it for evaluation.
    2. If the generated test is a single function we try to find an adequate existing test file to append this function too and then evaluate.
    3. We do the following practical changes to ensure that the generated tests can be evaluated.
      1. Change the type of tabs to make it compatible with existing code (As Python is tab sensitive)
      2. Other formatting changes without changing the semantics and syntax of the function to ensure a fair evaluation.
  7. Tested Generated Semantically Equivalent Methods - Similar to the above task, we evaluate the semantically generated by appending the generated function in the existing code file by renaming the function name to make it work.
    1. Testing
      1. As the focus here is on the function generated and not on the tests generated. We use existing tests for this function if it exists.
      2. Else we use the test generated in the previous task but in this case, we modify the to make it runnable and semantically correct.
  8. Report Generation - With everything else out of the way we generate this report, to sum up, our observations and give statistics on what works and what does not and for what cases the model struggle, and in what part it excels.

Important Links

  1. Data - https://github.com/theashwin/ml4se/tree/main/milestone-2/data