Skip to content

Latest commit

 

History

History
204 lines (147 loc) · 8.44 KB

README.md

File metadata and controls

204 lines (147 loc) · 8.44 KB

Intro

In progress effort to experiment on efficient real-world software engineering tasks resolution.

lcs yolo "Let's log entire json conversation for each client implementation to a separate file. One call to send() should be logged to a separate temp file (persisted, not removed) and full name of this file should be written to normal log. Create a new class ConversationLogger which would handle this and used from every client."

Uses index, tools, generates patches and applies them to automatically produce https://github.com/okuvshynov/lucas/commit/960369fd05788ed22d2db51d545be1997d687c9b

Summary:

  • Focus on making changes to medium-sized codebases, where change is spread across multiple files and project doesn't fit into the context;
  • No autocomplete or generating snake game for 101st time.
  • Human readable indexing with LLMs, which can be debugged and understood - no multidimensional vector stores.
  • Focus on patch generation/application, not generating entire files from scratch to improve cost/latency.
  • Support for local models (llama.cpp server) and remote LLM providers (claude, mistral, groq, cerebras).
  • Support for version control tools (looking up commit info, blame, etc)

First attempt at SWE-bench

Here's one single example sqlfluff__sqlfluff-2419 from swe-bench dev dataset.

This project is big enough to not fit into context entirely, but index easily fits.

Sonnet 3.5 is used as 'main' model and locally running Llama 3.1 70B is used for indexing.

I ran it manually, indexing locally takes a while so I need to figure out better/faster way to do that (e.g. call some fast llama 70b API - cerebras or groq).

git clone https://github.com/sqlfluff/sqlfluff.git
cd sqlfluff

Switch to correct revision:

git checkout -b sqlfluff__sqlfluff-2419 f1dba0e1dd764ae72d67c3d5e1471cf14d3db030

create lucas.conf with content:

{
    "chunk_size": 4096,
    "index_client": {"type": "LocalClient", "endpoint": "http://localhost:8080/v1/chat/completions", "max_req_size" : 65536},
    "query_client": {
        "type": "ClaudeClient",
        "model": "claude-3-5-sonnet-20241022",
        "max_tokens": 8192,
        "tokens_rate": 200000,
        "cache": "ephemeral"
    },
    "crawler": {"includes": "*.py", "traverse": "git"},
    "token_counter" : {"type": "local_counter", "endpoint": "http://localhost:8080/tokenize"}
}

Run indexing, it took a few hours on M2 Utlra:


lcs index

Save the content of the query to p2419.in file:

Rule L060 could give a specific error message At the moment rule L060 flags something like this: ``` L: 21 | P: 9 | L060 | Use 'COALESCE' instead of 'IFNULL' or 'NVL'. ``` Since we likely know the wrong word, it might be nice to actually flag that instead of both `IFNULL` and `NVL` - like most of the other rules do. That is it should flag this: ``` L: 21 | P: 9 | L060 | Use 'COALESCE' instead of 'IFNULL'. ``` Or this: ``` L: 21 | P: 9 | L060 | Use 'COALESCE' instead of 'NVL'. ``` As appropriate. What do you think @jpy-git ?

Run lucas yolo to produce patch:

(base) studio ~/projects/sqlfluff % lcs yolof p2419.in
2024-10-26 08:42:28 INFO: {'directory': '/Users/oleksandr/projects/sqlfluff', 'message': "Rule L060 could give a specific error message At the moment rule L060 flags something like t
his: ``` L: 21 | P: 9 | L060 | Use 'COALESCE' instead of 'IFNULL' or 'NVL'. ``` Since we likely know the wrong word, it might be nice to actually flag that instead of both `IFNULL` a
nd `NVL` - like most of the other rules do. That is it should flag this: ``` L: 21 | P: 9 | L060 | Use 'COALESCE' instead of 'IFNULL'. ``` Or this: ``` L: 21 | P: 9 | L060 | Use 'COA
LESCE' instead of 'NVL'. ``` As appropriate. What do you think @jpy-git ?\n", 'index_file': '/Users/oleksandr/projects/sqlfluff/lucas.idx', 'client': {'type': 'ClaudeClient', 'model'
: 'claude-3-5-sonnet-20241022', 'max_tokens': 8192, 'tokens_rate': 200000, 'cache': 'ephemeral'}}
2024-10-26 08:42:28 INFO: loaded index
2024-10-26 08:42:28 INFO: sending payload, size = 19097
2024-10-26 08:42:33 INFO: Conversation logged to: /var/folders/qc/n9929lxs1hv_y70zclc7gpjh0000gn/T/lucas_claude_logs/claude_20241026_084233_383526.json
2024-10-26 08:42:33 INFO: Aggregate usage: {'input_tokens': 19, 'cache_creation_input_tokens': 21507, 'cache_read_input_tokens': 0, 'output_tokens': 53}
2024-10-26 08:42:33 INFO: input_tokens: 19, 0.00$ total
2024-10-26 08:42:33 INFO: cache_creation_input_tokens: 21507, 0.08$ total
2024-10-26 08:42:33 INFO: cache_read_input_tokens: 0, 0.00$ total
2024-10-26 08:42:33 INFO: output_tokens: 53, 0.00$ total
2024-10-26 08:42:33 INFO: requested tool: get_files({'filepaths': ['src/sqlfluff/rules/L060.py']})
2024-10-26 08:42:33 INFO: sending payload, size = 19740
2024-10-26 08:42:43 INFO: Conversation logged to: /var/folders/qc/n9929lxs1hv_y70zclc7gpjh0000gn/T/lucas_claude_logs/claude_20241026_084243_209730.json
2024-10-26 08:42:43 INFO: Aggregate usage: {'input_tokens': 671, 'cache_creation_input_tokens': 43028, 'cache_read_input_tokens': 0, 'output_tokens': 447}
2024-10-26 08:42:43 INFO: input_tokens: 671, 0.00$ total
2024-10-26 08:42:43 INFO: cache_creation_input_tokens: 43028, 0.16$ total
2024-10-26 08:42:43 INFO: cache_read_input_tokens: 0, 0.00$ total
2024-10-26 08:42:43 INFO: output_tokens: 447, 0.01$ total
2024-10-26 08:42:43 INFO: received final reply
2024-10-26 08:42:43 INFO: received 1 patches, applied 1.

We requested the right file from the start. Note that we were not reusing claude cache here, it's because of

if toolset is not None:
request["tools"] = toolset.definitions()
# need something like 'any' here
# TODO: this breaks prompt caching on first iteration
tool_choice = {"type": "auto"} if i > 0 else {"type": "any"}
request["tool_choice"] = tool_choice
. We could cut down cost more if this is improved.

Check the patch:

(base) studio ~/projects/sqlfluff % git diff
diff --git a/src/sqlfluff/rules/L060.py b/src/sqlfluff/rules/L060.py
index 836941edc..5057f59a5 100644
--- a/src/sqlfluff/rules/L060.py
+++ b/src/sqlfluff/rules/L060.py
@@ -59,4 +59,6 @@ class Rule_L060(BaseRule):
             ],
         )

-        return LintResult(context.segment, [fix])
+        return LintResult(
+            context.segment, [fix],
+            description=f"Use 'COALESCE' instead of '{context.segment.raw}'.")

Looks good, but need to set up environment and implement faster indexing

Next experiments

Example

We'll use this repository as example.

First, you create configuration file (lucas.conf) :

{
    "chunk_size": 4096,
    "index_client": {"type": "GroqClient"},
    "query_client": {
        "type": "ClaudeClient",
        "model": "claude-3-5-sonnet-20240620",
        "max_tokens": 6144,
        "tokens_rate": 200000,
        "cache": "ephemeral"
    },
    "crawler": {"includes": "*.py,*.json,*.txt", "traverse": "git"},
    "token_counter" : {"type": "local_counter", "endpoint": "http://localhost:8080/tokenize"}
}

We need to configure two separate llm clients:

  1. index_client - used for indexing. Can be slightly weaker model, as it will process entire codebase. It is using Groq API as an example, but can be local models as well.
  2. query_client - this is the bot which will use tools and try to complete the task. I use a stronger model here.

Install locally:

pip install -e .

Commands

Indexing

lcs index

Will produce file lucas.idx. It is a human-readable json file with summaries for files/directories. If file already exists, lcs will check if any files are new/changed/deleted and redo the changed files + parent directories.

Querying

lcs query "What different LLM clients are used by this projects?"

Uses index, tools and answers the question.

Getiing index stats

lcs stat

Shows size of the index (files, dirs, tokens).

Docker

Not sure yet how should I set it up, but something like

docker build -t lucas_swebench2 .
docker run -e GITHUB_REPO="https://github.com/okuvshynov/lucas.git" -e COMMIT_HASH="e08a5145c13940c59bd4021346dbd5bd069a6813" lucas_swebench