python_files.txt

<files>
<file>
<file_path>docs/src/examples/modal_langchain.py</file_path>
<file_content>
import sys
from modal import Secret, Stub, Image, web_endpoint
import lancedb
import re
import pickle
import requests
import zipfile
from pathlib import Path

from langchain.document_loaders import UnstructuredHTMLLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import LanceDB
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA

lancedb_image = Image.debian_slim().pip_install(
    "lancedb", "langchain", "openai", "pandas", "tiktoken", "unstructured", "tabulate"
)

stub = Stub(
    name="example-langchain-lancedb",
    image=lancedb_image,
    secrets=[Secret.from_name("my-openai-secret")],
)

docsearch = None
docs_path = Path("docs.pkl")
db_path = Path("lancedb")


def get_document_title(document):
    m = str(document.metadata["source"])
    title = re.findall("pandas.documentation(.*).html", m)
    if title[0] is not None:
        return title[0]
    return ""


def download_docs():
    pandas_docs = requests.get(
        "https://eto-public.s3.us-west-2.amazonaws.com/datasets/pandas_docs/pandas.documentation.zip"
    )
    with open(Path("pandas.documentation.zip"), "wb") as f:
        f.write(pandas_docs.content)

    file = zipfile.ZipFile(Path("pandas.documentation.zip"))
    file.extractall(path=Path("pandas_docs"))


def store_docs():
    docs = []

    if not docs_path.exists():
        for p in Path("pandas_docs/pandas.documentation").rglob("*.html"):
            if p.is_dir():
                continue
            loader = UnstructuredHTMLLoader(p)
            raw_document = loader.load()

            m = {}
            m["title"] = get_document_title(raw_document[0])
            m["version"] = "2.0rc0"
            raw_document[0].metadata = raw_document[0].metadata | m
            raw_document[0].metadata["source"] = str(raw_document[0].metadata["source"])
            docs = docs + raw_document

        with docs_path.open("wb") as fh:
            pickle.dump(docs, fh)
    else:
        with docs_path.open("rb") as fh:
            docs = pickle.load(fh)

    return docs


def qanda_langchain(query):
    download_docs()
    docs = store_docs()

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
    )
    documents = text_splitter.split_documents(docs)
    embeddings = OpenAIEmbeddings()

    db = lancedb.connect(db_path)
    table = db.create_table(
        "pandas_docs",
        data=[
            {
                "vector": embeddings.embed_query("Hello World"),
                "text": "Hello World",
                "id": "1",
            }
        ],
        mode="overwrite",
    )
    docsearch = LanceDB.from_documents(documents, embeddings, connection=table)
    qa = RetrievalQA.from_chain_type(
        llm=OpenAI(), chain_type="stuff", retriever=docsearch.as_retriever()
    )
    return qa.run(query)


@stub.function()
@web_endpoint(method="GET")
def web(query: str):
    answer = qanda_langchain(query)
    return {
        "answer": answer,
    }


@stub.function()
def cli(query: str):
    answer = qanda_langchain(query)
    print(answer)

</file_content>
<file_context>
<line>
<line_number>0, 1, 2</line_number>
<line_content>import sys, 
from modal import Secret, Stub, Image, web_endpoint, 
import lancedb</line_content>
<context>
The __import__() function is a wrapper around importlib.__import__(). import_module() simplifies importing modules and is the recommended programmatic way to import. find_spec() helps check if a module can be imported without loading it.</context>
</line>
<line>
<line_number>4, 5, 6</line_number>
<line_content>import pickle, 
import requests, 
import zipfile</line_content>
<context>
The zipimport module adds the ability to import Python modules from ZIP archives. It allows sys.path to contain paths to ZIP files, enabling modules inside those archives to be imported. The archive can have subdirectories to support package imports.</context>
</line>
<line>
<line_number>7, 9, 10</line_number>
<line_content>from pathlib import Path, 
from langchain.document_loaders import UnstructuredHTMLLoader, 
from langchain.embeddings import OpenAIEmbeddings</line_content>
<context>
The TraversableResources abstract base class extends ResourceReader to provide a concrete implementation for serving files through the importlib.resources module. It allows a loader to support reading package resources through both the</context>
</line>
<line>
<line_number>11, 12, 13</line_number>
<line_content>from langchain.text_splitter import RecursiveCharacterTextSplitter, 
from langchain.vectorstores import LanceDB, 
from langchain.llms import OpenAI</line_content>
<context>
Various methods exist for Unicode objects like concatenation, splitting, joining, finding substrings, replacing, comparison and formatting. The PyUnicode_InternInPlace and PyUnicode_InternFromString functions can intern strings.</context>
</line>
<line>
<line_number>14, 16, 17</line_number>
<line_content>from langchain.chains import RetrievalQA, 
lancedb_image = Image.debian_slim().pip_install(, 
'lancedb', 'langchain', 'openai', 'pandas', 'tiktoken', 'unstructured', 'tabulate'</line_content>
<context>
The importlib.metadata module provides access to the metadata of installed Python distribution packages. It can get entry points, metadata, version strings, files, and requirements for a distribution.</context>
</line>
<line>
<line_number>20, 21, 22</line_number>
<line_content>stub = Stub(, 
name='example-langchain-lancedb',, 
image=lancedb_image,</line_content>
<context>
how they are called. The Mock class removes the need to create multiple stubs and allows configuring return values, side effects, and tracking call arguments. Mock also supports mocking magic methods like __str__ and __len__.</context>
</line>
<line>
<line_number>23, 26, 27</line_number>
<line_content>secrets=[Secret.from_name('my-openai-secret')],, 
docsearch = None, 
docs_path = Path('docs.pkl')</line_content>
<context>
path. The search path can be customized by setting the PYTHONHOME or PYTHONPATH environment variables before calling Py_Initialize().</context>
</line>
<line>
<line_number>28, 31, 32</line_number>
<line_content>db_path = Path('lancedb'), 
def get_document_title(document):, 
m = str(document.metadata['source'])</line_content>
<context>
The pydoc module generates documentation for Python modules, functions, classes, and methods. It displays documentation derived from docstrings in multiple formats - as text on the console, served to a web browser, or saved as HTML files.</context>
</line>
<line>
<line_number>33, 34, 35</line_number>
<line_content>title = re.findall('pandas.documentation(.*).html', m), 
if title[0] is not None:, 
return title[0]</line_content>
<context>
The html module defines utilities for manipulating HTML in Python code. The key functions are html.escape() and html.unescape().</context>
</line>
<line>
<line_number>39, 40, 41</line_number>
<line_content>def download_docs():, 
pandas_docs = requests.get(, 
'https://eto-public.s3.us-west-2.amazonaws.com/datasets/pandas_docs/pandas.documentation.zip'</line_content>
<context>
The urllib.request module provides functions and classes for fetching URLs and making HTTP requests in python. Some key points:</context>
</line>
<line>
<line_number>43, 44, 46</line_number>
<line_content>with open(Path('pandas.documentation.zip'), 'wb') as f:, 
f.write(pandas_docs.content), 
file = zipfile.ZipFile(Path('pandas.documentation.zip'))</line_content>
<context>
The zipfile module in Python provides tools for working with ZIP archives. The module allows you to create, read, write, append, and list the contents of a ZIP file.</context>
</line>
<line>
<line_number>47, 50, 53</line_number>
<line_content>file.extractall(path=Path('pandas_docs')), 
def store_docs():, 
if not docs_path.exists():</line_content>
<context>
in future Python versions. The new API with files() and traversables is recommended instead.</context>
</line>
<line>
<line_number>54, 55, 57</line_number>
<line_content>for p in Path('pandas_docs/pandas.documentation').rglob('*.html'):, 
if p.is_dir():, 
loader = UnstructuredHTMLLoader(p)</line_content>
<context>
The html module defines utilities for manipulating HTML in Python code. The key functions are html.escape() and html.unescape().</context>
</line>
<line>
<line_number>58, 61, 62</line_number>
<line_content>raw_document = loader.load(), 
m['title'] = get_document_title(raw_document[0]), 
m['version'] = '2.0rc0'</line_content>
<context>
The format has changed across Python versions for compatibility reasons. There is a version argument to select the format to use. The current version is 4.</context>
</line>
<line>
<line_number>63, 64, 65</line_number>
<line_content>raw_document[0].metadata = raw_document[0].metadata | m, 
raw_document[0].metadata['source'] = str(raw_document[0].metadata['source']), 
docs = docs + raw_document</line_content>
<context>
- ast.parse - Parses source code into an AST
- ast.unparse - Unparses an AST back into source code
- ast.literal_eval - Safely evaluates a string with a Python literal expression 
- ast.get_docstring - Gets the docstring of a node</context>
</line>
<line>
<line_number>67, 68, 70</line_number>
<line_content>with docs_path.open('wb') as fh:, 
pickle.dump(docs, fh), 
with docs_path.open('rb') as fh:</line_content>
<context>
The pickle interface consists of two main functions - dumps() to serialize objects to a byte stream, and loads() to deserialize the byte stream back into Python objects. There are also convenience functions like dump() and load() to work directly</context>
</line>
<line>
<line_number>71, 73, 76</line_number>
<line_content>docs = pickle.load(fh), 
return docs, 
def qanda_langchain(query):</line_content>
<context>
The pickle interface consists of two main functions - dumps() to serialize objects to a byte stream, and loads() to deserialize the byte stream back into Python objects. There are also convenience functions like dump() and load() to work directly</context>
</line>
<line>
<line_number>77, 78, 80</line_number>
<line_content>download_docs(), 
docs = store_docs(), 
text_splitter = RecursiveCharacterTextSplitter(</line_content>
<context>
The pydoc module generates documentation for Python modules, functions, classes, and methods. It displays documentation derived from docstrings in multiple formats - as text on the console, served to a web browser, or saved as HTML files.</context>
</line>
<line>
<line_number>81, 82, 84</line_number>
<line_content>chunk_size=1000,, 
chunk_overlap=200,, 
documents = text_splitter.split_documents(docs)</line_content>
<context>
of a chunk, you create a new Chunk instance to process the next chunk. This continues until the end of the file is reached and creating a new Chunk fails with EOFError.</context>
</line>
<line>
<line_number>85, 87, 88</line_number>
<line_content>embeddings = OpenAIEmbeddings(), 
db = lancedb.connect(db_path), 
table = db.create_table(</line_content>
<context>
- Creating records in tables with CreateRecord.

- Initializing a new database with init_database.

- Adding data to tables with add_data. 

- Adding tables to a database with add_tables.

- Creating views on database tables with OpenView.</context>
</line>
<line>
<line_number>89, 92, 93</line_number>
<line_content>'pandas_docs',, 
'vector': embeddings.embed_query('Hello World'),, 
'text': 'Hello World',</line_content>
<context>
Using the embedding API an application can also extend Python by exposing functions and data from the application itself to Python code. This allows Python code to call back into the application.</context>
</line>
<line>
<line_number>94, 97, 99</line_number>
<line_content>'id': '1',, 
mode='overwrite',, 
docsearch = LanceDB.from_documents(documents, embeddings, connection=table)</line_content>
<context>
The doctest module checks examples in docstrings and text files, executing them and comparing the output to expected results. It contains APIs for using doctest functionality in different ways.</context>
</line>
<line>
<line_number>100, 101, 103</line_number>
<line_content>qa = RetrievalQA.from_chain_type(, 
llm=OpenAI(), chain_type='stuff', retriever=docsearch.as_retriever(), 
return qa.run(query)</line_content>
<context>
- urlencode() - Convert a dictionary into a urlencoded query string to be appended to a URL.

- parse_qs() and parse_qsl() - Parse query strings into Python data structures.</context>
</line>
<line>
<line_number>106, 107, 108</line_number>
<line_content>@stub.function(), 
@web_endpoint(method='GET'), 
def web(query: str):</line_content>
<context>
The urllib.request module provides functions and classes for fetching URLs and making HTTP requests in python. Some key points:</context>
</line>
<line>
<line_number>109, 111, 115</line_number>
<line_content>answer = qanda_langchain(query), 
'answer': answer,, 
@stub.function()</line_content>
<context>
- urlencode() - Convert a dictionary into a urlencoded query string to be appended to a URL.

- parse_qs() and parse_qsl() - Parse query strings into Python data structures.</context>
</line>
<line>
<line_number>116, 117, 118</line_number>
<line_content>def cli(query: str):, 
answer = qanda_langchain(query), 
print(answer)</line_content>
<context>
- urlencode() - Convert a dictionary into a urlencoded query string to be appended to a URL.

- parse_qs() and parse_qsl() - Parse query strings into Python data structures.
</context>
</line>
</file_context>
</file>
<file>
<file_path>docs/src/notebooks/diffusiondb/datagen.py</file_path>
<file_content>
#!/usr/bin/env python
#
#  Copyright 2023 LanceDB Developers
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

"""Dataset hf://poloclub/diffusiondb
"""

import io
from argparse import ArgumentParser
from multiprocessing import Pool

import lance
import lancedb
import pyarrow as pa
from datasets import load_dataset
from PIL import Image
from transformers import CLIPModel, CLIPProcessor, CLIPTokenizerFast

MODEL_ID = "openai/clip-vit-base-patch32"

device = "cuda"

tokenizer = CLIPTokenizerFast.from_pretrained(MODEL_ID)
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

schema = pa.schema(
    [
        pa.field("prompt", pa.string()),
        pa.field("seed", pa.uint32()),
        pa.field("step", pa.uint16()),
        pa.field("cfg", pa.float32()),
        pa.field("sampler", pa.string()),
        pa.field("width", pa.uint16()),
        pa.field("height", pa.uint16()),
        pa.field("timestamp", pa.timestamp("s")),
        pa.field("image_nsfw", pa.float32()),
        pa.field("prompt_nsfw", pa.float32()),
        pa.field("vector", pa.list_(pa.float32(), 512)),
        pa.field("image", pa.binary()),
    ]
)


def pil_to_bytes(img) -> list[bytes]:
    buf = io.BytesIO()
    img.save(buf, format="PNG")
    return buf.getvalue()


def generate_clip_embeddings(batch) -> pa.RecordBatch:
    image = processor(text=None, images=batch["image"], return_tensors="pt")[
        "pixel_values"
    ].to(device)
    img_emb = model.get_image_features(image)
    batch["vector"] = img_emb.cpu().tolist()

    with Pool() as p:
        batch["image_bytes"] = p.map(pil_to_bytes, batch["image"])
    return batch


def datagen(args):
    """Generate DiffusionDB dataset, and use CLIP model to generate image embeddings."""
    dataset = load_dataset("poloclub/diffusiondb", args.subset)
    data = []
    for b in dataset.map(
        generate_clip_embeddings, batched=True, batch_size=256, remove_columns=["image"]
    )["train"]:
        b["image"] = b["image_bytes"]
        del b["image_bytes"]
        data.append(b)
    tbl = pa.Table.from_pylist(data, schema=schema)
    return tbl


def main():
    parser = ArgumentParser()
    parser.add_argument(
        "-o", "--output", metavar="DIR", help="Output lance directory", required=True
    )
    parser.add_argument(
        "-s",
        "--subset",
        choices=["2m_all", "2m_first_10k", "2m_first_100k"],
        default="2m_first_10k",
        help="subset of the hg dataset",
    )

    args = parser.parse_args()

    batches = datagen(args)
    lance.write_dataset(batches, args.output)


if __name__ == "__main__":
    main()

</file_content>
<file_context>
<line>
<line_number>15, 19, 20</line_number>
<line_content>'''Dataset hf://poloclub/diffusiondb, 
from argparse import ArgumentParser, 
from multiprocessing import Pool</line_content>
<context>
parsing arguments. Additional information covers building values in Python from C values.</context>
</line>
<line>
<line_number>22, 23, 24</line_number>
<line_content>import lance, 
import lancedb, 
import pyarrow as pa</line_content>
<context>
PyImport_ImportModuleEx() imports a module by name, with additional globals, locals, and fromlist arguments similar to Python's __import__() function. It returns the imported module or NULL if there was an error.</context>
</line>
<line>
<line_number>25, 26, 27</line_number>
<line_content>from datasets import load_dataset, 
from PIL import Image, 
from transformers import CLIPModel, CLIPProcessor, CLIPTokenizerFast</line_content>
<context>
The main functions are tomllib.load() which parses a TOML file, and tomllib.loads() which parses a TOML string. They take the TOML source as the first argument, and return a dict of the parsed data. An optional parse_float argument can be passed to</context>
</line>
<line>
<line_number>29, 31, 33</line_number>
<line_content>MODEL_ID = 'openai/clip-vit-base-patch32', 
device = 'cuda', 
tokenizer = CLIPTokenizerFast.from_pretrained(MODEL_ID)</line_content>
<context>
The encode_* functions raise TypeError if passed a multipart message instead of encoding the subparts individually. They extract the payload, encode it, and reset the payload to the encoded value.</context>
</line>
<line>
<line_number>34, 35, 37</line_number>
<line_content>model = CLIPModel.from_pretrained('openai/clip-vit-base-patch32').to(device), 
processor = CLIPProcessor.from_pretrained('openai/clip-vit-base-patch32'), 
schema = pa.schema(</line_content>
<context>
PySlice_New creates a new slice object given start, stop, and step values (any can be None). PySlice_GetIndices and PySlice_GetIndicesEx extract the start, stop, and step values from a slice assuming a sequence of a given length, clipping out of</context>
</line>
<line>
<line_number>39, 40, 41</line_number>
<line_content>pa.field('prompt', pa.string()),, 
pa.field('seed', pa.uint32()),, 
pa.field('step', pa.uint16()),</line_content>
<context>
__name__ attribute. PyModule_GetState returns the module state.</context>
</line>
<line>
<line_number>42, 43, 44</line_number>
<line_content>pa.field('cfg', pa.float32()),, 
pa.field('sampler', pa.string()),, 
pa.field('width', pa.uint16()),</line_content>
<context>
- Get a string representation of an object, like with PyObject_Str and PyObject_Repr. These implement str() and repr().

- Get the length or size of an object, with PyObject_Length and PyObject_Size. These implement len().</context>
</line>
<line>
<line_number>45, 46, 47</line_number>
<line_content>pa.field('height', pa.uint16()),, 
pa.field('timestamp', pa.timestamp('s')),, 
pa.field('image_nsfw', pa.float32()),</line_content>
<context>
PyFloat_GetInfo returns a structseq with info on float precision, max, and min. PyFloat_GetMax returns the max float DBL_MAX. PyFloat_GetMin returns the min float DBL_MIN.</context>
</line>
<line>
<line_number>48, 49, 50</line_number>
<line_content>pa.field('prompt_nsfw', pa.float32()),, 
pa.field('vector', pa.list_(pa.float32(), 512)),, 
pa.field('image', pa.binary()),</line_content>
<context>
the __name__ and __qualname__ attributes from the passed in name and qualname arguments. The PyCoro_New function steals a reference to the frame object passed in.</context>
</line>
<line>
<line_number>55, 56, 57</line_number>
<line_content>def pil_to_bytes(img) -> list[bytes]:, 
buf = io.BytesIO(), 
img.save(buf, format='PNG')</line_content>
<context>
You can concatenate bytes objects with PyBytes_Concat, which creates a new bytes object with the contents of the old and new bytes objects.</context>
</line>
<line>
<line_number>58, 61, 62</line_number>
<line_content>return buf.getvalue(), 
def generate_clip_embeddings(batch) -> pa.RecordBatch:, 
image = processor(text=None, images=batch['image'], return_tensors='pt')[</line_content>
<context>
- format_list() - Takes a list from extract_tb() and formats the frames for printing.

- format_exception() - Formats exception and traceback info into a list of strings for printing.</context>
</line>
<line>
<line_number>63, 64, 65</line_number>
<line_content>'pixel_values', 
].to(device), 
img_emb = model.get_image_features(image)</line_content>
<context>
and PyCode_GetFreevars return the names of local variables, cell variables, and free variables respectively.</context>
</line>
<line>
<line_number>66, 68, 69</line_number>
<line_content>batch['vector'] = img_emb.cpu().tolist(), 
with Pool() as p:, 
batch['image_bytes'] = p.map(pil_to_bytes, batch['image'])</line_content>
<context>
You can concatenate bytes objects with PyBytes_Concat, which creates a new bytes object with the contents of the old and new bytes objects.</context>
</line>
<line>
<line_number>70, 73, 74</line_number>
<line_content>return batch, 
def datagen(args):, 
'''Generate DiffusionDB dataset, and use CLIP model to generate image embeddings.'''</line_content>
<context>
The command line interface allows creating an archive from a directory containing Python code. It has options to specify the output file, Python interpreter to use in the shebang line, main function to call, whether to compress files, and to display</context>
</line>
<line>
<line_number>75, 77, 78</line_number>
<line_content>dataset = load_dataset('poloclub/diffusiondb', args.subset), 
for b in dataset.map(, 
generate_clip_embeddings, batched=True, batch_size=256, remove_columns=['image']</line_content>
<context>
For example, rgb_to_hsv can convert an (R, G, B) tuple to (H, S, V) and hsv_to_rgb does the reverse conversion. The colorsys module enables flexible color space conversions in Python.</context>
</line>
<line>
<line_number>79, 80, 81</line_number>
<line_content>)['train']:, 
b['image'] = b['image_bytes'], 
del b['image_bytes']</line_content>
<context>
Python encodes them into bytes using an encoding like UTF-8.</context>
</line>
<line>
<line_number>82, 83, 84</line_number>
<line_content>data.append(b), 
tbl = pa.Table.from_pylist(data, schema=schema), 
return tbl</line_content>
<context>
PyList_Insert inserts an item at a given index. PyList_Append adds an item to the end of a list.</context>
</line>
<line>
<line_number>87, 88, 89</line_number>
<line_content>def main():, 
parser = ArgumentParser(), 
parser.add_argument(</line_content>
<context>
The add_argument() method is used to register arguments with the parser. It allows specifying options like the type, default value, help text, and more. The parse_args() method then does the parsing and conversion to create the namespace object.</context>
</line>
<line>
<line_number>90, 92, 94</line_number>
<line_content>'-o', '--output', metavar='DIR', help='Output lance directory', required=True, 
parser.add_argument(, 
'--subset',</line_content>
<context>
- report(), report_partial_closure(), report_full_closure() - Print reports of the directory comparison.

- left, right - The left and right directories. 

- common_dirs, common_files - Common subdirectories and files.</context>
</line>
<line>
<line_number>95, 96, 97</line_number>
<line_content>choices=['2m_all', '2m_first_10k', '2m_first_100k'],, 
default='2m_first_10k',, 
help='subset of the hg dataset',</line_content>
<context>
After parsing, opts contains a list of (option, value) tuples and args contains the remaining non-option arguments. These can then be processed in the script.</context>
</line>
<line>
<line_number>100, 102, 103</line_number>
<line_content>args = parser.parse_args(), 
batches = datagen(args), 
lance.write_dataset(batches, args.output)</line_content>
<context>
Overall, the argparse module provides powerful command-line parsing with features like positional arguments, optional arguments, help text generation, and argument conversion. The examples show many typical use cases for the module.
</context>
</line>
</file_context>
</file>
<file>
<file_path>docs/test/md_testing.py</file_path>
<file_content>
import glob
from typing import Iterator
from pathlib import Path

excluded_files = [
    "../src/fts.md",
    "../src/embedding.md",
    "../src/examples/serverless_lancedb_with_s3_and_lambda.md",
    "../src/examples/serverless_qa_bot_with_modal_and_langchain.md",
    "../src/examples/youtube_transcript_bot_with_nodejs.md"
]
languages = ["py", "javascript"]
glob_string = "../src/**/*.md"

def yield_lines(lines: Iterator[str], prefix: str, suffix: str, languages: list):
    current_language = {language: False for language in languages}
    for line in lines:
        for language in languages:
            if line.strip().startswith(prefix + language):
                current_language[language] = True
            elif current_language[language] and line.strip().startswith(suffix):
                current_language[language] = False
                yield ("\n", language)
            elif current_language[language]:
                yield (line, language)

def create_code_files(prefix: str, suffix: str, file_ending: str = ""):
    for file in filter(lambda file: file not in excluded_files, glob.glob(glob_string, recursive=True)):
        with open(file, "r") as f:
            lines = list(yield_lines(iter(f), prefix, suffix, languages))
            python_lines = [line[0] for line in lines if line[1] == "py"]
            node_lines = [line[0] for line in lines if line[1] == "javascript"]

        if len(python_lines) > 0:
            python_out_path = Path("python") / Path(file).name.strip(".md") / (Path(file).name.strip(".md") + file_ending + ".py")
            python_out_path.parent.mkdir(exist_ok=True, parents=True)
            with open(python_out_path, "w") as python_out:
                python_out.writelines(python_lines)

        if len(node_lines) > 0:
            node_out_path = Path("node") / Path(file).name.strip(".md") / (Path(file).name.strip(".md") + file_ending + ".js")
            node_out_path.parent.mkdir(exist_ok=True, parents=True)
            with open(node_out_path, "w") as node_out:
                node_out.write("(async () => {\n")
                node_out.writelines(node_lines)
                node_out.write("})();")

# Setup doc code
create_code_files("<!--", "-->", "-setup")

# Actual doc code
create_code_files("```", "```")

</file_content>
<file_context>
<line>
<line_number>0, 1, 2</line_number>
<line_content>import glob, 
from typing import Iterator, 
from pathlib import Path</line_content>
<context>
with '[]'. glob.glob() returns a list of matching pathnames, which can be absolute or relative paths. The glob.iglob() function returns an iterator instead of a list. The glob module uses os.scandir() and fnmatch.fnmatch() internally. Files starting</context>
</line>
<line>
<line_number>4, 5, 6</line_number>
<line_content>excluded_files = [, 
'../src/fts.md',, 
'../src/embedding.md',</line_content>
<context>
- include *.txt to add all .txt files
- recursive-include examples *.py to add all .py files recursively under examples/
- prune examples/tmp to exclude the examples/tmp directory</context>
</line>
<line>
<line_number>7, 8, 9</line_number>
<line_content>'../src/examples/serverless_lancedb_with_s3_and_lambda.md',, 
'../src/examples/serverless_qa_bot_with_modal_and_langchain.md',, 
'../src/examples/youtube_transcript_bot_with_nodejs.md'</line_content>
<context>
with the ext_modules option using Extension objects. Scripts are listed under the scripts option. Data files and extra files can also be included.</context>
</line>
<line>
<line_number>11, 12, 14</line_number>
<line_content>languages = ['py', 'javascript'], 
glob_string = '../src/**/*.md', 
def yield_lines(lines: Iterator[str], prefix: str, suffix: str, languages: list):</line_content>
<context>
Some examples of usage:

- Append a tr command to uppercase:
  t.append('tr a-z A-Z', '--')
  
- Open a file-like object for writing text through the pipeline: 
  f = t.open('outfile', 'w')</context>
</line>
<line>
<line_number>15, 16, 17</line_number>
<line_content>current_language = {language: False for language in languages}, 
for line in lines:, 
for language in languages:</line_content>
<context>
Python does not use braces for blocks in "if", "while", "for", etc because it was influenced by the ABC language which found it improved readability. The colon introduces the block and eliminates ambiguity about scope. Editors can also use the colon</context>
</line>
<line>
<line_number>18, 19, 20</line_number>
<line_content>if line.strip().startswith(prefix + language):, 
current_language[language] = True, 
elif current_language[language] and line.strip().startswith(suffix):</line_content>
<context>
returns true if Python is currently initialized.</context>
</line>
<line>
<line_number>21, 22, 23</line_number>
<line_content>current_language[language] = False, 
yield ('\n', language), 
elif current_language[language]:</line_content>
<context>
The language is named after Monty Python's Flying Circus. The tutorial invites you to play with the Python interpreter while reading to learn the language through examples. It covers basic language elements like expressions, data types, functions</context>
</line>
<line>
<line_number>24, 26, 27</line_number>
<line_content>yield (line, language), 
def create_code_files(prefix: str, suffix: str, file_ending: str = ''):, 
for file in filter(lambda file: file not in excluded_files, glob.glob(glob_string, recursive=True)):</line_content>
<context>
txt_files = fnmatch.filter(names, '*.txt')

So in summary, fnmatch provides simple Unix style filename matching functionality in Python. It can be useful for filtering lists of files and doing glob style matching.</context>
</line>
<line>
<line_number>28, 29, 30</line_number>
<line_content>with open(file, 'r') as f:, 
lines = list(yield_lines(iter(f), prefix, suffix, languages)), 
python_lines = [line[0] for line in lines if line[1] == 'py']</line_content>
<context>
A Python program is divided into logical lines terminated by NEWLINE tokens. Logical lines can span multiple physical lines using explicit line joining with backslashes or implicit line joining by enclosing expressions in</context>
</line>
<line>
<line_number>31, 33, 34</line_number>
<line_content>node_lines = [line[0] for line in lines if line[1] == 'javascript'], 
if len(python_lines) > 0:, 
python_out_path = Path('python') / Path(file).name.strip('.md') / (Path(file).name.strip('.md') + file_ending + '.py')</line_content>
<context>
A Python program is divided into logical lines terminated by NEWLINE tokens. Logical lines can span multiple physical lines using explicit line joining with backslashes or implicit line joining by enclosing expressions in</context>
</line>
<line>
<line_number>35, 36, 37</line_number>
<line_content>python_out_path.parent.mkdir(exist_ok=True, parents=True), 
with open(python_out_path, 'w') as python_out:, 
python_out.writelines(python_lines)</line_content>
<context>
The PYTHONPATH environment variable can add more directories to the search path.</context>
</line>
<line>
<line_number>39, 40, 41</line_number>
<line_content>if len(node_lines) > 0:, 
node_out_path = Path('node') / Path(file).name.strip('.md') / (Path(file).name.strip('.md') + file_ending + '.js'), 
node_out_path.parent.mkdir(exist_ok=True, parents=True)</line_content>
<context>
Methods like Path.exists(), Path.is_dir(), Path.is_file(), Path.open() allow querying properties of a filesystem path and interacting with the filesystem. Path.rmdir(), Path.unlink(), Path.rename(), and Path.replace() perform system calls to remove,</context>
</line>
<line>
<line_number>42, 43, 44</line_number>
<line_content>with open(node_out_path, 'w') as node_out:, 
node_out.write('(async () => {\n'), 
node_out.writelines(node_lines)</line_content>
<context>
The asyncio streams module provides high-level async/await-ready primitives to work with network connections and streams. The key functions are asyncio.open_connection(), asyncio.start_server(), asyncio.open_unix_connection(), and</context>
</line>
<line>
<line_number>45, 48, 51</line_number>
<line_content>node_out.write('})();'), 
create_code_files('<!--', '-->', '-setup'), 
create_code_files('```', '```')</line_content>
<context>
- Can output code directly into the file, or into separate buffers/files for cleaner code organization.
</context>
</line>
</file_context>
</file>
<file>
<file_path>python/lancedb/__init__.py</file_path>
<file_content>
#  Copyright 2023 LanceDB Developers
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

from .db import URI, LanceDBConnection


def connect(uri: URI) -> LanceDBConnection:
    """Connect to a LanceDB instance at the given URI

    Parameters
    ----------
    uri: str or Path
        The uri of the database.

    Examples
    --------

    For a local directory, provide a path for the database:

    >>> import lancedb
    >>> db = lancedb.connect("~/.lancedb")

    For object storage, use a URI prefix:

    >>> db = lancedb.connect("s3://my-bucket/lancedb")

    Returns
    -------
    conn : LanceDBConnection
        A connection to a LanceDB database.
    """
    return LanceDBConnection(uri)

</file_content>
<file_context>
<line>
<line_number>13, 16, 17</line_number>
<line_content>from .db import URI, LanceDBConnection, 
def connect(uri: URI) -> LanceDBConnection:, 
'''Connect to a LanceDB instance at the given URI</line_content>
<context>
The sqlite3 module provides a DB-API 2.0 compliant interface for working with SQLite databases in Python. It allows executing SQL statements and fetching results. Key components include the Connection, Cursor, and Row classes.</context>
</line>
<line>
<line_number>19, 20, 21</line_number>
<line_content>Parameters, 
----------, 
uri: str or Path</line_content>
<context>
- urlunparse() - Puts a parsed URL back together into a complete URL string. This is the inverse of urlparse(). 

- urlsplit() - Similar to urlparse() but doesn't split params and query.</context>
</line>
<line>
<line_number>22, 27, 29</line_number>
<line_content>The uri of the database., 
For a local directory, provide a path for the database:, 
>>> import lancedb</line_content>
<context>
The sqlite3 module provides a DB-API 2.0 compliant interface for working with SQLite databases in Python. It allows executing SQL statements and fetching results. Key components include the Connection, Cursor, and Row classes.</context>
</line>
<line>
<line_number>30, 32, 34</line_number>
<line_content>>>> db = lancedb.connect('~/.lancedb'), 
For object storage, use a URI prefix:, 
>>> db = lancedb.connect('s3://my-bucket/lancedb')</line_content>
<context>
The sqlite3 module provides a DB-API 2.0 compliant interface for working with SQLite databases in Python. It allows executing SQL statements and fetching results. Key components include the Connection, Cursor, and Row classes.</context>
</line>
<line>
<line_number>38, 39, 41</line_number>
<line_content>conn : LanceDBConnection, 
A connection to a LanceDB database., 
return LanceDBConnection(uri)</line_content>
<context>
A Connection represents a db connection. It can create Cursor objects to execute SQL statements. Connection provides methods like commit(), rollback(), close() for transaction control. The isolation_level attribute controls implicit transaction
</context>
</line>
</file_context>
</file>
<file>
<file_path>python/lancedb/common.py</file_path>
<file_content>
#  Copyright 2023 LanceDB Developers
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.
from pathlib import Path
from typing import List, Union

import numpy as np
import pandas as pd
import pyarrow as pa

VEC = Union[list, np.ndarray, pa.Array, pa.ChunkedArray]
URI = Union[str, Path]

# TODO support generator
DATA = Union[List[dict], dict, pd.DataFrame]
VECTOR_COLUMN_NAME = "vector"

</file_content>
<file_context>
<line>
<line_number>12, 13, 15</line_number>
<line_content>from pathlib import Path, 
from typing import List, Union, 
import numpy as np</line_content>
<context>
The Python standard library contains many built-in modules and functions that provide common functionality for Python programmers. It includes modules for file I/O, system access, data types like lists and dictionaries, text processing, networking,</context>
</line>
<line>
<line_number>16, 17, 19</line_number>
<line_content>import pandas as pd, 
import pyarrow as pa, 
VEC = Union[list, np.ndarray, pa.Array, pa.ChunkedArray]</line_content>
<context>
The python list data type has a built-in sort() method that sorts the list in-place. There is also a sorted() built-in function that builds a new sorted list from an iterable.</context>
</line>
<line>
<line_number>20, 23, 24</line_number>
<line_content>URI = Union[str, Path], 
DATA = Union[List[dict], dict, pd.DataFrame], 
VECTOR_COLUMN_NAME = 'vector'</line_content>
<context>
- urlencode() - Convert a dictionary into a urlencoded query string to be appended to a URL.

- parse_qs() and parse_qsl() - Parse query strings into Python data structures.
</context>
</line>
</file_context>
</file>
<file>
<file_path>python/lancedb/conftest.py</file_path>
<file_content>
import builtins
import os

import pytest

# import lancedb so we don't have to in every example
import lancedb


@pytest.fixture(autouse=True)
def doctest_setup(monkeypatch, tmpdir):
    # disable color for doctests so we don't have to include
    # escape codes in docstrings
    monkeypatch.setitem(os.environ, "NO_COLOR", "1")
    # Explicitly set the column width
    monkeypatch.setitem(os.environ, "COLUMNS", "80")
    # Work in a temporary directory
    monkeypatch.chdir(tmpdir)

</file_content>
<file_context>
<line>
<line_number>0, 3, 6</line_number>
<line_content>import builtins, 
import pytest, 
import lancedb</line_content>
<context>
installation paths and configuration variables. The builtins module contains built-in Python objects.</context>
</line>
<line>
<line_number>9, 10, 11</line_number>
<line_content>@pytest.fixture(autouse=True), 
def doctest_setup(monkeypatch, tmpdir):, 
# disable color for doctests so we don't have to include</line_content>
<context>
doctest supports finer grained control through:

- Example - Encapsulates a single Python statement and expected output.

- DocTest - Collects Examples extracted from a docstring.

- OutputChecker - Compares actual to expected output.</context>
</line>
<line>
<line_number>12, 13, 14</line_number>
<line_content># escape codes in docstrings, 
monkeypatch.setitem(os.environ, 'NO_COLOR', '1'), 
# Explicitly set the column width</line_content>
<context>
The format has changed across Python versions for compatibility reasons. There is a version argument to select the format to use. The current version is 4.</context>
</line>
<line>
<line_number>15, 16, 17</line_number>
<line_content>monkeypatch.setitem(os.environ, 'COLUMNS', '80'), 
# Work in a temporary directory, 
monkeypatch.chdir(tmpdir)</line_content>
<context>
- os.environ - dictionary containing environment variables
- os.getcwd() - get current working directory
- os.chdir() - change current working directory
- os.mkdir() - create a directory
- os.rmdir() - remove a directory
</context>
</line>
</file_context>
</file>
<file>
<file_path>python/lancedb/context.py</file_path>
<file_content>
#  Copyright 2023 LanceDB Developers
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.
from __future__ import annotations

import pandas as pd
from .exceptions import MissingValueError, MissingColumnError


def contextualize(raw_df: pd.DataFrame) -> Contextualizer:
    """Create a Contextualizer object for the given DataFrame.

    Used to create context windows. Context windows are rolling subsets of text
    data.

    The input text column should already be separated into rows that will be the
    unit of the window. So to create a context window over tokens, start with
    a DataFrame with one token per row. To create a context window over sentences,
    start with a DataFrame with one sentence per row.

    Examples
    --------
    >>> from lancedb.context import contextualize
    >>> import pandas as pd
    >>> data = pd.DataFrame({
    ...    'token': ['The', 'quick', 'brown', 'fox', 'jumped', 'over',
    ...              'the', 'lazy', 'dog', 'I', 'love', 'sandwiches'],
    ...    'document_id': [1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2]
    ... })

    ``window`` determines how many rows to include in each window. In our case
    this how many tokens, but depending on the input data, it could be sentences,
    paragraphs, messages, etc.

    >>> contextualize(data).window(3).stride(1).text_col('token').to_df()
                    token  document_id
    0     The quick brown            1
    1     quick brown fox            1
    2    brown fox jumped            1
    3     fox jumped over            1
    4     jumped over the            1
    5       over the lazy            1
    6        the lazy dog            1
    7          lazy dog I            1
    8          dog I love            1
    9   I love sandwiches            2
    10    love sandwiches            2
    >>> contextualize(data).window(7).stride(1).min_window_size(7).text_col('token').to_df()
                                      token  document_id
    0   The quick brown fox jumped over the            1
    1  quick brown fox jumped over the lazy            1
    2    brown fox jumped over the lazy dog            1
    3        fox jumped over the lazy dog I            1
    4       jumped over the lazy dog I love            1
    5   over the lazy dog I love sandwiches            1

    ``stride`` determines how many rows to skip between each window start. This can
    be used to reduce the total number of windows generated.

    >>> contextualize(data).window(4).stride(2).text_col('token').to_df()
                        token  document_id
    0     The quick brown fox            1
    2   brown fox jumped over            1
    4    jumped over the lazy            1
    6          the lazy dog I            1
    8   dog I love sandwiches            1
    10        love sandwiches            2

    ``groupby`` determines how to group the rows. For example, we would like to have
    context windows that don't cross document boundaries. In this case, we can
    pass ``document_id`` as the group by.

    >>> contextualize(data).window(4).stride(2).text_col('token').groupby('document_id').to_df()
                       token  document_id
    0    The quick brown fox            1
    2  brown fox jumped over            1
    4   jumped over the lazy            1
    6           the lazy dog            1
    9      I love sandwiches            2

    ``min_window_size`` determines the minimum size of the  context windows that are generated
    This can be used to trim the last few context windows which have size less than
    ``min_window_size``. By default context windows of size 1 are skipped.

    >>> contextualize(data).window(6).stride(3).text_col('token').groupby('document_id').to_df()
                                 token  document_id
    0  The quick brown fox jumped over            1
    3     fox jumped over the lazy dog            1
    6                     the lazy dog            1
    9                I love sandwiches            2

    >>> contextualize(data).window(6).stride(3).min_window_size(4).text_col('token').groupby('document_id').to_df()
                                 token  document_id
    0  The quick brown fox jumped over            1
    3     fox jumped over the lazy dog            1

    """
    return Contextualizer(raw_df)


class Contextualizer:
    """Create context windows from a DataFrame. See [lancedb.context.contextualize][]."""

    def __init__(self, raw_df):
        self._text_col = None
        self._groupby = None
        self._stride = None
        self._window = None
        self._min_window_size = 2
        self._raw_df = raw_df

    def window(self, window: int) -> Contextualizer:
        """Set the window size. i.e., how many rows to include in each window.

        Parameters
        ----------
        window: int
            The window size.
        """
        self._window = window
        return self

    def stride(self, stride: int) -> Contextualizer:
        """Set the stride. i.e., how many rows to skip between each window.

        Parameters
        ----------
        stride: int
            The stride.
        """
        self._stride = stride
        return self

    def groupby(self, groupby: str) -> Contextualizer:
        """Set the groupby column. i.e., how to group the rows.
        Windows don't cross groups

        Parameters
        ----------
        groupby: str
            The groupby column.
        """
        self._groupby = groupby
        return self

    def text_col(self, text_col: str) -> Contextualizer:
        """Set the text column used to make the context window.

        Parameters
        ----------
        text_col: str
            The text column.
        """
        self._text_col = text_col
        return self

    def min_window_size(self, min_window_size: int) -> Contextualizer:
        """Set the (optional) min_window_size size for the context window.

        Parameters
        ----------
        min_window_size: int
            The min_window_size.
        """
        self._min_window_size = min_window_size
        return self

    def to_df(self) -> pd.DataFrame:
        """Create the context windows and return a DataFrame."""

        if self._text_col not in self._raw_df.columns.tolist():
            raise MissingColumnError(self._text_col)

        if self._window is None or self._window < 1:
            raise MissingValueError(
                "The value of window is None or less than 1. Specify the "
                "window size (number of rows to include in each window)"
            )

        if self._stride is None or self._stride < 1:
            raise MissingValueError(
                "The value of stride is None or less than 1. Specify the "
                "stride (number of rows to skip between each window)"
            )

        def process_group(grp):
            # For each group, create the text rolling window
            # with values of size >= min_window_size
            text = grp[self._text_col].values
            contexts = grp.iloc[:: self._stride, :].copy()
            windows = [
                " ".join(text[start_i : min(start_i + self._window, len(grp))])
                for start_i in range(0, len(grp), self._stride)
                if start_i + self._window <= len(grp)
                or len(grp) - start_i >= self._min_window_size
            ]
            # if last few rows dropped
            if len(windows) < len(contexts):
                contexts = contexts.iloc[: len(windows)]
            contexts[self._text_col] = windows
            return contexts

        if self._groupby is None:
            return process_group(self._raw_df)
        # concat result from all groups
        return pd.concat(
            [process_group(grp) for _, grp in self._raw_df.groupby(self._groupby)]
        )

</file_content>
<file_context>
<line>
<line_number>12, 14, 15</line_number>
<line_content>from __future__ import annotations, 
import pandas as pd, 
from .exceptions import MissingValueError, MissingColumnError</line_content>
<context>
If annotations are stringized, you may need to manually evaluate them into Python values using eval(). The exact eval usage depends on the type of object. It's best to only evaluate strings when explicitly requested.</context>
</line>
<line>
<line_number>18, 19, 21</line_number>
<line_content>def contextualize(raw_df: pd.DataFrame) -> Contextualizer:, 
'''Create a Contextualizer object for the given DataFrame., 
Used to create context windows. Context windows are rolling subsets of text</line_content>
<context>
The contextvars module provides APIs for managing context-local state in Python. The ContextVar class declares a new Context Variable that can store values specific to the current context. ContextVars should be used by context managers instead of</context>
</line>
<line>
<line_number>24, 25, 26</line_number>
<line_content>The input text column should already be separated into rows that will be the, 
unit of the window. So to create a context window over tokens, start with, 
a DataFrame with one token per row. To create a context window over sentences,</line_content>
<context>
The contextvars module provides a Context class to represent context, a ContextVar class to represent context variables, and a Token class to represent a change in context.</context>
</line>
<line>
<line_number>27, 31, 32</line_number>
<line_content>start with a DataFrame with one sentence per row., 
>>> from lancedb.context import contextualize, 
>>> import pandas as pd</line_content>
<context>
'Hello {name}, your age is {age}'.format(name='Bob', age=25)

- Open a file for writing UTF-8 encoded text:

  f = open('data.txt', 'w', encoding='utf-8')

- Write a JSON string to a file:

  json.dump(data, f)

- Read in a file's contents:</context>
</line>
<line>
<line_number>33, 34, 35</line_number>
<line_content>>>> data = pd.DataFrame({, 
...    'token': ['The', 'quick', 'brown', 'fox', 'jumped', 'over',, 
...              'the', 'lazy', 'dog', 'I', 'love', 'sandwiches'],</line_content>
<context>
'Hello {name}, your age is {age}'.format(name='Bob', age=25)

- Open a file for writing UTF-8 encoded text:

  f = open('data.txt', 'w', encoding='utf-8')

- Write a JSON string to a file:

  json.dump(data, f)

- Read in a file's contents:</context>
</line>
<line>
<line_number>36, 39, 40</line_number>
<line_content>...    'document_id': [1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2], 
``window`` determines how many rows to include in each window. In our case, 
this how many tokens, but depending on the input data, it could be sentences,</line_content>
<context>
doctest supports finer grained control through:

- Example - Encapsulates a single Python statement and expected output.

- DocTest - Collects Examples extracted from a docstring.

- OutputChecker - Compares actual to expected output.</context>
</line>
<line>
<line_number>41, 43, 44</line_number>
<line_content>paragraphs, messages, etc., 
>>> contextualize(data).window(3).stride(1).text_col('token').to_df(), 
token  document_id</line_content>
<context>
The tokenize module provides functions for lexical tokenizing of Python source code. The main function is tokenize(), which takes a readline callable as input and returns a generator that yields 5-tuple tokens representing the token type, string</context>
</line>
<line>
<line_number>45, 46, 47</line_number>
<line_content>0     The quick brown            1, 
1     quick brown fox            1, 
2    brown fox jumped            1</line_content>
<context>
0.1 is represented internally as the binary fraction 3602879701896397 / 2 ** 55. This is close to but not exactly equal to 0.1.</context>
</line>
<line>
<line_number>48, 49, 50</line_number>
<line_content>3     fox jumped over            1, 
4     jumped over the            1, 
5       over the lazy            1</line_content>
<context>
- The turtle can be controlled through a functional or object-oriented interface. Using object-oriented style allows multiple turtles on one screen.</context>
</line>
<line>
<line_number>51, 52, 53</line_number>
<line_content>6        the lazy dog            1, 
7          lazy dog I            1, 
8          dog I love            1</line_content>
<context>
includes functions like count, cycle, repeat, chain, groupby, tee, and more. There are also recipes demonstrating common iteration patterns like iterating over permutations and combinations of data. The functools module provides higher-order</context>
</line>
<line>
<line_number>54, 55, 56</line_number>
<line_content>9   I love sandwiches            2, 
10    love sandwiches            2, 
>>> contextualize(data).window(7).stride(1).min_window_size(7).text_col('token').to_df()</line_content>
<context>
The tokenize module provides functions for lexical tokenizing of Python source code. The main function is tokenize(), which takes a readline callable as input and returns a generator that yields 5-tuple tokens representing the token type, string</context>
</line>
<line>
<line_number>57, 58, 59</line_number>
<line_content>token  document_id, 
0   The quick brown fox jumped over the            1, 
1  quick brown fox jumped over the lazy            1</line_content>
<context>
resets, hard-to-guess URLs, etc. The default token size is 32 bytes but you can specify the number of bytes.</context>
</line>
<line>
<line_number>60, 61, 62</line_number>
<line_content>2    brown fox jumped over the lazy dog            1, 
3        fox jumped over the lazy dog I            1, 
4       jumped over the lazy dog I love            1</line_content>
<context>
includes functions like count, cycle, repeat, chain, groupby, tee, and more. There are also recipes demonstrating common iteration patterns like iterating over permutations and combinations of data. The functools module provides higher-order</context>
</line>
<line>
<line_number>63, 65, 66</line_number>
<line_content>5   over the lazy dog I love sandwiches            1, 
``stride`` determines how many rows to skip between each window start. This can, 
be used to reduce the total number of windows generated.</line_content>
<context>
Notebook - Manages and displays child windows as tabs. Options like height, padding, tab state. Generates <<NotebookTabChanged>> event.

Progressbar - Shows status of long-running operation. Options: orient, length, mode, maximum, value.</context>
</line>
<line>
<line_number>68, 69, 70</line_number>
<line_content>>>> contextualize(data).window(4).stride(2).text_col('token').to_df(), 
token  document_id, 
0     The quick brown fox            1</line_content>
<context>
'Hello {name}, your age is {age}'.format(name='Bob', age=25)

- Open a file for writing UTF-8 encoded text:

  f = open('data.txt', 'w', encoding='utf-8')

- Write a JSON string to a file:

  json.dump(data, f)

- Read in a file's contents:</context>
</line>
<line>
<line_number>71, 72, 73</line_number>
<line_content>2   brown fox jumped over            1, 
4    jumped over the lazy            1, 
6          the lazy dog I            1</line_content>
<context>
- The turtle can be controlled through a functional or object-oriented interface. Using object-oriented style allows multiple turtles on one screen.</context>
</line>
<line>
<line_number>74, 75, 77</line_number>
<line_content>8   dog I love sandwiches            1, 
10        love sandwiches            2, 
``groupby`` determines how to group the rows. For example, we would like to have</line_content>
<context>
includes functions like count, cycle, repeat, chain, groupby, tee, and more. There are also recipes demonstrating common iteration patterns like iterating over permutations and combinations of data. The functools module provides higher-order</context>
</line>
<line>
<line_number>78, 79, 81</line_number>
<line_content>context windows that don't cross document boundaries. In this case, we can, 
pass ``document_id`` as the group by., 
>>> contextualize(data).window(4).stride(2).text_col('token').groupby('document_id').to_df()</line_content>
<context>
The main functions are:

- grp.getgrgid(id) - Returns group info for the given numeric id. Raises KeyError if not found.

- grp.getgrnam(name) - Returns group info for the given group name. Raises KeyError if not found.</context>
</line>
<line>
<line_number>82, 83, 84</line_number>
<line_content>token  document_id, 
0    The quick brown fox            1, 
2  brown fox jumped over            1</line_content>
<context>
resets, hard-to-guess URLs, etc. The default token size is 32 bytes but you can specify the number of bytes.</context>
</line>
<line>
<line_number>85, 86, 87</line_number>
<line_content>4   jumped over the lazy            1, 
6           the lazy dog            1, 
9      I love sandwiches            2</line_content>
<context>
singledispatch converts a function into a generic function that can have overloaded implementations registered for different types. It dispatches on the type of the first argument.</context>
</line>
<line>
<line_number>89, 90, 91</line_number>
<line_content>``min_window_size`` determines the minimum size of the  context windows that are generated, 
This can be used to trim the last few context windows which have size less than, 
``min_window_size``. By default context windows of size 1 are skipped.</line_content>
<context>
The module provides the BasicContext and ExtendedContext standard contexts. New contexts can be created via the Context constructor to control parameters like precision, rounding, flags, and traps. Each thread has its own current context.</context>
</line>
<line>
<line_number>93, 94, 95</line_number>
<line_content>>>> contextualize(data).window(6).stride(3).text_col('token').groupby('document_id').to_df(), 
token  document_id, 
0  The quick brown fox jumped over            1</line_content>
<context>
The main functions are:

- grp.getgrgid(id) - Returns group info for the given numeric id. Raises KeyError if not found.

- grp.getgrnam(name) - Returns group info for the given group name. Raises KeyError if not found.</context>
</line>
<line>
<line_number>96, 97, 98</line_number>
<line_content>3     fox jumped over the lazy dog            1, 
6                     the lazy dog            1, 
9                I love sandwiches            2</line_content>
<context>
The MultiCall class allows batching multiple calls to a remote server in one request for better performance. The convenience functions dumps and loads convert between Python objects and XML-RPC encoded data.</context>
</line>
<line>
<line_number>100, 101, 102</line_number>
<line_content>>>> contextualize(data).window(6).stride(3).min_window_size(4).text_col('token').groupby('document_id').to_df(), 
token  document_id, 
0  The quick brown fox jumped over            1</line_content>
<context>
The main functions are:

- grp.getgrgid(id) - Returns group info for the given numeric id. Raises KeyError if not found.

- grp.getgrnam(name) - Returns group info for the given group name. Raises KeyError if not found.</context>
</line>
<line>
<line_number>103, 106, 109</line_number>
<line_content>3     fox jumped over the lazy dog            1, 
return Contextualizer(raw_df), 
class Contextualizer:</line_content>
<context>
subclasses with appropriate encoding/decoding behaviors.</context>
</line>
<line>
<line_number>110, 112, 113</line_number>
<line_content>'''Create context windows from a DataFrame. See [lancedb.context.contextualize][].''', 
def __init__(self, raw_df):, 
self._text_col = None</line_content>
<context>
The Context class allows creating and managing context objects. The PyContext_New function creates a new empty context. PyContext_Copy makes a shallow copy of a context. PyContext_Enter and PyContext_Exit allow pushing and popping contexts.</context>
</line>
<line>
<line_number>114, 115, 116</line_number>
<line_content>self._groupby = None, 
self._stride = None, 
self._window = None</line_content>
<context>
The main functions are:

- grp.getgrgid(id) - Returns group info for the given numeric id. Raises KeyError if not found.

- grp.getgrnam(name) - Returns group info for the given group name. Raises KeyError if not found.</context>
</line>
<line>
<line_number>117, 118, 120</line_number>
<line_content>self._min_window_size = 2, 
self._raw_df = raw_df, 
def window(self, window: int) -> Contextualizer:</line_content>
<context>
The contextvars module provides APIs for managing context-local state in Python. The ContextVar class declares a new Context Variable that can store values specific to the current context. ContextVars should be used by context managers instead of</context>
</line>
<line>
<line_number>121, 123, 124</line_number>
<line_content>'''Set the window size. i.e., how many rows to include in each window., 
Parameters, 
----------</line_content>
<context>
displays tabular data, and the PanedWindow allows manipulating widget sizes interactively.</context>
</line>
<line>
<line_number>125, 126, 128</line_number>
<line_content>window: int, 
The window size., 
self._window = window</line_content>
<context>
PyTuple_SetItem and PyTuple_SET_ITEM set an item at a given index in a tuple. _PyTuple_Resize resizes a tuple.</context>
</line>
<line>
<line_number>129, 131, 132</line_number>
<line_content>return self, 
def stride(self, stride: int) -> Contextualizer:, 
'''Set the stride. i.e., how many rows to skip between each window.</line_content>
<context>
There are functions like PyBuffer_FromContiguous and PyBuffer_ToContiguous to copy data between contiguous buffers. PyBuffer_FillContiguousStrides fills in stride values for contiguous arrays. PyBuffer_FillInfo handles requests for buffer access on</context>
</line>
<line>
<line_number>134, 135, 136</line_number>
<line_content>Parameters, 
----------, 
stride: int</line_content>
<context>
Parameters may have default values, which are evaluated from left to right when the function is defined. Parameters after * or *identifier are keyword-only and may only be passed by keyword. Parameters before / are positional-only and may only be</context>
</line>
<line>
<line_number>137, 139, 140</line_number>
<line_content>The stride., 
self._stride = stride, 
return self</line_content>
<context>
"self" is used explicitly in Python method definitions and calls due to influences from Modula-3. It makes the use of instance variables and methods obvious and resolves syntactic issues with assignment. The explicit nature also allows calling</context>
</line>
<line>
<line_number>142, 143, 144</line_number>
<line_content>def groupby(self, groupby: str) -> Contextualizer:, 
'''Set the groupby column. i.e., how to group the rows., 
Windows don't cross groups</line_content>
<context>
The python language uses indentation instead of braces for grouping statements because it is more elegant and improves code clarity. Indentation-based syntax avoids ambiguities that can occur in other languages and reduces conflicts over coding style</context>
</line>
<line>
<line_number>146, 147, 148</line_number>
<line_content>Parameters, 
----------, 
groupby: str</line_content>
<context>
- grp.getgrall() - Returns a list of all available group entries.</context>
</line>
<line>
<line_number>149, 151, 152</line_number>
<line_content>The groupby column., 
self._groupby = groupby, 
return self</line_content>
<context>
- grp.getgrall() - Returns a list of all available group entries.</context>
</line>
<line>
<line_number>154, 155, 157</line_number>
<line_content>def text_col(self, text_col: str) -> Contextualizer:, 
'''Set the text column used to make the context window., 
Parameters</line_content>
<context>
The contextvars module provides APIs for managing context-local state in Python. The ContextVar class declares a new Context Variable that can store values specific to the current context. ContextVars should be used by context managers instead of</context>
</line>
<line>
<line_number>158, 159, 160</line_number>
<line_content>----------, 
text_col: str, 
The text column.</line_content>
<context>
The print() function and f-strings provide basic ways to output values and formatted strings in Python. For more advanced formatting, the str.format() method and manual string formatting techniques like str.ljust() allow padding and precise control</context>
</line>
<line>
<line_number>162, 163, 165</line_number>
<line_content>self._text_col = text_col, 
return self, 
def min_window_size(self, min_window_size: int) -> Contextualizer:</line_content>
<context>
So in summary, the tkinter.font module lets you create and configure named Font instances with properties like family and size, measure text rendered in the font, and get available font names and families in Tkinter.</context>
</line>
<line>
<line_number>166, 168, 169</line_number>
<line_content>'''Set the (optional) min_window_size size for the context window., 
Parameters, 
----------</line_content>
<context>
- tcgetwinsize(): Gets the terminal window size.

- tcsetwinsize(): Sets the terminal window size.

You pass a file descriptor like sys.stdin.fileno() to specify the terminal.</context>
</line>
<line>
<line_number>170, 171, 173</line_number>
<line_content>min_window_size: int, 
The min_window_size., 
self._min_window_size = min_window_size</line_content>
<context>
PyFloat_GetInfo returns a structseq with info on float precision, max, and min. PyFloat_GetMax returns the max float DBL_MAX. PyFloat_GetMin returns the min float DBL_MIN.</context>
</line>
<line>
<line_number>174, 176, 177</line_number>
<line_content>return self, 
def to_df(self) -> pd.DataFrame:, 
'''Create the context windows and return a DataFrame.'''</line_content>
<context>
The Context class allows creating and managing context objects. The PyContext_New function creates a new empty context. PyContext_Copy makes a shallow copy of a context. PyContext_Enter and PyContext_Exit allow pushing and popping contexts.</context>
</line>
<line>
<line_number>179, 180, 182</line_number>
<line_content>if self._text_col not in self._raw_df.columns.tolist():, 
raise MissingColumnError(self._text_col), 
if self._window is None or self._window < 1:</line_content>
<context>
Python 2.6 includes many new features and improvements. The xrange() function is now an iterator instead of returning a list. There is a new syntax for catching exceptions - "except TypeError as exc". The with statement no longer needs to be imported</context>
</line>
<line>
<line_number>183, 184, 185</line_number>
<line_content>raise MissingValueError(, 
'The value of window is None or less than 1. Specify the ', 
'window size (number of rows to include in each window)'</line_content>
<context>
Exceptions

The StatisticsError exception is raised for errors like empty inputs.

Normal Distribution</context>
</line>
<line>
<line_number>188, 189, 190</line_number>
<line_content>if self._stride is None or self._stride < 1:, 
raise MissingValueError(, 
'The value of stride is None or less than 1. Specify the '</line_content>
<context>
over the sequence until an IndexError is raised.</context>
</line>
<line>
<line_number>191, 194, 195</line_number>
<line_content>'stride (number of rows to skip between each window)', 
def process_group(grp):, 
# For each group, create the text rolling window</line_content>
<context>
- grp.getgrall() - Returns a list of all available group entries.</context>
</line>
<line>
<line_number>196, 197, 198</line_number>
<line_content># with values of size >= min_window_size, 
text = grp[self._text_col].values, 
contexts = grp.iloc[:: self._stride, :].copy()</line_content>
<context>
The Python curses module provides an interface to the underlying C library. It simplifies some functions like addstr() which handles displaying text at the current position or a specified coordinate. Windows represent rectangular areas and support</context>
</line>
<line>
<line_number>199, 200, 201</line_number>
<line_content>windows = [, 
' '.join(text[start_i : min(start_i + self._window, len(grp))]), 
for start_i in range(0, len(grp), self._stride)</line_content>
<context>
The built-in functions in Python provide basic functionality that is always available. Some common ones include:

- print() - prints objects to the text stream 

- len() - returns length of an object

- range() - generates a sequence of numbers</context>
</line>
<line>
<line_number>202, 203, 205</line_number>
<line_content>if start_i + self._window <= len(grp), 
or len(grp) - start_i >= self._min_window_size, 
# if last few rows dropped</line_content>
<context>
over the sequence until an IndexError is raised.</context>
</line>
<line>
<line_number>206, 207, 208</line_number>
<line_content>if len(windows) < len(contexts):, 
contexts = contexts.iloc[: len(windows)], 
contexts[self._text_col] = windows</line_content>
<context>
The syntax of the with statement was expanded to allow multiple context managers in a single statement, removing the need for the contextlib.nested() function. Also, the new sys.version_info named tuple provides information about the current Python</context>
</line>
<line>
<line_number>209, 211, 212</line_number>
<line_content>return contexts, 
if self._groupby is None:, 
return process_group(self._raw_df)</line_content>
<context>
The main functions are:

- grp.getgrgid(id) - Returns group info for the given numeric id. Raises KeyError if not found.

- grp.getgrnam(name) - Returns group info for the given group name. Raises KeyError if not found.</context>
</line>
<line>
<line_number>213, 214, 215</line_number>
<line_content># concat result from all groups, 
return pd.concat(, 
[process_group(grp) for _, grp in self._raw_df.groupby(self._groupby)]</line_content>
<context>
- grp.getgrall() - Returns a list of all available group entries.
</context>
</line>
</file_context>
</file>
<file>
<file_path>python/lancedb/db.py</file_path>
<file_content>
#  Copyright 2023 LanceDB Developers
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

from __future__ import annotations

import os
from pathlib import Path
import os

import pyarrow as pa
from pyarrow import fs

from .common import DATA, URI
from .table import LanceTable
from .util import get_uri_scheme, get_uri_location


class LanceDBConnection:
    """
    A connection to a LanceDB database.

    Parameters
    ----------
    uri: str or Path
        The root uri of the database.

    Examples
    --------
    >>> import lancedb
    >>> db = lancedb.connect("./.lancedb")
    >>> db.create_table("my_table", data=[{"vector": [1.1, 1.2], "b": 2},
    ...                                   {"vector": [0.5, 1.3], "b": 4}])
    LanceTable(my_table)
    >>> db.create_table("another_table", data=[{"vector": [0.4, 0.4], "b": 6}])
    LanceTable(another_table)
    >>> db.table_names()
    ['another_table', 'my_table']
    >>> len(db)
    2
    >>> db["my_table"]
    LanceTable(my_table)
    >>> "my_table" in db
    True
    >>> db.drop_table("my_table")
    >>> db.drop_table("another_table")
    """

    def __init__(self, uri: URI):
        is_local = isinstance(uri, Path) or get_uri_scheme(uri) == "file"
        if is_local:
            if isinstance(uri, str):
                uri = Path(uri)
            uri = uri.expanduser().absolute()
            Path(uri).mkdir(parents=True, exist_ok=True)
        self._uri = str(uri)

    @property
    def uri(self) -> str:
        return self._uri

    def table_names(self) -> list[str]:
        """Get the names of all tables in the database.

        Returns
        -------
        list of str
            A list of table names.
        """
        try:
            filesystem, path = fs.FileSystem.from_uri(self.uri)
        except pa.ArrowInvalid:
            raise NotImplementedError("Unsupported scheme: " + self.uri)

        try:
            paths = filesystem.get_file_info(
                fs.FileSelector(get_uri_location(self.uri))
            )
        except FileNotFoundError:
            # It is ok if the file does not exist since it will be created
            paths = []
        tables = [
            os.path.splitext(file_info.base_name)[0]
            for file_info in paths
            if file_info.extension == "lance"
        ]
        return tables

    def __len__(self) -> int:
        return len(self.table_names())

    def __contains__(self, name: str) -> bool:
        return name in self.table_names()

    def __getitem__(self, name: str) -> LanceTable:
        return self.open_table(name)

    def create_table(
        self,
        name: str,
        data: DATA = None,
        schema: pa.Schema = None,
        mode: str = "create",
    ) -> LanceTable:
        """Create a table in the database.

        Parameters
        ----------
        name: str
            The name of the table.
        data: list, tuple, dict, pd.DataFrame; optional
            The data to insert into the table.
        schema: pyarrow.Schema; optional
            The schema of the table.
        mode: str; default "create"
            The mode to use when creating the table.
            By default, if the table already exists, an exception is raised.
            If you want to overwrite the table, use mode="overwrite".

        Note
        ----
        The vector index won't be created by default.
        To create the index, call the `create_index` method on the table.

        Returns
        -------
        LanceTable
            A reference to the newly created table.

        Examples
        --------

        Can create with list of tuples or dictionaries:

        >>> import lancedb
        >>> db = lancedb.connect("./.lancedb")
        >>> data = [{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},
        ...         {"vector": [0.2, 1.8], "lat": 40.1, "long":  -74.1}]
        >>> db.create_table("my_table", data)
        LanceTable(my_table)
        >>> db["my_table"].head()
        pyarrow.Table
        vector: fixed_size_list<item: float>[2]
          child 0, item: float
        lat: double
        long: double
        ----
        vector: [[[1.1,1.2],[0.2,1.8]]]
        lat: [[45.5,40.1]]
        long: [[-122.7,-74.1]]

        You can also pass a pandas DataFrame:

        >>> import pandas as pd
        >>> data = pd.DataFrame({
        ...    "vector": [[1.1, 1.2], [0.2, 1.8]],
        ...    "lat": [45.5, 40.1],
        ...    "long": [-122.7, -74.1]
        ... })
        >>> db.create_table("table2", data)
        LanceTable(table2)
        >>> db["table2"].head()
        pyarrow.Table
        vector: fixed_size_list<item: float>[2]
          child 0, item: float
        lat: double
        long: double
        ----
        vector: [[[1.1,1.2],[0.2,1.8]]]
        lat: [[45.5,40.1]]
        long: [[-122.7,-74.1]]

        Data is converted to Arrow before being written to disk. For maximum
        control over how data is saved, either provide the PyArrow schema to
        convert to or else provide a PyArrow table directly.

        >>> custom_schema = pa.schema([
        ...   pa.field("vector", pa.list_(pa.float32(), 2)),
        ...   pa.field("lat", pa.float32()),
        ...   pa.field("long", pa.float32())
        ... ])
        >>> db.create_table("table3", data, schema = custom_schema)
        LanceTable(table3)
        >>> db["table3"].head()
        pyarrow.Table
        vector: fixed_size_list<item: float>[2]
          child 0, item: float
        lat: float
        long: float
        ----
        vector: [[[1.1,1.2],[0.2,1.8]]]
        lat: [[45.5,40.1]]
        long: [[-122.7,-74.1]]
        """
        if data is not None:
            tbl = LanceTable.create(self, name, data, schema, mode=mode)
        else:
            tbl = LanceTable(self, name)
        return tbl

    def open_table(self, name: str) -> LanceTable:
        """Open a table in the database.

        Parameters
        ----------
        name: str
            The name of the table.

        Returns
        -------
        A LanceTable object representing the table.
        """
        return LanceTable(self, name)

    def drop_table(self, name: str):
        """Drop a table from the database.

        Parameters
        ----------
        name: str
            The name of the table.
        """
        filesystem, path = pa.fs.FileSystem.from_uri(self.uri)
        table_path = os.path.join(path, name + ".lance")
        filesystem.delete_dir(table_path)

</file_content>
<file_context>
<line>
<line_number>13, 16, 19</line_number>
<line_content>from __future__ import annotations, 
from pathlib import Path, 
import pyarrow as pa</line_content>
<context>
The pkgutil module provides utilities for Python's import system, especially package support. The extend_path function extends the search path for modules in a package. The ImpImporter and ImpLoader classes provide backwards compatibility wrappers</context>
</line>
<line>
<line_number>20, 22, 23</line_number>
<line_content>from pyarrow import fs, 
from .common import DATA, URI, 
from .table import LanceTable</line_content>
<context>
The urllib.request module provides functions and classes for fetching URLs and making HTTP requests in python. Some key points:</context>
</line>
<line>
<line_number>24, 27, 29</line_number>
<line_content>from .util import get_uri_scheme, get_uri_location, 
class LanceDBConnection:, 
A connection to a LanceDB database.</line_content>
<context>
The urllib.request module provides functions and classes for fetching URLs and making HTTP requests in python. Some key points:</context>
</line>
<line>
<line_number>31, 32, 33</line_number>
<line_content>Parameters, 
----------, 
uri: str or Path</line_content>
<context>
- urlunparse() - Puts a parsed URL back together into a complete URL string. This is the inverse of urlparse(). 

- urlsplit() - Similar to urlparse() but doesn't split params and query.</context>
</line>
<line>
<line_number>34, 38, 39</line_number>
<line_content>The root uri of the database., 
>>> import lancedb, 
>>> db = lancedb.connect('./.lancedb')</line_content>
<context>
The sqlite3 module provides a DB-API 2.0 compliant interface for working with SQLite databases in Python. It allows executing SQL statements and fetching results. Key components include the Connection, Cursor, and Row classes.</context>
</line>
<line>
<line_number>40, 41, 42</line_number>
<line_content>>>> db.create_table('my_table', data=[{'vector': [1.1, 1.2], 'b': 2},, 
...                                   {'vector': [0.5, 1.3], 'b': 4}]), 
LanceTable(my_table)</line_content>
<context>
floats, booleans, datetimes, arrays, and tables into native Python datatypes.</context>
</line>
<line>
<line_number>43, 44, 45</line_number>
<line_content>>>> db.create_table('another_table', data=[{'vector': [0.4, 0.4], 'b': 6}]), 
LanceTable(another_table), 
>>> db.table_names()</line_content>
<context>
floats, booleans, datetimes, arrays, and tables into native Python datatypes.</context>
</line>
<line>
<line_number>46, 47, 49</line_number>
<line_content>['another_table', 'my_table'], 
>>> len(db), 
>>> db['my_table']</line_content>
<context>
The len() function returns the length of an object. For strings, this is the number of characters. For lists, tuples, dicts, and sets, it is the number of items.</context>
</line>
<line>
<line_number>50, 51, 53</line_number>
<line_content>LanceTable(my_table), 
>>> 'my_table' in db, 
>>> db.drop_table('my_table')</line_content>
<context>
- Creating records in tables with CreateRecord.

- Initializing a new database with init_database.

- Adding data to tables with add_data. 

- Adding tables to a database with add_tables.

- Creating views on database tables with OpenView.</context>
</line>
<line>
<line_number>54, 57, 58</line_number>
<line_content>>>> db.drop_table('another_table'), 
def __init__(self, uri: URI):, 
is_local = isinstance(uri, Path) or get_uri_scheme(uri) == 'file'</line_content>
<context>
use methods on the traversable to open or read resources. as_file() provides a file path object.</context>
</line>
<line>
<line_number>59, 60, 61</line_number>
<line_content>if is_local:, 
if isinstance(uri, str):, 
uri = Path(uri)</line_content>
<context>
resolving relative paths, and checking properties like whether a path is absolute, a file, or directory. The pathlib API helps avoid a lot of errors compared to using string operations to handle paths.</context>
</line>
<line>
<line_number>62, 63, 64</line_number>
<line_content>uri = uri.expanduser().absolute(), 
Path(uri).mkdir(parents=True, exist_ok=True), 
self._uri = str(uri)</line_content>
<context>
The PYTHONPATH environment variable can add more directories to the search path.</context>
</line>
<line>
<line_number>67, 68, 70</line_number>
<line_content>def uri(self) -> str:, 
return self._uri, 
def table_names(self) -> list[str]:</line_content>
<context>
dictionaries. It provides fundamental building blocks for working with URLs in Python.</context>
</line>
<line>
<line_number>71, 75, 76</line_number>
<line_content>'''Get the names of all tables in the database., 
list of str, 
A list of table names.</line_content>
<context>
getpwall returns a list of all available password database entries.</context>
</line>
<line>
<line_number>79, 80, 81</line_number>
<line_content>filesystem, path = fs.FileSystem.from_uri(self.uri), 
except pa.ArrowInvalid:, 
raise NotImplementedError('Unsupported scheme: ' + self.uri)</line_content>
<context>
use methods on the traversable to open or read resources. as_file() provides a file path object.</context>
</line>
<line>
<line_number>84, 85, 87</line_number>
<line_content>paths = filesystem.get_file_info(, 
fs.FileSelector(get_uri_location(self.uri)), 
except FileNotFoundError:</line_content>
<context>
The PyOS_FSPath function returns the filesystem path representation for a given path object. It handles str, bytes, and PathLike objects. 

Py_FdIsInteractive checks if a file is interactive based on its file descriptor.</context>
</line>
<line>
<line_number>88, 89, 90</line_number>
<line_content># It is ok if the file does not exist since it will be created, 
paths = [], 
tables = [</line_content>
<context>
in future Python versions. The new API with files() and traversables is recommended instead.</context>
</line>
<line>
<line_number>91, 92, 93</line_number>
<line_content>os.path.splitext(file_info.base_name)[0], 
for file_info in paths, 
if file_info.extension == 'lance'</line_content>
<context>
like size and modification time, and splitting paths into parts like the directory, base filename, and extension. os.path works with strings rather than objects.</context>
</line>
<line>
<line_number>95, 97, 98</line_number>
<line_content>return tables, 
def __len__(self) -> int:, 
return len(self.table_names())</line_content>
<context>
The len() function returns the length of an object. For strings, this is the number of characters. For lists, tuples, dicts, and sets, it is the number of items.</context>
</line>
<line>
<line_number>100, 101, 103</line_number>
<line_content>def __contains__(self, name: str) -> bool:, 
return name in self.table_names(), 
def __getitem__(self, name: str) -> LanceTable:</line_content>
<context>
returns a SymbolTable instance for a given piece of Python source code. The SymbolTable class represents a namespace and provides methods to inspect the identifiers defined within it, like get_type, get_name, get_lineno, etc. Specific subclasses</context>
</line>
<line>
<line_number>104, 106, 108</line_number>
<line_content>return self.open_table(name), 
def create_table(, 
name: str,</line_content>
<context>
returns a SymbolTable instance for a given piece of Python source code. The SymbolTable class represents a namespace and provides methods to inspect the identifiers defined within it, like get_type, get_name, get_lineno, etc. Specific subclasses</context>
</line>
<line>
<line_number>109, 110, 111</line_number>
<line_content>data: DATA = None,, 
schema: pa.Schema = None,, 
mode: str = 'create',</line_content>
<context>
__get__(), __set__(), and __delete__() instead of the default attribute lookup behavior.</context>
</line>
<line>
<line_number>112, 113, 115</line_number>
<line_content>) -> LanceTable:, 
'''Create a table in the database., 
Parameters</line_content>
<context>
- Creating records in tables with CreateRecord.

- Initializing a new database with init_database.

- Adding data to tables with add_data. 

- Adding tables to a database with add_tables.

- Creating views on database tables with OpenView.</context>
</line>
<line>
<line_number>116, 118, 119</line_number>
<line_content>----------, 
The name of the table., 
data: list, tuple, dict, pd.DataFrame; optional</line_content>
<context>
The collections module provides specialized container datatypes that provide alternatives to Python's general built-in containers like dict, list, set, and tuple.</context>
</line>
<line>
<line_number>120, 121, 122</line_number>
<line_content>The data to insert into the table., 
schema: pyarrow.Schema; optional, 
The schema of the table.</line_content>
<context>
- Creating records in tables with CreateRecord.

- Initializing a new database with init_database.

- Adding data to tables with add_data. 

- Adding tables to a database with add_tables.

- Creating views on database tables with OpenView.</context>
</line>
<line>
<line_number>123, 124, 125</line_number>
<line_content>mode: str; default 'create', 
The mode to use when creating the table., 
By default, if the table already exists, an exception is raised.</line_content>
<context>
- Creating records in tables with CreateRecord.

- Initializing a new database with init_database.

- Adding data to tables with add_data. 

- Adding tables to a database with add_tables.

- Creating views on database tables with OpenView.</context>
</line>
<line>
<line_number>126, 130, 131</line_number>
<line_content>If you want to overwrite the table, use mode='overwrite'., 
The vector index won't be created by default., 
To create the index, call the `create_index` method on the table.</line_content>
<context>
write() writes the config to a file object, remove_section() and remove_option() remove sections and options.</context>
</line>
<line>
<line_number>135, 136, 141</line_number>
<line_content>LanceTable, 
A reference to the newly created table., 
Can create with list of tuples or dictionaries:</line_content>
<context>
- Creating records in tables with CreateRecord.

- Initializing a new database with init_database.

- Adding data to tables with add_data. 

- Adding tables to a database with add_tables.

- Creating views on database tables with OpenView.</context>
</line>
<line>
<line_number>143, 144, 145</line_number>
<line_content>>>> import lancedb, 
>>> db = lancedb.connect('./.lancedb'), 
>>> data = [{'vector': [1.1, 1.2], 'lat': 45.5, 'long': -122.7},</line_content>
<context>
'Hello {name}, your age is {age}'.format(name='Bob', age=25)

- Open a file for writing UTF-8 encoded text:

  f = open('data.txt', 'w', encoding='utf-8')

- Write a JSON string to a file:

  json.dump(data, f)

- Read in a file's contents:</context>
</line>
<line>
<line_number>146, 147, 148</line_number>
<line_content>...         {'vector': [0.2, 1.8], 'lat': 40.1, 'long':  -74.1}], 
>>> db.create_table('my_table', data), 
LanceTable(my_table)</line_content>
<context>
floats, booleans, datetimes, arrays, and tables into native Python datatypes.</context>
</line>
<line>
<line_number>149, 150, 151</line_number>
<line_content>>>> db['my_table'].head(), 
pyarrow.Table, 
vector: fixed_size_list<item: float>[2]</line_content>
<context>
PyTuple_SetItem and PyTuple_SET_ITEM set an item at a given index in a tuple. _PyTuple_Resize resizes a tuple.</context>
</line>
<line>
<line_number>152, 153, 154</line_number>
<line_content>child 0, item: float, 
lat: double, 
long: double</line_content>
<context>
slicing syntax like L[1:10:2] is now supported by built-in types like lists, tuples, and strings, allowing more powerful slicing operations.</context>
</line>
<line>
<line_number>156, 157, 158</line_number>
<line_content>vector: [[[1.1,1.2],[0.2,1.8]]], 
lat: [[45.5,40.1]], 
long: [[-122.7,-74.1]]</line_content>
<context>
The vectorcall arguments are an array of positional args, the number of args, and a tuple of keyword argument names. There are some specialized call functions like PyObject_CallOneArg() and PyObject_CallNoArgs() for efficiency.</context>
</line>
<line>
<line_number>160, 162, 163</line_number>
<line_content>You can also pass a pandas DataFrame:, 
>>> import pandas as pd, 
>>> data = pd.DataFrame({</line_content>
<context>
'Hello {name}, your age is {age}'.format(name='Bob', age=25)

- Open a file for writing UTF-8 encoded text:

  f = open('data.txt', 'w', encoding='utf-8')

- Write a JSON string to a file:

  json.dump(data, f)

- Read in a file's contents:</context>
</line>
<line>
<line_number>164, 165, 166</line_number>
<line_content>...    'vector': [[1.1, 1.2], [0.2, 1.8]],, 
...    'lat': [45.5, 40.1],, 
...    'long': [-122.7, -74.1]</line_content>
<context>
The vectorcall arguments are an array of positional args, the number of args, and a tuple of keyword argument names. There are some specialized call functions like PyObject_CallOneArg() and PyObject_CallNoArgs() for efficiency.</context>
</line>
<line>
<line_number>168, 169, 170</line_number>
<line_content>>>> db.create_table('table2', data), 
LanceTable(table2), 
>>> db['table2'].head()</line_content>
<context>
- Creating records in tables with CreateRecord.

- Initializing a new database with init_database.

- Adding data to tables with add_data. 

- Adding tables to a database with add_tables.

- Creating views on database tables with OpenView.</context>
</line>
<line>
<line_number>171, 172, 173</line_number>
<line_content>pyarrow.Table, 
vector: fixed_size_list<item: float>[2], 
child 0, item: float</line_content>
<context>
PyTuple_SetItem and PyTuple_SET_ITEM set an item at a given index in a tuple. _PyTuple_Resize resizes a tuple.</context>
</line>
<line>
<line_number>174, 175, 177</line_number>
<line_content>lat: double, 
long: double, 
vector: [[[1.1,1.2],[0.2,1.8]]]</line_content>
<context>
The vectorcall arguments are an array of positional args, the number of args, and a tuple of keyword argument names. There are some specialized call functions like PyObject_CallOneArg() and PyObject_CallNoArgs() for efficiency.</context>
</line>
<line>
<line_number>178, 179, 181</line_number>
<line_content>lat: [[45.5,40.1]], 
long: [[-122.7,-74.1]], 
Data is converted to Arrow before being written to disk. For maximum</line_content>
<context>
the current position, read data from the chunk, and skip to the end.</context>
</line>
<line>
<line_number>182, 183, 185</line_number>
<line_content>control over how data is saved, either provide the PyArrow schema to, 
convert to or else provide a PyArrow table directly., 
>>> custom_schema = pa.schema([</line_content>
<context>
It is recommended to use PyObject_GetBuffer() (or the "y*" or "w*" format codes with PyArg_ParseTuple()) to get a buffer view over an object, and PyBuffer_Release() when the view can be released instead of the Old Buffer Protocol.</context>
</line>
<line>
<line_number>186, 187, 188</line_number>
<line_content>...   pa.field('vector', pa.list_(pa.float32(), 2)),, 
...   pa.field('lat', pa.float32()),, 
...   pa.field('long', pa.float32())</line_content>
<context>
There are also functions to convert between numeric types, like PyNumber_Long to convert to a long integer and PyNumber_Float to convert to a float. PyNumber_Index converts to an integer and raises an exception on failure.</context>
</line>
<line>
<line_number>190, 191, 192</line_number>
<line_content>>>> db.create_table('table3', data, schema = custom_schema), 
LanceTable(table3), 
>>> db['table3'].head()</line_content>
<context>
- Creating records in tables with CreateRecord.

- Initializing a new database with init_database.

- Adding data to tables with add_data. 

- Adding tables to a database with add_tables.

- Creating views on database tables with OpenView.</context>
</line>
<line>
<line_number>193, 194, 195</line_number>
<line_content>pyarrow.Table, 
vector: fixed_size_list<item: float>[2], 
child 0, item: float</line_content>
<context>
PyTuple_SetItem and PyTuple_SET_ITEM set an item at a given index in a tuple. _PyTuple_Resize resizes a tuple.</context>
</line>
<line>
<line_number>196, 197, 199</line_number>
<line_content>lat: float, 
long: float, 
vector: [[[1.1,1.2],[0.2,1.8]]]</line_content>
<context>
Floating point numbers are represented in Python and most other languages as binary fractions. This can cause decimal fractions like 0.1 to be approximated when stored as floating point values. For example, 0.1 is represented internally as the binary</context>
</line>
<line>
<line_number>200, 201, 203</line_number>
<line_content>lat: [[45.5,40.1]], 
long: [[-122.7,-74.1]], 
if data is not None:</line_content>
<context>
the current position, read data from the chunk, and skip to the end.</context>
</line>
<line>
<line_number>204, 206, 207</line_number>
<line_content>tbl = LanceTable.create(self, name, data, schema, mode=mode), 
tbl = LanceTable(self, name), 
return tbl</line_content>
<context>
The shelve module provides a persistent dictionary-like object that can store arbitrary python objects. A shelf acts like a dictionary, but the values are written to a database file to allow them to persist after the python process ends.</context>
</line>
<line>
<line_number>209, 210, 212</line_number>
<line_content>def open_table(self, name: str) -> LanceTable:, 
'''Open a table in the database., 
Parameters</line_content>
<context>
opening a shelf directly from a filename without supplying a dict-like database object.</context>
</line>
<line>
<line_number>213, 215, 219</line_number>
<line_content>----------, 
The name of the table., 
A LanceTable object representing the table.</line_content>
<context>
The Row class represents a result row. It allows accessing columns by index or case-insensitive name. Row provides a memory efficient alternative to tuples.</context>
</line>
<line>
<line_number>221, 223, 224</line_number>
<line_content>return LanceTable(self, name), 
def drop_table(self, name: str):, 
'''Drop a table from the database.</line_content>
<context>
returns a SymbolTable instance for a given piece of Python source code. The SymbolTable class represents a namespace and provides methods to inspect the identifiers defined within it, like get_type, get_name, get_lineno, etc. Specific subclasses</context>
</line>
<line>
<line_number>226, 227, 229</line_number>
<line_content>Parameters, 
----------, 
The name of the table.</line_content>
<context>
Parameters may have default values, which are evaluated from left to right when the function is defined. Parameters after * or *identifier are keyword-only and may only be passed by keyword. Parameters before / are positional-only and may only be</context>
</line>
<line>
<line_number>231, 232, 233</line_number>
<line_content>filesystem, path = pa.fs.FileSystem.from_uri(self.uri), 
table_path = os.path.join(path, name + '.lance'), 
filesystem.delete_dir(table_path)</line_content>
<context>
Methods like Path.exists(), Path.is_dir(), Path.is_file(), Path.open() allow querying properties of a filesystem path and interacting with the filesystem. Path.rmdir(), Path.unlink(), Path.rename(), and Path.replace() perform system calls to remove,
</context>
</line>
</file_context>
</file>
<file>
<file_path>python/lancedb/embeddings.py</file_path>
<file_content>
#  Copyright 2023 LanceDB Developers
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

import math
import sys
from typing import Callable, Union

import numpy as np
import pandas as pd
import pyarrow as pa
from lance.vector import vec_to_table
from retry import retry


def with_embeddings(
    func: Callable,
    data: Union[pa.Table, pd.DataFrame],
    column: str = "text",
    wrap_api: bool = True,
    show_progress: bool = False,
    batch_size: int = 1000,
) -> pa.Table:
    """Add a vector column to a table using the given embedding function.

    The new columns will be called "vector".

    Parameters
    ----------
    func : Callable
        A function that takes a list of strings and returns a list of vectors.
    data : pa.Table or pd.DataFrame
        The data to add an embedding column to.
    column : str, default "text"
        The name of the column to use as input to the embedding function.
    wrap_api : bool, default True
        Whether to wrap the embedding function in a retry and rate limiter.
    show_progress : bool, default False
        Whether to show a progress bar.
    batch_size : int, default 1000
        The number of row values to pass to each call of the embedding function.

    Returns
    -------
    pa.Table
        The input table with a new column called "vector" containing the embeddings.
    """
    func = EmbeddingFunction(func)
    if wrap_api:
        func = func.retry().rate_limit()
    func = func.batch_size(batch_size)
    if show_progress:
        func = func.show_progress()
    if isinstance(data, pd.DataFrame):
        data = pa.Table.from_pandas(data, preserve_index=False)
    embeddings = func(data[column].to_numpy())
    table = vec_to_table(np.array(embeddings))
    return data.append_column("vector", table["vector"])


class EmbeddingFunction:
    def __init__(self, func: Callable):
        self.func = func
        self.rate_limiter_kwargs = {}
        self.retry_kwargs = {}
        self._batch_size = None
        self._progress = False

    def __call__(self, text):
        # Get the embedding with retry
        if len(self.retry_kwargs) > 0:

            @retry(**self.retry_kwargs)
            def embed_func(c):
                return self.func(c.tolist())

        else:

            def embed_func(c):
                return self.func(c.tolist())

        if len(self.rate_limiter_kwargs) > 0:
            v = int(sys.version_info.minor)
            if v >= 11:
                print(
                    "WARNING: rate limit only support up to 3.10, proceeding without rate limiter"
                )
            else:
                import ratelimiter

                max_calls = self.rate_limiter_kwargs["max_calls"]
                limiter = ratelimiter.RateLimiter(
                    max_calls, period=self.rate_limiter_kwargs["period"]
                )
                embed_func = limiter(embed_func)
        batches = self.to_batches(text)
        embeds = [emb for c in batches for emb in embed_func(c)]
        return embeds

    def __repr__(self):
        return f"EmbeddingFunction(func={self.func})"

    def rate_limit(self, max_calls=0.9, period=1.0):
        self.rate_limiter_kwargs = dict(max_calls=max_calls, period=period)
        return self

    def retry(self, tries=10, delay=1, max_delay=30, backoff=3, jitter=1):
        self.retry_kwargs = dict(
            tries=tries,
            delay=delay,
            max_delay=max_delay,
            backoff=backoff,
            jitter=jitter,
        )
        return self

    def batch_size(self, batch_size):
        self._batch_size = batch_size
        return self

    def show_progress(self):
        self._progress = True
        return self

    def to_batches(self, arr):
        length = len(arr)

        def _chunker(arr):
            for start_i in range(0, len(arr), self._batch_size):
                yield arr[start_i : start_i + self._batch_size]

        if self._progress:
            from tqdm.auto import tqdm

            yield from tqdm(_chunker(arr), total=math.ceil(length / self._batch_size))
        else:
            yield from _chunker(arr)

</file_content>
<file_context>
<line>
<line_number>13, 14, 15</line_number>
<line_content>import math, 
import sys, 
from typing import Callable, Union</line_content>
<context>
- Allows calling functions with these data types and converting between Python objects and C data types automatically. Supports C calling conventions like cdecl and stdcall.</context>
</line>
<line>
<line_number>17, 18, 19</line_number>
<line_content>import numpy as np, 
import pandas as pd, 
import pyarrow as pa</line_content>
<context>
Python modules without needing to run setup.py install.</context>
</line>
<line>
<line_number>20, 21, 24</line_number>
<line_content>from lance.vector import vec_to_table, 
from retry import retry, 
def with_embeddings(</line_content>
<context>
Using the embedding API an application can also extend Python by exposing functions and data from the application itself to Python code. This allows Python code to call back into the application.</context>
</line>
<line>
<line_number>25, 26, 27</line_number>
<line_content>func: Callable,, 
data: Union[pa.Table, pd.DataFrame],, 
column: str = 'text',</line_content>
<context>
The print() function and f-strings provide basic ways to output values and formatted strings in Python. For more advanced formatting, the str.format() method and manual string formatting techniques like str.ljust() allow padding and precise control</context>
</line>
<line>
<line_number>28, 29, 30</line_number>
<line_content>wrap_api: bool = True,, 
show_progress: bool = False,, 
batch_size: int = 1000,</line_content>
<context>
- run_until_complete and run_forever to run the event loop until a Future completes or the loop is stopped. stop can be used to stop the loop. 

- call_soon and call_later to schedule callbacks. call_later schedules a callback after a delay.</context>
</line>
<line>
<line_number>31, 32, 34</line_number>
<line_content>) -> pa.Table:, 
'''Add a vector column to a table using the given embedding function., 
The new columns will be called 'vector'.</line_content>
<context>
The Row class represents a result row. It allows accessing columns by index or case-insensitive name. Row provides a memory efficient alternative to tuples.</context>
</line>
<line>
<line_number>36, 37, 38</line_number>
<line_content>Parameters, 
----------, 
func : Callable</line_content>
<context>
Function call semantics assign values to all parameters, from positional arguments, keyword arguments, or defaults. * and ** can receive excess positional or keyword parameters.</context>
</line>
<line>
<line_number>39, 40, 41</line_number>
<line_content>A function that takes a list of strings and returns a list of vectors., 
data : pa.Table or pd.DataFrame, 
The data to add an embedding column to.</line_content>
<context>
The enumerate() function takes an iterable and returns an enumerate object that produces tuples containing indices and values.</context>
</line>
<line>
<line_number>42, 43, 44</line_number>
<line_content>column : str, default 'text', 
The name of the column to use as input to the embedding function., 
wrap_api : bool, default True</line_content>
<context>
wraps calls update_wrapper as a convenience decorator factory. It ensures wrapper functions have names, docstrings etc. reflecting the wrapped function.</context>
</line>
<line>
<line_number>45, 46, 47</line_number>
<line_content>Whether to wrap the embedding function in a retry and rate limiter., 
show_progress : bool, default False, 
Whether to show a progress bar.</line_content>
<context>
The key functions include:

- print_tb() - Prints a traceback object to a file. Can limit the number of stack trace entries printed.</context>
</line>
<line>
<line_number>48, 49, 54</line_number>
<line_content>batch_size : int, default 1000, 
The number of row values to pass to each call of the embedding function., 
The input table with a new column called 'vector' containing the embeddings.</line_content>
<context>
The Row class represents a result row. It allows accessing columns by index or case-insensitive name. Row provides a memory efficient alternative to tuples.</context>
</line>
<line>
<line_number>56, 57, 58</line_number>
<line_content>func = EmbeddingFunction(func), 
if wrap_api:, 
func = func.retry().rate_limit()</line_content>
<context>
Using the embedding API an application can also extend Python by exposing functions and data from the application itself to Python code. This allows Python code to call back into the application.</context>
</line>
<line>
<line_number>59, 60, 61</line_number>
<line_content>func = func.batch_size(batch_size), 
if show_progress:, 
func = func.show_progress()</line_content>
<context>
To run a command silently, call subprocess.run() and ignore the returned CompletedProcess object. To check for errors, use subprocess.run(..., check=True) which raises CalledProcessError if the program exits with a non-zero code.</context>
</line>
<line>
<line_number>62, 63, 64</line_number>
<line_content>if isinstance(data, pd.DataFrame):, 
data = pa.Table.from_pandas(data, preserve_index=False), 
embeddings = func(data[column].to_numpy())</line_content>
<context>
Using the embedding API an application can also extend Python by exposing functions and data from the application itself to Python code. This allows Python code to call back into the application.</context>
</line>
<line>
<line_number>65, 66, 69</line_number>
<line_content>table = vec_to_table(np.array(embeddings)), 
return data.append_column('vector', table['vector']), 
class EmbeddingFunction:</line_content>
<context>
Using the embedding API an application can also extend Python by exposing functions and data from the application itself to Python code. This allows Python code to call back into the application.</context>
</line>
<line>
<line_number>70, 71, 72</line_number>
<line_content>def __init__(self, func: Callable):, 
self.func = func, 
self.rate_limiter_kwargs = {}</line_content>
<context>
a new method from a callable and instance. PyMethod_Function gets the underlying function from a method. PyMethod_Self gets the bound instance from a method.</context>
</line>
<line>
<line_number>73, 74, 75</line_number>
<line_content>self.retry_kwargs = {}, 
self._batch_size = None, 
self._progress = False</line_content>
<context>
Py_FatalError prints an error and aborts. Py_Exit calls Py_FinalizeEx and then exits the process. Py_AtExit registers a cleanup function to be called on exit.</context>
</line>
<line>
<line_number>77, 78, 79</line_number>
<line_content>def __call__(self, text):, 
# Get the embedding with retry, 
if len(self.retry_kwargs) > 0:</line_content>
<context>
The encode_* functions raise TypeError if passed a multipart message instead of encoding the subparts individually. They extract the payload, encode it, and reset the payload to the encoded value.</context>
</line>
<line>
<line_number>81, 82, 83</line_number>
<line_content>@retry(**self.retry_kwargs), 
def embed_func(c):, 
return self.func(c.tolist())</line_content>
<context>
The encode_* functions raise TypeError if passed a multipart message instead of encoding the subparts individually. They extract the payload, encode it, and reset the payload to the encoded value.</context>
</line>
<line>
<line_number>87, 88, 90</line_number>
<line_content>def embed_func(c):, 
return self.func(c.tolist()), 
if len(self.rate_limiter_kwargs) > 0:</line_content>
<context>
The Py_RETURN_NONE macro is used to properly handle returning Py_None from a C function. It takes care of incrementing the reference count of Py_None before returning it.</context>
</line>
<line>
<line_number>91, 92, 94</line_number>
<line_content>v = int(sys.version_info.minor), 
if v >= 11:, 
'WARNING: rate limit only support up to 3.10, proceeding without rate limiter'</line_content>
<context>
minor, micro, release level and serial occupying different bytes and bits. PY_VERSION_HEX can be used for numeric comparisons of versions.</context>
</line>
<line>
<line_number>97, 99, 100</line_number>
<line_content>import ratelimiter, 
max_calls = self.rate_limiter_kwargs['max_calls'], 
limiter = ratelimiter.RateLimiter(</line_content>
<context>
limit. An example implements an echo server using start_server().</context>
</line>
<line>
<line_number>101, 103, 104</line_number>
<line_content>max_calls, period=self.rate_limiter_kwargs['period'], 
embed_func = limiter(embed_func), 
batches = self.to_batches(text)</line_content>
<context>
limit. An example implements an echo server using start_server().</context>
</line>
<line>
<line_number>105, 106, 108</line_number>
<line_content>embeds = [emb for c in batches for emb in embed_func(c)], 
return embeds, 
def __repr__(self):</line_content>
<context>
To embed Python, the C/C++ application must initialize the interpreter by calling Py_Initialize(). The application can then execute Python code by calling API functions like PyRun_SimpleString() to run Python code from a string.</context>
</line>
<line>
<line_number>109, 111, 112</line_number>
<line_content>return f'EmbeddingFunction(func={self.func})', 
def rate_limit(self, max_calls=0.9, period=1.0):, 
self.rate_limiter_kwargs = dict(max_calls=max_calls, period=period)</line_content>
<context>
PyFunction_GetGlobals returns the globals dictionary associated with a function object.

PyFunction_GetModule returns the __module__ attribute of a function object.

PyFunction_GetDefaults returns the default argument values of a function object.</context>
</line>
<line>
<line_number>113, 115, 116</line_number>
<line_content>return self, 
def retry(self, tries=10, delay=1, max_delay=30, backoff=3, jitter=1):, 
self.retry_kwargs = dict(</line_content>
<context>
built-in TimeoutError. CancelledError is raised when a Task or Future is cancelled. It allows custom handling of cancellation and should usually be re-raised. InvalidStateError is raised when a Task or Future is in an invalid internal state, such as</context>
</line>
<line>
<line_number>117, 118, 119</line_number>
<line_content>tries=tries,, 
delay=delay,, 
max_delay=max_delay,</line_content>
<context>
- run_until_complete and run_forever to run the event loop until a Future completes or the loop is stopped. stop can be used to stop the loop. 

- call_soon and call_later to schedule callbacks. call_later schedules a callback after a delay.</context>
</line>
<line>
<line_number>120, 121, 123</line_number>
<line_content>backoff=backoff,, 
jitter=jitter,, 
return self</line_content>
<context>
faulthandler.dump_traceback_later() dumps the tracebacks after a timeout, optionally repeating. faulthandler.cancel_dump_traceback_later() cancels this.</context>
</line>
<line>
<line_number>125, 126, 127</line_number>
<line_content>def batch_size(self, batch_size):, 
self._batch_size = batch_size, 
return self</line_content>
<context>
The PyTuple_Size and PyTuple_GET_SIZE functions return the size of a tuple. PyTuple_GetItem and PyTuple_GET_ITEM get an item at a given index from a tuple. PyTuple_GetSlice gets a slice from a tuple.</context>
</line>
<line>
<line_number>129, 130, 131</line_number>
<line_content>def show_progress(self):, 
self._progress = True, 
return self</line_content>
<context>
A Future object can be awaited to get its result when available. The result() method returns the result when it is set via set_result(). If the result is not ready, result() raises an InvalidStateError.</context>
</line>
<line>
<line_number>133, 134, 136</line_number>
<line_content>def to_batches(self, arr):, 
length = len(arr), 
def _chunker(arr):</line_content>
<context>
You instantiate a Chunk object at the start of each chunk. Then you can use the various methods to get information about the chunk, read its data, and move around within it. When you reach the end of a chunk, you create a new Chunk instance to</context>
</line>
<line>
<line_number>137, 138, 140</line_number>
<line_content>for start_i in range(0, len(arr), self._batch_size):, 
yield arr[start_i : start_i + self._batch_size], 
if self._progress:</line_content>
<context>
PySlice_New creates a new slice object given start, stop, and step values (any can be None). PySlice_GetIndices and PySlice_GetIndicesEx extract the start, stop, and step values from a slice assuming a sequence of a given length, clipping out of</context>
</line>
<line>
<line_number>141, 143, 145</line_number>
<line_content>from tqdm.auto import tqdm, 
yield from tqdm(_chunker(arr), total=math.ceil(length / self._batch_size)), 
yield from _chunker(arr)</line_content>
<context>
PySlice_New creates a new slice object given start, stop, and step values (any can be None). PySlice_GetIndices and PySlice_GetIndicesEx extract the start, stop, and step values from a slice assuming a sequence of a given length, clipping out of
</context>
</line>
</file_context>
</file>
<file>
<file_path>python/lancedb/exceptions.py</file_path>
<file_content>
"""Custom exception handling"""


class MissingValueError(ValueError):
    """Exception raised when a required value is missing."""

    pass


class MissingColumnError(KeyError):
    """
    Exception raised when a column name specified is not in
    the  DataFrame object
    """

    def __init__(self, column_name):
        self.column_name = column_name

    def __str__(self):
        return (
            f"Error: Column '{self.column_name}' does not exist in the DataFrame object"
        )

</file_content>
<file_context>
<line>
<line_number>0, 3, 4</line_number>
<line_content>'''Custom exception handling''', 
class MissingValueError(ValueError):, 
'''Exception raised when a required value is missing.'''</line_content>
<context>
Built-in exceptions like ZeroDivisionError, NameError, and TypeError are raised when typical problems occur, like dividing by zero, using an undeclared name, or mismatching operand types. Custom exception classes can also be defined by inheriting</context>
</line>
<line>
<line_number>9, 11, 12</line_number>
<line_content>class MissingColumnError(KeyError):, 
Exception raised when a column name specified is not in, 
the  DataFrame object</line_content>
<context>
Python raises exceptions when errors occur. Exceptions can be handled with try/except blocks to allow recovery from errors. The raise statement explicitly raises an exception. Exceptions are identified by class instances. The except clause selects</context>
</line>
<line>
<line_number>15, 16, 18</line_number>
<line_content>def __init__(self, column_name):, 
self.column_name = column_name, 
def __str__(self):</line_content>
<context>
returns a SymbolTable instance for a given piece of Python source code. The SymbolTable class represents a namespace and provides methods to inspect the identifiers defined within it, like get_type, get_name, get_lineno, etc. Specific subclasses
</context>
</line>
</file_context>
</file>
<file>
<file_path>python/lancedb/fts.py</file_path>
<file_content>
#  Copyright 2023 LanceDB Developers
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

"""Full text search index using tantivy-py"""
import os
from typing import List, Tuple

import pyarrow as pa

try:
    import tantivy
except ImportError:
    raise ImportError(
        "Please install tantivy-py `pip install tantivy@git+https://github.com/quickwit-oss/tantivy-py#164adc87e1a033117001cf70e38c82a53014d985` to use the full text search feature."
    )

from .table import LanceTable


def create_index(index_path: str, text_fields: List[str]) -> tantivy.Index:
    """
    Create a new Index (not populated)

    Parameters
    ----------
    index_path : str
        Path to the index directory
    text_fields : List[str]
        List of text fields to index

    Returns
    -------
    index : tantivy.Index
        The index object (not yet populated)
    """
    # Declaring our schema.
    schema_builder = tantivy.SchemaBuilder()
    # special field that we'll populate with row_id
    schema_builder.add_integer_field("doc_id", stored=True)
    # data fields
    for name in text_fields:
        schema_builder.add_text_field(name, stored=True)
    schema = schema_builder.build()
    os.makedirs(index_path, exist_ok=True)
    index = tantivy.Index(schema, path=index_path)
    return index


def populate_index(index: tantivy.Index, table: LanceTable, fields: List[str]) -> int:
    """
    Populate an index with data from a LanceTable

    Parameters
    ----------
    index : tantivy.Index
        The index object
    table : LanceTable
        The table to index
    fields : List[str]
        List of fields to index

    Returns
    -------
    int
        The number of rows indexed
    """
    # first check the fields exist and are string or large string type
    for name in fields:
        f = table.schema.field(name)  # raises KeyError if not found
        if not pa.types.is_string(f.type) and not pa.types.is_large_string(f.type):
            raise TypeError(f"Field {name} is not a string type")

    # create a tantivy writer
    writer = index.writer()
    # write data into index
    dataset = table.to_lance()
    row_id = 0
    for b in dataset.to_batches(columns=fields):
        for i in range(b.num_rows):
            doc = tantivy.Document()
            doc.add_integer("doc_id", row_id)
            for name in fields:
                doc.add_text(name, b[name][i].as_py())
            writer.add_document(doc)
            row_id += 1
    # commit changes
    writer.commit()
    return row_id


def search_index(
    index: tantivy.Index, query: str, limit: int = 10
) -> Tuple[Tuple[int], Tuple[float]]:
    """
    Search an index for a query

    Parameters
    ----------
    index : tantivy.Index
        The index object
    query : str
        The query string
    limit : int
        The maximum number of results to return

    Returns
    -------
    ids_and_score: list[tuple[int], tuple[float]]
        A tuple of two tuples, the first containing the document ids
        and the second containing the scores
    """
    searcher = index.searcher()
    query = index.parse_query(query)
    # get top results
    results = searcher.search(query, limit)
    if results.count == 0:
        return tuple(), tuple()
    return tuple(
        zip(
            *[
                (searcher.doc(doc_address)["doc_id"][0], score)
                for score, doc_address in results.hits
            ]
        )
    )

</file_content>
<file_context>
<line>
<line_number>13, 15, 17</line_number>
<line_content>'''Full text search index using tantivy-py''', 
from typing import List, Tuple, 
import pyarrow as pa</line_content>
<context>
the Py_UNICODE type API, platform.popen(), imp.find_module(), BaseException.message, and XML toolkit compatibility properties. The array module's u type code for Unicode strings is also deprecated in preparation for removal in Python 4.</context>
</line>
<line>
<line_number>20, 21, 22</line_number>
<line_content>import tantivy, 
except ImportError:, 
raise ImportError(</line_content>
<context>
ImportError on failure and remove failed modules from sys.modules.</context>
</line>
<line>
<line_number>23, 26, 29</line_number>
<line_content>'Please install tantivy-py `pip install tantivy@git+https://github.com/quickwit-oss/tantivy-py#164adc87e1a033117001cf70e38c82a53014d985` to use the full text search feature.', 
from .table import LanceTable, 
def create_index(index_path: str, text_fields: List[str]) -> tantivy.Index:</line_content>
<context>
like search, match, fullmatch, split, findall, finditer, sub, and subn for searching and replacing.</context>
</line>
<line>
<line_number>31, 33, 34</line_number>
<line_content>Create a new Index (not populated), 
Parameters, 
----------</line_content>
<context>
- Parameters define the arguments a function can take. Default values can be specified for parameters to make them optional.</context>
</line>
<line>
<line_number>35, 36, 37</line_number>
<line_content>index_path : str, 
Path to the index directory, 
text_fields : List[str]</line_content>
<context>
Python uses both methods and functions for builtins depending on semantics. Methods like list.index() imply modifications or behaviors relevant to the list itself. Functions like len(list) take a list as an argument but are general operations not</context>
</line>
<line>
<line_number>38, 42, 43</line_number>
<line_content>List of text fields to index, 
index : tantivy.Index, 
The index object (not yet populated)</line_content>
<context>
The enumerate() function takes an iterable and returns an enumerate object that produces tuples containing indices and values.</context>
</line>
<line>
<line_number>45, 46, 47</line_number>
<line_content># Declaring our schema., 
schema_builder = tantivy.SchemaBuilder(), 
# special field that we'll populate with row_id</line_content>
<context>
The sqlite3 module provides a DB-API 2.0 compliant interface for working with SQLite databases in Python. It allows executing SQL statements and fetching results. Key components include the Connection, Cursor, and Row classes.</context>
</line>
<line>
<line_number>48, 49, 50</line_number>
<line_content>schema_builder.add_integer_field('doc_id', stored=True), 
# data fields, 
for name in text_fields:</line_content>
<context>
update_wrapper updates a wrapper function to look like the wrapped function. It assigns and updates attributes like __name__, __doc__ etc.</context>
</line>
<line>
<line_number>51, 52, 53</line_number>
<line_content>schema_builder.add_text_field(name, stored=True), 
schema = schema_builder.build(), 
os.makedirs(index_path, exist_ok=True)</line_content>
<context>
attribute. Since this is an implementation detail, alternate Python implementations may not provide __builtins__.</context>
</line>
<line>
<line_number>54, 55, 58</line_number>
<line_content>index = tantivy.Index(schema, path=index_path), 
return index, 
def populate_index(index: tantivy.Index, table: LanceTable, fields: List[str]) -> int:</line_content>
<context>
The enumerate() function takes an iterable and returns an enumerate object that produces tuples containing indices and values.</context>
</line>
<line>
<line_number>60, 62, 63</line_number>
<line_content>Populate an index with data from a LanceTable, 
Parameters, 
----------</line_content>
<context>
The enumerate() function takes an iterable and returns an enumerate object that produces tuples containing indices and values.</context>
</line>
<line>
<line_number>64, 65, 66</line_number>
<line_content>index : tantivy.Index, 
The index object, 
table : LanceTable</line_content>
<context>
The enumerate() function takes an iterable and returns an enumerate object that produces tuples containing indices and values.</context>
</line>
<line>
<line_number>67, 68, 69</line_number>
<line_content>The table to index, 
fields : List[str], 
List of fields to index</line_content>
<context>
- Lists are defined with [] and can contain mixed type elements. Lists support indexing, slicing, adding/removing elements, concatenation with +, and length with the len() function.</context>
</line>
<line>
<line_number>74, 76, 77</line_number>
<line_content>The number of rows indexed, 
# first check the fields exist and are string or large string type, 
for name in fields:</line_content>
<context>
The Row class represents a result row. It allows accessing columns by index or case-insensitive name. Row provides a memory efficient alternative to tuples.</context>
</line>
<line>
<line_number>78, 79, 80</line_number>
<line_content>f = table.schema.field(name)  # raises KeyError if not found, 
if not pa.types.is_string(f.type) and not pa.types.is_large_string(f.type):, 
raise TypeError(f'Field {name} is not a string type')</line_content>
<context>
flags like Py_TPFLAGS_HAVE_GC. Functions like PyType_Check and PyType_HasFeature can check properties of type objects.</context>
</line>
<line>
<line_number>82, 83, 84</line_number>
<line_content># create a tantivy writer, 
writer = index.writer(), 
# write data into index</line_content>
<context>
The csv.writer() function returns a writer object to convert data to delimited strings on a given file-like object. The writer's writerow() method writes a row of data to the file. The dialect parameter again allows specifying formatting options.</context>
</line>
<line>
<line_number>85, 86, 87</line_number>
<line_content>dataset = table.to_lance(), 
row_id = 0, 
for b in dataset.to_batches(columns=fields):</line_content>
<context>
The Row class represents a result row. It allows accessing columns by index or case-insensitive name. Row provides a memory efficient alternative to tuples.</context>
</line>
<line>
<line_number>88, 89, 90</line_number>
<line_content>for i in range(b.num_rows):, 
doc = tantivy.Document(), 
doc.add_integer('doc_id', row_id)</line_content>
<context>
Python lacks a ++ increment operator because it is inconsistent with the clean, explicit style recommended by Guido van Rossum and the Python style guide. The += augmented assignment operator serves the same purpose in a more explicit way.</context>
</line>
<line>
<line_number>91, 92, 93</line_number>
<line_content>for name in fields:, 
doc.add_text(name, b[name][i].as_py()), 
writer.add_document(doc)</line_content>
<context>
The pydoc module generates documentation for Python modules, functions, classes, and methods. It displays documentation derived from docstrings in multiple formats - as text on the console, served to a web browser, or saved as HTML files.</context>
</line>
<line>
<line_number>94, 95, 96</line_number>
<line_content>row_id += 1, 
# commit changes, 
writer.commit()</line_content>
<context>
Python lacks a ++ increment operator because it is inconsistent with the clean, explicit style recommended by Guido van Rossum and the Python style guide. The += augmented assignment operator serves the same purpose in a more explicit way.</context>
</line>
<line>
<line_number>97, 100, 101</line_number>
<line_content>return row_id, 
def search_index(, 
index: tantivy.Index, query: str, limit: int = 10</line_content>
<context>
over the sequence until an IndexError is raised.</context>
</line>
<line>
<line_number>102, 104, 106</line_number>
<line_content>) -> Tuple[Tuple[int], Tuple[float]]:, 
Search an index for a query, 
Parameters</line_content>
<context>
The enumerate() function takes an iterable and returns an enumerate object that produces tuples containing indices and values.</context>
</line>
<line>
<line_number>107, 108, 109</line_number>
<line_content>----------, 
index : tantivy.Index, 
The index object</line_content>
<context>
The enumerate() function takes an iterable and returns an enumerate object that produces tuples containing indices and values.</context>
</line>
<line>
<line_number>110, 111, 112</line_number>
<line_content>query : str, 
The query string, 
limit : int</line_content>
<context>
as numbers, since no operations can alter a string value.</context>
</line>
<line>
<line_number>113, 117, 118</line_number>
<line_content>The maximum number of results to return, 
ids_and_score: list[tuple[int], tuple[float]], 
A tuple of two tuples, the first containing the document ids</line_content>
<context>
Variables like ST_MODE, ST_SIZE, ST_ATIME provide indexes into the tuple returned by os.stat() with info like the file permissions mode, size, access time, etc.</context>
</line>
<line>
<line_number>119, 121, 122</line_number>
<line_content>and the second containing the scores, 
searcher = index.searcher(), 
query = index.parse_query(query)</line_content>
<context>
The key function extracts a key from each element for comparison. This allows searching complex data structures.

Examples show looking up grades from scores, and inserting movies into a sorted list by release year.</context>
</line>
<line>
<line_number>123, 124, 125</line_number>
<line_content># get top results, 
results = searcher.search(query, limit), 
if results.count == 0:</line_content>
<context>
over the sequence until an IndexError is raised.</context>
</line>
<line>
<line_number>126, 127, 130</line_number>
<line_content>return tuple(), tuple(), 
return tuple(, 
(searcher.doc(doc_address)['doc_id'][0], score)</line_content>
<context>
The enumerate() function takes an iterable and returns an enumerate object that produces tuples containing indices and values.
</context>
</line>
</file_context>
</file>
<file>
<file_path>python/lancedb/query.py</file_path>
<file_content>
#  Copyright 2023 LanceDB Developers
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.
from __future__ import annotations
from typing import Literal

import numpy as np
import pandas as pd
import pyarrow as pa

from .common import VECTOR_COLUMN_NAME


class LanceQueryBuilder:
    """
    A builder for nearest neighbor queries for LanceDB.

    Examples
    --------
    >>> import lancedb
    >>> data = [{"vector": [1.1, 1.2], "b": 2},
    ...         {"vector": [0.5, 1.3], "b": 4},
    ...         {"vector": [0.4, 0.4], "b": 6},
    ...         {"vector": [0.4, 0.4], "b": 10}]
    >>> db = lancedb.connect("./.lancedb")
    >>> table = db.create_table("my_table", data=data)
    >>> (table.search([0.4, 0.4])
    ...       .metric("cosine")
    ...       .where("b < 10")
    ...       .select(["b"])
    ...       .limit(2)
    ...       .to_df())
       b      vector  score
    0  6  [0.4, 0.4]    0.0
    """

    def __init__(self, table: "lancedb.table.LanceTable", query: np.ndarray):
        self._metric = "L2"
        self._nprobes = 20
        self._refine_factor = None
        self._table = table
        self._query = query
        self._limit = 10
        self._columns = None
        self._where = None

    def limit(self, limit: int) -> LanceQueryBuilder:
        """Set the maximum number of results to return.

        Parameters
        ----------
        limit: int
            The maximum number of results to return.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        self._limit = limit
        return self

    def select(self, columns: list) -> LanceQueryBuilder:
        """Set the columns to return.

        Parameters
        ----------
        columns: list
            The columns to return.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        self._columns = columns
        return self

    def where(self, where: str) -> LanceQueryBuilder:
        """Set the where clause.

        Parameters
        ----------
        where: str
            The where clause.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        self._where = where
        return self

    def metric(self, metric: Literal["L2", "cosine"]) -> LanceQueryBuilder:
        """Set the distance metric to use.

        Parameters
        ----------
        metric: "L2" or "cosine"
            The distance metric to use. By default "L2" is used.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        self._metric = metric
        return self

    def nprobes(self, nprobes: int) -> LanceQueryBuilder:
        """Set the number of probes to use.

        Higher values will yield better recall (more likely to find vectors if
        they exist) at the expense of latency.

        See discussion in [Querying an ANN Index][../querying-an-ann-index] for
        tuning advice.

        Parameters
        ----------
        nprobes: int
            The number of probes to use.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        self._nprobes = nprobes
        return self

    def refine_factor(self, refine_factor: int) -> LanceQueryBuilder:
        """Set the refine factor to use, increasing the number of vectors sampled.

        As an example, a refine factor of 2 will sample 2x as many vectors as
        requested, re-ranks them, and returns the top half most relevant results.

        See discussion in [Querying an ANN Index][querying-an-ann-index] for
        tuning advice.

        Parameters
        ----------
        refine_factor: int
            The refine factor to use.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        self._refine_factor = refine_factor
        return self

    def to_df(self) -> pd.DataFrame:
        """
        Execute the query and return the results as a pandas DataFrame.
        In addition to the selected columns, LanceDB also returns a vector
        and also the "score" column which is the distance between the query
        vector and the returned vector.
        """
        ds = self._table.to_lance()
        tbl = ds.to_table(
            columns=self._columns,
            filter=self._where,
            nearest={
                "column": VECTOR_COLUMN_NAME,
                "q": self._query,
                "k": self._limit,
                "metric": self._metric,
                "nprobes": self._nprobes,
                "refine_factor": self._refine_factor,
            },
        )
        return tbl.to_pandas()


class LanceFtsQueryBuilder(LanceQueryBuilder):
    def to_df(self) -> pd.DataFrame:
        try:
            import tantivy
        except ImportError:
            raise ImportError(
                "Please install tantivy-py `pip install tantivy@git+https://github.com/quickwit-oss/tantivy-py#164adc87e1a033117001cf70e38c82a53014d985` to use the full text search feature."
            )

        from .fts import search_index

        # get the index path
        index_path = self._table._get_fts_index_path()
        # open the index
        index = tantivy.Index.open(index_path)
        # get the scores and doc ids
        row_ids, scores = search_index(index, self._query, self._limit)
        if len(row_ids) == 0:
            return pd.DataFrame()
        scores = pa.array(scores)
        output_tbl = self._table.to_lance().take(row_ids, columns=self._columns)
        output_tbl = output_tbl.append_column("score", scores)
        return output_tbl.to_pandas()

</file_content>
<file_context>
<line>
<line_number>12, 13, 15</line_number>
<line_content>from __future__ import annotations, 
from typing import Literal, 
import numpy as np</line_content>
<context>
The __annotations__ attribute is a dictionary that stores type hints for functions, classes, and modules in Python. It was introduced in Python 3.0.</context>
</line>
<line>
<line_number>16, 17, 19</line_number>
<line_content>import pandas as pd, 
import pyarrow as pa, 
from .common import VECTOR_COLUMN_NAME</line_content>
<context>
pydoc can document modules, packages, functions, classes, methods, or a path to a .py file. Importing the module runs any code at the module level, so use if __name__ == '__main__' guards.</context>
</line>
<line>
<line_number>22, 24, 28</line_number>
<line_content>class LanceQueryBuilder:, 
A builder for nearest neighbor queries for LanceDB., 
>>> import lancedb</line_content>
<context>
and some useful functions for analyzing pickled data. The module is most relevant for Python core developers working on pickle, not typical users.</context>
</line>
<line>
<line_number>29, 30, 31</line_number>
<line_content>>>> data = [{'vector': [1.1, 1.2], 'b': 2},, 
...         {'vector': [0.5, 1.3], 'b': 4},, 
...         {'vector': [0.4, 0.4], 'b': 6},</line_content>
<context>
The python list data type has a built-in sort() method that sorts the list in-place. There is also a sorted() built-in function that builds a new sorted list from an iterable.</context>
</line>
<line>
<line_number>32, 33, 34</line_number>
<line_content>...         {'vector': [0.4, 0.4], 'b': 10}], 
>>> db = lancedb.connect('./.lancedb'), 
>>> table = db.create_table('my_table', data=data)</line_content>
<context>
The sqlite3 module provides a DB-API 2.0 compliant interface for working with SQLite databases in Python. It allows executing SQL statements and fetching results. Key components include the Connection, Cursor, and Row classes.</context>
</line>
<line>
<line_number>35, 36, 37</line_number>
<line_content>>>> (table.search([0.4, 0.4]), 
...       .metric('cosine'), 
...       .where('b < 10')</line_content>
<context>
Additional utility functions like index, find_lt, etc are provided for common searching tasks on sorted lists.</context>
</line>
<line>
<line_number>38, 39, 40</line_number>
<line_content>...       .select(['b']), 
...       .limit(2), 
...       .to_df())</line_content>
<context>
limit. An example implements an echo server using start_server().</context>
</line>
<line>
<line_number>41, 42, 45</line_number>
<line_content>b      vector  score, 
0  6  [0.4, 0.4]    0.0, 
def __init__(self, table: 'lancedb.table.LanceTable', query: np.ndarray):</line_content>
<context>
The entry_points() function returns EntryPoint objects representing entry points in a distribution, which have attributes like name, group, and value. These can be loaded to resolve the entry point. Entry points can be selected by group or other</context>
</line>
<line>
<line_number>46, 47, 48</line_number>
<line_content>self._metric = 'L2', 
self._nprobes = 20, 
self._refine_factor = None</line_content>
<context>
The documentation describes several constant values that exist in the built-in python namespace. These include False, True, None, NotImplemented, Ellipsis, and __debug__.</context>
</line>
<line>
<line_number>49, 50, 51</line_number>
<line_content>self._table = table, 
self._query = query, 
self._limit = 10</line_content>
<context>
limit. An example implements an echo server using start_server().</context>
</line>
<line>
<line_number>52, 53, 55</line_number>
<line_content>self._columns = None, 
self._where = None, 
def limit(self, limit: int) -> LanceQueryBuilder:</line_content>
<context>
limit. An example implements an echo server using start_server().</context>
</line>
<line>
<line_number>56, 58, 59</line_number>
<line_content>'''Set the maximum number of results to return., 
Parameters, 
----------</line_content>
<context>
- max() - returns largest item in an iterable

Built-in functions cover types like numbers, sequences, mappings, classes, modules, files, and more. They allow you to easily perform common operations without having to write additional code yourself.</context>
</line>
<line>
<line_number>60, 61, 65</line_number>
<line_content>limit: int, 
The maximum number of results to return., 
LanceQueryBuilder</line_content>
<context>
- max() - returns largest item in an iterable

Built-in functions cover types like numbers, sequences, mappings, classes, modules, files, and more. They allow you to easily perform common operations without having to write additional code yourself.</context>
</line>
<line>
<line_number>66, 68, 69</line_number>
<line_content>The LanceQueryBuilder object., 
self._limit = limit, 
return self</line_content>
<context>
limit. An example implements an echo server using start_server().</context>
</line>
<line>
<line_number>71, 72, 74</line_number>
<line_content>def select(self, columns: list) -> LanceQueryBuilder:, 
'''Set the columns to return., 
Parameters</line_content>
<context>
The fields can have default values set normally in Python. Default values can also be set by the field() function which allows for things like mutable default values using default_factory. The field() function has parameters that mirror those in the</context>
</line>
<line>
<line_number>75, 76, 77</line_number>
<line_content>----------, 
columns: list, 
The columns to return.</line_content>
<context>
The Row class represents a result row. It allows accessing columns by index or case-insensitive name. Row provides a memory efficient alternative to tuples.</context>
</line>
<line>
<line_number>81, 82, 84</line_number>
<line_content>LanceQueryBuilder, 
The LanceQueryBuilder object., 
self._columns = columns</line_content>
<context>
field() function allows for additional per-field configurations.</context>
</line>
<line>
<line_number>85, 87, 88</line_number>
<line_content>return self, 
def where(self, where: str) -> LanceQueryBuilder:, 
'''Set the where clause.</line_content>
<context>
A Future object can be awaited to get its result when available. The result() method returns the result when it is set via set_result(). If the result is not ready, result() raises an InvalidStateError.</context>
</line>
<line>
<line_number>90, 91, 92</line_number>
<line_content>Parameters, 
----------, 
where: str</line_content>
<context>
Parameters may have default values, which are evaluated from left to right when the function is defined. Parameters after * or *identifier are keyword-only and may only be passed by keyword. Parameters before / are positional-only and may only be</context>
</line>
<line>
<line_number>93, 97, 98</line_number>
<line_content>The where clause., 
LanceQueryBuilder, 
The LanceQueryBuilder object.</line_content>
<context>
Expression statements are used to compute and print a value or call a procedure that returns no result. Assignment statements bind names to values and modify attributes or items of mutable objects. Augmented assignment combines a binary operation</context>
</line>
<line>
<line_number>100, 101, 103</line_number>
<line_content>self._where = where, 
return self, 
def metric(self, metric: Literal['L2', 'cosine']) -> LanceQueryBuilder:</line_content>
<context>
The def statement defines a function in Python. Functions can have documentation strings, parameters, default argument values, arbitrary argument lists, and return statements.</context>
</line>
<line>
<line_number>104, 106, 107</line_number>
<line_content>'''Set the distance metric to use., 
Parameters, 
----------</line_content>
<context>
Parameters may have default values, which are evaluated from left to right when the function is defined. Parameters after * or *identifier are keyword-only and may only be passed by keyword. Parameters before / are positional-only and may only be</context>
</line>
<line>
<line_number>108, 109, 113</line_number>
<line_content>metric: 'L2' or 'cosine', 
The distance metric to use. By default 'L2' is used., 
LanceQueryBuilder</line_content>
<context>
measures of central location, measures of spread or dispersion, statistics for relations between two inputs, and exceptions.</context>
</line>
<line>
<line_number>114, 116, 117</line_number>
<line_content>The LanceQueryBuilder object., 
self._metric = metric, 
return self</line_content>
<context>
attribute. Since this is an implementation detail, alternate Python implementations may not provide __builtins__.</context>
</line>
<line>
<line_number>119, 120, 122</line_number>
<line_content>def nprobes(self, nprobes: int) -> LanceQueryBuilder:, 
'''Set the number of probes to use., 
Higher values will yield better recall (more likely to find vectors if</line_content>
<context>
and uses PEP 526 type annotations to define the member variables to use in these generated methods.</context>
</line>
<line>
<line_number>123, 125, 126</line_number>
<line_content>they exist) at the expense of latency., 
See discussion in [Querying an ANN Index][../querying-an-ann-index] for, 
tuning advice.</line_content>
<context>
Additional utility functions like index, find_lt, etc are provided for common searching tasks on sorted lists.</context>
</line>
<line>
<line_number>128, 129, 130</line_number>
<line_content>Parameters, 
----------, 
nprobes: int</line_content>
<context>
Parameters may have default values, which are evaluated from left to right when the function is defined. Parameters after * or *identifier are keyword-only and may only be passed by keyword. Parameters before / are positional-only and may only be</context>
</line>
<line>
<line_number>131, 135, 136</line_number>
<line_content>The number of probes to use., 
LanceQueryBuilder, 
The LanceQueryBuilder object.</line_content>
<context>
methods execute code under tracing. The "results()" method returns a "CoverageResults" object with the cumulative tracing results.</context>
</line>
<line>
<line_number>138, 139, 141</line_number>
<line_content>self._nprobes = nprobes, 
return self, 
def refine_factor(self, refine_factor: int) -> LanceQueryBuilder:</line_content>
<context>
attribute. Since this is an implementation detail, alternate Python implementations may not provide __builtins__.</context>
</line>
<line>
<line_number>142, 144, 145</line_number>
<line_content>'''Set the refine factor to use, increasing the number of vectors sampled., 
As an example, a refine factor of 2 will sample 2x as many vectors as, 
requested, re-ranks them, and returns the top half most relevant results.</line_content>
<context>
various ways like sorting by cumulative time or filtering to a subset of functions.</context>
</line>
<line>
<line_number>147, 148, 150</line_number>
<line_content>See discussion in [Querying an ANN Index][querying-an-ann-index] for, 
tuning advice., 
Parameters</line_content>
<context>
Some functions like PyNumber_AsSsize_t convert to specific C numeric types. Others like PyIndex_Check test if an object implements the index protocol and can be used as an array index.</context>
</line>
<line>
<line_number>151, 152, 153</line_number>
<line_content>----------, 
refine_factor: int, 
The refine factor to use.</line_content>
<context>
The fractions module implements rational numbers.</context>
</line>
<line>
<line_number>157, 158, 160</line_number>
<line_content>LanceQueryBuilder, 
The LanceQueryBuilder object., 
self._refine_factor = refine_factor</line_content>
<context>
PyDescr_NewMethod creates a method descriptor from a PyMethodDef structure.</context>
</line>
<line>
<line_number>161, 163, 165</line_number>
<line_content>return self, 
def to_df(self) -> pd.DataFrame:, 
Execute the query and return the results as a pandas DataFrame.</line_content>
<context>
Some key steps are:

- Convert data from C to Python types with API functions 
- Call Python interface routines using the converted values
- Convert return values back from Python to C</context>
</line>
<line>
<line_number>166, 167, 168</line_number>
<line_content>In addition to the selected columns, LanceDB also returns a vector, 
and also the 'score' column which is the distance between the query, 
vector and the returned vector.</line_content>
<context>
The entry_points() function returns EntryPoint objects representing entry points in a distribution, which have attributes like name, group, and value. These can be loaded to resolve the entry point. Entry points can be selected by group or other</context>
</line>
<line>
<line_number>170, 171, 172</line_number>
<line_content>ds = self._table.to_lance(), 
tbl = ds.to_table(, 
columns=self._columns,</line_content>
<context>
- in_table_d1 - returns True if the codepoint has bidirectional property 'R' or 'AL'</context>
</line>
<line>
<line_number>173, 175, 176</line_number>
<line_content>filter=self._where,, 
'column': VECTOR_COLUMN_NAME,, 
'q': self._query,</line_content>
<context>
Filters provide more advanced filtering based on criteria beyond log level. They have a filter() method to determine if a record should be processed.</context>
</line>
<line>
<line_number>177, 178, 179</line_number>
<line_content>'k': self._limit,, 
'metric': self._metric,, 
'nprobes': self._nprobes,</line_content>
<context>
Identifiers consist of letters, digits, underscores and certain Unicode characters. Names like __*__ are system-defined. Names like _* and __* are special. Keywords like def, class, etc. are reserved.</context>
</line>
<line>
<line_number>180, 183, 186</line_number>
<line_content>'refine_factor': self._refine_factor,, 
return tbl.to_pandas(), 
class LanceFtsQueryBuilder(LanceQueryBuilder):</line_content>
<context>
attribute. Since this is an implementation detail, alternate Python implementations may not provide __builtins__.</context>
</line>
<line>
<line_number>187, 189, 190</line_number>
<line_content>def to_df(self) -> pd.DataFrame:, 
import tantivy, 
except ImportError:</line_content>
<context>
The PyFrame_Type object represents the Python types.FrameType type for frame objects. PyFrame_Check() checks if an object is a frame object.</context>
</line>
<line>
<line_number>191, 192, 195</line_number>
<line_content>raise ImportError(, 
'Please install tantivy-py `pip install tantivy@git+https://github.com/quickwit-oss/tantivy-py#164adc87e1a033117001cf70e38c82a53014d985` to use the full text search feature.', 
from .fts import search_index</line_content>
<context>
The key functions in imp are:

- imp.find_module() - Searches for a module and returns a file handle, path, and description tuple if found. Raises ImportError otherwise.</context>
</line>
<line>
<line_number>197, 198, 199</line_number>
<line_content># get the index path, 
index_path = self._table._get_fts_index_path(), 
# open the index</line_content>
<context>
use methods on the traversable to open or read resources. as_file() provides a file path object.</context>
</line>
<line>
<line_number>200, 201, 202</line_number>
<line_content>index = tantivy.Index.open(index_path), 
# get the scores and doc ids, 
row_ids, scores = search_index(index, self._query, self._limit)</line_content>
<context>
Variables like ST_MODE, ST_SIZE, ST_ATIME provide indexes into the tuple returned by os.stat() with info like the file permissions mode, size, access time, etc.</context>
</line>
<line>
<line_number>203, 204, 205</line_number>
<line_content>if len(row_ids) == 0:, 
return pd.DataFrame(), 
scores = pa.array(scores)</line_content>
<context>
The Row class represents a result row. It allows accessing columns by index or case-insensitive name. Row provides a memory efficient alternative to tuples.</context>
</line>
<line>
<line_number>206, 207, 208</line_number>
<line_content>output_tbl = self._table.to_lance().take(row_ids, columns=self._columns), 
output_tbl = output_tbl.append_column('score', scores), 
return output_tbl.to_pandas()</line_content>
<context>
The Row class represents a result row. It allows accessing columns by index or case-insensitive name. Row provides a memory efficient alternative to tuples.
</context>
</line>
</file_context>
</file>
<file>
<file_path>python/lancedb/table.py</file_path>
<file_content>
#  Copyright 2023 LanceDB Developers
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

from __future__ import annotations

import os
import shutil
from functools import cached_property
from typing import List, Union

import lance
import numpy as np
import pandas as pd
import pyarrow as pa
from lance import LanceDataset
from lance.vector import vec_to_table

from .common import DATA, VEC, VECTOR_COLUMN_NAME
from .query import LanceFtsQueryBuilder, LanceQueryBuilder
from .util import get_uri_scheme


def _sanitize_data(data, schema):
    if isinstance(data, list):
        data = pa.Table.from_pylist(data)
        data = _sanitize_schema(data, schema=schema)
    if isinstance(data, dict):
        data = vec_to_table(data)
    if isinstance(data, pd.DataFrame):
        data = pa.Table.from_pandas(data)
        data = _sanitize_schema(data, schema=schema)
    if not isinstance(data, pa.Table):
        raise TypeError(f"Unsupported data type: {type(data)}")
    return data


class LanceTable:
    """
    A table in a LanceDB database.

    Examples
    --------

    Create using [LanceDBConnection.create_table][lancedb.LanceDBConnection.create_table]
    (more examples in that method's documentation).

    >>> import lancedb
    >>> db = lancedb.connect("./.lancedb")
    >>> table = db.create_table("my_table", data=[{"vector": [1.1, 1.2], "b": 2}])
    >>> table.head()
    pyarrow.Table
    vector: fixed_size_list<item: float>[2]
      child 0, item: float
    b: int64
    ----
    vector: [[[1.1,1.2]]]
    b: [[2]]

    Can append new data with [LanceTable.add][lancedb.table.LanceTable.add].

    >>> table.add([{"vector": [0.5, 1.3], "b": 4}])
    2

    Can query the table with [LanceTable.search][lancedb.table.LanceTable.search].

    >>> table.search([0.4, 0.4]).select(["b"]).to_df()
       b      vector  score
    0  4  [0.5, 1.3]   0.82
    1  2  [1.1, 1.2]   1.13

    Search queries are much faster when an index is created. See
    [LanceTable.create_index][lancedb.table.LanceTable.create_index].

    """

    def __init__(
        self, connection: "lancedb.db.LanceDBConnection", name: str, version: int = None
    ):
        self._conn = connection
        self.name = name
        self._version = version

    def _reset_dataset(self):
        try:
            del self.__dict__["_dataset"]
        except AttributeError:
            pass

    @property
    def schema(self) -> pa.Schema:
        """Return the schema of the table.

        Returns
        -------
        pa.Schema
            A PyArrow schema object."""
        return self._dataset.schema

    def list_versions(self):
        """List all versions of the table"""
        return self._dataset.versions()

    @property
    def version(self) -> int:
        """Get the current version of the table"""
        return self._dataset.version

    def checkout(self, version: int):
        """Checkout a version of the table. This is an in-place operation.

        This allows viewing previous versions of the table.

        Parameters
        ----------
        version : int
            The version to checkout.

        Examples
        --------
        >>> import lancedb
        >>> db = lancedb.connect("./.lancedb")
        >>> table = db.create_table("my_table", [{"vector": [1.1, 0.9], "type": "vector"}])
        >>> table.version
        1
        >>> table.to_pandas()
               vector    type
        0  [1.1, 0.9]  vector
        >>> table.add([{"vector": [0.5, 0.2], "type": "vector"}])
        2
        >>> table.version
        2
        >>> table.checkout(1)
        >>> table.to_pandas()
               vector    type
        0  [1.1, 0.9]  vector
        """
        max_ver = max([v["version"] for v in self._dataset.versions()])
        if version < 1 or version > max_ver:
            raise ValueError(f"Invalid version {version}")
        self._version = version
        self._reset_dataset()

    def __len__(self):
        return self._dataset.count_rows()

    def __repr__(self) -> str:
        return f"LanceTable({self.name})"

    def __str__(self) -> str:
        return self.__repr__()

    def head(self, n=5) -> pa.Table:
        """Return the first n rows of the table."""
        return self._dataset.head(n)

    def to_pandas(self) -> pd.DataFrame:
        """Return the table as a pandas DataFrame.

        Returns
        -------
        pd.DataFrame
        """
        return self.to_arrow().to_pandas()

    def to_arrow(self) -> pa.Table:
        """Return the table as a pyarrow Table.

        Returns
        -------
        pa.Table"""
        return self._dataset.to_table()

    @property
    def _dataset_uri(self) -> str:
        return os.path.join(self._conn.uri, f"{self.name}.lance")

    def create_index(self, metric="L2", num_partitions=256, num_sub_vectors=96):
        """Create an index on the table.

        Parameters
        ----------
        metric: str, default "L2"
            The distance metric to use when creating the index. Valid values are "L2" or "cosine".
            L2 is euclidean distance.
        num_partitions: int
            The number of IVF partitions to use when creating the index.
            Default is 256.
        num_sub_vectors: int
            The number of PQ sub-vectors to use when creating the index.
            Default is 96.
        """
        self._dataset.create_index(
            column=VECTOR_COLUMN_NAME,
            index_type="IVF_PQ",
            metric=metric,
            num_partitions=num_partitions,
            num_sub_vectors=num_sub_vectors,
        )
        self._reset_dataset()

    def create_fts_index(self, field_names: Union[str, List[str]]):
        """Create a full-text search index on the table.

        Warning - this API is highly experimental and is highly likely to change
        in the future.

        Parameters
        ----------
        field_names: str or list of str
            The name(s) of the field to index.
        """
        from .fts import create_index, populate_index

        if isinstance(field_names, str):
            field_names = [field_names]
        index = create_index(self._get_fts_index_path(), field_names)
        populate_index(index, self, field_names)

    def _get_fts_index_path(self):
        return os.path.join(self._dataset_uri, "_indices", "tantivy")

    @cached_property
    def _dataset(self) -> LanceDataset:
        return lance.dataset(self._dataset_uri, version=self._version)

    def to_lance(self) -> LanceDataset:
        """Return the LanceDataset backing this table."""
        return self._dataset

    def add(self, data: DATA, mode: str = "append") -> int:
        """Add data to the table.

        Parameters
        ----------
        data: list-of-dict, dict, pd.DataFrame
            The data to insert into the table.
        mode: str
            The mode to use when writing the data. Valid values are
            "append" and "overwrite".

        Returns
        -------
        int
            The number of vectors in the table.
        """
        data = _sanitize_data(data, self.schema)
        lance.write_dataset(data, self._dataset_uri, mode=mode)
        self._reset_dataset()
        return len(self)

    def search(self, query: Union[VEC, str]) -> LanceQueryBuilder:
        """Create a search query to find the nearest neighbors
        of the given query vector.

        Parameters
        ----------
        query: list, np.ndarray
            The query vector.

        Returns
        -------
        LanceQueryBuilder
            A query builder object representing the query.
            Once executed, the query returns selected columns, the vector,
            and also the "score" column which is the distance between the query
            vector and the returned vector.
        """
        if isinstance(query, str):
            # fts
            return LanceFtsQueryBuilder(self, query)

        if isinstance(query, list):
            query = np.array(query)
        if isinstance(query, np.ndarray):
            query = query.astype(np.float32)
        else:
            raise TypeError(f"Unsupported query type: {type(query)}")
        return LanceQueryBuilder(self, query)

    @classmethod
    def create(cls, db, name, data, schema=None, mode="create"):
        tbl = LanceTable(db, name)
        data = _sanitize_data(data, schema)
        lance.write_dataset(data, tbl._dataset_uri, mode=mode)
        return tbl


def _sanitize_schema(data: pa.Table, schema: pa.Schema = None) -> pa.Table:
    """Ensure that the table has the expected schema.

    Parameters
    ----------
    data: pa.Table
        The table to sanitize.
    schema: pa.Schema; optional
        The expected schema. If not provided, this just converts the
        vector column to fixed_size_list(float32) if necessary.
    """
    if schema is not None:
        if data.schema == schema:
            return data
        # cast the columns to the expected types
        data = data.combine_chunks()
        data = _sanitize_vector_column(data, vector_column_name=VECTOR_COLUMN_NAME)
        return pa.Table.from_arrays(
            [data[name] for name in schema.names], schema=schema
        )
    # just check the vector column
    return _sanitize_vector_column(data, vector_column_name=VECTOR_COLUMN_NAME)


def _sanitize_vector_column(data: pa.Table, vector_column_name: str) -> pa.Table:
    """
    Ensure that the vector column exists and has type fixed_size_list(float32)

    Parameters
    ----------
    data: pa.Table
        The table to sanitize.
    vector_column_name: str
        The name of the vector column.
    """
    if vector_column_name not in data.column_names:
        raise ValueError(f"Missing vector column: {vector_column_name}")
    vec_arr = data[vector_column_name].combine_chunks()
    if pa.types.is_fixed_size_list(vec_arr.type):
        return data
    if not pa.types.is_list(vec_arr.type):
        raise TypeError(f"Unsupported vector column type: {vec_arr.type}")
    values = vec_arr.values
    if not pa.types.is_float32(values.type):
        values = values.cast(pa.float32())
    list_size = len(values) / len(data)
    vec_arr = pa.FixedSizeListArray.from_arrays(values, list_size)
    return data.set_column(
        data.column_names.index(vector_column_name), vector_column_name, vec_arr
    )

</file_content>
<file_context>
<line>
<line_number>13, 16, 17</line_number>
<line_content>from __future__ import annotations, 
import shutil, 
from functools import cached_property</line_content>
<context>
PyFunction_GetAnnotations returns the annotations dictionary of a function object.

PyFunction_SetAnnotations sets the annotations dictionary for a function object.</context>
</line>
<line>
<line_number>18, 20, 21</line_number>
<line_content>from typing import List, Union, 
import lance, 
import numpy as np</line_content>
<context>
The python list data type has a built-in sort() method that sorts the list in-place. There is also a sorted() built-in function that builds a new sorted list from an iterable.</context>
</line>
<line>
<line_number>22, 23, 24</line_number>
<line_content>import pandas as pd, 
import pyarrow as pa, 
from lance import LanceDataset</line_content>
<context>
The PyDictObject represents a Python dictionary. The PyDict_Type is the Python dict type. Functions are provided to check if an object is a dict, create a new empty dict, and clear an existing dict.</context>
</line>
<line>
<line_number>25, 27, 28</line_number>
<line_content>from lance.vector import vec_to_table, 
from .common import DATA, VEC, VECTOR_COLUMN_NAME, 
from .query import LanceFtsQueryBuilder, LanceQueryBuilder</line_content>
<context>
Converters allow custom handling of SQLite types when converting to Python types. They are registered with register_converter(). Column names or declared types are used for looking up converters.</context>
</line>
<line>
<line_number>29, 32, 33</line_number>
<line_content>from .util import get_uri_scheme, 
def _sanitize_data(data, schema):, 
if isinstance(data, list):</line_content>
<context>
The urllib package in Python contains several modules for working with URLs. The main module is urllib.request, which handles opening and reading URLs. It allows you to open a URL like a file and read from it. The urllib.error module contains</context>
</line>
<line>
<line_number>34, 35, 36</line_number>
<line_content>data = pa.Table.from_pylist(data), 
data = _sanitize_schema(data, schema=schema), 
if isinstance(data, dict):</line_content>
<context>
- Delete attributes, like with PyObject_DelAttr.

- Compare two objects, like with PyObject_RichCompare and PyObject_RichCompareBool. These implement comparison operators like <, ==, != etc.</context>
</line>
<line>
<line_number>37, 38, 39</line_number>
<line_content>data = vec_to_table(data), 
if isinstance(data, pd.DataFrame):, 
data = pa.Table.from_pandas(data)</line_content>
<context>
The PyFrame_Type object represents the Python types.FrameType type for frame objects. PyFrame_Check() checks if an object is a frame object.</context>
</line>
<line>
<line_number>40, 41, 42</line_number>
<line_content>data = _sanitize_schema(data, schema=schema), 
if not isinstance(data, pa.Table):, 
raise TypeError(f'Unsupported data type: {type(data)}')</line_content>
<context>
floats, booleans, datetimes, arrays, and tables into native Python datatypes.</context>
</line>
<line>
<line_number>43, 46, 48</line_number>
<line_content>return data, 
class LanceTable:, 
A table in a LanceDB database.</line_content>
<context>
returns a file-like object that can be used to read the data returned.</context>
</line>
<line>
<line_number>53, 54, 56</line_number>
<line_content>Create using [LanceDBConnection.create_table][lancedb.LanceDBConnection.create_table], 
(more examples in that method's documentation)., 
>>> import lancedb</line_content>
<context>
- Creating records in tables with CreateRecord.

- Initializing a new database with init_database.

- Adding data to tables with add_data. 

- Adding tables to a database with add_tables.

- Creating views on database tables with OpenView.</context>
</line>
<line>
<line_number>57, 58, 59</line_number>
<line_content>>>> db = lancedb.connect('./.lancedb'), 
>>> table = db.create_table('my_table', data=[{'vector': [1.1, 1.2], 'b': 2}]), 
>>> table.head()</line_content>
<context>
The sqlite3 module provides a DB-API 2.0 compliant interface for working with SQLite databases in Python. It allows executing SQL statements and fetching results. Key components include the Connection, Cursor, and Row classes.</context>
</line>
<line>
<line_number>60, 61, 62</line_number>
<line_content>pyarrow.Table, 
vector: fixed_size_list<item: float>[2], 
child 0, item: float</line_content>
<context>
PyTuple_SetItem and PyTuple_SET_ITEM set an item at a given index in a tuple. _PyTuple_Resize resizes a tuple.</context>
</line>
<line>
<line_number>65, 68, 70</line_number>
<line_content>vector: [[[1.1,1.2]]], 
Can append new data with [LanceTable.add][lancedb.table.LanceTable.add]., 
>>> table.add([{'vector': [0.5, 1.3], 'b': 4}])</line_content>
<context>
- append - add an item to the end of the array
- extend - extend array by appending another array or iterable 
- insert - insert an item at a given index
- pop - remove and return item at given index
- tolist - convert the array to a regular list</context>
</line>
<line>
<line_number>73, 75, 76</line_number>
<line_content>Can query the table with [LanceTable.search][lancedb.table.LanceTable.search]., 
>>> table.search([0.4, 0.4]).select(['b']).to_df(), 
b      vector  score</line_content>
<context>
The key function extracts a key from each element for comparison. This allows searching complex data structures.

Examples show looking up grades from scores, and inserting movies into a sorted list by release year.</context>
</line>
<line>
<line_number>77, 78, 80</line_number>
<line_content>0  4  [0.5, 1.3]   0.82, 
1  2  [1.1, 1.2]   1.13, 
Search queries are much faster when an index is created. See</line_content>
<context>
Additional utility functions like index, find_lt, etc are provided for common searching tasks on sorted lists.</context>
</line>
<line>
<line_number>81, 85, 86</line_number>
<line_content>[LanceTable.create_index][lancedb.table.LanceTable.create_index]., 
def __init__(, 
self, connection: 'lancedb.db.LanceDBConnection', name: str, version: int = None</line_content>
<context>
returns a SymbolTable instance for a given piece of Python source code. The SymbolTable class represents a namespace and provides methods to inspect the identifiers defined within it, like get_type, get_name, get_lineno, etc. Specific subclasses</context>
</line>
<line>
<line_number>88, 89, 90</line_number>
<line_content>self._conn = connection, 
self.name = name, 
self._version = version</line_content>
<context>
argument called egg_info. This should be a distutils.command.install_egg_info.install_egg_info object.</context>
</line>
<line>
<line_number>92, 94, 95</line_number>
<line_content>def _reset_dataset(self):, 
del self.__dict__['_dataset'], 
except AttributeError:</line_content>
<context>
The asdict() and astuple() functions can convert dataclass instances to dicts and tuples respectively. The make_dataclass() function creates a new dataclass programmatically. The replace() function creates a new instance, replacing specified fields.</context>
</line>
<line>
<line_number>99, 100, 105</line_number>
<line_content>def schema(self) -> pa.Schema:, 
'''Return the schema of the table., 
A PyArrow schema object.'''</line_content>
<context>
returns a SymbolTable instance for a given piece of Python source code. The SymbolTable class represents a namespace and provides methods to inspect the identifiers defined within it, like get_type, get_name, get_lineno, etc. Specific subclasses</context>
</line>
<line>
<line_number>106, 108, 109</line_number>
<line_content>return self._dataset.schema, 
def list_versions(self):, 
'''List all versions of the table'''</line_content>
<context>
The collections module provides specialized container datatypes that provide alternatives to Python's general built-in containers like dict, list, set, and tuple.</context>
</line>
<line>
<line_number>110, 113, 114</line_number>
<line_content>return self._dataset.versions(), 
def version(self) -> int:, 
'''Get the current version of the table'''</line_content>
<context>
The asdict() and astuple() functions can convert dataclass instances to dicts and tuples respectively. The make_dataclass() function creates a new dataclass programmatically. The replace() function creates a new instance, replacing specified fields.</context>
</line>
<line>
<line_number>115, 117, 118</line_number>
<line_content>return self._dataset.version, 
def checkout(self, version: int):, 
'''Checkout a version of the table. This is an in-place operation.</line_content>
<context>
floats, booleans, datetimes, arrays, and tables into native Python datatypes.</context>
</line>
<line>
<line_number>120, 122, 123</line_number>
<line_content>This allows viewing previous versions of the table., 
Parameters, 
----------</line_content>
<context>
Rather than expose the large tables directly, it exposes their functionality through codepoint checking and mapping functions.</context>
</line>
<line>
<line_number>124, 125, 129</line_number>
<line_content>version : int, 
The version to checkout., 
>>> import lancedb</line_content>
<context>
Additional metadata like author, url, license, etc can provide more information on the package. Versions should follow major.minor[.patch] format. Classifiers, keywords, and platforms can also be specified.</context>
</line>
<line>
<line_number>130, 131, 132</line_number>
<line_content>>>> db = lancedb.connect('./.lancedb'), 
>>> table = db.create_table('my_table', [{'vector': [1.1, 0.9], 'type': 'vector'}]), 
>>> table.version</line_content>
<context>
The sqlite3 module provides a DB-API 2.0 compliant interface for working with SQLite databases in Python. It allows executing SQL statements and fetching results. Key components include the Connection, Cursor, and Row classes.</context>
</line>
<line>
<line_number>134, 135, 136</line_number>
<line_content>>>> table.to_pandas(), 
vector    type, 
0  [1.1, 0.9]  vector</line_content>
<context>
Converters allow custom handling of SQLite types when converting to Python types. They are registered with register_converter(). Column names or declared types are used for looking up converters.</context>
</line>
<line>
<line_number>137, 139, 141</line_number>
<line_content>>>> table.add([{'vector': [0.5, 0.2], 'type': 'vector'}]), 
>>> table.version, 
>>> table.checkout(1)</line_content>
<context>
floats, booleans, datetimes, arrays, and tables into native Python datatypes.</context>
</line>
<line>
<line_number>142, 143, 144</line_number>
<line_content>>>> table.to_pandas(), 
vector    type, 
0  [1.1, 0.9]  vector</line_content>
<context>
Converters allow custom handling of SQLite types when converting to Python types. They are registered with register_converter(). Column names or declared types are used for looking up converters.</context>
</line>
<line>
<line_number>146, 147, 148</line_number>
<line_content>max_ver = max([v['version'] for v in self._dataset.versions()]), 
if version < 1 or version > max_ver:, 
raise ValueError(f'Invalid version {version}')</line_content>
<context>
minor, micro, release level and serial occupying different bytes and bits. PY_VERSION_HEX can be used for numeric comparisons of versions.</context>
</line>
<line>
<line_number>149, 150, 152</line_number>
<line_content>self._version = version, 
self._reset_dataset(), 
def __len__(self):</line_content>
<context>
minor, micro, release level and serial occupying different bytes and bits. PY_VERSION_HEX can be used for numeric comparisons of versions.</context>
</line>
<line>
<line_number>153, 155, 156</line_number>
<line_content>return self._dataset.count_rows(), 
def __repr__(self) -> str:, 
return f'LanceTable({self.name})'</line_content>
<context>
returns a SymbolTable instance for a given piece of Python source code. The SymbolTable class represents a namespace and provides methods to inspect the identifiers defined within it, like get_type, get_name, get_lineno, etc. Specific subclasses</context>
</line>
<line>
<line_number>158, 159, 161</line_number>
<line_content>def __str__(self) -> str:, 
return self.__repr__(), 
def head(self, n=5) -> pa.Table:</line_content>
<context>
returns a SymbolTable instance for a given piece of Python source code. The SymbolTable class represents a namespace and provides methods to inspect the identifiers defined within it, like get_type, get_name, get_lineno, etc. Specific subclasses</context>
</line>
<line>
<line_number>162, 163, 165</line_number>
<line_content>'''Return the first n rows of the table.''', 
return self._dataset.head(n), 
def to_pandas(self) -> pd.DataFrame:</line_content>
<context>
The Row class represents a result row. It allows accessing columns by index or case-insensitive name. Row provides a memory efficient alternative to tuples.</context>
</line>
<line>
<line_number>166, 170, 172</line_number>
<line_content>'''Return the table as a pandas DataFrame., 
pd.DataFrame, 
return self.to_arrow().to_pandas()</line_content>
<context>
Converters allow custom handling of SQLite types when converting to Python types. They are registered with register_converter(). Column names or declared types are used for looking up converters.</context>
</line>
<line>
<line_number>174, 175, 179</line_number>
<line_content>def to_arrow(self) -> pa.Table:, 
'''Return the table as a pyarrow Table., 
pa.Table'''</line_content>
<context>
returns a SymbolTable instance for a given piece of Python source code. The SymbolTable class represents a namespace and provides methods to inspect the identifiers defined within it, like get_type, get_name, get_lineno, etc. Specific subclasses</context>
</line>
<line>
<line_number>180, 183, 184</line_number>
<line_content>return self._dataset.to_table(), 
def _dataset_uri(self) -> str:, 
return os.path.join(self._conn.uri, f'{self.name}.lance')</line_content>
<context>
returns a SymbolTable instance for a given piece of Python source code. The SymbolTable class represents a namespace and provides methods to inspect the identifiers defined within it, like get_type, get_name, get_lineno, etc. Specific subclasses</context>
</line>
<line>
<line_number>186, 187, 189</line_number>
<line_content>def create_index(self, metric='L2', num_partitions=256, num_sub_vectors=96):, 
'''Create an index on the table., 
Parameters</line_content>
<context>
PySlice_Unpack and PySlice_AdjustIndices are safer, more modern ways to get slice indices - unpacking the slice object and then adjusting indices for a sequence length.</context>
</line>
<line>
<line_number>190, 191, 192</line_number>
<line_content>----------, 
metric: str, default 'L2', 
The distance metric to use when creating the index. Valid values are 'L2' or 'cosine'.</line_content>
<context>
The slice functions let you extract the slice indices from a slice object or properly adjust them for a sequence length.</context>
</line>
<line>
<line_number>193, 194, 195</line_number>
<line_content>L2 is euclidean distance., 
num_partitions: int, 
The number of IVF partitions to use when creating the index.</line_content>
<context>
The slice functions let you extract the slice indices from a slice object or properly adjust them for a sequence length.</context>
</line>
<line>
<line_number>196, 197, 198</line_number>
<line_content>Default is 256., 
num_sub_vectors: int, 
The number of PQ sub-vectors to use when creating the index.</line_content>
<context>
Some functions like PyNumber_AsSsize_t convert to specific C numeric types. Others like PyIndex_Check test if an object implements the index protocol and can be used as an array index.</context>
</line>
<line>
<line_number>199, 201, 202</line_number>
<line_content>Default is 96., 
self._dataset.create_index(, 
column=VECTOR_COLUMN_NAME,</line_content>
<context>
The fields can have default values set normally in Python. Default values can also be set by the field() function which allows for things like mutable default values using default_factory. The field() function has parameters that mirror those in the</context>
</line>
<line>
<line_number>203, 204, 205</line_number>
<line_content>index_type='IVF_PQ',, 
metric=metric,, 
num_partitions=num_partitions,</line_content>
<context>
PySlice_Unpack and PySlice_AdjustIndices are safer, more modern ways to get slice indices - unpacking the slice object and then adjusting indices for a sequence length.</context>
</line>
<line>
<line_number>206, 208, 210</line_number>
<line_content>num_sub_vectors=num_sub_vectors,, 
self._reset_dataset(), 
def create_fts_index(self, field_names: Union[str, List[str]]):</line_content>
<context>
PySlice_Unpack and PySlice_AdjustIndices are safer, more modern ways to get slice indices - unpacking the slice object and then adjusting indices for a sequence length.</context>
</line>
<line>
<line_number>211, 213, 214</line_number>
<line_content>'''Create a full-text search index on the table., 
Warning - this API is highly experimental and is highly likely to change, 
in the future.</line_content>
<context>
like search, match, fullmatch, split, findall, finditer, sub, and subn for searching and replacing.</context>
</line>
<line>
<line_number>216, 217, 218</line_number>
<line_content>Parameters, 
----------, 
field_names: str or list of str</line_content>
<context>
like -f or --foo as well as positional arguments. The parse_args() method is then used to convert the argument strings into objects and assign them as attributes of a namespace.</context>
</line>
<line>
<line_number>219, 221, 223</line_number>
<line_content>The name(s) of the field to index., 
from .fts import create_index, populate_index, 
if isinstance(field_names, str):</line_content>
<context>
By default, importlib.metadata provides metadata discovery for filesystem and zip file distributions. Additional finders can be added by implementing the DistributionFinder abstract base class. The find_distributions() method should return</context>
</line>
<line>
<line_number>224, 225, 226</line_number>
<line_content>field_names = [field_names], 
index = create_index(self._get_fts_index_path(), field_names), 
populate_index(index, self, field_names)</line_content>
<context>
The fields can have default values set normally in Python. Default values can also be set by the field() function which allows for things like mutable default values using default_factory. The field() function has parameters that mirror those in the</context>
</line>
<line>
<line_number>228, 229, 231</line_number>
<line_content>def _get_fts_index_path(self):, 
return os.path.join(self._dataset_uri, '_indices', 'tantivy'), 
@cached_property</line_content>
<context>
The os.path module contains functions for working with file paths and directory paths. It allows you to extract components of paths, check if a path exists, get metadata like size/timestamps, normalize paths to their absolute version, and join path</context>
</line>
<line>
<line_number>232, 233, 235</line_number>
<line_content>def _dataset(self) -> LanceDataset:, 
return lance.dataset(self._dataset_uri, version=self._version), 
def to_lance(self) -> LanceDataset:</line_content>
<context>
Python classes provide a way to bundle data and functionality together. Classes define attributes and methods. Instances of classes (objects) can be created and interacted with.</context>
</line>
<line>
<line_number>236, 237, 239</line_number>
<line_content>'''Return the LanceDataset backing this table.''', 
return self._dataset, 
def add(self, data: DATA, mode: str = 'append') -> int:</line_content>
<context>
floats, booleans, datetimes, arrays, and tables into native Python datatypes.</context>
</line>
<line>
<line_number>240, 242, 243</line_number>
<line_content>'''Add data to the table., 
Parameters, 
----------</line_content>
<context>
- Creating records in tables with CreateRecord.

- Initializing a new database with init_database.

- Adding data to tables with add_data. 

- Adding tables to a database with add_tables.

- Creating views on database tables with OpenView.</context>
</line>
<line>
<line_number>244, 245, 247</line_number>
<line_content>data: list-of-dict, dict, pd.DataFrame, 
The data to insert into the table., 
The mode to use when writing the data. Valid values are</line_content>
<context>
The csv.DictReader and csv.DictWriter classes make it easier to read and write CSV data as dictionaries instead of lists. DictReader uses the first row of the CSV file as dictionary keys by default.</context>
</line>
<line>
<line_number>248, 253, 255</line_number>
<line_content>'append' and 'overwrite'., 
The number of vectors in the table., 
data = _sanitize_data(data, self.schema)</line_content>
<context>
PySet_Add adds an element to a set. PySet_Discard removes an element if present, not raising an error if not found. PySet_Pop removes and returns an arbitrary element. PySet_Clear empties a set of all elements.</context>
</line>
<line>
<line_number>256, 257, 258</line_number>
<line_content>lance.write_dataset(data, self._dataset_uri, mode=mode), 
self._reset_dataset(), 
return len(self)</line_content>
<context>
floats, booleans, datetimes, arrays, and tables into native Python datatypes.</context>
</line>
<line>
<line_number>260, 261, 262</line_number>
<line_content>def search(self, query: Union[VEC, str]) -> LanceQueryBuilder:, 
'''Create a search query to find the nearest neighbors, 
of the given query vector.</line_content>
<context>
Additional utility functions like index, find_lt, etc are provided for common searching tasks on sorted lists.</context>
</line>
<line>
<line_number>264, 265, 266</line_number>
<line_content>Parameters, 
----------, 
query: list, np.ndarray</line_content>
<context>
Parameters may have default values, which are evaluated from left to right when the function is defined. Parameters after * or *identifier are keyword-only and may only be passed by keyword. Parameters before / are positional-only and may only be</context>
</line>
<line>
<line_number>267, 271, 272</line_number>
<line_content>The query vector., 
LanceQueryBuilder, 
A query builder object representing the query.</line_content>
<context>
and pop, or queues with collections.deque and its append and popleft methods.</context>
</line>
<line>
<line_number>273, 274, 275</line_number>
<line_content>Once executed, the query returns selected columns, the vector,, 
and also the 'score' column which is the distance between the query, 
vector and the returned vector.</line_content>
<context>
The key function extracts a key from each element for comparison. This allows searching complex data structures.

Examples show looking up grades from scores, and inserting movies into a sorted list by release year.</context>
</line>
<line>
<line_number>277, 279, 281</line_number>
<line_content>if isinstance(query, str):, 
return LanceFtsQueryBuilder(self, query), 
if isinstance(query, list):</line_content>
<context>
False and True represent boolean false and true values. Assignments to them raise SyntaxErrors. None represents the absence of a value and is used for default arguments. NotImplemented is returned by special methods to indicate the operation is not</context>
</line>
<line>
<line_number>282, 283, 284</line_number>
<line_content>query = np.array(query), 
if isinstance(query, np.ndarray):, 
query = query.astype(np.float32)</line_content>
<context>
Converters allow custom handling of SQLite types when converting to Python types. They are registered with register_converter(). Column names or declared types are used for looking up converters.</context>
</line>
<line>
<line_number>286, 287, 289</line_number>
<line_content>raise TypeError(f'Unsupported query type: {type(query)}'), 
return LanceQueryBuilder(self, query), 
@classmethod</line_content>
<context>
The @dataclass decorator examines the class to find fields, which are defined as class variables with a type annotation. It adds various dunder methods to the class like __init__, __repr__, and __eq__ if they don't already exist. The order of the</context>
</line>
<line>
<line_number>290, 291, 292</line_number>
<line_content>def create(cls, db, name, data, schema=None, mode='create'):, 
tbl = LanceTable(db, name), 
data = _sanitize_data(data, schema)</line_content>
<context>
- Creating records in tables with CreateRecord.

- Initializing a new database with init_database.

- Adding data to tables with add_data. 

- Adding tables to a database with add_tables.

- Creating views on database tables with OpenView.</context>
</line>
<line>
<line_number>293, 294, 297</line_number>
<line_content>lance.write_dataset(data, tbl._dataset_uri, mode=mode), 
return tbl, 
def _sanitize_schema(data: pa.Table, schema: pa.Schema = None) -> pa.Table:</line_content>
<context>
floats, booleans, datetimes, arrays, and tables into native Python datatypes.</context>
</line>
<line>
<line_number>298, 300, 301</line_number>
<line_content>'''Ensure that the table has the expected schema., 
Parameters, 
----------</line_content>
<context>
- in_table_d1 - returns True if the codepoint has bidirectional property 'R' or 'AL'</context>
</line>
<line>
<line_number>302, 303, 304</line_number>
<line_content>data: pa.Table, 
The table to sanitize., 
schema: pa.Schema; optional</line_content>
<context>
- Providing finer control over data attributes by using getter and setter functions instead of directly exposing members. This allows validating values and restricting deletions.</context>
</line>
<line>
<line_number>305, 306, 308</line_number>
<line_content>The expected schema. If not provided, this just converts the, 
vector column to fixed_size_list(float32) if necessary., 
if schema is not None:</line_content>
<context>
There are also functions to convert between numeric types, like PyNumber_Long to convert to a long integer and PyNumber_Float to convert to a float. PyNumber_Index converts to an integer and raises an exception on failure.</context>
</line>
<line>
<line_number>309, 310, 311</line_number>
<line_content>if data.schema == schema:, 
return data, 
# cast the columns to the expected types</line_content>
<context>
- in_table_d1 - returns True if the codepoint has bidirectional property 'R' or 'AL'</context>
</line>
<line>
<line_number>312, 313, 314</line_number>
<line_content>data = data.combine_chunks(), 
data = _sanitize_vector_column(data, vector_column_name=VECTOR_COLUMN_NAME), 
return pa.Table.from_arrays(</line_content>
<context>
There are functions like PyBuffer_FromContiguous and PyBuffer_ToContiguous to copy data between contiguous buffers. PyBuffer_FillContiguousStrides fills in stride values for contiguous arrays. PyBuffer_FillInfo handles requests for buffer access on</context>
</line>
<line>
<line_number>315, 317, 318</line_number>
<line_content>[data[name] for name in schema.names], schema=schema, 
# just check the vector column, 
return _sanitize_vector_column(data, vector_column_name=VECTOR_COLUMN_NAME)</line_content>
<context>
- Get and set attributes on an object, like PyObject_GetAttr and PyObject_SetAttr. These act similar to accessing attributes in Python using dot notation. 

- Delete attributes, like with PyObject_DelAttr.</context>
</line>
<line>
<line_number>321, 323, 325</line_number>
<line_content>def _sanitize_vector_column(data: pa.Table, vector_column_name: str) -> pa.Table:, 
Ensure that the vector column exists and has type fixed_size_list(float32), 
Parameters</line_content>
<context>
Converters allow custom handling of SQLite types when converting to Python types. They are registered with register_converter(). Column names or declared types are used for looking up converters.</context>
</line>
<line>
<line_number>326, 327, 328</line_number>
<line_content>----------, 
data: pa.Table, 
The table to sanitize.</line_content>
<context>
Rather than expose the large tables directly, it exposes their functionality through codepoint checking and mapping functions.</context>
</line>
<line>
<line_number>329, 330, 332</line_number>
<line_content>vector_column_name: str, 
The name of the vector column., 
if vector_column_name not in data.column_names:</line_content>
<context>
The Sniffer class can be used to automatically deduce the format of a CSV file. The has_header() method guesses if the first row contains headers.</context>
</line>
<line>
<line_number>333, 334, 335</line_number>
<line_content>raise ValueError(f'Missing vector column: {vector_column_name}'), 
vec_arr = data[vector_column_name].combine_chunks(), 
if pa.types.is_fixed_size_list(vec_arr.type):</line_content>
<context>
The encode_* functions raise TypeError if passed a multipart message instead of encoding the subparts individually. They extract the payload, encode it, and reset the payload to the encoded value.</context>
</line>
<line>
<line_number>336, 337, 338</line_number>
<line_content>return data, 
if not pa.types.is_list(vec_arr.type):, 
raise TypeError(f'Unsupported vector column type: {vec_arr.type}')</line_content>
<context>
flags like Py_TPFLAGS_HAVE_GC. Functions like PyType_Check and PyType_HasFeature can check properties of type objects.</context>
</line>
<line>
<line_number>339, 340, 341</line_number>
<line_content>values = vec_arr.values, 
if not pa.types.is_float32(values.type):, 
values = values.cast(pa.float32())</line_content>
<context>
There are also functions to convert between numeric types, like PyNumber_Long to convert to a long integer and PyNumber_Float to convert to a float. PyNumber_Index converts to an integer and raises an exception on failure.</context>
</line>
<line>
<line_number>342, 343, 344</line_number>
<line_content>list_size = len(values) / len(data), 
vec_arr = pa.FixedSizeListArray.from_arrays(values, list_size), 
return data.set_column(</line_content>
<context>
Arrays are useful when storing and operating on contiguous homogeneous numeric data, especially for types not natively supported by Python like int16. The enforced typechecking and compact size make them efficient compared to regular Python lists.
</context>
</line>
</file_context>
</file>
<file>
<file_path>python/lancedb/util.py</file_path>
<file_content>
#  Copyright 2023 LanceDB Developers
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

from urllib.parse import ParseResult, urlparse

from pyarrow import fs


def get_uri_scheme(uri: str) -> str:
    """
    Get the scheme of a URI. If the URI does not have a scheme, assume it is a file URI.

    Parameters
    ----------
    uri : str
        The URI to parse.

    Returns
    -------
    str: The scheme of the URI.
    """
    parsed = urlparse(uri)
    scheme = parsed.scheme
    if not scheme:
        scheme = "file"
    elif scheme in ["s3a", "s3n"]:
        scheme = "s3"
    elif len(scheme) == 1:
        # Windows drive names are parsed as the scheme
        # e.g. "c:\path" -> ParseResult(scheme="c", netloc="", path="/path", ...)
        # So we add special handling here for schemes that are a single character
        scheme = "file"
    return scheme


def get_uri_location(uri: str) -> str:
    """
    Get the location of a URI. If the parameter is not a url, assumes it is just a path

    Parameters
    ----------
    uri : str
        The URI to parse.

    Returns
    -------
    str: Location part of the URL, without scheme
    """
    parsed = urlparse(uri)
    if not parsed.netloc:
        return parsed.path
    else:
        return parsed.netloc + parsed.path

</file_content>
<file_context>
<line>
<line_number>13, 15, 18</line_number>
<line_content>from urllib.parse import ParseResult, urlparse, 
from pyarrow import fs, 
def get_uri_scheme(uri: str) -> str:</line_content>
<context>
The urllib.parse module provides functions for parsing and manipulating URLs. Some key functions include:

- urlparse() - Parses a URL into components like scheme, netloc, path, params, query, and fragment. This breaks the URL into pieces.</context>
</line>
<line>
<line_number>20, 22, 23</line_number>
<line_content>Get the scheme of a URI. If the URI does not have a scheme, assume it is a file URI., 
Parameters, 
----------</line_content>
<context>
get_default_scheme(), get_path(), and get_paths() functions can be used to get information on the installation schemes and paths.</context>
</line>
<line>
<line_number>25, 29, 31</line_number>
<line_content>The URI to parse., 
str: The scheme of the URI., 
parsed = urlparse(uri)</line_content>
<context>
The urllib.parse module provides functions for parsing and manipulating URLs. Some key functions include:

- urlparse() - Parses a URL into components like scheme, netloc, path, params, query, and fragment. This breaks the URL into pieces.</context>
</line>
<line>
<line_number>32, 33, 34</line_number>
<line_content>scheme = parsed.scheme, 
if not scheme:, 
scheme = 'file'</line_content>
<context>
The mimetypes module provides a guess_type function that determines the MIME type from a filename. This allows mapping file extensions to MIME types.</context>
</line>
<line>
<line_number>35, 36, 37</line_number>
<line_content>elif scheme in ['s3a', 's3n']:, 
scheme = 's3', 
elif len(scheme) == 1:</line_content>
<context>
Python 3.8 adds a number of new features and improvements. Some highlights:

- Assignment expressions (':=') allow assigning values to variables as part of a larger expression. This helps avoid calling a function multiple times.</context>
</line>
<line>
<line_number>38, 39, 40</line_number>
<line_content># Windows drive names are parsed as the scheme, 
# e.g. 'c:\path' -> ParseResult(scheme='c', netloc='', path='/path', ...), 
# So we add special handling here for schemes that are a single character</line_content>
<context>
path type (instantiating it creates either a PosixPath or WindowsPath).</context>
</line>
<line>
<line_number>41, 42, 45</line_number>
<line_content>scheme = 'file', 
return scheme, 
def get_uri_location(uri: str) -> str:</line_content>
<context>
The wsgiref.util module provides utility functions for working with WSGI environments such as guessing the scheme, constructing request URIs, shifting path info, and setting up testing defaults. It also includes the FileWrapper class for converting</context>
</line>
<line>
<line_number>47, 49, 50</line_number>
<line_content>Get the location of a URI. If the parameter is not a url, assumes it is just a path, 
Parameters, 
----------</line_content>
<context>
The urllib.parse module provides functions for parsing and manipulating URLs. Some key functions include:

- urlparse() - Parses a URL into components like scheme, netloc, path, params, query, and fragment. This breaks the URL into pieces.</context>
</line>
<line>
<line_number>52, 56, 58</line_number>
<line_content>The URI to parse., 
str: Location part of the URL, without scheme, 
parsed = urlparse(uri)</line_content>
<context>
The urllib.parse module provides functions for parsing and manipulating URLs. Some key functions include:

- urlparse() - Parses a URL into components like scheme, netloc, path, params, query, and fragment. This breaks the URL into pieces.</context>
</line>
<line>
<line_number>59, 60, 62</line_number>
<line_content>if not parsed.netloc:, 
return parsed.path, 
return parsed.netloc + parsed.path</line_content>
<context>
The netrc constructor takes an optional file path argument specifying the netrc file to parse. If no argument is given, it defaults to ~/.netrc. It raises a FileNotFoundError if the file doesn't exist or a NetrcParseError if there are syntax errors
</context>
</line>
</file_context>
</file>
<file>
<file_path>python/setup.py</file_path>
<file_content>
#  Copyright 2023 LanceDB Developers
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

import setuptools

if __name__ == "__main__":
    setuptools.setup()

</file_content>
<file_context>
<line>
<line_number>13, 15, 16</line_number>
<line_content>import setuptools, 
if __name__ == '__main__':, 
setuptools.setup()</line_content>
<context>
installer runs setup.py scripts with setuptools even if they only import distutils.
</context>
</line>
</file_context>
</file>
<file>
<file_path>python/tests/test_context.py</file_path>
<file_content>
#  Copyright 2023 LanceDB Developers
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

import pandas as pd
import pytest

from lancedb.context import contextualize


@pytest.fixture
def raw_df() -> pd.DataFrame:
    return pd.DataFrame(
        {
            "token": [
                "The",
                "quick",
                "brown",
                "fox",
                "jumped",
                "over",
                "the",
                "lazy",
                "dog",
                "I",
                "love",
                "sandwiches",
            ],
            "document_id": [1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2],
        }
    )


def test_contextualizer(raw_df: pd.DataFrame):
    result = (
        contextualize(raw_df)
        .window(6)
        .stride(3)
        .text_col("token")
        .groupby("document_id")
        .to_df()["token"]
        .to_list()
    )

    assert result == [
        "The quick brown fox jumped over",
        "fox jumped over the lazy dog",
        "the lazy dog",
        "I love sandwiches",
    ]


def test_contextualizer_with_threshold(raw_df: pd.DataFrame):
    result = (
        contextualize(raw_df)
        .window(6)
        .stride(3)
        .text_col("token")
        .groupby("document_id")
        .min_window_size(4)
        .to_df()["token"]
        .to_list()
    )

    assert result == [
        "The quick brown fox jumped over",
        "fox jumped over the lazy dog",
    ]

</file_content>
<file_context>
<line>
<line_number>13, 14, 16</line_number>
<line_content>import pandas as pd, 
import pytest, 
from lancedb.context import contextualize</line_content>
<context>
The contextvars module provides APIs for managing context-local state in Python. The ContextVar class declares a new Context Variable that can store values specific to the current context. ContextVars should be used by context managers instead of</context>
</line>
<line>
<line_number>19, 20, 21</line_number>
<line_content>@pytest.fixture, 
def raw_df() -> pd.DataFrame:, 
return pd.DataFrame(</line_content>
<context>
It is recommended to use PyObject_GetBuffer() (or the "y*" or "w*" format codes with PyArg_ParseTuple()) to get a buffer view over an object, and PyBuffer_Release() when the view can be released instead of the Old Buffer Protocol.</context>
</line>
<line>
<line_number>23, 35, 37</line_number>
<line_content>'token': [, 
'sandwiches',, 
'document_id': [1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2],</line_content>
<context>
The tokenize module provides functions for lexical tokenizing of Python source code. The main function is tokenize(), which takes a readline callable as input and returns a generator that yields 5-tuple tokens representing the token type, string</context>
</line>
<line>
<line_number>42, 43, 44</line_number>
<line_content>def test_contextualizer(raw_df: pd.DataFrame):, 
result = (, 
contextualize(raw_df)</line_content>
<context>
The contextvars module provides APIs for managing context-local state in Python. The ContextVar class declares a new Context Variable that can store values specific to the current context. ContextVars should be used by context managers instead of</context>
</line>
<line>
<line_number>45, 46, 47</line_number>
<line_content>.window(6), 
.stride(3), 
.text_col('token')</line_content>
<context>
The tokenize module provides functions for lexical tokenizing of Python source code. The main function is tokenize(), which takes a readline callable as input and returns a generator that yields 5-tuple tokens representing the token type, string</context>
</line>
<line>
<line_number>48, 49, 50</line_number>
<line_content>.groupby('document_id'), 
.to_df()['token'], 
.to_list()</line_content>
<context>
The main functions are:

- grp.getgrgid(id) - Returns group info for the given numeric id. Raises KeyError if not found.

- grp.getgrnam(name) - Returns group info for the given group name. Raises KeyError if not found.</context>
</line>
<line>
<line_number>53, 54, 55</line_number>
<line_content>assert result == [, 
'The quick brown fox jumped over',, 
'fox jumped over the lazy dog',</line_content>
<context>
The assert statement inserts debugging assertions into a program. The pass statement is a no-op when executed. The del statement deletes bindings from namespaces. The return statement exits a function and returns a value. The yield statement yields</context>
</line>
<line>
<line_number>56, 57, 61</line_number>
<line_content>'the lazy dog',, 
'I love sandwiches',, 
def test_contextualizer_with_threshold(raw_df: pd.DataFrame):</line_content>
<context>
The contextvars module provides APIs for managing context-local state in Python. The ContextVar class declares a new Context Variable that can store values specific to the current context. ContextVars should be used by context managers instead of</context>
</line>
<line>
<line_number>62, 63, 64</line_number>
<line_content>result = (, 
contextualize(raw_df), 
.window(6)</line_content>
<context>
PyFrame_GetGlobals(), PyFrame_GetLocals(), and PyFrame_GetBuiltins() get the respective f_globals, f_locals, and f_builtins attributes of a frame. PyFrame_GetLasti() gets the f_lasti attribute.</context>
</line>
<line>
<line_number>65, 66, 67</line_number>
<line_content>.stride(3), 
.text_col('token'), 
.groupby('document_id')</line_content>
<context>
The main functions are:

- grp.getgrgid(id) - Returns group info for the given numeric id. Raises KeyError if not found.

- grp.getgrnam(name) - Returns group info for the given group name. Raises KeyError if not found.</context>
</line>
<line>
<line_number>68, 69, 70</line_number>
<line_content>.min_window_size(4), 
.to_df()['token'], 
.to_list()</line_content>
<context>
The tokenize module provides functions for lexical tokenizing of Python source code. The main function is tokenize(), which takes a readline callable as input and returns a generator that yields 5-tuple tokens representing the token type, string</context>
</line>
<line>
<line_number>73, 74, 75</line_number>
<line_content>assert result == [, 
'The quick brown fox jumped over',, 
'fox jumped over the lazy dog',</line_content>
<context>
The assert statement inserts debugging assertions into a program. The pass statement is a no-op when executed. The del statement deletes bindings from namespaces. The return statement exits a function and returns a value. The yield statement yields
</context>
</line>
</file_context>
</file>
<file>
<file_path>python/tests/test_db.py</file_path>
<file_content>
#  Copyright 2023 LanceDB Developers
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

import pandas as pd
import pytest

import lancedb


def test_basic(tmp_path):
    db = lancedb.connect(tmp_path)

    assert db.uri == str(tmp_path)
    assert db.table_names() == []

    table = db.create_table(
        "test",
        data=[
            {"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
            {"vector": [5.9, 26.5], "item": "bar", "price": 20.0},
        ],
    )
    rs = table.search([100, 100]).limit(1).to_df()
    assert len(rs) == 1
    assert rs["item"].iloc[0] == "bar"

    rs = table.search([100, 100]).where("price < 15").limit(2).to_df()
    assert len(rs) == 1
    assert rs["item"].iloc[0] == "foo"

    assert db.table_names() == ["test"]
    assert "test" in db
    assert len(db) == 1

    assert db.open_table("test").name == db["test"].name


def test_ingest_pd(tmp_path):
    db = lancedb.connect(tmp_path)

    assert db.uri == str(tmp_path)
    assert db.table_names() == []

    data = pd.DataFrame(
        {
            "vector": [[3.1, 4.1], [5.9, 26.5]],
            "item": ["foo", "bar"],
            "price": [10.0, 20.0],
        }
    )
    table = db.create_table("test", data=data)
    rs = table.search([100, 100]).limit(1).to_df()
    assert len(rs) == 1
    assert rs["item"].iloc[0] == "bar"

    rs = table.search([100, 100]).where("price < 15").limit(2).to_df()
    assert len(rs) == 1
    assert rs["item"].iloc[0] == "foo"

    assert db.table_names() == ["test"]
    assert "test" in db
    assert len(db) == 1

    assert db.open_table("test").name == db["test"].name


def test_create_mode(tmp_path):
    db = lancedb.connect(tmp_path)
    data = pd.DataFrame(
        {
            "vector": [[3.1, 4.1], [5.9, 26.5]],
            "item": ["foo", "bar"],
            "price": [10.0, 20.0],
        }
    )
    db.create_table("test", data=data)

    with pytest.raises(Exception):
        db.create_table("test", data=data)

    new_data = pd.DataFrame(
        {
            "vector": [[3.1, 4.1], [5.9, 26.5]],
            "item": ["fizz", "buzz"],
            "price": [10.0, 20.0],
        }
    )
    tbl = db.create_table("test", data=new_data, mode="overwrite")
    assert tbl.to_pandas().item.tolist() == ["fizz", "buzz"]


def test_delete_table(tmp_path):
    db = lancedb.connect(tmp_path)
    data = pd.DataFrame(
        {
            "vector": [[3.1, 4.1], [5.9, 26.5]],
            "item": ["foo", "bar"],
            "price": [10.0, 20.0],
        }
    )
    db.create_table("test", data=data)

    with pytest.raises(Exception):
        db.create_table("test", data=data)

    assert db.table_names() == ["test"]

    db.drop_table("test")
    assert db.table_names() == []

    db.create_table("test", data=data)
    assert db.table_names() == ["test"]

</file_content>
<file_context>
<line>
<line_number>13, 14, 16</line_number>
<line_content>import pandas as pd, 
import pytest, 
import lancedb</line_content>
<context>
Some key steps are:

- Convert data from C to Python types with API functions 
- Call Python interface routines using the converted values
- Convert return values back from Python to C</context>
</line>
<line>
<line_number>19, 20, 22</line_number>
<line_content>def test_basic(tmp_path):, 
db = lancedb.connect(tmp_path), 
assert db.uri == str(tmp_path)</line_content>
<context>
resolving relative paths, and checking properties like whether a path is absolute, a file, or directory. The pathlib API helps avoid a lot of errors compared to using string operations to handle paths.</context>
</line>
<line>
<line_number>23, 25, 28</line_number>
<line_content>assert db.table_names() == [], 
table = db.create_table(, 
{'vector': [3.1, 4.1], 'item': 'foo', 'price': 10.0},</line_content>
<context>
Converters allow custom handling of SQLite types when converting to Python types. They are registered with register_converter(). Column names or declared types are used for looking up converters.</context>
</line>
<line>
<line_number>29, 32, 33</line_number>
<line_content>{'vector': [5.9, 26.5], 'item': 'bar', 'price': 20.0},, 
rs = table.search([100, 100]).limit(1).to_df(), 
assert len(rs) == 1</line_content>
<context>
over the sequence until an IndexError is raised.</context>
</line>
<line>
<line_number>34, 36, 37</line_number>
<line_content>assert rs['item'].iloc[0] == 'bar', 
rs = table.search([100, 100]).where('price < 15').limit(2).to_df(), 
assert len(rs) == 1</line_content>
<context>
- PEP 308 added conditional expressions to Python using a new syntax: x = true_value if condition else false_value. This provides a concise way to write conditional logic.</context>
</line>
<line>
<line_number>38, 40, 41</line_number>
<line_content>assert rs['item'].iloc[0] == 'foo', 
assert db.table_names() == ['test'], 
assert 'test' in db</line_content>
<context>
ANY can be passed in assert calls to ignore arguments. FILTER_DIR filters mock dir() output. mock_open() helps mock open() and provides configurable read data.

Sealing a mock disables automatic mock creation on attribute access.</context>
</line>
<line>
<line_number>42, 44, 47</line_number>
<line_content>assert len(db) == 1, 
assert db.open_table('test').name == db['test'].name, 
def test_ingest_pd(tmp_path):</line_content>
<context>
The test module contains utilities for writing tests for Python code. test.support provides classes and functions to assist with testing, such as assert methods, mock objects, and tools to manage warnings and the environment. test.regrtest drives the</context>
</line>
<line>
<line_number>48, 50, 51</line_number>
<line_content>db = lancedb.connect(tmp_path), 
assert db.uri == str(tmp_path), 
assert db.table_names() == []</line_content>
<context>
The sqlite3 module provides a DB-API 2.0 compliant interface for working with SQLite databases in Python. It allows executing SQL statements and fetching results. Key components include the Connection, Cursor, and Row classes.</context>
</line>
<line>
<line_number>53, 55, 56</line_number>
<line_content>data = pd.DataFrame(, 
'vector': [[3.1, 4.1], [5.9, 26.5]],, 
'item': ['foo', 'bar'],</line_content>
<context>
The python list data type has a built-in sort() method that sorts the list in-place. There is also a sorted() built-in function that builds a new sorted list from an iterable.</context>
</line>
<line>
<line_number>57, 60, 61</line_number>
<line_content>'price': [10.0, 20.0],, 
table = db.create_table('test', data=data), 
rs = table.search([100, 100]).limit(1).to_df()</line_content>
<context>
over the sequence until an IndexError is raised.</context>
</line>
<line>
<line_number>62, 63, 65</line_number>
<line_content>assert len(rs) == 1, 
assert rs['item'].iloc[0] == 'bar', 
rs = table.search([100, 100]).where('price < 15').limit(2).to_df()</line_content>
<context>
- PEP 308 added conditional expressions to Python using a new syntax: x = true_value if condition else false_value. This provides a concise way to write conditional logic.</context>
</line>
<line>
<line_number>66, 67, 69</line_number>
<line_content>assert len(rs) == 1, 
assert rs['item'].iloc[0] == 'foo', 
assert db.table_names() == ['test']</line_content>
<context>
if identifiers are valid keyword names in Python code.</context>
</line>
<line>
<line_number>70, 71, 73</line_number>
<line_content>assert 'test' in db, 
assert len(db) == 1, 
assert db.open_table('test').name == db['test'].name</line_content>
<context>
The test module contains utilities for writing tests for Python code. test.support provides classes and functions to assist with testing, such as assert methods, mock objects, and tools to manage warnings and the environment. test.regrtest drives the</context>
</line>
<line>
<line_number>76, 77, 78</line_number>
<line_content>def test_create_mode(tmp_path):, 
db = lancedb.connect(tmp_path), 
data = pd.DataFrame(</line_content>
<context>
The unittest module provides a rich set of tools for constructing and running tests in Python. It supports test automation, sharing of setup and shutdown code, aggregation of tests into collections, and independence of tests from the reporting</context>
</line>
<line>
<line_number>80, 81, 82</line_number>
<line_content>'vector': [[3.1, 4.1], [5.9, 26.5]],, 
'item': ['foo', 'bar'],, 
'price': [10.0, 20.0],</line_content>
<context>
The enumerate() function takes an iterable and returns an enumerate object that produces tuples containing indices and values.</context>
</line>
<line>
<line_number>85, 87, 88</line_number>
<line_content>db.create_table('test', data=data), 
with pytest.raises(Exception):, 
db.create_table('test', data=data)</line_content>
<context>
The unittest module provides a rich set of tools for constructing and running tests in Python. It supports test automation, sharing of setup and shutdown code, aggregation of tests into collections, and independence of tests from the reporting</context>
</line>
<line>
<line_number>90, 92, 93</line_number>
<line_content>new_data = pd.DataFrame(, 
'vector': [[3.1, 4.1], [5.9, 26.5]],, 
'item': ['fizz', 'buzz'],</line_content>
<context>
floats, booleans, datetimes, arrays, and tables into native Python datatypes.</context>
</line>
<line>
<line_number>94, 97, 98</line_number>
<line_content>'price': [10.0, 20.0],, 
tbl = db.create_table('test', data=new_data, mode='overwrite'), 
assert tbl.to_pandas().item.tolist() == ['fizz', 'buzz']</line_content>
<context>
floats, booleans, datetimes, arrays, and tables into native Python datatypes.</context>
</line>
<line>
<line_number>101, 102, 103</line_number>
<line_content>def test_delete_table(tmp_path):, 
db = lancedb.connect(tmp_path), 
data = pd.DataFrame(</line_content>
<context>
Methods like Path.exists(), Path.is_dir(), Path.is_file(), Path.open() allow querying properties of a filesystem path and interacting with the filesystem. Path.rmdir(), Path.unlink(), Path.rename(), and Path.replace() perform system calls to remove,</context>
</line>
<line>
<line_number>105, 106, 107</line_number>
<line_content>'vector': [[3.1, 4.1], [5.9, 26.5]],, 
'item': ['foo', 'bar'],, 
'price': [10.0, 20.0],</line_content>
<context>
The enumerate() function takes an iterable and returns an enumerate object that produces tuples containing indices and values.</context>
</line>
<line>
<line_number>110, 112, 113</line_number>
<line_content>db.create_table('test', data=data), 
with pytest.raises(Exception):, 
db.create_table('test', data=data)</line_content>
<context>
The unittest module provides a rich set of tools for constructing and running tests in Python. It supports test automation, sharing of setup and shutdown code, aggregation of tests into collections, and independence of tests from the reporting</context>
</line>
<line>
<line_number>115, 117, 118</line_number>
<line_content>assert db.table_names() == ['test'], 
db.drop_table('test'), 
assert db.table_names() == []</line_content>
<context>
ANY can be passed in assert calls to ignore arguments. FILTER_DIR filters mock dir() output. mock_open() helps mock open() and provides configurable read data.

Sealing a mock disables automatic mock creation on attribute access.
</context>
</line>
</file_context>
</file>
<file>
<file_path>python/tests/test_embeddings.py</file_path>
<file_content>
#  Copyright 2023 LanceDB Developers
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.
import sys

import numpy as np
import pyarrow as pa
from lancedb.embeddings import with_embeddings


def mock_embed_func(input_data):
    return [np.random.randn(128).tolist() for _ in range(len(input_data))]


def test_with_embeddings():
    for wrap_api in [True, False]:
        if wrap_api and sys.version_info.minor >= 11:
            # ratelimiter package doesn't work on 3.11
            continue
        data = pa.Table.from_arrays(
            [
                pa.array(["foo", "bar"]),
                pa.array([10.0, 20.0]),
            ],
            names=["text", "price"],
        )
        data = with_embeddings(mock_embed_func, data, wrap_api=wrap_api)
        assert data.num_columns == 3
        assert data.num_rows == 2
        assert data.column_names == ["text", "price", "vector"]
        assert data.column("text").to_pylist() == ["foo", "bar"]
        assert data.column("price").to_pylist() == [10.0, 20.0]

</file_content>
<file_context>
<line>
<line_number>12, 14, 15</line_number>
<line_content>import sys, 
import numpy as np, 
import pyarrow as pa</line_content>
<context>
PyImport_ImportModuleEx() imports a module by name, with additional globals, locals, and fromlist arguments similar to Python's __import__() function. It returns the imported module or NULL if there was an error.</context>
</line>
<line>
<line_number>16, 19, 20</line_number>
<line_content>from lancedb.embeddings import with_embeddings, 
def mock_embed_func(input_data):, 
return [np.random.randn(128).tolist() for _ in range(len(input_data))]</line_content>
<context>
- Spec and autospec arguments ensure mocks behave like the real objects they are replacing.

- Mocking the import system with patch.dict can allow mocking modules that are imported locally within functions.</context>
</line>
<line>
<line_number>23, 24, 25</line_number>
<line_content>def test_with_embeddings():, 
for wrap_api in [True, False]:, 
if wrap_api and sys.version_info.minor >= 11:</line_content>
<context>
wraps calls update_wrapper as a convenience decorator factory. It ensures wrapper functions have names, docstrings etc. reflecting the wrapped function.</context>
</line>
<line>
<line_number>26, 28, 30</line_number>
<line_content># ratelimiter package doesn't work on 3.11, 
data = pa.Table.from_arrays(, 
pa.array(['foo', 'bar']),</line_content>
<context>
To create a for loop using an iterator, you first get the iterator object from your iterable with PyObject_GetIter. Then you call PyIter_Next in a loop to retrieve values until it returns NULL. Make sure to decrement the reference count on each</context>
</line>
<line>
<line_number>31, 33, 35</line_number>
<line_content>pa.array([10.0, 20.0]),, 
names=['text', 'price'],, 
data = with_embeddings(mock_embed_func, data, wrap_api=wrap_api)</line_content>
<context>
Using the embedding API an application can also extend Python by exposing functions and data from the application itself to Python code. This allows Python code to call back into the application.</context>
</line>
<line>
<line_number>36, 37, 38</line_number>
<line_content>assert data.num_columns == 3, 
assert data.num_rows == 2, 
assert data.column_names == ['text', 'price', 'vector']</line_content>
<context>
The Row class represents a result row. It allows accessing columns by index or case-insensitive name. Row provides a memory efficient alternative to tuples.
</context>
</line>
</file_context>
</file>
<file>
<file_path>python/tests/test_fts.py</file_path>
<file_content>
# Copyright 2023 LanceDB Developers
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.
import os
import random

import lancedb.fts
import numpy as np
import pandas as pd
import pytest
import tantivy

import lancedb as ldb


@pytest.fixture
def table(tmp_path) -> ldb.table.LanceTable:
    db = ldb.connect(tmp_path)
    vectors = [np.random.randn(128) for _ in range(100)]

    nouns = ("puppy", "car", "rabbit", "girl", "monkey")
    verbs = ("runs", "hits", "jumps", "drives", "barfs")
    adv = ("crazily.", "dutifully.", "foolishly.", "merrily.", "occasionally.")
    adj = ("adorable", "clueless", "dirty", "odd", "stupid")
    text = [
        " ".join(
            [
                nouns[random.randrange(0, 5)],
                verbs[random.randrange(0, 5)],
                adv[random.randrange(0, 5)],
                adj[random.randrange(0, 5)],
            ]
        )
        for _ in range(100)
    ]
    table = db.create_table(
        "test", data=pd.DataFrame({"vector": vectors, "text": text, "text2": text})
    )
    return table


def test_create_index(tmp_path):
    index = ldb.fts.create_index(str(tmp_path / "index"), ["text"])
    assert isinstance(index, tantivy.Index)
    assert os.path.exists(str(tmp_path / "index"))


def test_populate_index(tmp_path, table):
    index = ldb.fts.create_index(str(tmp_path / "index"), ["text"])
    assert ldb.fts.populate_index(index, table, ["text"]) == len(table)


def test_search_index(tmp_path, table):
    index = ldb.fts.create_index(str(tmp_path / "index"), ["text"])
    ldb.fts.populate_index(index, table, ["text"])
    index.reload()
    results = ldb.fts.search_index(index, query="puppy", limit=10)
    assert len(results) == 2
    assert len(results[0]) == 10  # row_ids
    assert len(results[1]) == 10  # scores


def test_create_index_from_table(tmp_path, table):
    table.create_fts_index("text")
    df = table.search("puppy").limit(10).select(["text"]).to_df()
    assert len(df) == 10
    assert "text" in df.columns


def test_create_index_multiple_columns(tmp_path, table):
    table.create_fts_index(["text", "text2"])
    df = table.search("puppy").limit(10).to_df()
    assert len(df) == 10
    assert "text" in df.columns
    assert "text2" in df.columns


def test_empty_rs(tmp_path, table, mocker):
    table.create_fts_index(["text", "text2"])
    mocker.patch("lancedb.fts.search_index", return_value=([], []))
    df = table.search("puppy").limit(10).to_df()
    assert len(df) == 0

</file_content>
<file_context>
<line>
<line_number>13, 15, 16</line_number>
<line_content>import random, 
import lancedb.fts, 
import numpy as np</line_content>
<context>
The random module in Python implements functions for generating pseudo-random numbers for various distributions. Some key things the module provides:</context>
</line>
<line>
<line_number>17, 18, 19</line_number>
<line_content>import pandas as pd, 
import pytest, 
import tantivy</line_content>
<context>
pydoc can document modules, packages, functions, classes, methods, or a path to a .py file. Importing the module runs any code at the module level, so use if __name__ == '__main__' guards.</context>
</line>
<line>
<line_number>21, 24, 25</line_number>
<line_content>import lancedb as ldb, 
@pytest.fixture, 
def table(tmp_path) -> ldb.table.LanceTable:</line_content>
<context>
the PyDictObject structure and PyDict_Type. The examples demonstrate proper usage patterns.</context>
</line>
<line>
<line_number>26, 27, 29</line_number>
<line_content>db = ldb.connect(tmp_path), 
vectors = [np.random.randn(128) for _ in range(100)], 
nouns = ('puppy', 'car', 'rabbit', 'girl', 'monkey')</line_content>
<context>
the __name__ and __qualname__ attributes from the passed in name and qualname arguments. The PyCoro_New function steals a reference to the frame object passed in.</context>
</line>
<line>
<line_number>30, 31, 32</line_number>
<line_content>verbs = ('runs', 'hits', 'jumps', 'drives', 'barfs'), 
adv = ('crazily.', 'dutifully.', 'foolishly.', 'merrily.', 'occasionally.'), 
adj = ('adorable', 'clueless', 'dirty', 'odd', 'stupid')</line_content>
<context>
like capwords which capitalizes words in a string.</context>
</line>
<line>
<line_number>36, 37, 38</line_number>
<line_content>nouns[random.randrange(0, 5)],, 
verbs[random.randrange(0, 5)],, 
adv[random.randrange(0, 5)],</line_content>
<context>
a range, normalvariate() to sample from a normal distribution, choice() to pick a random element from a sequence, and shuffle() to shuffle a list randomly in-place.</context>
</line>
<line>
<line_number>39, 42, 44</line_number>
<line_content>adj[random.randrange(0, 5)],, 
for _ in range(100), 
table = db.create_table(</line_content>
<context>
The documentation describes a function for generating random numbers in Python. The random module implements a pseudorandom number generator that can generate random floats between 0 and 1, integers within a specified range, sample from normal</context>
</line>
<line>
<line_number>45, 47, 50</line_number>
<line_content>'test', data=pd.DataFrame({'vector': vectors, 'text': text, 'text2': text}), 
return table, 
def test_create_index(tmp_path):</line_content>
<context>
The enumerate() function takes an iterable and returns an enumerate object that produces tuples containing indices and values.</context>
</line>
<line>
<line_number>51, 52, 53</line_number>
<line_content>index = ldb.fts.create_index(str(tmp_path / 'index'), ['text']), 
assert isinstance(index, tantivy.Index), 
assert os.path.exists(str(tmp_path / 'index'))</line_content>
<context>
resolving relative paths, and checking properties like whether a path is absolute, a file, or directory. The pathlib API helps avoid a lot of errors compared to using string operations to handle paths.</context>
</line>
<line>
<line_number>56, 57, 58</line_number>
<line_content>def test_populate_index(tmp_path, table):, 
index = ldb.fts.create_index(str(tmp_path / 'index'), ['text']), 
assert ldb.fts.populate_index(index, table, ['text']) == len(table)</line_content>
<context>
over the sequence until an IndexError is raised.</context>
</line>
<line>
<line_number>61, 62, 63</line_number>
<line_content>def test_search_index(tmp_path, table):, 
index = ldb.fts.create_index(str(tmp_path / 'index'), ['text']), 
ldb.fts.populate_index(index, table, ['text'])</line_content>
<context>
The PYTHONPATH environment variable can add more directories to the search path.</context>
</line>
<line>
<line_number>64, 65, 66</line_number>
<line_content>index.reload(), 
results = ldb.fts.search_index(index, query='puppy', limit=10), 
assert len(results) == 2</line_content>
<context>
over the sequence until an IndexError is raised.</context>
</line>
<line>
<line_number>67, 68, 71</line_number>
<line_content>assert len(results[0]) == 10  # row_ids, 
assert len(results[1]) == 10  # scores, 
def test_create_index_from_table(tmp_path, table):</line_content>
<context>
over the sequence until an IndexError is raised.</context>
</line>
<line>
<line_number>72, 73, 74</line_number>
<line_content>table.create_fts_index('text'), 
df = table.search('puppy').limit(10).select(['text']).to_df(), 
assert len(df) == 10</line_content>
<context>
over the sequence until an IndexError is raised.</context>
</line>
<line>
<line_number>75, 78, 79</line_number>
<line_content>assert 'text' in df.columns, 
def test_create_index_multiple_columns(tmp_path, table):, 
table.create_fts_index(['text', 'text2'])</line_content>
<context>
over the sequence until an IndexError is raised.</context>
</line>
<line>
<line_number>80, 81, 82</line_number>
<line_content>df = table.search('puppy').limit(10).to_df(), 
assert len(df) == 10, 
assert 'text' in df.columns</line_content>
<context>
The len() function returns the length of an object. For strings, this is the number of characters. For lists, tuples, dicts, and sets, it is the number of items.</context>
</line>
<line>
<line_number>83, 86, 87</line_number>
<line_content>assert 'text2' in df.columns, 
def test_empty_rs(tmp_path, table, mocker):, 
table.create_fts_index(['text', 'text2'])</line_content>
<context>
ANY can be passed in assert calls to ignore arguments. FILTER_DIR filters mock dir() output. mock_open() helps mock open() and provides configurable read data.

Sealing a mock disables automatic mock creation on attribute access.</context>
</line>
<line>
<line_number>88, 89, 90</line_number>
<line_content>mocker.patch('lancedb.fts.search_index', return_value=([], [])), 
df = table.search('puppy').limit(10).to_df(), 
assert len(df) == 0</line_content>
<context>
over the sequence until an IndexError is raised.
</context>
</line>
</file_context>
</file>
<file>
<file_path>python/tests/test_io.py</file_path>
<file_content>
#  Copyright 2023 LanceDB Developers
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

import os
import pytest

import lancedb

# You need to setup AWS credentials an a base path to run this test. Example
#    AWS_PROFILE=default TEST_S3_BASE_URL=s3://my_bucket/dataset pytest tests/test_io.py


@pytest.mark.skipif(
    (os.environ.get("TEST_S3_BASE_URL") is None),
    reason="please setup s3 base url",
)
def test_s3_io():
    db = lancedb.connect(os.environ.get("TEST_S3_BASE_URL"))
    assert db.table_names() == []

    table = db.create_table(
        "test",
        data=[
            {"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
            {"vector": [5.9, 26.5], "item": "bar", "price": 20.0},
        ],
    )
    rs = table.search([100, 100]).limit(1).to_df()
    assert len(rs) == 1
    assert rs["item"].iloc[0] == "bar"

    rs = table.search([100, 100]).where("price < 15").limit(2).to_df()
    assert len(rs) == 1
    assert rs["item"].iloc[0] == "foo"

    assert db.table_names() == ["test"]
    assert "test" in db
    assert len(db) == 1

    assert db.open_table("test").name == db["test"].name

</file_content>
<file_context>
<line>
<line_number>14, 16, 22</line_number>
<line_content>import pytest, 
import lancedb, 
@pytest.mark.skipif(</line_content>
<context>
PyImport_ImportModuleNoBlock() is a deprecated alias for PyImport_ImportModule().</context>
</line>
<line>
<line_number>23, 24, 26</line_number>
<line_content>(os.environ.get('TEST_S3_BASE_URL') is None),, 
reason='please setup s3 base url',, 
def test_s3_io():</line_content>
<context>
path. The search path can be customized by setting the PYTHONHOME or PYTHONPATH environment variables before calling Py_Initialize().</context>
</line>
<line>
<line_number>27, 28, 30</line_number>
<line_content>db = lancedb.connect(os.environ.get('TEST_S3_BASE_URL')), 
assert db.table_names() == [], 
table = db.create_table(</line_content>
<context>
The sqlite3 module provides a DB-API 2.0 compliant interface for working with SQLite databases in Python. It allows executing SQL statements and fetching results. Key components include the Connection, Cursor, and Row classes.</context>
</line>
<line>
<line_number>33, 34, 37</line_number>
<line_content>{'vector': [3.1, 4.1], 'item': 'foo', 'price': 10.0},, 
{'vector': [5.9, 26.5], 'item': 'bar', 'price': 20.0},, 
rs = table.search([100, 100]).limit(1).to_df()</line_content>
<context>
over the sequence until an IndexError is raised.</context>
</line>
<line>
<line_number>38, 39, 41</line_number>
<line_content>assert len(rs) == 1, 
assert rs['item'].iloc[0] == 'bar', 
rs = table.search([100, 100]).where('price < 15').limit(2).to_df()</line_content>
<context>
- PEP 308 added conditional expressions to Python using a new syntax: x = true_value if condition else false_value. This provides a concise way to write conditional logic.</context>
</line>
<line>
<line_number>42, 43, 45</line_number>
<line_content>assert len(rs) == 1, 
assert rs['item'].iloc[0] == 'foo', 
assert db.table_names() == ['test']</line_content>
<context>
if identifiers are valid keyword names in Python code.</context>
</line>
<line>
<line_number>46, 47, 49</line_number>
<line_content>assert 'test' in db, 
assert len(db) == 1, 
assert db.open_table('test').name == db['test'].name</line_content>
<context>
The test module contains utilities for writing tests for Python code. test.support provides classes and functions to assist with testing, such as assert methods, mock objects, and tools to manage warnings and the environment. test.regrtest drives the
</context>
</line>
</file_context>
</file>
<file>
<file_path>python/tests/test_query.py</file_path>
<file_content>
#  Copyright 2023 LanceDB Developers
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

import lance
import numpy as np
import pandas as pd
import pandas.testing as tm
import pyarrow as pa
import pytest
from lancedb.query import LanceQueryBuilder


class MockTable:
    def __init__(self, tmp_path):
        self.uri = tmp_path

    def to_lance(self):
        return lance.dataset(self.uri)


@pytest.fixture
def table(tmp_path) -> MockTable:
    df = pa.table(
        {
            "vector": pa.array(
                [[1, 2], [3, 4]], type=pa.list_(pa.float32(), list_size=2)
            ),
            "id": pa.array([1, 2]),
            "str_field": pa.array(["a", "b"]),
            "float_field": pa.array([1.0, 2.0]),
        }
    )
    lance.write_dataset(df, tmp_path)
    return MockTable(tmp_path)


def test_query_builder(table):
    df = LanceQueryBuilder(table, [0, 0]).limit(1).select(["id"]).to_df()
    assert df["id"].values[0] == 1
    assert all(df["vector"].values[0] == [1, 2])


def test_query_builder_with_filter(table):
    df = LanceQueryBuilder(table, [0, 0]).where("id = 2").to_df()
    assert df["id"].values[0] == 2
    assert all(df["vector"].values[0] == [3, 4])


def test_query_builder_with_metric(table):
    query = [4, 8]
    df_default = LanceQueryBuilder(table, query).to_df()
    df_l2 = LanceQueryBuilder(table, query).metric("L2").to_df()
    tm.assert_frame_equal(df_default, df_l2)

    df_cosine = LanceQueryBuilder(table, query).metric("cosine").limit(1).to_df()
    assert df_cosine.score[0] == pytest.approx(
        cosine_distance(query, df_cosine.vector[0]),
        abs=1e-6,
    )
    assert 0 <= df_cosine.score[0] <= 1


def cosine_distance(vec1, vec2):
    return 1 - np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

</file_content>
<file_context>
<line>
<line_number>13, 14, 15</line_number>
<line_content>import lance, 
import numpy as np, 
import pandas as pd</line_content>
<context>
Some key steps are:

- Convert data from C to Python types with API functions 
- Call Python interface routines using the converted values
- Convert return values back from Python to C</context>
</line>
<line>
<line_number>16, 17, 18</line_number>
<line_content>import pandas.testing as tm, 
import pyarrow as pa, 
import pytest</line_content>
<context>
- Once ported, update setup.py to indicate Python 3 support and use CI/testing tools like pytest and tox to prevent regressions and ensure both Python 2 and 3 stay supported.</context>
</line>
<line>
<line_number>19, 22, 23</line_number>
<line_content>from lancedb.query import LanceQueryBuilder, 
class MockTable:, 
def __init__(self, tmp_path):</line_content>
<context>
create_autospec() creates a mock with a spec from another object. Keyword arguments to patch() etc. are passed to the mock constructor.</context>
</line>
<line>
<line_number>24, 26, 27</line_number>
<line_content>self.uri = tmp_path, 
def to_lance(self):, 
return lance.dataset(self.uri)</line_content>
<context>
The urllib.request module provides functions and classes for fetching URLs and making HTTP requests in python. Some key points:</context>
</line>
<line>
<line_number>30, 31, 32</line_number>
<line_content>@pytest.fixture, 
def table(tmp_path) -> MockTable:, 
df = pa.table(</line_content>
<context>
The unittest.mock module provides useful tools for mocking objects in python tests, including the Mock class, patch decorators, magic method support, and helpers like sentinel and call.</context>
</line>
<line>
<line_number>34, 35, 37</line_number>
<line_content>'vector': pa.array(, 
[[1, 2], [3, 4]], type=pa.list_(pa.float32(), list_size=2), 
'id': pa.array([1, 2]),</line_content>
<context>
Arrays are useful when storing and operating on contiguous homogeneous numeric data, especially for types not natively supported by Python like int16. The enforced typechecking and compact size make them efficient compared to regular Python lists.</context>
</line>
<line>
<line_number>38, 39, 42</line_number>
<line_content>'str_field': pa.array(['a', 'b']),, 
'float_field': pa.array([1.0, 2.0]),, 
lance.write_dataset(df, tmp_path)</line_content>
<context>
The struct module in Python provides functions to convert between Python values and C structs represented as Python bytes objects. It allows packing and unpacking different data types according to format strings that describe the data layout.</context>
</line>
<line>
<line_number>43, 46, 47</line_number>
<line_content>return MockTable(tmp_path), 
def test_query_builder(table):, 
df = LanceQueryBuilder(table, [0, 0]).limit(1).select(['id']).to_df()</line_content>
<context>
The unittest.mock module provides useful tools for mocking objects in python tests, including the Mock class, patch decorators, magic method support, and helpers like sentinel and call.</context>
</line>
<line>
<line_number>48, 49, 52</line_number>
<line_content>assert df['id'].values[0] == 1, 
assert all(df['vector'].values[0] == [1, 2]), 
def test_query_builder_with_filter(table):</line_content>
<context>
ANY can be passed in assert calls to ignore arguments. FILTER_DIR filters mock dir() output. mock_open() helps mock open() and provides configurable read data.

Sealing a mock disables automatic mock creation on attribute access.</context>
</line>
<line>
<line_number>53, 54, 55</line_number>
<line_content>df = LanceQueryBuilder(table, [0, 0]).where('id = 2').to_df(), 
assert df['id'].values[0] == 2, 
assert all(df['vector'].values[0] == [3, 4])</line_content>
<context>
Python lacks a "switch" or "case" statement because it can be emulated through a sequence of "if", "elif", and "else". For large numbers of cases, a dictionary mapping cases to functions provides similar capability. The built in "getattr()" function</context>
</line>
<line>
<line_number>58, 59, 60</line_number>
<line_content>def test_query_builder_with_metric(table):, 
query = [4, 8], 
df_default = LanceQueryBuilder(table, query).to_df()</line_content>
<context>
The fields can have default values set normally in Python. Default values can also be set by the field() function which allows for things like mutable default values using default_factory. The field() function has parameters that mirror those in the</context>
</line>
<line>
<line_number>61, 62, 64</line_number>
<line_content>df_l2 = LanceQueryBuilder(table, query).metric('L2').to_df(), 
tm.assert_frame_equal(df_default, df_l2), 
df_cosine = LanceQueryBuilder(table, query).metric('cosine').limit(1).to_df()</line_content>
<context>
PyBool_FromLong(1) would return Py_True since 1 evaluates to True. PyBool_FromLong(0) would return Py_False.</context>
</line>
<line>
<line_number>65, 66, 69</line_number>
<line_content>assert df_cosine.score[0] == pytest.approx(, 
cosine_distance(query, df_cosine.vector[0]),, 
assert 0 <= df_cosine.score[0] <= 1</line_content>
<context>
yields values rather than calling PyGen_New() or PyGen_NewWithQualName() directly.
</context>
</line>
</file_context>
</file>
<file>
<file_path>python/tests/test_table.py</file_path>
<file_content>
#  Copyright 2023 LanceDB Developers
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

from pathlib import Path

import pandas as pd
import pyarrow as pa
import pytest
from lancedb.table import LanceTable


class MockDB:
    def __init__(self, uri: Path):
        self.uri = uri


@pytest.fixture
def db(tmp_path) -> MockDB:
    return MockDB(tmp_path)


def test_basic(db):
    ds = LanceTable.create(
        db,
        "test",
        data=[
            {"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
            {"vector": [5.9, 26.5], "item": "bar", "price": 20.0},
        ],
    ).to_lance()

    table = LanceTable(db, "test")
    assert table.name == "test"
    assert table.schema == ds.schema
    assert table.to_lance().to_table() == ds.to_table()


def test_create_table(db):
    schema = pa.schema(
        [
            pa.field("vector", pa.list_(pa.float32(), 2)),
            pa.field("item", pa.string()),
            pa.field("price", pa.float32()),
        ]
    )
    expected = pa.Table.from_arrays(
        [
            pa.FixedSizeListArray.from_arrays(pa.array([3.1, 4.1, 5.9, 26.5]), 2),
            pa.array(["foo", "bar"]),
            pa.array([10.0, 20.0]),
        ],
        schema=schema,
    )
    data = [
        [
            {"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
            {"vector": [5.9, 26.5], "item": "bar", "price": 20.0},
        ]
    ]
    df = pd.DataFrame(data[0])
    data.append(df)
    data.append(pa.Table.from_pandas(df, schema=schema))

    for i, d in enumerate(data):
        tbl = (
            LanceTable.create(db, f"test_{i}", data=d, schema=schema)
            .to_lance()
            .to_table()
        )
        assert expected == tbl


def test_add(db):
    table = LanceTable.create(
        db,
        "test",
        data=[
            {"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
            {"vector": [5.9, 26.5], "item": "bar", "price": 20.0},
        ],
    )

    # table = LanceTable(db, "test")
    assert len(table) == 2

    count = table.add([{"vector": [6.3, 100.5], "item": "new", "price": 30.0}])
    assert count == 3

    expected = pa.Table.from_arrays(
        [
            pa.FixedSizeListArray.from_arrays(
                pa.array([3.1, 4.1, 5.9, 26.5, 6.3, 100.5]), 2
            ),
            pa.array(["foo", "bar", "new"]),
            pa.array([10.0, 20.0, 30.0]),
        ],
        schema=pa.schema(
            [
                pa.field("vector", pa.list_(pa.float32(), 2)),
                pa.field("item", pa.string()),
                pa.field("price", pa.float64()),
            ]
        ),
    )
    assert expected == table.to_arrow()


def test_versioning(db):
    table = LanceTable.create(
        db,
        "test",
        data=[
            {"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
            {"vector": [5.9, 26.5], "item": "bar", "price": 20.0},
        ],
    )

    assert len(table.list_versions()) == 1
    assert table.version == 1

    table.add([{"vector": [6.3, 100.5], "item": "new", "price": 30.0}])
    assert len(table.list_versions()) == 2
    assert table.version == 2
    assert len(table) == 3

    table.checkout(1)
    assert table.version == 1
    assert len(table) == 2

</file_content>
<file_context>
<line>
<line_number>13, 15, 16</line_number>
<line_content>from pathlib import Path, 
import pandas as pd, 
import pyarrow as pa</line_content>
<context>
pydoc can document modules, packages, functions, classes, methods, or a path to a .py file. Importing the module runs any code at the module level, so use if __name__ == '__main__' guards.</context>
</line>
<line>
<line_number>17, 18, 21</line_number>
<line_content>import pytest, 
from lancedb.table import LanceTable, 
class MockDB:</line_content>
<context>
The unittest.mock module provides useful tools for mocking objects in python tests, including the Mock class, patch decorators, magic method support, and helpers like sentinel and call.</context>
</line>
<line>
<line_number>22, 23, 26</line_number>
<line_content>def __init__(self, uri: Path):, 
self.uri = uri, 
@pytest.fixture</line_content>
<context>
path. The search path can be customized by setting the PYTHONHOME or PYTHONPATH environment variables before calling Py_Initialize().</context>
</line>
<line>
<line_number>27, 28, 31</line_number>
<line_content>def db(tmp_path) -> MockDB:, 
return MockDB(tmp_path), 
def test_basic(db):</line_content>
<context>
The unittest.mock module provides useful tools for mocking objects in python tests, including the Mock class, patch decorators, magic method support, and helpers like sentinel and call.</context>
</line>
<line>
<line_number>32, 36, 37</line_number>
<line_content>ds = LanceTable.create(, 
{'vector': [3.1, 4.1], 'item': 'foo', 'price': 10.0},, 
{'vector': [5.9, 26.5], 'item': 'bar', 'price': 20.0},</line_content>
<context>
Additional functions can copy a dict, get items, keys, and values as lists, set defaults, and merge dictionaries. The documentation provides examples for properly iterating through keys and values and modifying values during iteration.</context>
</line>
<line>
<line_number>39, 41, 42</line_number>
<line_content>).to_lance(), 
table = LanceTable(db, 'test'), 
assert table.name == 'test'</line_content>
<context>
- in_table_d1 - returns True if the codepoint has bidirectional property 'R' or 'AL'</context>
</line>
<line>
<line_number>43, 44, 47</line_number>
<line_content>assert table.schema == ds.schema, 
assert table.to_lance().to_table() == ds.to_table(), 
def test_create_table(db):</line_content>
<context>
- Mock objects record their calls in the mock_calls attribute, which is useful for making additional assertions about the sequencing and details of calls.</context>
</line>
<line>
<line_number>48, 50, 51</line_number>
<line_content>schema = pa.schema(, 
pa.field('vector', pa.list_(pa.float32(), 2)),, 
pa.field('item', pa.string()),</line_content>
<context>
- Get and set attributes on an object, like PyObject_GetAttr and PyObject_SetAttr. These act similar to accessing attributes in Python using dot notation. 

- Delete attributes, like with PyObject_DelAttr.</context>
</line>
<line>
<line_number>52, 55, 57</line_number>
<line_content>pa.field('price', pa.float32()),, 
expected = pa.Table.from_arrays(, 
pa.FixedSizeListArray.from_arrays(pa.array([3.1, 4.1, 5.9, 26.5]), 2),</line_content>
<context>
When you perform operations with floats in Python, small errors can accumulate due to the inexact internal representation. For example, summing 0.1 three times may not yield exactly 0.3. This happens because 0.1 cannot be represented exactly in</context>
</line>
<line>
<line_number>58, 59, 61</line_number>
<line_content>pa.array(['foo', 'bar']),, 
pa.array([10.0, 20.0]),, 
schema=schema,</line_content>
<context>
The enumerate() function takes an iterable and returns an enumerate object that produces tuples containing indices and values.</context>
</line>
<line>
<line_number>65, 66, 69</line_number>
<line_content>{'vector': [3.1, 4.1], 'item': 'foo', 'price': 10.0},, 
{'vector': [5.9, 26.5], 'item': 'bar', 'price': 20.0},, 
df = pd.DataFrame(data[0])</line_content>
<context>
The python list data type has a built-in sort() method that sorts the list in-place. There is also a sorted() built-in function that builds a new sorted list from an iterable.</context>
</line>
<line>
<line_number>70, 71, 73</line_number>
<line_content>data.append(df), 
data.append(pa.Table.from_pandas(df, schema=schema)), 
for i, d in enumerate(data):</line_content>
<context>
The enumerate() function takes an iterable and returns an enumerate object that produces tuples containing indices and values.</context>
</line>
<line>
<line_number>75, 76, 77</line_number>
<line_content>LanceTable.create(db, f'test_{i}', data=d, schema=schema), 
.to_lance(), 
.to_table()</line_content>
<context>
- Creating records in tables with CreateRecord.

- Initializing a new database with init_database.

- Adding data to tables with add_data. 

- Adding tables to a database with add_tables.

- Creating views on database tables with OpenView.</context>
</line>
<line>
<line_number>79, 82, 83</line_number>
<line_content>assert expected == tbl, 
def test_add(db):, 
table = LanceTable.create(</line_content>
<context>
Autospeccing limits the api of mocks to the original object's api. It happens recursively so attributes of mocks only have apis of the original object's attributes. Mocked functions have the same signature as the originals. create_autospec() creates</context>
</line>
<line>
<line_number>87, 88, 92</line_number>
<line_content>{'vector': [3.1, 4.1], 'item': 'foo', 'price': 10.0},, 
{'vector': [5.9, 26.5], 'item': 'bar', 'price': 20.0},, 
# table = LanceTable(db, 'test')</line_content>
<context>
floats, booleans, datetimes, arrays, and tables into native Python datatypes.</context>
</line>
<line>
<line_number>93, 95, 96</line_number>
<line_content>assert len(table) == 2, 
count = table.add([{'vector': [6.3, 100.5], 'item': 'new', 'price': 30.0}]), 
assert count == 3</line_content>
<context>
The enumerate() function takes an iterable and returns an enumerate object that produces tuples containing indices and values.</context>
</line>
<line>
<line_number>98, 100, 101</line_number>
<line_content>expected = pa.Table.from_arrays(, 
pa.FixedSizeListArray.from_arrays(, 
pa.array([3.1, 4.1, 5.9, 26.5, 6.3, 100.5]), 2</line_content>
<context>
When you perform operations with floats in Python, small errors can accumulate due to the inexact internal representation. For example, summing 0.1 three times may not yield exactly 0.3. This happens because 0.1 cannot be represented exactly in</context>
</line>
<line>
<line_number>103, 104, 106</line_number>
<line_content>pa.array(['foo', 'bar', 'new']),, 
pa.array([10.0, 20.0, 30.0]),, 
schema=pa.schema(</line_content>
<context>
TOML datetimes to Python datetime, TOML arrays to Python list, and TOML tables to Python dict.</context>
</line>
<line>
<line_number>108, 109, 110</line_number>
<line_content>pa.field('vector', pa.list_(pa.float32(), 2)),, 
pa.field('item', pa.string()),, 
pa.field('price', pa.float64()),</line_content>
<context>
PyFloat_GetInfo returns a structseq with info on float precision, max, and min. PyFloat_GetMax returns the max float DBL_MAX. PyFloat_GetMin returns the min float DBL_MIN.</context>
</line>
<line>
<line_number>114, 117, 118</line_number>
<line_content>assert expected == table.to_arrow(), 
def test_versioning(db):, 
table = LanceTable.create(</line_content>
<context>
- Mock objects record their calls in the mock_calls attribute, which is useful for making additional assertions about the sequencing and details of calls.</context>
</line>
<line>
<line_number>122, 123, 127</line_number>
<line_content>{'vector': [3.1, 4.1], 'item': 'foo', 'price': 10.0},, 
{'vector': [5.9, 26.5], 'item': 'bar', 'price': 20.0},, 
assert len(table.list_versions()) == 1</line_content>
<context>
- Lists are defined with [] and can contain mixed type elements. Lists support indexing, slicing, adding/removing elements, concatenation with +, and length with the len() function.</context>
</line>
<line>
<line_number>128, 130, 131</line_number>
<line_content>assert table.version == 1, 
table.add([{'vector': [6.3, 100.5], 'item': 'new', 'price': 30.0}]), 
assert len(table.list_versions()) == 2</line_content>
<context>
- Lists are defined with [] and can contain mixed type elements. Lists support indexing, slicing, adding/removing elements, concatenation with +, and length with the len() function.</context>
</line>
<line>
<line_number>132, 133, 135</line_number>
<line_content>assert table.version == 2, 
assert len(table) == 3, 
table.checkout(1)</line_content>
<context>
- in_table_d1 - returns True if the codepoint has bidirectional property 'R' or 'AL'
</context>
</line>
</file_context>
</file>
<file>
<file_path>python/tests/test_util.py</file_path>
<file_content>
#  Copyright 2023 LanceDB Developers
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

from lancedb.util import get_uri_scheme


def test_normalize_uri():
    uris = [
        "relative/path",
        "/absolute/path",
        "file:///absolute/path",
        "s3://bucket/path",
        "gs://bucket/path",
        "c:\\windows\\path",
    ]
    schemes = ["file", "file", "file", "s3", "gs", "file"]

    for uri, expected_scheme in zip(uris, schemes):
        parsed_scheme = get_uri_scheme(uri)
        assert parsed_scheme == expected_scheme

</file_content>
<file_context>
<line>
<line_number>13, 16, 18</line_number>
<line_content>from lancedb.util import get_uri_scheme, 
def test_normalize_uri():, 
'relative/path',</line_content>
<context>
The wsgiref.util module provides utility functions for working with WSGI environments such as guessing the scheme, constructing request URIs, shifting path info, and setting up testing defaults. It also includes the FileWrapper class for converting</context>
</line>
<line>
<line_number>19, 20, 21</line_number>
<line_content>'/absolute/path',, 
'file:///absolute/path',, 
's3://bucket/path',</line_content>
<context>
The os.path module contains functions for working with file paths and directory paths. It allows you to extract components of paths, check if a path exists, get metadata like size/timestamps, normalize paths to their absolute version, and join path</context>
</line>
<line>
<line_number>22, 23, 25</line_number>
<line_content>'gs://bucket/path',, 
'c:\\windows\\path',, 
schemes = ['file', 'file', 'file', 's3', 'gs', 'file']</line_content>
<context>
with '[]'. glob.glob() returns a list of matching pathnames, which can be absolute or relative paths. The glob.iglob() function returns an iterator instead of a list. The glob module uses os.scandir() and fnmatch.fnmatch() internally. Files starting</context>
</line>
<line>
<line_number>27, 28, 29</line_number>
<line_content>for uri, expected_scheme in zip(uris, schemes):, 
parsed_scheme = get_uri_scheme(uri), 
assert parsed_scheme == expected_scheme</line_content>
<context>
get_default_scheme(), get_path(), and get_paths() functions can be used to get information on the installation schemes and paths.
</context>
</line>
</file_context>
</file>
</files>