DSPy and outlines preparation

DSPy and outlines preparation on macOS 10.13.6

DSPy and outlines preparation
Outlines + DSPy

DSPy, a framwork for algorithmically optimzing LM prompts and weights, and outlines, a opensource decoding framework for structured generation similar to openAI's Structured Outputs API, relies on apache arrow, which defines fast data access without serialization overhead. However, on some legacy systems, like my macOS 10.13.6 left for eGPU support, there is no pre-built arrow library, making the setup of DSPy and outlines imfeasible. This tutorial aims to describe how to build apache arrow library on macOS 10.13.6 step by step and verify outlines, DSPy after the setup.

Section 1: Apache Arrow setup

References:

1, Preparation of required libraries: don't directly use brew update && brew bundle --file=arrow/cpp/Brewfile, because brew on macOS 10.13.6 now is out-of-dated and many libraries, like llvm 14, abestil-cpp, can't be upgraded to the latest version any more.

Conda installation for the required libraries:

conda install utf8proc lz4-c libthrift

Apache ORC installation:

git clone https://github.com/apache/orc
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=DEBUG
make package
make test-out
tar -xvf ORC-2.0.2-Darwin.tar.gz
cp -rf ORC-2.0.2-Darwin /usr/local/Cellar/ORC-2.0.2

Apache Arrow 17.0 installation:

mkdir apache-arrow
cd apache-arrow
git clone https://github.com/llv22/arrow_forward arrow
mkdir dist
  1. Set environment variables to let Arrow’s build system know about our build toolchain
export ARROW_HOME=$(pwd)/dist
export LD_LIBRARY_PATH=$(pwd)/dist/lib:$LD_LIBRARY_PATH
export CMAKE_PREFIX_PATH=$ARROW_HOME:$CMAKE_PREFIX_PATH
  1. Build arrow c++ library cleaning build folder and specifying ORC_ROOT
rm -rf ./arrow/cpp/build
cmake -S arrow/cpp -B arrow/cpp/build -DCMAKE_INSTALL_PREFIX=$ARROW_HOME -DORC_ROOT=/usr/local/Cellar/ORC-2.0.2 --preset ninja-release-python
cmake --build arrow/cpp/build --target install
  1. Build arrow python wheel
pushd arrow/python
export PYARROW_PARALLEL=4
export ARROW_BUILD_TYPE=debug
python setup.py build_ext --build-type=$ARROW_BUILD_TYPE --bundle-arrow-cpp bdist_wheel
popd
  1. Installation of outlines and dspy
git clone https://github.com/outlines-dev/outlines.git
python -m build # wheel is located in dist/
git clone https://github.com/stanfordnlp/dspy.git
python setup.py clean bdist_wheel
python -m build # wheel is located in dist/
pip install dspy-ai[chromadb] # or [groq] or [marqo] or [milvus] or [mongodb] or [myscale] or [pinecone] or [qdrant] or [snowflake] or [weaviate]

Section 2: Outlines

Reference:

Outlines can't support JSON and structred data output based on open-source transformers.

from enum import Enum
from pydantic import BaseModel, constr

import outlines
import torch

class Weapon(str, Enum):
    sword = "sword"
    axe = "axe"
    mace = "mace"
    spear = "spear"
    bow = "bow"
    crossbow = "crossbow"

class Armor(str, Enum):
    leather = "leather"
    chainmail = "chainmail"
    plate = "plate"

class Character(BaseModel):
    name: constr(max_length=10)
    age: int
    armor: Armor
    weapon: Weapon
    strength: int

model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")

# Construct structured sequence generator
generator = outlines.generate.json(model, Character)

# Draw a sample
seed = 789001

character = generator("Give me a character description", seed=seed)

print(repr(character))
# Character(name='Anderson', age=28, armor=<Armor.chainmail: 'chainmail'>, weapon=<Weapon.sword: 'sword'>, strength=8)

character = generator("Give me an interesting character description", rng=rng)

print(repr(character))
# Character(name='Vivian Thr', age=44, armor=<Armor.plate: 'plate'>, weapon=<Weapon.crossbow: 'crossbow'>, strength=125)

OpenAI constrainted structure

from enum import Enum
from typing import Union

from pydantic import BaseModel

import openai
from openai import OpenAI

class Table(str, Enum):
    orders = "orders"
    customers = "customers"
    products = "products"

class Column(str, Enum):
    id = "id"
    status = "status"
    expected_delivery_date = "expected_delivery_date"
    delivered_at = "delivered_at"
    shipped_at = "shipped_at"
    ordered_at = "ordered_at"
    canceled_at = "canceled_at"

class Operator(str, Enum):
    eq = "="
    gt = ">"
    lt = "<"
    le = "<="
    ge = ">="
    ne = "!="

class OrderBy(str, Enum):
    asc = "asc"
    desc = "desc"

class DynamicValue(BaseModel):
    column_name: str

class Condition(BaseModel):
    column: str
    operator: Operator
    value: Union[str, int, DynamicValue]

class Query(BaseModel):
    table_name: Table
    columns: list[Column]
    conditions: list[Condition]
    order_by: OrderBy

client = OpenAI()

completion = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant. The current date is August 6, 2024. You help users query for the data they are looking for by calling the query function.",
        },
        {
            "role": "user",
            "content": "look up all my orders in may of last year that were fulfilled but not delivered on time",
        },
    ],
    tools=[
        openai.pydantic_function_tool(Query),
    ],
)
print(completion.choices[0].message.tool_calls[0].function.parsed_arguments)

Section 3: DSPy

Reference:

import dspy
class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought("context, question -> answer")
    
    def forward(self, question):
        context = self.retrieve(question).passages
        answer = self.generate_answer(context=context, question=question)
        return answer
rag = RAG()  # zero-shot, uncompiled version of RAG
rag("what is the capital of France?").answer  # -> "Paris"