Over the past 10 months, we've build up instructor with the principle of 'easy to try, and easy to delete'. We accomplished this by patching the openai client with the instructor package and adding new arguments like response_model, max_retries, and validation_context. As a result I truely believe isntructor is the best way to get structured data out of llm apis.
But as a result, we've been a bit stuck on getting typing to work well while giving you more control at development time. I'm excited to launch version 1.0.0 which cleans up the api w.r.t. typing without compromising the ease of use.
When asking language models to summarize text, there's a risk that the generated summary ends up in English, even if the source text is in another language. This is likely due to the instructions being provided in English, biasing the model towards English output.
In this post, we explore techniques to ensure the language of the generated summary matches the language of the source text. We leverage Pydantic for data validation and the langdetect library for language identification.
A special shoutout to Shreya for her contributions to the anthropic support. As of now, all features are operational with the exception of streaming support.
For those eager to experiment, simply patch the client with ANTHROPIC_JSON, which will enable you to leverage the anthropic client for making requests.
pip install instructor[anthropic]
Missing Features
Just want to acknowledge that we know that we are missing partial streaming and some better re-asking support for XML. We are working on it and will have it soon.
frompydanticimportBaseModelfromtypingimportListimportanthropicimportinstructor# Patching the Anthropics client with the instructor for enhanced capabilitiesanthropic_client=instructor.from_openai(create=anthropic.Anthropic().messages.create,mode=instructor.Mode.ANTHROPIC_JSON)classProperties(BaseModel):name:strvalue:strclassUser(BaseModel):name:strage:intproperties:List[Properties]user_response=anthropic_client(model="claude-3-haiku-20240307",max_tokens=1024,max_retries=0,messages=[{"role":"user","content":"Create a user for a model with a name, age, and properties.",}],response_model=User,)# type: ignoreprint(user_response.model_dump_json(indent=2))"""{ "name": "John", "age": 25, "properties": [ { "key": "favorite_color", "value": "blue" } ]}
We're encountering challenges with deeply nested types and eagerly invite the community to test, provide feedback, and suggest necessary improvements as we enhance the anthropic client's support.
What that people have been using instructor for is to generate synthetic data rather than extracting data itself. We can even use the J-Schemo extra fields to give specific examples to control how we generate data.
Consider the example below. We'll likely generate very simple names.
fromtypingimportIterablefrompydanticimportBaseModelimportinstructorfromopenaiimportOpenAI# Define the UserDetail modelclassUserDetail(BaseModel):name:strage:int# Patch the OpenAI client to enable the response_model functionalityclient=instructor.from_openai(OpenAI())defgenerate_fake_users(count:int)->Iterable[UserDetail]:returnclient.chat.completions.create(model="gpt-3.5-turbo",response_model=Iterable[UserDetail],messages=[{"role":"user","content":f"Generate a {count} synthetic users"},],)foruseringenerate_fake_users(5):print(user)#> name='Alice' age=25#> name='Bob' age=30#> name='Charlie' age=35#> name='David' age=40#> name='Eve' age=22
We might want to set examples as part of the prompt by leveraging Pydantics configuration. We can set examples directly in the JSON scheme itself.
fromtypingimportIterablefrompydanticimportBaseModel,FieldimportinstructorfromopenaiimportOpenAI# Define the UserDetail modelclassUserDetail(BaseModel):name:str=Field(examples=["Timothee Chalamet","Zendaya"])age:int# Patch the OpenAI client to enable the response_model functionalityclient=instructor.from_openai(OpenAI())defgenerate_fake_users(count:int)->Iterable[UserDetail]:returnclient.chat.completions.create(model="gpt-3.5-turbo",response_model=Iterable[UserDetail],messages=[{"role":"user","content":f"Generate a {count} synthetic users"},],)foruseringenerate_fake_users(5):print(user)#> name='John Doe' age=25#> name='Jane Smith' age=30#> name='Michael Johnson' age=22#> name='Emily Davis' age=28#> name='David Brown' age=35
By incorporating names of celebrities as examples, we have shifted towards generating synthetic data featuring well-known personalities, moving away from the simplistic, single-word names previously used.
To effectively generate synthetic examples with more nuance, lets upgrade to the "gpt-4-turbo-preview" model, use model level examples rather than attribute level examples:
importinstructorfromtypingimportIterablefrompydanticimportBaseModel,ConfigDictfromopenaiimportOpenAI# Define the UserDetail modelclassUserDetail(BaseModel):"""Old Wizards"""name:strage:intmodel_config=ConfigDict(json_schema_extra={"examples":[{"name":"Gandalf the Grey","age":1000},{"name":"Albus Dumbledore","age":150},]})# Patch the OpenAI client to enable the response_model functionalityclient=instructor.from_openai(OpenAI())defgenerate_fake_users(count:int)->Iterable[UserDetail]:returnclient.chat.completions.create(model="gpt-4-turbo-preview",response_model=Iterable[UserDetail],messages=[{"role":"user","content":f"Generate `{count}` synthetic examples"},],)foruseringenerate_fake_users(5):print(user)#> name='Merlin' age=1000#> name='Saruman the White' age=700#> name='Radagast the Brown' age=600#> name='Elminster Aumar' age=1200#> name='Mordenkainen' age=850
By adjusting the descriptions within our Pydantic models, we can subtly influence the nature of the synthetic data generated. This method allows for a more nuanced control over the output, ensuring that the generated data aligns more closely with our expectations or requirements.
For instance, specifying "Fancy French sounding names" as a description for the name field in our UserDetail model directs the generation process to produce names that fit this particular criterion, resulting in a dataset that is both diverse and tailored to specific linguistic characteristics.
importinstructorfromtypingimportIterablefrompydanticimportBaseModel,FieldfromopenaiimportOpenAI# Define the UserDetail modelclassUserDetail(BaseModel):name:str=Field(description="Fancy French sounding names")age:int# Patch the OpenAI client to enable the response_model functionalityclient=instructor.from_openai(OpenAI())defgenerate_fake_users(count:int)->Iterable[UserDetail]:returnclient.chat.completions.create(model="gpt-3.5-turbo",response_model=Iterable[UserDetail],messages=[{"role":"user","content":f"Generate `{count}` synthetic users"},],)foruseringenerate_fake_users(5):print(user)#> name='Jean Luc' age=30#> name='Claire Belle' age=25#> name='Pierre Leclair' age=40#> name='Amelie Rousseau' age=35#> name='Etienne Lefevre' age=28
Instructor has expanded its capabilities for language models. It started with API interactions via the OpenAI SDK, using Pydantic for structured data validation. Now, Instructor supports multiple models and platforms.
Instructor now works with cloud-based APIs and local models for structured data extraction. Developers can refer to our guide on Patching for information on using JSON mode with different models.
Large language models (LLMs) like GPTs are incredibly powerful, but working with their open-ended text outputs can be challenging. This is where the Instructor library shines - it allows you to easily map LLM outputs to structured data using Python type annotations.
Its a common misconception that LangChain's LangSmith is only compatible with LangChain's models. In reality, LangSmith is a unified DevOps platform for developing, collaborating, testing, deploying, and monitoring LLM applications. In this blog we will explore how LangSmith can be used to enhance the OpenAI client alongside instructor.
I just released a free course on wits and biases. It goes over the material from tutorial. Check it out at wandb.courses its free and open to everyone and just under an hour long!