- Published on
Integrating Vision using the latest OpenAI API
- Authors
- Name
- Athos Georgiou
Integrating Vision using the latest OpenAI API
Welcome to the latest instalment of the series on building an AI chat assistant from scratch. In this article, I'll be integrating vision into the AI chat assistant I've been building in the previous articles. This work will include creating new UI components for the Vision API, as well as creating new API routes to support the new functionality. I'll also be refactoring the existing code to improve readability and maintainability.
In prior articles, I've covered the following topics:
- Part 1 - Integrating Markdown in Streaming Chat for AI Assistants
- Part 2 - Creating a Customized Input Component in a Streaming AI Chat Assistant Using Material-UI
- Part 3 - Integrating Next-Auth in a Streaming AI Chat Assistant Using Material-UI
- Part 5 - Integrating the OpenAI Assistants API in a Streaming AI Chat Assistant Using Material-UI
Although the the topics are not all inclusive of the entire project, they do provide a good starting point for anyone interested in building their own AI chat assistant and may serve as the foundation for the rest of the series.
If you'd prefer to skip and get the code yourself, you can find it on GitHub.
Overview
OpenAI Vision is a recent feature that allows you to get insights from images. It can be used to detect objects, faces, landmarks, text, and more. The API is currently in beta and is subject to change. For more information, check out the OpenAI Vision API documentation.
Generally, integrating Vision into your application involves the following steps:
- Create a new API route to handle the API calls.
- Create a new UI component to allow the user to upload images, or URLs.
- Show the results of the API calls in the chat UI component, using streaming chat.
Our goal is to create a customized configuration component for the Vision API. This component will allow the user to upload images, or URLs and display the results of the API calls in the chat UI component, using streaming chat. The user will also be able to delete all Vision related data/files. Finally, the user will be able to enable/disable the Vision API by clicking on the switch.
Prerequisites
Before we dive in, make sure you have the following:
- A basic understanding of React and Material-UI
- Node.js and npm installed in your environment.
- A React project set up, Ideally using Next.js. Keep in mind that I'll be using Titanium, which is a template that already has a lots of basic functionality set up for building an AI Assistant. You can use it as a starting point for your own project, or you can just follow along and copy/paste the code snippets you need.
Step 1: Creating new UI Components
In this step, we'll be creating the new UI components needed to support the new functionality. This includes a new AssistantDialog
component, which will be used to display the Assistant configuration dialog. We'll also be updating the CustomizedInputBase
component, to support the new functionality.
Update the CustomizedInputBase component
I've updated the CustomizedInputBase
component to support the new functionality. This includes adding a new VisionDialog
component, which will be used to display the Vision configuration dialog. I've also added a new Vision
button, which will be used to open the Vision configuration dialog.
Although the initial implementation used several state variables to track the state of the UI, I've refactored the code to use react-hook-form instead. This allowed me to simplify the code and make it more readable/maintainable.
You can view the see the CustomizedInputBase file on github: CustomizedInputBase, but the the UI component changes are shown below:
...
<>
<Paper
component="form"
sx={{
p: '2px 4px',
display: 'flex',
alignItems: 'center',
width: isSmallScreen ? '100%' : 650,
}}
onKeyDown={(event) => {
if (event.key === 'Enter') {
event.preventDefault();
handleSendClick();
}
}}
>
<IconButton
sx={{ p: '10px' }}
aria-label="menu"
onClick={handleMenuOpen}
>
<MenuIcon />
</IconButton>
<Menu
anchorEl={anchorEl}
open={Boolean(anchorEl)}
onClose={handleMenuClose}
anchorOrigin={{
vertical: 'top',
horizontal: 'right',
}}
transformOrigin={{
vertical: 'top',
horizontal: 'right',
}}
>
<MenuItem onClick={handleAssistantsClick}>
<ListItemIcon>
<AssistantIcon />
</ListItemIcon>
Assistant
</MenuItem>
<MenuItem onClick={handleVisionClick}>
<ListItemIcon>
<VisionIcon fontSize="small" />
</ListItemIcon>
Vision
</MenuItem>
</Menu>
<InputBase
sx={{ ml: 1, flex: 1 }}
placeholder="Enter your message"
value={inputValue}
onChange={handleInputChange}
/>
<IconButton
type="button"
sx={{ p: '10px' }}
aria-label="send"
onClick={handleSendClick}
>
<SendIcon />
</IconButton>
</Paper>
<AssistantDialog
open={isAssistantDialogOpen}
onClose={() => setIsAssistantDialogOpen(false)}
/>
<VisionDialog
open={isVisionDialogOpen}
onClose={() => setIsVisionDialogOpen(false)}
/>
</>
...
Create the VisionDialog component
I've created a new VisionDialog
component, which will be used to display the Vision configuration dialog. This component will allow the user to upload images, or URLs and display the results of the API calls in the chat UI component, using streaming chat. The user will also be able to delete all Vision related data/files. Finally, the user will be able to enable/disable the Vision API by clicking on the switch.
Thankfully, a lot of the grunt work has been done already for the Assistant I implemented prior, so it was a matter of repurposing the existing code to support the new functionality. I've added a new VisionFileList
component, which will be used to display the files uploaded to the Vision API. I've also added a new AddUrlDialog
component, which will be used to display the dialog for adding URLs to the Vision API.
You can view the see the VisionDialog file on github: VisionDialog, but the the UI component changes are shown below:
...
<Dialog open={open} onClose={onClose}>
<DialogTitle style={{ textAlign: 'center' }}>
Add Vision Images
</DialogTitle>
<DialogContent style={{ paddingBottom: 8 }}>
<VisionFileList files={visionFiles} onDelete={handleRemoveUrl} />
</DialogContent>
<DialogActions style={{ paddingTop: 0 }}>
<Box
display="flex"
flexDirection="column"
alignItems="stretch"
width="100%"
>
<Button
onClick={handleUpdate}
style={{ marginBottom: '8px' }}
variant="outlined"
color="success"
>
Update
</Button>
<Box display="flex" justifyContent="center" alignItems="center">
<Button onClick={handleCloseClick}>Close Window</Button>
<Button onClick={handleAddUrlClick}>Add URL</Button>
<Typography variant="caption" sx={{ mx: 1 }}>
Disable
</Typography>
<Switch
checked={isVisionEnabled}
onChange={handleToggle}
name="activeVision"
/>
<Typography variant="caption" sx={{ mx: 1 }}>
Enable
</Typography>
<input
type="file"
ref={visionFileInputRef}
style={{ display: 'none' }}
/>
</Box>
</Box>
</DialogActions>
</Dialog>
<AddUrlDialog
open={isAddUrlDialogOpen}
onClose={() => setIsAddUrlDialogOpen(false)}
onAddUrl={handleAddUrl}
/>
</>
...
Create the VisionFileList component
I've created a new VisionFileList
component, which will be used to display the files uploaded to the Vision API. This component will allow the user to delete all Vision related data/files. This component is almost identical to the one I used for the Assistant, so again it was relatively straight forward.
One challenge I faced was to ensure that the user can delete all Vision related data/files. This is because the Vision API doesn't support deleting files, so I had to come up with a workaround. More on this when I go over the API routes.
Also, to avoid clutter, I had to do some refactoring to ensure that the AssistantFileList
and VisionFileList
components are reusable. This involved moving the AssistantFileList
component to a new FileList
component, which is used by both the AssistantDialog
and VisionDialog
components.
It looks like the code below, but if you want to see the file, you can view the see the VisionFileList file on github: VisionFileList
...
<FilePaper
files={files}
renderFileItem={(file) => (
<ListItem
key={file.id}
secondaryAction={
<IconButton
edge="end"
aria-label="delete"
onClick={() => onDelete(file)}
>
<DeleteIcon />
</IconButton>
}
>
<ListItemAvatar>
<Avatar>
<FolderIcon />
</Avatar>
</ListItemAvatar>
<ListItemText primary={file.name} />
</ListItem>
)}
/>
...
Create the AddUrlDialog component
This is a simple popup dialog, which allows the user to add URLs. The user can add multiple URLs, which will be stored in the visionFiles
state variable. The user can then chat with the completions API, asking questions about the images.
As of now, only URLs are supported, but I'm planning to add support for image uploads in the future.
You can view the see the AddUrlDialog file on github: AddUrlDialog, or see some of the code below:
...
<Dialog open={open} onClose={handleClose}>
<DialogTitle sx={{ textAlign: 'center' }}>Add URL</DialogTitle>
<DialogContent style={{ paddingBottom: 8, width: '600px' }}>
<FormControl
fullWidth
margin="dense"
error={error.name}
variant="outlined"
>
<TextField
fullWidth
label="Name"
variant="outlined"
value={nameInput}
onChange={(e) => setNameInput(e.target.value)}
error={error.name}
helperText={error.name ? 'Name is required' : ' '}
/>
</FormControl>
<FormControl
fullWidth
margin="dense"
error={error.url}
variant="outlined"
>
<TextField
fullWidth
label="URL"
variant="outlined"
value={urlInput}
onChange={(e) => setUrlInput(e.target.value)}
error={error.url}
helperText={error.url ? 'URL is required' : ' '}
/>
</FormControl>
</DialogContent>
<DialogActions>
<Box display="flex" justifyContent="center" width="100%">
<Button onClick={handleAddUrl} color="primary">
Add
</Button>
<Button onClick={handleClose}>Cancel</Button>
</Box>
</DialogActions>
</Dialog>
...
Step 2: Creating new API routes
In this step, we'll be creating the new API routes needed to support the new functionality. These routes will handle
The route
import { NextRequest, NextResponse } from 'next/server';
import {
getDatabaseAndUser,
getDb,
handleErrorResponse,
sendErrorResponse,
} from '@/app/lib/utils/db';
export async function GET(req: NextRequest): Promise<NextResponse> {
try {
const db = await getDb();
const userEmail = req.headers.get('userEmail') as string;
const serviceName = req.headers.get('serviceName');
const { user } = await getDatabaseAndUser(db, userEmail);
if (serviceName === 'vision' && user.visionId) {
const fileCollection = db.collection<IFile>('files');
const visionFileList = await fileCollection
.find({ visionId: user.visionId })
.toArray();
return NextResponse.json(
{
message: 'Vision retrieved',
visionId: user.visionId,
visionFileList,
isVisionEnabled: user.isVisionEnabled,
},
{ status: 200 }
);
}
return sendErrorResponse('Vision not configured for the user', 200);
} catch (error: any) {
return handleErrorResponse(error);
}
}
Update the route (On load and on Click Update Button)
import { NextRequest, NextResponse } from 'next/server';
import { getDb, getUserByEmail, sendErrorResponse } from '@/app/lib/utils/db';
import { Collection } from 'mongodb';
async function updateVision(
user: IUser,
usersCollection: Collection<IUser>,
isVisionEnabled: boolean
): Promise<void> {
let isAssistantEnabled = isVisionEnabled ? false : user.isAssistantEnabled;
let visionId = user.visionId;
if (!visionId) {
console.log('No visionId found. Creating a new one');
visionId = crypto.randomUUID();
}
await usersCollection.updateOne(
{ email: user.email },
{
$set: {
isAssistantEnabled: isAssistantEnabled,
isVisionEnabled: isVisionEnabled,
visionId: visionId,
},
}
);
}
export async function POST(req: NextRequest): Promise<NextResponse> {
try {
const db = await getDb();
const { isVisionEnabled, userEmail } = (await req.json()) as {
isVisionEnabled: boolean;
userEmail: string;
};
const usersCollection = db.collection<IUser>('users');
const user = await getUserByEmail(usersCollection, userEmail);
if (!user) {
return sendErrorResponse('User not found', 404);
}
await updateVision(user, usersCollection, isVisionEnabled);
return NextResponse.json(
{
message: 'Vision updated',
visionId: user.visionId,
isVisionEnabled: isVisionEnabled,
},
{ status: 200 }
);
} catch (error: any) {
console.error('Error in vision update:', error);
return sendErrorResponse('Error in vision update', 500);
}
}
Add a URL (User Action)
import { NextRequest, NextResponse } from 'next/server';
import {
getDatabaseAndUser,
getDb,
sendErrorResponse,
} from '@/app/lib/utils/db';
export async function POST(req: NextRequest): Promise<NextResponse> {
try {
const db = await getDb();
const { file, userEmail } = await req.json();
const { user } = await getDatabaseAndUser(db, userEmail);
let visionId;
const usersCollection = db.collection<IUser>('users');
if (!user.visionId) {
console.log('No visionId found. Creating a new one');
visionId = crypto.randomUUID();
await usersCollection.updateOne(
{ email: user.email },
{ $set: { visionId: visionId } }
);
} else {
visionId = user.visionId;
}
file.visionId = visionId;
const fileCollection = db.collection<IFile>('files');
const insertFileResponse = await fileCollection.insertOne(file);
return NextResponse.json({
message: 'File processed successfully',
response: insertFileResponse,
file: file,
status: 200,
});
} catch (error) {
console.error(error);
return sendErrorResponse('Error processing file', 500);
}
}
Delete a URL (User Action)
import { NextRequest, NextResponse } from 'next/server';
import {
getDatabaseAndUser,
getDb,
sendErrorResponse,
} from '@/app/lib/utils/db';
export async function POST(req: NextRequest): Promise<NextResponse> {
try {
const db = await getDb();
const { file, userEmail } = await req.json();
const { user } = await getDatabaseAndUser(db, userEmail);
if (user.visionId !== file.visionId) {
return sendErrorResponse('User VisionId not found', 404);
}
const fileCollection = db.collection<IFile>('files');
const deleteFileResponse = await fileCollection.deleteOne({
visionId: file.visionId,
});
return NextResponse.json({
status: 200,
message: 'Url deleted successfully',
response: deleteFileResponse,
});
} catch (error) {
return sendErrorResponse('Error deleting file', 500);
}
}
Step 4: Honorable mentions
If you had followed the series until now, you will have likely noticed that some refactoring has gone in the application to improve the code and make it more readable/maintainable. More to come on this aspect in a future article, but for now, I'll just throw this out there: react-hook-form is awesome!
Step 5: Testing the new functionality
Now that we have the new UI components and API routes in place, we can test the new functionality. To do this, we'll need to start the application and open the browser. Once the application is running, we can open the Vision configuration dialog by clicking on the Vision button in the chat UI component.
We can add URLs by clicking on the Add URL button, which will open the Add URL dialog. We can then enter the name and URL of the image we want to add and click on the Add button. The URL will be added to the list of URLs and the Add URL dialog will close.
We can then click on the Update button, which will update the Vision API with the new URLs. We can then chat with the Vision API, asking questions about the images. As I mentioned earlier, the results are pretty impressive, with Vision being able to detect objects, faces, landmarks and text at remarkable accuracy. Let's not forget about text, which is also pretty remarkable.
We can also delete the URLs by clicking on the delete button next to the URL. This will delete the URL from the list of URLs and the Vision API.
Finally, we can enable/disable the Vision API by clicking on the switch. This will enable/disable the Vision API and update the UI accordingly.
The easiest way to test the new functionality is to use the Titanium Template, where you can also find instructions on how to set it up and running on your local machine, or deploying it to Vercel.
Conclusion and Next Steps
In this article, I've covered how to integrate Vision with OpenAI API, including new UI components and API routes to support the new functionality, as well as some mentions on refactoring the existing code to improve readability and maintainability.
To be honest with you, I was pleasantly surprised by how easy it was to integrate Vision into the existing application. I was expecting a lot more work, but it was relatively straight forward. And the results are pretty cool! Vision is capable of detecting objects, faces, landmarks and text at remarkable accuracy. I'm really looking forward to seeing what people will build with this new API.
Feel free to check out Titanium, which already has a lots of basic functionality set up for building an AI Assistant. You can use it as a starting point for your own project, or you can just follow along and copy/paste the code snippets you need.
If you have any questions or comments, feel free to reach out to me on GitHub, LinkedIn, or via email.
See ya around and happy coding!