Notes on

Building Generative AI Services with FastAPI

by Alireza Parandeh

• 4 min read


Most of my notes are in the form of code after re-implementing what’s detailed in the book. The implementations in the book are quite basic, and as the author mentions, it requires much more for production-ready systems.

Achieving Concurrency in AI Workloads

If FastAPI didn’t run on the ASGI web framework via Starlette, it would effectively run synchronously. Using ASGI, it supports concurrency via both multithreading (via a thread pool) and asynchronous programming (via an event loop). This lets it serve multiple requests in parallel, without blocking the main server process.

Ideally, run handlers on the event loop, as it can be more efficient than running them on the thread pool (each thread in the pool has to acquire the GIL before it can execute code bytes—requires computation to do).
The FastAPI docs tell you not to worry much about declaring your handlers async def or def. From personal experience, always use async def.
Just make sure to not block the event loop with blocking operations inside async routes.

Authentication and Authorization

OAuth Authorization Code Flow (ACF)

  1. User clicks login
  2. User is redirected to Identity Provider’s (IDP’s) login page (app sends along client ID to identify itself)
  3. User logs into their account and is shown a consent screen presenting them the scopes (i.e. permissions) the application is requesting on their behalf.
  4. User grants all, some, or no scopes
  5. If consent is not rejected by the user (i.e., the resource owner), the IDP’s authorization server sends the user back along with an application grant code to a redirect URI you’ve provided.
  6. Once your app has a grant code associated with the user session, permitted scopes, and the app’s client id, it can exchange it with an authorization server for a short-lived access token and a longer-lived refresh token. You can use the refresh token to request a new access token without having to restart the whole authentication process.
  7. App can now use the access token to access the provider’s resource server to perform actions on behalf of the user on their resources (given the appropriate scopes).

A more secure variant of ACF uses proof key for code exchange (PKCE). During this flow, you add a hashed secret called the code_challenge when sending the initial request to the identity provider. Then you present the unhashed secret code_verifier again to exchange the authorization code for an access token.
This protects against authorization code interception attacks.

And other types for other applications:

  • Implicit flow for SPAs with no separate backend. Skips the grant code to directly get an access token.
  • Client credentials flow. Useful when building a backend service for machine-to-machine communication where there’s not browsers involved. You exchange client ID and secret for an access token to access your own resources on the IDP servers (so no access on behalf of users).
  • Resource owner password credentials flow - like client credentials flow, but uses a username and password of a user to get an access token. Avoid this flow, as credentials are being exchanged directly with the authorization server.
  • Device authorization flow, which is mostly used for devices with limited input capabilities like smart TVs. You just scan a QR code and use a web browser from your phone.

A few common authorization models:

  • Role-based access control (RBAC): user is assigned roles, dictating their rights
  • Relationship-based access control (ReBAC): authorization is determined by the user’s relationship to another entity, e.g. a team
  • Attribute-based access control (ABAC): e.g. user has “paid” attribute and can therefore access premium resources, but all users can access public resources.

 It’s common for larger applications to combine features of those authorization models.
Creating a separate authorization service can help decouple your application code from changes that may occur often (who can do what, etc.). Essentially you’d use it for answering “Given this actor, can they perform ACTION on RESOURCE?”

Securing AI Services

Guardrails

Adding guardrails to your inputs/outputs adds latency. You can parallelize, starting both the guardrail request and request to the AI model at the same time. If the guardrail says the request to the AI model isn’t allowed, you immediately abort the AI model request and reject the request. This costs more, but affords you shorter response times.
I’ve seen various providers do this. The second the model starts talking about disallowed topics, the request is rejected, mid-stream.
That also hints at how you might parallelize output + guardrails on the output: simply stream the text and evaluate continuously.

Liked these notes? Join the newsletter.

Get notified whenever I post new notes.