
The Real Cost of Vector Storage: S3 Vectors vs OpenSearch vs pgvector vs Pinecone
At re:Invent 2025, AWS announced that Amazon S3 Vectors was GA with a 40x capacity bump (2 billion vectors per index). The headline that was mentioned widely was: "up to 90% lower cost than specialized vector databases." This is still true, but as with almost any technology, the details vary widely based on your situation and workload.
I have been working on this article for a while, but on May 28, 2026, AWS quietly knocked out one of the comparison's load-bearing arguments. The next-generation Amazon OpenSearch Serverless went GA with scale-to-zero compute and no minimum OCU floor. I'm a big fan of serverless services and seeing something that fits that definition much more than before is great news. The "$700/month idle" criticism that disqualified OpenSearch from every dev, demo, and bursty workload last year is mostly gone now. So I modified what I was working on here to compare S3 Vectors against the NEW OpenSearch Serverless (now called "NextGen"), with the old architecture (now "Classic") available as a historical reference column in the calculator.
The pricing model is still one of the most workload-sensitive pricing models AWS has shipped in years. S3 Vectors charges almost nothing when you don't query, then ramps hard with both query count and index size. NextGen OpenSearch Serverless now also scales to zero, but you pay a 10-30s cold-start (we measured ~15s at 50K vectors) on the first query after idle. Aurora pgvector falls in the middle. Pinecone Serverless has a minimum that swallows hobby workloads but flattens out at scale. None of these stores is the cheapest at every shape of workload, and with NextGen reshuffling the deck the crossovers between them are surprising.
This post walks through a hands-on cost model and a live benchmark harness across all four stores. Everything in here runs from one Terraform apply and a make load && make bench, on Python 3.14, against us-east-1. The companion repo is at github.com/RDarrylR/aws-vector-hosting-comparison. You can clone it and try the working four-way comparison yourself.

Why the blog comparisons don't always tell the full story with a real workload
Most "S3 Vectors vs X" articles you can read today pick a single point on the workload curve and report the winner at that point. The trouble is that the four contenders use four genuinely different pricing models, and the rank order flips three times as you move along just one axis - query rate.
| Store | What you actually pay for |
|---|---|
| Amazon S3 Vectors | Storage per logical GB-month + per-GB writes + per-million query API + per-TB query data processed |
| Amazon OpenSearch Serverless NextGen | OCU-hours when warm, $0 compute when idle (10-min idle timeout), + S3-backed storage per GB-month |
| Amazon OpenSearch Serverless Classic (pre-May 2026) | OCU-hours with 2+2 (or 1+1) OCU minimum floor + S3-backed storage per GB-month |
| Aurora PostgreSQL Serverless v2 + pgvector | ACU-hours + storage per GB-month + I/O per million (or I/O-Optimized flat rate) |
| Pinecone Serverless (Standard) | Storage per GB-month + Read Units + Write Units + $50/month minimum |
That last column is the only thing that matters. As of May 28, 2026, three of the AWS-native options can reach a true idle-cost floor in different ways: S3 Vectors has no provisioned compute, NextGen OSS can scale compute to zero after a 10-minute idle, and Aurora Serverless v2 can auto-pause when configured with min_capacity = 0. Pinecone's different: the Starter tier can be free for small demos, but the paid Standard tier still carries a $50/month minimum regardless of usage. This demo deliberately keeps Aurora at a min_capacity = 0.5 floor so its warm queries stay sub-100ms; if you'd rather pay nothing while idle and accept the resume latency on the next query, drop the floor to 0 in infrastructure/modules/aurora-pgvector/main.tf. The "cheapest store for a vector index" question has no answer until you commit to a workload shape, and most workloads have several shapes layered on top of each other (idle development, bursty user traffic, periodic batch reindexing).
So instead of a single answer, the goal of this post is a decision framework with the math underneath it, plus a repo you can clone to plug in your own numbers.
The four contenders, current versions
We are running this on the May 2026 release of each:
- Amazon S3 Vectors: GA December 2025, expanded to 17 additional regions in March 2026. Up to 2B vectors per index, 10K indexes per vector bucket. Native PUT/GET/Query/Delete Vector APIs. SSE-S3 or SSE-KMS encryption.
- Amazon OpenSearch Serverless NextGen: GA May 28, 2026.
VECTORSEARCHcollection type, FAISS HNSW. Scale-to-zero on a 10-minute idle timeout, no minimum OCU floor when configured for it. Compute and storage decoupled via a "collection group" resource. AWS documents 10-30s first-query latency when waking from zero; we measured ~15s at 50K vectors in this demo. Once warm, client-side query latency sits around 90-100ms p50 at this index size (server-side query time is faster, but the boto3/HTTPS/JSON round trip adds a fixed ~50ms of overhead). The original architecture is still available as "Classic" with the old OCU floor. Worth watching out for: depending on how you create the collection group (Terraform, SDK, CLI version, console), you can still accidentally end up with Classic infrastructure. The cost model in this article assumes a collection group created withgeneration = "NEXTGEN"andminIndexingCapacityInOCU = minSearchCapacityInOCU = 0. Check the console's Serverless generation field after creation to confirm. - Aurora PostgreSQL Serverless v2, engine 17.4 with
pgvector0.8.0. HNSW indexes,vector_cosine_ops. ACU range 0.5 to 16 in this demo. Data API (HTTP/IAM) for queries so Lambda can talk to it without a VPC. - Pinecone Serverless 2026.05 on AWS
us-east-1. The cost model in this post quotes the Standard tier ($50/month minimum, $0.33/GB storage, $4.50/M Write Units, $18/M Read Units) because that's what you actually pay at production scale. The demo itself runs entirely inside Pinecone's free Starter tier (2 GB storage, 2M WU/month, 1M RU/month, no minimum, no credit card) - 50K × 1024-dim vectors is ~250 MB, well below the 2 GB ceiling, and a full bench run uses ~15K Read Units against the 1M monthly allowance. Sign up at pinecone.io, put the API key in.env, andmake load-pinecone && make bench --include-pineconewill work without a paid plan.
Embeddings throughout are 1024-dimensional float32, generated by Amazon Titan Text Embeddings V2 (amazon.titan-embed-text-v2:0). Titan V2 charges $0.00002 per 1,000 input tokens at the time of writing, which keeps the embedding budget for 50k storm-event narratives well under a dollar.
The current pricing, in one place
Every dollar figure later in the post flows from this table. The companion repo's calculator/src/vector_cost/prices.py is the source of truth - the article reflects whatever is in that file. All prices are us-east-1, May 2026.
| Resource | Price |
|---|---|
| S3 Vectors - storage | $0.06 / GB-month |
| S3 Vectors - writes (PUT) | $0.20 / GB uploaded |
| S3 Vectors - query API | $2.50 / million requests |
| S3 Vectors - query data processed, first 100K vectors | $0.004 / TB |
| S3 Vectors - query data processed, beyond 100K vectors | $0.002 / TB |
| OpenSearch Serverless NextGen - indexing OCU-hour | $0.24 |
| OpenSearch Serverless NextGen - search OCU-hour | $0.24 |
| OpenSearch Serverless NextGen - minimum OCU floor | None (scales to zero) |
| OpenSearch Serverless NextGen - idle timeout | 10 minutes |
| OpenSearch Serverless NextGen - managed storage | $0.024 / GB-month |
| OpenSearch Serverless Classic - OCU-hour (for historical comparison) | $0.24 |
| OpenSearch Serverless Classic - minimum floor (2+2 OCUs redundant) | ~$701 / month idle |
| Aurora PostgreSQL Serverless v2 - ACU-hour (Standard) | $0.12 |
| Aurora PostgreSQL Serverless v2 - storage (Standard) | $0.10 / GB-month |
| Aurora PostgreSQL Serverless v2 - I/O (Standard) | $0.20 / million requests |
| Pinecone Serverless (Standard) - storage | $0.33 / GB-month |
| Pinecone Serverless (Standard, AWS us-east-1) - Write Units | $4.50 / million |
| Pinecone Serverless (Standard, AWS us-east-1) - Read Units | $18.00 / million |
| Pinecone Serverless (Standard) - monthly minimum | $50.00 |
| Bedrock Titan Embed Text V2 | $0.00002 / 1K input tokens |
The two prices that produce most of the surprises are S3 Vectors' query data processed ($0.002 per TB scanned beyond the first 100K vectors per index) and Pinecone's Read Units, which grow with namespace size. Everything else behaves the way you'd expect.
The cost model: ~200 lines of pure Python
One thing I built is a cost calculator. It isn't a calculator in the AWS Pricing Calculator sense - that one won't model S3 Vectors query data processed against a fluctuating index size. It's also not a benchmark tool. It's a single dataclass per store plus a function that returns the monthly bill broken down into named line items, given a Workload.
@dataclass(frozen=True)
class Workload:
n_vectors: int
dim: int = 1024
queries_per_month: int = 0
writes_per_month: int = 0
metadata_kb: float = 1.0
bytes_per_value: int = 4
@property
def vector_bytes(self) -> float:
return self.dim * self.bytes_per_value + 16 + self.metadata_kb * 1024
For S3 Vectors specifically, the cost function looks like this (the rest are in model.py in the repo):
def S3VectorsCosts(w: Workload) -> StoreCost:
p = PRICES["s3vectors"]
storage = w.total_gb * p.storage_per_gb_month
puts = (w.writes_per_month * w.vector_bytes / BYTES_PER_GB) * p.put_per_gb
query_api = (w.queries_per_month / 1_000_000) * p.query_api_per_million
tier1_vectors = min(w.n_vectors, p.tier1_cutoff_vectors)
tier2_vectors = max(0, w.n_vectors - p.tier1_cutoff_vectors)
tier1_tb = tier1_vectors * w.vector_bytes / BYTES_PER_TB
tier2_tb = tier2_vectors * w.vector_bytes / BYTES_PER_TB
query_data = w.queries_per_month * (
tier1_tb * p.query_data_per_tb_tier1
+ tier2_tb * p.query_data_per_tb_tier2
)
return StoreCost(
store="S3 Vectors",
monthly_total=storage + puts + query_api + query_data,
line_items={
"storage": storage, "writes": puts,
"query_api": query_api, "query_data_processed": query_data,
},
)
The line-item dict is the whole point. The article quotes each component separately; the calculator does the same so you can argue with my numbers by inspecting which term blew up.
A CLI wraps it:
$ uv run vector-cost report --dim 1024 --metadata-kb 1 --output markdown
prints the canonical scenarios I'm using below.
Cost at three scales, six workload shapes
We have six scenarios and four stores. These can be pulled from the calculator with make cost. NextGen is the default OpenSearch column; the parenthesized Classic numbers are what you would have paid for the same workload before May 28.
| Scenario | S3 Vectors | OSS NextGen (Classic) | Aurora pgvector | Pinecone |
|---|---|---|---|---|
| 10M vectors, 1K q/mo (idle dev) | $3 | $121 ($702) | $529 | $50 |
| 10M vectors, 1M q/mo (steady) | $100 | $527 ($702) | $541 | $358 |
| 100M vectors, 10K q/mo (infrequent) | $38 | $2,114 ($2,290) | $1,450 | $163 |
| 100M vectors, 10M q/mo (steady) | $9,405 | $2,114 ($2,290) | $1,578* | $4,478 |
| 1B vectors, 1K q/mo (archival RAG) | $297 | $4,206 ($18,169) | $1,882* | $1,584 |
| 1B vectors, 10M q/mo (steady) | $93,746 | $17,985 ($18,169) | $2,010* | $6,803 |
* Aurora pgvector pricing is calculated with the cluster capped at 16 ACUs (32 GiB RAM). The HNSW working set for 100M-1B vectors won't fit and queries will fall off a cliff. The model is happy because the model only knows about price; the database isn't happy. We come back to this.
Four things jump out:
- S3 Vectors is still the cheapest for truly idle workloads, but the gap to NextGen OSS just collapsed from ~250x to ~40x. At 10M vectors and 1K queries per month, S3 Vectors costs $3 and NextGen OSS costs $121 - both viable. Last month NextGen OSS would have been $702. That's a big change.
- The S3 Vectors cost curve gets vertical above ~1M queries/month at 100M vectors. Query data processed scales linearly with
vectors × queries. At 100M × 10M q/mo you're processing roughly 4.7 million TB (~4.7 EB) per month - 100M vectors × ~5 KB on the read path × 10M queries works out to that ballpark - and even at $0.002/TB that's about $9,400. The same workload on NextGen OSS is $2,114 because OSS isn't paying per query. - Aurora pgvector still dominates the middle band (10M-100M vectors, steady traffic) when the working set fits in RAM. At 10M vectors and a million queries a month, Aurora is $541 against NextGen OSS's $527 - they're effectively tied, and the right answer depends on whether you need SQL joins (Aurora) or you'd rather pay the 10-30s cold-start tax to get cheaper idle (OSS NextGen). Warm client-side latency's roughly in the same band for both at this scale (see the latency section), so it's not a tiebreaker on its own.
- NextGen OSS scales down on archival workloads. The 1B-vector archival row is the most striking change: $4,206/mo on NextGen vs $18,169/mo on Classic - a 4x reduction purely because 1K queries/month is one query every 43 minutes, well past the 10-minute idle timeout, so the collection spends most of the month at scale-zero. The Classic floor of 2+2 OCUs was paying for compute that was almost never used.
Where the crossovers actually are
Plotting cost as a function of monthly query volume at a fixed 100M-vector index makes the framework concrete. From the calculator:
| Queries/month | S3 Vectors | OSS NextGen | Aurora pgvector | Pinecone | Cheapest |
|---|---|---|---|---|---|
| 0 (idle) | $29 | $12 | $1,450 | $158 | OSS NextGen |
| 10 q/mo | $29 | $17 | $1,450 | $158 | OSS NextGen |
| 1k q/mo | $30 | $492 | $1,450 | $159 | S3 Vectors |
| 100k q/mo | $123 | $2,114 | $1,451 | $202 | S3 Vectors |
| 1M q/mo | $966 | $2,114 | $1,462 | $590 | Pinecone |
| 2.5M q/mo | $2,373 | $2,114 | $1,482 | $1,238 | Pinecone |
| 5M q/mo | $4,717 | $2,114 | $1,514 | $2,318 | Aurora pgvector |
| 10M q/mo | $9,405 | $2,114 | $1,578 | $4,478 | Aurora pgvector |
Four crossovers worth knowing at this index size:
- NextGen OSS owns the idle and ultra-low-volume regions thanks to scale-to-zero. At 0-10 queries per month it's $12-17 against S3 Vectors' flat $29 (S3 Vectors still pays storage on the full 100M-vector index, while NextGen OSS pays only for its cheaper S3-managed storage when scaled down). This flip is small in absolute dollars but real, and it didn't exist before May 28, 2026.
- S3 Vectors wins the low-to-medium volume band (~1k-300k queries/month). NextGen OSS has warmed up and is paying its full OCU rate; S3 Vectors' per-query charges are still negligible at this scale.
- Pinecone takes over in the middle (~300k-4M q/mo). Its $50/month floor that punishes it at low scale becomes irrelevant once you have real traffic, and its per-Read-Unit pricing is cheaper than S3 Vectors' per-TB data-processed and cheaper than OSS NextGen's warm OCU floor in this band. The break-even with S3 Vectors at 100M vectors lands near 300k q/mo; with Aurora and NextGen OSS, near 3-4M q/mo.
- Aurora pgvector wins the high-volume regime (~4M queries/month and up). It has no per-query charges, so once the index is loaded, additional traffic costs almost nothing - the bill is dominated by ACU-hours which barely move with query volume.
The S3 Vectors line is purely queries × index_size. The pure S3 Vectors vs NextGen OSS crossover (ignoring the other two) still lands at ~2M q/mo; it's just that Pinecone is cheaper than both in the band where that crossover happens, so neither is the overall cheapest there. With Classic OSS the comparable crossover was at ~2.5M q/mo because the warm price is similar; NextGen doesn't shift the high-end crossover, it shifts the low-end by making the idle region cheap.
Building the harness with Terraform
Everything above is a model. The companion repo also stands up real infrastructure so you can sanity-check the math against actual latency numbers and S3 Vectors invoices. The whole stack is one Terraform module against hashicorp/aws ~> 6.43.
The S3 Vectors module is the smallest of the four:
resource "aws_s3vectors_vector_bucket" "this" {
vector_bucket_name = "${var.name_prefix}-vectors-${var.suffix}"
force_destroy = true
encryption_configuration {
sse_type = "AES256"
}
}
resource "aws_s3vectors_index" "this" {
vector_bucket_name = aws_s3vectors_vector_bucket.this.vector_bucket_name
index_name = "${var.name_prefix}-idx"
data_type = "float32"
dimension = var.vector_dim
distance_metric = "cosine"
metadata_configuration {
non_filterable_metadata_keys = ["text"]
}
}
Note non_filterable_metadata_keys: anything in there is excluded from S3 Vectors query data processed. It still counts toward storage and PUT logical GB, but it doesn't increase the per-query data-processing charge. We keep the actual document text in non-filterable metadata for exactly that reason: returning it on a hit doesn't add to the S3 Vectors query-data-processed charge, and you never filter on full-text content anyway.
The OpenSearch Serverless module is the heaviest because OSS still uses the four-policy pattern from 2023 - encryption, network, data access - and NextGen adds a fifth resource: a collection_group of generation = "NEXTGEN" that holds the scale-to-zero capacity settings. There's a wrinkle here worth flagging. The aws_opensearchserverless_collection_group resource itself shipped in the AWS provider (it was added in 6.40 around April 2026) and exposes name, standby_replicas, description, and a capacity_limits block with min_* / max_* fields. What it does NOT expose at the time of writing is the generation attribute - and the AOSS API defaults that field to CLASSIC. A Classic collection group runs the same workload but pays the per-OCU floor 24/7, so an apparently-correct Terraform plan can quietly land you on the wrong architecture. The AWS CLI picked up generation in v2.34.56, so aws opensearchserverless create-collection-group --cli-input-json '{"generation":"NEXTGEN", ...}' works on a current CLI but not on anything older. The companion repo drives boto3 directly through a null_resource plus local-exec rather than shelling out to the CLI - boto3 is pinned in pyproject.toml/uv.lock so the version's deterministic across machines, where a CLI floor would be implicit and easy to break. When the AWS provider adds generation to the resource (track the hashicorp/terraform-provider-aws changelog), the null_resource can be swapped for a native aws_opensearchserverless_collection_group block and the helper script deleted:
resource "null_resource" "collection_group_nextgen" {
triggers = {
name = local.collection_group_name
region = var.aws_region
profile = var.aws_profile
max_indexing_ocu = "4"
max_search_ocu = "4"
script_path = "${path.module}/manage_collection_group.py"
}
provisioner "local-exec" {
when = create
command = "uv run python3 ${self.triggers.script_path} create --name ${self.triggers.name} --region ${self.triggers.region} --max-indexing-ocu ${self.triggers.max_indexing_ocu} --max-search-ocu ${self.triggers.max_search_ocu}"
environment = { AWS_PROFILE = self.triggers.profile }
}
provisioner "local-exec" {
when = destroy
command = "uv run python3 ${self.triggers.script_path} delete --name ${self.triggers.name} --region ${self.triggers.region}"
environment = { AWS_PROFILE = self.triggers.profile }
}
}
resource "aws_opensearchserverless_collection" "this" {
name = local.collection_name
type = "VECTORSEARCH"
standby_replicas = "ENABLED" # NEXTGEN requires ENABLED
collection_group_name = local.collection_group_name
depends_on = [
null_resource.collection_group_nextgen,
aws_opensearchserverless_security_policy.encryption,
aws_opensearchserverless_security_policy.network,
aws_opensearchserverless_access_policy.data,
]
}
The boto3 helper at modules/opensearch/manage_collection_group.py is short - it calls create_collection_group(name=..., standbyReplicas="ENABLED", generation="NEXTGEN", capacityLimits=...), treats ConflictException and missing-on-delete as success so it's idempotent, and disables botocore's client-side parameter validation so the call works across boto3 minor versions. A few things to look out for if you're adapting this to your own module:
standbyReplicas = "ENABLED"is required on a NEXTGEN collection group; the API rejectsDISABLED. The collection inside the group also needsstandby_replicas = "ENABLED". The doubled cost compared to a Classic dev collection is real (roughly 2x the warm OCU rate) but the scale-to-zero savings dwarf it for anything that isn't continuously busy.- Scale-to-zero is set by
minIndexingCapacityInOCU = 0andminSearchCapacityInOCU = 0on the collection group. NEXTGEN groups default both to 0 if you omit them, so it's easy to get scale-to-zero by accident; this module sets them explicitly so the intent's obvious in the source. Any non-zero minimum is a permanent OCU floor and disables scale-to-zero. Themax_*_capacity_in_ocuvalues cap how far the group can scale up under load (4+4 here is fine for the demo; production indexes typically need more). - After
apply, the AOSS console shows a Serverless generation field on the collection detail page. If it reads NextGen, scale-to-zero is active. If it reads Classic, the boto3 step didn't run or didn't passgeneration- and the cost model in this article doesn't apply to that deployment.
Aurora is straightforward Serverless v2 plus an HTTP/IAM Data API endpoint, which is the difference between "the benchmarker Lambda needs a VPC" and "the benchmarker Lambda is a tiny stateless function." The Lambda's IAM policy is the typical short list:
statement {
sid = "DataApiAccess"
effect = "Allow"
actions = [
"rds-data:ExecuteStatement",
"rds-data:BatchExecuteStatement",
"rds-data:BeginTransaction",
"rds-data:CommitTransaction",
"rds-data:RollbackTransaction",
]
resources = [aws_rds_cluster.this.arn]
}
statement {
sid = "SecretAccess"
effect = "Allow"
actions = ["secretsmanager:GetSecretValue"]
resources = [aws_rds_cluster.this.master_user_secret[0].secret_arn]
}
Least-privilege all the way through: the Lambda gets s3vectors:QueryVectors (not PutVectors) on the specific index ARN, not * (the IAM policy uses arn:aws:s3vectors:REGION:ACCOUNT:bucket/BUCKET/index/INDEX), only the Data API verbs on the one Aurora cluster ARN, and only GetSecretValue on the one secret ARN. For OpenSearch Serverless, IAM's coarser (aoss:APIAccessAll is the only data-plane action AOSS exposes), so the real least-privilege lives in the data access policy. Ours grants only the verbs the benchmarker actually needs - the Lambda doesn't need write or admin permissions on indexes:
resource "aws_opensearchserverless_access_policy" "data" {
name = "${var.name_prefix}-data-${var.suffix}"
type = "data"
policy = jsonencode([{
Rules = [
{
Resource = ["collection/${local.collection_name}"]
Permission = ["aoss:DescribeCollectionItems"]
ResourceType = "collection"
},
{
Resource = ["index/${local.collection_name}/*"]
Permission = ["aoss:DescribeIndex", "aoss:ReadDocument"]
ResourceType = "index"
},
]
Principal = [var.benchmarker_role_arn]
}])
}
The repo data policy is slightly broader because the loader scripts need create/write to bootstrap the index; in production you'd split that into a separate principal that only runs at ingest time. The full IAM document is in infrastructure/modules/benchmarker/main.tf, and the access policy is in infrastructure/modules/opensearch/main.tf.
The embedding pipeline: 50,000 storm-event narratives
For an apples-to-apples comparison the same vectors have to land in all four stores. The corpus is 50,000 short weather-event narratives - a procedurally-generated dataset that lives in demo/src/vector_demo/corpus.py and was written specifically for this repo. The seed is fixed, so two clones produce the same vectors.
A sample doc looks like this:
At 14:33 UTC, a supercell with rotating mesocyclone moved through Topeka.
Observers reported the storm tore the roof off a metal-framed warehouse.
Storm total: 2.4 in of precipitation; peak gust 87 mph. No fatalities reported.
That's deliberately not a wine-review or product-description dataset. Every public S3 Vectors / pgvector tutorial uses one of those. We want the embeddings to be on something the model has never seen exactly before.
Embedding runs against Titan V2:
@retry(stop=stop_after_attempt(6), wait=wait_exponential_jitter(initial=0.5, max=30))
def _invoke(client, text: str, dim: int) -> list[float]:
body = json.dumps({"inputText": text, "dimensions": dim, "normalize": True})
resp = client.invoke_model(modelId=MODEL_ID, body=body)
return json.loads(resp["body"].read())["embedding"]
50,000 docs × ~40 input tokens each = 2M tokens at $0.02/1M = $0.04 to embed the entire corpus. Embedding is essentially free at this scale.
Each loader is a small Python file. The S3 Vectors loader is the most concise because the API batches up to 500 vectors per call:
@retry(stop=stop_after_attempt(6), wait=wait_exponential_jitter(initial=0.5, max=30))
def _put(client, bucket: str, index: str, vectors: list[dict]) -> None:
client.put_vectors(
vectorBucketName=bucket,
indexName=index,
vectors=vectors,
)
A vector looks like:
{
"key": "d0042131-3f8b...",
"data": {"float32": [0.0231, -0.0117, ...]},
"metadata": {"text": "At 14:33 UTC, a supercell..."},
}
Aurora is one INSERT ... ON CONFLICT DO NOTHING per batch against the Data API. OpenSearch is helpers.streaming_bulk against the AOSS endpoint signed with SigV4. Pinecone is index.upsert. The full loaders are about 60 lines each.
Latency in the wild
Once the data's loaded, the harness runs the canonical query set against each store for ten iterations with top_k=10, discards the first iteration as warmup, and captures wall-clock latency from inside the benchmarker Lambda (so the numbers include the SDK / HTTPS / parsing stack a real application would also pay). The load is 50K vectors, dim=1024, run from an arm64 Lambda in us-east-1 against stores in the same region. The Pinecone row is from a separate make bench --include-pinecone run against the free Starter tier; Pinecone's data plane lives outside AWS, so its latencies include the public-internet round-trip to Pinecone's edge.
The table below is from an actual make apply && make load && make bench cycle on 2026-05-30. Numbers will move with cold starts, network jitter, and OCU/ACU autoscaling - the point is the order of magnitude, not the second decimal.
| Store | Warm p50 | Warm p95 | Warm p99 | Cold first-query |
|---|---|---|---|---|
| S3 Vectors | 68ms | 100ms | 107ms | 104ms (no observable cold start) |
| OSS NextGen | 94ms | 219ms | 282ms | 15,212ms (search-OCU scale-up) |
| Aurora pgvector | 56ms | 180ms | 190ms | 485ms (ACU scale-up from 0.5) |
| Pinecone Serverless | 61ms | 123ms | 672ms | 61ms (no observable cold start)* |
* Pinecone p99 of 672ms is one outlier in 200 samples; the Starter tier is shared infrastructure and occasional spikes are normal. Pinecone numbers include the laptop-to-Pinecone round-trip from the make bench host because Pinecone isn't an AWS service.
The story isn't "one store is fast and one is slow." All four warm p50s sit in a tight band of 56-94ms. The real spread is in the tails: S3 Vectors has the tightest p99 (107ms), Aurora sits at p99 190ms, OSS NextGen at p99 282ms (noisier because of background OCU scaling), and Pinecone's 672ms p99 is one shared-tenant outlier in 200 samples. AWS positions NextGen and Aurora as "single-digit milliseconds when warm," and that's roughly right on the server side of the query - but a real client paying boto3, HTTPS, JSON parsing, and rds-data Data-API overhead sees 55-100ms per call. That overhead's the same whether you're querying 50K vectors or 50M, so it's a fixed tax on top of whatever the underlying engine does.
The cold-start column is where NextGen earns the scale-to-zero cost savings - and pays for them. The first OSS query against a freshly-loaded NextGen collection took ~15 seconds while search OCUs spun up. This is the documented behavior, finally captured in a benchmark: the collection's indexing OCUs were warm from the load that just completed, but its search OCUs hadn't been touched yet, and AOSS scales them on demand. Subsequent queries hit the now-warm search OCUs and dropped to sub-300ms.
Aurora's 485ms first-query is the Serverless v2 autoscaler bumping ACUs up from the 0.5 minimum this demo keeps. Aurora doesn't go fully cold at min_capacity = 0.5 (that's the floor of always-on ACU we're paying for), but the ACU scaler still has its own warmup. If you instead set min_capacity = 0, Aurora auto-pauses after idle and resume takes longer than the 485ms you see here. S3 Vectors stays flat across cold and warm because it's S3 under the hood, with nothing to scale.
The implication for your application: NextGen's 10-30s cold start is fine for batch RAG pipelines (the cost is amortized across the batch), tolerable for back-office tools (users wait), and a non-starter for live chat. If you need NextGen's warm-path behavior but can't tolerate the first-query delay, you can keep search OCUs warm with a synthetic query every ~9 minutes. That deliberately trades away some or all of the search-side scale-to-zero savings, so model it as warm OpenSearch capacity (one or more search OCU-hours, billed continuously) rather than as a free workaround. If you can't tolerate the cold start AND can't afford continuously-warm OCUs, S3 Vectors is the better fit.
The cost-vs-latency tradeoff still holds, but the warm-latency numbers are tighter than the original draft suggested. At 100M vectors and 100k queries per month, S3 Vectors costs $123/mo and NextGen OSS costs $2,114/mo. The warm-p50 gap between them is roughly 25ms (S3V 68ms vs OSS 94ms - OSS NextGen is actually a touch slower at p50 for this collection size; S3 Vectors also has the tighter p99). You're paying ~$2,000/month for the option to handle a burst at high query volume without ballooning per-query charges, not for warm-path speed. Worth it for some apps, absurd for others.
The benchmark Lambda
I wanted the latency numbers to be repeatable, not a thing I ran once on my laptop. The harness ships an arm64 Lambda on the python3.14 runtime, running on a 6-hour EventBridge schedule. arm64 is ~20% cheaper than x86_64 for the same compute budget. Powertools provides three things in one decorator stack: EMF metrics, structured JSON logs, and X-Ray spans.
logger = Logger()
tracer = Tracer()
metrics = Metrics()
@logger.inject_lambda_context(log_event=True)
@tracer.capture_lambda_handler
@metrics.log_metrics(capture_cold_start_metric=True)
def handler(event, context):
query_vecs = [_embed(q) for q in QUERIES]
for name, fn in [
("S3Vectors", _bench_s3vectors),
("OpenSearchServerless", _bench_opensearch),
("AuroraPgvector", _bench_aurora),
]:
try:
latencies = fn(query_vecs)
_publish(name, latencies)
except Exception:
logger.exception("store failed", extra={"store": name})
metrics.add_dimension(name="Store", value=name)
metrics.add_metric(name="StoreError", unit=MetricUnit.Count, value=1)
return {"ok": True}
_publish emits QueryLatencyP50, QueryLatencyP95, QueryLatencyP99 as EMF metrics with a Store dimension. CloudWatch turns those into per-store latency dashboards you can trend over weeks alongside the modeled cost.
One caveat on log_event=True: the demo logs the full event for observability while benchmarking against a synthetic corpus. For real user traffic, turn that off (or redact query/metadata/retrieved text before logging) - vector queries and the documents they return can carry PII, customer prompts, or other sensitive content that you don't want sitting in CloudWatch Logs.
This is the part of the repo that's deliberately small. Production benchmarking is its own deep rabbit hole (warm-up loops, network jitter, percentile aggregation) and we aren't chasing nines here - we're showing that the cost crossovers are stable enough to base a decision on.
When to choose each store
The framework that falls out of the math:
For demos, PoCs, and side projects (the entry point everyone actually starts at):
- NextGen OSS is now a credible default when you want OpenSearch semantics (hybrid search, filters, BM25 alongside vectors) or when your traffic shape is uncertain. For the tiny 50K-vector demo in this repo, idle storage is effectively pocket change; for larger indexes the idle cost's still storage-driven, so check the calculator rather than assuming it's always pennies. Once warm, queries return in tens of milliseconds at this collection size (sub-10ms server-side, ~90-100ms wall-clock from the client after the boto3/HTTPS round-trip). The 10-30s cold start on the first query after a coffee break is the price you pay. Last week this wasn't an option because Classic OSS billed $700/month regardless. For pure vector lookup at very low query volume, S3 Vectors is still the simpler and cheaper answer, with consistent first-query latency.
- S3 Vectors still wins if you can't tolerate cold starts at all (e.g., a "hit the API and impress the stakeholder" demo). It costs ~$3/month at 10M vectors instead of NextGen's ~$120/month, but serves at a consistent 80ms.
- Pinecone free Starter tier is the third option if you want zero AWS resources and just an API key.
Pick S3 Vectors when you have:
- Many tenants with mostly-cold indexes (per-tenant chat RAG, archived embeddings, multi-region replicas you query rarely) AND need consistent latency on the first query (S3 Vectors never cold-starts).
- High aggregate vector count (100M+) and a query rate below ~300k/month per index (the new NextGen crossover).
- Latency budgets where 100ms is fine and the application can't tolerate a 10s pause on the first query of the day.
- Bedrock Knowledge Base as the consumer - the integration is native and self-synchronizing.
Pick OpenSearch Serverless NextGen when you have:
- Any workload with idle periods longer than 10 minutes - the scale-to-zero benefit kicks in immediately.
- Sustained mid-to-high throughput (300k-10M+ queries/month) where warm cost beats S3 Vectors' per-query charges.
- Tolerance for 10s cold-start on the first query after idle. Hide it behind a batch boundary, a "your query is processing" UX, or by pre-warming with a synthetic ping if you can't.
- Hybrid search needs (BM25 + vector + filters) that exceed what pgvector reasonably supports.
Pick Aurora PostgreSQL Serverless v2 + pgvector when you have:
- A working set that fits comfortably in RAM (call it < 50M 1024-dim vectors, < 256 GB working set).
- Hard requirements on transactional consistency, joins to relational tables, or SQL-native tooling.
- An existing Postgres team and you don't want to bring up another data store.
Pick Pinecone Serverless when you have:
- A team without AWS data-plane skills, and Pinecone's UI/UX is worth $50+/month minimum at production scale.
- Cross-cloud portability requirements that rule out AWS-native options.
- A workload at moderate scale (10M-100M vectors, 1M-10M q/mo) where Pinecone's read-unit pricing happens to land cheaper than OpenSearch for your specific query mix.
When NOT to use this combination
S3 Vectors specifically is a poor fit when:
- Query latency below ~80ms matters. S3 Vectors is a near-100ms service in this benchmark, not a 20ms autocomplete-style engine. If your users are typing autocomplete-style and you need 20ms p50, S3 Vectors will feel sluggish.
- You query the whole index very frequently. At 1B vectors and 10M queries a month we modeled $93k a month. That's not a bug in the model, it's the pricing model working as designed. If you're going to scan a multi-TB index millions of times a month, your costs are going to look more like OpenSearch's even if your latencies look like S3 Vectors'.
- You need cross-index joins or transactional semantics. S3 Vectors does similarity search; it doesn't pretend to be a database. Aurora pgvector is the right shape there.
- You depend on rich relevance scoring beyond cosine/Euclidean/dot product. No hybrid BM25, no learned-to-rank, no custom scoring scripts. OpenSearch has all of those.
OpenSearch Serverless NextGen is a poor fit for latency-sensitive interactive applications with sporadic traffic. The 10-30s cold start on the first query after a 10-minute idle window is a non-starter for a chat app where a user opens the page once an hour. Either keep the collection warm with a synthetic ping (which gives back most of the scale-to-zero savings), pre-warm before a known traffic burst, or use S3 Vectors instead.
OpenSearch Serverless Classic (the pre-May 2026 architecture) is a poor fit for almost anything new in 2026. The 4-OCU floor billed $700/month even when idle - the entire reason NextGen exists. Existing Classic collections still work and are still billed the same way; just don't start a new project on Classic.
Aurora pgvector is a poor fit beyond ~100M vectors at 1024 dimensions. The HNSW index needs the working set in shared buffers, and Serverless v2 caps at 256 ACUs (~512 GB RAM) which is far less than 1B × 4 KB. You can split the index across pgvector tables, but at that point you're rebuilding what OpenSearch already does.
Pinecone Serverless is a strong managed option, but in AWS-first environments the decision isn't only about cost. Make sure the security team's comfortable with API-key based access, third-party data-plane exposure, and the governance model for embeddings, metadata, and retrieved text (all of which leave your AWS account and live in Pinecone's infrastructure). Pinecone's also a poor fit for anything under ~$50/month of actual usage on Standard tier - the minimum floor turns small workloads into a flat tax - and a poor fit if your data residency requirements rule out third-party SaaS entirely.
Going to production
A few non-obvious things to wire up before this gets in front of users:
- Encryption keys. S3 Vectors supports SSE-S3 (default) and SSE-KMS with a customer-managed key. If your data is sensitive enough to require BYOK, switching from SSE-S3 to a CMK requires the right key policy or grants for the S3 Vectors service path; test it carefully. Don't assume the same key policy pattern you use for ordinary S3 object encryption will automatically work for vector indexes - the call path's different, and a working bucket-key setup can still fail on a vector PUT/query if the CMK's policy doesn't list the right principal and actions (typically
kms:Decryptandkms:GenerateDataKey). - Multi-tenant isolation. If you're running per-tenant indexes, use one vector index per tenant (you get 10K indexes per vector bucket) and use IAM resource ARNs (
arn:aws:s3vectors:region:account:bucket/X/index/tenant-Y) to constrain reads. AWS Storage Blog covered the agent-tool-selection pattern in February 2026 and it generalizes to any multi-tenant case. - Backups. None of these stores have an obvious "back up your vectors" story. The pragmatic move is to keep the source documents in S3 (which you're doing for ingest anyway) and treat the vector index as derived state you can rebuild. The repo's
make embed && make load-s3vectorsis the rebuild path. - Monitoring the cost curve in production. S3 Vectors cost is dominated by query data processed - which is dimensionless to your application code but scales with index size. Set a CloudWatch alarm on
AWS/S3Vectors NumberOfQuerieswith a threshold that matches the calculator's break-even point for your index size. The repo emits modeled monthly cost as a custom metric so you can correlate. - Pinecone connection pooling. The Pinecone client opens a new HTTP connection per process by default. In a high-concurrency Lambda this destroys p99 latency. Reuse the
Pineconeclient across invocations as a module-level singleton.
Things to look out for
Several things bit me while building this:
- S3 Vectors
data_typeis"float32", not"float". The API rejects the latter with a not-particularly-helpful schema error. The Terraform resource accepts both spellings in 6.43 but only the former actually deploys. non_filterable_metadata_keysis set-typed. If you don't pre-declare a key as non-filterable, the system makes it filterable and charges you for it in query data processed. This is the single biggest "I didn't know it was a setting" footgun in the S3 Vectors API - we put the document text in non-filterable so it doesn't blow up query costs.- OpenSearch Serverless
aoss:APIAccessAllis the only IAM action. There's noaoss:Queryoraoss:Index. Data-plane permissions are enforced through the data access policy, not IAM. This took me an hour the first time. - The Terraform AWS provider doesn't yet expose
generationon collection groups, and the API default is CLASSIC. The resource itself (aws_opensearchserverless_collection_group) shipped in provider 6.40 withname,standby_replicas,description, and acapacity_limitsblock, but it doesn't yet surface thegenerationattribute that the AOSS API uses to decide whether the group runs as NextGen or Classic. AWS CLI v2.34.56+ accepts the field, and so does any recent boto3 - the companion repo'snull_resource+local-exec(seemodules/opensearch/manage_collection_group.py) drives boto3 directly to bridge the gap. NEXTGEN groups also requirestandbyReplicas=ENABLED(Classic accepts DISABLED) andminIndexingCapacityInOCU=minSearchCapacityInOCU=0to actually scale to zero. Without all three together, you get a Classic group (or a NEXTGEN group with a permanent OCU floor) under the same Terraform resources and 24/7 OCU billing instead of scale-to-zero - so check the console's Serverless generation field afterapplyto confirm what you actually deployed. - NextGen OSS cold-start latency is real and substantial. The first query against a NextGen collection after search OCUs have scaled to zero takes ~15 seconds in our 50K-vector benchmark, while AOSS spins search capacity back up. Subsequent queries hit warm capacity and drop to sub-300ms. If you're benchmarking and the first run looks pathological, that's expected; run the bench a second time without delay or use the benchmarker Lambda's
warmup=trueoption which discards the first iteration. - NextGen OSS rejects
engineon knn_vector mappings. Classic accepted (and required)"engine": "faiss". NextGen's index API treatsengineas an illegal field and returnsillegal_argument_exceptionat index-creation time. Omit it on NextGen - the engine is implicit. If you're porting Classic code to NextGen, this will be your first index-creation failure. - Aurora Data API requires
enable_http_endpoint = trueon the cluster andmanage_master_user_password = trueso Secrets Manager owns the password. Without the second flag, Data API calls fail because there's no secret to GetSecretValue against. - Pinecone Serverless namespaces aren't free. Splitting a single index into many namespaces grows your storage cost (Pinecone bills storage per namespace including overhead). For most workloads, one namespace per logical tenant inside one index is the right unit, not "every customer gets their own index."
- Bedrock Titan V2's
dimensionsparameter is request-time, not configuration-time. You can ask for 256 / 512 / 1024 per call. If half your code paths ask for 1024 and half for 512, your index will be unusable. Pick one at the start.
Cost of running the demo itself
Running this end-to-end once - provision, embed, load, bench, tear down - costs well under USD $1 at 50,000 vectors on us-east-1. Where the spend goes:
| Item | What it costs |
|---|---|
| Bedrock Titan V2 embedding 50K docs (~2M tokens) | $0.04 |
| Aurora Serverless v2 cluster, ~30 min alive at 0.5-2 ACU | $0.06 - $0.20 |
| OpenSearch Serverless NextGen, ~30 min of active use (~2 OCUs during load + bench, $0 when idle) | $0.10 - $0.25 |
| S3 Vectors storage + 50K PUT writes (~245 MB logical) | $0.06 |
| Benchmarker Lambda invocations + CloudWatch logs/metrics | < $0.01 |
| Optional Pinecone free Starter tier (50K vectors, no card) | $0.00 |
Aurora Serverless v2 still bills continuously while the cluster is alive. NextGen OSS bills only while warm - if you leave the bench running for an hour you'll see ~$0.50, but if you make apply then go to lunch the OSS bill stays near zero. Run the full sequence in one sitting and the total is well under $1.
Cleanup
make destroy
That removes every AWS resource the Terraform module created: the S3 Vectors bucket and index, the OpenSearch Serverless collection and its three policies, the Aurora cluster, the benchmarker Lambda, EventBridge schedule, CloudWatch log group, and IAM roles.
Two things make destroy won't clean up:
- CloudWatch metrics emitted by the benchmarker stay on the account for the standard 15-month retention. Metrics are free at this volume.
- Pinecone if you enabled it. Sign in to the Pinecone console and delete the
vector-hosting-comparisonindex, or callpc.delete_index("vector-hosting-comparison"). The Pinecone Standard-tier minimum is prorated daily so leaving it overnight isn't free.
To verify nothing is still running:
aws --region us-east-1 s3vectors list-vector-buckets
aws --region us-east-1 opensearchserverless list-collections
aws --region us-east-1 rds describe-db-clusters \
--query 'DBClusters[].DBClusterIdentifier'
All three should return empty after make destroy.
Wrapping up
The "right" vector store is a question of workload shape, not vendor. From the math and the harness, after the May 28 NextGen launch:
- S3 Vectors is still dramatically cheaper for infrequent or massive-scale-low-query workloads where every query needs a consistent latency floor (no cold starts). Still dramatically more expensive at sustained high query volume on a large index.
- OpenSearch Serverless NextGen is now a strong default candidate for "I want production-grade vector search and don't know my traffic shape yet." Scale-to-zero means a PoC costs cents; a busy production app pays the same warm-OCU rates as Classic did, but only during peak hours. The 10-30s cold start is the new tradeoff to manage.
- Aurora pgvector is the unfashionable but correct middle-ground answer for moderate-scale workloads that fit in RAM and want SQL on top.
- Pinecone Serverless is a defensible choice when you need a third-party SaaS or cross-cloud portability - and the free Starter tier still makes it a no-cost option for getting started.
The 90% cost-reduction claim from S3 Vectors GA was true. Five months later, NextGen OSS closed half of that gap for the workloads most people actually run. The headlines are converging; the workload-shape thinking that picks between them is more important than ever.
You can clone the repo, run make plan to see what gets created, make apply to provision (~10 minutes), then make embed && make load && make bench && make cost. The whole comparison runs end-to-end in under thirty minutes and tears down with make destroy for under a dollar of AWS spend at 50K vectors.
If you find a workload shape where the framework doesn't survive contact with reality, the repo's calculator is here ->vector_cost/model.py.
Resources
Companion code
- github.com/RDarrylR/aws-vector-hosting-comparison - everything in this post, runnable end-to-end
calculator/src/vector_cost/model.py- the cost model itself, ~200 linescalculator/src/vector_cost/prices.py- the published prices (single source of truth)lambdas/benchmarker/src/benchmarker/handler.py- arm64 Lambda + Powertools EMF metrics
Amazon S3 Vectors
- Amazon S3 Vectors now generally available with increased scale and performance - the December 2025 GA announcement
- Working with S3 Vectors and vector buckets - user-guide overview, APIs, and limits
- S3 Vectors best practices - batched inserts, multi-tenant isolation, non-filterable metadata
- Optimize agent tool selection using S3 Vectors and Bedrock Knowledge Bases - the agent tool-routing pattern
- Query billion-scale vectors with SQL: Integrating S3 Vectors and Aurora PostgreSQL - tiered hot/cold pattern
- S3 pricing page - S3 Vectors section, source of every dollar figure here
OpenSearch Serverless NextGen and pgvector
- Introducing the next generation of Amazon OpenSearch Serverless - May 28, 2026 GA announcement
- The next generation of Amazon OpenSearch Serverless: Built from the ground up for agents - scale-to-zero, idle timeout, cold-start behavior
- Amazon OpenSearch Serverless pricing - NextGen and Classic OCU-hour rates
- Creating NextGen collections - collection groups, capacity limits, scale-to-zero
- pgvector on Aurora PostgreSQL - HNSW indexes, vector_cosine_ops
- Aurora Serverless v2 pricing - ACU-hour, Aurora Standard vs I/O-Optimized
Bedrock embeddings and Pinecone
- Amazon Titan Text Embeddings V2 - model card, dimension parameter
- Amazon Bedrock pricing - on-demand token pricing for Titan and others
- Pinecone Serverless pricing - tier matrix, Read/Write Units, regional adjustments
Terraform and tooling
hashicorp/awsprovider docs -aws_s3vectors_vector_buckethashicorp/awsprovider docs -aws_s3vectors_indexhashicorp/awsprovider docs -aws_opensearchserverless_collectionhashicorp/awsprovider docs -aws_opensearchserverless_collection_group- NextGen collection group, scale-to-zero capacity limits- AWS Lambda Powertools for Python - EMF metrics, structured logging, tracing
uv- the Python package manager used in the repo
Related posts on my site
- Serverless Analytics from Your Laptop: S3 Tables, DuckDB, and an OpenAQ Lakehouse - the same "managed object store as a primary data store" pattern, applied to Iceberg tables
- Aurora PostgreSQL Express Configuration: From Zero to Production Database in 30 Seconds - background on Aurora Serverless v2 setup
- Lambda Managed Instances with Terraform - more on Lambda packaging and the arm64 cost case
Connect with me on X, Bluesky, LinkedIn, GitHub, Medium, Dev.to, or the AWS Builder Center. Check out more of my projects at darryl-ruggles.cloud and join the Believe In Serverless community.
Comments
Loading comments...