Thanos uses object storage as primary storage for metrics and metadata related to them. In this document you can learn how to configure your object storage and what is the data layout and format for primary Thanos components that are "block" aware, like: sidecar
compact
, receive
and store gateway
.
Thanos supports any object stores that can be implemented against Thanos objstore.Bucket interface.
All clients can be configured using --objstore.config-file
to reference to the configuration file or --objstore.config
to put yaml config directly.
You can either pass YAML file defined below in --objstore.config-file
or pass the YAML content directly using --objstore.config
We recommend the latter as it gives an explicit static view of configuration for each component. It also saves you the fuss of creating and managing additional file.
Don't be afraid of multiline flags!
In Kubernetes it is as easy as (on Thanos sidecar example):
- args:
- sidecar
- |
--objstore.config=type: GCS
config:
bucket: <bucket>
- --prometheus.url=http://localhost:9090
- |
--tracing.config=type: STACKDRIVER
config:
service_name: ""
project_id: <project>
sample_factor: 16
- --tsdb.path=/prometheus-data
Current object storage client implementations:
Provider | Maturity | Aimed For | Auto-tested on CI | Maintainers |
---|---|---|---|---|
Google Cloud Storage | Stable | Production Usage | yes | @bwplotka |
AWS/S3 (and all S3-compatible storages e.g disk-based Minio) | Stable | Production Usage | yes | @bwplotka |
Azure Storage Account | Stable | Production Usage | no | @vglafirov |
OpenStack Swift | Beta (working PoC) | Production Usage | yes | @FUSAKLA |
Tencent COS | Beta | Production Usage | no | @jojohappy,@hanjm |
AliYun OSS | Beta | Production Usage | no | @shaulboozhiao,@wujinhu |
Local Filesystem | Stable | Testing and Demo only | yes | @bwplotka |
Missing support to some object storage? Check out how to add your client section
NOTE: Currently Thanos requires strong consistency (write-read) for object store implementation for singleton Compaction purposes.
Thanos uses the minio client library to upload Prometheus data into AWS S3.
You can configure an S3 bucket as an object store with YAML, either by passing the configuration directly to the --objstore.config
parameter, or (preferably) by passing the path to a configuration file to the --objstore.config-file
option.
NOTE: Minio client was mainly for AWS S3, but it can be configured against other S3-compatible object storages e.g Ceph
type: S3
config:
bucket: ""
endpoint: ""
region: ""
access_key: ""
insecure: false
signature_version2: false
secret_key: ""
put_user_metadata: {}
http_config:
idle_conn_timeout: 1m30s
response_header_timeout: 2m
insecure_skip_verify: false
tls_handshake_timeout: 10s
expect_continue_timeout: 1s
max_idle_conns: 100
max_idle_conns_per_host: 100
max_conns_per_host: 0
trace:
enable: false
list_objects_version: ""
part_size: 67108864
sse_config:
type: ""
kms_key_id: ""
kms_encryption_context: {}
encryption_key: ""
At a minimum, you will need to provide a value for the bucket
, endpoint
, access_key
, and secret_key
keys. The rest of the keys are optional.
The AWS region to endpoint mapping can be found in this link.
Make sure you use a correct signature version. Currently AWS requires signature v4, so it needs signature_version2: false
. If you don't specify it, you will get an Access Denied
error. On the other hand, several S3 compatible APIs use signature_version2: true
.
You can configure the timeout settings for the HTTP client by setting the http_config.idle_conn_timeout
and http_config.response_header_timeout
keys. As a rule of thumb, if you are seeing errors like timeout awaiting response headers
in your logs, you may want to increase the value of http_config.response_header_timeout
.
Please refer to the documentation of the Transport type in the net/http
package for detailed information on what each option does.
part_size
is specified in bytes and refers to the minimum file size used for multipart uploads, as some custom S3 implementations may have different requirements. A value of 0
means to use a default 128 MiB size.
Set list_objects_version: "v1"
for S3 compatible APIs that don't support ListObjectsV2 (e.g. some versions of Ceph). Default value (""
) is equivalent to "v2"
.
For debug and testing purposes you can set
-
insecure: true
to switch to plain insecure HTTP instead of HTTPS -
http_config.insecure_skip_verify: true
to disable TLS certificate verification (if your S3 based storage is using a self-signed certificate, for example) -
trace.enable: true
to enable the minio client's verbose logging. Each request and response will be logged into the debug logger, so debug level logging must be enabled for this functionality.
SSE can be configued using the sse_config
. SSE-S3, SSE-KMS, and SSE-C are supported.
-
If type is set to
SSE-S3
you do not need to configure other options. -
If type is set to
SSE-KMS
you must setkms_key_id
. Thekms_encryption_context
is optional, as AWS provides a default encryption context. -
If type is set to
SSE-C
you must provide a path to the encryption key usingencryption_key
.
If the SSE Config block is set but the type
is not one of SSE-S3
, SSE-KMS
, or SSE-C
, an error is raised.
You will also need to apply the following AWS IAM policy for the user to access the KMS key:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "KMSAccess",
"Effect": "Allow",
"Action": [
"kms:GenerateDataKey",
"kms:Encrypt",
"kms:Decrypt"
],
"Resource": "arn:aws:kms:<region>:<account>:key/<KMS key id>"
}
]
}
By default Thanos will try to retrieve credentials from the following sources:
- From config file if BOTH
access_key
andsecret_key
are present. - From the standard AWS environment variable -
AWS_ACCESS_KEY_ID
,AWS_SECRET_ACCESS_KEY
- From
~/.aws/credentials
- IAM credentials retrieved from an instance profile.
NOTE: Getting access key from config file and secret key from other method (and vice versa) is not supported.
Example working AWS IAM policy for user:
- For deployment (policy for Thanos services):
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Statement",
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetObject",
"s3:DeleteObject",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::<bucket>/*",
"arn:aws:s3:::<bucket>"
]
}
]
}
(No bucket policy)
To test the policy, set env vars for S3 access for empty, not used bucket as well as:
THANOS_TEST_OBJSTORE_SKIP=GCS,AZURE,SWIFT,COS,ALIYUNOSS
THANOS_ALLOW_EXISTING_BUCKET_USE=true
And run: GOCACHE=off go test -v -run TestObjStore_AcceptanceTest_e2e ./pkg/...
- For testing (policy to run e2e tests):
We need access to CreateBucket and DeleteBucket and access to all buckets:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Statement",
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetObject",
"s3:DeleteObject",
"s3:PutObject",
"s3:CreateBucket",
"s3:DeleteBucket"
],
"Resource": [
"arn:aws:s3:::<bucket>/*",
"arn:aws:s3:::<bucket>"
]
}
]
}
With this policy you should be able to run set THANOS_TEST_OBJSTORE_SKIP=GCS,AZURE,SWIFT,COS,ALIYUNOSS
and unset S3_BUCKET
and run all tests using make test
.
Details about AWS policies: https://docs.aws.amazon.com/AmazonS3/latest/dev/using-with-s3-actions.html
To configure Google Cloud Storage bucket as an object store you need to set bucket
with GCS bucket name and configure Google Application credentials.
For example:
type: GCS
config:
bucket: ""
service_account: ""
Application credentials are configured via JSON file and only the bucket needs to be specified, the client looks for:
- A JSON file whose path is specified by the
GOOGLE_APPLICATION_CREDENTIALS
environment variable. - A JSON file in a location known to the gcloud command-line tool. On Windows, this is
%APPDATA%/gcloud/application_default_credentials.json
. On other systems,$HOME/.config/gcloud/application_default_credentials.json
. - On Google App Engine it uses the
appengine.AccessToken
function. - On Google Compute Engine and Google App Engine Managed VMs, it fetches credentials from the metadata server. (In this final case any provided scopes are ignored.)
You can read more on how to get application credential json file in https://cloud.google.com/docs/authentication/production
Another possibility is to inline the ServiceAccount into the Thanos configuration and only maintain one file. This feature was added, so that the Prometheus Operator only needs to take care of one secret file.
type: GCS
config:
bucket: "thanos"
service_account: |-
{
"type": "service_account",
"project_id": "project",
"private_key_id": "abcdefghijklmnopqrstuvwxyz12345678906666",
"private_key": "-----BEGIN PRIVATE KEY-----\...\n-----END PRIVATE KEY-----\n",
"client_email": "[email protected]",
"client_id": "123456789012345678901",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/thanos%40gitpods.iam.gserviceaccount.com"
}
Note: GCS Policies should be applied at the project level, not at the bucket level
For deployment:
Storage Object Creator
and Storage Object Viewer
For testing:
Storage Object Admin
for ability to create and delete temporary buckets.
To test the policy is working as expected, exec into the sidecar container, eg:
kubectl exec -it -n <namespace> <prometheus with sidecar pod name> -c <sidecar container name> -- /bin/sh
Then test that you can at least list objects in the bucket, eg:
thanos tools bucket ls --objstore.config="${OBJSTORE_CONFIG}"
To use Azure Storage as Thanos object store, you need to precreate storage account from Azure portal or using Azure CLI. Follow the instructions from Azure Storage Documentation: https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account
To configure Azure Storage account as an object store you need to provide a path to Azure storage config file in flag --objstore.config-file
.
Config file format is the following:
type: AZURE
config:
storage_account: ""
storage_account_key: ""
container: ""
endpoint: ""
max_retries: 0
msi_resource: ""
user_assigned_id: ""
pipeline_config:
max_tries: 0
try_timeout: 0s
retry_delay: 0s
max_retry_delay: 0s
reader_config:
max_retry_requests: 0
http_config:
idle_conn_timeout: 0s
response_header_timeout: 0s
insecure_skip_verify: false
tls_handshake_timeout: 0s
expect_continue_timeout: 0s
max_idle_conns: 0
max_idle_conns_per_host: 0
max_conns_per_host: 0
disable_compression: false
If msi_resource
is used, authentication is done via system-assigned managed identity. The value for Azure should be https://<storage-account-name>.blob.core.windows.net
.
If user_assigned_id
is used, authentication is done via user-assigned managed identity. When using user_assigned_id
the msi_resource
defaults to https://<storage_account>.<endpoint>
The generic max_retries
will be used as value for the pipeline_config
's max_tries
and reader_config
's max_retry_requests
. For more control, max_retries
could be ignored (0) and one could set specific retry values.
Thanos uses ncw/swift client to upload Prometheus data into OpenStack Swift.
Below is an example configuration file for thanos to use OpenStack swift container as an object store. Note that if the name
of a user, project or tenant is used one must also specify its domain by ID or name. Various examples for OpenStack authentication can be found in the official documentation.
By default, OpenStack Swift has a limit for maximum file size of 5 GiB. Thanos index files are often larger than that. To resolve this issue, Thanos uses Static Large Objects (SLO) which are uploaded as segments. These are by default put into the segments
directory of the same container. The default limit for using SLO is 1 GiB which is also the maximum size of the segment. If you don't want to use the same container for the segments (best practise is to use <container_name>_segments
to avoid polluting listing of the container objects) you can use the large_file_segments_container_name
option to override the default and put the segments to other container. In rare cases you can switch to Dynamic Large Objects (DLO) by setting the use_dynamic_large_objects
to true, but use it with caution since it even more relies on eventual consistency.
type: SWIFT
config:
auth_version: 0
auth_url: ""
username: ""
user_domain_name: ""
user_domain_id: ""
user_id: ""
password: ""
domain_id: ""
domain_name: ""
project_id: ""
project_name: ""
project_domain_id: ""
project_domain_name: ""
region_name: ""
container_name: ""
large_object_chunk_size: 1073741824
large_object_segments_container_name: ""
retries: 3
connect_timeout: 10s
timeout: 5m
use_dynamic_large_objects: false
To use Tencent COS as storage store, you should apply a Tencent Account to create an object storage bucket at first. Note that detailed from Tencent Cloud Documents: https://cloud.tencent.com/document/product/436
To configure Tencent Account to use COS as storage store you need to set these parameters in yaml format stored in a file:
type: COS
config:
bucket: ""
region: ""
app_id: ""
secret_key: ""
secret_id: ""
http_config:
idle_conn_timeout: 1m30s
response_header_timeout: 2m
tls_handshake_timeout: 10s
expect_continue_timeout: 1s
max_idle_conns: 100
max_idle_conns_per_host: 100
max_conns_per_host: 0
Set the flags --objstore.config-file
to reference to the configuration file.
In order to use AliYun OSS object storage, you should first create a bucket with proper Storage Class , ACLs and get the access key on the AliYun cloud. Go to https://www.alibabacloud.com/product/oss for more detail.
To use AliYun OSS object storage, please specify following yaml configuration file in objstore.config*
flag.
type: ALIYUNOSS
config:
endpoint: ""
bucket: ""
access_key_id: ""
access_key_secret: ""
Use --objstore.config-file to reference to this configuration file.
In order to use Baidu BOS object storage, you should apply for a Baidu Account and create an object storage bucket first. Refer to Baidu Cloud Documents for more details. To use Baidu BOS object storage, please specify the following yaml configuration file in --objstore.config*
flag.
type: BOS
config:
bucket: ""
endpoint: ""
access_key: ""
secret_key: ""
This storage type is used when user wants to store and access the bucket in the local filesystem. We treat filesystem the same way we would treat object storage, so all optimization for remote bucket applies even though, we might have the files locally.
NOTE: This storage type is experimental and might be inefficient. It is NOT advised to use it as the main storage for metrics in production environment. Particularly there is no planned support for distributed filesystems like NFS. This is mainly useful for testing and demos.
type: FILESYSTEM
config:
directory: ""
Following checklist allows adding new Go code client to supported providers:
- Create new directory under
pkg/objstore/<provider>
- Implement objstore.Bucket interface
- Add
NewTestBucket
constructor for testing purposes, that creates and deletes temporary bucket. - Use created
NewTestBucket
in ForeachStore method to ensure we can run tests against new provider. (In PR) - RUN the TestObjStoreAcceptanceTest against your provider to ensure it fits. Fix any found error until test passes. (In PR)
- Add client implementation to the factory in factory code. (Using as small amount of flags as possible in every command)
- Add client struct config to bucketcfggen to allow config auto generation.
At that point, anyone can use your provider by spec.
Thanos supports writing and reading data in native Prometheus TSDB blocks
in TSDB format. This is the format used by Prometheus TSDB database for persisting data on the local disk. With the efficient index and chunk binary formats, it also fits well to be used directly from object storage using range GET API.
Following sections explain this format in details with the additional files and entries that Thanos system supports.
Official docs for Prometheus TSDB format can be found here, but this section lists the most important elements here.
TSDB Block means particularly a set of Blobs (files) in a single directory (or prefix
if we talk in Object Storage terms) named with ULID e.g 01ARZ3NDEKTSV4RRFFQ69G5FAV
.
Those files contain series (labels with compressed samples) for particular time duration (e.g 2h) from particular Source
(e.g Prometheus or Thanos Receive)
In Thanos system, all files are strictly immutable. (NOTE: In Prometheus too, but with some caveats like tombstones). This means that any modification like rewrite
deletion
or compaction
has to be done by creating a new block and removing (with delay!) old one.
NOTE: Any other not-known file present in this directory is ignored when reading the data. However, those can be removed when the block is being deleted from object storage/disk.
Example block file structure (on the local filesystem) can look like this:
01DN3SK96XDAEKRB1AN30AAW6E:
total 2209344
drwxr-xr-x 2 bwplotka bwplotka 4096 Dec 10 2019 chunks
-rw-r--r-- 1 bwplotka bwplotka 1962383742 Dec 10 2019 index
-rw-r--r-- 1 bwplotka bwplotka 6761 Dec 10 2019 meta.json
-rw-r--r-- 1 bwplotka bwplotka 111 Dec 10 2019 delete-mark.json # <-- Optional marker.
-rw-r--r-- 1 bwplotka bwplotka 124 Dec 10 2019 no-compact-mark.json # <-- Optional marker.
01DN3SK96XDAEKRB1AN30AAW6E/chunks:
total 8202452
-rw-r--r-- 1 bwplotka bwplotka 536870490 Dec 10 2019 000001
-rw-r--r-- 1 bwplotka bwplotka 536869843 Dec 10 2019 000002
-rw-r--r-- 1 bwplotka bwplotka 536869848 Dec 10 2019 000003
-rw-r--r-- 1 bwplotka bwplotka 536868209 Dec 10 2019 000004
-rw-r--r-- 1 bwplotka bwplotka 536869517 Dec 10 2019 000005
-rw-r--r-- 1 bwplotka bwplotka 536870654 Dec 10 2019 000006
-rw-r--r-- 1 bwplotka bwplotka 536855168 Dec 10 2019 000007
-rw-r--r-- 1 bwplotka bwplotka 536859441 Dec 10 2019 000008
-rw-r--r-- 1 bwplotka bwplotka 536862863 Dec 10 2019 000009
-rw-r--r-- 1 bwplotka bwplotka 536868432 Dec 10 2019 000010
-rw-r--r-- 1 bwplotka bwplotka 536861395 Dec 10 2019 000011
-rw-r--r-- 1 bwplotka bwplotka 536870859 Dec 10 2019 000012
-rw-r--r-- 1 bwplotka bwplotka 536854971 Dec 10 2019 000013
-rw-r--r-- 1 bwplotka bwplotka 536846973 Dec 10 2019 000014
-rw-r--r-- 1 bwplotka bwplotka 536866732 Dec 10 2019 000015
-rw-r--r-- 1 bwplotka bwplotka 346266827 Dec 10 2019 000016
Let's look at each file one by one.
NOTE: Currently supported meta.json version: v1 Currently supported meta.json Thanos section version: v1
This file is an important entry that described the block and its data.
This file allows you to find for example:
- The block ID (
ulid
) - Duration of the block (
minTime
andmaxTime
) - Important statistics (
stats.numSeries
) - How many times block was re-compacted (
compaction.level
) - What initial smaller blocks IDs are part of this block (
compaction.sources
) - What smaller (including intermittent) blocks IDs are part of this block (
compaction.parents
) - Thanos Section (only visible for blocks generated by Thanos components like
sidecar
,receive
orcompact
):- External Labels for block (identifying producers) (
thanos.labels
) - Downsampling resolution if downsampling was done on this block (
thanos.downsample.resolution
).0
means no downsampling. - What component created block (
thanos.source
) - Files and its sizes that are part of this block (
thanos.files
)
- External Labels for block (identifying producers) (
NOTE: In theory, you can modify this data manually. However, components like Compactor and Store Gateway currently infinitely cache that meta.json, (sometimes on disk if configured), so manual cache removal and restart might be needed.
Example meta.json file:
{
"ulid": "01DN3SK96XDAEKRB1AN30AAW6E",
"minTime": 1567641600000,
"maxTime": 1568851200000,
"stats": {
"numSamples": 5397517846,
"numSeries": 8377876,
"numChunks": 67874256
},
"compaction": {
"level": 4,
"sources": [
"01DKZNX70TQQ0R025G66ZF1V5P",
"01DKZWS55317K7JGVMCSBR68Z2", // Trimmed items for readability.
"01DN3GH4A71RD6NYQ2VZPBQTFH"
],
"parents": [
{
"ulid": "01DM4WK3F9ZGW19W16MZJJFF6T",
"minTime": 1567641600000,
"maxTime": 1567814400000
},
{
"ulid": "01DMA1BXHK3G2KDKAPMBTVATRT",
"minTime": 1567814400000,
"maxTime": 1567987200000
},
{
"ulid": "01DMF65TY6JSTCDVTPZ094B5D6",
"minTime": 1567987200000,
"maxTime": 1568160000000
},
{
"ulid": "01DMMB0SK28FKC55RNK7ZZWS1A",
"minTime": 1568160000000,
"maxTime": 1568332800000
},
{
"ulid": "01DMSFSXNE8Y76G5KCQ2BABYFA",
"minTime": 1568332800000,
"maxTime": 1568505600000
},
{
"ulid": "01DMYMM5SW0FPJSHQQQM05FBN9",
"minTime": 1568505600000,
"maxTime": 1568678400000
},
{
"ulid": "01DN3SDE1M9W1JG7JFSM5QFP2Y",
"minTime": 1568678400000,
"maxTime": 1568851200000
}
]
},
"version": 1,
"thanos": {
"labels": {
"cluster": "eu1",
"monitor": "prometheus",
"tenant": "team-a",
"replica": "1"
},
"downsample": {
"resolution": 0
},
"source": "compactor",
"files": [
{
"rel_path": "index",
"size_bytes": 1313
}, // Trimmed items for readability.
],
"version": 1
}
}
Format in Go code can be found here.
External labels are extremely important block metadata. They are stored in meta.json
in thanos.labels
section and allows to identify the producer and owner of those blocks. This information will be used further by different Thanos components:
- Those labels will be visible when data is queried. You can aggregate across those in PromQL etc.
- Querier to filter out store APIs to touch during query requests.
- Many object storage readers like compactor and store gateway which groups the blocks by external labels. This grouping allows horizontal scalability like sharding or concurrency.
- Some of those labels can be chosen as replication labels. Querier and Compactor will then deduplicate such blocks identified by same HA groups.
- Some of those labels can be chosen as tenancy labels. This allows read, write and storage isolation mechanism.
The meta.json
and thanos.labels
labels are filled during block upload/creation. For example:
- Each produced TSDB block by Prometheus is labelled with Prometheus external labels by
sidecar
before upload to object storage. - Each produced TSDB block by
compact
is labelled with whatever source blocks had. The exception is the deduplication process that removes the chosen replica flag(s). - Each produced TSDB block by
receive
is labelled with labels given labels in repeated receive--labels
flag.
The recommended information that should be given in those labels:
Example Prometheus useful external labels:
- Replication information e.g
replica="0"
- Cluster, environment, zone, so target origin e.g
cluster="eu-1-production"
orcluster="1",env="production",region="us-west1"
- Tenancy information e.g
tenant="organizationABC"
NOTE: Be careful with receive external flags. Remote Write clients can stream any labels. If some label will duplicate with the external label of receive, it will be masked with what receiver has specified. This is why it's recommended to have
receive_
prefix to all receive labels. (e.g to not confuse with Prometheus replicas)
Example Receive useful external labels:
- Replication information e.g
receive_replica="0"
(to not confuse with Prometheusreplica
often stated). - Cluster, environment, zone, so target origin e.g
receive_cluster="eu-west1-production-1"
orreceive_cluster="1",receive_env="production",receive_region="us-west1"
- Tenancy information e.g
tenant="organizationABC"
NOTE: Currently supported index file versions: v1 and v2
This file stores the index created to allow efficient lookup for series and its samples.
All entries are sorted lexicographically unless stated otherwise.
From high level it allows to find:
- Label names
- Label values for label name
- All series labels
- Given (or all) series' chunk reference. This can be used to find chunk with samples in the chunk files
The following describes the format of the index
file found in each block directory. It is terminated by a table of contents which serves as an entry point into the index.
┌────────────────────────────┬─────────────────────┐
│ magic(0xBAAAD700) <4b> │ version(1) <1 byte> │
├────────────────────────────┴─────────────────────┤
│ ┌──────────────────────────────────────────────┐ │
│ │ Symbol Table │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Series │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Label Index 1 │ │
│ ├──────────────────────────────────────────────┤ │
│ │ ... │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Label Index N │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Postings 1 │ │
│ ├──────────────────────────────────────────────┤ │
│ │ ... │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Postings N │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Label Offset Table │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Postings Offset Table │ │
│ ├──────────────────────────────────────────────┤ │
│ │ TOC │ │
│ └──────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘
When the index is written, an arbitrary number of padding bytes may be added between the lined out main sections above. When sequentially scanning through the file, any zero bytes after a section's specified length must be skipped.
Most of the sections described below start with a len
field. It always specifies the number of bytes just before the trailing CRC32 checksum. The checksum is always calculated over those len
bytes.
The symbol table holds a sorted list of deduplicated strings that occurred in label pairs of the stored series. They can be referenced from subsequent sections and significantly reduce the total index size.
The section contains a sequence of the string entries, each prefixed with the string's length in raw bytes. All strings are utf-8 encoded. Strings are referenced by sequential indexing. The strings are sorted in lexicographically ascending order.
┌────────────────────┬─────────────────────┐
│ len <4b> │ #symbols <4b> │
├────────────────────┴─────────────────────┤
│ ┌──────────────────────┬───────────────┐ │
│ │ len(str_1) <uvarint> │ str_1 <bytes> │ │
│ ├──────────────────────┴───────────────┤ │
│ │ . . . │ │
│ ├──────────────────────┬───────────────┤ │
│ │ len(str_n) <uvarint> │ str_n <bytes> │ │
│ └──────────────────────┴───────────────┘ │
├──────────────────────────────────────────┤
│ CRC32 <4b> │
└──────────────────────────────────────────┘
The section contains a sequence of series that hold the label set of the series as well as its chunks within the block. The series are sorted lexicographically by their label sets. Each series section is aligned to 16 bytes. The ID for a series is the offset/16
. This serves as the series' ID in all subsequent references. Thereby, a sorted list of series IDs implies a lexicographically sorted list of series label sets.
┌───────────────────────────────────────┐
│ ┌───────────────────────────────────┐ │
│ │ series_1 │ │
│ ├───────────────────────────────────┤ │
│ │ . . . │ │
│ ├───────────────────────────────────┤ │
│ │ series_n │ │
│ └───────────────────────────────────┘ │
└───────────────────────────────────────┘
Every series entry first holds its number of labels, followed by tuples of symbol table references that contain the label name and value. The label pairs are lexicographically sorted. After the labels, the number of indexed chunks is encoded, followed by a sequence of metadata entries containing the chunks minimum (mint
) and maximum (maxt
) timestamp and a reference to its position in the chunk file. The mint
is the time of the first sample and maxt
is the time of the last sample in the chunk. Holding the time range data in the index allows dropping chunks irrelevant to queried time ranges without accessing them directly.
mint
of the first chunk is stored, it's maxt
is stored as a delta and the mint
and maxt
are encoded as deltas to the previous time for subsequent chunks. Similarly, the reference of the first chunk is stored and the next ref is stored as a delta to the previous one.
┌──────────────────────────────────────────────────────────────────────────┐
│ len <uvarint> │
├──────────────────────────────────────────────────────────────────────────┤
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ labels count <uvarint64> │ │
│ ├──────────────────────────────────────────────────────────────────────┤ │
│ │ ┌────────────────────────────────────────────┐ │ │
│ │ │ ref(l_i.name) <uvarint32> │ │ │
│ │ ├────────────────────────────────────────────┤ │ │
│ │ │ ref(l_i.value) <uvarint32> │ │ │
│ │ └────────────────────────────────────────────┘ │ │
│ │ ... │ │
│ ├──────────────────────────────────────────────────────────────────────┤ │
│ │ chunks count <uvarint64> │ │
│ ├──────────────────────────────────────────────────────────────────────┤ │
│ │ ┌────────────────────────────────────────────┐ │ │
│ │ │ c_0.mint <varint64> │ │ │
│ │ ├────────────────────────────────────────────┤ │ │
│ │ │ c_0.maxt - c_0.mint <uvarint64> │ │ │
│ │ ├────────────────────────────────────────────┤ │ │
│ │ │ ref(c_0.data) <uvarint64> │ │ │
│ │ └────────────────────────────────────────────┘ │ │
│ │ ┌────────────────────────────────────────────┐ │ │
│ │ │ c_i.mint - c_i-1.maxt <uvarint64> │ │ │
│ │ ├────────────────────────────────────────────┤ │ │
│ │ │ c_i.maxt - c_i.mint <uvarint64> │ │ │
│ │ ├────────────────────────────────────────────┤ │ │
│ │ │ ref(c_i.data) - ref(c_i-1.data) <varint64> │ │ │
│ │ └────────────────────────────────────────────┘ │ │
│ │ ... │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
├──────────────────────────────────────────────────────────────────────────┤
│ CRC32 <4b> │
└──────────────────────────────────────────────────────────────────────────┘
A label index section indexes the existing (combined) values for one or more label names. The #names
field determines the number of indexed label names, followed by the total number of entries in the #entries
field. The body holds #entries / #names tuples of symbol table references, each tuple being of #names length. The value tuples are sorted in lexicographically increasing order. This is no longer used.
┌───────────────┬────────────────┬────────────────┐
│ len <4b> │ #names <4b> │ #entries <4b> │
├───────────────┴────────────────┴────────────────┤
│ ┌─────────────────────────────────────────────┐ │
│ │ ref(value_0) <4b> │ │
│ ├─────────────────────────────────────────────┤ │
│ │ ... │ │
│ ├─────────────────────────────────────────────┤ │
│ │ ref(value_n) <4b> │ │
│ └─────────────────────────────────────────────┘ │
│ . . . │
├─────────────────────────────────────────────────┤
│ CRC32 <4b> │
└─────────────────────────────────────────────────┘
For instance, a single label name with 4 different values will be encoded as:
┌────┬───┬───┬──────────────┬──────────────┬──────────────┬──────────────┬───────┐
│ 24 │ 1 │ 4 │ ref(value_0) | ref(value_1) | ref(value_2) | ref(value_3) | CRC32 |
└────┴───┴───┴──────────────┴──────────────┴──────────────┴──────────────┴───────┘
The sequence of label index sections is finalized by a label offset table containing label offset entries that points to the beginning of each label index section for a given label name.
Postings sections store monotonically increasing lists of series references that contain a given label pair associated with the list.
┌────────────────────┬────────────────────┐
│ len <4b> │ #entries <4b> │
├────────────────────┴────────────────────┤
│ ┌─────────────────────────────────────┐ │
│ │ ref(series_1) <4b> │ │
│ ├─────────────────────────────────────┤ │
│ │ ... │ │
│ ├─────────────────────────────────────┤ │
│ │ ref(series_n) <4b> │ │
│ └─────────────────────────────────────┘ │
├─────────────────────────────────────────┤
│ CRC32 <4b> │
└─────────────────────────────────────────┘
The sequence of postings sections is finalized by a postings offset table containing postings offset entries that points to the beginning of each postings section for a given label pair.
A label offset table stores a sequence of label offset entries. Every label offset entry holds the label name and the offset to its values in the label index section. They are used to track label index sections. This is no longer used.
┌─────────────────────┬──────────────────────┐
│ len <4b> │ #entries <4b> │
├─────────────────────┴──────────────────────┤
│ ┌────────────────────────────────────────┐ │
│ │ n = 1 <1b> │ │
│ ├──────────────────────┬─────────────────┤ │
│ │ len(name) <uvarint> │ name <bytes> │ │
│ ├──────────────────────┴─────────────────┤ │
│ │ offset <uvarint64> │ │
│ └────────────────────────────────────────┘ │
│ . . . │
├────────────────────────────────────────────┤
│ CRC32 <4b> │
└────────────────────────────────────────────┘
A postings offset table stores a sequence of postings offset entries, sorted by label name and value. Every postings offset entry holds the label name/value pair and the offset to its series list in the postings section. They are used to track postings sections. They are partially read into memory when an index file is loaded.
┌─────────────────────┬──────────────────────┐
│ len <4b> │ #entries <4b> │
├─────────────────────┴──────────────────────┤
│ ┌────────────────────────────────────────┐ │
│ │ n = 2 <1b> │ │
│ ├──────────────────────┬─────────────────┤ │
│ │ len(name) <uvarint> │ name <bytes> │ │
│ ├──────────────────────┼─────────────────┤ │
│ │ len(value) <uvarint> │ value <bytes> │ │
│ ├──────────────────────┴─────────────────┤ │
│ │ offset <uvarint64> │ │
│ └────────────────────────────────────────┘ │
│ . . . │
├────────────────────────────────────────────┤
│ CRC32 <4b> │
└────────────────────────────────────────────┘
The table of contents serves as an entry point to the entire index and points to various sections in the file. If a reference is zero, it indicates the respective section does not exist and empty results should be returned upon lookup.
┌─────────────────────────────────────────┐
│ ref(symbols) <8b> │
├─────────────────────────────────────────┤
│ ref(series) <8b> │
├─────────────────────────────────────────┤
│ ref(label indices start) <8b> │
├─────────────────────────────────────────┤
│ ref(label offset table) <8b> │
├─────────────────────────────────────────┤
│ ref(postings start) <8b> │
├─────────────────────────────────────────┤
│ ref(postings offset table) <8b> │
├─────────────────────────────────────────┤
│ CRC32 <4b> │
└─────────────────────────────────────────┘
NOTE: Currently supported index file versions: v1.
NOTE: Don't confuse with
chunks format
(XOR encoded, Gorilla compressed set of samples). Overall chunk files are containing multiple series chunks (:
The following describes the format of a chunks file, which is created in the chunks/
directory of a block. The maximum size per segment file is 512MiB.
Chunks in the files are referenced from the index by uint64 composed of in-file offset (lower 4 bytes) and segment sequence number (upper 4 bytes).
┌──────────────────────────────┐
│ magic(0x85BD40DD) <4 byte> │
├──────────────────────────────┤
│ version(1) <1 byte> │
├──────────────────────────────┤
│ padding(0) <3 byte> │
├──────────────────────────────┤
│ ┌──────────────────────────┐ │
│ │ Chunk 1 │ │
│ ├──────────────────────────┤ │
│ │ ... │ │
│ ├──────────────────────────┤ │
│ │ Chunk N │ │
│ └──────────────────────────┘ │
└──────────────────────────────┘
┌───────────────┬───────────────────┬──────────────┬────────────────┐
│ len <uvarint> │ encoding <1 byte> │ data <bytes> │ CRC32 <4 byte> │
└───────────────┴───────────────────┴──────────────┴────────────────┘
Thanos ignores any tombstones files. They are also deleted by sidecar on upload.