vector

"A lightweight, ultra-fast tool for building observability pipelines" - https://vector.dev

You can think of vector as being an replacement for fluentd or fluentbit. It is great for reading inputs, transforming those inputs, and sending those inputs elsewhere. EG: for reading logs and shipping them.

Links

Examples

Show the supported sources, transforms, sinks

I'm not going to paste them here because the list is long and likely would be different depending on your version, but you can view them via:

vector list

The list as of vector 0.22.0 includes things from aws, gcp, splunk, prometheus, kafka, influxdb, elasticsearch, azure, and more.

Spawn a process and handle its stdout and stderr

One problem with reading stdout and stderr in linux is that those are two different file handles, so you have to handle them as such. Having a tool to aggregate them back into a single stream with annotations about what stream they were taken from is great. This example shows how to use vector to spawn a subprocess, remove some fields, and print to stdout:

#!/bin/bash
# Filename: /tmp/stream-test.sh

for _ in {1..5} ; do
  echo "This is stdout"
  echo "This is stderr" >&2
  sleep 0.$(( RANDOM ))
done

The default config file format is toml, but the below example uses yaml because it is my preference. You can convert between them with dasel.

# Filename: vector.yaml
---
# https://vector.dev/docs/reference/configuration/sources/exec
sources:
  exec:
    command:
      - /tmp/stream-test.sh
    decoding:
      codec: bytes
    mode: streaming
    streaming:
      respawn_on_exit: False
    type: exec

# https://vector.dev/docs/reference/configuration/transforms
transforms:
  remove_exec_fields:
    inputs:
      - exec
    # https://vector.dev/docs/reference/vrl/
    source: |-
      del(.command)
      del(.host)
      del(.source_type)
    type: remap

# https://vector.dev/docs/reference/configuration/sinks/console
sinks:
  print:
    encoding:
      codec: json
    inputs:
      - remove_exec_fields
    type: console

$ vector --config vector.yaml
2022-06-01T21:29:35.914895Z  INFO vector::app: Log level is enabled. level="vector=info,codec=info,vrl=info,file_source=info,tower_limit=trace,rdkafka=info,buffers=info,kube=info"
2022-06-01T21:29:35.915019Z  INFO vector::app: Loading configs. paths=["vector.yaml"]
2022-06-01T21:29:35.916968Z  INFO vector::topology::running: Running healthchecks.
2022-06-01T21:29:35.917095Z  INFO vector: Vector has started. debug="false" version="0.22.0" arch="x86_64" build_id="5e937e3 2022-06-01"
2022-06-01T21:29:35.917138Z  INFO vector::app: API is disabled, enable by setting `api.enabled` to `true` and use commands like `vector top`.
2022-06-01T21:29:35.917152Z  INFO vector::topology::builder: Healthcheck: Passed.
{"message":"This is stderr","pid":2470931,"stream":"stderr","timestamp":"2022-06-01T21:29:35.918778044Z"}
{"message":"This is stdout","pid":2470931,"stream":"stdout","timestamp":"2022-06-01T21:29:35.918821210Z"}
{"message":"This is stderr","pid":2470931,"stream":"stderr","timestamp":"2022-06-01T21:29:36.679150968Z"}
{"message":"This is stdout","pid":2470931,"stream":"stdout","timestamp":"2022-06-01T21:29:36.679193905Z"}
{"message":"This is stderr","pid":2470931,"stream":"stderr","timestamp":"2022-06-01T21:29:36.959284295Z"}
{"message":"This is stdout","pid":2470931,"stream":"stdout","timestamp":"2022-06-01T21:29:36.959315187Z"}
{"message":"This is stdout","pid":2470931,"stream":"stdout","timestamp":"2022-06-01T21:29:37.124459926Z"}
{"message":"This is stderr","pid":2470931,"stream":"stderr","timestamp":"2022-06-01T21:29:37.124598441Z"}
{"message":"This is stderr","pid":2470931,"stream":"stderr","timestamp":"2022-06-01T21:29:37.241035793Z"}
{"message":"This is stdout","pid":2470931,"stream":"stdout","timestamp":"2022-06-01T21:29:37.241074381Z"}
2022-06-01T21:29:37.484711Z  INFO vector::shutdown: All sources have finished.
2022-06-01T21:29:37.484751Z  INFO vector: Vector has stopped.

Even in the above example you can see how difficult it is to aggregate stdout and stderr with accurate order. In the script, stderr always comes second, but in all but one of these iterations, stderr was handled before stdout. This is not a problem of vector, this is a fundamental posix problem due to stderr and stdout having separate streams. However, vector seems to have a method for handling this when a timestamp shows up in the stream. If I replace echo with date "+%FT%T%z.%N foo" in both streams, they are consistently in-order. Of course, another way to handle this is to output logs as structured data with the timestamp right from the source, but you will not always have control over the source log format.

Another aspect of this setup is you can use vector as a type of init system, because you can set sources.exec.streaming.respawn_on_exit = true which will re-launch the process if it dies for some reason.

Tap a running vector instance

https://vector.dev/guides/level-up/vector-tap-guide/

Vector has a feature called tap that lets you hook into an running instance and see what is coming through. You can enable this in your vector config via:

# Filename: vector.toml
[api]
enabled = true

Then simply

vector tap

This shows pre-transform inputs, and outputs, which is useful when you are not seeing the output you expect because you can see the before and after right next to each other. There are also some further arguments you can pass to vector tap that let you filter so you can see specific inputs or outputs. See vector tap --help for those syntaxes.

Debug syntax using a repl

https://vector.dev/docs/reference/vrl/

Vector has a repl feature that can be use for developing configs and debugging. Launch it with vector vrl. Once inside, type help to get guidance on how to proceed.