Skip to content

Latest commit

 

History

History
515 lines (400 loc) · 16.2 KB

README.md

File metadata and controls

515 lines (400 loc) · 16.2 KB

Standalone Linux IO Tracer

Standalone Linux IO Tracer (iotrace) is a tool for block device I/O tracing and management of created traces

For each I/O to target device(s) basic metadata information is captured (IO operation type, address, size), supplemented with extended classification. Extended classification contains information about I/O type (direct / filesystem metadata / file) and target file attributes(e.g. file size).

iotrace is based on Open CAS Telemetry Framework (OCTF). Collected traces are stored in OCTF trace location. Traces can later be converted to JSON or CSV format.

iotrace consists of a kernel tracing module (iotrace.ko) and an executable (iotrace) with command line interface.

In this readme:

Supported OS

Right now the compilation of Standalone Linux IO Tracer is tested on the following OSes:

OS Version Comment
RHEL/CentOS 7.7
RHEL/CentOS 8.1
Ubuntu 18.04
Fedora 31

Source Code

Source code is available in the official Standalone Linux IO Tracer GitHub repository:

git clone https://github.com/open-cas/standalone-linux-io-tracer
cd standalone-linux-io-tracer

Deployment

Checkout

To get stable version of iotrace checkout latest release:

git clone https://github.com/Open-CAS/standalone-linux-io-tracer/
cd standalone-linux-io-tracer
git checkout $(git tag | grep "^v[[:digit:]]*.[[:digit:]]*.[[:digit:]]*$" | tail -1)

But if you are going to develop iotrace, it is ok to checkout master branch.

Prerequisites

  • To build and use Standalone Linux IO Tracer, setup prerequisites first in the following way:

    git submodule update --init --recursive
    sudo ./setup_dependencies.sh
    

    Installed dependencies include OCTF, Google Protocol Buffers, CMake and Google Test. The dependencies are either installed with yum/apt or installed to a dedicated directory /opt/octf/ to avoid overwriting already installed ones.

Build

Both the executable and the kernel module (and OCTF if submodule is present) are built with:

make

Installation

Both the executable and the kernel module (and OCTF if submodule is present) are installed with:

sudo make install

Theory of operation

Standalone Linux IO Tracer captures request data by registering to multiple trace points surfaced by the Linux kernel (e.g. BIO queueing, BIO splitting, BIO completion). This allows for gathering of IO metadata at the request level and passing it between kernel and userspace.

A circular buffer is allocated and shared between kernel and userspace for each logical CPU core and trace events (example shown below) are then pushed into it.

struct iotrace_event_hdr {
    /** Event sequence ID */
    log_sid_t sid;

    /** Time stamp */
    uint64_t timestamp;

    /** Trace event type, iotrace_event_type enunerator */
    uint32_t type;

    /** Size of this event, including header */
    uint32_t size;
} __attribute__((packed, aligned(8)));

...

struct iotrace_event {
    /** Trace event header */
    struct iotrace_event_hdr hdr;
    /**
     * @brief IO ID
     *
     * This ID can be used by the tracing environment to assign an ID to the IO.
     *
     * @note Zero means not set.
     */
    uint64_t id;

    /** Address of IO in sectors */
    uint64_t lba;

    /** Size of IO in sectors */
    uint32_t len;

    /** IO class of IO */
    uint32_t io_class;

    /** Device ID */
    uint32_t dev_id;

    /** Operation flags: flush, fua, ... .
     * Values according to iotrace_event_flag_t enum
     * are summed (OR-ed) together. */
    uint32_t flags;

    /** Operation type: read, write, discard
     * (iotrace_event_operation_t enumerator) **/
    uint8_t operation;

    /** Write hint associated with IO */
    uint8_t write_hint;
} __attribute__((packed, aligned(8)));

The events declaration file can be found here.

The userspace part of the Standalone Linux IO Tracer reads the entries from the circular buffer and translates them into Google Protocol Buffer format (see example below), for easier portability. The data is then serialized in trace files in a per CPU basis (e.g. octf.trace.0).

message EventHeader {
    /** Event sequence ID */
    uint64 sid = 1;

    /** Time stamp */
    uint64 timestamp = 2;
}

...

enum IoType {
    UnknownIoType = 0;
    Read = 1;
    Write = 2;
    Discard = 3;
}

...

message EventIo {
    /** Address of IO in sectors */
    uint64 lba = 1;

    /** Size of IO in sectors */
    uint32 len = 2;

    /** IO class of IO */
    uint32 ioClass = 3;

    /** Device ID */
    uint64 deviceId = 4;

    /** Operation type: read, write, trim */
    IoType operation = 5;

    /** Flush flag */
    bool flush = 6;

    /** FUA flag */
    bool fua = 7;

    /** Write (lifetime) hint */
    uint32 writeHint = 8;

    /**
     * This ID can be used by the tracing environment to assign an ID to the IO.
     * Zero means not set.
     */
    uint64 id = 9;
}

...

message Event {
    /** Trace event header */
    EventHeader header = 1;

    oneof EventType {
        EventIo io = 2;
        EventDeviceDescription deviceDescription = 3;
        EventIoFilesystemMeta filesystemMeta = 4;
        EventIoCompletion ioCompletion = 5;
        EventIoFilesystemFileName filesystemFileName = 6;
        EventIoFilesystemFileEvent filesystemFileEvent = 7;
    }
}

The protobuf events declaration file can be found here.

You may see the results of translating into the above protobuf format, by executing the following command:

iotrace --trace-parser --io --path "kernel/2020-07-02_08:52:51" --raw

Output:

...
{"header":{"sid":"1","timestamp":"14193058940837"},"deviceDescription":{"id":"271581186","name":"nvme0n1","size":"732585168","model":"INTEL SSDPED1K375GA"}}
{"header":{"sid":"73","timestamp":"14196894550578"},"io":{"lba":"1652296","len":256,"ioClass":19,"deviceId":"271581186","operation":"Write","flush":false,"fua":false,"writeHint":0,"id":"110842991263647"}}
{"header":{"sid":"74","timestamp":"14196894550696"},"filesystemMeta":{"refSid":"110842991263647","fileId":{"partitionId":"271581186","id":"76","creationDate":"2020-07-02T06:52:55.712990641Z"},"fileOffset":"0","fileSize":"241960"}}
...

After tracing is complete these singular trace events may be parsed, combined and translated into different Google Protocol Buffer messages (or other formats, such as CSV) when executing Standalone Linux IO Tracer trace parser commands.

For example the --trace-parser --io command analyzes multiple submission, split and completion events to give a more complete view of a given IO request such as: its latency, queue depth, file size and path (if applicable) etc.

Examples

  • Start tracing two block devices for 1 hour, or until trace file is 1GiB:

    sudo iotrace --start-tracing --devices /dev/sda,/dev/sdb1 --time 3600 --size 1024
    

    NOTE: To allow tracing of block devices, Linux kernel tracing module needs to be loaded first. It is done automatically. After collecting traces the module will be unloaded.

  • The below output example is based on sample traces found here. The traces were captured during YCSB workload type A benchmark on RocksDB.

    Traces can be unpacked using the following command:

    tar -xf sample_trace.tar.xz
    

    They can then be moved to the trace repository directory, the path of which can be extracted using:

    iotrace --trace-config --get-trace-repository-path
    
  • List created traces:

    iotrace --trace-management --list-traces
    

    Output:

    {
    "trace": [
      {
       "tracePath": "kernel/2020-07-02_08:52:51",
       "state": "COMPLETE"
       "label": "ycsb;a;rocksdb;uniform;xfs;381MiB;16000000"
      }
    ]
    }
    
  • Parse traces (note usage of path returned in --list-traces):

    iotrace --trace-parser --io --path "kernel/2020-07-02_08:52:51" --format json
    

    Output:

    {"header":{"sid":"1","timestamp":"3835590186"},"io":{"lba":"1652296","len":256,"ioClass":19,"operation":"Write","flush":false,"fua":false,"error":false,"latency":"83797","qd":"1","writeHint":0},"device":{"id":"271581186","name":"nvme0n1","partition":"271581186","model":"INTEL SSDPED1K375GA"},"file":{"id":"76","offset":"0","size":"241960","path":"/000014.sst","eventType":"Access","creationDate":"2020-07-02T06:52:55.712990641Z"}}
    {"header":{"sid":"2","timestamp":"3835625267"},"io":{"lba":"1652552","len":256,"ioClass":19,"operation":"Write","flush":false,"fua":false,"error":false,"latency":"95069","qd":"2","writeHint":0},"device":{"id":"271581186","name":"nvme0n1","partition":"271581186","model":"INTEL SSDPED1K375GA"},"file":{"id":"76","offset":"256","size":"241960","path":"/000014.sst","eventType":"Access","creationDate":"2020-07-02T06:52:55.712990641Z"}}
    {"header":{"sid":"3","timestamp":"3835638717"},"io":{"lba":"1652808","len":256,"ioClass":19,"operation":"Write","flush":false,"fua":false,"error":false,"latency":"130094","qd":"3","writeHint":0},"device":{"id":"271581186","name":"nvme0n1","partition":"271581186","model":"INTEL SSDPED1K375GA"},"file":{"id":"76","offset":"512","size":"241960","path":"/000014.sst","eventType":"Access","creationDate":"2020-07-02T06:52:55.712990641Z"}}
    {"header":{"sid":"4","timestamp":"3835652180"},"io":{"lba":"1653064","len":256,"ioClass":19,"operation":"Write","flush":false,"fua":false,"error":false,"latency":"203209","qd":"4","writeHint":0},"device":{"id":"271581186","name":"nvme0n1","partition":"271581186","model":"INTEL SSDPED1K375GA"},"file":{"id":"76","offset":"768","size":"241960","path":"/000014.sst","eventType":"Access","creationDate":"2020-07-02T06:52:55.712990641Z"}}
    ...
    

    NOTE: Any mention of LBA assumes a 512B sector size, even if the underlying drive is formatted to 4096B sectors. Similarly a 512B sector is the unit of length.

  • Show trace statistics:

    iotrace --trace-parser --statistics -p "kernel/2020-07-02_08:52:51/"
    

    Output:

    "statistics": [
    {
     "desc": {
      "device": {
       "id": "271581186",
       "name": "nvme0n1",
       "size": "732585168" >  NOTE:  In sectors
      }
     },
     "duration": "24650525234",  NOTE:  In nanoseconds
     ...
     "write": {  NOTE:  Statistics for writes
      "size": {
       "unit": "sector",
       "average": "254",  NOTE:  Average write I/O size in sectors
       "min": "1",  NOTE:  Minimum write I/O size in sectors
       "max": "256",  NOTE:  Maximum write I/O size in sectors
       "total": "16338524",  NOTE:  Total write I/O size in sectors
       ...
      "latency": {
       "unit": "ns",
       "average": "15071365",  NOTE:  Average write I/O latency in ns
       "min": "9942",  NOTE:  Minimum write I/O latency in ns
       "max": "63161202",  NOTE:  Maximum write I/O latency in ns
      ...
      "count": "64188",  NOTE:  Number of write operations
      "metrics": {
       "throughput": {
        "unit": "IOPS",
        "value": 2603.9201757643164  NOTE:  Average write IOPS
       },
       "workset": {
        "unit": "sector",
        "value": 2237616  NOTE:  Number of distinct sectors written
       },
       "bandwidth": {
        "unit": "MiB/s",
        "value": 323.63590009317039  NOTE:  Average write bandwidth
       }
    ...
    

    NOTE: Similar statistics exist for any read, discard or flush operations. There's a section for the combined statistics as well (Total).

  • Show file system statistics:

    iotrace --trace-parser --fs-statistics -p "kernel/2020-07-02_08:52:51/"
    

    Output:

    "entries": [
     {
      "deviceId": "271581186",
      "partitionId": "271581186",
      "statistics": {
     ...
     "write": {
       "size": {
        "unit": "sector",
        "average": "255",  NOTE:  Average write I/O size in sectors
        "min": "8",  NOTE:  Minimum write I/O size in sectors
        "max": "256",  NOTE:  Maximum write I/O size in sectors
        "total": "16336216",  NOTE:  Total write I/O size in sectors
        "percentiles": {}
       },
       "count": "63984",
       "metrics": {
        "throughput": {
         "unit": "IOPS",
         "value": 2602.9390810897676
        },
        "workset": {
         "unit": "sector",
         "value": 2237096  NOTE:  Number of distinct sectors written
        },
        "write invalidation factor": {
         "unit": "",
         "value": 7.3024206381845032  NOTE:  Average number of times a given sector in the workset is written to
        },
        "bandwidth": {
         "unit": "MiB/s",
         "value": 324.49957478019985  NOTE:  Average write bandwidth
        }
       },
       ...
       "directory": "/"  NOTE:  Directories are relative to the filesystem mountpoint
    

    NOTE: Similar statistics exist for any read, discard or flush operations. There's a section for the combined statistics as well (Total).

    NOTE: File system statistics are gathered for detected groups of related I/O requests with common attributes like directory, file extension, file name prefix or IO class.

  • Show latency histogram:

    iotrace --trace-parser --latency-histogram -p "kernel/2020-07-02_08:52:51/"
    

    Output:

    ...
     "write": {
      "unit": "ns",
      "range": [
       {
        "begin": "0",
        "end": "0",
        "count": "0"
       },
       ...
       {
        "begin": "8192",  NOTE:  Minimum bound of latency histogram bucket in nanoseconds
        "end": "16383",  NOTE:  Maximum bound of latency histogram bucket in nanoseconds
        "count": "95"  NOTE:  Number of I/O requests in the bucket
       },
    ...
    

    NOTE: In the above example 95 write requests had latency between 8192 and 16383 ns.

    NOTE: Similar statistics exist for any read, discard or flush operations. There's a section for the combined statistics as well (Total).

  • Show latency histogram:

    iotrace --trace-parser --lba-histogram -p "kernel/2020-07-02_08:52:51/"
    

    Output:

    ...
     "write": {
      "unit": "sector",
      "range": [
       {
        "begin": "0",   NOTE:  Minimum disk LBA bound of latency histogram bucket
        "end": "20479",  NOTE:  Maximum disk LBA bound of latency histogram bucket
        "count": "177"  NOTE:  Number of I/O requests in the bucket
       },
    ...
    

    NOTE: In the above example 177 write requests were issued to disk LBAs 0 through 20479.

    NOTE: Similar statistics exist for any read, discard or flush operations. There's a section for the combined statistics as well (Total).

Tests

See our tests README

Contributing

Please refer to the OCTF contributing guide.

Related Projects

Please explore related projects:

Notice

NOTICE contains more information