ozontech
diff --git a/‎cfg/matchrule/README.md
Lines changed: 53 additions & 0 deletions b/‎cfg/matchrule/README.md
Lines changed: 53 additions & 0 deletions
diff --git a/‎docs/architecture.md
Lines changed: 3 additions & 3 deletions b/‎docs/architecture.md
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/examples.md
Lines changed: 3 additions & 3 deletions b/‎docs/examples.md
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/installation.md
Lines changed: 1 addition & 1 deletion b/‎docs/installation.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎pipeline/README.idoc.md
Lines changed: 150 additions & 0 deletions b/‎pipeline/README.idoc.md
Lines changed: 150 additions & 0 deletions
diff --git a/‎pipeline/README.md
Lines changed: 150 additions & 0 deletions b/‎pipeline/README.md
Lines changed: 150 additions & 0 deletions
@@ -0,0 +1,53 @@
+# Match rules
+
+Match rules are lightweight checks for the raw byte contents. The rules are combined in rulesets, they can be used with logical `and` or `or` applied to all rules in the rulset, its result might be inverted, they can check values in case insensitive mode.
+
+## Rule
+
+**`values`** *`[]string`*
+
+List of values to check the content against.
+
+<br>
+
+**`mode`** *`string`* *`required`* *`options=prefix|suffix|contains`*
+
+Content check mode. In `prefix` mode only first bytes of the content are checked. In `suffix` mode only last bytes of the content are checked. In `contains` mode there is a substring search in the contents.
+
+<br>
+
+**`case_insensitive`** *`bool`* *`default=false`*
+
+When `case_insensitive` is set to `true` all `values` and the checking contents are converted to lowercase. It is better to avoid using this mode because it can impact throughput and performance of the logs collection. 
+
+<br>
+
+**`invert`** *`bool`* *`default=false`*
+
+Flag indicating whether to negate the match result. For example if all of the rules are matched and `invert` is set to `true` the whole ruleset will result as not matched. It should be used when it is easier to list items that should not match the rules.
+
+<br>
+
+## RuleSet
+
+**`name`** *`string`*
+
+The name of the ruleset. Has some additional semantics in [antispam exceptions](/pipeline/antispam/README.md#exception-parameters).
+
+<br>
+
+**`cond`** *`string`* *`default=and`* *`options=and|or`*
+
+Logical conditional operation to combine rules with. If set to `and` ruleset will only match when all rules are matched. If set to `or` ruleset will match when at least one of the rules is matched.
+
+<br>
+
+**`rules`** *`[]`[Rule](/cfg/matchrule/README.md#rule)*
+
+List of rules to check the log against.
+
+<br>
+
+## RuleSets
+
+List of [RuleSet](/cfg/matchrule/README.md#ruleset). Always combined with logical `or`, meaning it matches when at least one of the rulesets match.
@@ -6,13 +6,13 @@ Here is a bit simplified architecture of the **file.d** solution.
 
 What's going on here:
 
-- **Input plugin** pulls data from external systems and pushes it next to the pipeline controller. Full list of input plugins available is [here](../plugin/input).
+- **Input plugin** pulls data from external systems and pushes it next to the pipeline controller. Full list of input plugins available is [here](/plugin/input/README.md).
 - The **pipeline controller** creates **streams** of the data and is in charge of converting data to event and subsequent routing.
 - The **event pool** provides fast event instancing. 
 - Events are processed by one or more **processors**. Every processor holds all **action plugins** from the configuration.
 - Every moment the processor gets a stream of data, process 1 or more events and returns the stream to a **streamer** that is a pool of streams.
-- Action plugins act on the events which meet particular criteria.
-- Finally, the event goes to the **output plugins** and is dispatched to the external system.  
+- Action plugins act on the events which meet particular criteria. Full list of action plugins available is [here](/plugin/action/README.md).
+- Finally, the event goes to the **output plugins** and is dispatched to the external system. Full list of output plugins available is [here](/plugin/output/README.md).
 
 You can extend `file.d` by adding your own input, action, and output plugins. 
 
 
@@ -74,7 +74,7 @@ pipelines:
 ```
 
 ## What's next?
-1. [Input](/plugin/input) plugins documentation
-2. [Action](/plugin/action) plugins documentation
-3. [Output](/plugin/output) plugins documentation
+1. [Input](/plugin/input/README.md) plugins documentation
+2. [Action](/plugin/action/README.md) plugins documentation
+3. [Output](/plugin/output/README.md) plugins documentation
 4. [Helm-chart](/charts/filed/README.md) and examples for running in Kubernetes
@@ -6,7 +6,7 @@ Images are available
 on [GitHub container registry](https://github.com/ozontech/file.d/pkgs/container/file.d/versions?filters%5Bversion_type%5D=tagged).
 
 **Note**:
-If you are using [journalctl](https://github.com/ozontech/file.d/tree/master/plugin/input/journalctl) input plugin, we
+If you are using [journalctl](https://github.com/ozontech/file.d/tree/master/plugin/input/journalctl/README.md) input plugin, we
 recommend choosing the ubuntu version that matches the host machine
 version.
 For example, if the host machine with which you want to collect logs using journald has a version of Ubuntu 18.04, you
 
@@ -1,2 +1,152 @@
+# Pipeline
+
+Pipeline is an entity which handles data. It consists of input plugin, list of action plugins and output plugin. The input plugin sends the data to `pipeline.In` controller. There the data is validated, if the data is empty, it is discarded, the data size is also checked, the behaviour for the long logs is defined by `cut_off_event_by_limit` setting. Then the data is checked in `antispam` if it is enabled. After all checks are passed the data is converted to the `Event` structure, the events are limited by the `EventPool`, and decoded depending on the [pipeline settings](#settings). The event is sent to stream which are handled with `processors`. In the processors the event is passed through the list of action plugins and sent to the output plugin. Output plugin commits the `Event` by calling `pipeline.Commit` function and after the commit is finished the data is considered as processed. More details and architecture is presented in [architecture page](/docs/architecture.md).
+
+## Settings
+
+**`capacity`** *`int`* *`default=1024`* 
+
+Capacity of the `EventPool`. There can only be processed no more than `capacity` events at the same time. It can be considered as one of the rate limiting tools, but its primary role is to control the amount of RAM used by File.d.
+
+<br>
+
+**`avg_log_size`** *`int`* *`default=4096`* 
+
+Expected average size of the input logs in bytes. Used in standard event pool to release buffer memory when its size exceeds this value.
+
+<br>
+
+**`max_event_size`** *`int`* *`default=0`* 
+
+Maximum allowed size of the input logs in bytes. If set to 0, logs of any size are allowed. If set to the value greater than 0, logs with size greater than `max_event_size` are discarded unless `cut_off_event_by_limit` is set to `true`.
+
+<br>
+
+**`cut_off_event_by_limit`** *`bool`* *`default=false`* 
+
+Flag indicating whether to cut logs which have exceeded the `max_event_size`. If set to `true` huge logs are cut and only the first `max_event_size` bytes of the logs are passed further. If set to `false` huge logs are discarded. Only works if `max_event_size` is greater than 0, otherwise does nothing. Useful when there are huge logs which affect the logging system but it is prefferable to deliver them at least partially.
+
+<br>
+
+**`cut_off_event_by_limit_field`** *`string`*
+
+Field to add to log if it was cut by `max_event_size`. E.g. with `cut_off_event_by_limit_field: _cropped`, if the log was cut, the output event will have field `"_cropped":true`. Only works if `cut_off_event_by_limit` is set to `true` and `max_event_size` is greater than 0. Useful for marking cut logs.
+
+<br>
+
+**`decoder`** *`string`* *`default=auto`* 
+
+Which decoder to use on every log from input plugin. Defaults to `auto` meaning the usage of the decoder suggested by the input plugin. Currently most of the time `json` decoder is suggested, the only exception is [k8s input plugin](/plugin/input/k8s/README.md) with CRI type not docker, in that case `cri` decoder is suggested. The full list of the decoders is available on the [decoders page](/decoder/readme.md).
+
+<br>
+
+**`decoder_params`** *`map[string]any`*
+
+Additional parameters for the chosen decoder. The params list varies. It can be found on the [decoders page](/decoder/readme.md) for each of them.
+
+<br>
+
+**`stream_field`** *`string`* *`default=stream`* 
+
+Which field in the log indicates `stream`. Mostly used for distinguishing `stdout` from `stderr` in k8s logs.
+
+<br>
+
+**`maintenance_interval`** *`string`* *`default=5s`* 
+
+How often to perform maintenance. Maintenance includes antispammer maintenance and metric cleanup, metric holder maintenance, increasing basic pipeline metrics with accumulated deltas, logging pipeline stats. The value must be passed in format of duration (`<number>(ms|s|m|h)`).
+
+<br>
+
+**`event_timeout`** *`bool`* *`default=30s`* 
+
+How long the event can process in action plugins and block stream in streamer until it is marked as a timeout event and unlocks stream so that the whole pipeline does not get stuck. The value must be passed in format of duration (`<number>(ms|s|m|h)`).
+
+<br>
+
+**`antispam_threshold`** *`int`* *`default=0`* 
+
+Threshold value for the [antispammer](/pipeline/antispam/README.md#antispammer) to ban sources. If set to 0 antispammer is disabled. If set to the value greater than 0 antispammer is enabled and bans sources which write `antispam_threshold` or more logs in `maintenance_interval` time.
+
+<br>
+
+**`antispam_exceptions`** *`[]`[antispam.Exception](/pipeline/antispam/README.md#exception-parameters)*
+
+The list of antispammer exceptions. If the log matches at least one of the exceptions it is not accounted in antispammer.
+
+<br>
+
+**`meta_cache_size`** *`int`* *`default=1024`* 
+
+Amount of entries in metadata cache.
+
+<br>
+
+**`source_name_meta_field`** *`string`*
+
+The metadata field used to retrieve the name or origin of a data source. You can use it for antispam. Metadata is configured via `meta` parameter in input plugin. For example:
+
+```yaml
+input:
+    type: k8s
+    meta:
+        pod_namespace: '{{ .pod_name }}.{{ .namespace_name }}'
+pipeline:
+    antispam_threshold: 2000
+    source_name_meta_field: pod_namespace
+```
+
+<br>
+
+**`is_strict`** *`bool`* *`default=false`* 
+
+Whether to fatal on decoding error.
+
+<br>
+
+**`metric_hold_duration`** *`string`* *`default=30m`* 
+
+The amount of time the metric can be idle until it is deleted. Used for deleting rarely updated metrics to save metrics storage resources. The value must be passed in format of duration (`<number>(ms|s|m|h)`).
+
+<br>
+
+**`pool`** *`string`* *`options=std|low_memory`*
+
+Type of `EventPool`. `std` pool is an original pool with the slice of `Event` pointers and slices of free events indicators. `low_memory` pool is a leveled pool based on multiple `sync.Pool` for the events of different size. The latter one is experimental.
+
+<br>
+
+## Datetime parse formats
+
+Most of the plugins which work with parsing datetime call `pipeline.ParseTime` function. It accepts datetime layouts the same way as Go `time.Parse` (in format of datetime like `2006-01-02T15:04:05.999999999Z07:00`) except unix timestamp formats, they can only be specified via aliases.
+
+For the comfort of use there are aliases to some datetime formats:
+
++ `ansic` - Mon Jan _2 15:04:05 2006
++ `unixdate` - Mon Jan _2 15:04:05 MST 2006
++ `rubydate` - Mon Jan 02 15:04:05 -0700 2006
++ `rfc822` - 02 Jan 06 15:04 MST
++ `rfc822z` - 02 Jan 06 15:04 -0700
++ `rfc850` - Monday, 02-Jan-06 15:04:05 MST
++ `rfc1123` - Mon, 02 Jan 2006 15:04:05 MST
++ `rfc1123z` - Mon, 02 Jan 2006 15:04:05 -0700
++ `rfc3339` - 2006-01-02T15:04:05Z07:00
++ `rfc3339nano` - 2006-01-02T15:04:05.999999999Z07:00
++ `kitchen` - 3:04PM
++ `stamp` - Jan _2 15:04:05
++ `stampmilli` - Jan _2 15:04:05.000
++ `stampmicro` - Jan _2 15:04:05.000000
++ `stampnano` - Jan _2 15:04:05.000000000
++ `nginx_errorlog` - 2006/01/02 15:04:05
++ `unixtime` - unix timestamp in seconds: 1739959880
++ `unixtimemilli` - unix timestamp in milliseconds: 1739959880999
++ `unixtimemicro` - unix timestamp in microseconds: 1739959880999999 (e.g. `journalctl` writes timestamp in that format in `__REALTIME_TIMESTAMP` field when using json output format)
++ `unixtimenano` - unix timestamp in nanoseconds: 1739959880999999999
+
+**Note**: when using `unixtime(|milli|micro|nano)` if there is a float value its whole part is always considered as seconds and the fractional part is fractions of a second.
+
 ## Match modes
+
+> Note: consider using [DoIf match rules](/pipeline/doif/README.md) instead, since it is an advanced version of match modes.
+
 @match-modes|header-description
@@ -1,4 +1,154 @@
+# Pipeline
+
+Pipeline is an entity which handles data. It consists of input plugin, list of action plugins and output plugin. The input plugin sends the data to `pipeline.In` controller. There the data is validated, if the data is empty, it is discarded, the data size is also checked, the behaviour for the long logs is defined by `cut_off_event_by_limit` setting. Then the data is checked in `antispam` if it is enabled. After all checks are passed the data is converted to the `Event` structure, the events are limited by the `EventPool`, and decoded depending on the [pipeline settings](#settings). The event is sent to stream which are handled with `processors`. In the processors the event is passed through the list of action plugins and sent to the output plugin. Output plugin commits the `Event` by calling `pipeline.Commit` function and after the commit is finished the data is considered as processed. More details and architecture is presented in [architecture page](/docs/architecture.md).
+
+## Settings
+
+**`capacity`** *`int`* *`default=1024`* 
+
+Capacity of the `EventPool`. There can only be processed no more than `capacity` events at the same time. It can be considered as one of the rate limiting tools, but its primary role is to control the amount of RAM used by File.d.
+
+<br>
+
+**`avg_log_size`** *`int`* *`default=4096`* 
+
+Expected average size of the input logs in bytes. Used in standard event pool to release buffer memory when its size exceeds this value.
+
+<br>
+
+**`max_event_size`** *`int`* *`default=0`* 
+
+Maximum allowed size of the input logs in bytes. If set to 0, logs of any size are allowed. If set to the value greater than 0, logs with size greater than `max_event_size` are discarded unless `cut_off_event_by_limit` is set to `true`.
+
+<br>
+
+**`cut_off_event_by_limit`** *`bool`* *`default=false`* 
+
+Flag indicating whether to cut logs which have exceeded the `max_event_size`. If set to `true` huge logs are cut and only the first `max_event_size` bytes of the logs are passed further. If set to `false` huge logs are discarded. Only works if `max_event_size` is greater than 0, otherwise does nothing. Useful when there are huge logs which affect the logging system but it is prefferable to deliver them at least partially.
+
+<br>
+
+**`cut_off_event_by_limit_field`** *`string`*
+
+Field to add to log if it was cut by `max_event_size`. E.g. with `cut_off_event_by_limit_field: _cropped`, if the log was cut, the output event will have field `"_cropped":true`. Only works if `cut_off_event_by_limit` is set to `true` and `max_event_size` is greater than 0. Useful for marking cut logs.
+
+<br>
+
+**`decoder`** *`string`* *`default=auto`* 
+
+Which decoder to use on every log from input plugin. Defaults to `auto` meaning the usage of the decoder suggested by the input plugin. Currently most of the time `json` decoder is suggested, the only exception is [k8s input plugin](/plugin/input/k8s/README.md) with CRI type not docker, in that case `cri` decoder is suggested. The full list of the decoders is available on the [decoders page](/decoder/readme.md).
+
+<br>
+
+**`decoder_params`** *`map[string]any`*
+
+Additional parameters for the chosen decoder. The params list varies. It can be found on the [decoders page](/decoder/readme.md) for each of them.
+
+<br>
+
+**`stream_field`** *`string`* *`default=stream`* 
+
+Which field in the log indicates `stream`. Mostly used for distinguishing `stdout` from `stderr` in k8s logs.
+
+<br>
+
+**`maintenance_interval`** *`string`* *`default=5s`* 
+
+How often to perform maintenance. Maintenance includes antispammer maintenance and metric cleanup, metric holder maintenance, increasing basic pipeline metrics with accumulated deltas, logging pipeline stats. The value must be passed in format of duration (`<number>(ms|s|m|h)`).
+
+<br>
+
+**`event_timeout`** *`bool`* *`default=30s`* 
+
+How long the event can process in action plugins and block stream in streamer until it is marked as a timeout event and unlocks stream so that the whole pipeline does not get stuck. The value must be passed in format of duration (`<number>(ms|s|m|h)`).
+
+<br>
+
+**`antispam_threshold`** *`int`* *`default=0`* 
+
+Threshold value for the [antispammer](/pipeline/antispam/README.md#antispammer) to ban sources. If set to 0 antispammer is disabled. If set to the value greater than 0 antispammer is enabled and bans sources which write `antispam_threshold` or more logs in `maintenance_interval` time.
+
+<br>
+
+**`antispam_exceptions`** *`[]`[antispam.Exception](/pipeline/antispam/README.md#exception-parameters)*
+
+The list of antispammer exceptions. If the log matches at least one of the exceptions it is not accounted in antispammer.
+
+<br>
+
+**`meta_cache_size`** *`int`* *`default=1024`* 
+
+Amount of entries in metadata cache.
+
+<br>
+
+**`source_name_meta_field`** *`string`*
+
+The metadata field used to retrieve the name or origin of a data source. You can use it for antispam. Metadata is configured via `meta` parameter in input plugin. For example:
+
+```yaml
+input:
+    type: k8s
+    meta:
+        pod_namespace: '{{ .pod_name }}.{{ .namespace_name }}'
+pipeline:
+    antispam_threshold: 2000
+    source_name_meta_field: pod_namespace
+```
+
+<br>
+
+**`is_strict`** *`bool`* *`default=false`* 
+
+Whether to fatal on decoding error.
+
+<br>
+
+**`metric_hold_duration`** *`string`* *`default=30m`* 
+
+The amount of time the metric can be idle until it is deleted. Used for deleting rarely updated metrics to save metrics storage resources. The value must be passed in format of duration (`<number>(ms|s|m|h)`).
+
+<br>
+
+**`pool`** *`string`* *`options=std|low_memory`*
+
+Type of `EventPool`. `std` pool is an original pool with the slice of `Event` pointers and slices of free events indicators. `low_memory` pool is a leveled pool based on multiple `sync.Pool` for the events of different size. The latter one is experimental.
+
+<br>
+
+## Datetime parse formats
+
+Most of the plugins which work with parsing datetime call `pipeline.ParseTime` function. It accepts datetime layouts the same way as Go `time.Parse` (in format of datetime like `2006-01-02T15:04:05.999999999Z07:00`) except unix timestamp formats, they can only be specified via aliases.
+
+For the comfort of use there are aliases to some datetime formats:
+
++ `ansic` - Mon Jan _2 15:04:05 2006
++ `unixdate` - Mon Jan _2 15:04:05 MST 2006
++ `rubydate` - Mon Jan 02 15:04:05 -0700 2006
++ `rfc822` - 02 Jan 06 15:04 MST
++ `rfc822z` - 02 Jan 06 15:04 -0700
++ `rfc850` - Monday, 02-Jan-06 15:04:05 MST
++ `rfc1123` - Mon, 02 Jan 2006 15:04:05 MST
++ `rfc1123z` - Mon, 02 Jan 2006 15:04:05 -0700
++ `rfc3339` - 2006-01-02T15:04:05Z07:00
++ `rfc3339nano` - 2006-01-02T15:04:05.999999999Z07:00
++ `kitchen` - 3:04PM
++ `stamp` - Jan _2 15:04:05
++ `stampmilli` - Jan _2 15:04:05.000
++ `stampmicro` - Jan _2 15:04:05.000000
++ `stampnano` - Jan _2 15:04:05.000000000
++ `nginx_errorlog` - 2006/01/02 15:04:05
++ `unixtime` - unix timestamp in seconds: 1739959880
++ `unixtimemilli` - unix timestamp in milliseconds: 1739959880999
++ `unixtimemicro` - unix timestamp in microseconds: 1739959880999999 (e.g. `journalctl` writes timestamp in that format in `__REALTIME_TIMESTAMP` field when using json output format)
++ `unixtimenano` - unix timestamp in nanoseconds: 1739959880999999999
+
+**Note**: when using `unixtime(|milli|micro|nano)` if there is a float value its whole part is always considered as seconds and the fractional part is fractions of a second.
+
 ## Match modes
+
+> Note: consider using [DoIf match rules](/pipeline/doif/README.md) instead, since it is an advanced version of match modes.
+
 #### And
 `match_mode: and` — matches fields with AND operator