=pod =for vim vim: tw=72 ts=3 sts=3 sw=3 et ai : =encoding utf8 =head1 NAME Data::Tubes::Plugin::Parser =head1 DESCRIPTION This module contains factory functions to generate I that ease parsing of input records. Each of the generated tubes has the following contract: =over =item * the input record MUST be a hash reference; =item * one field in the hash (according to factory argument C, set to C by default) points to the input text that has to be parsed; =item * one field in the hash (according to factory argument C, set to C by default) is set to the output of the parsing operation. =back The factory functions below have two names, one starting with C and the other without this prefix. They are perfectly equivalent to each other, whereas the short version can be handier e.g. when using C or C from L. =head1 FUNCTIONS =head2 B<< by_format >> my $tube = by_format($format, %args); # OR my $tube = by_format(%args); # OR my $tube = by_format(\%args); parse the input text according to a template format string (passed via factory argument C or through first unnamed parameter C<$format>). This string is supposed to be composed of word and non-word sequences, where each word sequence is assumed to be the name of a field, and each non-word sequence is a separator. Example: $format = 'foo;bar;baz'; is interpreted as follows: @field_names = ('foo', 'bar', 'baz'); @separators = (';', ';'); Example: $format = 'foo;bar~~~baz'; is interpreted as follows: @field_names = ('foo', 'bar', 'baz'); @separators = (';', '~~~'); In the first case, i.e. when all separators are equal to each other, L will be called, as it is (arguably) slightly more efficient. Otherwise, L will be called. Whatever these two factories return will be returned back. All C<@field_names> MUST be different from one another. The following arguments are supported: =over =item C set to the number of missing trailing elements that you are fine to lose, in case the format is only compound of a single separator and L is used behind the scenes. This allows you setting an optional I trailing parameter to collect whatever you are not really interested into, also allowing for its absence. As an example, consider the following input lines: FOO0,BAR0,BAZ0,WHATEVER FOO1,BAR1,BAZ1 FOO2,BAR2,BAZ2,WHAT2,EVER2, Assuming that you're really interested into the first three parameter, disregarding whatever comes after, you can set the following format: foo,bar,baz,rest and also set C to 1, indicating that you can sustain the lack of C (which you really don't care about); =item C the format to use for splitting the inputs. This parameter is the I
one, so it can also be passed as the first, unnamed parameter (see third calling convention); =item C name of the input field, defaults to C; =item C name of the tube, useful for debugging; =item C name of the output field, defaults to C; =item C remove leading and trailing whitespaces from the extracted values; =item C set how you are going to accept input values, e.g. escaped or quoted. See L for details. =back =head2 B<< by_regex >> my $tube = by_regex($regex, %args); # OR my $tube = by_regex(%args); # OR my $tube = by_regex(\%args); parse the input text based on a regular expression, passed as argument C or C<$regex> as unnamed first parameter. The regular expression is supposed to have named captures, that will eventually be used to populate the rendered output. The following arguments are supported: =over =item C name of the input field, defaults to C; =item C name of the tube, useful for debugging; =item C name of the output field, defaults to C; =item C the regular expression to use for splitting the inputs. This is the I
argument, and can be passed also as the first unnamed one in the argument list. =back =head2 B<< by_separators >> my $tube = by_separators($separators, %args); # OR my $tube = by_separators(%args); # OR my $tube = by_separators(\%args); parse the input according to a series of separators, that will be applied in sequence. For example, if the list of separators is the following: @separators = (';', '~~'); the following input: $text = 'foo;bar~~/baz/'; will be split as: @split = ('foo', 'bar', '/baz/'); The following arguments are supported: =over =item C name of the input field, defaults to C; =item C a reference to an array containing the list of keys to be associated to the values from the split; =item C name of the tube, useful for debugging; =item C name of the output field, defaults to C; =item C a reference to an array containing the list of separators to be used for splitting the input. This parameter can also be passed as the first, unnamed argument. Each separator can be: =over =item * a I, that is invoked once with a reference to the arguments, and must return either of the following forms; =item * a I, that will be used as-is at the right place; =item * a I, that will be matched verbatim (through a regular expression matching the string after passing it through C); =back =item C remove leading and trailing whitespaces from the extracted values. Example: @seps = qw< : ; , >; $input = ' what : ever ;you,do '; @elements = ('what', 'ever', 'you', 'do'); =item C this is how you provide a description of what you consider a I. It can be multiple things: =over =item * a I, that is called and MUST provide back one of the following alternatives; =item * a I, that is used directly; =item * a I, that is turned into an array reference by creating an anonymous array with the string as its only element, then processed as in the following bullet; =item * an I with elements inside, that will be described in the following list. =back If you end up with an I, each element will be put in a big regular expression that is the C of all elements. Each can be: =over =item * a I, that is fit as-is in the big regular expression; =item * the string C, that is the same as having put the three string C, C and C; =item * the string C, that is the same as having put the three string C and C; =item * the string C (or C), that allows you to match a string that is delimited by single quotes, with no escaping inside. This is always put at the beginning of the big regular expression (although C strings can be fit before actually); =item * the string C (or C), that allows you to match a string that is delimited by double quotes, also allowing escaped elements inside (via backslashes). This is always put at the beginning of the big regular expression; =item * the string C, that allows you to match a non-greedy sequence of escaped characters (via backslash). If C is also specified, single quotes need to be escaped too. If C is also specified, double quotes need to be escaped too. This is always set at the end of the big regular expression (except for C, that might appear after it); =item * the string C, that allows you to match a non-greedy sequence of characters, i.e. it is a synonym of regular expression C<(?ms:.*?)>. If present, it is always set at the end of the big regular expression. =back For example, if you want to accept single quoted, double quoted and unquoted strings, you might provide the following: [qw< single-quoted double-quoted whatever >] =back =head2 B<< by_split >> my $tube = by_split(%args); # OR my $tube = by_split(\%args); # OR my $tube = by_split($separator, %args); split the input according to a separator string, passed either as the first unnamed parameter C<$separator> or as hash options C. The following arguments are supported: =over =item C set to the number of missing trailing elements that you are fine to lose, in case you also provide C (see below). This is particularly important when this function is called behind the scenes by L, because I sets C. In practice, suppose that you set the following C: [qw< foo bar baz whatever >] A normal parsing will expect to find at least four elements, so the following input would fail: FOO,BAR,BAZ On the other hand, if you set C to 1, you are accepting that there might be a missing value for C, that will be filled with the undefined value. =item C name of the input field, defaults to C; =item C optional reference to an array containing a list of keys to be associated to the split data. If present, it will be used as such; if absent, a reference to an array will be set as output. =item C name of the tube, useful for debugging; =item C name of the output field, defaults to C; =item C the separator to be used for C. If it is a code reference, it is invoked once with the provided arguments to get the separator back. After this, it can be either a regular expression, used as-is, or a string that is passed through C before being used; =item C remove leading and trailing whitespaces from the extracted values. As you might expect, if the C is a colon, the following input: $input = ' what : ever :you:do '; would be split into the following elements: @elements = ('what', 'ever', 'you', 'do'); =back =head2 B<< by_value_separator >> $tube = by_value_separator($separator, %args); # OR $tube = by_value_separator(%args); # OR $tube = by_value_separator(\%args); parse a sequence of value-and-separator. This is a generalization of L, where you can provide a way to specify what you consider I values, e.g. to allow for escaping or quoting (hence also allowing having the separator inside your values). B: this function uses the regular expression construct C<(?{...})> internally. While it is supported as of perl 5.10, this has evolved in time, up to perl 5.18 where it was stabilized. In particular, before perl 5.18 it was not possible to use lexical variables in the construct, so for older perls C uses a package variable for collecting values. This should not be a problem, but might be. Just to make an example, suppose that you are using semicolons as separators. C would allow you to take this: 'some;thing'; what\;ever ; "this;\"goes\";fine" and turn it into this: ['some;thing', 'what;ever', 'this:"goes";fine'] As noted, it is similar to L; as a matter of fact, this might be re-implemented (less efficiently) through L. Unless there are bugs, of course. Like L, you can provide a C parameter (also via the first, unnamed parameter) that can be either a sub reference, a string or a regular expression. Additionally, you can provide a C parameter that tells what is considered an I input value. A value can be different things (see below), but it boils down to providing regular expressions, indication of pre-canned matching expressions, or a combination. When you match values, you can then I them. For example, if you specify that you want to accept double-quoted strings, it makes sense to remove the quotes and un-escape the remaining sequence before using it. Depending on what you pass as a definition for a valid C, your decoding approach might vary. Decoding can happen in two ways: either you provide a C function that will be applied to each value, or a C that is applied to the whole values array. You might want to choose the latter for improving performance (1 sub call against N). Normally, an input would be split and an array reference would populate the C field (that is, the field indicated by the C argument). If you would rather get a hash, you can pass C to use, in order. If this is the case, you can also accept getting more values than you have keys for with C, or less of them with C. Last, you might want to take advantage of C if your values shouldn't have leading/trailing spaces. Be sure to read the fine prints about trimming quoted strings, though. Accepted arguments are: =over =item C =item C these are integer values that set how much less/more values you are willing to admit with respect to the provided C (see below). Hence, they only work when C is set. By default they are set to 0, meaning that you expect to have exactly the same number of values as there are keys. Allowing I means that you accept getting less values than there are keys, that will be associated to C. Allowing I means that you're willing to ditch that number of exceeding values; =item C name of the input field, defaults to C; =item C an array reference with the keys to be associated (one-by-one, in order) to the extracted values; =item C name of the tube, useful for debugging. Defaults to C; =item C name of the output field, defaults to C; =item C the separator to be used between two consecutive valid Is. It can be one of the following: =over =item * a I, that is called with whatever arguments provided (as a hash reference) and MUST return one of the following two alternatives; =item * a I, that will be matched for the separator; =item * a I, that will be matched verbatim. =back There is no default, you MUST provide one either as the first, unnamed parameter or as argument C; =item C remove leading and trailing whitespaces from the extracted values. This is applied I decoding is applied, which means that leading/trailing whitespaces I quoted strings will be kept. Defaults to a I value, meaning that no trimming is performed; =item C this is how you provide a description of what you consider a I. It can be multiple things: =over =item * a I, that is called and MUST provide back one of the following alternatives; =item * a I, that is used directly; =item * a I, that is turned into an array reference by creating an anonymous array with the string as its only element, then processed as in the following bullet; =item * an I with elements inside, that will be described in the following list. =back If you end up with an I, each element will be put in a big regular expression that is the C of all elements. Each can be: =over =item * a I, that is fit as-is in the big regular expression; =item * the string C, that is the same as having put the three string C, C and C; =item * the string C, that is the same as having put the three string C and C; =item * the string C (or C), that allows you to match a string that is delimited by single quotes, with no escaping inside. This is always put at the beginning of the big regular expression (although C strings can be fit before actually); =item * the string C (or C), that allows you to match a string that is delimited by double quotes, also allowing escaped elements inside (via backslashes). This is always put at the beginning of the big regular expression; =item * the string C, that allows you to match a non-greedy sequence of escaped characters (via backslash). If C is also specified, single quotes need to be escaped too. If C is also specified, double quotes need to be escaped too. This is always set at the end of the big regular expression (except for C, that might appear after it); =item * the string C, that allows you to match a non-greedy sequence of characters, i.e. it is a synonym of regular expression C<(?ms:.*?)>. If present, it is always set at the end of the big regular expression. =back For example, if you want to accept single quoted, double quoted and unquoted strings, you might provide the following: [qw< single-quoted double-quoted whatever >] =back =head2 B<< ghashy >> my $tube = ghashy(%args); # OR my $tube = ghashy(\%args); parse the input thext as a hash, generalized. The algorithm used is the same as L. It is a generalization of L below. Accepts all arguments as L, with the same default values except for C that is set to the empty string (as opposed to not being defined). This means that stand-alone values will always be accepted. This setting is in line with L and has been set for backwards/mutual compatibility. The following arguements are recognised too: =over =item C a hash reference with default values for the output; =item C name of the input field, defaults to C; =item C name of the tube, useful for debugging. Defaults to C; =item C name of the output field, defaults to C; =back =head2 B<< hashy >> my $tube = hashy(%args); # OR my $tube = hashy(\%args); parse the input text as a hash. The algorithm used is the same as L. =over =item C character used to divide chunks in the input, defaults to a space character (ASCII 0x20); =item C the default key to be used when a key is not present in a chunk, defaults to the empty string; =item C a hash reference with default values for the output; =item C name of the input field, defaults to C; =item C character used to divide the key from the value in a chunk, defaults to the equal sign C<=>; =item C name of the tube, useful for debugging. Defaults to C; =item C name of the output field, defaults to C; =back This tube factory is strict in what accepts as inputs, in that the separators MUST be single characters and there is no escaping mechanism. If you need something more flexible, see L above. =head2 B<< parse_by_format >> Alias for L. =head2 B<< parse_by_regex >> Alias for L. =head2 B<< parse_by_separators >> Alias for L. =head2 B<< parse_by_split >> Alias for L. =head2 B<< parse_by_value_separator >> Alias for L. =head2 B<< parse_ghashy >> Alias for L. =head2 B<< parse_hashy >> Alias for L. =head2 B<< parse_single >> Alias for L. =head2 B<< single >> my $tube = single(%args); # OR my $tube = single(\%args); consider the input text as already parsed, and generate as output a hash reference where the text is associated to a key. =over =item C name of the input field, defaults to C; =item C key to use for associating the input text; =item C name of the tube, useful for debugging; =item C name of the output field, defaults to C; =back =head1 BUGS AND LIMITATIONS Report bugs either through RT or GitHub (patches welcome). =head1 AUTHOR Flavio Poletti =head1 COPYRIGHT AND LICENSE Copyright (C) 2016 by Flavio Poletti This module is free software. You can redistribute it and/or modify it under the terms of the Artistic License 2.0. This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. =cut