package CWB::CEQL::Parser;

use Carp;
use Encode;

use warnings;
use strict;

=head1 NAME

CWB::CEQL::Parser - Deterministic Perl parser for simplified query languages

=head1 SYNOPSIS

  ##### DPP grammar in separate module #####
  package MyGrammar;
  use base 'CWB::CEQL::Parser';

  sub my_rule {
    my ($self, $input) = @_;
    ## body of grammar rule "my_rule" goes here, transforming $input => $output
    return $output;
  }

  sub default {
    ## default rule will be called if parser is applied to string
    my ($self, $input) = @_;
    return $self->Call("my_rule", $input);        # recursively call other grammar rules
  }

  1;

  ##### main program #####
  use MyGrammar;
  our $grammar = new MyGrammar;

  $result = $grammar->Parse($string);             # applies 'default' rule
  $result = $grammar->Parse($string, "my_rule");  # parse as given constituent

=head1 DESCRIPTION

B<CWB::CEQL::Parser> implements a B<heuristic>-driven, B<deterministic>,
B<top-down> parser for extended context-free grammars written in B<Perl>,
called a B<D>eterministic B<P>erl B<P>arser (B<DPP>).  This parsing algorithm
was designed specifically for automatic translation of simplified, user-friendly
query and scripting languages (such as the B<C>ommon B<E>lementary B<Q>uery
B<L>anguage provided by B<CWB::CEQL>) into low-level code (e.g. B<CQP> syntax).

The DPP architecture was motivated by the observation that simplified queries
are often very similar in structure to the corresponding low-level queries,
and that many authors use cascaded regular expression substitutions to
transform one into the other.  While such cascades are very easy to write in
Perl and perform efficiently, there are two important limitations: it would
often be useful (i) to validate and transform recursive structures, and (ii)
to restrict a particular transformation to a certain scope.  Because of these
limitations, incorrect user input -- and sometimes even correct input -- leads
to malformed low-level queries.  Without an intimate knowledge of the
implementation, it is often impossible to guess the true location of the
problem from the cryptic error messages generated by the backend processor.
Moreover, simplified query languages based on regular expression substitution
typically have rather limited expressiveness and flexibility (because the
substitutions are applied unconditionally, so symbols cannot have different
meanings in different contexts).

B<CWB::CEQL::Parser> aims to overcome these limitations by combining
regexp-based matching and substitution with a simple top-down parser for
context-free grammars, as well as a shift-reduce-style parser for nested
bracketing.  Parsing complexity is limited by enforcing a B<fully
deterministic> parsing algorithm: a B<DPP rule> (= constituent type,
corresponding to the LHS of a traditional CFG rule) may have multiple
expansions, but the Perl code implementing the rule has to employ heuristics
to choose a single expansion for a given input.  If the selected expansion
does not match the input string, the entire parsing process fails.  Again,
this decision was motivated by the observation that, in the case of simplified
query languages, it is often very easy to make such deterministic decisions
with a regular expression and/or a few lines of Perl code.

Each B<DPP rule> is implemented as a B<Perl subroutine> (or, more precisely,
B<method>).  It is invoked by the parser with an input string that is expected
to be a constituent of the respective type, and returns its analysis of this
constituent.  In the typical application of DPP grammars, the return value is
a string representing (part of) a low-level query expression, but grammar
authors may also decide to return arbitrary data structures.  In the B<rule
body>, other grammar rules can be applied to the full input string, a
substring, or an arbitrarily transformed substring using the B<Call> method.
It is also possible to invoke the shift-reduce-type parser with the B<Apply>
method.  Both methods return an analysis of the given substring, which can
then be integrated with the analyses of other substrings and the parsing
performed by the rule body itself.

The section on L<"WRITING GRAMMARS"> below explains in detail how to write new
DPP grammars from scratch; L<"GRAMMAR RULES"> presents some typical design
patterns for grammar rules and lists the methods available to grammar writers;
L<"EXTENDING GRAMMARS"> shows how to extend and modify existing grammars (such
as the standard CEQL implementation provided by the B<CWB::CEQL> module);
L<"USER-VISIBLE METHODS"> documents methods aimed at grammar users; L<"METHODS
USED BY GRAMMAR AUTHORS"> documents methods for grammar writers; and
L<"INTERNAL METHODS"> contains short descriptions of methods used internally
by the parser.


=head1 WRITING GRAMMARS

Technically, a B<DPP grammar> is a subclass of B<CWB::CEQL::Parser>, which
defines B<DPP rules> in the form of Perl B<methods>, and inherits parsing and
housekeeping methods from the base class.  Instantiating such a grammar class
yields an independent parser object.

By convention, the names of B<rule methods> are written in lowercase with
underscores (e.g., C<word_and_pos>), B<methods for users and grammar writers>
are written in mixed case (e.g., C<Parse> or C<SetParam>), and B<internal
methods> are written in mixed case starting with a lowercase letter (e.g.,
B<formatHtmlString>).  If you need to define helper subroutines in your grammar
class, their names should begin with an underscore (e.g., C<_escape_regexp>)
to avoid confusion with grammar rules.  The C<default> rule has to be
implemented by all grammars and will be applied to an input string if no
constituent type is specified.  The basic skeleton of a DPP grammar therefore
looks like this:

  package MyGrammar;
  use base 'CWB::CEQL::Parser';

  sub some_rule {
    ## body of grammar rule "some_rule" goes here
  }

  sub default {
    ## default rule will be called if parser is applied to string
  }

  1; # usually, a grammar is implemented as a separate module file

The user instantiates a parser for this grammar as an object of type
B<MyGrammar>, then calls the B<Parse> method to analyse an input string
(optionally specifying the expected constituent type if it is different from
C<default>).  B<Parse> returns an analysis of the input string in the form
chosen by the grammar author.  In most cases, including the standard
B<CWB::CEQL> grammar, this will simply be a string containing a low-level
query expression.  Additional information can be returned by using objects of
class B<CWB::CEQL::String>, which behave like strings in most contexts
(through operator overloading) but can also be assigned a user-specified type
(see the L<CWB::CEQL::String> manpage for details).  Alternatively, an
arbitrary data structure or object can be returned instead of a string.  We
will assume in the following that DPP rules always return plain strings.

  use MyGrammar;
  our $grammar = new MyGrammar;

  $result = $grammar->Parse($string);              # applies 'default' rule
  $result = $grammar->Parse($string, "some_rule"); # parse as given constituent type

If parsing fails, the B<Parse> method returns B<undef>.  A full description of
the error and where it occurred can then be obtained with the B<ErrorMessage>
and B<HtmlErrorMessage> methods:

  @lines_of_text = $grammar->ErrorMessage;
  $html_code = $grammar->HtmlErrorMessage;

The latter takes care of encoding special characters as HTML entities where
necessary and has been included to simplify the integration of DPP grammars
into Web interfaces.

Internally, B<Parse> will invoke appropriate grammar rules.  In the first
example above, the B<default> method would be called with argument I<$string>;
in the second example, B<some_rule> would be called.  A typical DPP rule
performs the following operations:

=over 4

=item 1.

examine the input string to decide whether it appears to be a suitable
constituent, and to determine its internal structure

=item 2.

if the test in Step 1 fails, B<die> with a meaningful error message;
the B<Parse> method will catch this exception and report it to the user
together with a stack trace of where the error occured

=item 3.

make a deterministic choice of further processing steps, i.e. the rule commits
to a specific analysis of the input string and cannot go back on this decision
later on

=item 4.

apply other grammar rules to substrings of the input using the B<Call> method,
or split the input into a sequence of tokens passed to the B<Apply> method
(which invokes a shift-reduce-type parser); note that the rule has to make sure
that these substrings are constituents of the specified type

=item 5.

collect the parsing results for subconstituents returned in Step 4, optionally
add further parts of the input string, and apply necessary transformations;
the resulting string is returned as the rule's transformation of the input
(alternatively, an arbitrary data structure can be returned as an analysis of
the input string)

=back

Note that DPP rules always return an analysis or transformation of their
input; they are I<not allowed> to return B<undef> in order to show that the
input string failed to parse.  This is a consequence of the deterministic
nature of the DPP approach: the caller guarantees that the input is a
constituent of the specified type -- anything else is an error condition and
causes the rule to B<die>.  The two main adavantages of the DPP approach are
that (i) the parser does not have to perform any backtracking and (ii) grammar
rules do not need to check the return values of subrules invoked with B<Call>
or B<Apply>.

Sometimes, it may be unavoidable to try different analyses of an input string
in sequence.  In such exceptional cases, grammar writers can use the B<Try>
method to perform a simple type of backtracking.  B<Try> works exactly like
B<Call>, but will catch any exception raised due to a parse failure and return
B<undef> in this case.  Grammar writers are strongly advised to avoid
backtracking whenever possible, though: the deterministic nature of DPP is
essential for efficient parsing, and repeated backtracking will greatly
increase its computational complexity.

DPP grammars can be B<customised> in two ways.  One possibility is to
B<override existing rules> by subclassing the grammar, as described in the
section on L<"EXTENDING GRAMMARS">.  This offers an extremely flexible way of
changing grammar behaviour, but requires a detailed knowledge of the
B<CWB::CEQL::Parser> module and the internal design of the grammar.

A much easier customisation strategy is for grammar writers to define named
B<parameters>, which can then be set by end users in order to control certain
features of the grammar.  Typical applications of parameters include the
following:

=over 4

=item *

customisation of corpus attribute names (e.g., a parameter C<pos_attribute>
might specify the appropriate positional attribute for part-of-speech tags,
such as C<pos> or C<tag>)

=item *

activating or deactivating certain grammar rules (e.g., a parameter
C<has_lemma> might indicate whether a corpus includes lemmatisation
information or not; if it is FALSE, then input strings including lemma
constraints will raise parse errors in the respective grammar rules)

=item *

definition of lookup tables for simplified part-of-speech tags (which have to
be adapted to the tagset used by a particular corpus)

=back

Named parameters have to be defined in the constructor (i.e. the B<new>
method) of a grammar by calling the B<NewParam> method, which also sets a
default value for the new parameter.  They can then be modified or read out at
any time using the B<SetParam> and B<GetParam> methods.  It is an error to
set or read the value of a parameter that hasn't previously been defined.

A typical skeletion of a DPP grammar with parameters looks as follows:

  package MyGrammar;
  use base 'CWB::CEQL::Parser';

  sub new {
    my $class = shift;
    my $self = new CWB::CEQL::Parser;
    $self->NewParam("pos_attribute", "pos");
    return bless($self, $class);
  }

  sub pos_tag {
    my ($self, $input) = @_;
    my $pos_att = $self->GetParam("pos_attribute");
    die "'$input' does not appear to be a valid POS tag\n"
      unless $input =~ /^[A-Z0-9]+$/;
    return "$pos_att = '$input'"; # CQP constraint for POS tag
  }

  # ... other grammar rules, including "default" ...

  1;

If your grammar does not define its own parameters, it is not necessary to
provide an explicit implementation of the B<new> method (unless some other
initialisation has to be performed).

A user can now apply B<MyGrammar> to a corpus that stores POS tags in
a p-attribute named C<tag>:

  use MyGrammar;

  our $grammar = new MyGrammar;
  $grammar->SetParam("pos_attribute", "tag");

  $cqp_query = $grammar->Parse($simple_query);

The following section presents some typical design patterns for DPP rules and
explains the use of B<Call>, B<Apply> and B<Try>.  Complete function
references are found in the sections L<"USER-VISIBLE METHODS"> and L<"METHODS
USED BY GRAMMAR AUTHORS">.  If you want to see an example of a complete DPP
grammar, it is a good idea to take a look at the implementation of the
standard CEQL grammar in the B<CWB::CEQL> module.  Knowledge of this grammar
implementation is essential if you want to build your own custom CEQL
extensions.


=head1 GRAMMAR RULES

=head2 Stand-alone rules

The simplest DPP rules are stand-alone rules that transform their input string
directly without invoking any subrules.  These rules typically make use of regular
expression substitutions and correspond to one part of the substitution cascade
in a traditional implementation of simple query languages.  In contrast to such
cascades, DPP rules apply only to relevant parts of the input string and cannot
accidentally modify other parts of the simple query.  The example below transforms
a search term with shell-style wildcards (C<?> and C<*>) into a regular expression.
Note how the input string is first checked to make sure it does not contain any
other metacharacters that might have a special meaning in the generated regular
expression, and B<die>s with an informative error message otherwise.

  sub wildcard_expression {
    my ($self, $input) = @_;
    die "the wildcard expression ''$input'' contains invalid characters\n"
      unless $input =~ /^[A-Za-z0-9?*-]+$/;
    my $regexp = $input;
    $regexp =~ s/\?/./g;
    $regexp =~ s/\*/.*/g;
    return $regexp;
  }

Alternatively, the rule could escape all regular expression metacharacters in
the input string so they are matched literally by the regular expression.
This version of the grammar might use an internal subroutine for translating
strings with wildcards safely into regular expressions:

  sub wildcard_expression {
    my ($self, $input) = @_;
    return _wildcard_to_regexp($input);
  }

  # note leading underscore for internal subroutine (this is not a method!)
  sub _wildcard_to_regexp {
    my $s = quotemeta(shift);
    $s =~ s/\\[?]/./g;  # wildcards will also have been escaped with a backslash
    $s =~ s/\\([*+])/.$1/g;  # works for wildcards * and +
    return $s;
  }

=head2 Handling parse errors

DPP rules should always carry out strict checks to ensure that their input is
a well-formed constituent of the required type, and B<die> with a clear and
informative error message otherwise.  This helps users to locate and correct
syntax errors in their input.  If errors are caught too late, i.e. in deeply
nested subrules, it may be difficult to recognise the true cause of the
problem.

The error message passed to B<die> should be limited to a single line of text
if possible.  Always append a newline character (C<\n>) in order to suppress
the automatic Perl stack trace, which provides no useful information for
grammar users and is likely to be confusing.  B<CWB::CEQL::Parser> will add
its own stack trace of subrule invocations so that users can pinpoint the
precise location of the syntax error.  In order to make this stack trace
readable and informative, DPP rules should always be given descriptive names: use
C<wildcard_expression> or C<part_of_speech> rather than C<rule1723a>.

The B<HtmlErrorMessage> method will automatically convert HTML metacharacters
and non-ASCII characters to entities, so it is safe to include the returned
HTML code directly in a Web page.  Error messages may use basic wiki-style
formatting: C<''...''> for typewriter font, C<//...//> for italics and
C<**...**> for bold font.  Note that such markup is non-recursive and nested
formatting will be ignored.  User input should always be enclosed in
C<''...''> in error messages so that C<//> and C<**> sequences in the input
are not mistaken as formatting instructions.

=head2 Calling subrules

Most DPP rules divide the input string into one or more subconstituents,
similar to the rules of a standard context-free grammar.  The main difference
is that a DPP rule has to settle on the specific positions and categories
of the subconstituents, rather than just listing possible category sequences.
Many DPP rules will also remove syntactic operators and delimiters, so that
only complex subconstituents are passed to other rules for parsing with the
B<Call> method.

The following example allows users to search for a word form using either a
wildcard pattern or a regular expression enclosed in C</.../>.  The return
value is a CQP query.  As an additional optimisation, wildcard patterns that
do not contain any wildcards are matched literally (which is faster than a
regular expression and avoids possible conflicts with regexp metacharacters).

  sub wordform_pattern {
    my ($self, $input) = @_;
    die "the wordform pattern ''$input'' must not contain whitespace or double quotes\n"
      if $input =~ /\s|\"/;
    if ($input =~ /^\/(.+)\/$/) {
      my $regexp = $1; # regular expression query: simply wrap in double quotes
      return "\"$regexp\"";
    }
    else {
      if ($input =~ /[?*+]/) {
        my $regexp = $self->Call("wildcard_expression", $input); # call subrule
        return "\"$regexp\"";
      }
      else {
        return "\"$input\"\%l";
      }
    }
  }

It would probably be a good idea to signal an error if the wordform pattern
starts or ends with a slash (C</>) but is not enclosed in C</.../> as a
regular expression query.  This is likely to be a typing mistake and the user
will be confused if the input is silently interpreted as a wildcard
expression.

=head2 Parsing sequences

If the input string consists of a variable number of subconstituents of the
same type, the B<Apply> method provides a convenient alternative to repeated
subrule calls.  It parses all specified subconstituents, collects the parse
results and returns them as a list.  The following example processes queries
that consist of a sequence wordform patterns separated by blanks (each pattern
is either a wildcard expression or regular expression, according to the DPP
rules defined above), and returns an equivalent CQP query.

  sub wordform_sequence {
    my ($self, $input) = @_;
    my @items = split " ", $input;
    my @cqp_patterns = $self->Apply("wordform_pattern", @items);
    return "@cqp_patterns";
  }

Recall that the list returned by B<Apply> does not have to be validated: if
any error occurs, the respective subrule will B<die> and abort the complete
parse.

=head2 The shift-reduce parser for nested bracketing

The B<Apply> method is more than a convenient shorthand for parsing lists of
constituents.  Its main purpose is to parse nested bracketing structures,
which are very common in the syntax of formal languages (examples include
arithmetical formulae, regular expressions and most computer programming
languages).  When parsing the constituents of a list with nested bracketing,
two special methods, B<BeginGroup> and B<EndGroup>, are called to mark opening
and closing delimiters.  Proper nesting will then automatically be verified by
the DPP parser.  If the syntax allows different types of groups to be mixed,
optional names can be passed to the B<BeginGroup> and B<EndGroup> calls in
order to ensure that the different group types match properly.

The output generated by the items of a bracketing group is collected
separately and returned when B<EndGroup> is called.  From this list, the rule
processing the closing delimiter has to construct a single expression for the
entire group.  Note that the return value of the DPP rule calling
B<BeginGroup> becomes part of the bracketing group output.  If this is not
desired, the rule must return an empty string (C<"">).  Rules can also check
whether they are in a nested group with the help of the B<NestingLevel> method
(which returns 0 at the top level).

The example below extends our simple query language with regexp-style
parenthesised groups, quantifiers (C<?>, C<*>, C<+>) and alternatives (C<|>).
In order to simplify the implementation, metacharacters must be separated from
wordform patterns and from other metacharacters by blanks; and quantifiers
must be attached directly to a closing parenthesis (otherwise, the question
mark in C<) ?> would be ambiguous between a quantifier and a wildcard pattern
matching a single character).  Note that the C<simple_query> rule is
practically identical to C<wordform_sequence> above, but has been renamed to
reflect its new semantics.

  sub simple_query {
    my ($self, $input) = @_;
    my @items = split " ", $input;
    my @cqp_tokens = $self->Apply("simple_query_item", @items);
    return "@cqp_tokens";
  }

  # need to define single rule to parse all items of a list with nested bracketing
  sub simple_query_item {
    my ($self, $item) = @_;
    # opening delimiter: (
    if ($item eq "(") {
      $self->BeginGroup();
      return "";  # opening delimiter should not become part of group output
    }
    # alternatives separator: | (only within nested group)
    elsif ($item eq "|") {
      die "a group of alternatives (|) must be enclosed in parentheses\n"
        unless $self->NestingLevel > 0; # | metacharacter is not allowed at top level
      return "|";
    }
    # closing delimiter: ) with optional quantifier
    elsif ($item =~ /^\)([?*+]?)$/) {
      my $quantifier = $1;
      my @cqp_tokens = $self->EndGroup();
      die "empty groups '( )' are not allowed\n"
        unless @cqp_tokens > 0;
      return "(@cqp_tokens)$quantifier";
    }
    # all other tokens should be wordform patterns
    else {
      return $self->Call("wordform_pattern", $item);
    }
  }

For a complete grammar implementation, don't forget to specify the default rule!

  sub default {
    my ($self, $input) = @_;
    $self->Call("simple_query", $input);
  }

=head2 Structural transformations with the shift-reduce parser

The B<Apply> mechanism does not implement a full-fledged shift-reduce parser.
It is well suited for nested bracketing, where structures have explicit start
and end markers, but it cannot automatically handle structural transformations
that are needed e.g. to parse infix operators.  The running example in this
subsection will be a grammar for simple arithmetic expressions, consisting of
numbers, C<+> and C<-> operators, and parentheses for grouping.  For instance,
the expression C<42 - (9 - 6)> should be transformed into nested function calls
C<sub(42, sub(9, 6))>.

One strategy for parsing such expressions is simply to collect all elements
within a group, and then perform necessary transformations on the list
returned by the B<EndGroup> method when a closing delimiter is encountered.
This approach is facilitated by the B<CWB::CEQL::String> module, which allows
strings returned by grammar rules to be annotated with type information
(technically, B<CWB::CEQL::String> objects are complex data structures which
can be interpolated like ordinary strings).  In our example, strings are
either operators (type C<Op>) or terms (numbers or parenthesised
subexpressions, type C<Term>).  A flat sequence of terms and operators is
translated into nested function calls by the internal function
C<_shift_reduce>, which repeatedly collapses a sequence C<Term Op Term> into a
single C<Term>.  Note that C<_shift_reduce> has to be called in two places in
the grammar: (1) at the end of each bracketing group and (2) for the top-level
sequence returned by the B<Apply> method.

  package Arithmetic;
  use base 'CWB::CEQL::Parser';
  use CWB::CEQL::String;

  sub default {
    my ($self, $input) = @_;
    return $self->Call("arithmetic_expression", $input);
  }

  sub arithmetic_expression {
    my ($self, $input) = @_;
    $input =~ s/([()+-])/ $1 /g;            # insert whitespace around metacharacters
    $input =~ s/^\s+//; $input =~ s/\s+$//; # strip leading/trailing whitespace
    my @items = split " ", $input;          # split on whitespace into items (numbers, operators, parentheses)
    my @terms_ops = $self->Apply("arithmetic_item", @items); # returns list of Term's and Op's
    return $self->_shift_reduce(@terms_ops);
  }

  sub arithmetic_item {
    my ($self, $item) = @_;
    if ($item eq "+")    { return new CWB::CEQL::String "add", "Op" }
    elsif ($item eq "-") { return new CWB::CEQL::String "sub", "Op" }
    elsif ($item eq "(") { $self->BeginGroup("subexpression"); return "" }
    elsif ($item eq ")") {
      my @terms_ops = $self->EndGroup("subexpression");
      return $self->_shift_reduce(@terms_ops);
    }
    elsif ($item =~ /^[0-9]+$/) { return new CWB::CEQL::String $item, "Term" }
    else { die "invalid element '' $item '' in arithmetic expression\n" }
  }

  sub _shift_reduce {
    my ($self, @terms_ops) = @_;
    while (@terms_ops >= 3) {
      # reduce first three items (which must be Term Op Term) to single Term
      my @types = map {$_->type} @terms_ops;
      die "syntax error in arithmetic expression\n"
        unless "@types" =~ /^Term Op Term/; # wrong sequence of terms and operators
      my $t1 = shift @terms_ops;
      my $op = shift @terms_ops;
      my $t2 = shift @terms_ops;
      my $new_term = new CWB::CEQL::String "$op($t1, $t2)", "Term";
      unshift @terms_ops, $new_term;
    }
    die "syntax error in arithmetic expression\n"
      unless @terms_ops == 1;     # wrong number of items
    return shift @terms_ops;
  }

The obvious drawback of this approach is the difficulty of signaling the
precise location of a syntax error to the user (in the example grammar above,
the parser will simply print C<syntax error> if there is any problem in a
sequence of terms and operators).  By the time the error is detected, all
items in the active group have already been pre-processed and subexpressions
have been collapsed.  Printing the current list of terms and operators would
only add to the user's confusion.

In order to signal errors immediately where they occur, each item should be
validated before it is added to the result list (e.g. an operator may not be
pushed as first item on a result list), and the reduce operation (C<< Term Op
Term => Term >>) should be applied as soon as possible.  The rule
C<arithmetic_item> needs direct access to the currently active result list for
this purpose: (1) to check how many items have already been pushed when
validating a new item, and (2) to reduce a sequence C<Term Op Term> to a single
C<Term> in the result list.

A pointer to the currently active result list is obtained with the internal
B<currentGroup> method, allowing a grammar rule to manipulate the result list.
The B<proximity queries> in the B<CWB::CEQL> grammar illustrate this advanced
form of shift-reduce parsing.


=head2 Backtracking with Try()

B<** TODO **>


=head1 EXTENDING GRAMMARS

B<** TODO **>


=head1 USER-VISIBLE METHODS

Methods that are called by the "end users" of a grammar.

=over 4

=item I<$grammar> = B<new> MyGrammar;

Create parser object I<$grammar> for the specified grammar (which must be a
class derived from B<CWB::CEQL::Parser>).  Note that the parser itself is not
reentrant, but multiple parsers for the same grammar can be run in parallel.
The return value I<$grammar> is an object of class B<MyGrammar>.

=cut

sub new {
  my $class = shift;
  my $self = {
              'PARAM_DEFAULTS' => {},  # globally set default values for parameters
              'PARAM' => undef,        # working copies of parameters during parse
              'INPUT' => undef,        # input string (defined while parsing)
              'ERROR' => undef,        # error message generated by last parse (undef = no error)
              'CALLSTACK' => [],       # call stack for backtrace in case of error
              'GROUPS' => undef,       # group structure for shift-reduce parser (undef if not active)
              'GROUPSTACK' => undef,   # stack of nested bracketing groups (undef if not active)
             };
  bless($self, $class);
}

=item I<$result> = I<$grammar>->B<Parse>(I<$string> [, I<$rule>]);

Parse input string I<$string> as a constituent of type I<$rule> (if
unspecified, the C<default> rule will be used).  The return value I<$result>
is typically a string containing the transformed query, but may also be an
arbitrary data structure or object (such as a parse tree for I<$input>).
Consult the relevant grammar documentation for details.  If parsing fails,
B<undef> is returned.

=cut

sub Parse {
  croak 'Usage:  $result = $grammar->Parse($string [, $rule]);'
    unless @_ == 2 or @_ == 3;
  my ($self, $input, $rule) = @_;
  $rule = "default"
    unless defined $rule;
  confess "CWB::CEQL::Parser: Parse() method is not re-entrant\n(tried to parse '$input' while parsing '".$self->{INPUT}."')"
    if defined $self->{INPUT};

  $self->{INPUT} = $input;
  %{$self->{PARAM}} = %{$self->{PARAM_DEFAULTS}}; # shallow copy of hash
  $self->{CALLSTACK} = [];              # re-initialise call stack (should destroy information from last parse)
  $self->{GROUPS} = undef;              # indicate that shift-reduce parser is not active
  $self->{CURRENT_GROUP} = undef;
  $self->{GROUPSTACK} = undef;
  $self->{ERROR} = undef;               # clear previous errors

  my $result = eval { $self->Call($rule, $input) }; # catch exceptions from parse errors
  if (not defined $result) {
    my $error = $@;
    chomp($error);                      # remove trailing newline
    $error = "parse of '' ".$self->{INPUT}." '' returned no result (reason unknown)"
      if $error eq "";
    $error =~ s/\s*\n\s*/ **::** /g;
    $self->{ERROR} = $error;
  }

  $self->{INPUT} = undef;               # no active parse
  $self->{PARAM} = undef;               # restore global parameter values (PARAM_DEFAULTS)
  return $result;                       # undef if parse failed
}

=item I<@lines_of_text> = I<$grammar>->B<ErrorMessage>;

If the last parse failed, returns a detailed error message and backtrace of
the callstack as a list of text lines (without newlines).  Otherwise, returns
empty list.

=cut

sub ErrorMessage {
  my $self = shift;
  my $error = $self->{ERROR};
  return ()
    unless defined $error;

  my @lines = "**Error:** $error";
  my $previous_frame = { RULE => "", INPUT => "" }; # init do dummy frame to avoid special case below
  foreach my $frame (reverse @{$self->{CALLSTACK}}) {
    my $rule = $frame->{RULE};
    if ($rule eq "APPLY") {
      my @done = @{$frame->{APPLY_DONE}};
      my @remain = @{$frame->{APPLY_ITEMS}};
      push @lines, " - at this location: '' @done ''**<==**'' @remain ''";
    }
    else {
      my $input = $frame->{INPUT};
      my $previous_input = $previous_frame->{INPUT} || "";
      if (($previous_input eq $input) and ($previous_frame->{RULE} ne "APPLY")) {
        $lines[-1] .= ", **$rule**";
      }
      else {
        push @lines, " - when parsing '' $input '' as **$rule**";
      }
    }
    $previous_frame = $frame;
  }
  return @lines;
}

=item I<$html_code> = I<$grammar>->B<HtmlErrorMessage>;

If the last parse failed, returns HTML-formatted error message and backtrace
of the callstack.  The string I<$html_code> is valid HTML and can directly be
included in a generated Web page.  In particular, unsafe and non-ASCII
characters have been encoded as HTML entities.  Simple, non-recursive
wiki-style markup in an error message is interpreted in the following way:

  **<text>**    <text> is shown in bold font (<b> ... </b>)
  //<text>//    <text> is displayed in italics (<i> ... </i>)
  ''<text>''    <text> is shown in typewriter font (<code> ... </code>)

Lines starting with C< - > (note the two blanks) are converted into list items.

=cut

sub HtmlErrorMessage {
  my $self = shift;
  my @text_lines = $self->ErrorMessage();
  if (@text_lines > 0) {
    return $self->formatHtmlText(@text_lines);
  }
  else {
    return undef;
  }
}

=item I<$grammar>->B<SetParam>(I<$name>, I<$value>);

=item I<$value> = I<$grammar>->B<GetParam>(I<$name>);

Set the value of parameter I<$name> (B<SetParam>), or read its current value
(B<GetParam>).  The parameter I<$name> must have been defined by the grammar
class (which I<$grammar> is an instance of) and should be described in the
grammar's documentation.

=cut

sub SetParam {
  croak 'Usage:  $grammar->SetParam($name, $value)'
    unless @_ == 3;
  my ($self, $name, $value) = @_;
  ## select either global parameter values (user level) or working copy (during parse)
  my $param_set = (defined $self->{INPUT}) ? $self->{PARAM} : $self->{PARAM_DEFAULTS};
  croak "CWB::CEQL::Parser: parameter '$name' does not exist"
    unless exists $param_set->{$name};
  $param_set->{$name} = $value;
}

sub GetParam {
  croak 'Usage:  $grammar->GetParam($name)'
    unless @_ == 2;
  my ($self, $name) = @_;
  my $param_set = (defined $self->{INPUT}) ? $self->{PARAM} : $self->{PARAM_DEFAULTS};
  croak "CWB::CEQL::Parser: parameter '$name' does not exist"
    unless exists $param_set->{$name};
  return $param_set->{$name};
}

=back


=head1 METHODS USED BY GRAMMAR AUTHORS

Methods for grammar authors.  Since these methods are intended for use in the
rules of a DPP grammar, they are typically applied to the object I<$self>.

=over 4

=item I<$self>->B<NewParam>(I<$name>, I<$default_value>);

Define new parameter I<$name> with default value I<$default_value>.  This
method is normally called in the constructor (method B<new>) of a
parameterized grammar.  If it is used in a rule body, the new parameter
will be created in the working copy of the parameter set and will only be
available during the current parse.

=cut

sub NewParam {
  confess 'Usage:  $self->NewParam($name, $default_value)'
    unless @_ == 3;
  my ($self, $name, $value) = @_;
  ## select either global parameter values (user level) or working copy (during parse)
  my $param_set = (defined $self->{INPUT}) ? $self->{PARAM} : $self->{PARAM_DEFAULTS};
  confess "CWB::CEQL::Parser: parameter '$name' already exists, cannot create with NewParam()"
    if exists $param_set->{$name};
  $param_set->{$name} = $value;
}

=item I<$result> = I<$self>->B<Call>(I<$rule>, I<$input>);

Apply rule I<$rule> to input string I<$input>.  The return value I<$result>
depends on the grammar rule, but is usually a string containing a translated
version of I<$input>.  Grammar rules may also annotate this string with
B<attributes> or by B<bless>ing it into a custom class, or return a complex
data structure such as a parse tree for I<$input>.  The caller has to be aware
what kind of value I<$rule> returns.

Note that B<Call> never returns B<undef>.  In case of an error, the entire
parse is aborted.

=cut

sub Call {
  confess 'Usage:  $result = $self->Call($rule, $input);'
    unless @_ == 3;
  my ($self, $rule, $input) = @_;
  confess "Sorry, we're not parsing yet"
    unless defined $self->{INPUT};
  my $method = $self->can("$rule");
  confess "the rule **$rule** does not exist in grammar **".ref($self)."** (internal error)\n"
    unless defined $method;
  my $frame = {RULE => $rule,
               INPUT => $input};
  push @{$self->{CALLSTACK}}, $frame; 
  my $result = $method->($self, $input);
  die "rule **$rule** failed to return a result (internal error)\n"
    unless defined $result;
  my $return_frame = pop @{$self->{CALLSTACK}};
  die "call stack has been corrupted (internal error)"
    unless $return_frame eq $frame;
  return $result;
}

=item I<$result> = I<$self>->B<Try>(I<$rule>, I<$input>);

Tentatively apply rule I<$rule> to the input string.  If I<$input> is parsed
successfully, B<Try> returns the translated version I<$result> (or an
arbitrary data structure such as a parse tree for I<$input>) just as B<Call>
would.  If parsing fails, B<Try> does not abort but simply returns B<undef>,
ignoring any error messages generated during the attempt.  In addition, the
call stack is restored and all parameters are reset to their previous values,
so that parsing can continue as if nothing had happened (note, however, that
this is based on flat backup copies, so complex data structures may have been
altered destructively).

=cut

sub Try {
  confess 'Usage:  $result = $self->Try($rule, $input);'
    unless @_ == 3;
  my ($self, $rule, $input) = @_;
  confess "Sorry, we're not parsing yet"
    unless defined $self->{INPUT};

  ## make flat backup copies of important data structures and ensure they are restored upon return
  ## (this is not completely safe, but should undo most changes that a failed parse may have made)
  my $back_param = [ @{$self->{PARAM}} ];
  my $back_callstack = [ @{$self->{CALLSTACK}} ]; 
  my ($back_groups, $back_current_group, $back_groupstack) = (undef, undef, undef);
  if (defined $self->{GROUPS}) {
    $back_groups = [ @{$self->{GROUPS}} ];
    $back_current_group = [ @{$back_groups->[0]} ]
      if @$back_groups > 0;
  }
  if (defined $self->{GROUPSTACK}) {
    $back_groupstack = [ @{$self->{GROUPSTACK}} ];
  }

  my $result = eval { $self->Call($rule, $input) };

  ## if parsing failed, restore internal data structures from backup copies
  if (not defined $result) {
    $self->{PARAM} = $back_param;
    $self->{CALLSTACK} = $back_callstack;
    $self->{GROUPS} = $back_groups;
    if (defined $back_groups and defined $back_current_group) {
      $self->{GROUPS}->[0] = $back_current_group;
    }
    $self->{GROUPSTACK} = $back_groupstack;
  }

  return $result;
}

=item I<@results> = I<$self>->B<Apply>(I<$rule>, I<@items>);

Apply rule I<$rule> to each input string in the list I<@items>.  The return
values are collected and returned as a list I<@results>, which has to be
further processed by the caller.  Note that empty strings (C<"">) are
automatically removed from the list of return values.

=cut

sub Apply {
  confess 'Usage:  @results = $self->Apply($rule, @items);'
    unless @_ >= 2;
  my $self = shift;
  my $rule = shift;
  my $frame = {RULE => "APPLY",
               INPUT => undef,
               APPLY_ITEMS => [ @_ ],
               APPLY_DONE => []};
  push @{$self->{CALLSTACK}}, $frame;

  ## data structures for nested groups and result values must be restored on exit (in case of nested Apply())
  local $self->{GROUPS} = [ [] ];   # set up data structure to collect result values of nested groups
  local $self->{GROUPSTACK} = [];   # stack of nested groups (keeps track of nesting depth and ensures proper nesting)

  ## process each input item in turn
  while (@{$frame->{APPLY_ITEMS}}) {
    my $input = shift @{$frame->{APPLY_ITEMS}};
    push @{$frame->{APPLY_DONE}}, $input;
    my $result = $self->Call($rule, $input);
    push @{$self->{GROUPS}->[0]}, $result
      unless $result eq "" and not ref $result;  # plain empty string indicates that this item does not generate output
  }

  ## check that nested bracketing was balanced and that data structure for result values is sane
  if (@{$self->{GROUPSTACK}} > 0) {
    my $next_type = pop @{$self->{GROUPSTACK}};
    my $type_msg = ($next_type eq "*") ? "" : "of type ''$next_type''";
    die "bracketing is not balanced: too many opening delimiters $type_msg\n";
  }
  confess "data structure for result values is corrupt in Apply() call (internal error)"
    unless @{$self->{GROUPS}} == 1;
  my @results = @{$self->{GROUPS}->[0]};

  my $return_frame = pop @{$self->{CALLSTACK}};
  die "call stack has been corrupted (internal error)"
    unless $return_frame eq $frame;
  return @results;
}

=item I<$self>->B<BeginGroup>([I<$name>]);

Marks the start of a nested group, when an opening delimiter is encountered.
B<BeginGroup> may only be called while the shift-reduce parser is active
during an B<Apply> operation.  The optional parameter I<$name> can be used to
ensure proper nesting of different types of groups; the default group name is
C<*>.  After calling B<BeginGroup>, a DPP rule will often return C<""> since
the opening determiner has a purely syntactic function and is not generate
output directly.

=cut

sub BeginGroup {
  my ($self, $name) = @_;
  $name = "*"
    unless defined $name;
  my $group_stack = $self->{GROUPSTACK};
  confess "BeginGroup() called outside Apply() operation (internal error)"
    unless defined $group_stack;
  push @$group_stack, $name;
  unshift @{$self->{GROUPS}}, [];
}

=item I<@group_results> = I<$self>->B<EndGroup>([I<$name>]);

Marks the end of a nested group, when a closing delimiter is encountered.  The
optional parameter I<$name> (or the default name C<*>) must be identical to
the group name of the matching opening delimiter.  B<EndGroup> returns a list
containing the result values collected from this nested group.

=cut

sub EndGroup {
  my ($self, $name) = @_;
  $name = "*"
    unless defined $name;
  my $group_stack = $self->{GROUPSTACK};
  my $groups = $self->{GROUPS};
  confess "EndGroup() called outside Apply() operation (internal error)"
    unless defined $group_stack;
  my $type_msg = ($name eq "*") ? "" : "of type ''$name''";
  die "bracketing is not balanced: too many closing delimiters $type_msg\n"
    unless @$group_stack > 0;
  confess "data structure for result values is corrupt in Apply() call (internal error)"
    unless @$groups == @$group_stack + 1;
  my $active_group = pop @$group_stack;
  die "opening delimiter of type ''$active_group'' paired with closing delimiter of type ''$name''\n"
    unless $name eq $active_group;
  my $group_results = shift @$groups;
  return @$group_results;
}

=item I<$n> = I<$self>->B<NestingLevel>;

Returns the nesting depth I<$n> of the current group during an B<Apply>
operation.  A nesting depth of 0 corresponds to the top level.
B<NestingLevel> may only be called while the shift-reduce parser is active and
will B<die> otherwise.

=cut

sub NestingLevel {
  my $self = shift;
  my $group_stack = $self->{GROUPSTACK};
  confess "NestingLevel() called outside Apply() operation (internal error)"
    unless defined $group_stack;
  return scalar @{$group_stack};
}

=back


=head1 INTERNAL METHODS

Internal methods of B<CWB::CEQL::Parser>.

=over 4

=item I<$array_ref> = I<$self>->B<currentGroup>;

Returns a pointer to the currently active result list during an B<Apply>
operation (either the top-level result list, or the local result list in a
nested group).  This pointer can be used to access previously collected
return values (before B<EndGroup> is called), and to manipulate the result
list (e.g. to perform advanced shift-reduce parsing).

It is an error to call this method while the shift-reduce parser is not
active.

=cut

sub currentGroup {
  my $self = shift;
  confess "currentGroup() called outside Apply() operation (internal error)"
    unless defined $self->{GROUPSTACK};
  return $self->{GROUPS}->[0];
}

=item I<$html_code> = I<$grammar>->B<formatHtmlText>(I<@lines_of_text>);

Format one or more text lines with simple wiki-style markup as HTML.  The
string I<$html_code> is valid HTML and can directly be included in a generated
Web page.  In particular, unsafe and non-ASCII characters are automatically
encoded as HTML entities.  The following typographic markup is supported:

=over 4

=item *

C<< **<text>** >> - <text> is displayed in bold face (C<< <b> ... </b> >>)

=item *

C<< //<text>// >> - <text> is displayed in italics (C<< <i> ... </i> >>)

=item *

C<< ''<text>'' >> - <text> is shown in typewriter font (C<< <code> ... </code> >>)

=item *

lines starting with C< - > (note the two blanks before and after the
hyphen) are converted into list items

=item *

all other lines are formatted as separate paragraphs (C<< <p> ... </p> >>)

=back

The wiki markup is non-recursive, i.e. no substitutions will be applied to
the text wrapped in C<''...''> etc.  This behaviour is intentional, so that
e.g. B<**> in a query expression will not be mistaken for a bold face marker,
(as long as the query is displayed in typewriter font, i.e. as C<''<query>''>).

=cut

sub formatHtmlText {
  my $self = shift;
  my @html_lines = ();
  my $in_list = 0;
  while (@_) {
    my $line = shift;
    my $list_item = ($line =~ s{^ -\s+}{}) ? 1 : 0;
    $line = $self->encodeEntities($line); # does not affect **, // and ''
    $line =~ s{(\*\*|//|'')(.*?)(\1)}{
      if ($1 eq "**")    { "<b>$2</b>" }
      elsif ($1 eq "//") { "<i>$2</i>" }
      else               { "<code>$2</code>" }
    }ge;
    if ($list_item) {
      push @html_lines, "<ul>"
        unless $in_list;
      push @html_lines, "<li>$line</li>";
      $in_list = 1;
    }
    else {
      push @html_lines, "</ul>"
        if $in_list;
      push @html_lines, "<p>$line</p>";
      $in_list = 0;
    }
  }
  push @html_lines, "</ul>"
    if $in_list;
  return join("\n", @html_lines);
}

=item I<$html> = I<$grammar>->B<encodeEntities>(I<$string>);

Replacement for B<encode_entities> function from B<HTML::Entities>, to avoid
dependency on this package (which is not part of the standard library).
Transforms unsafe characters C<E<lt>>, C<E<gt>>, C<&> and C<"> into HTML
entities, normalises whitespace and removes other control characters.

If I<$string> is a Unicode string, all non-ASCII characters are replaced
by numerical entities (otherwise, an unknown 8-bit character set is assumed,
so no substitutions can be made).

=cut

sub encodeEntities {
  my ($self, $s) = @_;
  my %entity = ( '<' => '&lt;', '>' => '&gt;', '&' => '&amp;', '"' => '&quot;' );
  $s =~ s/([<>&"])/$entity{$1}/ge;  # unsafe characters => entities
  $s =~ s/[ \t]+/ /g;               # normalise whitespace (but not line breaks)
  $s =~ s/[\x00-\x09\x0b\x0c\x0e-\x1f]+//g; # remove other control characters except LF and CR
  if (Encode::is_utf8($s)) {
    $s =~ s/([^\x00-\x7f])/sprintf "&#x%X;", ord($1)/ge;
  }
  return $s;
}

=back

=head2 Internal structure of CWB::CEQL::Parser objects

A DPP parser object (i.e. an object that belongs to B<CWB::CEQL::Parser> or
one of its subclasses) is a data structure (hashref) with the following
variables:

=over 4

=item PARAM_DEFAULTS

A hashref containing the global values of grammar parameters, i.e. values set
by the main program for this parser object or the default values defined by
the grammar class.

=item PARAM

Working copy of the grammar parameters, which is used while parsing and may be
modified by grammar rules without affecting the global values.  During a
parse, the B<NewParam>, B<SetParam> and B<GetParam> methods operate on this
working copy.

The C<PARAM> variable is re-initialised before each parse with a flat copy of
the C<PARAM_DEFAULTS> hashref.  Therefore, care has to be taken when modifying
complex parameter values within grammar rules, as the changes will affect the
global values in C<PARAM_DEFAULTS>.  If complex values need to be changed
internally, the grammar rule should always update the parameter with
B<SetParam> and a deep copy of the previous parameter value.

=item INPUT

The current input string passed to the B<Parse> method.  This variable is
mostly used to indicate whether the parser is currently active or not (e.g. in
order to avoid nested B<Parse> calls).

=item ERROR

Error message generated by the last parse, or B<undef> if the parse was
successful.  This error message is returned by B<ErrorMessage> and
B<HtmlErrorMessage> together with a backtrace of the parser's call stack.

=item CALLSTACK

The C<CALLSTACK> variable is an arrayref with information about the nested
calls of grammar rules and their input strings.  Each array element
corresponds to a nested rule invocation and is a hashref with the following
fields:

=over 4

=item RULE 

Name of the grammar rule (i.e. Perl B<method>) invoked.  When the shift-reduce
parser is called with B<Apply>, a special rule named C<APPLY> is pushed on the
stack.

=item INPUT

Input string for the grammar rule (which should be a constituent of the
respective type).

=item APPLY_ITEMS (optional, "APPLY" rule only)

List (arrayref) of items passed to B<Apply> for processing by the shift-reduce
parser.  This field is only present in the call stack entry for the special
C<APPLY> rule.  Items are shifted from this list to C<APPLY_DONE> as they are
processed by the shift-reduce parser.

=item APPLY_DONE (optiona, "APPLY" rule only) 

Items from the list passed to B<Apply> that have already been handled by the
shift-reduce parser.  The main purpose of C<APPLY_ITEMS> and C<APPLY_DONE> is
to narrow down the location of parse errors in a nested bracketing structure.

=back

=item GROUPS

List (arrayref) of arrayrefs collecting parse results for nested bracketing
groups.  The first element of this list corresponds to the currently active
bracketing group.  The C<GROUPS> variable is only defined while the
shift-reduce parser is active.

=item GROUPSTACK

Stack (arrayref) of nested bracketing groups.  Each stack element corresponds
to one level of nesting and is a string giving the type of the respective
group.  If no type has been specified by the user, the default value C<*> is
used.  The length of this array can be used to determine the current nesting
depth.

=back


=head1 COPYRIGHT

Copyright (C) 1999-2010 Stefan Evert [http::/purl.org/stefan.evert]

This software is provided AS IS and the author makes no warranty as to
its use and performance. You may use the software, redistribute and
modify it under the same terms as Perl itself.

=cut

1;