Dictionary Definition
parser n : a computer program that divides code
up into functional components; "compilers must parse source code in
order to translate it into object code"
User Contributed Dictionary
Pronunciation
-
- Rhymes: -ɑː(r)zə(r)
Noun
- One who parses.
- A computer program that parses.
Translations
Extensive Definition
In computer
science and linguistics, parsing (more
formally: syntactic analysis) is the process of analyzing a
sequence of tokens to
determine grammatical structure with respect to a given (more or
less) formal
grammar. A parser is thus one of the components in an interpreter or compiler, where it captures the
implied hierarchy of the input text and transforms it into a form
suitable for further processing (often some kind of parse tree,
abstract
syntax tree or other hierarchical structure) and normally
checks for syntax errors at the same time. The parser often uses a
separate lexical
analyser to create tokens from the sequence of input
characters. Parsers may be programmed by hand or may be
semi-automatically generated (in some programming language) by a
tool (such as
Yacc) from a grammar written in Backus-Naur
form.
Parsing is also an earlier term for the
diagramming of sentences of natural languages, and is still used
for the diagramming of inflected
languages, such as the Romance
languages or Latin.
Parsers can also be constructed as executable
specifications of grammars in functional programming languages.
Frost, Hafiz and Callaghan have built on the work of others to
construct a set of higher-order
functions (called parser
combinators) which allow polynomial time and space complexity
top-down parser to be constructed as executable specifications of
ambiguous grammars containing left-recursive productions. The
X-SAIGA
site has more about the algorithms and implementation
details.
Human languages
- Also see :Category:Natural language parsing
In some machine
translation and
natural language processing systems, human languages are parsed
by computer programs. Human sentences are not easily parsed by
programs, as there is substantial ambiguity
in the structure of human language. In order to parse natural
language data, researchers must first agree on the grammar to be used. The choice
of syntax is affected by both linguistic and computational
concerns; for instance some parsing systems use lexical
functional grammar, but in general, parsing for grammars of
this type is known to be NP-complete.
Head-driven phrase structure grammar is another linguistic
formalism which has been popular in the parsing community, but
other research efforts have focused on less complex formalisms such
as the one used in the Penn Treebank. Shallow
parsing aims to find only the boundaries of major constituents
such as noun phrases. Another popular strategy for avoiding
linguistic controversy is dependency
grammar parsing.
Most modern parsers are at least partly statistical; that is, they
rely on a corpus of training data which has already been annotated
(parsed by hand). This approach allows the system to gather
information about the frequency with which various constructions
occur in specific contexts. (See machine
learning.) Approaches which have been used include
straightforward PCFGs (probabilistic
context free grammars), maximum
entropy, and neural nets.
Most of the more successful systems use lexical statistics (that
is, they consider the identities of the words involved, as well as
their part of
speech). However such systems are vulnerable to overfitting and require some
kind of smoothing to be effective.
Parsing algorithms for natural language cannot
rely on the grammar having 'nice' properties as with
manually-designed grammars for programming languages. As mentioned
earlier some grammar formalisms are very computationally difficult
to parse; in general, even if the desired structure is not context-free,
some kind of context-free approximation to the grammar is used to
perform a first pass. Algorithms which use context-free grammars
often rely on some variant of the CKY
algorithm, usually with some
heuristic to prune away unlikely analyses to save time. (See
chart
parsing.) However some systems trade speed for accuracy using,
eg, linear-time versions of the shift-reduce
algorithm. A somewhat recent development has been parse
reranking in which the parser proposes some large number of
analyses, and a more complex system selects the best option. It is
normally branching of one part and its subparts
Programming languages
The most common use of a parser is as a component
of a compiler or
interpreter. This
parses the source code
of a
computer programming language to create some form of internal
representation. Programming languages tend to be specified in terms
of a context-free
grammar because fast and efficient parsers can be written for
them. Parsers are written by hand or generated by parser
generators.
Context-free grammars are limited in the extent
to which they can express all of the requirements of a language.
Informally, the reason is that the memory of such a language is
limited. The grammar cannot remember the presence of a construct
over an arbitrarily long input; this is necessary for a language in
which, for example, a name must be declared before it may be
referenced. More powerful grammars that can express this
constraint, however, cannot be parsed efficiently. Thus, it is a
common strategy to create a relaxed parser for a context-free
grammar which accepts a superset of the desired language constructs
(that is, it accepts some invalid constructs); later, the unwanted
constructs can be filtered out.
Overview of process
The following example demonstrates the common case of parsing a computer language with two levels of grammar: lexical and syntactic.The first stage is the token generation, or
lexical
analysis, by which the input character stream is split into
meaningful symbols defined by a grammar of regular
expressions. For example, a calculator program would look at an
input such as "12*(3+4)^2" and split it into the tokens 1,2, *, (,
3, +, 4, ), ^, and 2, each of which is a meaningful symbol in the
context of an arithmetic expression. The parser would contain rules
to tell it that the characters *, +, ^, ( and ) mark the start of a
new token, so meaningless tokens like "12*" or "(3" will not be
generated.
The next stage is parsing or syntactic analysis,
which is checking that the tokens form an allowable expression.
This is usually done with reference to a context-free
grammar which recursively defines components that can make up
an expression and the order in which they must appear. However, not
all rules defining programming languages can be expressed by
context-free grammars alone, for example type validity and proper
declaration of identifiers. These rules can be formally expressed
with attribute
grammars.
The final phase is
semantic parsing or analysis, which is working out the
implications of the expression just validated and taking the
appropriate action. In the case of a calculator or interpreter, the
action is to evaluate the expression or program; a compiler, on the
other hand, would generate some kind of code. Attribute grammars
can also be used to define these actions.
Types of parsers
The task of the parser is essentially to determine if and how the input can be derived from the start symbol of the grammar. This can be done in essentially two ways:- Top-down parsing - Top-down parsing can be viewed as an attempt to find left-most derivations of an input-stream by searching for parse-trees using a top-down expansion of the given formal grammar rules. Tokens are consumed from left to right. Inclusive choice is used to accommodate ambiguity by expanding all alternative right-hand-sides of grammar rules . LL parsers and recursive-descent parser are examples of top-down parsers, which cannot accommodate left recursive productions. Although it has been believed that simple implementations of top-down parsing cannot accommodate direct and indirect left-recursion and may require exponential time and space complexity while parsing ambiguous context-free grammars, more sophisticated algorithm for top-down parsing have been created by Frost, Hafiz, and Callaghan which accommodates ambiguity and left recursion in polynomial time and which generates polynomial-size representations of the potentially-exponential number of parse trees. Their algorithm is able to produce both left-most and right-most derivations of an input w.r.t. a given CFG.
- Bottom-up parsing - A parser can start with the input and attempt to rewrite it to the start symbol. Intuitively, the parser attempts to locate the most basic elements, then the elements containing these, and so on. LR parsers are examples of bottom-up parsers. Another term used for this type of parser is Shift-Reduce parsing.
Another important distinction is whether the
parser generates a leftmost derivation or a rightmost derivation
(see context-free
grammar). LL parsers will generate a leftmost derivation and LR parsers
will generate a rightmost derivation (although usually in reverse)
.
Examples of parsers
Top-down parsers
Some of the parsers that use top-down parsing include:- Recursive descent parser
- LL parser (Left-to-right, Leftmost derivation)
- X-SAIGA - eXecutable SpecificAtIons of GrAmmars. Contains publications related to top-down parsing algorithm that supports left-recursion and ambiguity in polynomial time and space.
Bottom-up parsers
Some of the parsers that use bottom-up parsing include:- Precedence parser
- BC (bounded context) parsing
- LR parser (Left-to-right, Rightmost derivation)
- CYK parser
References
- Parsing Techniques - A Practical Guide web page of book includes downloadable pdf.
See also
Parsing concepts
Parser development software
See also the comparison of parser generators.Wikimedia
parser in Catalan: Analitzador sintàctic
parser in Czech: Parser
parser in German: Parser
parser in Spanish: Analizador sintáctico
parser in Persian: تحلیلگر نحوی
parser in French: Décomposition analytique
parser in Korean: 구문 분석
parser in Croatian: Parsiranje
parser in Italian: Parsing
parser in Hungarian: Elemző (informatika)
parser in Macedonian: Парсер
parser in Dutch: Parser
parser in Japanese: 構文解析
parser in Polish: Parser
parser in Portuguese: Análise sintática
(computação)
parser in Russian: Синтаксический анализ
parser in Serbian: Парсер
parser in Finnish: Jäsennin
parser in Swedish: Parser
parser in Tamil: இலக்கணப் பாகுபடுத்தி
parser in Turkish: Metin ayrıştırıcı
parser in Ukrainian: Синтаксичний
аналіз