We benchmark a number of statistical approaches for ABSA (via SemEval public data) with a linguistic approach. We discuss parsing, used in most benchmarked systems, and it's two main branches, probabilistic and symbolic parsing. Finally, we propose an alternative approach which combines the best of both paradigms: linguistic/symbolic processing for topic and polarity detection and Machine Learning for aspect categorization.
We present a grammar-based approach to Aspect-Based Sentiment Analysis (also known as Topic-Based) which is currently available in 20+ languages. When we say it is available we mean that these 20+ languages are in production in numerous commercial projects mainly in the area of VoC and survey coding projects. We describe the typical ingredients of a linguistic platform:
- on the software side: a language-independent lexical analyzer and a PDA-based non-deterministic GLR parser
- on the data side: corpus-based lexicons (with up to 300 million entries for morphologically complex languages like Finnish); and unification grammars (with anything from 500 to 1000 rules per language);
- on the customization side: sentiment rules (around 1000 per language); domain-specific categorization rules; etc.
Probably the main advantage of the engine is that adding support for a new language is a matter of changing the data side (grammar and dictionaries), which can be done quickly and efficiently.
The system achieves 70% accuracy out-of-the-box on most domains and types of texts, and up to 90% accuracy when adapted to specific domains. Domain adaptation is carried out by adding a small number of domain-dependent rules; the process is incremental and predictable. The grammars can be efficiently adapted to match the peculiarities of different types of text, from social media to news. No manually-annotated corpora are needed to train the system, since it does not require any sort of training.