Data By the Bay has ended
Data By the Bay is the first Data Grid conference matrix with 6 vertical application areas  spanned by multiple horizontal data pipelines, platforms, and algorithms.  We are unifying data science and data engineering, showing what really works to run businesses at scale.
Tuesday, May 17 • 2:10pm - 2:50pm
Aspect Based Sentiment Analysis in 20+ Languages

Sign up or log in to save this to your schedule and see who's attending!

We benchmark a number of statistical approaches for ABSA (via SemEval public data) with a linguistic approach. We discuss parsing, used in most benchmarked systems, and it's two main branches, probabilistic and symbolic parsing. Finally, we propose an alternative approach which combines the best of both paradigms: linguistic/symbolic processing for topic and polarity detection and Machine Learning for aspect categorization.

We present a grammar-based approach to Aspect-Based Sentiment Analysis (also known as Topic-Based) which is currently available in 20+ languages. When we say it is available we mean that these 20+ languages are in production in numerous commercial projects mainly in the area of VoC and survey coding projects. We describe the typical ingredients of a linguistic platform: 
  • on the software side: a language-independent lexical analyzer and a PDA-based non-deterministic GLR parser
  • on the data side: corpus-based lexicons (with up to 300 million entries for morphologically complex languages like Finnish); and unification grammars (with anything from 500 to 1000 rules per language); 
  • on the customization side: sentiment rules (around 1000 per language); domain-specific categorization rules; etc.

Probably the main advantage of the engine is that adding support for a new language is a matter of changing the data side (grammar and dictionaries), which can be done quickly and efficiently.

The system achieves 70% accuracy out-of-the-box on most domains and types of texts, and up to 90% accuracy when adapted to specific domains. Domain adaptation is carried out by adding a small number of domain-dependent rules; the process is incremental and predictable. The grammars can be efficiently adapted to match the peculiarities of different types of text, from social media to news. No manually-annotated corpora are needed to train the system, since it does not require any sort of training.

avatar for Antonio Valderrabanos

Antonio Valderrabanos

CEO and Founder, Bitext
Antonio Valderrábanos, CEO & Founder at Bitext I have a long experience on how to use Deep Linguistic Analysis to solve business problems, particularly in the area of Text Analytics. I started working for large R&D labs, at IBM and Novell. I developed the first grammar checker... Read More →

Tuesday May 17, 2016 2:10pm - 2:50pm

Attendees (10)