Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Linux Foundation

Evaluating Text Extraction: Apache Tika's New Tika-Eval Module

Linux Foundation via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore the new tika-eval module for evaluating text extraction tools in this 44-minute conference talk by Tim Allison from The MITRE Corporation. Learn about the importance of text extraction in various applications, including search and natural language processing. Discover how Apache Tikaâ„¢ detects file types and extracts metadata and text from numerous file formats. Gain insights into the evaluation methodology for content extraction systems, including metrics, limitations, and real-world results from testing on public domain documents. Understand common challenges in text extraction, such as hidden problems and missing text. Delve into topics like regression testing, evaluation metrics, and the importance of human interpretation in the evaluation process. Benefit from the speaker's extensive experience in natural language processing and content extraction as he shares valuable resources and conclusions about this crucial component in many popular tools like Solrâ„¢, Nutchâ„¢, and Elasticsearch.

Syllabus

Introduction
Overview
Whats different
Content Extraction
Metadata
Blood on the Highway
Search
Regression Testing
What Can Go Wrong
Hidden Problems
Example of Missing Text
Dream
Evaluation Metric
TikaEval Overview
TikaEval Definitions
Why TikaEval
TikaEval
Profile
Compare
StartDB
Profile Reports
Common Words Metric
Similarity Metric
Common Word Metric
Evaluation Metric Public
Limitations
Human Interpretation
Conclusion
Resources
Thank you
Data import handler
Metadata normalization
Application dependent

Taught by

Linux Foundation

Reviews

Start your review of Evaluating Text Extraction: Apache Tika's New Tika-Eval Module

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.