AI Product Expert Certification - Master Generative AI Skills
Start speaking a new language. It’s just 3 weeks away.
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore the new tika-eval module for evaluating text extraction tools in this 44-minute conference talk by Tim Allison from The MITRE Corporation. Learn about the importance of text extraction in various applications, including search and natural language processing. Discover how Apache Tikaâ„¢ detects file types and extracts metadata and text from numerous file formats. Gain insights into the evaluation methodology for content extraction systems, including metrics, limitations, and real-world results from testing on public domain documents. Understand common challenges in text extraction, such as hidden problems and missing text. Delve into topics like regression testing, evaluation metrics, and the importance of human interpretation in the evaluation process. Benefit from the speaker's extensive experience in natural language processing and content extraction as he shares valuable resources and conclusions about this crucial component in many popular tools like Solrâ„¢, Nutchâ„¢, and Elasticsearch.
Syllabus
Introduction
Overview
Whats different
Content Extraction
Metadata
Blood on the Highway
Search
Regression Testing
What Can Go Wrong
Hidden Problems
Example of Missing Text
Dream
Evaluation Metric
TikaEval Overview
TikaEval Definitions
Why TikaEval
TikaEval
Profile
Compare
StartDB
Profile Reports
Common Words Metric
Similarity Metric
Common Word Metric
Evaluation Metric Public
Limitations
Human Interpretation
Conclusion
Resources
Thank you
Data import handler
Metadata normalization
Application dependent
Taught by
Linux Foundation