6TH MAY 16
Zeropoint is a Belgian enterprise offering businesses a competitive advantage through offshoring. Our model is aimed at offering
you precisely the talented people you need while allowing you to
stay in charge of all your processes.
Introduction
Solution
Python Example
Reference
Questions
2/8
Problem
Document handling is still not an easy task. While working on a
personal project, I came across a problem where I have to extract
the document information along-with the text using Python.
Custom ContentHandler and Custom Document parsers were not
suitable as different document types has different formats.
Complexity of the problem was raising with the induction of
different file types, like:
Scanned PDFs
Scanned images with text. (PNG, JPG etc)
Actual documents like doc, ppt, odt, PDF etc.
3/8
Solution
Apache Tika
Apache Tika toolkit detects and extracts meta-data and text from
over a thousand different file types (such as PPT, XLS, PDF and
many more....)
Files can be parsed through a single interface
Useful for search engine indexing
Content Analysis
Translations
4/8
5/8
Python Usage
1
2
3
import tika
from tika import parser
4
5
6
7
8
9
6/8
Further information:
https://tika.apache.org
https://tika.apache.org/0.9/formats.html
7/8
Questions?
8/8