Anda di halaman 1dari 8

F R I D A Y

6TH MAY 16

Zeropoint is a Belgian enterprise offering businesses a competitive advantage through offshoring. Our model is aimed at offering
you precisely the talented people you need while allowing you to
stay in charge of all your processes.

EXTRACTING META-DATA AND TEXT


NAJAM ALVI

Introduction Solution Python Example Reference Questions

Meta-data & Text Extraction

Introduction

Solution

Python Example

Reference

Questions

Extracting Meta-data and Text by Najam Alvi

2/8

Introduction Solution Python Example Reference Questions

Problem
Document handling is still not an easy task. While working on a
personal project, I came across a problem where I have to extract
the document information along-with the text using Python.
Custom ContentHandler and Custom Document parsers were not
suitable as different document types has different formats.
Complexity of the problem was raising with the induction of
different file types, like:
Scanned PDFs
Scanned images with text. (PNG, JPG etc)
Actual documents like doc, ppt, odt, PDF etc.

Extracting Meta-data and Text by Najam Alvi

3/8

Introduction Solution Python Example Reference Questions

Solution

Apache Tika
Apache Tika toolkit detects and extracts meta-data and text from
over a thousand different file types (such as PPT, XLS, PDF and
many more....)
Files can be parsed through a single interface
Useful for search engine indexing
Content Analysis
Translations

Extracting Meta-data and Text by Najam Alvi

4/8

Introduction Solution Python Example Reference Questions

App & JAXRS


Tika App Mode: Works at a network pipe level
Starting Server: java -jar tika-app.jar server port XXXX
Usage: nc 127.0.0.1 xxxx <[FILENAME]
Tika JAXRS: Provides a full RESTful interface
Starting Server: java -jar tika-server.jar host = HOSTNAME port
= XXXX
Usage: curl -X PUT data-binary @FILENAME
http://localhost:XXX/tika header Content-type:
application/pdf
Command Line: java -jar tika-app.jar [Options.] [file]
Usage java -jar tika-app.jar xml test.pdf

Extracting Meta-data and Text by Najam Alvi

5/8

Introduction Solution Python Example Reference Questions

Python Usage

1
2
3

import tika
from tika import parser

4
5

6
7

parsed = parser . from_file ( / path / to / file , http ://


localhost :9292/ tika )
print ( parsed [" metadata "])
print ( parsed [" content "])

8
9

Extracting Meta-data and Text by Najam Alvi

6/8

Introduction Solution Python Example Reference Questions

Further information:

https://tika.apache.org
https://tika.apache.org/0.9/formats.html

Extracting Meta-data and Text by Najam Alvi

7/8

Introduction Solution Python Example Reference Questions

Questions?

Extracting Meta-data and Text by Najam Alvi

8/8

Anda mungkin juga menyukai