Full text search with PDF

We tried to make LogicalDOC as intuitive as possible, but an advice is always welcome.

Moderator: car031

Bertsjuhn
Posts: 4
Joined: Sun Feb 22, 2015 2:20 pm

Full text search with PDF

Sun Feb 22, 2015 4:34 pm

Hi,

I have just setup the LogicalDoc system on my Qnap NAS.
When I started filling the LogicalDoc system with scans that I have made, I noticed that the Full-Text search does not seem to work on documents that I have scanned to PDF (using a OCR scanner).
Does Logicaldoc index the text layer of such a PDF?
Can I check the indexes to see what is in them at this point?
mmeschieri
Posts: 242
Joined: Mon Apr 19, 2010 3:40 pm

Re: Full text search with PDF

Mon Feb 23, 2015 2:36 pm

OCR feature is only available in the commercial editions and require some configurations as per the installation guide. Once you create a document in LogicalDOC it is not immediately indexed and available for full-text searches. You have to wait the indexer task to process the new documents. Your PDF will be processed and OCRed.

When a document is indexed you will see a small white cylinder icon, click on it and all the extracted text will be downloaded in a .txt file for easy inspection
agaspa
Posts: 714
Joined: Tue Apr 20, 2010 8:24 am

Re: Full text search with PDF

Mon Feb 23, 2015 2:46 pm

Hi Bertsjuhn,
I confirm that LogicalDOC indexes the text in PDF.

But this operation is typically performed as a scheduled task.
The indexing operation is performed by the task "Document Indexing"
(see the attached image)
http://help.logicaldoc.com/en/administr ... uled-tasks

When the document has been indexed on his left appears an icon representing a small gray silo.
By clicking on this icon you can view the text that has been extracted from the indexer
Attachments
02-indexed-text.gif
Indexed text
02-indexed-text.gif (24.31 KiB) Viewed 18856 times
01-Scheduled-tasks.gif
Scheduled tasks
01-Scheduled-tasks.gif (29.91 KiB) Viewed 18856 times
Bertsjuhn
Posts: 4
Joined: Sun Feb 22, 2015 2:20 pm

Re: Full text search with PDF

Mon Feb 23, 2015 3:53 pm

Thanks for the reply :)

I just checked the extracted text.

This text is directly selected out of the PDF:

Code: Select all

Een schadegeval i s altijd vervelend. Mocht u echter schade hebben, dan kunt u rekenen
op onze persoonlijke dienstverlening en snelle en adequate schadebehandeling.
This is the same part of the document but then from the extracted text file

Code: Select all

E e n s c h a d e g e v a l i s a l t i j d v e r v e l e n d . Mocht u e c h t e r schade h e b b e n , d a n k u n t u r e k e n e n
op onze p e r s o o n l i j k e d i e n s t v e r l e n i n g en s n e l l e en adequate schadebehande l ing .
mmeschieri
Posts: 242
Joined: Mon Apr 19, 2010 3:40 pm

Re: Full text search with PDF

Mon Feb 23, 2015 4:15 pm

Probably that PDF is a result of an OCR. While your PDF viewer shows you the words correctly, in the file each character was placed in an independent word.
Bertsjuhn
Posts: 4
Joined: Sun Feb 22, 2015 2:20 pm

Re: Full text search with PDF

Mon Feb 23, 2015 8:00 pm

The strange part is that if I extract the text out of the same PDF, I do get the correct text.
Without the spaces between every letter.
agaspa
Posts: 714
Joined: Tue Apr 20, 2010 8:24 am

Re: Full text search with PDF

Tue Feb 24, 2015 9:14 am

If it were possible, I ask you to send your PDF to our support service, we would like to examine it.
Send it to support at logicaldoc.com or attach it to this thread (as a .zip file)
Bertsjuhn
Posts: 4
Joined: Sun Feb 22, 2015 2:20 pm

Re: Full text search with PDF

Tue Feb 24, 2015 10:08 pm

Since the file is to large, I have send an email :)
agaspa
Posts: 714
Joined: Tue Apr 20, 2010 8:24 am

Re: Full text search with PDF

Wed Feb 25, 2015 11:53 am

Hello Bertsjuhn,

we tried your file and actually the extracted text contains all the distinct letters.
Currently LogicalDOC is not able to index it properly.
Commercial versions of LogicalDOC have an integrated OCR (Tesseract), which is able to execute the character recognition on images and raster PDFs.
You should try to add into LogicalDOC a document without OCR and use the LogicalDOC internal OCR.

More information about Tesseract in LogicalDOC are available here
http://help.logicaldoc.com/en/installat ... ware-linux
http://help.logicaldoc.com/en/installat ... /tesseract

The same guide is also available for Ubuntu.
On Windows Tesseract is installed by the LogicalDOC setup, so you don't need to worry about it

See the images below to configure the OCR in LogicalDOC
Attachments
tesseract-OCR-windows-enabled.gif
Tesseract OCR Windows (OCR enabled)
tesseract-OCR-windows-enabled.gif (27.51 KiB) Viewed 18833 times
tesseract-OCR-windows-disabled.gif
Tesseract OCR Windows (OCR disabled)
tesseract-OCR-windows-disabled.gif (27.19 KiB) Viewed 18833 times
tesseract-OCR-linux-02.gif
Tesseract OCR Linux
tesseract-OCR-linux-02.gif (25.92 KiB) Viewed 18833 times
neuromanticx
Posts: 1
Joined: Tue Nov 03, 2015 11:25 am

Re: Full text search with PDF

Tue Nov 03, 2015 2:48 pm

Hi Bertsjuhn,

I read your interesting post that you are able to install logicalDoc on the QNAP NAS. I need some help and advice as to how do you get root access in the QNAP to install it as SSH or telnet to QNAP is restricted only to admin user.

Hope to hear from you soon.

Best regards,
Melvyn

Return to “Usage”

Who is online

Users browsing this forum: No registered users and 24 guests