Full text search with PDF

We tried to make LogicalDOC as intuitive as possible, but an advice is always welcome.

Moderators: car031, car031, car031

Bertsjuhn
Posts: 4
Joined: Sun Feb 22, 2015 2:20 pm

Full text search with PDF

Post by Bertsjuhn » Sun Feb 22, 2015 4:34 pm

Hi,

I have just setup the LogicalDoc system on my Qnap NAS.
When I started filling the LogicalDoc system with scans that I have made, I noticed that the Full-Text search does not seem to work on documents that I have scanned to PDF (using a OCR scanner).
Does Logicaldoc index the text layer of such a PDF?
Can I check the indexes to see what is in them at this point?

mmeschieri
Posts: 235
Joined: Mon Apr 19, 2010 3:40 pm

Re: Full text search with PDF

Post by mmeschieri » Mon Feb 23, 2015 2:36 pm

OCR feature is only available in the commercial editions and require some configurations as per the installation guide. Once you create a document in LogicalDOC it is not immediately indexed and available for full-text searches. You have to wait the indexer task to process the new documents. Your PDF will be processed and OCRed.

When a document is indexed you will see a small white cylinder icon, click on it and all the extracted text will be downloaded in a .txt file for easy inspection

agaspa
Posts: 501
Joined: Tue Apr 20, 2010 8:24 am

Re: Full text search with PDF

Post by agaspa » Mon Feb 23, 2015 2:46 pm

Hi Bertsjuhn,
I confirm that LogicalDOC indexes the text in PDF.

But this operation is typically performed as a scheduled task.
The indexing operation is performed by the task "Document Indexing"
(see the attached image)
http://help.logicaldoc.com/en/administr ... uled-tasks

When the document has been indexed on his left appears an icon representing a small gray silo.
By clicking on this icon you can view the text that has been extracted from the indexer
Attachments
02-indexed-text.gif
Indexed text
02-indexed-text.gif (24.31 KiB) Viewed 12492 times
01-Scheduled-tasks.gif
Scheduled tasks
01-Scheduled-tasks.gif (29.91 KiB) Viewed 12492 times

Bertsjuhn
Posts: 4
Joined: Sun Feb 22, 2015 2:20 pm

Re: Full text search with PDF

Post by Bertsjuhn » Mon Feb 23, 2015 3:53 pm

Thanks for the reply :)

I just checked the extracted text.

This text is directly selected out of the PDF:

Code: Select all

Een schadegeval i s altijd vervelend. Mocht u echter schade hebben, dan kunt u rekenen
op onze persoonlijke dienstverlening en snelle en adequate schadebehandeling.
This is the same part of the document but then from the extracted text file

Code: Select all

E e n s c h a d e g e v a l i s a l t i j d v e r v e l e n d . Mocht u e c h t e r schade h e b b e n , d a n k u n t u r e k e n e n
op onze p e r s o o n l i j k e d i e n s t v e r l e n i n g en s n e l l e en adequate schadebehande l ing .

mmeschieri
Posts: 235
Joined: Mon Apr 19, 2010 3:40 pm

Re: Full text search with PDF

Post by mmeschieri » Mon Feb 23, 2015 4:15 pm

Probably that PDF is a result of an OCR. While your PDF viewer shows you the words correctly, in the file each character was placed in an independent word.

Bertsjuhn
Posts: 4
Joined: Sun Feb 22, 2015 2:20 pm

Re: Full text search with PDF

Post by Bertsjuhn » Mon Feb 23, 2015 8:00 pm

The strange part is that if I extract the text out of the same PDF, I do get the correct text.
Without the spaces between every letter.

agaspa
Posts: 501
Joined: Tue Apr 20, 2010 8:24 am

Re: Full text search with PDF

Post by agaspa » Tue Feb 24, 2015 9:14 am

If it were possible, I ask you to send your PDF to our support service, we would like to examine it.
Send it to support at logicaldoc.com or attach it to this thread (as a .zip file)

Bertsjuhn
Posts: 4
Joined: Sun Feb 22, 2015 2:20 pm

Re: Full text search with PDF

Post by Bertsjuhn » Tue Feb 24, 2015 10:08 pm

Since the file is to large, I have send an email :)

agaspa
Posts: 501
Joined: Tue Apr 20, 2010 8:24 am

Re: Full text search with PDF

Post by agaspa » Wed Feb 25, 2015 11:53 am

Hello Bertsjuhn,

we tried your file and actually the extracted text contains all the distinct letters.
Currently LogicalDOC is not able to index it properly.
Commercial versions of LogicalDOC have an integrated OCR (Tesseract), which is able to execute the character recognition on images and raster PDFs.
You should try to add into LogicalDOC a document without OCR and use the LogicalDOC internal OCR.

More information about Tesseract in LogicalDOC are available here
http://help.logicaldoc.com/en/installat ... ware-linux
http://help.logicaldoc.com/en/installat ... /tesseract

The same guide is also available for Ubuntu.
On Windows Tesseract is installed by the LogicalDOC setup, so you don't need to worry about it

See the images below to configure the OCR in LogicalDOC
Attachments
tesseract-OCR-windows-enabled.gif
Tesseract OCR Windows (OCR enabled)
tesseract-OCR-windows-enabled.gif (27.51 KiB) Viewed 12469 times
tesseract-OCR-windows-disabled.gif
Tesseract OCR Windows (OCR disabled)
tesseract-OCR-windows-disabled.gif (27.19 KiB) Viewed 12469 times
tesseract-OCR-linux-02.gif
Tesseract OCR Linux
tesseract-OCR-linux-02.gif (25.92 KiB) Viewed 12469 times

neuromanticx
Posts: 1
Joined: Tue Nov 03, 2015 11:25 am

Re: Full text search with PDF

Post by neuromanticx » Tue Nov 03, 2015 2:48 pm

Hi Bertsjuhn,

I read your interesting post that you are able to install logicalDoc on the QNAP NAS. I need some help and advice as to how do you get root access in the QNAP to install it as SSH or telnet to QNAP is restricted only to admin user.

Hope to hear from you soon.

Best regards,
Melvyn

Locked

Who is online

Users browsing this forum: No registered users and 1 guest