Errors in admin.log after recreatin indexes

Document ID : KB000125287
Last Modified Date : 25/01/2019
Show Technical Document Details
Question:
I implemented the steps suggested in KB000019522 to perform Filestore Reindex (https://comm.support.ca.com/kb/srch02001-error-while-retrieving-search-contents-contact-your-system-administrator-while-searching-or-retrieving-or-adding-documents-to-the-knowledge-store/kb000019522 ) and I am able to access documents without any issue. However, the admin.log shows below errors. Why?

1) (admin) Attempt to output character of integral value 0 that is not represented in specified output encoding of UTF-8. 

2) (admin) [Fatal Error] :2:3: The markup in the document preceding the root element must be well-formed. 

3) (admin) java.lang.Throwable: Warning: You did not close the PDF Document 
1/23/19 11:01 AM (admin) at org.pdfbox.cos.COSDocument.finalize(COSDocument.java:420) 
1/23/19 11:01 AM (admin) at java.lang.System$2.invokeFinalize(System.java:1270) 
1/23/19 11:01 AM (admin) at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:98) 
1/23/19 11:01 AM (admin) at java.lang.ref.Finalizer.access$100(Finalizer.java:34) 
1/23/19 11:01 AM (admin) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:213) 

4) (admin) java.lang.NoSuchMethodException: org.openxmlformats.schemas.wordprocessingml.x2006.main.impl.CTPictureBaseImpl.<init>(org.apache.xmlbeans.Sch 
emaType, boolean) 
1/23/19 11:10 AM (admin) at java.lang.Class.getConstructor0(Class.java:3082) 

 
Answer:
They are 'acceptable' failures. Below is the explanation why: 

Our search capabilities are text based (i.e. there is a search box where words are entered that are used as search terms to match against the text in the documents). 

The indexer therefore determines the file type (text, MS Word, MS Excel, PDF, etc.) and then uses code that can read the file and extract all the text that can be indexed and used as search terms. 

Sometimes these files also contain non-text data; it may be embedded binary content (digital signatures, passwords, encrypted data like protected worksheets, etc.), or simply be a 'format' or 'version' of the document type that our file-type parser cannot handle. In other cases it can be genuine that a few bytes or characters in a file can be interpreted one of two ways (what might look like a UTF byte sequence to begin with, may in fact just randomly use the same byte/characters for something else). 

In all of these kinds of cases, whilst an error is reported, and whilst this means a part or all of a document might not be indexed, they are considered 'acceptable' failures because the content of the document that couldn't be indexed was non-text. 

This is also true of the NoSuchMethodException one, in fact it's probably the most clear example to us of the problem that occurs when a newer file format version is being used than our libraries can index (e.g. some Office 2015 document or feature or addin in a Word doc that is too new for our version of PPM to parse and index).