SecurityXploded.com
Investigating Corrupt/Malicious PDF Document | www.SecurityXploded.com
 
 
Investigating Corrupt/Malicious PDF Document
Author: Ayush Anand 
 
 
 
See Also
 
 
 
Contents
 
 
Introduction
Today, I will show you how to analyze and troubelshoot a corrupted or malicious PDF document. In this exercise I will be using sample PDF file for illustration purposes which you can download from here [Reference 2]. Before proceeding further, it is highly recommended that you to read this article 'PDF Overview - Peering into the Internals of PDF' [Reference 1] for better understanding of internal structure and components of PDF.
 
This article will help you get better understanding of inner working and flow of PDF file assisting you in the PDF Malware Analysis or any research work revolving around PDF.
 
 
 
Requirements
Before we get our hands dirty, we need to have following tools
  1. Acrobat Reader
  2. Notepad++ or any other text editor
Starting Corrupted PDF
 
Now download the sample document 'multipages.pdf' [References 2] and open it in the PDF reader.

On launching you will see following error
 
pdf error
 
 
 
Tracing and Fixing the Error in PDF
 
Lets start the investigation as to see what went wrong with this PDF document.

To get inside view, open this corrupt PDF file in Notepad++.  You will see the contents as shown below
 

1 0 obj

<< 

        /Pages 2 0 R

        /Type /Catalog

>> 

endobj

2 0 obj

<< 

        /Count 2

        /Kids [ 3 0 R 5 0 R 7 0 R 9 0 R 11 0 R ]

        /Type /Pages

>> 

endobj

3 0 obj

<< 

        /MediaBox [ 0 0 795 842 ]

        /Parent 2 0 R

        /Contents 4 0 R

        /Resources <<

               /Font <<

                       /F1 <<

                               /Name /F1

                               /BaseFont /Helvetica

                               /Subtype /Type1

                               /Type /Font

                       >>

               >>

        >>

        /Type /Page

>> 

endobj

4 0 obj

<< 

        /Length 55

>>stream

BT

/F1 18 Tf

186 690 Td

20 TL

(www.secsavvy.com) Tj

ET

 

endstream

endobj

5 0 obj

<< 

        /MediaBox [ 0 0 795 842 ]

        /Parent 2 0 R

        /Contents 6 0 R

        /Resources <<

               /Font <<

                       /F1 <<

                               /Name /F1

                               /BaseFont /Helvetica

                               /Subtype /Type1

                               /Type /Font

                       >>

               >>

        >>

        /Type /Page

>> 

endobj

6 0 obj

<< 

        /Length 45

>>stream

BT

/F1 15 Tf

186 690 Td

20 TL

(Page 1) Tj

ET

 

endstream

endobj

7 0 obj

<< 

        /MediaBox [ 0 0 795 842 ]

        /Parent 2 0 R

        /Contents 8 0 R

        /Resources <<

               /Font <<

                       /F1 <<

                               /Name /F1

                               /BaseFont /Helvetica

                               /Subtype /Type1

                               /Type /Font

                       >>

               >>

        >>

        /Type /Page

>> 

endobj

8 0 obj

<< 

        /Length 45

>>stream

BT

/F1 15 Tf

186 690 Td

20 TL

(Page 2) Tj

ET

 

endstream

endobj

9 0 obj

<< 

        /MediaBox [ 0 0 795 842 ]

        /Parent 2 0 R

        /Contents 10 0 R

        /Resources <<

               /Font <<

                       /F1 <<

                               /Name /F1

                               /BaseFont /Helvetica

                               /Subtype /Type1

                               /Type /Font

                       >>

               >>

        >>

        /Type /Page

>> 

endobj

10 0 obj

<< 

        /Length 45

>>stream

BT

/F1 15 Tf

186 690 Td

20 TL

(Page 3) Tj

ET

 

endstream

endobj

11 0 obj

<< 

        /MediaBox [ 0 0 795 842 ]

        /Parent 2 0 R

        /Content 12 0 R

        /Resources <<

               /Font <<

                       /F1 <<

                               /Name /F1

                               /BaseFont /Helvetica

                               /Subtype /Type1

                               /Type /Font

                       >>

               >>

        >>

        /Type /Page

>> 

endobj

12 0 obj

<< 

        /Length 47

>>stream

BT

/F1 15 Tf

186 690 Td

20 TL

(Password) Tj

ET

 

endstream

endobj

xref

0 13

0000000000 65535 f

0000000010 00000 n

0000000067 00000 n

0000000161 00000 n

0000000398 00000 n

0000000510 00000 n

0000000747 00000 n

0000000849 00000 n

0000001086 00000 n

0000001188 00000 n

0000001426 00000 n

0000001529 00000 n

0000001768 00000 n

trailer

<< 

        /Root 1 0 R

        /Size 13

>> 

startxref

1873

%%EOF
 
 
PDF file consists of 4 elements:
  1. PDF header identifying the PDF specification.
  2. A body containing the objects that make up the document contained in the file
  3. A cross-reference table containing information about the indirect objects in the file
  4. A trailer giving the location of the cross-reference table and of certain special objects within the body of the file.
But here if you observe closely,  there is no header so we will add a PDF header and try to open this PDF.
 
%PDF-1.7
 
Lets add this missing header info at the beginning of the file. Now you can open it open it without problem as shown below.
 
pdf error
 
Well that's good, but everything is not right. From the above picture you can see that it has total of 2 pictures. Lets investigate further.

Here is the screenshot showing the brief analysis of page-linking structure of this PDF file
 
pdf error
 
Now, we are able to find that this PDF has actually total 5 pages so edit the Count from 2 to 5 and open this PDF as shown below.
 
%PDF-1.7
1 0 obj
<<
/Pages 2 0 R
/Type /Catalog
>>
endobj
2 0 obj
<<
/Count 5
/Kids [ 3 0 R 5 0 R 7 0 R 9 0 R 11 0 R ]
/Type /Pages
>>
endobj
 
Now, we are able to see all 5 pages but last page is blank so we will investigate further.

Last page is in fact pointed by 11 0 R indirect object reference clear from the code snippet below
 
11 0 obj
<<
/MediaBox [ 0 0 795 842 ]
/Parent 2 0 R
/Content 12 0 R
/Resources <<
/Font <<
/F1 <<
/Name /F1
/BaseFont /Helvetica
/Subtype /Type1
/Type /Font
>>
>>
>>
/Type /Page
>>
endobj
 
In PDF, 'Contents' keyword is used for describing the contents of a file . If this entry is absent then the page is empty.

But here object number 12 Contents is written as 'Content' (note the missing 's' at the end). Hence the PDF reader is unable to recognize the name Content so it ignores the Content without giving any error.

To fix this, simply replace Content with Contents and open the PDF. Now you will be able to see all five pages.

You can download this fixed PDF 'MultiplePages_Fixed' [Reference 2] and test it for yourself.
 
 
 
Video Demonstration
 
Here is the video demonstration of this entire analysis and fixing process.
 
 
 
 
Reference
 
  1. PDF Overview - Peering into the Internals of PDF
  2. Download Sample PDF File
  3. PDF resources on Didier Stevens Blog
 
 
Conclusion
 
IHope you enjoyed this article and also got to know more about working flow of PDF document.

f you are more interested to read about PDF then I recommend you to visit excellent bog of Didier Stevens [Reference 3]
 
 
 
See Also