Extract PDF images with JavaScript
To extract images from a PDF using pdf.js in JavaScript, follow these steps:
- Load the PDF: Initialize pdf.js and load the PDF document.
- Access Each Page: For each page, use page.getOperatorList() to access its operators.
- Extract Images: Check the operators for paintImageXObject commands, where embedded images are stored.
- Render Image: Render the images onto a canvas for display or extraction.
Here’s an example setup using pdf.js
:
html
1 <script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/2.10.377/pdf.min.js"></script>2 <canvas id="imageCanvas"></canvas>3 <script>4 const url = 'path/to/your.pdf';56 pdfjsLib.getDocument(url).promise.then(async (pdf) => {7 for (let pageNum = 1; pageNum <= pdf.numPages; pageNum++) {8 const page = await pdf.getPage(pageNum);9 const operatorList = await page.getOperatorList();1011 operatorList.fnArray.forEach((fn, i) => {12 if (fn === pdfjsLib.OPS.paintImageXObject) {13 const imgName = operatorList.argsArray[i][0];14 const img = page.objs.get(imgName);15 renderImageToCanvas(img);16 }17 });18 }19 });2021 function renderImageToCanvas(img) {22 const canvas = document.getElementById('imageCanvas');23 const ctx = canvas.getContext('2d');24 canvas.width = img.width;25 canvas.height = img.height;26 ctx.putImageData(img, 0, 0); // Render the image data onto canvas27 }28 </script>
This script locates and renders images in the PDF as separate canvases.