• bitcoinBitcoin(BTC)$60,301.002.12%
  • ethereumEthereum(ETH)$1,613.003.36%
  • tetherTether(USDT)$1.000.00%
  • binancecoinBNB(BNB)$560.162.26%
  • usd-coinUSDC(USDC)$1.000.00%
  • rippleXRP(XRP)$1.061.84%
  • solanaSolana(SOL)$75.176.66%
  • tronTRON(TRX)$0.321419-0.20%
  • Figure HelocFigure Heloc(FIGR_HELOC)$1.052.60%
  • HyperliquidHyperliquid(HYPE)$66.659.61%
  • dogecoinDogecoin(DOGE)$0.0734880.89%
  • RainRain(RAIN)$0.0159232.39%
  • USDSUSDS(USDS)$1.000.00%
  • leo-tokenLEO Token(LEO)$9.561.55%
  • zcashZcash(ZEC)$407.489.46%
  • stellarStellar(XLM)$0.1752202.60%
  • moneroMonero(XMR)$314.851.56%
  • whitebitWhiteBIT Coin(WBT)$47.971.21%
  • CantonCanton(CC)$0.144421-3.13%
  • chainlinkChainlink(LINK)$7.423.04%
  • cardanoCardano(ADA)$0.1463592.65%
  • LABLAB(LAB)$14.99-9.81%
  • USD1USD1(USD1)$1.000.00%
  • daiDai(DAI)$1.000.01%
  • Ethena USDeEthena USDe(USDE)$1.000.01%
  • the-open-networkGram (prev. Toncoin)(GRAM)$1.610.59%
  • bitcoin-cashBitcoin Cash(BCH)$200.915.57%
  • litecoinLitecoin(LTC)$43.192.26%
  • hedera-hashgraphHedera(HBAR)$0.0718181.69%
  • Circle USYCCircle USYC(USYC)$1.130.05%
  • Global DollarGlobal Dollar(USDG)$1.00-0.03%
  • avalanche-2Avalanche(AVAX)$6.695.00%
  • suiSui(SUI)$0.703.65%
  • paypal-usdPayPal USD(PYUSD)$1.000.06%
  • shiba-inuShiba Inu(SHIB)$0.0000043.44%
  • crypto-com-chainCronos(CRO)$0.0544000.83%
  • tether-goldTether Gold(XAUT)$4,004.12-1.13%
  • nearNEAR Protocol(NEAR)$1.862.08%
  • BlackRock USD Institutional Digital Liquidity FundBlackRock USD Institutional Digital Liquidity Fund(BUIDL)$1.000.00%
  • Ondo US Dollar YieldOndo US Dollar Yield(USDY)$1.14-0.31%
  • BittensorBittensor(TAO)$208.212.48%
  • World Liberty FinancialWorld Liberty Financial(WLFI)$0.0597413.32%
  • uniswapUniswap(UNI)$2.921.09%
  • pax-goldPAX Gold(PAXG)$4,005.24-1.13%
  • okbOKB(OKB)$81.194.20%
  • AsterAster(ASTER)$0.631.23%
  • Ripple USDRipple USD(RLUSD)$1.000.01%
  • OndoOndo(ONDO)$0.3161313.07%
  • HTX DAOHTX DAO(HTX)$0.000002-0.09%
  • worldcoin-wldWorldcoin(WLD)$0.421297-3.86%
TradePoint.io
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop
No Result
View All Result
TradePoint.io
No Result
View All Result

OCRmyPDF Tutorial: Convert Scanned Documents into Searchable PDF/A Files with Sidecar Text Extraction and Batch Processing

June 28, 2026
in AI & Technology
Reading Time: 3 mins read
A A
OCRmyPDF Tutorial: Convert Scanned Documents into Searchable PDF/A Files with Sidecar Text Extraction and Batch Processing
ShareShareShareShareShare

YOU MAY ALSO LIKE

DeepSeek open sources DSpark, a new framework to speed up LLM inference by up to 85%

Google Expands Personalized Intelligence To Gemini App Image Creation

def _purge(*prefixes):
   for name in [m for m in list(sys.modules)
                if any(m == p or m.startswith(p + ".") for p in prefixes)]:
       del sys.modules[name]
def _load_ocrmypdf():
   _purge("PIL", "ocrmypdf")
   import ocrmypdf
   return ocrmypdf
try:
   ocrmypdf = _load_ocrmypdf()
except ImportError as e:
   if "_Ink" in str(e) or "PIL" in str(e):
       print("Repairing an incompatible Pillow (reinstalling pillow<12)...")
       sh(f'"{sys.executable}" -m pip install -q --force-reinstall "pillow<12"')
       try:
           ocrmypdf = _load_ocrmypdf()
           print("Pillow repaired — continuing without a restart.")
       except Exception:
           raise RuntimeError(
               "Pillow is still incompatible in this session. Use the Colab menu: "
               "Runtime > Restart session, then run this cell again."
           )
   else:
       raise
from ocrmypdf.exceptions import (
   ExitCode,
   PriorOcrFoundError,
   EncryptedPdfError,
   MissingDependencyError,
   TaggedPDFError,
   DigitalSignatureError,
   DpiError,
   InputFileError,
   UnsupportedImageFormatError,
)
from ocrmypdf.helpers import check_pdf
from ocrmypdf.pdfa import file_claims_pdfa
import img2pdf
from PIL import Image, ImageDraw, ImageFont, ImageFilter
logging.basicConfig(level=logging.WARNING, format="%(levelname)s: %(message)s")
logging.getLogger("ocrmypdf").setLevel(logging.WARNING)
logging.getLogger("pdfminer").setLevel(logging.ERROR)
logging.getLogger("PIL").setLevel(logging.WARNING)
SAMPLE_TEXT_PAGES = [
   "Optical Character Recognition, commonly abbreviated as OCR, is the "
   "process of converting images of typed or printed text into machine "
   "encoded text. This page was generated as a synthetic scan so that the "
   "OCRmyPDF pipeline has something realistic to recognize and search.",
   "On 14 March 2026 the archive contained 1,482 pages across 37 folders. "
   "Roughly 92 percent of those pages were scanned at 200 to 300 dots per "
   "inch. The remaining 8 percent were skewed and required deskewing before "
   "any reliable recognition was possible.",
   "After OCRmyPDF finishes, the output is a searchable PDF/A file. You can "
   "select text, copy it, and run full text search across thousands of "
   "documents. The original image resolution is preserved while a hidden "
   "text layer is placed accurately underneath the page image.",
]
def _find_font():
   for cand in (
       "/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf",
       "/usr/share/fonts/truetype/liberation/LiberationSans-Regular.ttf",
   ):
       if os.path.exists(cand):
           return cand
   return None
_FONT_PATH = _find_font()
FONT = ImageFont.truetype(_FONT_PATH, 40) if _FONT_PATH else ImageFont.load_default()
def _add_speckle(img, n=6000, dark=60):
   """Sprinkle light dark specks to imitate scanner noise (motivates --clean)."""
   import random
   px = img.load()
   w, h = img.size
   for _ in range(n):
       px[random.randint(0, w - 1), random.randint(0, h - 1)] = random.randint(0, dark)
   return img
def render_page(text, skew=False):
   """Render one A4 page (1654x2339 px ≈ 200 DPI) of dark text on white."""
   W, H = 1654, 2339
   img = Image.new("L", (W, H), 255)
   draw = ImageDraw.Draw(img)
   draw.multiline_text((150, 180), textwrap.fill(text, width=58),
                       fill=25, font=FONT, spacing=18)
   if skew:
       img = img.rotate(6, resample=Image.BICUBIC, expand=False, fillcolor=255)
       img = img.filter(ImageFilter.GaussianBlur(0.6))
       img = _add_speckle(img)
   return img
def build_scanned_pdf(pdf_path: Path, pages_text, skew_index=1):
   """Render pages to PNGs and wrap them losslessly into an image-only PDF."""
   pngs = []
   for i, text in enumerate(pages_text):
       img = render_page(text, skew=(i == skew_index))
       p = pdf_path.parent / f"_pg_{pdf_path.stem}_{i}.png"
       img.save(p, format="PNG", dpi=(200, 200))
       pngs.append(str(p))
   with open(pdf_path, "wb") as f:
       f.write(img2pdf.convert(pngs))
   for p in pngs:
       os.remove(p)
   return pdf_path
def do_ocr(input_file, output_file, **kw):
   """Wrapper around ocrmypdf.ocr() that disables the progress bar and times it."""
   kw.setdefault("progress_bar", False)
   t0 = time.perf_counter()
   rc = ocrmypdf.ocr(input_file, output_file, **kw)
   return rc, time.perf_counter() - t0
def tokens(s: str):
   return re.findall(r"[a-z0-9]+", s.lower())
def kb(path) -> str:
   return f"{Path(path).stat().st_size / 1024:,.1f} KB"
def banner(title: str):
   line = "─" * 74
   print(f"\n{line}\n  {title}\n{line}")

Credit: Source link

ShareTweetSendSharePin

Related Posts

DeepSeek open sources DSpark, a new framework to speed up LLM inference by up to 85%
AI & Technology

DeepSeek open sources DSpark, a new framework to speed up LLM inference by up to 85%

June 29, 2026
Google Expands Personalized Intelligence To Gemini App Image Creation
AI & Technology

Google Expands Personalized Intelligence To Gemini App Image Creation

June 29, 2026
NVIDIA BioNeMo Agent Toolkit Turns Biomolecular Models Into Callable Skills for AI Agents in Drug Discovery
AI & Technology

NVIDIA BioNeMo Agent Toolkit Turns Biomolecular Models Into Callable Skills for AI Agents in Drug Discovery

June 29, 2026
The attack that hijacked Claude Code came through Sentry. Datadog, PagerDuty, and Jira have the same exposure.
AI & Technology

The attack that hijacked Claude Code came through Sentry. Datadog, PagerDuty, and Jira have the same exposure.

June 29, 2026
Next Post
Manhattan sublease space disappearing thanks to AI firms

Manhattan sublease space disappearing thanks to AI firms

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

No Result
View All Result
Knicks fans celebrate after historic Game 4 comeback

Knicks fans celebrate after historic Game 4 comeback

June 27, 2026
CPI report: Inflation rises to 4.2%, the highest levels in three years

CPI report: Inflation rises to 4.2%, the highest levels in three years

June 28, 2026
New York fans celebrate the moment the Knicks become NBA champions

New York fans celebrate the moment the Knicks become NBA champions

June 25, 2026

About

Learn more

Our Services

Legal

Privacy Policy

Terms of Use

Bloggers

Learn more

Article Links

Contact

Advertise

Ask us anything

©2020- TradePoint.io - All rights reserved!

Tradepoint.io, being just a publishing and technology platform, is not a registered broker-dealer or investment adviser. So we do not provide investment advice. Rather, brokerage services are provided to clients of Tradepoint.io by independent SEC-registered broker-dealers and members of FINRA/SIPC. Every form of investing carries some risk and past performance is not a guarantee of future results. “Tradepoint.io“, “Instant Investing” and “My Trading Tools” are registered trademarks of Apperbuild, LLC.

This website is operated by Apperbuild, LLC. We have no link to any brokerage firm and we do not provide investment advice. Every information and resource we provide is solely for the education of our readers. © 2020 Apperbuild, LLC. All rights reserved.

No Result
View All Result
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop

© 2023 - TradePoint.io - All Rights Reserved!