Python解析PDF中文字及表格--pdfplumber與tabula-py－KOEI的旅行

Python解析PDF有4種方式:pdfplumber、tabula-py、pdfminer、pypdf2
實作解析中文和表格的PDF檔結果如下:
1.pdfplumber:可讀表格並存入pandas DataFrame，中文也成功解出，複雜表格可調整参數有機會正確讀出。
2.tabula-py:可讀表格並存入pandas DataFrame，中文也成功解出。
3.pdfminer:中文可成功解出，無讀表格功能，已被pdfplumber取代，安裝使用pdfplumber即可。
4.pypdf2(1.26):中文解不出，讀出無字天書。

所以本篇介紹pdfplumber和tabula-py的用法，但讀PDF前先確定2件事:

1.PDF的中文可被正確複製出，遇過PDF可顯示中文，但因為用了少見的中文編碼，實際上將內容複製至記事本顯示不出中文，這種文件pdfplumber或tabula也是解析不出，會顯示成cid。

2.確定pdfplumber或tabula是新版本，曾經安裝了套件，但執行後中文還是讀不出，重裝新版才OK。

一、pdfplumber安裝及使用

(使用環境:Windows10 64-bit，Anaconda 3)

安裝pdfplumber: 確認已連接網路，於 Anaconda Prompt輸入安裝指令>pip install pdfplumber
程式自動上網下載並安裝pdfplumber(目前是pdfplumber-0.5.11.tar.gz)，pdfplumber有關Image功能需先安裝ImageMagick，但新版本已跳過此項檢查，沒安裝ImageMagick也可以成功安裝pdfplumber。

先讀純文字試試:

import pandas as pd
import pdfplumber
pdffile="D:/Python/Test.pdf" #pdf檔路徑及檔名
pdf = pdfplumber.open(pdffile)
p0 = pdf.pages[0]
text=p0.extract_text() #讀文字
print(text)

讀取表格:

table = p0.extract_table() #讀第一頁表格
df0 = pd.DataFrame(table[1:], columns=table[0])
for page in range(1,len(pdf.pages)): #讀第一頁成功後，從第二頁開始逐頁讀表格
table = pdf.pages[page].extract_table()
df0 = pd.concat([df0,pd.DataFrame(table[1:], columns=table[0])], ignore_index = True)
pdf.close()
df0.replace(to_replace = r'\n', value = ' ', regex = True, inplace = True) #換行符號轉成空白
df0

單純的表格可正確讀出來，但複雜的會讀錯誤，可用table_settings調整，詳請参考GitHub網站，值得注意的是在Windows 32-bits環境執行有問題，在Windows 64-bits才順利執行。

一、tabula-py安裝及使用

安裝tabula-py: 確認已連接網路，並且已經安裝Java，於 Anaconda Prompt輸入安裝指令>pip install tabula-py
注意是安裝tabula-py，不是tabula，差2個字就不一樣。程式自動上網下載並安裝tabula-py(目前是tabula_py-1.3.1-py2.py3-none-any.whl)。

讀取表格:

import tabula
import pandas as pd
pdffile="D:/Python/Test.pdf" #pdf檔路徑及檔名
df=tabula.read_pdf(pdffile, pages=1,spreadsheet=True, pandas_options={'header':None}) #讀第一頁
df

結果可直接寫入csv檔

tabula.convert_into(pdffile, "output.csv", output_format="csv")

讀第一頁沒問題後，讀全部頁數，參數改成'all':

df=tabula.read_pdf(pdffile, pages='all',spreadsheet=True, pandas_options={'header':None})

df0.replace(to_replace = r'\r', value = ' ', regex = True, inplace = True) #換行符號轉成空白

tabula-py的用法較簡單，但沒調整表格功能，單純的表格適用。

KOEI

KOEI的旅行

KOEI 發表在痞客邦留言(0) 人氣()

E-mail轉寄

KOEI的旅行

歡迎光臨KOEI在痞客邦的小天地

Python解析PDF中文字及表格--pdfplumber與tabula-py

歷史上的今天

留言列表

站方公告

活動快報

愛睡噴...

我的好友

熱門文章

最新文章

最新留言

動態訂閱

文章精選

文章搜尋

新聞交換(RSS)

誰來我家

參觀人氣

QR Code

POWERED BY

文章分類