- 基于工作需求,现需要统计几千份word文件中日期进行存档
- 可利用Python脚本模块遍历目录下所有word文件,查找其中日期,以文件名-日期格式导出到txt文本中
- 先明确相关需求
- 需读取doc/docx格式,利用python-docx和win32com.client库分别处理docx和doc格式文件
- 文档中日期格式为xx年xx月xx日,利用正则表达式进行匹配日期
- 输出为txt文本
python 代码:import os
import re
import win32com.client
from docx import Document
def find_dates(text):
pattern = r'\b\d{1,4}年\d{1,2}月\d{1,2}日\b'
return re.findall(pattern, text)
def process_word_files(directory):
results = []
for filename in os.listdir(directory):
if filename.endswith('.docx'):
doc = Document(os.path.join(directory, filename))
text = ' '.join(paragraph.text for paragraph in doc.paragraphs)
dates = find_dates(text)
for date in dates:
results.append(f'{filename}-{date}')
elif filename.endswith('.doc'):
word = win32com.client.DispatchEx('Word.Application')
word.Visible = False
doc = word.Documents.Open(os.path.join(directory, filename))
text = doc.Content.Text
dates = find_dates(text)
for date in dates:
results.append(f'{filename}-{date}')
doc.Close(False)
word.Quit()
return results
def save_to_txt(results, output_filename):
with open(output_filename, 'w') as f:
for line in results:
f.write(line + '\n')
directory = r'C:\\Users\\user\\Desktop\\1\\2023'
output_filename = 'output.txt'
results = process_word_files(directory)
save_to_txt(results, output_filename)
- 为了更方便,我们可以使用pandas库将结果保存在Excel中
- 如果文件较多,由于读取doc旧版格式会打开word后台,以上代码实际运行后,会出现内存占满问题
- 以下代码进行了改进,在处理完每个文件后关闭后台
python 代码:import os
import re
import win32com.client
import pandas as pd
from docx import Document
def find_dates(text):
pattern = r'\b\d{1,4}年\d{1,2}月\d{1,2}日\b'
return re.findall(pattern, text)
def process_word_files(directory):
results = []
for filename in os.listdir(directory):
if filename.endswith('.docx'):
doc = Document(os.path.join(directory, filename))
text = ' '.join(paragraph.text for paragraph in doc.paragraphs)
dates = find_dates(text)
for date in dates:
results.append({'filename': filename, 'date': date})
elif filename.endswith('.doc'):
word = win32com.client.DispatchEx('Word.Application')
word.Visible = False
doc = word.Documents.Open(os.path.join(directory, filename))
text = doc.Content.Text
dates = find_dates(text)
for date in dates:
results.append({'filename': filename, 'date': date})
doc.Close(False)
word.Quit()
return results
def save_to_excel(results, output_filename):
df = pd.DataFrame(results)
df.to_excel(output_filename, index=False)
directory = r'C:\\Users\\user\\Desktop\\word\\data'
output_filename = 'output.xlsx'
results = process_word_files(directory)
save_to_excel(results, output_filename)
python 代码:pip install python-docx win32com.client pandas openpyxl pyarrow
你的文章内容非常专业,让人佩服。
暂无点赞