专栏名称: GitHubStore
分享有意思的开源项目
今天看啥  ›  专栏  ›  GitHubStore

DocAI:从非结构化文档中提取结构化数据

GitHubStore  · 公众号  ·  · 2024-09-14 17:45

文章预览

项目简介 使用 Answer.AI 的 Byaldi  、OpenAI  gpt-4o 和 Langchain 的结构化输出 从非结构化文档中提取结构化数据。 安装 pyenv virtualenv 3.10.6 docai pyenv activate docai poetry install 环境变量 确保您在环境变量中设置了 OPENAI_API_KEY 和 HF_TOKEN。 export OPENAI_API_KEY= export HF_TOKEN= 使用示例 从 pdfs/ 文件夹构建索引: python scripts/build_index.py --folder "pdfs/" --index_name "application" 样本输出 What losses have occurred in the past 5 years? LossHistory( losses=[ Loss(loss_date='2/20/21', loss_amount=7003.0, loss_description='Claimant was in his sleeper when his truck got hit by insured driver on the left', date_of_claim='4/19/21'), Loss(loss_date='2/4/21', loss_amount=92584.0, loss_description='The IV was attempting to merge on the highway when the IV lost control and struck ', date_of_claim=' 4 / 30 / 21 '), Loss(loss_date=' 9 / 14 / 21 ', loss_amount=5583.0, loss ………………………………

原文地址:访问原文地址
快照地址: 访问文章快照
总结与预览地址:访问总结与预览