Extract Data from .pdf with ChatGPT

Code
import json
import os

import pandas as pd
import pdfplumber

os.getcwd()
'c:\\Users\\joost\\ReposWindows\\bedrock-web\\sandbox'

Upload the AGS3-3-1-2005.pdf

Then use a prompt like this:

Put the Groups and Headings table in a CSV with columns group_name,contents,notes,parent_group
The Groups and Headings table starts at the bottom of page 13 and finishes on page 17.

It might require a bit of chatting with ChatGPT before you get a proper table in your chat. Once you do, you can copy-paste it to a .tsv file. A .tsv file is a Tab Separated Value file instead of CSV - Comma Separated Value file, which is what you get when you copy the table from ChatGPT. TSV’s are also handy, because they don’t run into issues when one of the values contains a comma…

Now it could be that ChatGPT doesn’t always return the same results and that the table isn’t exactly as in the AGS 3 .pdf document. This was the case for me:

Code
df1 = pd.read_csv("ags3_chatgpt_groups_and_headings.tsv", sep="\t")
df1["df"] = 1
df2 = pd.read_csv("ags3_chatgpt_groups_and_headings2.tsv", sep="\t")
df2["df"] = 2

# Concatenate the two DataFrames
df_concat = pd.concat([df1, df2])

# Find the duplicate rows
manual_duplicates = df_concat.duplicated(
    subset=df_concat.columns.difference(["df"]), keep=False
)

# Find the rows that are not duplicates (i.e., the rows that are unique to one DataFrame)
unique_rows = df_concat[~manual_duplicates]

unique_rows.sort_index()
group_name contents parent_group df
32 HPGI Horizontal Profile Gauge Installation Details HOLE 1
32 HPGI Horizontal Profile Gauge Installation HOLE 2
37 IFID On Site Volatile Headspace Testing (FID) HOLE 2
37 IFID On Site Volatile Headspace Testing Using Flame... HOLE 1
38 INST Single Point Instrument Installation HOLE 2
38 INST Single Point Instrument Installation Details HOLE 1
40 IPID On Site Volatile Headspace Testing (PID) HOLE 2
71 UNIT Definition of <UNITS> and CNMT_UNIT - 1
72 UNIT Definition of Units - 2

Extract AGS 3 and 4 Data Dictionaries from their corresponding AGS .pdf documents

Code
def extract_ags3_data_dict_table(table):
    headings = []
    for row in table[2:]:  # Skip first 2 rows: 1st = title, 2nd = headings
        headings.append(
            {
                "status": None if row[0] == "" else row[0].strip(),
                "heading": row[1].strip(),
                "unit": None if row[2] == "" else row[2].strip().replace("\n", " "),
                "description": row[3].strip().replace("\n", " "),
                "example": None if row[4] == "" else row[4].strip().replace("\n", " "),
            }
        )
    return headings


def extract_ags4_data_dict_table(table):
    # Skip rows that don't contain data
    for i, row in enumerate(table):
        if "Suggested\nUnit / Type" in row or "Unit / Type" in row:
            first_data_row = i + 1
            break

    headings = []
    for row in table[first_data_row:]:
        row = [x for x in row if x is not None]
        headings.append(
            {
                "status": None if row[0] == "" else row[0].strip(),
                "heading": row[1].strip(),
                "unit": None if row[2] == "" else row[2].strip().replace("\n", ""),
                "type": row[3].strip(),
                "description": row[4].strip().replace("\n", " "),
                "example": None if row[5] == "" else row[5].strip().replace("\n", " "),
            }
        )
    return headings
Code
ags_version = 4  # AGS version 3 or 4
pdf_dict = {
    3: {"pdf_file": "AGS3_v3-1-2005.pdf", "from_page": 22, "to_page": 69},
    4: {"pdf_file": "AGS4-v4-1-1-2022.pdf", "from_page": 18, "to_page": 160},
}
Code
pdf_file, from_page, to_page = pdf_dict[ags_version].values()

# List to store extracted data for each group
extracted_data = []
previous_group_name = ""
with pdfplumber.open(pdf_file) as pdf:
    # Adjust the page range based on where the tables are located
    for page_number in range(from_page, to_page):
        page = pdf.pages[page_number - 1]  # pdfplumber is 0-based, so subtract 1
        tables_on_current_page = page.extract_tables()  # Extract tables from the page

        # Iterate through all tables found on the page
        for table in tables_on_current_page:
            if ags_version == 3:
                table_title = table[0][0].strip()  # Get table title from AGS3
            elif ags_version == 4:
                table_title = table[0][1].strip()  # Get table title from AGS4
            print(table_title)

            parts = table_title.split(": ", 1)  # Split on the first occurrence of ': '
            if "Group Name" in parts[0]:
                group_name = parts[1].split(" - ")[0]
                group_description = " - ".join(parts[1].split(" - ")[1:])
                group_description = group_description.replace("\n", " ")
                if ags_version == 3:
                    headings = extract_ags3_data_dict_table(table)
                elif ags_version == 4:
                    headings = extract_ags4_data_dict_table(table)

                if group_name == previous_group_name:
                    extracted_data[-1]["headings"].extend(headings)
                else:
                    extracted_data.append(
                        {
                            "group_name": group_name,
                            "group_description": group_description,
                            "headings": headings,
                        }
                    )
                previous_group_name = group_name
Group Name: PROJ - Project Information
Group Name: ABBR - Abbreviation Definitions
Group Name: DICT - User Defined Groups and Headings
Group Name: FILE - Associated Files
Group Name: TRAN - Data File Transmission Information / Data Status
Group Name: TYPE - Definition of Data Types
Group Name: UNIT - Definition of Units
Group Name: AAVT - Aggregate Abrasion Tests
Group Name: AAVT - Aggregate Abrasion Tests
Group Name: ACVT - Aggregate Crushing Value Tests
Group Name: AELO - Aggregate Elongation Index Tests
Group Name: AFLK - Aggregate Flakiness Tests
Group Name: AFLK - Aggregate Flakiness Tests
Group Name: AIVT - Aggregate Impact Value Tests
Group Name: AIVT - Aggregate Impact Value Tests
Group Name: ALOS - Los Angeles Abrasion Tests
Group Name: ALOS - Los Angeles Abrasion Tests
Group Name: APSV - Aggregate Polished Stone Tests
Group Name: ARTW - Aggregate Determination of the Resistance to Wear (micro-Deval)
Group Name: ASDI - Slake Durability Index Tests
Group Name: ASNS - Aggregate Soundness Tests
Group Name: AWAD - Aggregate Water Absorption Tests
Group Name: AWAD - Aggregate Water Absorption Tests
Group Name: BKFL - Exploratory Hole Backfill Details
Group Name: CBRG - California Bearing Ratio Tests - General
Group Name: CBRT - California Bearing Ratio Tests - Data
Group Name: CDIA - Casing Diameter by Depth
Group Name: CHIS - Chiselling Details
Group Name: CHOC - Chain of Custody Information
Group Name: CMPG - Compaction Tests - General
Group Name: CMPT - Compaction Tests - Data
Group Name: CONG - Consolidation Tests - General
Group Name: CONG - Consolidation Tests - General
Group Name: CONS - Consolidation Tests - Data
Group Name: CONS - Consolidation Tests - Data
Group Name: CTRC - Cyclic Triaxial Tests - Consolidation
Group Name: CTRC - Cyclic Triaxial Tests - Consolidation
Group Name: CTRD - Cyclic Triaxial Tests - Data
Group Name: CTRG - Cyclic Triaxial Test - General
Group Name: CTRG - Cyclic Triaxial Test - General
Group Name: CTRP - Cyclic Triaxial Test - Derived Parameters
Group Name: CTRP - Cyclic Triaxial Test - Derived Parameters
Group Name: CTRS - Cyclic Triaxial Test - Saturation
Group Name: CORE - Coring Information
Group Name: DCPG - Dynamic Cone Penetrometer Tests - General
Group Name: DCPG - Dynamic Cone Penetrometer Tests - General
Group Name: DCPT - Dynamic Cone Penetrometer Tests - Data
Group Name: DETL - Stratum Detail Descriptions
Group Name: DETL - Stratum Detail Descriptions
Group Name: DISC - Discontinuity Data
Group Name: DISC - Discontinuity Data
Group Name: DLOG - Driller Geological Description
Group Name: DOBS - Drilling/Advancement Observations & Parameters
Group Name: DOBS - Drilling/Advancement Observations & Parameters
Group Name: DPRG - Dynamic Probe Tests - General
Group Name: DPRB - Dynamic Probe Tests - Data
Group Name: DREM - Depth Related Remarks
Group Name: ECTN - Sample Container Details
Group Name: ELRG - Environmental Laboratory Reporting
Group Name: ELRG - Environmental Laboratory Reporting
Description
Depth to top of test
specimen
Laboratory analytical
name

Group Name: ERES - Environmental Contaminant Testing
Group Name: ERES - Environmental Contaminant Testing
Group Name: ESCG - Effective Stress Consolidation Tests - General
Group Name: ESCG - Effective Stress Consolidation Tests - General
Group Name: ESCT - Effective Stress Consolidation Tests - Data
Group Name: ESCT - Effective Stress Consolidation Tests - Data
Group Name: FGHG - Field Geohydraulic Testing - General
Group Name: FGHI - Field Geohydraulic Testing - Instrumentation Details
Group Name: FGHS - Field Geohydraulic Testing - Test Results (per stage)
Group Name: FGHT - Field Geohydraulic Testing - Data
Group Name: FLSH - Drilling Flush Details
Group Name: FRAC - Fracture Spacing
Group Name: FRST - Frost Susceptibility Tests
Group Name: FRST - Frost Susceptibility Tests
Group Name: GCHM - Geotechnical Chemistry Testing
Group Name: GCHM - Geotechnical Chemistry Testing
Group Name: GEOL - Field Geological Descriptions
Group Name: GRAG - Particle Size Distribution Analysis - General
Group Name: GRAG - Particle Size Distribution Analysis - General
Group Name: GRAT - Particle Size Distribution Analysis - Data
Group Name: GRAT - Particle Size Distribution Analysis - Data
Group Name: HDIA - Hole Diameter by Depth
Group Name: HDPH - Depth Related Exploratory Hole Information
Group Name: HDPH - Depth Related Exploratory Hole Information
Group Name: HORN - Exploratory Hole Orientation and Inclination
Group Name: ICBR - In Situ California Bearing Ratio Tests
Group Name: ICBR - In Situ California Bearing Ratio Tests
Group Name: IDEN - In Situ Density Tests
Group Name: IFID - On Site Volatile Headspace Testing Using Flame Ionisation Detector
Group Name: IPEN - In Situ Hand Penetrometer Tests
Group Name: IPID - On Site Volatile Headspace Testing by Photo Ionisation Detector
Group Name: IPID - On Site Volatile Headspace Testing by Photo Ionisation Detector
Group Name: IPRG - In Situ Permeability Tests - General
Group Name: IPRG - In Situ Permeability Tests - General
Group Name: IPRT - In Situ Permeability Tests - Data
Group Name: IRDX - In Situ Redox Tests
Group Name: IRES - In Situ Resistivity Tests
Group Name: IRES - In Situ Resistivity Tests
Group Name: ISAG - Soakaway Tests - General
Group Name: ISAG - Soakaway Tests - General
Group Name: ISAT - Soakaway Tests - Data
Group Name: ISPT - Standard Penetration Test Results
Group Name: ISPT - Standard Penetration Test Results
Group Name: IVAN - In Situ Vane Tests
Group Name: LBSG - Testing Schedule
Group Name: LBST - Testing Schedule Details
Group Name: LBST - Testing Schedule Details
Group Name: LDEN - Density Tests
Group Name: LDEN - Density Tests
Group Name: LDYN - Dynamic Testing
Group Name: LDYN - Dynamic Testing
Group Name: LFCN – Laboratory Fall Cone Test
Group Name: LFCN – Laboratory Fall Cone Test
Group Name: LLIN - Linear Shrinkage Tests
Group Name: LLIN - Linear Shrinkage Tests
Group Name: LLPL - Liquid and Plastic Limit Tests
Group Name: LLPL - Liquid and Plastic Limit Tests
Group Name: LNMC - Water/moisture Content Tests
Group Name: LOCA - Location Details
Group Name: LOCA - Location Details
Group Name: LPDN - Particle Density Tests
Group Name: LPEN - Laboratory Hand Penetrometer Tests
Group Name: LRES - Laboratory Resistivity Tests
Group Name: LRES - Laboratory Resistivity Tests
Group Name: LSLT - Shrinkage Limit Tests
Group Name: LSTG - Initial Consumption of Lime Tests - General
Group Name: LSTG - Initial Consumption of Lime Tests - General
Group Name: LSTT - Initial Consumption of Lime Tests - Data
Group Name: LSWL - Swelling Index Testing
Group Name: LSWL - Swelling Index Testing
Group Name: LTCH - Laboratory Thermal Conductivity
Group Name: LTCH - Laboratory Thermal Conductivity
Group Name: LUCT - Laboratory Unconfined Compression Test
Group Name: LUCT - Laboratory Unconfined Compression Test
Group Name: LVAN - Laboratory Vane Tests
Group Name: LVAN - Laboratory Vane Tests
Group Name: MCVG - MCV Tests - General
Group Name: MCVG - MCV Tests - General
Group Name: MCVT - MCV Tests - Data
Group Name: MCVT - MCV Tests - Data
Group Name: MOND - Monitoring Readings
Group Name: MONG - Monitoring Installations and Instruments
Group Name: MONG - Monitoring Installations and Instruments
Group Name: PIPE - Monitoring Installation Pipe Work
Group Name: PLTG - Plate Loading Tests - General
Group Name: PLTT - Plate Loading Tests - Data
Group Name: PMTG - Pressuremeter Test Results - General
Group Name: PMTG - Pressuremeter Test Results - General
Group Name: PMTD - Pressuremeter Test Data
Group Name: PMTL - Pressuremeter Test Results - Individual Loops
Group Name: PREM - Project Specific Time Related Remarks
Group Name: PTIM - Boring/Drilling Progress by Time
Group Name: PTST - Laboratory Permeability Tests
Group Name: PTST - Laboratory Permeability Tests
Group Name: PTST - Laboratory Permeability Tests
Group Name: PUMG - Pumping Tests - General
Group Name: PUMT - Pumping Tests - Data
Group Name: RCAG - Rock Abrasiveness Tests - General
Group Name: RCAG - Rock Abrasiveness Tests - General
Group Name: RCAT - Rock Abrasiveness Tests - Data
Group Name: RCAT - Rock Abrasiveness Tests - Data
Group Name: RCCV - Chalk Crushing Value Tests
Group Name: RCCV - Chalk Crushing Value Tests
Group Name: RDEN - Rock Porosity and Density Tests
Group Name: RDEN - Rock Porosity and Density Tests
Group Name: RELD - Relative Density Tests
Group Name: RELD - Relative Density Tests
Group Name: RESC - Resonant Column Test - Consolidation
Group Name: RESC - Resonant Column Test - Consolidation
Group Name: RESD - Resonant Column Test – Data
Group Name: RESD - Resonant Column Test – Data
Group Name: RESG - Resonant Column Test – General
Group Name: RESG - Resonant Column Test – General
Group Name: RESP - Resonant Column Test - Derived Parameters
Group Name: RESS - Resonant Column Test – Saturation
Group Name: RPLT - Point Load Testing
Group Name: RPLT - Point Load Testing
Group Name: RSCH - Schmidt Rebound Hardness Tests
Group Name: RSCH - Schmidt Rebound Hardness Tests
Group Name: RSCH - Schmidt Rebound Hardness Tests
Group Name: RSHR - Shore Scleroscope Hardness Tests
Group Name: RTEN - Tensile Strength Testing
Group Name: RUCS - Rock Uniaxial Compressive Strength and Deformability Tests
Group Name: RUCS - Rock Uniaxial Compressive Strength and Deformability Tests
Group Name: RWCO - Water Content of Rock Tests
Group Name: RWCO - Water Content of Rock Tests
Group Name: SAMP - Sample Information
Group Name: SAMP - Sample Information
Group Name: SCDG - Static Cone Dissipation Tests - General
Group Name: SCDT - Static Cone Dissipation Tests - Data
Group Name: SCDT - Static Cone Dissipation Tests - Data
Group Name: SCPG - Static Cone Penetration Tests - General
Group Name: SCPP - Static Cone Penetration Tests - Derived Parameters
Group Name: SCPT - Static Cone Penetration Tests - Data
Group Name: SCPT - Static Cone Penetration Tests - Data
Group Name: SHBG - Shear Box Testing - General
Group Name: SHBG - Shear Box Testing - General
Group Name: SHBT - Shear Box Testing - Data
Group Name: SHBT - Shear Box Testing - Data
Group Name: STND - Standards / Specifications
Group Name: SUCT - Suction Tests
Group Name: SUCT - Suction Tests
Group Name: TNPC - Ten Per Cent Fines
Group Name: TNPC - Ten Per Cent Fines
Group Name: TREG - Triaxial Tests - Effective Stress - General
Group Name: TRET - Triaxial Tests - Effective Stress - Data
Group Name: TRET - Triaxial Tests - Effective Stress - Data
Group Name: TREM - Location Specific Time Related Remarks
Group Name: TRIG - Triaxial Tests - Total Stress - General
Group Name: TRIT - Triaxial Tests - Total Stress - Data
Group Name: TRIT - Triaxial Tests - Total Stress - Data
Group Name: WADD - Water Added Records
Group Name: WADD - Water Added Records
Group Name: WETH - Weathering
Group Name: WGPG - Wireline Geophysics - General
Group Name: WGPT - Wireline Geophysics - Readings
Tool
Calliper X-Y (2 x two
arm Calliper)
Downhole Magnetic
Resonance

Group Name: WINS - Window or Windowless Sampling Run Details
Group Name: WSTG - Water Strike - General
Group Name: WSTD - Water Strike - Details
Code
with open(f"ags{ags_version}_manual_groups.json", "r") as f:
    manual_groups = json.load(f)

extracted_groups = [d["group_name"] for d in extracted_data]

print(set(manual_groups) - set(extracted_groups))
print(set(extracted_groups) - set(manual_groups))
{'LFCN'}
{'LFCN – Laboratory Fall Cone Test'}
Code
# Save the extracted data to a JSON file
with open(f"ags{ags_version}_data_dict_p{from_page}-{to_page}.json", "w") as json_file:
    json.dump(extracted_data, json_file, indent=2)