AI Data Extraction Research

A research spike exploring the feasibility of using foundation models to extract faculty and staff profile data for the University of Idaho website

Introduction

The Person Profile Data Extraction Spike was a feasibility study conducted to evaluate the potential of using foundation models to extract structured data from faculty and staff profile pages on the University of Idaho website.

Rather than relying on a poorly documented legacy database, this approach aimed to directly extract key information from public profile pages, creating a more reliable and maintainable dataset for the university's website.

Project Highlights

  • 92.79% success rate in extracting profile data
  • Cost-effective solution at $0.0012 per profile
  • Processed 901 URLs in under 2 hours
  • Used LangGraph state machine for orchestration
  • Implemented ethical web crawling practices

Project Overview

The goal of this spike was to determine if foundation models could reliably extract structured data from faculty and staff profile pages on the University of Idaho website. The extracted data would include key fields such as Name, Title, Email, Degrees, and other information defined in the project's user story.

By leveraging AI models instead of relying on direct database access, the project aimed to create a more maintainable and flexible approach to keeping the university's website profile information up-to-date.

The spike focused on evaluating the accuracy, cost, and performance of this approach to determine its feasibility for full-scale implementation.

Workflow Diagram illustrating the data extraction process

Technology Stack

PythonLangGraphLangChainGoogle GeminiBeautifulSoup4PydanticLangSmithPandas

Approach & Methodology

Three-Step Process

The project followed a structured three-step approach to test the feasibility of using foundation models for profile data extraction:

  1. 1.
    Identify Profiles: Started with a list of known profile URLs stored in data/uidaho_urls.json, generated by analyzing the website's sitemap.xml file to identify faculty and staff profile pages.
  2. 2.
    Crawl & Extract: Processed these URLs using a LangGraph state machine with steps for fetching pages, preprocessing HTML, extracting data with foundation models, and validating the results.
  3. 3.
    Evaluate: Analyzed the results based on accuracy (using an LLM-as-a-judge pattern), cost, and processing time, with detailed tracing and debugging via LangSmith.

LangGraph State Machine

The core of the extraction process was implemented as a LangGraph state machine with the following key nodes:

  • fetch_page: Retrieves HTML content respectfully (using configured delays)
  • preprocess_html: Parses and cleans HTML using BeautifulSoup
  • extract_data: Uses Gemini Flash to extract information into a Pydantic schema
  • validate_data: Employs an LLM-as-a-judge pattern to evaluate accuracy
  • handle_error: Captures and logs errors at each step

Ethical Crawling Practices

The project implemented responsible web crawling practices to ensure minimal impact on the university's web servers:

  • Configured delays between requests to prevent server overload
  • Respected robots.txt directives and crawl-delay settings
  • Used proper user-agent identification
  • Implemented error handling to back off on server errors
  • Limited concurrent requests to maintain server health

Technical Implementation

Data Schemas

Pydantic models were used to define the structure of the extracted data and validation results:

ProfileData Schema

class ProfileData(BaseModel):
    name: str
    title: str
    email: Optional[str] = None
    phone: Optional[str] = None
    office: Optional[str] = None
    department: Optional[str] = None
    degrees: Optional[List[str]] = None
    bio: Optional[str] = None
    research_interests: Optional[List[str]] = None
    courses_taught: Optional[List[str]] = None
    publications: Optional[List[str]] = None
    website: Optional[str] = None

ValidationResult Schema

class FieldValidation(BaseModel):
    field: str
    status: Literal["Correct", "Incorrect", "Missing"]
    explanation: Optional[str] = None

class ValidationResult(BaseModel):
    overall_accuracy: float
    field_validations: List[FieldValidation]
    suggestions: Optional[List[str]] = None

LLM Integration

The project leveraged Google's Gemini Flash model for both extraction and validation:

Model Configuration

# Initialize the Gemini model
extraction_model = ChatGoogleGenerativeAI(
    model="gemini-flash",
    temperature=0.1,
    convert_system_message_to_human=True,
    safety_settings={
        HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
        HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
        HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
        HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
    }
)

# Initialize the validation model (same model, different instance)
validation_model = ChatGoogleGenerativeAI(
    model="gemini-flash",
    temperature=0.1,
    convert_system_message_to_human=True,
    safety_settings={
        HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
        HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
        HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
        HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
    }
)

The extraction process used structured prompts to guide the model in extracting specific fields from the HTML content, while the validation process employed an LLM-as-a-judge pattern to evaluate the accuracy of the extracted data.

Monitoring and Debugging

LangSmith was used for comprehensive monitoring and debugging of the extraction process

Features Used

  • Detailed tracing of each step in the LangGraph state machine
  • Token usage tracking for cost estimation
  • Latency measurement for performance analysis
  • Error logging and categorization
  • Prompt inspection and refinement

Metrics Collected

  • Accuracy
  • Token Usage
  • Estimated Cost
  • Latency
  • Success Rate

Results

High Accuracy

92.8%

Success Rate: 92.79% (836 successful extractions out of 901 URLs)

Error Analysis: Of the 65 failed extractions, 64 (98.5%) were due to HTTP 404 errors, indicating the URLs from the sitemap no longer exist on the website.

Cost-Effective

$0.0012

per profile

Total Cost: $1.0254 for processing all 901 profiles

Token Usage: Average of 3,132 tokens per successful profile, with 2,618,463 tokens used across all profiles

Efficient Processing

6.6s

per profile

Total Time: 1h 32m 27s for processing all 901 profiles

Note: Most processing time was due to intentional delays between requests to ensure ethical crawling of the university website.

Detailed Performance Metrics

Metrics Collected

  • Accuracy: Field-level correctness reported by the validation node
  • Token Usage: Input, output, and total tokens per LLM call
  • Estimated Cost: Calculated based on token usage and model pricing. Tiktoken was used to produce the estimates.
  • Latency: Processing time per profile and per node
  • Success Rate: Percentage of URLs processed without errors

Limitations & Challenges

Sitemap 404s

The uidaho.edu sitemap contains URLs that return 404 errors, indicating it's outdated. Relying solely on the sitemap for comprehensive coverage is not viable.

HTML Variability

The HTML structure of different faculty profile pages varies (e.g., different class names, layouts). The preprocess_html step handled these variations well.

Multiple Profiles Page

One failure occurred on a page listing multiple faculty profiles (/people/adjuncts), which didn't conform to the single-profile schema our extraction currently supports.

Future Recommendations

Model Choice

Gemini Flash provided excellent accuracy at a low cost ($0.0012 per profile). Based on these results, evaluating more expensive models like Claude 3.7 Sonnet or GPT-4o is unnecessary.

Process Improvements

  • Develop a more reliable strategy for identifying profile URLs beyond the outdated sitemap
  • Consider a special handler for pages with multiple profiles
  • Split the nodes into files for better code organization
  • Move the prompts into the config file for easier maintenance

Recommendations

Feasibility Assessment

The foundation model approach has proven highly feasible for extracting the required profile data, with a 92.79% success rate. Based on the results of this spike, I recommend proceeding with this approach for the full implementation.

Model Choice

Gemini Flash provided excellent accuracy at a low cost ($0.0012 per profile). Based on these results, evaluating more expensive models like Claude 3.7 Sonnet or GPT-4o is unnecessary.

Process Improvements

  • Develop a more reliable strategy for identifying profile URLs beyond the outdated sitemap
  • Consider a special handler for pages with multiple profiles
  • Split the nodes into files for better code organization
  • Move the prompts into the config file for easier maintenance

Next Steps

Short-Term Actions

  1. Refine URL Discovery: Develop a more reliable strategy for identifying profile URLs beyond the outdated sitemap
  2. Support Multiple Profiles: Add support for pages containing multiple profiles
  3. Documentation: Update documentation with final findings and procedures
  4. Code Refactoring: Split the nodes into files and move prompts to the config file

Long-Term Vision

This spike demonstrates the potential for using foundation models in other data extraction and integration scenarios at the university:

  • Expand to other types of university content (courses, events, news)
  • Create an automated pipeline for regular updates to keep data fresh
  • Develop a validation interface for human review of extracted data
  • Integrate with the university's content management system

Conclusion

The Person Profile Data Extraction Spike has successfully demonstrated the feasibility of using foundation models to extract structured data from faculty and staff profile pages on the University of Idaho website.

With a 92.79% success rate, low cost per profile ($0.0012), and reasonable processing time (6.6 seconds per profile), this approach has proven to be a viable solution for extracting faculty and staff profile data.

The use of foundation models for this task not only eliminates the need to rely on poorly documented legacy databases but also provides a more flexible and maintainable approach to keeping the university's website profile information up-to-date. This successful spike paves the way for similar applications of AI in other data extraction and integration scenarios at the university.

Interested in similar solutions?