AI Data Extraction Research
A research spike exploring the feasibility of using foundation models to extract faculty and staff profile data for the University of Idaho website
Introduction
The Person Profile Data Extraction Spike was a feasibility study conducted to evaluate the potential of using foundation models to extract structured data from faculty and staff profile pages on the University of Idaho website.
Rather than relying on a poorly documented legacy database, this approach aimed to directly extract key information from public profile pages, creating a more reliable and maintainable dataset for the university's website.
Project Highlights
- •92.79% success rate in extracting profile data
- •Cost-effective solution at $0.0012 per profile
- •Processed 901 URLs in under 2 hours
- •Used LangGraph state machine for orchestration
- •Implemented ethical web crawling practices
Project Overview
The goal of this spike was to determine if foundation models could reliably extract structured data from faculty and staff profile pages on the University of Idaho website. The extracted data would include key fields such as Name, Title, Email, Degrees, and other information defined in the project's user story.
By leveraging AI models instead of relying on direct database access, the project aimed to create a more maintainable and flexible approach to keeping the university's website profile information up-to-date.
The spike focused on evaluating the accuracy, cost, and performance of this approach to determine its feasibility for full-scale implementation.
Technology Stack
Approach & Methodology
Three-Step Process
The project followed a structured three-step approach to test the feasibility of using foundation models for profile data extraction:
- 1.Identify Profiles: Started with a list of known profile URLs stored in
data/uidaho_urls.json
, generated by analyzing the website's sitemap.xml file to identify faculty and staff profile pages. - 2.Crawl & Extract: Processed these URLs using a LangGraph state machine with steps for fetching pages, preprocessing HTML, extracting data with foundation models, and validating the results.
- 3.Evaluate: Analyzed the results based on accuracy (using an LLM-as-a-judge pattern), cost, and processing time, with detailed tracing and debugging via LangSmith.
LangGraph State Machine
The core of the extraction process was implemented as a LangGraph state machine with the following key nodes:
- •fetch_page: Retrieves HTML content respectfully (using configured delays)
- •preprocess_html: Parses and cleans HTML using BeautifulSoup
- •extract_data: Uses Gemini Flash to extract information into a Pydantic schema
- •validate_data: Employs an LLM-as-a-judge pattern to evaluate accuracy
- •handle_error: Captures and logs errors at each step
Ethical Crawling Practices
The project implemented responsible web crawling practices to ensure minimal impact on the university's web servers:
- •Configured delays between requests to prevent server overload
- •Respected robots.txt directives and crawl-delay settings
- •Used proper user-agent identification
- •Implemented error handling to back off on server errors
- •Limited concurrent requests to maintain server health
Technical Implementation
Data Schemas
Pydantic models were used to define the structure of the extracted data and validation results:
ProfileData Schema
class ProfileData(BaseModel):
name: str
title: str
email: Optional[str] = None
phone: Optional[str] = None
office: Optional[str] = None
department: Optional[str] = None
degrees: Optional[List[str]] = None
bio: Optional[str] = None
research_interests: Optional[List[str]] = None
courses_taught: Optional[List[str]] = None
publications: Optional[List[str]] = None
website: Optional[str] = None
ValidationResult Schema
class FieldValidation(BaseModel):
field: str
status: Literal["Correct", "Incorrect", "Missing"]
explanation: Optional[str] = None
class ValidationResult(BaseModel):
overall_accuracy: float
field_validations: List[FieldValidation]
suggestions: Optional[List[str]] = None
LLM Integration
The project leveraged Google's Gemini Flash model for both extraction and validation:
Model Configuration
# Initialize the Gemini model
extraction_model = ChatGoogleGenerativeAI(
model="gemini-flash",
temperature=0.1,
convert_system_message_to_human=True,
safety_settings={
HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
}
)
# Initialize the validation model (same model, different instance)
validation_model = ChatGoogleGenerativeAI(
model="gemini-flash",
temperature=0.1,
convert_system_message_to_human=True,
safety_settings={
HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
}
)
The extraction process used structured prompts to guide the model in extracting specific fields from the HTML content, while the validation process employed an LLM-as-a-judge pattern to evaluate the accuracy of the extracted data.
Monitoring and Debugging
LangSmith was used for comprehensive monitoring and debugging of the extraction process
Features Used
- •Detailed tracing of each step in the LangGraph state machine
- •Token usage tracking for cost estimation
- •Latency measurement for performance analysis
- •Error logging and categorization
- •Prompt inspection and refinement
Metrics Collected
- •Accuracy
- •Token Usage
- •Estimated Cost
- •Latency
- •Success Rate
Results
High Accuracy
Success Rate: 92.79% (836 successful extractions out of 901 URLs)
Error Analysis: Of the 65 failed extractions, 64 (98.5%) were due to HTTP 404 errors, indicating the URLs from the sitemap no longer exist on the website.
Cost-Effective
per profile
Total Cost: $1.0254 for processing all 901 profiles
Token Usage: Average of 3,132 tokens per successful profile, with 2,618,463 tokens used across all profiles
Efficient Processing
per profile
Total Time: 1h 32m 27s for processing all 901 profiles
Note: Most processing time was due to intentional delays between requests to ensure ethical crawling of the university website.
Detailed Performance Metrics
Metrics Collected
- •Accuracy: Field-level correctness reported by the validation node
- •Token Usage: Input, output, and total tokens per LLM call
- •Estimated Cost: Calculated based on token usage and model pricing. Tiktoken was used to produce the estimates.
- •Latency: Processing time per profile and per node
- •Success Rate: Percentage of URLs processed without errors
Limitations & Challenges
Sitemap 404s
The uidaho.edu
sitemap contains URLs that return 404 errors, indicating it's outdated. Relying solely on the sitemap for comprehensive coverage is not viable.
HTML Variability
The HTML structure of different faculty profile pages varies (e.g., different class names, layouts). The preprocess_html
step handled these variations well.
Multiple Profiles Page
One failure occurred on a page listing multiple faculty profiles (/people/adjuncts
), which didn't conform to the single-profile schema our extraction currently supports.
Future Recommendations
Model Choice
Gemini Flash provided excellent accuracy at a low cost ($0.0012 per profile). Based on these results, evaluating more expensive models like Claude 3.7 Sonnet or GPT-4o is unnecessary.
Process Improvements
- •Develop a more reliable strategy for identifying profile URLs beyond the outdated sitemap
- •Consider a special handler for pages with multiple profiles
- •Split the nodes into files for better code organization
- •Move the prompts into the config file for easier maintenance
Recommendations
Feasibility Assessment
The foundation model approach has proven highly feasible for extracting the required profile data, with a 92.79% success rate. Based on the results of this spike, I recommend proceeding with this approach for the full implementation.
Model Choice
Gemini Flash provided excellent accuracy at a low cost ($0.0012 per profile). Based on these results, evaluating more expensive models like Claude 3.7 Sonnet or GPT-4o is unnecessary.
Process Improvements
- •Develop a more reliable strategy for identifying profile URLs beyond the outdated sitemap
- •Consider a special handler for pages with multiple profiles
- •Split the nodes into files for better code organization
- •Move the prompts into the config file for easier maintenance
Next Steps
Short-Term Actions
- Refine URL Discovery: Develop a more reliable strategy for identifying profile URLs beyond the outdated sitemap
- Support Multiple Profiles: Add support for pages containing multiple profiles
- Documentation: Update documentation with final findings and procedures
- Code Refactoring: Split the nodes into files and move prompts to the config file
Long-Term Vision
This spike demonstrates the potential for using foundation models in other data extraction and integration scenarios at the university:
- •Expand to other types of university content (courses, events, news)
- •Create an automated pipeline for regular updates to keep data fresh
- •Develop a validation interface for human review of extracted data
- •Integrate with the university's content management system
Conclusion
The Person Profile Data Extraction Spike has successfully demonstrated the feasibility of using foundation models to extract structured data from faculty and staff profile pages on the University of Idaho website.
With a 92.79% success rate, low cost per profile ($0.0012), and reasonable processing time (6.6 seconds per profile), this approach has proven to be a viable solution for extracting faculty and staff profile data.
The use of foundation models for this task not only eliminates the need to rely on poorly documented legacy databases but also provides a more flexible and maintainable approach to keeping the university's website profile information up-to-date. This successful spike paves the way for similar applications of AI in other data extraction and integration scenarios at the university.