The Article Extractor processor is an enhanced content extraction tool for JoomGrabber component that uses PHP-Readability library as its primary extraction engine, with optional AI refinement and content spinning capabilities.
Features
- PHP-Readability Integration: Primary content extraction using Readability library
- AI Refinement: Optional AI processing to clean and enhance extracted content
- Content Spinning: Integrated content rewriting for unique article generation
- Multiple AI Providers: Supports OpenAI (GPT) and Google AI Studio (Gemini)
- HTML Structure Preservation: Maintains article formatting, images, and links
- Smart Fallback System: Graceful degradation when AI services are unavailable
- Comprehensive Error Handling: Detailed error messages and fallback mechanisms
Configuration Options
Extraction Settings
| Parameter | Type | Default | Description |
|---|---|---|---|
| Use AI Refinement | Radio | Yes | Enable/disable AI content refinement |
| AI Service | Select | Choose between OpenAI or Google AI Studio | |
| Spin Content | Radio | No | Enable content rewriting (requires AI) |
| Spinning Intensity | Select | Moderate | Control the amount of content rewriting |
Spinning Intensity Options:
- Light: 20-30% changes
- Moderate: 50-60% changes
- Heavy: 80-90% changes
API Settings
| Parameter | Type | Default | Description |
|---|---|---|---|
| OpenAI API Key | Text | Empty | API key for OpenAI services |
| Google AI API Key | Text | Empty | API key for Google AI Studio |
| OpenAI Token Limit | Text | 100000 | Maximum characters to send to OpenAI |
| Google Token Limit | Text | 60000 | Maximum characters to send to Google AI |
Input/Output Fields
Input Fields
url: URL of the page to extract content fromhtml: Raw HTML content (alternative to URL)
Output Fields
extracted_article: Extracted HTML contenttitle: Article titlesummary: Brief summary (max 3 sentences)stop: Error handling object withstateandmsgproperties
Processing Workflow
-
Content Extraction
- Fetches content from URL if provided
- Uses PHP-Readability for primary extraction
- Extracts title, content, and generates summary
-
AI Refinement (Optional)
- Sends content to selected AI service
- Removes non-article elements (ads, navigation)
- Preserves HTML structure and important elements
-
Content Spinning (Optional)
- Integrated with AI refinement step
- Rewrites content based on intensity setting
- Maintains original meaning and structure
-
Error Handling
- Falls back to Readability-only results if AI fails
- Returns original content on spinning failure
- Provides detailed error messages
Debugging
Enable debugging by adding ?pdebug=1 to the URL. This will output:
- Input data
- Processor parameters
- API responses
- Error messages
- Processing steps