# Intelligent PDF Page Naming System

## Overview

The PDF to PNG extraction now automatically detects page types and assigns appropriate file names based on content analysis using OCR.

## File Naming Logic

### Detection Strategy

1. **Analyze All Pages**: OCR text is extracted from each page
2. **Find Map Page**: Searches for "Google Map View" or related indicators
3. **Classify Pages**:
   - **Before Map**: `booth_info 1.png`, `booth_info 2.png`, etc.
   - **Map Page**: `map.png`
   - **After Map**: `voters 1.png`, `voters 2.png`, etc.

### Expected Output Pattern

For a 45-page PDF:
```
[booth_info 1.png, booth_info 2.png, map.png, voters 1.png, voters 2.png, ..., voters 42.png]
```

### Real Example from Electoral Roll

**Page 1** - Electoral Roll Header
- Contains: "ELECTORAL ROLL 2025 U07 Puducherry"
- Contains: "Details of part and polling area"
- Contains: Street listings
- **Named as**: `booth_info 1.png`

**Page 2** - Continuation of Booth Details
- Contains: Starting/Ending Serial Numbers
- Contains: Net Electors count
- **Named as**: `booth_info 2.png` (if page 3 is map)
- **OR** Part of `booth_info 1.png` if this is last before map

**Page 3** - Maps and Photos
- Contains: "Nazri Naksha", "Google Map View"
- Contains: "Polling Station Building Front View"
- Contains: "Cad View", "Key MAP View"
- **Named as**: `map.png`

**Page 4+** - Voter Cards
- Contains: Grid of voter profiles (3 columns × 10 rows)
- Individual cards with voter details
- **Named as**: `voters 1.png`, `voters 2.png`, etc.

## Map Page Detection Indicators

The system looks for these text patterns (case-insensitive):

### Primary Indicators
- `google map view` ✓ Most reliable
- `nazri naksha` ✓ Hindi for map sketch

### Secondary Indicators
- `polling station building front view`
- `polling station front view`
- `cad view`
- `key map view`
- `google map`
- `map view`

## Booth Info Page Detection

**Booth information pages** are identified as:
- All pages **before** the map page
- Typically contain:
  - Assembly constituency details
  - Revision information
  - Polling area details
  - Section and street names
  - Polling station address
  - Elector statistics

## Voter Page Detection

**Voter pages** are identified as:
- All pages **after** the map page
- Contain voter card grids with:
  - Serial numbers
  - Names
  - Parent/spouse names
  - House numbers
  - Age and gender
  - Voter ID photos

## Fallback Logic

If **no map page is detected**:
- System assumes **page 3** (index 2) as map page
- Pages 1-2 become booth_info pages
- Pages 4+ become voter pages

This is a safe default based on standard Electoral Roll format.

## API Response Enhancement

The response now includes `type` field for each extracted file:

```json
{
  "success": true,
  "files": [
    {
      "page": 1,
      "type": "booth_info",
      "filename": "booth_info 1.png",
      "path": "/path/to/file",
      "size": 1234567
    },
    {
      "page": 2,
      "type": "booth_info",
      "filename": "booth_info 2.png",
      "path": "/path/to/file",
      "size": 1234567
    },
    {
      "page": 3,
      "type": "map",
      "filename": "map.png",
      "path": "/path/to/file",
      "size": 1234567
    },
    {
      "page": 4,
      "type": "voter",
      "filename": "voters 1.png",
      "path": "/path/to/file",
      "size": 1234567
    }
  ]
}
```

## Integration with ProcessVoterImageBatch

The `ProcessVoterImageBatch` job already handles these file names correctly:

### Booth Info Detection
```php
$boothInfoFile = $images->first(fn($file) => 
    str_contains(strtolower($file->getFilename()), 'booth_info')
);
```
✓ Matches: `booth_info 1.png`, `booth_info 2.png`, etc.

### Voter Pages Detection
```php
$voterPages = $images->filter(fn($file) => 
    str_contains(strtolower($file->getFilename()), 'voter')
)->values();
```
✓ Matches: `voters 1.png`, `voters 2.png`, etc.

## Processing Workflow

### Step 1: Extract PDF to PNG
```bash
POST /api/pdf-to-png/extract
{
  "pdf_file": "electoral_roll.pdf",
  "constituency": "Orleanpet",
  "booth_number": "19"
}
```

**Result**: Creates intelligently named files
```
Constituency/Orleanpet/19/
├── booth_info 1.png    ← Page 1 (Electoral Roll header, streets)
├── booth_info 2.png    ← Page 2 (Elector statistics) 
├── map.png             ← Page 3 (Maps and photos)
├── voters 1.png        ← Page 4 (First voter grid)
├── voters 2.png        ← Page 5 (Second voter grid)
└── voters 3.png        ← Page 6 (Third voter grid)
```

### Step 2: Process with OCR
```bash
POST /api/image-import/run
{
  "constituency": "Orleanpet",
  "booth_number": "19"
}
```

**Result**: 
- Extracts booth details from `booth_info 1.png`, `booth_info 2.png`
- Processes voters from `voters 1.png`, `voters 2.png`, etc.
- Skips `map.png` (not matched by voter filter)

## Technical Implementation

### Page Analysis Process

1. **First Pass**: Analyze all pages
   ```php
   $pageTypes = $this->analyzePageTypes($pdfPath, $pageCount, $convertPath, $dpi);
   ```

2. **Find Map Page**:
   - Convert each page to temporary PNG
   - Extract text using Tesseract OCR
   - Check for map indicators
   - Record map page index

3. **Assign Names**:
   - Pages before map: `booth_info {counter}.png`
   - Map page: `map.png`
   - Pages after map: `voters {counter}.png`

4. **Second Pass**: Extract with correct names
   - Convert each page to final PNG with determined name
   - Save to constituency folder

### Performance

- **Analysis Phase**: ~1-2 seconds per page (OCR)
- **Extraction Phase**: ~0.5-1 second per page (conversion)
- **Total**: For 45-page PDF ≈ 90-120 seconds

**Optimization**: Analysis is only done once to determine naming, then extraction proceeds.

## Error Handling

### Tesseract Not Available
- System falls back to default naming pattern
- Warning logged
- Uses page 3 as default map page

### OCR Extraction Fails
- Continues with next page
- Uses default naming based on page position
- Logs warning but doesn't fail entire process

### No Map Page Detected
- Uses page 3 (index 2) as default
- Logs warning with assumption
- Continues processing normally

## Benefits

✅ **Automatic Detection**: No manual configuration needed  
✅ **Consistent Naming**: Standard pattern across all PDFs  
✅ **Integration Ready**: Works with existing import system  
✅ **Flexible**: Handles PDFs with varying page counts  
✅ **Robust**: Fallback logic for edge cases  
✅ **Logged**: Detailed logs for debugging  

## Testing

### Test with Sample Electoral Roll PDF

```bash
curl -X POST http://localhost:8000/api/pdf-to-png/extract \
  -F "pdf_file=@electoral_roll_orleanpet_19.pdf" \
  -F "constituency=Orleanpet" \
  -F "booth_number=19" \
  -F "dpi=300"
```

### Verify Output

```bash
ls -lh /var/www/Constituency/Orleanpet/19/
# Expected:
# booth_info 1.png
# booth_info 2.png
# map.png
# voters 1.png
# voters 2.png
# ...
```

### Check Logs

```bash
tail -f storage/logs/laravel.log | grep "Page analysis"
# Output:
# Page analysis complete: total_pages=47, booth_info_pages=2, map_page=3, voter_pages=44
```

## Prerequisites

Both tools must be installed:

### ImageMagick
```bash
brew install imagemagick  # macOS
sudo apt install imagemagick  # Ubuntu
```

### Tesseract OCR
```bash
brew install tesseract  # macOS
sudo apt install tesseract-ocr  # Ubuntu
```

### Verify Installation
```bash
convert -version
tesseract --version
```

## Troubleshooting

### Wrong Page Named as Map

**Issue**: System incorrectly identifies a page as map page

**Solution**: Check OCR text extraction quality
```bash
# Test Tesseract on specific page
tesseract page_3.png output
cat output.txt
```

### All Pages Named as Voters

**Issue**: No map page detected, all pages after page 3 are voters

**Solution**: 
- Verify map page contains expected text
- Check Tesseract is installed and working
- Review logs for OCR errors

### Booth Info Not Extracted

**Issue**: No booth information in import results

**Solution**:
- Verify `booth_info 1.png` file exists
- Check file naming matches pattern
- Confirm ProcessVoterImageBatch detects booth_info files

## Future Enhancements

Potential improvements:
- Cache OCR results to avoid re-extraction
- Parallel page analysis for faster processing
- Machine learning for page type classification
- Support for custom page type patterns
- User-configurable detection keywords
