Reverse Image Search
AGT-FOR-006 detects stolen, recycled, or internet-sourced images that are fraudulently submitted as genuine claim evidence. The agent computes a perceptual hash of each submitted image and queries multiple reverse image search services in parallel — Google Vision API, TinEye API, and a proprietary internal hash database of previously seen claim images. It also checks whether the image, or a visually near-duplicate, already exists publicly on the internet (news sites, auto dealer listings, social media). A claimant submitting a publicly-available photo of a vehicle fire as evidence of their own incident is a definitive fraud indicator. The agent returns all discovered source URLs, similarity scores, and the earliest known publication date to establish an unambiguous prior existence timeline.
Tech Stack
Input
A single image file and optional context about the claimed incident for relevance filtering.
Accepted Formats
Fields
| Name | Type | Req | Description |
|---|---|---|---|
| image_file | binary | Yes | Raw image bytes to search for |
| claim_id | string | Yes | Claim ID for tracking and internal hash cross-reference |
| incident_type | string | No | Incident category for result relevance scoring (e.g. motor, fire, flood) |
| max_results | int | No | Maximum number of matching URLs to return per source (default: 10) |
Output
Match results from all search sources, the computed image hashes, and a final verdict on whether the image is original or recycled.
Format:
JSONFields
| Name | Type | Description |
|---|---|---|
| phash | string | 64-bit perceptual hash (hex) of the submitted image |
| internal_match | object | null | If found in internal database: {claim_id, submission_date, similarity_pct} |
| google_matches | array<object> | Google Vision web detection results: {url, title, score, first_seen} |
| tineye_matches | array<object> | TinEye results: {url, crawl_date, score, image_url} |
| earliest_known_date | string | null | ISO-8601 date of the earliest known web publication of this image |
| flags | array<string> | FLAG_INTERNET_SOURCE, FLAG_DUPLICATE_CLAIM, FLAG_STOCK_PHOTO, FLAG_NEWS_ARTICLE |
| risk_score | float | Normalised risk contribution 0.0–1.0 |
| verdict | string | PASS | FLAG | INCONCLUSIVE |
Example Response
{
"phash": "f8e4c2a0b6d4e8f0",
"internal_match": null,
"google_matches": [
{"url": "https://vnexpress.net/...", "title": "Xe bốc cháy trên cao tốc", "score": 0.98, "first_seen": "2023-03-12"}
],
"tineye_matches": [],
"earliest_known_date": "2023-03-12",
"flags": ["FLAG_INTERNET_SOURCE", "FLAG_NEWS_ARTICLE"],
"risk_score": 0.97,
"verdict": "FLAG"
}
How It Works
AGT-FOR-006 operates on the forensic principle that a genuine incident photo is unique — it has never appeared anywhere on the internet before the moment of submission. Any prior appearance on the web, regardless of where, indicates the image was not taken at the claimant's incident.
The agent's first layer of defence is the internal claim image database. Every image submitted to any claim is hashed and indexed. Before querying expensive external APIs, the agent checks whether this hash (or a close match) already exists in a previous claim — the most direct form of duplicate fraud detection.
The second layer uses external reverse image search services in parallel (via aiohttp async calls) to query both Google Vision API and TinEye. These services have indexed billions of web pages and can find an image even if it has been resized, recompressed, cropped, or watermarked.
The critical insight is the temporal analysis: if the earliest known publication of the image predates the claimed incident, the image cannot be genuine evidence of that incident. For example, a claimant submitting a photo of a car fire that appeared in a newspaper three months ago cannot claim that photo as evidence of their own incident today.
The agent also detects stock photo usage (images from Getty Images or Shutterstock commonly appear in fraudulent claims) and social media reuse (an image posted to Facebook or Twitter before the incident).
All matches are returned with URLs, dates, and similarity scores, providing adjudicators with direct evidence they can verify independently.
Thinking Steps
Image Preprocessing & Hash Computation
Load the image with Pillow, convert to grayscale, resize to 64×64 for hashing. Compute three hash types: pHash (DCT-based, robust to compression), dHash (difference hash, fast), and wHash (wavelet hash, robust to blur). Store all three for multi-algorithm matching.
Using multiple hash algorithms reduces both false positives and false negatives: pHash catches resized/recompressed copies, wHash catches blurred or watermarked versions.
Internal Database Cross-Reference
Query the Redis hash index (built from all previously submitted claim images) using Hamming distance ≤ 10 bits as the similarity threshold. A match here means this exact (or near-identical) image was previously submitted in another claim — a definitive duplicate fraud signal.
The Redis sorted set structure enables O(log N) approximate nearest-neighbour search on the hash space.
Google Vision Web Detection
Submit the image to Google Cloud Vision API's webDetection feature. Extracts: fullMatchingImages (exact copies on the web), partialMatchingImages (cropped or partially matching), visuallySimilarImages, and webEntities. Record all URLs and their first-crawled dates.
Google's webDetection often finds images that TinEye misses because Google crawls more recently indexed pages.
TinEye Reverse Search
Submit the image to TinEye's REST API which specialises in exact and near-duplicate matching across its index of over 60 billion images. TinEye returns match count, URLs, and crawl dates. Its strength is finding images that have been slightly cropped or recoloured.
TinEye excels at finding recycled stock photos, which are a common source for fraudulent 'damage' photos.
Timeline Analysis
Extract the earliest known publication date from all discovered URLs. If the earliest publication date predates the claimed incident date, this proves the image cannot be an original photo of the incident.
A delta of even one day before the incident is conclusive evidence of image recycling.
Source Classification & Flag Assignment
Classify discovered sources: news articles → FLAG_NEWS_ARTICLE, stock photo sites (Shutterstock, Getty) → FLAG_STOCK_PHOTO, previous claims in internal DB → FLAG_DUPLICATE_CLAIM, any web source → FLAG_INTERNET_SOURCE. Risk score increases with the number of web matches and their similarity.
A single exact match on a news article from before the incident date is sufficient for a high-confidence FLAG regardless of other signals.
Thinking Tree
-
Root Question: Is this image original and unpublished before the claimed incident?
-
Check internal claim database
- Hash match found in previous claim → FLAG_DUPLICATE_CLAIM
- No internal match — proceed to web search
-
Google Vision Web Detection
-
Exact match found on web
- Publication date before incident → FLAG_INTERNET_SOURCE (high confidence)
- Publication date after incident → weak signal, note only
- No exact web match — check TinEye
-
Exact match found on web
-
TinEye near-duplicate search
- Stock photo site match → FLAG_STOCK_PHOTO
- News article match → FLAG_NEWS_ARTICLE
- No match on any service → PASS
-
Check internal claim database
Decision Tree
Does image hash match any previous claim in internal DB?
Google Vision finds exact or near-duplicate web match?
Earliest known publication date before incident date?
TinEye finds match on stock photo site?
TinEye finds match on any news or media site?
FLAG — DUPLICATE_CLAIM: Same image used in a previous claim
FLAG — INTERNET_SOURCE: Image was published online before the incident
FLAG — STOCK_PHOTO: Image sourced from commercial stock photo library
FLAG — NEWS_ARTICLE: Image found in news/media coverage unrelated to claimant
PASS — No prior web publication found; image appears original
Technical Design
Architecture
AGT-FOR-006 is an async FastAPI microservice. All three search operations (internal DB, Google Vision, TinEye) run concurrently via asyncio.gather to minimise latency. The internal Redis hash index enables sub-millisecond duplicate detection before expensive external API calls. Total p95 latency is approximately 3–5 seconds depending on Google Vision response time.
Components
| Component | Role | Technology |
|---|---|---|
| HashComputer | Computes pHash, dHash, wHash from image | imagehash 4.x + Pillow |
| InternalDBChecker | Queries Redis hash index with Hamming distance filter | Redis ZRANGEBYSCORE + Python bitcount |
| GoogleVisionClient | Calls Google Cloud Vision webDetection endpoint | google-cloud-vision Python SDK |
| TinEyeClient | Calls TinEye REST API | aiohttp + TinEye API v2 |
| TimelineAnalyser | Extracts and compares publication dates | Python datetime + dateparser |
| SourceClassifier | Categorises match URLs into source types | URL pattern matching + domain allowlist |
| ResultAggregator | Merges results from all sources into unified verdict | Pure Python |
Architecture Diagram
┌──────────────────────────────┐
│ POST /analyze (image + │
│ claim_id) │
└──────────────┬───────────────┘
│
▼
┌──────────────────────────────┐
│ HashComputer │
│ (pHash + dHash + wHash) │
└──────┬───────────────────────┘
│
├──────────────────────┐
▼ ▼
┌──────────────┐ ┌──────────────────────┐
│InternalDB │ │ Async API Calls │
│Checker │ │ ┌─────────────────┐ │
│(Redis pHash) │ │ │ GoogleVisionCli │ │
└──────┬───────┘ │ └────────┬────────┘ │
│ │ │ │
│ │ ┌────────▼────────┐ │
│ │ │ TinEyeClient │ │
│ │ └────────┬────────┘ │
│ └──────────┼───────────┘
│ │
└──────────┬───────────┘
│
▼
┌────────────────────────┐
│ TimelineAnalyser + │
│ SourceClassifier │
└──────────┬─────────────┘
│
▼
┌────────────────────────┐
│ ResultAggregator │
└────────────────────────┘
Data Flow