Web Information Gathering

Web reconnaissance involves carefully gathering information about a target website or web application. It serves as the groundwork for the information-gathering phase in penetration testing.

Posted Sep 11, 2025

Web Information Gathering.

By KAKAROT

9 min read

Web Information Gathering

Certificate Transparency Logs

Top Tools:

crt.sh

Censys

        
      
root@kakarot$ curl -s "https://crt.sh/?q=tesla.com&output=json" | jq -r '.[] | select(.name_value | contains("dev")) | .name_value' | sort -u    #Coarse Filter

        
      
root@kakarot$ curl -s "https://crt.sh/?q=tesla.com&output=json" | jq -r '.[].name_value | split("\n")[] | select(contains("dev"))' | sort -u    #Hostname-level Extraction

WHOIS

WHOIS is a protocol used to query databases that store registration details of domains, to identify who owns or manages them and more.

root@kakarot$ whois domain.com

Some of the information returned by whois:
Domain Name
Registrar
Registrant Contact
Administrative Contact
Technical Contact
Creation and Expiration Dates
Name Servers

DNS

Top Tools:

dig

nslookup

host

dnsenum

fierce

dnsrecon

theHarvester

Online DNS Lookup Services

root@kakarot$ dig domain.com                      # Default A record lookup

root@kakarot$ dig domain.com A                    # Get IPv4 address (A record)

root@kakarot$ dig domain.com AAAA                 # Get IPv6 address (AAAA record)

root@kakarot$ dig domain.com MX                   # Show mail servers (MX records)

root@kakarot$ dig domain.com NS                   # Show authoritative name servers (NS)

root@kakarot$ dig domain.com TXT                  # Show TXT records (SPF, verification, etc.)

root@kakarot$ dig domain.com CNAME                # Show CNAME (alias) record

root@kakarot$ dig domain.com SOA                  # Show SOA (zone authority info)

root@kakarot$ dig @1.1.1.1 domain.com             # Query using a specific DNS server (1.1.1.1)

root@kakarot$ dig +trace domain.com               # Trace full DNS resolution path

        
      
root@kakarot$ dig -x 192.168.1.1                  # Reverse lookup: IP -> hostname

root@kakarot$ dig +short domain.com               # Short, concise answer only

root@kakarot$ dig +noall +answer domain.com       # Show only the ANSWER section

root@kakarot$ dig domain.com ANY                  # Request all records (may be ignored by servers)

Subdomain Bruteforcing

Top Tools:

dnsenum

fierce

dnsrecon

amass

assetfinder

puredns

        
      
root@kakarot$ dnsenum --enum inlanefreight.com -f /usr/share/seclists/Discovery/DNS/subdomains-top1million-20000.txt -r

--enum: Shortcut option equivalent to –threads 5 -s 15 -w.
-r, --recursion: Recursion on subdomains, brute force all discovered subdomains that have an NS record.

DNS Zone Transfers

+------------------+                           +------------------+
| secondaryServer  |                           |  primaryServer   |
+------------------+                           +------------------+
        |                                               |
        | --------- AXFR Request (Zone Transfer) ------>|
        |                                               |
        | <--------- SOA Record (Start of Authority) ---|
        |                                               |
        +---------------------- loop ------------------ +
        |                                               |
        | | <--------- DNS Record -------------------|  |
        | |                                          |  |
        | +------------------- end loop -------------+  |
        |                                               |
        | <---------- Zone Transfer Complete -----------|
        |                                               |
        | --------- ACK (Acknowledgement) ------------->|
        |                                               |
+------------------+                           +------------------+
| secondaryServer  |                           |  primaryServer   |
+------------------+                           +------------------+

Exploiting Zone Transfers Using Dig:

root@kakarot$ dig axfr domain.com

root@kakarot$ dig axfr @DnsServer domain.com

Virtual Hosts

Top Tools:

gobuster

feroxbuster

ffuf

We Have 3 Types of Virtual Hosting:

Name-Based Virtual Hosting

IP-Based Virtual Hosting

Port-Based Virtual Hosting

        
      
root@kakarot$ gobuster vhost -u http://<target-ip> -w <wordlist> --append-domain

-u: The target URL
-w: Path to the wordlist
--append-domain : Append main domain from URL to words from wordlist

-t: Number of concurrent threads (default: 10)
-k: Skip TLS certificate verification (default: false)
-o: Output file to write results to (defaults to stdout)

Fingerprinting

Top Tools:

Nmap
Wappalyzer
BuiltWith
WhatWeb
wafw00f
Netcraft

Fingerprinting Techniques:

Banner Grabbing
Analysing HTTP Headers
Probing for Specific Responses
Analysing Page Content

root@kakarot$ curl -I domain.com

Wafw00f

root@kakarot$ wafw00f domain.com

Nikto

        
      
root@kakarot$ nikto -h domain.com -Tuning b

-h: Target host
-Tuning b: Software Identification

Crawling

Types of crawling strategies:

Breadth-First Crawling
Depth-First Crawling

Some Valuable Information:

Comments
Links (Internal and External)
Sensitive Files
MetaData

Have you checked /robots.txt file ?

robots.txt Structure:

User-agent
Directives
- Disallow
- Allow
- Crawl-delay
- Sitemap

Have you checked /.well-known path ?

Some /.well-known URIs:

/.well-known/change-password
openid-configuration
security.txt
mta-sts.txt
assetlinks.json
more

Top Web Crawlers:

Scrapy (Python Framework)
Apache Nutch (Scalable Crawler)
Burp Suite Spider
OWASP ZAP (Zed Attack Proxy)

Have you checked /sitemap.xml file ?

Using ReconSpider:

        
      
import scrapy
import json
import re
from urllib.parse import urlparse
from scrapy.crawler import CrawlerProcess
from scrapy.downloadermiddlewares.offsite import OffsiteMiddleware

class CustomOffsiteMiddleware(OffsiteMiddleware):
    def should_follow(self, request, spider):
        if not self.host_regex:
            return True
        # This modification allows domains with ports
        host = urlparse(request.url).netloc.split(':')[0]
        return bool(self.host_regex.search(host))

class WebReconSpider(scrapy.Spider):
    name = 'ReconSpider'
    
    def __init__(self, start_url, *args, **kwargs):
        super(WebReconSpider, self).__init__(*args, **kwargs)
        self.start_urls = [start_url]
        self.allowed_domains = [urlparse(start_url).netloc.split(':')[0]]
        self.visited_urls = set()
        self.results = {
            'emails': set(),
            'links': set(),
            'external_files': set(),
            'js_files': set(),
            'form_fields': set(),
            'images': set(),
            'videos': set(),
            'audio': set(),
            'comments': set(),
        }
        
    def parse(self, response):
        self.visited_urls.add(response.url)

        # Only process text responses
        if response.headers.get('Content-Type', '').decode('utf-8').startswith('text'):
            # Extract emails
            emails = set(re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', response.text))
            self.results['emails'].update(emails)
        
            # Extract links
            links = response.css('a::attr(href)').getall()
            for link in links:
                if link.startswith('mailto:'):
                    continue
                parsed_link = urlparse(link)
                if not parsed_link.scheme:
                    link = response.urljoin(link)
                if urlparse(link).netloc == urlparse(response.url).netloc:
                    if link not in self.visited_urls:
                        yield response.follow(link, callback=self.parse)
                self.results['links'].add(link)
        
            # Extract external files (CSS, PDFs, etc.)
            external_files = response.css('link::attr(href), a::attr(href)').re(r'.*\.(css|pdf|docx?|xlsx?)$')
            for ext_file in external_files:
                self.results['external_files'].add(response.urljoin(ext_file))
        
            # Extract JS files
            js_files = response.css('script::attr(src)').getall()
            for js_file in js_files:
                self.results['js_files'].add(response.urljoin(js_file))
        
            # Extract form fields
            form_fields = response.css('input::attr(name), textarea::attr(name), select::attr(name)').getall()
            self.results['form_fields'].update(form_fields)
        
            # Extract images
            images = response.css('img::attr(src)').getall()
            for img in images:
                self.results['images'].add(response.urljoin(img))
        
            # Extract videos
            videos = response.css('video::attr(src), source::attr(src)').getall()
            for video in videos:
                self.results['videos'].add(response.urljoin(video))
        
            # Extract audio
            audio = response.css('audio::attr(src), source::attr(src)').getall()
            for aud in audio:
                self.results['audio'].add(response.urljoin(aud))
            
            # Extract comments
            comments = response.xpath('//comment()').getall()
            self.results['comments'].update(comments)
        else:
            # For non-text responses, just collect the URL
            self.results['external_files'].add(response.url)
        
        self.log(f"Processed {response.url}")

    def closed(self, reason):
        self.log("Crawl finished, converting results to JSON.")
        # Convert sets to lists for JSON serialization
        for key in self.results:
            self.results[key] = list(self.results[key])
        
        with open('results.json', 'w') as f:
            json.dump(self.results, f, indent=4)

        self.log(f"Results saved to results.json")

def run_crawler(start_url):
    process = CrawlerProcess(settings={
        'LOG_LEVEL': 'INFO',
        'DOWNLOADER_MIDDLEWARES': {
            '__main__.CustomOffsiteMiddleware': 500,
        }
    })
    process.crawl(WebReconSpider, start_url=start_url)
    process.start()

if __name__ == "__main__":
    import argparse
    
    parser = argparse.ArgumentParser(description="ReconSpider")
    parser.add_argument("start_url", help="The starting URL for the web crawler")
    args = parser.parse_args()
    
    run_crawler(args.start_url)

root@kakarot$ python3 ReconSpider.py domain.com

Search Engine

Search Operators:

Operator	What It Does	Example	Explanation
site:	Limit search to a single website or domain	site:kakarot.info	See only pages from kakarot.info
inurl:	Find pages that have a certain word in the URL	inurl:forum site:kakarot.info	Look for forum pages on kakarot.info
filetype:	Search for specific file types	filetype:pdf site:kakarot.info	Find PDF files on kakarot.info
intitle:	Look for a word in the page title	intitle:”guide” site:kakarot.info	Find pages with “guide” in the title
intext:	Search for a word in the main page content	intext:”Dragon Ball” site:kakarot.info	Find pages that mention “Dragon Ball”
cache:	See Google’s stored copy of a page	cache:kakarot.info	View the cached version of kakarot.info
link:	Find pages that link to a specific site	link:kakarot.info	Discover who links to kakarot.info
related:	Find websites similar to a domain	related:kakarot.info	See sites related to kakarot.info
info:	Get basic info about a website	info:kakarot.info	Shows title, description, and summary of kakarot.info
define:	Get the meaning of a word	define:kakarot	Look up the definition of “kakarot”
numrange:	Search for numbers in a range	site:kakarot.info numrange:1-100	Find pages mentioning numbers from 1 to 100
allintext:	Find pages containing all words in the content	allintext:”character stats” site:kakarot.info	Pages containing both “character” and “stats”
allinurl:	Find pages containing all words in the URL	allinurl:forum topic site:kakarot.info	Look for URLs with both “forum” and “topic”
allintitle:	Find pages containing all words in the title	allintitle:”Dragon Ball guide” site:kakarot.info	Pages with “Dragon Ball” and “guide” in the title
AND	Make search results include all terms	site:kakarot.info AND intext:”Saiyan”	Only pages with both conditions
OR	Include any of multiple terms	site:kakarot.info “Goku” OR “Vegeta”	Pages mentioning either “Goku” or “Vegeta”
NOT	Exclude a term	site:kakarot.info NOT intext:”ads”	Exclude pages that mention ads
* (wildcard)	Match any word	site:kakarot.info “character * stats”	Search for any word between “character” and “stats”
.. (range)	Search between two numbers	site:kakarot.info “level” 1..50	Find pages mentioning levels from 1 to 50
” “	Search exact phrase	“Dragon Ball Z guide” site:kakarot.info	Find pages that mention exactly “Dragon Ball Z guide”
-	Exclude words	site:kakarot.info -intext:”forum”	Exclude pages about forums

Google Dorking:

Looking For Login Pages:
- site:website.com inurl:login
- site:website.com (inurl:login OR inurl:admin)
Looking For Exposed Files:
- site:website.com filetype:pdf
- site:website.com (filetype:xls OR filetype:docx)
Looking For Configuration Files:
- site:website.com inurl:config.php
- site:website.com (ext:cnf OR ext:conf)
Locating Database Backups:
- site:website.com inurl:backup
- site:website.com filetype:sql

Google Hacking DB

Have you checked Wayback Machine ?

Automating Recon

Top Tools:

FinalRecon
Recon-ng
theHarvester
SpiderFoot
OSINT Framework

Recon Using FinalRecon:

        
      
root@kakarot$ finalrecon -h                                              
usage: finalrecon [-h] [--url URL] [--headers] [--sslinfo] [--whois] [--crawl] [--dns] [--sub] [--dir] [--wayback] [--ps] [--full] [-nb] [-dt DT] [-pt PT] [-T T] [-w W] [-r] [-s] [-sp SP]
                  [-d D] [-e E] [-o O] [-cd CD] [-k K]

FinalRecon - All in One Web Recon | v1.1.7

options:
  -h, --help  show this help message and exit
  --url URL   Target URL
  --headers   Header Information
  --sslinfo   SSL Certificate Information
  --whois     Whois Lookup
  --crawl     Crawl Target
  --dns       DNS Enumeration
  --sub       Sub-Domain Enumeration
  --dir       Directory Search
  --wayback   Wayback URLs
  --ps        Fast Port Scan
  --full      Full Recon

Extra Options:
  -nb         Hide Banner
  -dt DT      Number of threads for directory enum [ Default : 30 ]
  -pt PT      Number of threads for port scan [ Default : 50 ]
  -T T        Request Timeout [ Default : 30.0 ]
  -w W        Path to Wordlist [ Default : wordlists/dirb_common.txt ]
  -r          Allow Redirect [ Default : False ]
  -s          Toggle SSL Verification [ Default : True ]
  -sp SP      Specify SSL Port [ Default : 443 ]
  -d D        Custom DNS Servers [ Default : 1.1.1.1 ]
  -e E        File Extensions [ Example : txt, xml, php ]
  -o O        Export Format [ Default : txt ]
  -cd CD      Change export directory [ Default : ~/.local/share/finalrecon ]
  -k K        Add API key [ Example : shodan@key ]

Hmmmmmmm… Done !

Penetration Testing

web information gathering

This post is licensed under CC BY 4.0 by the author.