Web Information Gathering
Web reconnaissance involves carefully gathering information about a target website or web application. It serves as the groundwork for the information-gathering phase in penetration testing.
Certificate Transparency Logs
Top Tools:
1
root@kakarot$ curl -s "https://crt.sh/?q=tesla.com&output=json" | jq -r '.[] | select(.name_value | contains("dev")) | .name_value' | sort -u #Coarse Filter
1
root@kakarot$ curl -s "https://crt.sh/?q=tesla.com&output=json" | jq -r '.[].name_value | split("\n")[] | select(contains("dev"))' | sort -u #Hostname-level Extraction
WHOIS
WHOIS
is a protocol used to query databases that store registration details of domains, to identify who owns or manages them and more.
1
root@kakarot$ whois domain.com
Some of the information returned by whois:
Domain Name
Registrar
Registrant Contact
Administrative Contact
Technical Contact
Creation and Expiration Dates
Name Servers
DNS
Top Tools:
dig
nslookup
host
dnsenum
fierce
dnsrecon
theHarvester
Online DNS Lookup Services
1
root@kakarot$ dig domain.com # Default A record lookup
1
root@kakarot$ dig domain.com A # Get IPv4 address (A record)
1
root@kakarot$ dig domain.com AAAA # Get IPv6 address (AAAA record)
1
root@kakarot$ dig domain.com MX # Show mail servers (MX records)
1
root@kakarot$ dig domain.com NS # Show authoritative name servers (NS)
1
root@kakarot$ dig domain.com TXT # Show TXT records (SPF, verification, etc.)
1
root@kakarot$ dig domain.com CNAME # Show CNAME (alias) record
1
root@kakarot$ dig domain.com SOA # Show SOA (zone authority info)
1
root@kakarot$ dig @1.1.1.1 domain.com # Query using a specific DNS server (1.1.1.1)
1
root@kakarot$ dig +trace domain.com # Trace full DNS resolution path
1
root@kakarot$ dig -x 192.168.1.1 # Reverse lookup: IP -> hostname
1
root@kakarot$ dig +short domain.com # Short, concise answer only
1
root@kakarot$ dig +noall +answer domain.com # Show only the ANSWER section
1
root@kakarot$ dig domain.com ANY # Request all records (may be ignored by servers)
Subdomain Bruteforcing
Top Tools:
1
root@kakarot$ dnsenum --enum inlanefreight.com -f /usr/share/seclists/Discovery/DNS/subdomains-top1million-20000.txt -r
--enum
: Shortcut option equivalent to –threads 5 -s 15 -w.
-r
,--recursion
: Recursion on subdomains, brute force all discovered subdomains that have an NS record.
DNS Zone Transfers
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
+------------------+ +------------------+
| secondaryServer | | primaryServer |
+------------------+ +------------------+
| |
| --------- AXFR Request (Zone Transfer) ------>|
| |
| <--------- SOA Record (Start of Authority) ---|
| |
+---------------------- loop ------------------ +
| |
| | <--------- DNS Record -------------------| |
| | | |
| +------------------- end loop -------------+ |
| |
| <---------- Zone Transfer Complete -----------|
| |
| --------- ACK (Acknowledgement) ------------->|
| |
+------------------+ +------------------+
| secondaryServer | | primaryServer |
+------------------+ +------------------+
Exploiting Zone Transfers Using Dig:
1
root@kakarot$ dig axfr domain.com
1
root@kakarot$ dig axfr @DnsServer domain.com
Virtual Hosts
Top Tools:
We Have 3 Types of Virtual Hosting:
Name-Based Virtual Hosting
IP-Based Virtual Hosting
Port-Based Virtual Hosting
1
root@kakarot$ gobuster vhost -u http://<target-ip> -w <wordlist> --append-domain
-u
: The target URL
-w
: Path to the wordlist
--append-domain
: Append main domain from URL to words from wordlist
-t
: Number of concurrent threads (default: 10)
-k
: Skip TLS certificate verification (default: false)
-o
: Output file to write results to (defaults to stdout)
Fingerprinting
Top Tools:
Nmap
Wappalyzer
BuiltWith
WhatWeb
wafw00f
Netcraft
Fingerprinting Techniques:
Banner Grabbing
Analysing HTTP Headers
Probing for Specific Responses
Analysing Page Content
Banner Grabbing
1
root@kakarot$ curl -I domain.com
Wafw00f
1
root@kakarot$ wafw00f domain.com
Nikto
1
root@kakarot$ nikto -h domain.com -Tuning b
-h
: Target host
-Tuning b
: Software Identification
Crawling
Types of crawling strategies:
Breadth-First Crawling
Depth-First Crawling
Some Valuable Information:
Comments
Links (Internal and External)
Sensitive Files
MetaData
robots.txt
Structure:
User-agent
Directives
- Disallow
- Allow
- Crawl-delay
- Sitemap
Some /.well-known
URIs:
/.well-known/change-password
openid-configuration
security.txt
mta-sts.txt
assetlinks.json
more
Top Web Crawlers:
Scrapy (Python Framework)
Apache Nutch (Scalable Crawler)
Burp Suite Spider
OWASP ZAP (Zed Attack Proxy)
Using ReconSpider:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
import scrapy
import json
import re
from urllib.parse import urlparse
from scrapy.crawler import CrawlerProcess
from scrapy.downloadermiddlewares.offsite import OffsiteMiddleware
class CustomOffsiteMiddleware(OffsiteMiddleware):
def should_follow(self, request, spider):
if not self.host_regex:
return True
# This modification allows domains with ports
host = urlparse(request.url).netloc.split(':')[0]
return bool(self.host_regex.search(host))
class WebReconSpider(scrapy.Spider):
name = 'ReconSpider'
def __init__(self, start_url, *args, **kwargs):
super(WebReconSpider, self).__init__(*args, **kwargs)
self.start_urls = [start_url]
self.allowed_domains = [urlparse(start_url).netloc.split(':')[0]]
self.visited_urls = set()
self.results = {
'emails': set(),
'links': set(),
'external_files': set(),
'js_files': set(),
'form_fields': set(),
'images': set(),
'videos': set(),
'audio': set(),
'comments': set(),
}
def parse(self, response):
self.visited_urls.add(response.url)
# Only process text responses
if response.headers.get('Content-Type', '').decode('utf-8').startswith('text'):
# Extract emails
emails = set(re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', response.text))
self.results['emails'].update(emails)
# Extract links
links = response.css('a::attr(href)').getall()
for link in links:
if link.startswith('mailto:'):
continue
parsed_link = urlparse(link)
if not parsed_link.scheme:
link = response.urljoin(link)
if urlparse(link).netloc == urlparse(response.url).netloc:
if link not in self.visited_urls:
yield response.follow(link, callback=self.parse)
self.results['links'].add(link)
# Extract external files (CSS, PDFs, etc.)
external_files = response.css('link::attr(href), a::attr(href)').re(r'.*\.(css|pdf|docx?|xlsx?)$')
for ext_file in external_files:
self.results['external_files'].add(response.urljoin(ext_file))
# Extract JS files
js_files = response.css('script::attr(src)').getall()
for js_file in js_files:
self.results['js_files'].add(response.urljoin(js_file))
# Extract form fields
form_fields = response.css('input::attr(name), textarea::attr(name), select::attr(name)').getall()
self.results['form_fields'].update(form_fields)
# Extract images
images = response.css('img::attr(src)').getall()
for img in images:
self.results['images'].add(response.urljoin(img))
# Extract videos
videos = response.css('video::attr(src), source::attr(src)').getall()
for video in videos:
self.results['videos'].add(response.urljoin(video))
# Extract audio
audio = response.css('audio::attr(src), source::attr(src)').getall()
for aud in audio:
self.results['audio'].add(response.urljoin(aud))
# Extract comments
comments = response.xpath('//comment()').getall()
self.results['comments'].update(comments)
else:
# For non-text responses, just collect the URL
self.results['external_files'].add(response.url)
self.log(f"Processed {response.url}")
def closed(self, reason):
self.log("Crawl finished, converting results to JSON.")
# Convert sets to lists for JSON serialization
for key in self.results:
self.results[key] = list(self.results[key])
with open('results.json', 'w') as f:
json.dump(self.results, f, indent=4)
self.log(f"Results saved to results.json")
def run_crawler(start_url):
process = CrawlerProcess(settings={
'LOG_LEVEL': 'INFO',
'DOWNLOADER_MIDDLEWARES': {
'__main__.CustomOffsiteMiddleware': 500,
}
})
process.crawl(WebReconSpider, start_url=start_url)
process.start()
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="ReconSpider")
parser.add_argument("start_url", help="The starting URL for the web crawler")
args = parser.parse_args()
run_crawler(args.start_url)
1
root@kakarot$ python3 ReconSpider.py domain.com
Search Engine
Search Operators:
Operator | What It Does | Example | Explanation |
---|---|---|---|
site: | Limit search to a single website or domain | site:kakarot.info | See only pages from kakarot.info |
inurl: | Find pages that have a certain word in the URL | inurl:forum site:kakarot.info | Look for forum pages on kakarot.info |
filetype: | Search for specific file types | filetype:pdf site:kakarot.info | Find PDF files on kakarot.info |
intitle: | Look for a word in the page title | intitle:”guide” site:kakarot.info | Find pages with “guide” in the title |
intext: | Search for a word in the main page content | intext:”Dragon Ball” site:kakarot.info | Find pages that mention “Dragon Ball” |
cache: | See Google’s stored copy of a page | cache:kakarot.info | View the cached version of kakarot.info |
link: | Find pages that link to a specific site | link:kakarot.info | Discover who links to kakarot.info |
related: | Find websites similar to a domain | related:kakarot.info | See sites related to kakarot.info |
info: | Get basic info about a website | info:kakarot.info | Shows title, description, and summary of kakarot.info |
define: | Get the meaning of a word | define:kakarot | Look up the definition of “kakarot” |
numrange: | Search for numbers in a range | site:kakarot.info numrange:1-100 | Find pages mentioning numbers from 1 to 100 |
allintext: | Find pages containing all words in the content | allintext:”character stats” site:kakarot.info | Pages containing both “character” and “stats” |
allinurl: | Find pages containing all words in the URL | allinurl:forum topic site:kakarot.info | Look for URLs with both “forum” and “topic” |
allintitle: | Find pages containing all words in the title | allintitle:”Dragon Ball guide” site:kakarot.info | Pages with “Dragon Ball” and “guide” in the title |
AND | Make search results include all terms | site:kakarot.info AND intext:”Saiyan” | Only pages with both conditions |
OR | Include any of multiple terms | site:kakarot.info “Goku” OR “Vegeta” | Pages mentioning either “Goku” or “Vegeta” |
NOT | Exclude a term | site:kakarot.info NOT intext:”ads” | Exclude pages that mention ads |
* (wildcard) | Match any word | site:kakarot.info “character * stats” | Search for any word between “character” and “stats” |
.. (range) | Search between two numbers | site:kakarot.info “level” 1..50 | Find pages mentioning levels from 1 to 50 |
” “ | Search exact phrase | “Dragon Ball Z guide” site:kakarot.info | Find pages that mention exactly “Dragon Ball Z guide” |
- | Exclude words | site:kakarot.info -intext:”forum” | Exclude pages about forums |
Google Dorking:
- Looking For Login Pages:
- site:website.com inurl:login
- site:website.com (inurl:login OR inurl:admin)
- Looking For Exposed Files:
- site:website.com filetype:pdf
- site:website.com (filetype:xls OR filetype:docx)
- Looking For Configuration Files:
- site:website.com inurl:config.php
- site:website.com (ext:cnf OR ext:conf)
- Locating Database Backups:
- site:website.com inurl:backup
- site:website.com filetype:sql
Automating Recon
Top Tools:
Recon Using FinalRecon:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
root@kakarot$ finalrecon -h
usage: finalrecon [-h] [--url URL] [--headers] [--sslinfo] [--whois] [--crawl] [--dns] [--sub] [--dir] [--wayback] [--ps] [--full] [-nb] [-dt DT] [-pt PT] [-T T] [-w W] [-r] [-s] [-sp SP]
[-d D] [-e E] [-o O] [-cd CD] [-k K]
FinalRecon - All in One Web Recon | v1.1.7
options:
-h, --help show this help message and exit
--url URL Target URL
--headers Header Information
--sslinfo SSL Certificate Information
--whois Whois Lookup
--crawl Crawl Target
--dns DNS Enumeration
--sub Sub-Domain Enumeration
--dir Directory Search
--wayback Wayback URLs
--ps Fast Port Scan
--full Full Recon
Extra Options:
-nb Hide Banner
-dt DT Number of threads for directory enum [ Default : 30 ]
-pt PT Number of threads for port scan [ Default : 50 ]
-T T Request Timeout [ Default : 30.0 ]
-w W Path to Wordlist [ Default : wordlists/dirb_common.txt ]
-r Allow Redirect [ Default : False ]
-s Toggle SSL Verification [ Default : True ]
-sp SP Specify SSL Port [ Default : 443 ]
-d D Custom DNS Servers [ Default : 1.1.1.1 ]
-e E File Extensions [ Example : txt, xml, php ]
-o O Export Format [ Default : txt ]
-cd CD Change export directory [ Default : ~/.local/share/finalrecon ]
-k K Add API key [ Example : shodan@key ]