Skip to content

Commit be94895

Browse files
authored
Update README.md
1 parent 9dc2cf1 commit be94895

File tree

1 file changed

+15
-0
lines changed

1 file changed

+15
-0
lines changed

README.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,21 @@ The spider functionality is what gives Crawlector the capability to find additio
122122
- You may account for outbound/external links as well, for the main page only, via the config. option `add_ext_links`. This feature honours the `exclude_url` and `include_url` config. option.
123123
- You may account for outbound/external links of the main page only, excluding all other urls, via the config. option `ext_links_only`. This feature honours the `exclude_url` and `include_url` config. option.
124124

125+
# Site Ranking Funcitonality
126+
127+
- This is for checking the ranking of the website
128+
- You give it a file with a list of websites, with their ranking, in a csv file format
129+
- Services that provide lists of websites ranking include, Alexa top-1m (discontinued as of May 2022), [Cisco Umbrella](https://umbrella-static.s3-us-west-1.amazonaws.com/index.html), [Majestic](https://majestic.com/reports/majestic-million), Quantcast, Farsight and [Tranco](https://tranco-list.eu/), among others
130+
- CSV file format (2 columns only): first column holds the ranking, and the second column holds the domain name
131+
- If a cell to contain quoted data, it'll be automatically dequoted
132+
- Line breaks aren't allowed in quoted text
133+
- Leading and trailing spaces are trimmed from cells read
134+
- Empty and comment lines are skipped
135+
- The section `site_ranking` in the configuration file provides some options to alter how the CSV file is to be read
136+
- The performance of this query is dependent on the number of records in the CSV file
137+
- Crawlector compares every entry in the CSV file against the domain being investigated, and not the other way around
138+
- Only the registered/pay-level domain is compared
139+
125140
# Design Considerations
126141

127142
- A URL page is retrieved by sending a GET request to the server, reading the server response body, and passing it to Yara engine for detection.

0 commit comments

Comments
 (0)