Textpattern CMS support forum

You are not logged in. Register | Login | Help

#1 2018-07-09 08:37:23

planeth
Plugin Author
From: Nantes, France
Registered: 2009-03-19
Posts: 178
Website

A page of my website is being crawled a lot by the same IP

This is a question to the community I trust most.

Some context :
As I said here before, I maintain a listing of providers claiming to be GDPR compliant with links to their privacy policy and DPAs.
It’s here: gdpr4saas.eu/providers-list
This page is being crawled by different IPs (1 is from AWS) like 4 times an hour. For a page that barely changes once in 24 hours.
This page represents a lot of manual work to update.
I am not sure I feel comfortable having this work sucked up in such an obvious way…

Question:
What should I do?
Let the scrapping happen? Prevent it? How?
Any other suggestion is welcome.

Offline

#2 2018-07-09 11:08:56

gaekwad
Member
From: People's Republic of Cornwall
Registered: 2005-11-19
Posts: 2,327

Re: A page of my website is being crawled a lot by the same IP

planeth wrote #312919:

This page represents a lot of manual work to update.
I am not sure I feel comfortable having this work sucked up in such an obvious way…

Put in restrictions to greatly reduce crawlers. There will always be bad bots that ignore the restrictions, but you can prevent most, including common IP addresses.

What web server are you running? You can (usually) find this from the Textpattern diagnostics panel.

Offline

#3 2018-07-09 11:38:36

planeth
Plugin Author
From: Nantes, France
Registered: 2009-03-19
Posts: 178
Website

Re: A page of my website is being crawled a lot by the same IP

gaekwad wrote #312920:

What web server are you running? You can (usually) find this from the Textpattern diagnostics panel.

Apache/2.2
Re restrictions, do you mean putting a deny directive in my .htaccess?

Offline

#4 2018-07-09 11:46:30

gaekwad
Member
From: People's Republic of Cornwall
Registered: 2005-11-19
Posts: 2,327

Re: A page of my website is being crawled a lot by the same IP

planeth wrote #312921:

Apache/2.2
Re restrictions, do you mean putting a deny directive in my .htaccess?

Yes. If there’s no value to the bot, and / or if you’re not sure of its intentions, then block it at an IP address level. You don’t owe it anything, and if you want to protect your work then it’s a straightforward step to take in blocking access.

Offline

#5 2018-07-09 12:01:17

planeth
Plugin Author
From: Nantes, France
Registered: 2009-03-19
Posts: 178
Website

Re: A page of my website is being crawled a lot by the same IP

Yes, I was thinking of that.
One thing, though. Is there a chance that these IP addresses be common to other websites? (Being from AWS)

Offline

#6 2018-07-09 12:17:13

gaekwad
Member
From: People's Republic of Cornwall
Registered: 2005-11-19
Posts: 2,327

Re: A page of my website is being crawled a lot by the same IP

planeth wrote #312923:

Is there a chance that these IP addresses be common to other websites? (Being from AWS)

In my experience, some AWS instances use a static IP address, and some use one or more pools of IP addresses. From your description, it sounds like somebody has spun up an AWS instance to scrape your content, which is a common thing these days.

Your content has a value to you, and a value to others – you decide how much the content is worth on both sides, and you are very much permitted (expected?) to impose restrictions on content access to your own standards.

Simply, if something is scraping your content without permission, then it’s arguably stealing. You can choose to restrict access to that content.

Offline

#7 2018-07-09 15:43:49

colak
Admin
From: Cyprus
Registered: 2004-11-20
Posts: 6,768
Website

Re: A page of my website is being crawled a lot by the same IP

Hi planeth
You could just try to stop the bad bots for now. I maintain a list in my htaccess file on github.com/colak/neme/blob/master/.htaccess#L77


Yiannis
——————————
neme.org | hblack.net | LABS | State Machines | NeMe @ github

Offline

#8 2018-07-10 11:31:22

gaekwad
Member
From: People's Republic of Cornwall
Registered: 2005-11-19
Posts: 2,327

Re: A page of my website is being crawled a lot by the same IP

colak wrote #312930:

You could just try to stop the bad bots for now.

I was quietly hoping you’d appear, colak – you’re a professional at bot blocking in my eyes!

Offline

Board footer

Powered by FluxBB