Avatar image for rick
#2 Edited by rick (119 posts) - - Show Bio

Your ISP probably as a lot of bad people using it who are trying to hurt us. That is why you have to prove you're not a malicious bot every 24 hours. This will probably be the case for you for a while if not as long as you're on that ISP. Sorry. If you send your ip address to ipbans@gamespot.com they can give you more information.

Avatar image for atomicfox
#4 Posted by atomicfox (1 posts) - - Show Bio

I've verified my IP a little while an hour or so ago, at least I went to Comic Vine and completed the I'm not a robot captcha, but I still seem to be blocked when I try to use the ComicRack scaper. Is there a time frame on how long an address is limited?

Avatar image for fieldhouse
#5 Posted by fieldhouse (3 posts) - - Show Bio

Could we please get some guidelines on acceptable use of the API? As of today we are hunting around in the dark trying to make sense of the little information that has been given from a variety of sources. Whether intentional or not, there is currently a 72 hour API ban being placed on any IP address that has exceeded some unknown set of criteria.

ipbans@gamespot has stated that the malicious bot blocks expire after 24 hours "Blocks expire after 24 hours unless an admin manually changes that for super malicious stuff." (per email from ipbans)

They don't, in fact in another correspondence the time period implied... "it seems like they may put at least a 72 hour block on IP's for scraping.. "and "Been banned for the last 3 days as well. Guess they messed up."

The 24 hours mentioned does match the duration between "are you human" captcha prompts so it appears that "blocked" IPs are required to prove their "humanity" every 24 hours and are unable to use the API at all for the 72 hour duration of the ban.

The author of the ComicVine plugin has no additional information on this issue "I'm just struggling because there's not much I can do about it. Comic Vine has effectively broken their web API by banning certain unlucky users, and I'm not sure how to convince them to stop doing that."

We have confirmed that the ComicVine API is still accessible through a browser GET request. You are able to use a browser to successfully visit the link. This indicates that some authentication layer has been placed in front of the REST API. This additional authentication layer is requiring cookie authentication which requires a human to interpret captchas at least once every 24 hours. This change to the API and additional authentaction on top of the already required individual's API key has not been described or explained.

I don't believe the majority of ComicVine plugin users are being intentionally malicious and without any information or guidelines on API usage it is extremely difficult to comply with "acceptable" API usage.Please let us know what guidelines are being used to determine "acceptable" API use.

Is the 72 hour API ban intentional? Can the API be exempted from the scraping ban?

What information is being communicated to the API key holders when an IP ban is put in place? (we have not found any information but we are hopefully that we have just overlooked it).

Hopefully we can get this issue resolved relatively quickly and all of us can get back to enjoying our comic books instead of debugging API calls. :)

tomf

Avatar image for solidus0079
#6 Posted by solidus0079 (11 posts) - - Show Bio

I would also like to know the exact guidelines Comicvine has for its API use. I'm happy to follow them, we just don't know what they are! Being slapped at apparently random intervals with blockages is not something easily learned from.

Avatar image for tuathane
#7 Posted by tuathane (3 posts) - - Show Bio

I too am blocked, and would appreciate any information as to when / if this can be resolved.

Avatar image for cbanack
#8 Posted by cbanack (124 posts) - - Show Bio

@edgework:

Guys, is it possible that the new security measures (the ones where you have to prove you're not a bot) are unintentionally breaking the web API? Users of the web API don't have any way to prove that they aren't a bot, so this new security measure effectively bans those users from the web API altogether. This might just be an oversight, since the web API already has strictly enforced usage limitations to prevent abuse--so there's really no need to check for bots when using the web API.

Couldn't this problem be fixed simply by making http://api.comicvine.com exempt from the new security measures?

Avatar image for iohanr
#9 Posted by iohanr (1 posts) - - Show Bio

I am also blocked. I use ComicRack/ComicVine Scraper plugin. I have not scraped anything in a week, and now just trying to scrape 1 book fails and the log says "Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host. ---> "

Would appreciate any advice on what to do to get myself into the "unblocked" list. I am for sure not doing any "abusive" downloading/scraping from the API, unless you count 100-300 scrapes a week abusive... I thought that was already taken cared of with the limits on my individual API key.

Avatar image for hyperspacerebel
#10 Posted by Hyperspacerebel (8 posts) - - Show Bio

So, I fixed my problem with the cbanack's plugin for ComicRack. I don't know if everyone had the exact same problem as me, but I know a couple did at least.

I went into the cvconnection.py file in the Script folder, and changed all instances of comicvine.com to www.comicvine.com, and now it's working flawlessly again. So far....

Make sure to restart ComicRack after making the change for it to go through.

Avatar image for kinoregis
#11 Posted by kinoregis (1 posts) - - Show Bio

I tried that but it does not work for me.

Avatar image for rick
#12 Edited by rick (119 posts) - - Show Bio

A lot of this turns out to be an issue in our data center blocking API requests as malicious. This is probably because ComicRack does some crazy things. Also it seems there's been an update to ComicRack that makes this worse. We really don't support ComicRack in the first place, it's been a thorn in the engineers' side since its been a thing. ComicRack needs to be throttled back significantly. We'll be adjusting the API soon to deal with this.

In any case the data center blocking has been stopped.

Avatar image for overmonitor
#13 Posted by Overmonitor (1515 posts) - - Show Bio

Are all these single digit posters from an App of some sort?

Avatar image for fieldhouse
#14 Edited by fieldhouse (3 posts) - - Show Bio

@edgework said:

A lot of this turns out to be an issue in our data center blocking API requests as malicious. This is probably because ComicRack does some crazy things. Also it seems there's been an update to ComicRack that makes this worse. We really don't support ComicRack in the first place, it's been a thorn in the engineers' side since its been a thing. ComicRack needs to be throttled back significantly. We'll be adjusting the API soon to deal with this.

In any case the data center blocking has been stopped.

Thanks for the update @edgework As a couple people have mentioned, we're happy to work on changing the CVS behavior if we have some idea of what needs to be changed. We haven't gotten much feedback other than the ip bans.

ipblock@gamespot just said

Sorry, we can't comment on security procedures in the forums nor can we give any advise about how to acoid getting banned other than don't look malicious.

Which doesn't give us much to go on. :(

Cory's code is on github and was linked above, https://github.com/cbanack/comic-vine-scraper If there's anything "crazy" logic-wise we would love to know.

Are all these single digit posters from an App of some sort?

Sorry for the influx of chatter @overmonitor. I posted a link to this thread on the ComicRack forum so I would assume those replies are from a few of the people impacted by the api block. We have each individually registered for an api key in order to use the ComicVine plugin.

Avatar image for krandor
#15 Posted by krandor (11 posts) - - Show Bio

@edgework said:

A lot of this turns out to be an issue in our data center blocking API requests as malicious. This is probably because ComicRack does some crazy things. Also it seems there's been an update to ComicRack that makes this worse. We really don't support ComicRack in the first place, it's been a thorn in the engineers' side since its been a thing. ComicRack needs to be throttled back significantly. We'll be adjusting the API soon to deal with this.

In any case the data center blocking has been stopped.

I will add just as a suggestion that there are a LOT of comicrack users who would have no problems paying for a "premium" subscription to comicvine to avoid a lot of throttling and to help you guys scale better to handle the load.

Avatar image for cbanack
#16 Edited by cbanack (124 posts) - - Show Bio

Hi guys, I am the developer who built and maintains Comic Vine Scraper (CVS), which is the ComicRack plugin that makes use of the Comic Vine API.

Your comments are a little harsh, @edgework. I thought I'd better respond so at least you guys know I'm here, in case you ever want to contact me.

@edgework said:

This is probably because ComicRack does some crazy things.

I assure you, it does not. As mentioned above, the source code is there for all to peruse, and I am happy to guide you to all the relevant sections. I am a capable, responsible engineer who has gone out of his way to use the Comic Vine API respectfully and as minimally as possible. You guys have put strictly enforced limits on the API usage, so I have a lot of pressure from some of my users to find ways to use the API more efficiently so they don't bump up against those limits.

@edgework said:

Also it seems there's been an update to ComicRack that makes this worse.

I can confirm that there have been NO updates to how ComicRack/CVS accesses the Comic Vine API for years. The last minor update was in June, and has no effect on the API. It is possible that some individuals have extended or modified their own version of CVS, since it is an open source app. But since each user is required to use their own API key, even those users have no way of going beyond Comic Vine's prescribed usage limits.

@edgework said:

We really don't support ComicRack in the first place, it's been a thorn in the engineers' side since its been a thing.

Well, no one expects you to provide technical support for ComicRack. But supporting your own public API is another story. A lot of @overmonitor's 'single digit posters' are coming to this thread from the ComicRack forums because I sent them here. My inbox has been filling up with complaints about how my app seems to be broken. As I told them, "only the Comic Vine engineers can fix this problem for you".

I have to say, I really resent my application (which represents 7 years of blood and sweat, and a lot of high-quality work) being referred to as a "thorn in the engineers' side" simply because it has a large number of users. If my app is popular, that helps make Comic Vine popular.

The fact is, I have always been careful to work within whatever (poorly publicized) limits that Comic Vine wishes to impose. Over the years I have made major modifications to my application at the request of Comic Vine engineers, and I remain completely willing to do so. ComicRack users are generally big fans of Comic Vine, and my app drives users to the Comic Vine website. Whenever I hear Comic Vine publicly complaining about me or my app, I am forced to wonder why you guys even bother having a public API if you don't want us to use it?

Also, as mentioned above (and hundreds of times on the ComicRack forums) we have a legion of ComicRack users who would be more than happy to pay for extra access to the ComicVine API. Since everyone already has their own API key, that might not even be too hard to implement on your end.

@edgework said:

In any case the data center blocking has been stopped.

That's great news, thank you. A quick glance at my e-mail suggests that the number of bug reports I am receiving has already dropped significantly.

Avatar image for rick
#17 Edited by rick (119 posts) - - Show Bio

@cbanack: I say its a thorn in our sides because it shows the weakness in the API, not because your app sucks. Sorry if it came across that way. That said you really need to throttle that thing, especially when doing such a large data pull. We can always tell when a new user is running it because it spikes our database usage. Its our fault that the API allows that and we're going to be making changes so it doesn't any more. All you requests should be paged with no more than one completed every second.

I assumed at first that you made a change because all the error logs showed it was in the API pattern for Comicrack. That pattern does look malicious which is why it triggered the DDoS firewall. Too much, too fast.

Avatar image for hyperspacerebel
#18 Posted by Hyperspacerebel (8 posts) - - Show Bio

@edgework: Not to preempt cbanack's reply, as he probably knows much more about it than me, but I'm curious about you asking us to throttle to at least 1 request a second. As it stands, the absolute fastest I can get my tagger to run (e.g. perfectly named series of issues where the tagger can chug along without any user input) is 59 issues in 2 minutes 15 seconds. Now my understanding is that 59 issues of the same series should invoke approximately 60 api calls total (1 for the initial series grab, though it might be a couple more if it's paginated in some way, and then 1 for each issue). That puts my personal usage at around 1 request every 2.25 seconds, and that is only on the really rare occasion that I actually have a large perfectly-named series to tag. Most of the time there user input involved for what I need to tag, which adds 10s of seconds of delays in there.

So, ultimately, my question: is my usage within acceptable parameters? The tagger gives the user the ability to add a delay to each scrape, and I'm more than happy to use it if that's what you guys want. But if the culprits you're really after are people hitting the api more than once a second, then I know for sure I'm not it and I don't imagine anyone using the scraper like I do could be either.

Avatar image for jothay
#19 Posted by Jothay (3 posts) - - Show Bio

I'm getting into details you probably don't want to share right now but here's a series of questions that would illuminate the underlying issues if answered:

  1. Why are you allocating time and resources into 'breaking' the API further with more throttling, paging and other undocumented changes such that it is harder and harder to code against and use instead of improving the underlying architecture such that it can handle virtually any volume of requests without issue?
  2. Is it a matter of the person in charge of the code had this huge system dumped in his lap because another guy quit and doesn't know how to take care of the system? I can totally understand that, it's happened to me.
  3. Are you stuck on some ancient extremely-difficult-to-manage relational database from the early 90's? Some organizations refuse to ever change from an initial platform, regardless of how much it's hurting them not to update to modern tools.
  4. Do you need the assistance of person or persons not directly working for CV to assist with locating these pressure points, debugging and ultimately solving them? Granting access on a restricted basis to persons you could use as 'consultants' could cut down on the time it takes to git-er-done.
  5. Do you have live statistics available showing how many hits the API is getting on a constant basis and how those hits are being handled by your 1 or more servers under load balancing?
  6. How heavily has the database itself been analyzed for performance?
  7. Do you have proper indexing on all the tables to speed up calls commonly made by the API?
  8. What frameworks are you using to make these calls (between the services layer and the DB itself)? EntityFramework? Straight SQL requests? How are you specifically calling Relational data to a primary record (like Batman Comic with all Issues for that Comic), could that process be optimized?
  9. How long on average does it take for a query of 'Batman' to come back? 1ms? 100ms? 1s? 5s? Anything more than a few ms tells me that something is very wrong and would explain all of these issues.
  10. How big is the database itself? I can't imagine you have millions of Comic Titles, but certainly a few hundred thousand issues of comics. Are you storing your Thumbnails directly inside the database or just the file paths to where they are stored physically on the server (for use in the website)?

I'd be happy to take a look at the architecture of your system (in private of course) and outline pain points and how to solve them. I'll even help make some fixes for free! (The payment I get out of it is no more issues scraping as many comics as I want because your system could handle it.)

Avatar image for cbanack
#20 Edited by cbanack (124 posts) - - Show Bio

@edgework: ComicRack's scraper app is already throttled to do no more than one comic book per second, and there is also an option to increase (but not decrease) that length of time. However, locating a single comic book requires an inordinately large amount of API usage, which probably explains why you still see clusters of API hits. Allow me to describe the problem in general terms, as this is something that I'm sure every API user is facing.

Take a particularly bad but very common example: a long running comic with a single word name. Say, 'Batman', 'Spider-man', 'X-men', 'Avengers', 'Superman', 'Iron Man', 'The Hulk', etc.

When my app user wants to access metadata for her latest Batman comic, my app queries the API for 'Batman'. But the API returns at most 100 volumes per page, and there are over 1000 volumes with the word 'Batman' in them. So I have to do 10 API hits in a row just to read all the pages and show the user a list of available 'Batman' volumes to choose from. Then, based on the user's choice (or on an automatic choice by my app) I have to do another API hit to find the issue she is looking for. Then I have to do one more API hit to finally obtain Comic Vine metadata about the specific issue of Batman that my user is interested in.

This is a very inefficient API design. It means API users have to do a whole lot of API hits to accomplish what is undoubtedly the most common task that the API is ever used for: to find Comic Vine metadata about a particular issue of a particular volume. Even worse, there is very little user interaction here--by the time my app starts talking to the API, it pretty much knows exactly what volume and issue it wants. So the user waits while the app is forced to pound out 8 or 10 or 12 API hits just to get data about a single comic book. :(

I think solving the problem might be as simple as allowing a much, much larger page size in the search results and volume issue listings. Returning one page with all those results is going to be less resource intensive than returning all those same results spread out over many pages. This improvement could be particularly effective if it was also paired with a strategy to cache search terms and volume issue listings that return a large (>100) number of results. Even a cache that was just refreshed on a nightly basis would work pretty well.

I believe that if you guys did this, you would see a significant improvement when it comes to the resource usage of your API.

Thanks for reading!

Avatar image for cbanack
#21 Posted by cbanack (124 posts) - - Show Bio

Oh yeah, another way to deal with the paging issue I just described might be to give us an "exact match" option when doing string-based searches with the API.

Then the search for 'Batman' would only return one volume!

Avatar image for jkthemac
#22 Edited by JKtheMac (6 posts) - - Show Bio

From a user perspective, I can second this. I have a workflow whereby I add the new comics I have either bought or are out on the stands that I may be interested in reading into the Comic Rack database. Once that data is there I can do some very quick and easy searching with very sophisticated queries. This helps me write reviews, cross reference comics for publishing reading lists, and generally access your data without hitting your excellent website multiple times per hour.

However, as I look down the list of new comics I note I am adding in the order of 40-50 comics to my database each week. Some weeks that can include 10-20 of these vague single name titles like Batman and X-Men. That means what should be about 100 hits to your API each Wednesday can easily turn into 400 hits for no real reason. All because the database front end can't ask for exact matching.

Grabbing data for approximately 50 comics should be relatively painless and ultimately reduce the load on your servers, but the current system turns that into a potentially excessive load every Wednesday and Thursday.

Avatar image for rick
#23 Edited by rick (119 posts) - - Show Bio

@cbanack:Here's an actual sample from our server logs. Notice how many times when 2 per second requests are made. That's what I'm talking about.

74.x.x.x 2015-10-05 14:19:52 GET api/search/?api_key=xxx&client=cvscraper&format=xml&li

74.x.x.x 2015-10-05 14:19:53 GET api/issues/?api_key=xxx&client=cvscraper&format=xml&fi

74.x.x.x 2015-10-05 14:19:56 GET api/issue/4000-445851/?api_key=xxx&client=cvscraper&fo

74.x.x.x 2015-10-05 14:19:56 GET api/issues/?api_key=xxx&client=cvscraper&format=xml&fi

74.x.x.x 2015-10-05 14:19:57 GET api/volume/4050-38558/?api_key=xxx&client=cvscraper&fo

74.x.x.x 2015-10-05 14:19:59 GET api/issues/?api_key=xxx&client=cvscraper&format=xml&fi

74.x.x.x 2015-10-05 14:20:04 GET api/issue/4000-402327/?api_key=xxx&client=cvscraper&fo

74.x.x.x 2015-10-05 14:20:04 GET api/issues/?api_key=xxx&client=cvscraper&format=xml&fi

74.x.x.x 2015-10-05 14:20:07 GET api/issues/?api_key=xxx&client=cvscraper&format=xml&fi

74.x.x.x 2015-10-05 14:20:08 GET api/issues/?api_key=xxx&client=cvscraper&format=xml&fi

74.x.x.x 2015-10-05 14:20:08 GET api/issue/4000-342687/?api_key=xxx&client=cvscraper&fo

74.x.x.x 2015-10-05 14:20:10 GET api/issues/?api_key=xxx&client=cvscraper&format=xml&fi

74.x.x.x 2015-10-05 14:20:11 GET api/issues/?api_key=xxx&client=cvscraper&format=xml&fi

74.x.x.x 2015-10-05 14:20:12 GET api/issue/4000-321672/?api_key=xxx&client=cvscraper&fo

74.x.x.x 2015-10-05 14:20:14 GET api/issues/?api_key=xxx&client=cvscraper&format=xml&fi

74.x.x.x 2015-10-05 14:20:15 GET api/issues/?api_key=xxx&client=cvscraper&format=xml&fi

74.x.x.x 2015-10-05 14:20:15 GET api/issue/4000-342686/?api_key=xxx&client=cvscraper&fo

74.x.x.x 2015-10-05 14:20:17 GET api/issues/?api_key=xxx&client=cvscraper&format=xml&fi

74.x.x.x 2015-10-05 14:20:19 GET api/issues/?api_key=xxx&client=cvscraper&format=xml&fi

74.x.x.x 2015-10-05 14:20:20 GET api/issue/4000-427159/?api_key=xxx&client=cvscraper&fo

74.x.x.x 2015-10-05 14:20:22 GET api/issues/?api_key=xxx&client=cvscraper&format=xml&fi

74.x.x.x 2015-10-05 14:20:23 GET api/issues/?api_key=xxx&client=cvscraper&format=xml&fi

74.x.x.x 2015-10-05 14:20:24 GET api/issue/4000-342688/?api_key=xxx&client=cvscraper&fo

74.x.x.x 2015-10-05 14:20:26 GET api/issues/?api_key=xxx&client=cvscraper&format=xml&fi

74.x.x.x 2015-10-05 14:20:46 GET api/issues/?api_key=xxx&client=cvscraper&format=xml&fi

74.x.x.x 2015-10-05 14:20:46 GET api/issue/4000-358868/?api_key=xxx&client=cvscraper&fo

74.x.x.x 2015-10-05 14:20:48 GET api/issues/?api_key=xxx&client=cvscraper&format=xml&fi

74.x.x.x 2015-10-05 14:20:51 GET api/issue/4000-260228/?api_key=xxx&client=cvscraper&fo

74.x.x.x 2015-10-05 14:20:51 GET api/issues/?api_key=xxx&client=cvscraper&format=xml&fi

74.x.x.x 2015-10-05 14:20:53 GET api/issues/?api_key=xxx&client=cvscraper&format=xml&fi

74.x.x.x 2015-10-05 14:20:55 GET api/issue/4000-260247/?api_key=xxx&client=cvscraper&fo

74.x.x.x 2015-10-05 14:20:55 GET api/issues/?api_key=xxx&client=cvscraper&format=xml&fi

74.x.x.x 2015-10-05 14:20:58 GET api/issues/?api_key=xxx&client=cvscraper&format=xml&fi

74.x.x.x 2015-10-05 14:20:59 GET api/issues/?api_key=xxx&client=cvscraper&format=xml&fi

74.x.x.x 2015-10-05 14:21:00 GET api/issue/4000-358871/?api_key=xxx&client=cvscraper&fo

74.x.x.x 2015-10-05 14:21:01 GET api/issues/?api_key=xxx&client=cvscraper&format=xml&fi

74.x.x.x 2015-10-05 14:21:04 GET api/issues/?api_key=xxx&client=cvscraper&format=xml&fi

74.x.x.x 2015-10-05 14:21:05 GET api/issue/4000-260248/?api_key=xxx&client=cvscraper&fo

74.x.x.x 2015-10-05 14:21:07 GET api/issues/?api_key=xxx&client=cvscraper&format=xml&fi

74.x.x.x 2015-10-05 14:21:08 GET api/issues/?api_key=xxx&client=cvscraper&format=xml&fi

74.x.x.x 2015-10-05 14:21:09 GET api/issue/4000-381772/?api_key=xxx&client=cvscraper&fo

74.x.x.x 2015-10-05 14:22:47 GET api/search/?api_key=xxx&client=cvscraper&format=xml&li

74.x.x.x 2015-10-05 14:22:48 GET api/issues/?api_key=xxx&client=cvscraper&format=xml&fi

74.x.x.x 2015-10-05 14:22:52 GET api/issues/?api_key=xxx&client=cvscraper&format=xml&fi

Avatar image for rick
#24 Posted by rick (119 posts) - - Show Bio

The thing is, you're using the API to essentially copy our database to user's machines. Our API isn't meant for that. What you want to do with the data is something we don't provide. If you want to do interesting things with our data you should load your own central database with the api or feeds and run all your queries against that and limit what goes over the wire to what's necessary. We have no problem with you copying our entire database for this kind of thing. You can design the schema that best fits your needs. Just dumping the DB in whole or in part on the client is wasteful and anachronistic. That kind of thing will definitely be limited very soon.

Avatar image for rick
#25 Posted by rick (119 posts) - - Show Bio

To specifically address your "batman" example. You will likely only need 2 api queries. (using json for this example only to simplify the results)

http://www.comicvine.com/api/search/?api_key=xxx&limit=1&resources=volume&field_list=id&format=json&query=batman

gives you this

{  error: "OK",  limit: 1,  offset: 0,  number_of_page_results: 1,  number_of_total_results: 1003,  status_code: 1,  results: [  {  id: 51951,  resource_type: "volume"  }  ],  version: "1.0"  }

Use the number_of_total_results and subtract say 100 from it, use it in the offset parameter then run

http://www.comicvine.com/aapi/search/?api_key=xxx&format=json&limit=100&offset=903&resources=volume&field_list=name,start_year,publisher,id,image,count_of_issues&query=batman

That will be the last page. You didn't need to go though the other 9 pages to get to that.

Avatar image for cbanack
#26 Posted by cbanack (124 posts) - - Show Bio

When I first started work on the Comic Vine Scraper tool 7 years ago, I was very open and upfront with the Comic Vine engineers about what I was doing. I've continued to be very upfront about it over the years, working with Comic Vine as needed to make changes and try to write more efficient API calls as the number of users of my app grew. At no point did anyone from Comc Vine ever suggest that my app was using the database incorrectly or violating any terms of service, even when I asked that question directly.

@edgework said:

What you want to do with the data is something we don't provide.

I think what you mean is: "What you have been doing for 7 years with the data is something we no longer provide."

@edgework said:

Our API isn't meant for that.

What you want to do with the data is something we don't provide.

Just dumping the DB in whole or in part on the client is wasteful and anachronistic.

That kind of thing will definitely be limited very soon.

Here's the thing; you (Comic Vine, not you personally) are clearly no longer an ally in this project, you obviously view ComicRack/Comic Vine Scraper as a problem and a nuisance, and not a beneficial use of your web API. And remember, I've been doing all this work and maintaining this project for free in my spare time. So I don't want to continue pouring time and energy into something that you guys obviously don't support and are more than likely going to shut down or render unusable in the near future.

So I will shut the project down. On the bright side, that should make it a lot easier for you to keep your API usage numbers lower. Note that if you want to explicitly ban Comic Vine Scraper users (who will continue to use the app whether I'm maintaining it or not), the easiest way is to block all GET requests that contain the string "&client=cvscraper". (I put that marker in at the request of a Comic Vine developer some years ago.)

Avatar image for cbanack
#27 Posted by cbanack (124 posts) - - Show Bio

@edgework said:

To specifically address your "batman" example. You will likely only need 2 api queries. (using json for this example only to simplify the results)

http://www.comicvine.com/api/search/?api_key=xxx&limit=1&resources=volume&field_list=id&format=json&query=batman

gives you this

{ error: "OK", limit: 1, offset: 0, number_of_page_results: 1, number_of_total_results: 1003, status_code: 1, results: [ { id: 51951, resource_type: "volume" } ], version: "1.0" }

Use the number_of_total_results and subtract say 100 from it, use it in the offset parameter then run

http://www.comicvine.com/aapi/search/?api_key=xxx&format=json&limit=100&offset=903&resources=volume&field_list=name,start_year,publisher,id,image,count_of_issues&query=batman

That will be the last page. You didn't need to go though the other 9 pages to get to that.

Unfortunately, this only works if I somehow know ahead of time which page is going to contain the volume that the user is looking for when they search on 'batman'.

As I mentioned earlier, I believe this is a very common problem for pretty much all your API users (the vast majority of which will be unable to set up their own server, download a copy of your entire database and then write their own, more efficient API for accessing it.)

Avatar image for rick
#28 Posted by rick (119 posts) - - Show Bio

@cbanack: 7 years is a long time. 7 years ago few people had the internet in their front pockets and 10MB to the home was incredible. Things change quickly. Also our DB was much smaller then and we had far fewer users. I don't want you to take everything down and stop working on it. I want you to work on it! Update it for today's realities. I'd be happy to help you. We're still an ally. One that wants you to keep up with us.

Regarding my example. Its still valid, you should work backwards in most cases. There will be fewer API hits when you do that.

And yeah scraping HTML pages was never allowed at least not since CBS was the owner. Yeah it worked for a long time but we've caught up to 2015 just now and stuff like that is detected and blocked. Comic Vine in particular is sensitive to scrapers since it sits on a fine line of fair use and copyright infringement.

Avatar image for jkthemac
#29 Posted by JKtheMac (6 posts) - - Show Bio

@edgework: I think you may be confusing what the ComicVine API scraper is doing. The whole point of the scraper is to copy discrete and manageable chunks of your data such that we DONT access your servers at all times of day and night.

The problem at root, is not that what we are doing will break your system, it is that how you allow us to do it will break your system. The more you focus on restrictions rather than bottlenecks the less efficient your whole system will become.

Once the data is scraped onto our home databases your system is no longer interrogated even if we run complex and resource heavy database operations. If we were unable to scrape the data for this your servers would be hit all the time, probably via the website. That would have a much higher overhead to you.

Traffic on your website will only drive revenue if you get multiple users to visit not the same person desperately trying to work out which issue of Spider-Man was the one where he teamed up with Bombshell, and slowly trawling through 30 webpages to work it out.

Avatar image for claym
#30 Posted by ClayM (5 posts) - - Show Bio

@edgework I'd like to hear what you guys believe the appropriate usage of the API.

As far as I can see, the API is treated by CVS like every other API on the planet is treated. What is it doing that you find unacceptable?

Avatar image for jothay
#31 Posted by Jothay (3 posts) - - Show Bio

ComicRack maintains a 'db' in the form of an XML file storing information about comics the user has on their system, that's part of the program and always has been. For the last 7 years, in open communication with CV, cbanack has had in the wild a python based plugin for CR that grabs comic issue data from CV for the end user to match up to their issues. This saves the end user of CR hours of filling in dozens of data points per issue. For most, this is just a few comics a week that they keep up with. Some maintain most of the issues that come out each week so a dozen times 100-200 issues would be hours of work, your CV API and the Scraper allow that to become minutes of watching it work and only taking extra steps where it can't figure it out on it's own.

No one is (or should be) debating the above because it is a fact.

I believe that the thought process behind why CV wants to limit the API is as follows: The CV website and the API are sharing a database. Too many calls from the API slows the website down and causes CV to lose business, therefore money from people coming to the website and clicking on Ads. We can all understand this, slow website means less money. OK.

The thing that we take issue with is how you are handling it. Instead of making your operation more efficient, such that it could handle the workload it's getting more easily, you are breaking your API so as to make it harder and harder to work with.

Why do you have an API at all if you hate the fact that it gets used? I see your log shown above and sometimes more than one request happens in a second, but with the API keys blocked out, are they all from the same API key or is that an actual snapshot of everything with client=cvscraper in that timeframe? If it's the former, I can't see how you could be upset by that, it should only be a problem if all of those requests happened in the same second, maybe.

Now, you've stated above that you would rather the entire database be copied out so something else could be hit, were you serious? I could build that in a matter of days, I do it for a living. If you PM me and talk to me about making a clone of your existing database, I can take all your API woes away. The only thing hitting your existing database would be a revolving check for new/updated data once a minute to keep the API database current. There would be virtually 0 impact to your website.

Please, please contact me directly so we can talk this out. I am very interested in taking this on.

Avatar image for rick
#32 Edited by rick (119 posts) - - Show Bio

@claym: Comic Vine Scraper is hitting an HTML page and scraping it. That's not allowed. Bots are only allowed to hit api and feed urls.

Reference

Avatar image for hyperspacerebel
#33 Edited by Hyperspacerebel (8 posts) - - Show Bio
@edgework said:

@claym: Comic Vine Scraper is hitting an HTML page and scraping it. That's not allowed. Bots are only allowed to hit api and feed urls.

Reference

Okay, now we're getting somewhere. All this talk about "moving to 2015" and other PR talk really muddied the waters.

So, you don't want people scraping the HTML. Great. Banning people who are using the scraper innocently won't solve your problem because people will keep trying to use it not knowing they are doing something bad. Let's take a step back and find a solution.

Per the code, the html can be queried for two things: "first pass: find all the alternate cover image urls" and "second pass: find the community rating (stars) for this comic". Now looking at the api reference (and I'm sure cbanack pored over it extensively), I don't see a way to obtain that data. Can we get api methods for that data or better yet let us request that data on an existing issue query? Are you guys still extending the api? Are you willing to work with us as you've said? Or is the issue actually deeper than us hitting the html?

Avatar image for claym
#34 Edited by ClayM (5 posts) - - Show Bio

@edgework said:

@claym: Comic Vine Scraper is hitting an HTML page and scraping it. That's not allowed. Bots are only allowed to hit api and feed urls.

Reference

So the next question is: can CVS retrieve the information it's trying to get without scraping? Does the API provide that data?

If the API does, you have a legitimate gripe. If the API doesn't, well, we have something we can move forward on.

Avatar image for rick
#35 Posted by rick (119 posts) - - Show Bio

@hyperspacerebel: The issue is something that isn't a browser is hitting a page that isn't /api or /feeds The firewall detects that, knows the bot isn't Google, Bing etc. and blocks it for 24 hours (humans have to pass a recaptcha to get in from that IP). That all happens before you even get into the site. ...and if the User Agent is spoofed that's worse and gets you hard banned as malicious so don't try that. Much of that is relatively new which is why its now becoming a "thing". Still, the Terms of Service aren't new.

The alt covers and ratings was added to the site after the API was designed. We could add that stuff.

Avatar image for hyperspacerebel
#36 Posted by Hyperspacerebel (8 posts) - - Show Bio
@edgework said:

@hyperspacerebel: The issue is something that isn't a browser is hitting a page that isn't /api or /feeds The firewall detects that, knows the bot isn't Google, Bing etc. and blocks it for 24 hours (humans have to pass a recaptcha to get in from that IP). That all happens before you even get into the site. ...and if the User Agent is spoofed that's worse and gets you hard banned as malicious so don't try that. Much of that is relatively new which is why its now becoming a "thing". Still, the Terms of Service aren't new.

The alt covers and ratings was added to the site after the API was designed. We could add that stuff.

Yes, I understand all that. But that doesn't excuse the fact, in my opinion, that the way it was carried out and the communication we have received on the issue has cause a massive setback for many many of your users who only visit and contribute to the site because of the incredibly hard work of cbanack. If this issue had been approached by ComicVine with the goal of helping users, cbanack would probably still be willing to work with you, and none of us would be as frustrated as we are. Instead, ComicVine went all gung-ho going after the "bad guys" (i.e. people innocently using the CVS plugin who really have no idea what it's doing behind the scenes).

Thank you for considering new api functionality though, because that is what we all need to solve both your problems and our problems. If it's just a matter of replacing the two html scrapes with new api calls, I or another of us could probably integrate that into the CVS.

Avatar image for deactivated-5c901e667a76c
#37 Edited by deactivated-5c901e667a76c (36557 posts) - - Show Bio

@edgework: Does this have anything to do with me getting an "Abnormal traffic detected" message every time I open Comic Vine?

Moderator
Avatar image for rick
#38 Edited by rick (119 posts) - - Show Bio

@hyperspacerebel: Sorry, there was nothing changed and nothing new added, the firewall just started to work and this came up. None of you are bad guys, I know its innocent and you want every last bit of esoterica regarding your passion. I totally get that. If we did make a change that would get communicated as we've done in the past like when we auto blocked commercial providers.

All this has caused me to look closely at what's going on and I'm seeing an explanation for a lot of our problems with the database which is why I'm trying to get some changes made and also realizing what we need to do on our end to not let these kinds of uses affect us so much. I knew of ComicRack and knew of the problems it caused but never dealt directly with them, my predecessor did. The database issues have been nagging at me for a while but I never had time to investigate and now that I see one huge contributor you have my full attention.

Avatar image for krandor
#39 Posted by krandor (11 posts) - - Show Bio

@edgework: earlier you mentioned allowed us to make a central copy of the db that the actual scraper could then hit which would pull all those hits off your db other then requests to check for updates.

Is that a serious option for us to do? We have people ready to start building that almost immediately if we have the blessing to do so. Naturally the copy would only be for api calls.

Avatar image for rick
#40 Edited by rick (119 posts) - - Show Bio

@krandor: Yeah totally. That's not at all uncommon. That way, end users don't need to worry about API keys etc. and the central server will be pulling data once from us rather than a have bunch of end users pulling the same thing from us. When you're ready to begin the initial ingestion, let me know. I'll schedule a time with you when it will be best for everyone. After that you should just pull deltas.

Avatar image for krandor
#41 Posted by krandor (11 posts) - - Show Bio

@edgework: awesome. I'll pass that along. That seems to be the best long term solution for everyone.

Avatar image for jothay
#42 Posted by Jothay (3 posts) - - Show Bio

@edgework: You just redeemed my faith, which was shaken I'm sorry to say. I'll be working with @krandor and a couple other volunteers from ComicRack to toss a solution together quickly.

Since I've only ever looked at the existing API once, I'll need to familiarize myself with what it returns and how so I can replicate, we'll also need to discuss how to get the existing data over safely and securely.

Working with Deltas will commonly mean looking for a Modified Since timestamp search, do all the existing API calls allow for this or is there a specific call that we could feed a time to and have it return all the deltas in one swoop we can process on our end?

Avatar image for rick
#43 Posted by rick (119 posts) - - Show Bio

We can add a modified since to everything. I'll get that into the dev queue.

Avatar image for krandor
#44 Edited by krandor (11 posts) - - Show Bio

I am real glad to see we seem to now be working together on a solution instead of seemingly being at odds with each other. Very refreshing.