How to access Wayback Machine programmatically?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

What I'm trying to do

For a list of websites, I want to get the pages indexed by year, if they were archived at any point that year. So if I'm looking at example1.com and example2.com, I want to be able to get:

2010: example1.com, example2.com (the html from these archived pages)

2011: example1.com (example2.com, say, was not archived in 2011)

2012: example2.com

2013: example1.com, example2.com

and so on.

Question

Is this possible to get using the Wayback Machine API? I looked at their API listing and it didn't seem like I could do what I was trying to do. Maybe I'm missing something, but it seems like a fairly plausible use case. Any other suggestions?

asked Nov 19 '15 at 18:25

ShivanKaul

175117

add a comment |

What I'm trying to do

For a list of websites, I want to get the pages indexed by year, if they were archived at any point that year. So if I'm looking at example1.com and example2.com, I want to be able to get:

2010: example1.com, example2.com (the html from these archived pages)

2011: example1.com (example2.com, say, was not archived in 2011)

2012: example2.com

2013: example1.com, example2.com

and so on.

Question

asked Nov 19 '15 at 18:25

ShivanKaul

175117

add a comment |

What I'm trying to do

For a list of websites, I want to get the pages indexed by year, if they were archived at any point that year. So if I'm looking at example1.com and example2.com, I want to be able to get:

2010: example1.com, example2.com (the html from these archived pages)

2011: example1.com (example2.com, say, was not archived in 2011)

2012: example2.com

2013: example1.com, example2.com

and so on.

Question

asked Nov 19 '15 at 18:25

ShivanKaul

175117

What I'm trying to do

For a list of websites, I want to get the pages indexed by year, if they were archived at any point that year. So if I'm looking at example1.com and example2.com, I want to be able to get:

2010: example1.com, example2.com (the html from these archived pages)

2011: example1.com (example2.com, say, was not archived in 2011)

2012: example2.com

2013: example1.com, example2.com

and so on.

Question

web-scraping

asked Nov 19 '15 at 18:25

ShivanKaul

175117

asked Nov 19 '15 at 18:25

ShivanKaul

175117

asked Nov 19 '15 at 18:25

ShivanKaul

175117

asked Nov 19 '15 at 18:25

ShivanKaul

175117

asked Nov 19 '15 at 18:25

ShivanKaul

175117

add a comment |

2 Answers
2

active

oldest

votes

They key thing to understand about the Wayback Machine APIs is that there are (from what I can tell) three different ways to work with them.

Wayback Availability JSON API

The first is the API which is documented near the top of the Wayback Machine API page you already mentioned.

That API gives the date-wise nearest result for an archive on a given page. So you can check the Wayback Machine for copies of the Google homepage archived around New Year's Day like so:

http://archive.org/wayback/available?url=google.com&timestamp=20080101
http://archive.org/wayback/available?url=google.com&timestamp=20090101
http://archive.org/wayback/available?url=google.com&timestamp=20100101
etc..

Using the information returned in those URLs, you can easily download the content programmatically.

Wayback CDX Server API

Next we have the Wayback Machine CDX Server API which reveals a much richer series of interfaces. Most notably, you can quickly download every snapshot of a URL that you are interested in:

http://web.archive.org/cdx/search/cdx?url=www.fredtrotter.com

Memento API

Lastly we have the deep and mysterious resource that is the Wayback Machine Memento API. That link is to a blog post about the functionality, but from what I can garner, this is about working with the Wayback Machine at a protocol level, where the Mememnto Protocol is a well-thought out version of the way an archive site should operate.

Final thoughts

In all cases, please be gentle and respectful with your scripting. The Wayback Machine API does not currently require credentials, which is a very generous and open posture in general keeping with the Internet Archive's role as a "Wonder of the Virtual World". So do not abuse it, because that is how we ensure that we have nice things.

Thanks to Greg, and the rest of the Wayback Machine team, for the excellent work you do to keep the Internet a source of personal freedom and expression.

edited Nov 22 '18 at 12:32

sampablokuper

3,25674371

answered Feb 19 '17 at 8:30

ftrotter

2,35922844

add a comment |

Our CDX API allows you to make 2 separate calls get a list of all captures for the url or domain example1.com and the url or domain example2.com. You can then produce whatever summary you like.

answered Jun 17 '16 at 21:53

Greg Lindahl

318313

This answer is somewhat confusing. On whose behalf was it written? (I.e. who is "our" in this context?) Also, any chance you could state the URL for the API that you referred to, to avoid ambiguity and to eliminate needless web searches? (Cf. "Brevity is acceptable, but fuller explanations are better.")

– sampablokuper
Nov 22 '18 at 12:19

I was working at the Internet Archive on the Wayback when I wrote my answer, and the API I refer to is linked in the question and the other answer. Here it is again: archive.org/help/wayback_api.php (scroll to the bottom to the section labeled "Wayback CDX Server API")

– Greg Lindahl
Nov 23 '18 at 17:09

Upvoted. Thanks for clarifying, and thanks for working on Wayback :)

– sampablokuper
Nov 27 '18 at 1:21

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f33811582%2fhow-to-access-wayback-machine-programmatically%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

They key thing to understand about the Wayback Machine APIs is that there are (from what I can tell) three different ways to work with them.

Wayback Availability JSON API

The first is the API which is documented near the top of the Wayback Machine API page you already mentioned.

That API gives the date-wise nearest result for an archive on a given page. So you can check the Wayback Machine for copies of the Google homepage archived around New Year's Day like so:

Using the information returned in those URLs, you can easily download the content programmatically.

Wayback CDX Server API

Next we have the Wayback Machine CDX Server API which reveals a much richer series of interfaces. Most notably, you can quickly download every snapshot of a URL that you are interested in:

http://web.archive.org/cdx/search/cdx?url=www.fredtrotter.com

Memento API

Final thoughts

Thanks to Greg, and the rest of the Wayback Machine team, for the excellent work you do to keep the Internet a source of personal freedom and expression.

edited Nov 22 '18 at 12:32

sampablokuper

3,25674371

answered Feb 19 '17 at 8:30

ftrotter

2,35922844

add a comment |

They key thing to understand about the Wayback Machine APIs is that there are (from what I can tell) three different ways to work with them.

Wayback Availability JSON API

The first is the API which is documented near the top of the Wayback Machine API page you already mentioned.

That API gives the date-wise nearest result for an archive on a given page. So you can check the Wayback Machine for copies of the Google homepage archived around New Year's Day like so:

Using the information returned in those URLs, you can easily download the content programmatically.

Wayback CDX Server API

Next we have the Wayback Machine CDX Server API which reveals a much richer series of interfaces. Most notably, you can quickly download every snapshot of a URL that you are interested in:

http://web.archive.org/cdx/search/cdx?url=www.fredtrotter.com

Memento API

Final thoughts

Thanks to Greg, and the rest of the Wayback Machine team, for the excellent work you do to keep the Internet a source of personal freedom and expression.

edited Nov 22 '18 at 12:32

sampablokuper

3,25674371

answered Feb 19 '17 at 8:30

ftrotter

2,35922844

add a comment |

They key thing to understand about the Wayback Machine APIs is that there are (from what I can tell) three different ways to work with them.

Wayback Availability JSON API

The first is the API which is documented near the top of the Wayback Machine API page you already mentioned.

That API gives the date-wise nearest result for an archive on a given page. So you can check the Wayback Machine for copies of the Google homepage archived around New Year's Day like so:

Using the information returned in those URLs, you can easily download the content programmatically.

Wayback CDX Server API

Next we have the Wayback Machine CDX Server API which reveals a much richer series of interfaces. Most notably, you can quickly download every snapshot of a URL that you are interested in:

http://web.archive.org/cdx/search/cdx?url=www.fredtrotter.com

Memento API

Final thoughts

Thanks to Greg, and the rest of the Wayback Machine team, for the excellent work you do to keep the Internet a source of personal freedom and expression.

edited Nov 22 '18 at 12:32

sampablokuper

3,25674371

answered Feb 19 '17 at 8:30

ftrotter

2,35922844

They key thing to understand about the Wayback Machine APIs is that there are (from what I can tell) three different ways to work with them.

Wayback Availability JSON API

The first is the API which is documented near the top of the Wayback Machine API page you already mentioned.

That API gives the date-wise nearest result for an archive on a given page. So you can check the Wayback Machine for copies of the Google homepage archived around New Year's Day like so:

Using the information returned in those URLs, you can easily download the content programmatically.

Wayback CDX Server API

Next we have the Wayback Machine CDX Server API which reveals a much richer series of interfaces. Most notably, you can quickly download every snapshot of a URL that you are interested in:

http://web.archive.org/cdx/search/cdx?url=www.fredtrotter.com

Memento API

Final thoughts

Thanks to Greg, and the rest of the Wayback Machine team, for the excellent work you do to keep the Internet a source of personal freedom and expression.

edited Nov 22 '18 at 12:32

sampablokuper

3,25674371

answered Feb 19 '17 at 8:30

ftrotter

2,35922844

edited Nov 22 '18 at 12:32

sampablokuper

3,25674371

edited Nov 22 '18 at 12:32

sampablokuper

3,25674371

edited Nov 22 '18 at 12:32

sampablokuper

3,25674371

answered Feb 19 '17 at 8:30

ftrotter

2,35922844

answered Feb 19 '17 at 8:30

ftrotter

2,35922844

answered Feb 19 '17 at 8:30

ftrotter

2,35922844

add a comment |

Our CDX API allows you to make 2 separate calls get a list of all captures for the url or domain example1.com and the url or domain example2.com. You can then produce whatever summary you like.

answered Jun 17 '16 at 21:53

Greg Lindahl

318313

This answer is somewhat confusing. On whose behalf was it written? (I.e. who is "our" in this context?) Also, any chance you could state the URL for the API that you referred to, to avoid ambiguity and to eliminate needless web searches? (Cf. "Brevity is acceptable, but fuller explanations are better.")

– sampablokuper
Nov 22 '18 at 12:19

I was working at the Internet Archive on the Wayback when I wrote my answer, and the API I refer to is linked in the question and the other answer. Here it is again: archive.org/help/wayback_api.php (scroll to the bottom to the section labeled "Wayback CDX Server API")

– Greg Lindahl
Nov 23 '18 at 17:09

Upvoted. Thanks for clarifying, and thanks for working on Wayback :)

– sampablokuper
Nov 27 '18 at 1:21

add a comment |

Our CDX API allows you to make 2 separate calls get a list of all captures for the url or domain example1.com and the url or domain example2.com. You can then produce whatever summary you like.

answered Jun 17 '16 at 21:53

Greg Lindahl

318313

This answer is somewhat confusing. On whose behalf was it written? (I.e. who is "our" in this context?) Also, any chance you could state the URL for the API that you referred to, to avoid ambiguity and to eliminate needless web searches? (Cf. "Brevity is acceptable, but fuller explanations are better.")

– sampablokuper
Nov 22 '18 at 12:19

I was working at the Internet Archive on the Wayback when I wrote my answer, and the API I refer to is linked in the question and the other answer. Here it is again: archive.org/help/wayback_api.php (scroll to the bottom to the section labeled "Wayback CDX Server API")

– Greg Lindahl
Nov 23 '18 at 17:09

Upvoted. Thanks for clarifying, and thanks for working on Wayback :)

– sampablokuper
Nov 27 '18 at 1:21

add a comment |

Our CDX API allows you to make 2 separate calls get a list of all captures for the url or domain example1.com and the url or domain example2.com. You can then produce whatever summary you like.

answered Jun 17 '16 at 21:53

Greg Lindahl

318313

Our CDX API allows you to make 2 separate calls get a list of all captures for the url or domain example1.com and the url or domain example2.com. You can then produce whatever summary you like.

answered Jun 17 '16 at 21:53

Greg Lindahl

318313

answered Jun 17 '16 at 21:53

Greg Lindahl

318313

answered Jun 17 '16 at 21:53

Greg Lindahl

318313

answered Jun 17 '16 at 21:53

Greg Lindahl

318313

This answer is somewhat confusing. On whose behalf was it written? (I.e. who is "our" in this context?) Also, any chance you could state the URL for the API that you referred to, to avoid ambiguity and to eliminate needless web searches? (Cf. "Brevity is acceptable, but fuller explanations are better.")

– sampablokuper
Nov 22 '18 at 12:19

I was working at the Internet Archive on the Wayback when I wrote my answer, and the API I refer to is linked in the question and the other answer. Here it is again: archive.org/help/wayback_api.php (scroll to the bottom to the section labeled "Wayback CDX Server API")

– Greg Lindahl
Nov 23 '18 at 17:09

Upvoted. Thanks for clarifying, and thanks for working on Wayback :)

– sampablokuper
Nov 27 '18 at 1:21

add a comment |

This answer is somewhat confusing. On whose behalf was it written? (I.e. who is "our" in this context?) Also, any chance you could state the URL for the API that you referred to, to avoid ambiguity and to eliminate needless web searches? (Cf. "Brevity is acceptable, but fuller explanations are better.")

– sampablokuper
Nov 22 '18 at 12:19

I was working at the Internet Archive on the Wayback when I wrote my answer, and the API I refer to is linked in the question and the other answer. Here it is again: archive.org/help/wayback_api.php (scroll to the bottom to the section labeled "Wayback CDX Server API")

– Greg Lindahl
Nov 23 '18 at 17:09

Upvoted. Thanks for clarifying, and thanks for working on Wayback :)

– sampablokuper
Nov 27 '18 at 1:21

This answer is somewhat confusing. On whose behalf was it written? (I.e. who is "our" in this context?) Also, any chance you could state the URL for the API that you referred to, to avoid ambiguity and to eliminate needless web searches? (Cf. "Brevity is acceptable, but fuller explanations are better.")

– sampablokuper
Nov 22 '18 at 12:19

I was working at the Internet Archive on the Wayback when I wrote my answer, and the API I refer to is linked in the question and the other answer. Here it is again: archive.org/help/wayback_api.php (scroll to the bottom to the section labeled "Wayback CDX Server API")

– Greg Lindahl
Nov 23 '18 at 17:09

Upvoted. Thanks for clarifying, and thanks for working on Wayback :)

– sampablokuper
Nov 27 '18 at 1:21

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here