javascript - How to use XMLHttpRequest to download an HTML page in the background and extract a text element from it?

admin管理员组
文章数量:1023827

I want to make a Greasemonkey script that, while you are in URL_1, the script parses the whole HTML web page of URL_2 in the background in order to extract a text element from it.

To be specific, I want to download the whole page's HTML code (a Rotten Tomatoes page) in the background and store it in a variable and then use getElementsByClassName[0] in order to extract the text I want from the element with class name "critic_consensus".

I've found this in MDN: HTML in XMLHttpRequest so, I ended up in this unfortunately non-working code:

var xhr = new XMLHttpRequest();
xhr.onload = function() {
  alert(this.responseXML.getElementsByClassName(critic_consensus)[0].innerHTML);
}
xhr.open("GET", "/",true);
xhr.responseType = "document";
xhr.send();

It shows this error message when I run it in Firefox Scratchpad:

Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at /. This can be fixed by moving the resource to the same domain or enabling CORS.

PS. The reason why I don't use the Rotten Tomatoes API is that they've removed the critics consensus from it.

I want to make a Greasemonkey script that, while you are in URL_1, the script parses the whole HTML web page of URL_2 in the background in order to extract a text element from it.

I've found this in MDN: HTML in XMLHttpRequest so, I ended up in this unfortunately non-working code:

var xhr = new XMLHttpRequest();
xhr.onload = function() {
  alert(this.responseXML.getElementsByClassName(critic_consensus)[0].innerHTML);
}
xhr.open("GET", "http://www.rottentomatoes./m/godfather/",true);
xhr.responseType = "document";
xhr.send();

It shows this error message when I run it in Firefox Scratchpad:

Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at http://www.rottentomatoes./m/godfather/. This can be fixed by moving the resource to the same domain or enabling CORS.

PS. The reason why I don't use the Rotten Tomatoes API is that they've removed the critics consensus from it.

Share Improve this question edited Feb 6, 2018 at 22:01 Brock Adams 93.7k23 gold badges241 silver badges305 bronze badges asked Nov 5, 2014 at 19:19 darkred 6379 silver badges28 bronze badges

2 What is not-working? What error do you get? – Bergi Commented Nov 5, 2014 at 19:20
2 No error message inside Firefox's Scratchpad. After seeing Igor Barinov's reply, I checked the Firefox Web Console and that's where appears the error message he mentioned. I added the error message to my question. – darkred Commented Nov 5, 2014 at 19:52
I edited my answer with new idea, give it a try! – Igor Barinov Commented Nov 5, 2014 at 20:38

Add a ment |

3 Answers 3

Sorted by: Reset to default 5

For cross-origin requests, where the fetched site has not helpfully set a permissive CORS policy, Greasemonkey provides the GM_xmlhttpRequest() function. (Most other userscript engines also provide this function.)

GM_xmlhttpRequest is expressly designed to allow cross-origin requests.

To get your target information create a DOMParser on the result. Do not use jQuery methods as this will cause extraneous images, scripts and objects to load, slowing things down, or crashing the page.

Here's a plete script that illustrates the process:

// ==UserScript==
// @name        _Parse Ajax Response for specific nodes
// @include     http://stackoverflow./questions/*
// @require     http://ajax.googleapis./ajax/libs/jquery/2.1.0/jquery.min.js
// @grant       GM_xmlhttpRequest
// ==/UserScript==

GM_xmlhttpRequest ( {
    method: "GET",
    url:    "http://www.rottentomatoes./m/godfather/",
    onload: function (response) {
        var parser  = new DOMParser ();
        /* IMPORTANT!
            1) For Chrome, see
            https://developer.mozilla/en-US/docs/Web/API/DOMParser#DOMParser_HTML_extension_for_other_browsers
            for a work-around.

            2) jQuery.parseHTML() and similar are bad because it causes images, etc., to be loaded.
        */
        var doc         = parser.parseFromString (response.responseText, "text/html");
        var criticTxt   = doc.getElementsByClassName ("critic_consensus")[0].textContent;

        $("body").prepend ('<h1>' + criticTxt + '</h1>');
    },
    onerror: function (e) {
        console.error ('**** error ', e);
    },
    onabort: function (e) {
        console.error ('**** abort ', e);
    },
    ontimeout: function (e) {
        console.error ('**** timeout ', e);
    }
} );

The problem is: XMLHttpRequest cannot load http://www.rottentomatoes./m/godfather/. No 'Access-Control-Allow-Origin' header is present on the requested resource.

Because you are not the owner of the resource you can not set up this header.

What you can do is set up a proxy on heroku which will proxy all requests to rottentomatoes web site Here is a small node.js proxy https://gist.github./igorbarinov/a970cdaf5fc9451f8d34

var https = require('https'),
    http  = require('http'),
    util  = require('util'),
    path  = require('path'),
    fs    = require('fs'),
    colors = require('colors'),
    url = require('url'),
    httpProxy = require('http-proxy'),
    dotenv = require('dotenv');

dotenv.load();

var proxy = httpProxy.createProxyServer({});
var host = "www.rottentomatoes.";
var port = Number(process.env.PORT || 5000);

process.env.NODE_TLS_REJECT_UNAUTHORIZED = "0";

var server = require('http').createServer(function(req, res) {
    // You can define here your custom logic to handle the request
    // and then proxy the request.
    var path = url.parse(req.url, true).path;

    req.headers.host = host;
res.setHeader("Access-Control-Allow-Origin", "*");
    proxy.web(req, res, {
        target: "http://"+host+path,

    });

}).listen(port);

proxy.on('proxyRes', function (res) {
    console.log('RAW Response from the target', JSON.stringify(res.headers, true, 2));
});


util.puts('Proxying to '+ host +'. Server'.blue + ' started '.green.bold + 'on port '.blue + port);

I modified https://github./massive/firebase-proxy/ code for this

I published proxy on http://peaceful-cove-8072.herokuapp./ and on http://peaceful-cove-8072.herokuapp./m/godfather you can test it

Here is a gist to test http://jsfiddle/uuw8nryy/

var xhr = new XMLHttpRequest();
xhr.onload = function() {
  alert(this.responseXML.getElementsByClassName(critic_consensus)[0]);
}
xhr.open("GET", "http://peaceful-cove-8072.herokuapp./m/godfather",true);
xhr.responseType = "document";
xhr.send();

The JavaScript same origin policy prevents you from accessing content that belongs to a different domain.

The above reference also gives you four techniques for relaxing this rule (CORS being one of them).

I want to make a Greasemonkey script that, while you are in URL_1, the script parses the whole HTML web page of URL_2 in the background in order to extract a text element from it.

I've found this in MDN: HTML in XMLHttpRequest so, I ended up in this unfortunately non-working code:

var xhr = new XMLHttpRequest();
xhr.onload = function() {
  alert(this.responseXML.getElementsByClassName(critic_consensus)[0].innerHTML);
}
xhr.open("GET", "/",true);
xhr.responseType = "document";
xhr.send();

It shows this error message when I run it in Firefox Scratchpad:

Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at /. This can be fixed by moving the resource to the same domain or enabling CORS.

PS. The reason why I don't use the Rotten Tomatoes API is that they've removed the critics consensus from it.

I want to make a Greasemonkey script that, while you are in URL_1, the script parses the whole HTML web page of URL_2 in the background in order to extract a text element from it.

I've found this in MDN: HTML in XMLHttpRequest so, I ended up in this unfortunately non-working code:

var xhr = new XMLHttpRequest();
xhr.onload = function() {
  alert(this.responseXML.getElementsByClassName(critic_consensus)[0].innerHTML);
}
xhr.open("GET", "http://www.rottentomatoes./m/godfather/",true);
xhr.responseType = "document";
xhr.send();

It shows this error message when I run it in Firefox Scratchpad:

Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at http://www.rottentomatoes./m/godfather/. This can be fixed by moving the resource to the same domain or enabling CORS.

PS. The reason why I don't use the Rotten Tomatoes API is that they've removed the critics consensus from it.

Share Improve this question edited Feb 6, 2018 at 22:01 Brock Adams 93.7k23 gold badges241 silver badges305 bronze badges asked Nov 5, 2014 at 19:19 darkred 6379 silver badges28 bronze badges

2 What is not-working? What error do you get? – Bergi Commented Nov 5, 2014 at 19:20
2 No error message inside Firefox's Scratchpad. After seeing Igor Barinov's reply, I checked the Firefox Web Console and that's where appears the error message he mentioned. I added the error message to my question. – darkred Commented Nov 5, 2014 at 19:52
I edited my answer with new idea, give it a try! – Igor Barinov Commented Nov 5, 2014 at 20:38

Add a ment |

3 Answers 3

Sorted by: Reset to default 5

GM_xmlhttpRequest is expressly designed to allow cross-origin requests.

Here's a plete script that illustrates the process:

// ==UserScript==
// @name        _Parse Ajax Response for specific nodes
// @include     http://stackoverflow./questions/*
// @require     http://ajax.googleapis./ajax/libs/jquery/2.1.0/jquery.min.js
// @grant       GM_xmlhttpRequest
// ==/UserScript==

GM_xmlhttpRequest ( {
    method: "GET",
    url:    "http://www.rottentomatoes./m/godfather/",
    onload: function (response) {
        var parser  = new DOMParser ();
        /* IMPORTANT!
            1) For Chrome, see
            https://developer.mozilla/en-US/docs/Web/API/DOMParser#DOMParser_HTML_extension_for_other_browsers
            for a work-around.

            2) jQuery.parseHTML() and similar are bad because it causes images, etc., to be loaded.
        */
        var doc         = parser.parseFromString (response.responseText, "text/html");
        var criticTxt   = doc.getElementsByClassName ("critic_consensus")[0].textContent;

        $("body").prepend ('<h1>' + criticTxt + '</h1>');
    },
    onerror: function (e) {
        console.error ('**** error ', e);
    },
    onabort: function (e) {
        console.error ('**** abort ', e);
    },
    ontimeout: function (e) {
        console.error ('**** timeout ', e);
    }
} );

The problem is: XMLHttpRequest cannot load http://www.rottentomatoes./m/godfather/. No 'Access-Control-Allow-Origin' header is present on the requested resource.

Because you are not the owner of the resource you can not set up this header.

What you can do is set up a proxy on heroku which will proxy all requests to rottentomatoes web site Here is a small node.js proxy https://gist.github./igorbarinov/a970cdaf5fc9451f8d34

var https = require('https'),
    http  = require('http'),
    util  = require('util'),
    path  = require('path'),
    fs    = require('fs'),
    colors = require('colors'),
    url = require('url'),
    httpProxy = require('http-proxy'),
    dotenv = require('dotenv');

dotenv.load();

var proxy = httpProxy.createProxyServer({});
var host = "www.rottentomatoes.";
var port = Number(process.env.PORT || 5000);

process.env.NODE_TLS_REJECT_UNAUTHORIZED = "0";

var server = require('http').createServer(function(req, res) {
    // You can define here your custom logic to handle the request
    // and then proxy the request.
    var path = url.parse(req.url, true).path;

    req.headers.host = host;
res.setHeader("Access-Control-Allow-Origin", "*");
    proxy.web(req, res, {
        target: "http://"+host+path,

    });

}).listen(port);

proxy.on('proxyRes', function (res) {
    console.log('RAW Response from the target', JSON.stringify(res.headers, true, 2));
});


util.puts('Proxying to '+ host +'. Server'.blue + ' started '.green.bold + 'on port '.blue + port);

I modified https://github./massive/firebase-proxy/ code for this

I published proxy on http://peaceful-cove-8072.herokuapp./ and on http://peaceful-cove-8072.herokuapp./m/godfather you can test it

Here is a gist to test http://jsfiddle/uuw8nryy/

var xhr = new XMLHttpRequest();
xhr.onload = function() {
  alert(this.responseXML.getElementsByClassName(critic_consensus)[0]);
}
xhr.open("GET", "http://peaceful-cove-8072.herokuapp./m/godfather",true);
xhr.responseType = "document";
xhr.send();

The JavaScript same origin policy prevents you from accessing content that belongs to a different domain.

The above reference also gives you four techniques for relaxing this rule (CORS being one of them).

本文标签：

版权声明：本文标题：javascript - How to use XMLHttpRequest to download an HTML page in the background and extract a text element from it? - Stack Ov 内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://it.en369.cn/questions/1745597498a2158258.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

369IT编程

javascript - How to use XMLHttpRequest to download an HTML page in the background and extract a text element from it? - Stack Ov

3 Answers 3

3 Answers 3

更多相关文章

javascript - Datepicker not refreshing data - Stack Overflow

javascript - XPath .evaluate() not returning anything - Stack Overflow

c# - Rotate a 3D Vector around Axis and degrees - Stack Overflow

javascript - Jest failing because of importexport syntax - Stack Overflow

Javascript InnerHtml in Firefox - Stack Overflow

javascript - node.js storing gamestate, how? - Stack Overflow

c# - Using Buildconfiguration instead of launchsettings.json for Settings for RemoteDebbuging - Stack Overflow

How do I switch between multiple browser windows or tabs in Selenium with Java? - Stack Overflow

javascript - Can&#39;t connect Redux store to React Typescript App - Stack Overflow

plugins - How to block access to certain WordPress pages using a snippet

performance - How to cache a different page version based on HTTP header?

Update Docker volume when image is updated - Stack Overflow

selenium webdriver - I am unable to find links from python scraping code for exhibition websites - Stack Overflow

Block Adsense on specific page

url rewriting - Taxonomy rewrite question

javascript - Enable CORS in JIRA REST API - Stack Overflow

javascript - jQuery: How to get text from multiple elements and print them on multiple lines? - Stack Overflow

Are there ways to see the assembly code for the code generated by any of the JavaScript jits, especially V8&#39;s? - Stack O

javascript - Using res.locals.user to show user object in all frontend views - Stack Overflow

How can I display only sticky posts from a parent category on the homepage?

发表评论

推荐文章

javascript - Finding the Y axis value for a plot line in HIghCharts.js - Stack Overflow

javascript - Catch() not handling 404 - Stack Overflow

unix - JavaScript write a cookie with expiration time - Stack Overflow

javascript - Export as default multiple imported modules - Stack Overflow

include - if statement parent page for child pages

热门文章

java - Cellbrowser click event on cell - Stack Overflow

javascript - Backbone handle array of strings - Stack Overflow

javascript - Bootstrap Typeahead click event - Stack Overflow

ajax - javascript: how to serialize form data to string without jquery or other libraries - Stack Overflow

javascript - Trigger places_changed in Google maps from submit button - Stack Overflow

node.js - Whenever I try to install global npm packages, I get permission denied. How can I solve this. How can I install withou

javascript - How to export a state(user selected value) from one component to another component in React? - Stack Overflow

c - Record Audio using ALSA in mp4 format - Stack Overflow

jquery - how to iterate over nested array object in javascript - Stack Overflow

javascript - How to get the raw data from pdf.js - Stack Overflow

最新文章

windows设置断电重启开机后自动输入锁屏密码登录

Windows系统设置开机默认开启数字小键盘

Windows11 开机自动同步时间（开机时间不更新问题）

windows配置开机自启动软件或脚本

【Redis】Windows设置Redis为开机自启动

程序员刚毕业，先去大厂镀金还是先去小厂攒经验？

万象2008清空boss账户密码

【Tools】GitBook简明教程

oracle exadata celldisk 闪存盘受损导致性能下降

SDUT 2138 图结构练习——BFSDFS——判断可达性

javascript - jQuery getelementsbyclassname equivalent - Stack Overflow

How can I display only sticky posts from a parent category on the homepage?

javascript - Using res.locals.user to show user object in all frontend views - Stack Overflow

LLDB Python scripting - how to add module or load symbol file at particular address? - Stack Overflow

javascript - Raphael SVG: Ugly rendering in Chrome - Stack Overflow

javascript - Can't connect Redux store to React Typescript App - Stack Overflow

Are there ways to see the assembly code for the code generated by any of the JavaScript jits, especially V8's? - Stack O