python - Selenium cannot retrieve url when running in Google Colab

admin管理员组
文章数量:1026939

I built a small web scraper that has run successfully in a Google Colab over the last few months. It downloads a set of billing codes from the CMS website. Recently the driver started throwing timeout exceptions when retrieving some but not all urls. The reprex below downloads a file from two urls. It executes successfully when I run it locally and it attempts to and fails trying to retrieve the second url when running in Google Colab.

The timeout happens in driver.get(url). Strangely, the code works so long as the driver has not previously visited another url. For example, in the code below, not_working_url will successfully retrieve the webpage and download the file if it does not come after working_url.

from selenium import webdriver
from seleniummon.exceptions import TimeoutException
from selenium.webdriver.chrome.options import Options
from selenium.webdrivermon.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait


def download_documents() -> None:
    """Download billing code documents from CMS"""

    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    driver = webdriver.Chrome(options=chrome_options)

    working_url = ".aspx?articleid=59626&ver=6"
    not_working_url = ".aspx?lcdid=36377&ver=19"

    for row in [working_url, not_working_url]:
        print(f"Retrieving from {row}...")
        driver.get(row) # Fails on second url

        print("Wait for webdriver...")
        wait = WebDriverWait(driver, 2)

        print("Attempting license accept...")
        # Accept license
        try:
            wait.until(EC.element_to_be_clickable((By.ID, "btnAcceptLicense"))).click()
        except TimeoutException:
            pass
        wait = WebDriverWait(driver, 4)
        print("Attempting pop up close...")
        # Click on Close button of the second pop-up
        try:
            wait.until(
                EC.element_to_be_clickable(
                    (
                        By.XPATH,
                        "//button[@data-page-action='Clicked the Tracking Sheet Close button.']",
                    )
                )
            ).click()
        except TimeoutException:
            pass
        print("Attempting download...")
        driver.find_element(By.ID, "btnDownload").click()

download_documents()

Expected behavior: The code above runs successfully in Google Colab, just like it does locally.

A potentially related issue: Selenium TimeoutException in Google Colab

from selenium import webdriver
from seleniummon.exceptions import TimeoutException
from selenium.webdriver.chrome.options import Options
from selenium.webdrivermon.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait


def download_documents() -> None:
    """Download billing code documents from CMS"""

    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    driver = webdriver.Chrome(options=chrome_options)

    working_url = "https://www.cms.gov/medicare-coverage-database/view/article.aspx?articleid=59626&ver=6"
    not_working_url = "https://www.cms.gov/medicare-coverage-database/view/lcd.aspx?lcdid=36377&ver=19"

    for row in [working_url, not_working_url]:
        print(f"Retrieving from {row}...")
        driver.get(row) # Fails on second url

        print("Wait for webdriver...")
        wait = WebDriverWait(driver, 2)

        print("Attempting license accept...")
        # Accept license
        try:
            wait.until(EC.element_to_be_clickable((By.ID, "btnAcceptLicense"))).click()
        except TimeoutException:
            pass
        wait = WebDriverWait(driver, 4)
        print("Attempting pop up close...")
        # Click on Close button of the second pop-up
        try:
            wait.until(
                EC.element_to_be_clickable(
                    (
                        By.XPATH,
                        "//button[@data-page-action='Clicked the Tracking Sheet Close button.']",
                    )
                )
            ).click()
        except TimeoutException:
            pass
        print("Attempting download...")
        driver.find_element(By.ID, "btnDownload").click()

download_documents()

Expected behavior: The code above runs successfully in Google Colab, just like it does locally.

A potentially related issue: Selenium TimeoutException in Google Colab

Share Improve this question edited Nov 17, 2024 at 14:08 asked Nov 16, 2024 at 16:13 Marshall K 3331 silver badge14 bronze badges

Add a comment |

2 Answers 2

Sorted by: Reset to default 0

I was able to run my script successfully by initializing (and closing) the driver on every iteration of the loop rather than just once before it started.

For example, the loop below retrieves the url without timing out on each iteration. I would still appreciate any commentary explaining why I would ever need to reinitialize the driver regardless of my programming environment, but hopefully this solution is helpful for others who run into this issue.

for row in [working_url, not_working_url]:
    driver = webdriver.Chrome(options=chrome_options)
    driver.get(row)
    driver.close()

Try these below arguments:

    
   chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    chrome_options.add_argument("--disable-blink-features=AutomationControlled")
    chrome_options.add_argument(
        "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    )

from selenium import webdriver
from seleniummon.exceptions import TimeoutException
from selenium.webdriver.chrome.options import Options
from selenium.webdrivermon.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait


def download_documents() -> None:
    """Download billing code documents from CMS"""

    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    driver = webdriver.Chrome(options=chrome_options)

    working_url = ".aspx?articleid=59626&ver=6"
    not_working_url = ".aspx?lcdid=36377&ver=19"

    for row in [working_url, not_working_url]:
        print(f"Retrieving from {row}...")
        driver.get(row) # Fails on second url

        print("Wait for webdriver...")
        wait = WebDriverWait(driver, 2)

        print("Attempting license accept...")
        # Accept license
        try:
            wait.until(EC.element_to_be_clickable((By.ID, "btnAcceptLicense"))).click()
        except TimeoutException:
            pass
        wait = WebDriverWait(driver, 4)
        print("Attempting pop up close...")
        # Click on Close button of the second pop-up
        try:
            wait.until(
                EC.element_to_be_clickable(
                    (
                        By.XPATH,
                        "//button[@data-page-action='Clicked the Tracking Sheet Close button.']",
                    )
                )
            ).click()
        except TimeoutException:
            pass
        print("Attempting download...")
        driver.find_element(By.ID, "btnDownload").click()

download_documents()

Expected behavior: The code above runs successfully in Google Colab, just like it does locally.

A potentially related issue: Selenium TimeoutException in Google Colab

from selenium import webdriver
from seleniummon.exceptions import TimeoutException
from selenium.webdriver.chrome.options import Options
from selenium.webdrivermon.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait


def download_documents() -> None:
    """Download billing code documents from CMS"""

    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    driver = webdriver.Chrome(options=chrome_options)

    working_url = "https://www.cms.gov/medicare-coverage-database/view/article.aspx?articleid=59626&ver=6"
    not_working_url = "https://www.cms.gov/medicare-coverage-database/view/lcd.aspx?lcdid=36377&ver=19"

    for row in [working_url, not_working_url]:
        print(f"Retrieving from {row}...")
        driver.get(row) # Fails on second url

        print("Wait for webdriver...")
        wait = WebDriverWait(driver, 2)

        print("Attempting license accept...")
        # Accept license
        try:
            wait.until(EC.element_to_be_clickable((By.ID, "btnAcceptLicense"))).click()
        except TimeoutException:
            pass
        wait = WebDriverWait(driver, 4)
        print("Attempting pop up close...")
        # Click on Close button of the second pop-up
        try:
            wait.until(
                EC.element_to_be_clickable(
                    (
                        By.XPATH,
                        "//button[@data-page-action='Clicked the Tracking Sheet Close button.']",
                    )
                )
            ).click()
        except TimeoutException:
            pass
        print("Attempting download...")
        driver.find_element(By.ID, "btnDownload").click()

download_documents()

Expected behavior: The code above runs successfully in Google Colab, just like it does locally.

A potentially related issue: Selenium TimeoutException in Google Colab

Share Improve this question edited Nov 17, 2024 at 14:08 asked Nov 16, 2024 at 16:13 Marshall K 3331 silver badge14 bronze badges

Add a comment |

2 Answers 2

Sorted by: Reset to default 0

I was able to run my script successfully by initializing (and closing) the driver on every iteration of the loop rather than just once before it started.

for row in [working_url, not_working_url]:
    driver = webdriver.Chrome(options=chrome_options)
    driver.get(row)
    driver.close()

Try these below arguments:

    
   chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    chrome_options.add_argument("--disable-blink-features=AutomationControlled")
    chrome_options.add_argument(
        "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    )

本文标签： pythonSelenium cannot retrieve url when running in Google ColabStack Overflow

版权声明：本文标题：python - Selenium cannot retrieve url when running in Google Colab - Stack Overflow 内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://it.en369.cn/questions/1745652881a2161424.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

369IT编程

python - Selenium cannot retrieve url when running in Google Colab - Stack Overflow

2 Answers 2

2 Answers 2

更多相关文章