admin管理员组

文章数量:1026939

I built a small web scraper that has run successfully in a Google Colab over the last few months. It downloads a set of billing codes from the CMS website. Recently the driver started throwing timeout exceptions when retrieving some but not all urls. The reprex below downloads a file from two urls. It executes successfully when I run it locally and it attempts to and fails trying to retrieve the second url when running in Google Colab.

The timeout happens in driver.get(url). Strangely, the code works so long as the driver has not previously visited another url. For example, in the code below, not_working_url will successfully retrieve the webpage and download the file if it does not come after working_url.

from selenium import webdriver
from seleniummon.exceptions import TimeoutException
from selenium.webdriver.chrome.options import Options
from selenium.webdrivermon.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait


def download_documents() -> None:
    """Download billing code documents from CMS"""

    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    driver = webdriver.Chrome(options=chrome_options)

    working_url = ".aspx?articleid=59626&ver=6"
    not_working_url = ".aspx?lcdid=36377&ver=19"

    for row in [working_url, not_working_url]:
        print(f"Retrieving from {row}...")
        driver.get(row) # Fails on second url

        print("Wait for webdriver...")
        wait = WebDriverWait(driver, 2)

        print("Attempting license accept...")
        # Accept license
        try:
            wait.until(EC.element_to_be_clickable((By.ID, "btnAcceptLicense"))).click()
        except TimeoutException:
            pass
        wait = WebDriverWait(driver, 4)
        print("Attempting pop up close...")
        # Click on Close button of the second pop-up
        try:
            wait.until(
                EC.element_to_be_clickable(
                    (
                        By.XPATH,
                        "//button[@data-page-action='Clicked the Tracking Sheet Close button.']",
                    )
                )
            ).click()
        except TimeoutException:
            pass
        print("Attempting download...")
        driver.find_element(By.ID, "btnDownload").click()

download_documents()

Expected behavior: The code above runs successfully in Google Colab, just like it does locally.

A potentially related issue: Selenium TimeoutException in Google Colab

I built a small web scraper that has run successfully in a Google Colab over the last few months. It downloads a set of billing codes from the CMS website. Recently the driver started throwing timeout exceptions when retrieving some but not all urls. The reprex below downloads a file from two urls. It executes successfully when I run it locally and it attempts to and fails trying to retrieve the second url when running in Google Colab.

The timeout happens in driver.get(url). Strangely, the code works so long as the driver has not previously visited another url. For example, in the code below, not_working_url will successfully retrieve the webpage and download the file if it does not come after working_url.

from selenium import webdriver
from seleniummon.exceptions import TimeoutException
from selenium.webdriver.chrome.options import Options
from selenium.webdrivermon.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait


def download_documents() -> None:
    """Download billing code documents from CMS"""

    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    driver = webdriver.Chrome(options=chrome_options)

    working_url = "https://www.cms.gov/medicare-coverage-database/view/article.aspx?articleid=59626&ver=6"
    not_working_url = "https://www.cms.gov/medicare-coverage-database/view/lcd.aspx?lcdid=36377&ver=19"

    for row in [working_url, not_working_url]:
        print(f"Retrieving from {row}...")
        driver.get(row) # Fails on second url

        print("Wait for webdriver...")
        wait = WebDriverWait(driver, 2)

        print("Attempting license accept...")
        # Accept license
        try:
            wait.until(EC.element_to_be_clickable((By.ID, "btnAcceptLicense"))).click()
        except TimeoutException:
            pass
        wait = WebDriverWait(driver, 4)
        print("Attempting pop up close...")
        # Click on Close button of the second pop-up
        try:
            wait.until(
                EC.element_to_be_clickable(
                    (
                        By.XPATH,
                        "//button[@data-page-action='Clicked the Tracking Sheet Close button.']",
                    )
                )
            ).click()
        except TimeoutException:
            pass
        print("Attempting download...")
        driver.find_element(By.ID, "btnDownload").click()

download_documents()

Expected behavior: The code above runs successfully in Google Colab, just like it does locally.

A potentially related issue: Selenium TimeoutException in Google Colab

Share Improve this question edited Nov 17, 2024 at 14:08 Marshall K asked Nov 16, 2024 at 16:13 Marshall KMarshall K 3331 silver badge14 bronze badges
Add a comment  | 

2 Answers 2

Reset to default 0

I was able to run my script successfully by initializing (and closing) the driver on every iteration of the loop rather than just once before it started.

For example, the loop below retrieves the url without timing out on each iteration. I would still appreciate any commentary explaining why I would ever need to reinitialize the driver regardless of my programming environment, but hopefully this solution is helpful for others who run into this issue.

for row in [working_url, not_working_url]:
    driver = webdriver.Chrome(options=chrome_options)
    driver.get(row)
    driver.close()

Try these below arguments:

    
   chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    chrome_options.add_argument("--disable-blink-features=AutomationControlled")
    chrome_options.add_argument(
        "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    )

I built a small web scraper that has run successfully in a Google Colab over the last few months. It downloads a set of billing codes from the CMS website. Recently the driver started throwing timeout exceptions when retrieving some but not all urls. The reprex below downloads a file from two urls. It executes successfully when I run it locally and it attempts to and fails trying to retrieve the second url when running in Google Colab.

The timeout happens in driver.get(url). Strangely, the code works so long as the driver has not previously visited another url. For example, in the code below, not_working_url will successfully retrieve the webpage and download the file if it does not come after working_url.

from selenium import webdriver
from seleniummon.exceptions import TimeoutException
from selenium.webdriver.chrome.options import Options
from selenium.webdrivermon.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait


def download_documents() -> None:
    """Download billing code documents from CMS"""

    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    driver = webdriver.Chrome(options=chrome_options)

    working_url = ".aspx?articleid=59626&ver=6"
    not_working_url = ".aspx?lcdid=36377&ver=19"

    for row in [working_url, not_working_url]:
        print(f"Retrieving from {row}...")
        driver.get(row) # Fails on second url

        print("Wait for webdriver...")
        wait = WebDriverWait(driver, 2)

        print("Attempting license accept...")
        # Accept license
        try:
            wait.until(EC.element_to_be_clickable((By.ID, "btnAcceptLicense"))).click()
        except TimeoutException:
            pass
        wait = WebDriverWait(driver, 4)
        print("Attempting pop up close...")
        # Click on Close button of the second pop-up
        try:
            wait.until(
                EC.element_to_be_clickable(
                    (
                        By.XPATH,
                        "//button[@data-page-action='Clicked the Tracking Sheet Close button.']",
                    )
                )
            ).click()
        except TimeoutException:
            pass
        print("Attempting download...")
        driver.find_element(By.ID, "btnDownload").click()

download_documents()

Expected behavior: The code above runs successfully in Google Colab, just like it does locally.

A potentially related issue: Selenium TimeoutException in Google Colab

I built a small web scraper that has run successfully in a Google Colab over the last few months. It downloads a set of billing codes from the CMS website. Recently the driver started throwing timeout exceptions when retrieving some but not all urls. The reprex below downloads a file from two urls. It executes successfully when I run it locally and it attempts to and fails trying to retrieve the second url when running in Google Colab.

The timeout happens in driver.get(url). Strangely, the code works so long as the driver has not previously visited another url. For example, in the code below, not_working_url will successfully retrieve the webpage and download the file if it does not come after working_url.

from selenium import webdriver
from seleniummon.exceptions import TimeoutException
from selenium.webdriver.chrome.options import Options
from selenium.webdrivermon.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait


def download_documents() -> None:
    """Download billing code documents from CMS"""

    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    driver = webdriver.Chrome(options=chrome_options)

    working_url = "https://www.cms.gov/medicare-coverage-database/view/article.aspx?articleid=59626&ver=6"
    not_working_url = "https://www.cms.gov/medicare-coverage-database/view/lcd.aspx?lcdid=36377&ver=19"

    for row in [working_url, not_working_url]:
        print(f"Retrieving from {row}...")
        driver.get(row) # Fails on second url

        print("Wait for webdriver...")
        wait = WebDriverWait(driver, 2)

        print("Attempting license accept...")
        # Accept license
        try:
            wait.until(EC.element_to_be_clickable((By.ID, "btnAcceptLicense"))).click()
        except TimeoutException:
            pass
        wait = WebDriverWait(driver, 4)
        print("Attempting pop up close...")
        # Click on Close button of the second pop-up
        try:
            wait.until(
                EC.element_to_be_clickable(
                    (
                        By.XPATH,
                        "//button[@data-page-action='Clicked the Tracking Sheet Close button.']",
                    )
                )
            ).click()
        except TimeoutException:
            pass
        print("Attempting download...")
        driver.find_element(By.ID, "btnDownload").click()

download_documents()

Expected behavior: The code above runs successfully in Google Colab, just like it does locally.

A potentially related issue: Selenium TimeoutException in Google Colab

Share Improve this question edited Nov 17, 2024 at 14:08 Marshall K asked Nov 16, 2024 at 16:13 Marshall KMarshall K 3331 silver badge14 bronze badges
Add a comment  | 

2 Answers 2

Reset to default 0

I was able to run my script successfully by initializing (and closing) the driver on every iteration of the loop rather than just once before it started.

For example, the loop below retrieves the url without timing out on each iteration. I would still appreciate any commentary explaining why I would ever need to reinitialize the driver regardless of my programming environment, but hopefully this solution is helpful for others who run into this issue.

for row in [working_url, not_working_url]:
    driver = webdriver.Chrome(options=chrome_options)
    driver.get(row)
    driver.close()

Try these below arguments:

    
   chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    chrome_options.add_argument("--disable-blink-features=AutomationControlled")
    chrome_options.add_argument(
        "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    )

本文标签: pythonSelenium cannot retrieve url when running in Google ColabStack Overflow