Python Web Scrape

The GitHub repository for this project can be viewed here.

Phone Number Scraper

In this project, I created a simple Python script designed to collect publicly available phone numbers from Google Search results. The tool automates queries, extracts visible phone numbers, and populates them into a structured CSV file for easy enrichment and further use.

⚡️ Disclaimer: This tool is designed to strictly scrape publicly available phone numbers. Ensure your usage complies with Google’s Terms of Service and local data privacy regulations.


🚀 Features

  • Automated querying via Google Search.
  • Extraction of publicly visible phone numbers from result pages.
  • CSV output (Enriched_Dataset.csv) with enriched data.
  • Built-in basic error handling for common scraping issues.
  • Written in Jupyter Notebook (Scrapping_script.ipynb).

📂 Files in the Repository

FileDescription
Scrapping_script.ipynbJupyter Notebook containing the scraping script and all processing logic.
Enriched_Dataset.csvOutput CSV containing the extracted and enriched phone number data.
requirements.txtContains the dependencies required to run this script.

🛠️ Requirements

  • Python 3.7+
  • Jupyter Notebook
  • Key Libraries:
    • requests
    • beautifulsoup4
    • gspread
    • oauth2client
    • re (regular expressions)

Install the required packages using:

pip install -r requirements.txt

🧹 How to Use

  1. Clone this repository:

    git clone https://github.com/your-username/phone-number-scraper.git
    cd phone-number-scraper
  2. Install dependencies.

  3. Run the Notebook:

    Open Scrapping_script.ipynb using Jupyter Notebook or JupyterLab, and execute the cells sequentially.

  4. Customize Search Terms (Optional):

    Modify the query section inside the notebook to change the search keywords according to your needs.

  5. View Results:

    After execution, the enriched phone numbers will be saved in Enriched_Dataset.csv.


⚠️ Important Notes

  • Respect Robots.txt: Always check and respect the robots.txt file of any site you scrape.
  • Rate Limiting: Add delays between requests to avoid being IP-banned.
  • Legal Compliance: Scrape responsibly and ensure you adhere to all applicable data privacy laws.

🧺 Future Improvements

  • Integrate proxy support.
  • Implement CAPTCHA handling.
  • Add multi-threaded scraping for faster data collection.
  • Deploy as a Python package or CLI tool.

📜 License

This project is licensed under the MIT License. See the LICENSE file in the GitHub repository for more details.