Wednesday 21 September 2016

Things to take care while doing Web Scraping!!!

Things to take care while doing Web Scraping!!!

In the present day and age, web scraping word becomes most popular in data science. Basically web scraping is extracting the information from the websites using pre-written programs and web scraping scripts. Many organizations have successfully used web site scraping to build relevant and useful database that they use on a daily basis to enhance their business interests. This is the age of the Big Data and web scraping is one of the trending techniques in the data science.

Throughout my journey of learning web scraping and implementing many successful scraping projects, I have come across some great experiences we can learn from.  In this post, I’m going to discuss some of the approaches to take and approaches to avoid while executing web scraping.

User Proxies: Anonymously scraping data from websites

One should not scrape website with a single IP Address. Because when you repeatedly request the web page for web scraping, there is a chance that the remote web server might block your IP address preventing further request to the web page. To overcome this situation, one should scrape websites with the help of proxy servers (anonymous scraping). This will minimize the risk of getting trapped and blacklisted by a website. Use of Proxies to hide your identity (network details) to remote web servers while scraping data. You may also use a VPN instead of proxies to anonymously scrape websites.

Take maximum data and store it.

Do not follow “process the web page as it comes from the remote server”. Instead take all the information and store it to disk. This approach will be useful when your scraping algorithm breaks in the middle. In this case you don’t have to start scraping again. Never download the same content more than once as you are just wasting bandwidth. Try and download all content to disk in one go and then do the processing.

Follow strict rules in parsing:

Check various rules while parsing the information from the web site. For example if you expect a value to be a date then check that it’s really a date. This may greatly improve the quality of information. When you get unexpected data, then the algorithm need to be changed accordingly.

Respect Robots.txt

Robots.txt specifies the set of rules that should be followed by web crawlers and robots. I strongly advise you to consider and adjust your crawler to fully respect robots.txt. Robots.txt contains instructions on the exact pages that you are allowed to crawl, user-agent, and the requisite intervals between page requests. Following to these instructions minimizes the chance of getting blacklisted and banned from website owner.

Use XPath Smartly

XPath is a nice option to select elements of the HTML document more flexibly than CSS Selectors.  Be careful about HTML structure change through page to page so one xpath you made may be failed to extract data on another page due to changes in HTML structure.

Obey Website TOC:

Some websites make it absolutely apparent in their terms and conditions that they are particularly against to web scraping activities on their content. This can make you vulnerable against possible ethical and legal implications.

Test sample scrape and verify the data with actual scrape

Once you are done with web scraping project set up, you need to test it for sometimes. Check the extracted data. If something is not good, find out the cause and make changes accordingly and finally come to a perfect web scraping project.

Source: http://webdata-scraping.com/things-take-care-web-scraping/

Friday 9 September 2016

How to Use Microsoft Excel as a Web Scraping Tool

How to Use Microsoft Excel as a Web Scraping Tool

Microsoft Excel is undoubtedly one of the most powerful tools to manage information in a structured form. The immense popularity of Excel is not without reasons. It is like the Swiss army knife of data with its great features and capabilities. Here is how Excel can be used as a basic web scraping tool to extract web data directly into a worksheet. We will be using Excel web queries to make this happen.

Web queries is a feature of Excel which is basically used to fetch data on a web page into the Excel worksheet easily. It can automatically find tables on the webpage and would let you pick the particular table you need data from. Web queries can also be handy in situations where an ODBC connection is impossible to maintain apart from just extracting data from web pages. Let’s see how web queries work and how you can scrape HTML tables off the web using them.
Getting started

We’ll start with a simple Web query to scrape data from the Yahoo! Finance page. This page is particularly easier to scrape and hence is a good fit for learning the method. The page is also pretty straightforward and doesn’t have important information in the form of links or images. Here is the URL we will be using for the tutorial:

http://finance.yahoo.com/q/hp?s=GOOG

To create a new Web query:

1. Select the cell in which you want the data to appear.
2. Click on Data-> From Web
3. The New Web query box will pop up as shown below.

4. Enter the web page URL you need to extract data from in the Address bar and hit the Go button.
5. Click on the yellow-black buttons next to the table you need to extract data from.

6. After selecting the required tables, click on the Import button and you’re done. Excel will now start downloading the content of the selected tables into your worksheet.

Once you have the data scraped into your Excel worksheet, you can do a host of things like creating charts, sorting, formatting etc. to better understand or present the data in a simpler way.
Customizing the query

Once you have created a web query, you have the option to customize it according to your requirements. To do this, access Web query properties by right clicking on a cell with the extracted data. The page you were querying appears again, click on the Options button to the right of the address bar. A new pop up box will be displayed where you can customize how the web query interacts with the target page. The options here lets you change some of the basic things related to web pages like the formatting and redirections.

Apart from this, you can also alter the data range options by right clicking on a random cell with the query results and selecting Data range properties. The data range properties dialog box will pop up where you can make the required changes. You might want to rename the data range to something you can easily recognize like ‘Stock Prices’.

Auto refresh

Auto-refresh is a feature of web queries worth mentioning, and one which makes our Excel web scraper truly powerful. You can make the extracted data to be auto-refreshing so that your Excel worksheet will update the data whenever the source website changes. You can set how often you need the data to be updated from the source web page in data range options menu. The auto refresh feature can be enabled by ticking the box beside ‘Refresh every’ and setting your preferred time interval for updating the data.
Web scraping at scale

Although extracting data using Excel can be a great way to scrape html tables from the web, it is nowhere close to a real web scraping solution. This can prove to be useful if you are collecting data for your college research paper or you are a hobbyist looking for a cheap way to get your hands on some data. If data for business is your need, you will definitely have to depend on a web scraping provider with expertise in dealing with web scraping at scale. Outsourcing the complicated process that web scraping will also give you more room to deal with other things that need extra attention such as marketing your business.

Source: https://www.promptcloud.com/blog/how-to-use-excel-to-scrape-websites

Thursday 1 September 2016

Why Healthcare Companies should look towards Web Scraping

Why Healthcare Companies should look towards Web Scraping

The internet is a massive storehouse of information which is available in the form of text, media and other formats. To be competitive in this modern world, most businesses need access to this storehouse of information. But, all this information is not freely accessible as several websites do not allow you to save the data. This is where the process of Web Scraping comes in handy.

Web scraping is not new—it has been widely used by financial organizations, for detecting fraud; by marketers, for marketing and cross-selling; and by manufacturers for maintenance scheduling and quality control. Web scraping has endless uses for business and personal users. Every business or individual can have his or her own particular need for collecting data. You might want to access data belonging to a particular category from several websites. The different websites belonging to the particular category display information in non-uniform formats. Even if you are surfing a single website, you may not be able to access all the data at one place.

The data may be distributed across multiple pages under various heads. In a market that is vast and evolving rapidly, strategic decision-making demands accurate and thorough data to be analyzed, and on a periodic basis. The process of web scraping can help you mine data from several websites and store it in a single place so that it becomes convenient for you to a alyze the data and deliver results.

In the context of healthcare, web scraping is gaining foothold gradually but qualitatively. Several factors have led to the use of web scraping in healthcare. The voluminous amount of data produced by healthcare industry is too complex to be analyzed by traditional techniques. Web scraping along with data extraction can improve decision-making by determining trends and patterns in huge amounts of intricate data. Such intensive analyses are becoming progressively vital owing to financial pressures that have increased the need for healthcare organizations to arrive at conclusions based on the analysis of financial and clinical data. Furthermore, increasing cases of medical insurance fraud and abuse are encouraging healthcare insurers to resort to web scraping and data extraction techniques.

Healthcare is no longer a sector relying solely on person to person interaction. Healthcare has gone digital in its own way and different stakeholders of this industry such as doctors, nurses, patients and pharmacists are upping their ante technologically to remain in sync with the changing times. In the existing setup, where all choices are data-centric, web scraping in healthcare can impact lives, educate people, and create awareness. As people no more depend only on doctors and pharmacists, web scraping in healthcare can improve lives by offering rational solutions.

To be successful in the healthcare sector, it is important to come up with ways to gather and present information in innovative and informative ways to patients and customers. Web scraping offers a plethora of solutions for the healthcare industry. With web scraping and data extraction solutions, healthcare companies can monitor and gather information as well as track how their healthcare product is being received, used and implemented in different locales. It offers a safer and comprehensive access to data allowing healthcare experts to take the right decisions which ultimately lead to better clinical experience for the patients.

Web scraping not only gives healthcare professionals access to enterprise-wide information but also simplifies the process of data conversion for predictive analysis and reports. Analyzing user reviews in terms of precautions and symptoms for diseases that are incurable till date and are still undergoing medical research for effective treatments, can mitigate the fear in people. Data analysis can be based on data available with patients and is one way of creating awareness among people.

Hence, web scraping can increase the significance of data collection and help doctors make sense of the raw data. With web scraping and data extraction techniques, healthcare insurers can reduce the attempts of frauds, healthcare organizations can focus on better customer relationship management decisions, doctors can identify effective cure and best practices, and patients can get more affordable and better healthcare services.

Web scraping applications in healthcare can have remarkable utility and potential. However, the triumph of web scraping and data extraction techniques in healthcare sector depends on the accessibility to clean healthcare data. For this, it is imperative that the healthcare industry think about how data can be better recorded, stored, primed, and scraped. For instance, healthcare sector can consider standardizing clinical vocabulary and allow sharing of data across organizations to heighten the benefits from healthcare web scraping practices.

Healthcare sector is one of the top sectors where data is multiplying exponentially with time and requires a planned and structured storage of data. Continuous web scraping and data extraction is necessary to gain useful insights for renewing health insurance policies periodically as well as offer affordable and better public health solutions. Web scraping and data extraction together can process the mammoth mounds of healthcare data and transform it into information useful for decision making.

To reduce the gap between various components of healthcare sector-patients, doctors, pharmacies and hospitals, healthcare organizations and websites will have to tap the technology to collect data in all formats and present in a usable form. The healthcare sector needs to overcome the lag in implementing effective web scraping and data extraction techniques as well as intensify their pace of technology adoption. Web scraping can contribute enormously to the healthcare industry and facilitate organizations to methodically collect data and process it to identify inadequacies and best practices that improve patient care and reduce costs.

Source: https://www.promptcloud.com/blog/why-health-care-companies-should-use-web-scraping