Cyber Artifacts: Mining for Data

I have to admit- this week's reading had me out of my depth and I found myself googling (ironically) some of the terms in the articles on data mining. What I took from the readings was that data mining is finding the key words and phrases in the context of documents so that search engines will be able to get us the best results possible. The article Googling the Victorians by Patrick Leary really helped to put it in context when he pointed out that this is the type of thing that historians do on a daily basis with the books and articles that we read. Historians don't read every single word of other people's scholarship, but instead skim for key ideas and the evidence to support them. Search engines seem to do the same thing, just on a much larger scale with millions of sites at the same time.

Leary also brings up another argument that people are still making seven years later: the internet versus the library or the archive- which is better? Only yesterday I was on a database of African-American newspapers from the 20th century and was looking up articles spanning a decade from several different publications that I would never have been able to accomplish using the original print media in the short time that it took me online. Leary seems to be of the opinion that many have adopted today the internet and the digital age is clearly here to stay, so we might as well embrace and make the most of it. However, there is one thing that I have not been able to duplicate in an online search that a physical library gives me instantly: related works. In a library, I can look up a book on, for example, Alice Paul, and when I go to that section I can find multiple other books about her and the feminist movement that would be extremely useful. The only thing that I have found close to this online (and if anyone out there knows something different, please let me know) are the recommendations that Amazon gives you at the bottom of a product page. Instead, while searching online, multiple ways of searching the same thing have to be typed in to the search engine to ensure that an accurate representation of sources comes up.

This speaks to what the other two articles were about. In his 2006 blog post, Searching for History, William J. Turkel talked about the release of data from search results made by about 500,000 AOL users. My first thought was that the fact that the release was from the users of AOL showed the age of the article right off the bat. However, it was really informative in the way that different people are searching for historical subjects. Looking at this data today even can show how to set up your site so that a search engine will grab it over other, similar sites. For example, Turkel states that most people search "American history" or "U.S. history" rather than "history of the United States."

The final article, From Babel to Knowledge: Data Mining Large Digital Collections by Daniel J. Cohen, talked about the benefits of using a specialized search engine, rather than just relying on something like Google, to get more accurate results. He spoke of a program he created called H-bot to use for quick history facts. My only problem with this was, it is six years later and this was the first time that I am hearing of this program. Now, maybe I m not as connected to things as I should be and just missed this site, but to my knowledge people just use Google to look for their quick facts. I do know that specialized search engines can be extremely beneficial, however. I personally have used Google Scholar to find articles or journals for papers and found things that I would not have been able to just by using a regular Google search. By paring the results down to only peer reviewed scholarly sources, it is much more efficient. This being said, I understand the premise of H-bot, and it is a really good idea, but how can it compete with a giant like Google when the results are the same?

This article also spoke of the importance of non-profit archives having and running their own search engines, a premise that seems way out of the reach of many not-for-profit organizations. This could have been a change since the article war written, but now many sites are using Google to search within themselves. Many search bars have the little Powered by Google sign posted underneath them with the option of searching just within the site or on the entire web. I'm not sure if this is the way to go about things, but the comfort of a familiar search engine like Google running these pages makes it much easier to browse through the results.

Cyber Artifacts

Thursday, October 25, 2012

Mining for Data

No comments:

Post a Comment