|
|
# Digital Text, Data Creation and Usage
|
|
|
## Problems and approaches
|
|
|
For historians who aim to build up a larger digital corpus, the main problem seems to be evident: Data is often generated from physical, “analog” sources. Mostly, historians retrieve physical documents from a physical archive. We would then take photographs of these documents, using a scanner and perform OCR to recognize the written characters. But what do we do then? Regardless of the fact that all of these steps are already problematic (e.g. what to do with a paleographer’s needs or with hand-written texts in general? Interesting links for that: [Transcription](https://transcription.si.edu/) or [Transkribus](https://transkribus.eu/Transkribus/#scholar-content) We have to transform images into a text file format (as listed above). At the same time, for large corpuses, it is vital to give structure to our different entities by creating structural metadata and by defining digital objects.
|
|
|
|
|
|
In the case of my project, OCR is already performed and my main material is text. But what exactly is text?
|
|
|
|
|
|
## Text formats
|
|
|
There are many different text formats available. Choosing text formats depends on how one wants to use textual data. Here I list some from [unstructured data to structured data](https://www.datamation.com/big-data/structured-vs-unstructured-data.html) (whereas simple binary code would be the most structured data):
|
|
|
|
|
|
|
|
|
| **Format** | **Definition** | **Category** | **Use Cases** | **Advantages** | **Disadvantages** |
|
|
|
| ------ | ------ | ------ | ------ | ------ | ------ |
|
|
|
| **Plain text** | A format used to represent written language that draws upon a fixed set of reusable symbols arranged to express meaning; sometimes called plain text. Data are stored in bytes using a[Unicode encoding](https://www.youtube.com/watch?v=5aJKKgSEUnY), such as UTF-8. Example: ASCII text file | Unstructured | Store and exchange unformatted text | popular, human-readeable, thus simple to create and process | the lack of structure impedes software processing |
|
|
|
| **Comma Separated Value (CSV)** | A format built upon plain text to encode tabular data in rows, columns, and cells. A comma is used to separate one data cell from another; a newline or line break separates rows. Example: tabular data exported from a spreadsheet program | Structured |Store a single table of data; Export and exchange format for tabular data | popular; simple to create and process, flexible | can only represent one table of data |
|
|
|
| **JavaScript Object Notation (JSON)**| A format built upon plain text that uses key:value pairs to represent lists of items and dictionaries of items. Items may be strings, lists, or dictionaries. Example: Output of a call to an API| Structured |Store and exchange unformatted text; Commonly used for web-based [APIs](https://www.youtube.com/watch?v=s7wmiS2mSXY)| Simple to create and process; Can represent complex data| rigid structure |
|
|
|
| **Hypertext Markup Language (HTML)**|A format built upon plain text used to represent complex documents used extensively on the world wide web. HTML uses a defined set of tags with which to markup documents to indicate typographic and other features. Example: Web document | semi-structured | Store web-based documents | Standard for representing complex documents; Widely supported across all systems; Can represent text formatting and structural units| Standard is evolving; Same HTML document may be rendered differently across different systems|
|
|
|
| **eXtensible Markup Language (XML)**| A format similar to HTML that is used to represent complex documents and data in a way that can be reliably processed automatically. XML supports user-defined tags to markup textual data into semantically significant units. Examples: TEI file, RSS feed| semi-structured | Store program configuration data; Store and exchange data between web applications; Annotate textual materials for programmatic processing| Has a single, correct interpretation; XML Text Encoding Iniative (TEI) standard is **widely used in the humanities**| Imposes a strict hierarchy on document elements; Difficult to represent overlapping text elements; Harder to read than JSON documents|
|
|
|
|
|
|
## Digitizing Objects as defining objects
|
|
|
|
|
|
A different issue for me is creating metadata and defining my [digital objects](https://www2.archivists.org/glossary/terms/d/digital-object): How do I define my object, if its not a "give", a data?
|
|
|
Many humanities practitioners, such as archaeologists and art historians, study the broader world of objects that humans create, as well as their texts. Such objects may include the pictures people take, the physical artworks they produce, or the everyday utilitarian objects they use. Digital Humanities practitioners may want to digitize objects for a variety of reasons. For instance, some practitioners seek to understand the properties of objects at an aggregate or statistical scale via computational strategies. In this case, digitization of objects, might involve collecting tabular data that contains measurements about an artifact and digitally representing it in relational database. So the issue here is: How do I define my object, if I want to browse a century of newspaper for example? Do is my object one single issue, one page or one article?
|
|
|
|
|
|
## Examples of Online Repositories
|
|
|
A lot of data for Digital Humanities usage is available online, in some pre-structured file format. Here are some examples:
|
|
|
* [**Projekt Gutenberg**](https://www.gutenberg.org)
|
|
|
* [**Registry of Research Data Repositories**](https://www.re3data.org/): a searcheable rgistry of over 2000 repositories that host data.
|
|
|
* [**Harvard Dataverse**](https://dataverse.harvard.edu)
|
|
|
|
|
|
## Web Scraping and HTML
|
|
|
Web scraping refers to any technique that uses this consistency of document structure across multiple pages to automatically extract data from those pages. Web sites provide access to information in a semi-structured format called HTML (see above!), which is designed to deliver just enough information to a Web browser to create an interactive, visual representation of mixed media content - any combination of text, image, video, etc. - etc. - using a "page metaphor" where an HTML document corresponds to a web page of any length.
|
|
|
|
|
|
A key objective of the structured elements of HTML is to provide enough information about how the content should be displayed and interacted with so that content can accurately reproduce the desired layout of the material on the virtual page. Consequently, much of the markup code used **in HTML is closely related to aspects of layout** - for example, that a particular piece of text should be displayed in a particular font, or that a particular image should be placed at a certain distance below and to the left of another page element.
|
|
|
|
|
|
Although the structural elements it contains are not intended to facilitate access to semantic units in the data, in practice it is often the case that a combination of layout-related descriptors uniquely identifies a particular semantic element on a page. Thats where web scraping comes in: On newspaper Web site such as *Spiegel*, it will often be the case that all headings of a particular type are displayed in a particular font, size, and style, and that they appear at certain regular positions on each page; similarly, the main text of the article itself, as well as many other meaningful data, will be displayed. Although there is no way to know in advance what special formatting properties these might be, they will generally be consistent within a particular Web site or section of a Web site. Once these formatting properties are known, they can be used to programmatically extract the appropriate information from any similar page. Some HTML-calssification words can then be used as a tag. Because the vast majority of Web sites are database-driven - that is, parts of their core content are actually stored as structured data in a database and then combined with layout information to generate readable HTML pages - the systems that implement Web site functionality generally follow precise and consistent rules to facilitate the generation of HTML content. But nevertheless web scraping software has to be retuned for different websites (e.g. for different newspaper archives.)
|
|
|
|
|
|
Examples for web scraping: [Internet search engines such as Google rely heavily on web scraping to create an index of content available on the web](https://blog.proxycrawl.com/how-google-scrape-websites/). Although primarily indexing full-text content of pages, search engines in practice also attempt some degree of structured data extraction, ranging from simple distinguishing of the main contents of an individual page from “boilerplate” navigation content appearing in a similar form on many pages, to more detailed aspects of structure such as dates of publication and sections within pages.
|
|
|
|
|
|
There are several problems regarding web-scraping: Some may concern copyright issues, some are more of technical matter, such as overstraining a website by requesting too much.
|
|
|
|
|
|
Other sources: Wikipedia, EdX |
|
|
\ No newline at end of file |