A Strategy for Visual Structural Data Analysis of Labor Accident Data

Labor accidents are a serious social problem that results in damages to employees, employers, and governments, also consuming a signiĄcant portion of the WorldŠs GDP. In Brazil, The Brazilian Federal Labor Prosecution Office is the institutional service responsible for the defense of worker rights, and among its functions is the supervision and control of labor health and safety. They collect data on labor accidents in Brazilian territory and provide an anonymized version of this data publicly. This process generates a large volume of data containing important strategical information, which is often not straightforward to be extracted with manual analysis. Information visualization is a research area that studies the creation of visual representations for abstract structured or non-structured data, aiming to help people execute tasks more effectively. We propose a computational strategy employing a combination of Information Visualization techniques to perform a visual analysis of labor accident data, while not being restricted to this scenario. We developed a system that implements our strategy, and is comprised of two complementary visualizations, i) a multidimensional projection layout + a political map, and ii) a treemap layout + a parallel sets layout. We performed several exploratory analysis, in order to exploit the visualizationsŠ complementary capacities in providing simultaneous analysis of different data aspects. We obtained interesting results, identifying proĄles associated with small/large geographical areas, similarities among geographically distant localities, occurrence patterns related to citiesŠ size and economic development, the frequency distribution of labor accident types in Brazil, and characterized labor accidents in terms of occupation type, gender differences, causer agent, among other aspects. We believe that the proposed strategy facilitates and enhances the analysis of labor accident data, providing effective and efficient means to help governments to evaluate current public policies and foment the creation of new ones to reduce labor accidents and grant safety to employees, and also encourage transparency in governments and citizen participation.

"Should any political party attempt to abolish social security, unemployment insurance, and eliminate labor laws and farm programs, you would not hear of that party again in our political history." (Dwight D. Eisenhower)

Introduction
Labor Accidents(LAs) represent serious problems for the economy of a locality, because they result in physical and psychological disorders for the employee, loss of manpower for the employer, and generate signiĄcant expenses with compensation beneĄts and health related costs for the government. Every year more than 2.78 million deaths occur due to LAs directly or to illnesses related to them, in addition to other 374 million non-fatal accidents. The estimated expenses associated with these accidents is approximately 3.94% of the global yearly GDP(ILO, 2019). In Brazil, in the period of 2012 to 2018, 4.503.631 LAs were registered, with 16.455 deaths. The expenses with accidents registered in this period were of R$29.145.635.014, and when counting the expenses for accidents occurring before but still being paid the sum goes up to R$79.000.041.558(Smartlab de Trabalho Decente MPT -OIT, 2017).
The Brazilian Federal Labor Prosecution Office (BFLPO) is the institutional service responsible for the defense of worker rights, and among its functions is the supervision and control of labor health and safety. The BFLPO collects data on labor accidents in Brazilian territory using Labor Accident Communications (LACs), which is a document issued to register labor, route accidents and/or occupational diseases (INSS, 2018).An anonymized version of this data is publicly available in(Smartlab de Trabalho Decente MPT -OIT, 2017). In a country of continental proportions, the volume of data generated by LA occurrences is huge, which makes manual analysis of such data tedious, difficult, and error prone. On the other hand, automatic analysis of LA data, without human intervention, employing machine learning strategies or purely statistical calculations, might limit the extraction of information from the data repository, because they do not allow the reĄnement of results in an intuitive way to address the particular needs of the analyst, nor do they easily accommodate interactive exploration of the data. Furthermore, those automatic strategies make difficult the understanding of the relationship between the produced results and the data structure, thus the conclusions are reached in spite of humans, which might not be the best case when public interests are at stake. Information Visualization (INFOVIS) is a research area that studies the creation of visual representations for datasets, in order to enhance human cognitive capacities and aid people to fulĄll data exploration tasks more easily, in a form that is both efficient and effective (MUNZNER, 2014). The BFLPO provides a website named Observatório Digital de Saúde e Segurança do Trabalho(ODSST) 1 , in which LAs data can be analyzed using simple visual approaches. Such layouts are limited in terms of ability to highlight patterns and data aspects, and offer few or no interaction tools, which reduces the capacity of exploring the repository, and diminishes the potential of extracting strategical information. Using more sophisticated INFOVIS techniques may help BFLPO specialists in the analysis and exploration of LA data. The idea is to use a combination of several widely used techniques to create visualizations that communicate the underlying structure of those data, that is, the inherent relationship between different data instances without regarding temporal aspects, exploiting its heterogeneity, associated hierarchical organization, among other aspects.
In this context, INFOVIS techniques might be appropriate for this analysis, since they represent a potential tool for aiding in the identiĄcation of patterns and trends in the data that represent strategic information for BFLPO specialists. These visualizations may allow the identiĄcation of proĄles for accidents, localities, occupations, and make correlations among several indices measured in diverse regions, highlighting similarities and differences among distinct localities, even from different hierarchy levels. An important aspect of INFOVIS is the extensive use of interactions and visual metaphors when displaying data, allowing a natural and effective exploration. While interacting with a visual representation of a dataset, new facets of the data are revealed, and this Ćexibility improves knowledge discovery.

Objectives
The analysis of LAs data is important to the BFLPO, as it aids them in its primary task of providing labor safety. However, important as it may be, there is still a lack of proper analysis tools addressing their needs. Thus, the objective of this project is to develop a exploratory visual analysis strategy for LA analysis and exploration, employing INFOVIS techniques, in order to highlight different aspects of the underlying structure of the data.
The speciĄc objectives are: ❏ Create layouts that communicate the structure of the repository data, highlighting patterns, trends and correlations among measured indices; ❏ Create interaction tools and visual metaphors to improve data exploration, as well as to maximize information extraction; ❏ Develop a system implementing the proposed strategy, for exploratory analysis of the BFLPO repository and also to evaluate the proposed strategy.

Hypothesis
A visual analysis strategy employing information visualization techniques highlights the underlying structural characteristics of the BFLPO dataset, and is effective for the analysis and exploration of governmental data, improving decision making by specialists.

Contributions
This project has the following contributions: ❏ A system coordinating different layouts that employ established interactive visualization techniques in the analysis and exploration of LAs data; ❏ The combination of several widely used INFOVIS techniques in a coherent and effective way to aid the analysis of speciĄc data types.
❏ A visual analysis strategy that broadens the comprehension of analysis and exploration of LAs data, particularly in Brazil, providing a better understanding of LA proĄles, by means of exploring the particular LA dynamics in the country, helping to choose appropriate policies to prevent them.

Thesis Organization
The remainder of this thesis is organized as follows: Chapter 2 presents background knowledge required as well as related work. Chapter 3 presents the proposed strategy and describes the data used . Chapter 4 describes the developed system and its implementation details. Chapter 5 discusses the results obtained. Finally, Chapter 6 concludes with an overview of what was achieved, the limitations of our approach and future work.

Fundamentals
In this chapter we present some Information Visualization (INFOVIS) fundamentals and a discussion of related works. We Ąrst introduce basic concepts that are important to comprehend the work developed in this project, including concepts related to governmental data. We also present INFOVIS concepts, focusing in which techniques were employed in this project. Finally, we discuss several strategies that employ INFOVIS techniques in the analysis of governmental data.

Basic concepts
This section presents some basic concepts used in this research and in the related work.
Attribute: Also called variable, feature or characteristic, is a measured property of a data instance.
Instance: A representation of a data entry in the original space, described by all its attribute values.
Dataset: Also called data collection, is a set of several data instances.

Dimension:
Represents an attribute in a dataset when its instances are to be represented as a vector in a speciĄc space.
Layout: A single basic visual representation which can be combined with other layouts and interactions.
Visualization: A more elaborate visual representation, composed of one or more layouts and interactions.
Interaction: Actions performed by the user of a visualization, which causes an alteration in the appearance of the visualization.The modiĄcation of the visualization itself may also be called interaction in speciĄc situations.
Overview: Visual presentation of the general context of a dataset.
Original Space: Multidimensional space composed of all the instances with all its attributes.
Visualization Space: Space with a speciĄc number of dimensions in which a visualization is presented.

Point:
A representation of an instance in the visualization space, as a result of a multidimensional projection technique application.

Governmental Data
This section presents some works that address governmental data analysis, highlighting its importance to several sectors of society. Governmental data is data of any type or format collected or used by a government from all sectors of society. Giving its nature, this data is usually voluminous and heterogeneous, which represents important challenges in dealing with it. Governmental data is not always publicly available, but usually is, which encourages transparency in public institutions and promotes innovative civic-centric services(OECD, 2019).
According to (RADL et al., 2013) the analysis of governmental data present enormous potential in two ways. First, it allows the validation of previous knowledge and the gaining of insights and new knowledge about speciĄc Ąelds. Second, it can also help the administration of a locality by enabling the creation of new semantic technologies for the population and government. The data provided by the government in several forms, when correctly comprehended, can aid in the creation of public policies, or in the identiĄcation of areas lacking attention, thus optimizing government expenditure.
The importance of open governmental data is also reported in(GRAVES; HENDLER, 2013), arguing that simply making data open is not enough to keep the population well informed. Thus, it is necessary to create mechanisms that enable the execution of necessary operations for this data to make sense.
Governmental data is also useful to promote population participation, as shown in (DíAZ; AEDO; HERRANZ, 2014). The authors discuss how population participation is important for the development of policies for crisis management, and that for this participation to be effective the population also needs to have more conĄdence in the data provided by the government. It is also shown in (YING; XIALING; WEI, 2017) how a government can beneĄt from analyzing large volumes of data to optimize expenditure and improve the interaction between government and population.
Other data sources can also be used with governmental purposes, as in (ALOWIBDI; GHANI; MOKBEL, 2014). The authors present an application that suggests holiday locations based on twitter data via a Ćow map. Maps are interesting to communicate 2.3. Information Visualization 33 data, since they present a familiar layout, allowing fast comprehension to the general public. The target audience of this application are citizens that want to choose a good place to spend a holiday, and the government, that can focus its tourism policies in speciĄc places depending on the time of they year.
In Brazil, governmental data is provided by several public agencies. The Brazilian Federal Labor Prosecution Office(BFLPO) is the institution responsible for supervising work conditions, as well as intervening on labor issues to guarantee worker rights. The BFLPO counts with an ample dataset of labor accident records, composed of notiĄcations made, preferably by the employer, via LACs. These data are of strategic importance to Brazil, since its correct comprehension can help directing public policies to improve work relationships. The BFLPO also makes these data freely available.
Governmental data can be an important source of information both for government and population in general, because they allow for greater transparency in the services provided by the government, and the monitoring of its actions by the population, thus increasing social welfare. However, the lack of methods capable of communicating this data in a comprehensive and efficient manner is noticeable, as well as the lack of tools for analyzing them, as discussed in Section 2.6.

Information Visualization
Information visualization studies the creation of visual representations for abstract structured or non-structured data. The representations are designed to help people execute tasks more effectively, with the use of the space for the visual encoding chosen by the designer, and it has as one of its primary focus the determination of which techniques are appropriate to combine a dataset with a task (MUNZNER, 2014). The tasks which concern INFOVIS are the ones that the human being is crucial, such as the ones in which the problem is not well deĄned, or it is not know for sure what is being searched. For these types of tasks INFOVIS harnesses human beingŠs innate capacity for visual pattern recognition.
The vision sense has a prominent role in INFOVIS. The reason to use vision is that it is the most accurate sense in a human, capable of transmitting great quantities of data at the same time to the brain, where visual stimuli are perceived in parallel. This characteristic parallelism of vision can be contrasted to hearing, that only perceives information in a sequential manner and is much less sensitive to nuances outside the main context. As for other senses, there are limitations of technical nature, because there are no tools yet, nor even sufficient research to begin exploring their capacities(MUNZNER, 2014). Figure 1 depicts the classic INFOVIS workĆow. First data are obtained, which can be structured or not. Those data are organized and transformed to treat missing, spurious values, or to use other desired encoding, and then may be Ąltered, by the selection of subsets of the data. In the mapping stage, each data subset is associated to graphic primitives (lines, circles, etc) or more complex visual metaphors, and to graphic attributes (color, size, etc). This mapping is rendered into an image, which is the main visual abstraction of the data, and set of interactions is provided to manipulate it. The last stage in the process is the visualization, where the user interacts with the rendered image by means of controls in a graphical interface, to explore the dataset (LIU et al., 2014). INFOVIS techniques are many and diverse and a good overview of them can be found in (HEER; BOSTOCK; OGIEVETSKY, 2010).

Multidimensional Projections
Multidimensional projection refers to the act of representing a n-dimensional dataset in a p-dimensional space, where p << n and usually p = 2 for visualization purposes, while also trying to preserve the characteristics present in the many attributes in a reduced set of synthetic attributes. To reach this goal dimensionality reduction techniques are applied to the dataset. The results of these techniques is commonly visualized using a scatterplot layout. Assuming that the original relationship among instances are preserved in this layout, it is possible to visualize the general structure of the dataset, to perceive similarities between seemingly distinct instances, as well as to detect outliers and to select subsets of instances for other speciĄc analysis. Several multidimensional projection techniques can be found in the literature. We present here commonly used techniques.
Principal Component Analysis (PCA) (JEONG et al., 2009) is a technique that projects the instances of a dataset in a new system of coordinates based on the eigenvalues and eigenvectors of a covariance matrix of the data that minimizes the redundancy and maximizes the variance between these data instances. The eigenvectors found that have the greater eigenvalues are the ones that describe the most signiĄcant relationship among the data instances, which are called principal components. For visualization purposes, the Ąrst two or three principal components are used to determine the coordinates of each instance in the visualization space.
Multidimensional Scaling (MDS)(COX; COX, 2000) is a family of several linear and nonlinear techniques. The goal is to associate instances to speciĄc positions in a space, in such a way that the Euclidean distance between them in this space best reĆects their proximities, which can be similarities or dissimilarities, observed among them in the original space. To measure the correspondence between the obtained distances and the original relationship a stress function is commonly used, the smaller the obtained value the better is the correspondence. HINTON, 2008) is a nonlinear method that tries to minimize the distance between similar data instances. A probability distribution is constructed over a set of instances, so that pairs of similar objects are chosen with high probability and dissimilar pairs are chosen with signiĄcant low probability, almost inĄnitesimal. Similarly, a distribution is constructed for a lower dimensional space, and the Ąnal positions are obtained by moving the points in this lower dimensional space, in a way to minimize the Kullback-Leibler divergence between the two distributions.

t-SNE(MAATEN;
Least Square Projection (LSP)  is a method that aims to reduce the dimensionality of a set of instances, while also preserving the original neighborhood relationship among the instances. This method comprises two main steps, the Ąrst one consists in applying the MDS technique over a subset of those instances, called control points, to project them in the desired dimensional space. The second step consists in building a linear system using the neighborhood relationships of the original set and the Cartesian coordinates of the projected control points. The solutions to this linear system are the coordinates for the points in the new visualization space.
Local Affine Multidimensional Projection (LAMP)(JOIA et al., 2011) also employs the concept of control points, and using neighborhood information from those points orthogonal affine mappings are built, one for each instance from the dataset. As the LAMP projection depends on the control points conĄguration, the projection might be improved interactively by the user by manipulating these control points, and the mapping is thus able to reĆect user speciĄc perspectives of the data.
A more detailed discussion of these and other techniques can be found in(NONATO; perceiving the volume of data that intercepts a given value in an axis, the modiĄcation proposed in(TUOR; EVéQUOZ; LALANNE, 2016) is interesting. The layout, called Parallel bubbles, uses circles of variable radius in the axes ticks, the bubbles, and the radius of a circle is proportional to the volume of lines intersecting each point, as can be seen in Figure 3, in which continuous values in the left axis are mapped to categorical values in the right axis. In a traditional parallel coordinates layout, a large volume of data can cause overlapping lines, and using bubbles make it easier to detect and compare the volume of data that crosses an axis in a given point. A visualization called Pivotviz 1,2 is proposed in (NIELSEN; GRøNBaeK, 2015). It combines a parallel coordinates layout with summarizations in a pivot table representing the result a speciĄc selection. This work employs a modiĄcation in the parallel coordinates, which is the use of a single line of variable width to combine identical transactions. This approach is useful for revealing clusters, but subtle groupings might be masked. Figure 4 shows an example of this visualization, in which the value ŞRenewŤ is selected, which causes a great number of lines to be highlighted. The analysis of correlations between axes might be impaired, since it is not possible to compare the slope of a set of lines. This may occur because variable width lines might overshadow other lines or be almost imperceptible. Depicting summarizations of the data using a visualization based on a parallel coordinates strategy may be useful to reveal important information for massive datasets, such as the one provided by the ODSST.
The combination of proportional and nonproportional layouts is explored in (WIT-TENBURG; TURCHI, 2016). A proportional layout represents attributes as areas pro- Source: http://kk.datavis.dk/thebooksof.html portional to its value. This research work employs a combination of a treemap with nonproportional layouts, such as line charts and histograms. The goal is to facilitate the comparisons between parallel structures in a treemap, by embedding any nonproportional layout inside the treemap divisions. Although useful, the embedding might limit the visibility of parallel structures, as stated by the authors. We believe that the combination of a treemap with other proportional layouts may improve the visualization of multiple attributes, providing a more detailed analysis.
A visual analytics system for Ąnancial sector data, more speciĄcally stock market, is presented in (LEI; ZHANG, 2010). According to the authors the analysis of this type of data is usually performed by examining multiple charts in conjunction, with no combination among them, consuming an excessive amount of time. In this sense, an holistic vision of the data greatly beneĄts the analysis, because it permits the identiĄcation of trends and to rapidly make projections. Figure 5 shows an example of two combined layouts in the system, the layout on top with a ring format shows the volatility of stocks, while the one on bottom shows price clusters for stocks selected by clicking on the ring. This work is more concerned with analyzing data over time, however the characteristics present here, that is, the combination of layouts and interactions for the exploration and general visualization of a dataset in a visual analytics system, can be promptly transposed to a structural visualization-centric approach. It can be done using techniques such as multidimensional projections, parallel coordinates, and treemaps, providing similar beneĄts. Four visualization with different focus are shown in (SHARMIN et al., 2015), spatiotemporal, temporal, contextual, and event-centric. The visualizations are intended to aid just in time adaptative intervention, geared towards patients that suffer from stress. The idea of a visualization that uses contextual information, as shown in Figure 6, is interesting for this research, as the LACs data have several contextual data for each accident occurrence and can be exploited to identify proĄles of accidents. Information such as where a LA occurred, the occupation of the injured, the economic activity, the sex of the employee, can be related in layouts such as parallel coordinates and variations, or even directly combined with multidimensional projections.
The research works cited in this section present a variety of INFOVIS techniques that help to highlight the structural characteristics of a dataset. This research explores the use of those techniques or modiĄcations in them that may improve the analysis of the data in study, as well as to improve the extraction of information about LAs.

Governmental Data Visualization
This section presents some works that make use of INFOVIS techniques for the visual communication of governmental data, showing the potential of these techniques in highlighting patterns that represent strategic information for managers and population.
In Albania a website was developed in which visualizations for their governmental data are provided(HOXHA; BRAHAJ; VRANDEčIć, 2011), in an attempt to make this data more accessible. This website presents data visualizations on education, economy, demography, poverty, science and technology, justice, tourism, agriculture and energy. While this initiative is interesting, the visualizations employed are simple and have low interactivity, basically composed of static graphs with tooltips. Figure 7 presents some examples taken from the website, showing piecharts, barcharts, and lines with bars. The layoutsŠ simplicity may impair data analysis, because it may be difficulty to perceive complex patterns, relate different instances, identify groups, among other tasks. Dataviva(DATAVIVA, 2013) is a platform for visualization of Brazilian governmental data. It provides data about employment and income, commerce, education, and health, with the target audience being educators, entrepreneurs, as well as the general populace. Visualizations available include barchart, treemap, ThemeRiver, choropleth map, among others. Figure 8 shows a choropleth map , available in the platform, encoding the number of employees in higher education in Brazil in 2016. A divergent color scale is used, ranging from blue to red, in which red represents higher concentration. It is possible to notice that Sul and Sudeste regions concentrate the highest number of employees in higher education, with São Paulo being responsible for the highest concentration. However, for   states that present similar hues, it is not possible to distinguish which has higher values, or what is the level of similarity between them. It is also difficult to know if there is any relationship between the states by looking only at this layout. There is also a lack of coordination among the layouts and it is not possible to compare, in a practical manner, indices without alternating between layouts. Furthermore, the layouts do not offer any interactivity except for the exhibition of tooltips. Other similar approaches can be found applied to open governmental data of other countries(International Food Policy Research Institute (IFPRI); DATAWHEEL, 2017; DATAWHEEL, 2016; DATAWHEEL, 2018).
In order to highlight the geographical distribution of governmental data, maps with pins are used in (MENDONçA; MACIEL; FILHO, 2014) to monitor the incidence of the Aedes Aegypti mosquito in the city of Cuiabá, using data from the Mato Grosso State Secretary of Health. However, the visualization has also low interactivity and exploration capacity and it is only accessible to authorized personnel, which restricts the potential to make this data transparent, limiting its analysis and popular participation. Also using maps to show incidence indices there is a website named Observatório da Intervenção (Intervention Observatory), that monitors the cases of violence and rights violation resulting from the military intervention in Rio de Janeiro. In this website there is a bubble map, in which each bubble represents a value corresponding to the number of violations in the region. By clicking on bubbles zooming is done on the map region and the bubbles are split to give greater precision on the locality of the occurrence, up to street level. An example of this map can be seen in Figure 9. There are also some tables available containing information about the occurrences. Source: http://observatoriodaintervencao.com.br/dados/mapa-da-intervencao/ The works previously discussed fulĄll more of an informative than analytical role, but there are other works that employ governmental data for analytical purposes, such as (ZHIYUAN et al., 2017), which makes use of several different visualizations to analyze the Ćow of passengers in the Shanghai Metro. This work shows that it is important to offer different perspectives of the data over different layouts to understand big and complex datasets. The system uses bubble, bar, line charts, heat and Ćow maps, and starplot, highlighting different aspects of the data, such as passenger movements, how full is an station, paths most used, among others. In (RODRIGUES et al., 2017) a web application was developed to provide information about energy production in Germany. The application combines maps with glyphs and uses ThemeRiver to show the evolution of energy production in the power plants. An example of this application is shown in Figure 10, which shows the integration between the two layouts. Selections and Ąlters can be used on the map, which in turn modify the ThemeRiver visualization. The focus of this research is to foment political discussion, nudging, and story telling, having an ample target audience, including the general population. In our research work we also combine different layouts and interaction tools to enable the analysis of different data aspects, thus aiding specialist analysis and the populationŠs comprehension of the data. Source: (RODRIGUES et al., 2017) In the speciĄc case of LAs, Groeger, Grabell e Cotts (2015) presents a visualization that shows the amount of compensation received, in the USA, by works that lost limbs due to LAs. The visualization consists of pictograms representing the human body, one for each American state, with the value of the compensation encoded as the size of each limb, as shown in Figure 11. It can be noticed that the state of Alabama pays the lowest compensations in general, while Georgia, a bordering state, grants much bigger compensations. This visualization works better as informative and not for deep analysis, but shows how intuitive visual metaphors can facilitate the comprehension of data and the similar behavior between geographically distinct regions.
In Brazil, the ODSST(Smartlab de Trabalho Decente MPT -OIT, 2017) is the platform currently provided by the BFLPO to communicate data about LAs. In Figure 12 some examples of available layouts in the website can be seen, and it is noticeable how the visualizations are simple. Figure 12A shows a non interactive piechart that shows only a very limited part of the data, impairing its comprehension. This also occurs in the example of Figure 12B, in which the use of bubbles to encode the number of accidents might not be appropriate, since bigger bubbles might limit the visibility of smaller ones.   In (LIMA; PAIVA, 2017) data provided by DataViva are displayed using the PCA and MDS techniques. In the generated layouts it was possible to identify groups of localities with similar behavior proĄles, as well as some localities with peculiar behaviors in relation to what was expected with the considered set of indices. Figure 13 shows PCA and MDS scatterplots. Figure 13A shows the PCA layout of municipalities from the mesoregions of Triângulo Mineiro and Norte de Minas with respect to international commerce in which 4 groups can be noticed, 3 of them containing cities from different mesoregions, denoted by colors, presenting similar international commerce proĄles. The fourth group is composed of a single outlier, the city of Uberaba, which shows commerce characteristics totally different from the others. Figure 13B shows a MDS layout regarding higher education in microregions from Santa Catarina and Rio Grande do Sul states. The distinct colors represent the state to which a microregions belongs, and it can be noticed some intersection among localities from different geographic regions, as well as two distinct outliers. Group 2 represents a microregion, Sananduva, with weak educational infrastructure and group 3 the microregion of Porto Alegre, which stands out from the rest because of the large values of the indices measured there. This research work shows how analysis strategies focused on structural characteristic of the data can contribute with the capacity of extracting relevant information.  INFOVIS techniques have been seldom employed for the visualization of governmental data, with interesting results. There is still a lack of visual analytics systems for governmental data and none of them, to the extent of our knowledge and up to the writing of this document, was employed in the analysis of LAs data.

Final Considerations
Adequate analysis of governmental data has enormous potential to improve government policies and population welfare. However there is a lack of appropriate tools to facilitate this analysis and communicate such data. We showed that the use of INFOVIS techniques might be a solution to this problem. There are interesting research works capable of capturing the underlying structural characteristics of a dataset. However, the application of such techniques to the analysis of governmental data, speciĄcally LAs data, is still lacking. This research aims to create a strategy for the analysis of governmental data, focusing on LACs, combining structural visualization techniques, such as maps, treemap and multidimensional projections, associated with sophisticated interactions to help in the efficient coordination of those layouts.

Proposal
This research work proposes a strategy for visual structural analysis of LA occurrences data. The developed strategy intends to facilitate the comprehension of the structure of those data, by identifying accident behavior proĄles, correlations between data measured in different geographic regions, among other tasks. We propose two visualizations, i) a combination of a multidimensional projection with a political map, and ii) a combination of a treemap with a parallel sets layout.
While this strategy is not restricted to a particular dataset, it was conceived to address a lack of analysis strategies capable of fulĄlling the necessities of the BFLPO data analysis and should fulĄll the following requirements: r1: identify work proĄles; r2: identify areas lacking attention; r3: characterize localities, independent of its geographical position; r4: characterize wide geographical areas.
Thus, this chapter details the data from this repository, all necessary preprocessing steps, as well as the design decisions taken to produce the visualizations for the analysis.

Data Description
LAs data is freely available in the BFLPO repository 1 . Table 1 describes all the attributes used in this research. We decided to not consider the temporal attributes, as individually they do not add any strategic information to the analysis. The exception is the attribute "time", which was transformed in a binary attribute indicating if the accident occurred during the day or night shift. We refer to this new attribute as "shift".

LAC Emitter Nominal
Indicates who reported the LA occurrence to the authorities through the LAC.
Only the state and city where an accident occurred is present in the original BFLPO data, so we added the complete Brazilian territorial division provided by the IBGE 2 (see Table 2), in order to provide detailed analysis for all geographic hierarchy levels. However, this information was not taken into account when generating multidimensional projections, as the aim here is to identify behavior proĄles that are independent from geographical location.

Data preprocessing
The main idea of the proposed strategy is to allow the identiĄcation and characterization of groups of localities based on its structural characteristics, that is, the inherent relationships among data instances, manifested via correlations among measured attributes, behavior proĄles, among others.
As can be seen in Table 1, the majority of the attributes are categorical, and some of them present a large number of categories. Thus, we decided to group them into fewer and broader categories, keeping their original meaning. We manually grouped the attribute "ds_parte_corpo_atingida", and grouped the attributes "ds_cbo", "ds_cnae_classe_cat" Table 2 Ű BrazilŠs geographical division, comprehending Ąve hierarchical levels, from largest to smallest. Only states and cities have political autonomy, with their own laws and constitution but subordinated to federal laws and constitution.

Hierarchy Level Description # of Categories
Region 1 Division of BrazilŠs states in groups based on similarities to help the interpretation of statistics, without political autonomy.

5
State 2 Subdivisions of the country with political autonomy but with some subordination to the countryŠs government.

27
Mesoregion 3 Subdivision of the states to help the interpretation of statistics, without political autonomy.

137
Microregion 4 Subdivision of the mesoregions to help the interpretation of statistics, without political autonomy.

554
City 5 Administrative subdivision of the states with political autonomy but subordinate to the power of the states and the country.

5570
and ds_agente_causador" according to external official criteria. The sources used for each grouping and the resulting reduction in categories are presented in Table 3. It is possible to use multidimensional projections for the analysis of a variety of data aspects, from different perspectives. We decided to perform the analysis from the cities perspective. We then summarized all LACs for each city, and used these summaries to represent each of them. However, effective summarization of categorical data is not trivial. In this work, we transformed all categorical attributes into numeric attributes, creating dummy variables. The summarization is then performed by summing up all the values for each city and calculating the mean of all occurrences.

Figure 16 Ű Labor Accidents in Brazil
Parallel coordinates is a widely used layout to show multivariate data, as discussed in Chapter 2 Section 2.5. However it is not appropriate for massively categorical datasets, which is the case of the BFLPO repository. Parallel sets(KOSARA; BENDIX; HAUSER, 2006) is a layout derived from the parallel coordinates, specially designed for categorical data visualization. This layout uses the parallel axes, to visualize several attributes simultaneously, but focuses on showing sets and subsets of items, instead of individual data points, and on showing how these different sets relate to each other. Each attribute in the dataset is represented by an axis, divided in pieces (categories), whose sizes are proportional to the frequency of each category. A ŞribbonŤ connects a category in an axis to a category in another axis, if they both occur simultaneously, meaning that the width of a ribbon is the intersection of the sets up to that point. Thus, ŞpathsŤ are formed by these ribbons, which can be changed by reordering the axes/categories, resulting in different visual patterns.
An example of parallel sets showing data on survivors of the Titanic shipwreck can be seen in Figure 17. In this Ągure one notices that even though the number of survivors were almost evenly split between women and men, only a small fraction of men survived, while the majority of women have survived. The further ribbon paths help to narrow down on the survivor proĄle, allowing to notice, for instance, that approximately half of the surviving men were part of the shipŠs crew.
Parallel sets is appropriate to show the relationship between the different attributes characterizing LAs, as well as to permit the exploration of this data from different perspectives, by starting with different attributes such as gender or economic activity. The treemap provides the exploration of different granularities of the data in the parallel sets and contextualizes the amount of LACs in a locality. The resulting visualization can be seen in Figure 18.

System Description
This chapter presents a developed system that implements the proposed analysis method described in Chapter 3. It details each visualization and its accompanying interactions, as well as how we combine them to provide several data perspectives. This chapter also presents the materials used for the development of the system.

Political map
We employ a scatterplot layout to present the results of the multidimensional projection. As discussed in Chapter 3, we expect that the layout will reveal patterns that communicate the behavior proĄle of the cities, as well as how they relate to each other regarding these proĄles. Each point in the scatterplot represents a city, using information from 2012 to 2017, colored according to the mode of a speciĄc categorical attribute. The categorical attributes available for coloring are: Region, State, Mesoregion, Microregion, Causer Agent, Injury Nature, Occupation, Accident Type, Accident Locality Type, Injured Body Part, Economic Activity, Shift, and LAC Issuer.
After choosing a coloring attribute, another dropdown menu is used to select multiple categories in that attribute, which results in toggling the color of all the points having that category as mode, that is, points not currently highlighted are colored black, and the others receive a color corresponding to its mode category. Finally further to the left there is a checkbox which allows the user to toggle the color of all points at once, that is they can be all colored black, or each receive the color of its category. Figure 19 shows an example in which the colors of all cities were toggled off with the checkbox, and subsequently the category ŞMinas GeraisŤ of the Attribute ŞufŤ was seleted.
A point or a group of points can be freely selected using a polygonal selection tool, and the selected points are highlighted by coloring the circle borders in red. Figure 20 shows an example of a possible selection with this tool. The formed selection shapes can

Results
This chapter presents the results of applying our proposed strategy in the analysis of the BFLPO data set. We start by detailing the analysis procedure adopted in the experiments, and then discuss our Ąndings, highlighting the importance of each layout for the analysis, as well as how the complementarity of the visualizations may improve the analysis.

Analysis Procedure
We performed several analysis using the visualizations independently, in order to explore their individual capabilities. However, we also explored how each visualization may shed more light on the Ąndings from the other, to produce a more complete analysis.
Using the multidimensional projection + political map we Ąrst focused in analyzing the points distribution in the layout, identifying groups of cities, as well as isolated cities that may present anomalous behavior. We also investigated the categorizations produced by each coloring attribute in order to understand how they are related to the LAs behavior. The geographical map was used to better identify/understand proĄles that are independent of the citiesŠ geographical location.
We freely explored the parallel sets + treemap visualization to characterize broader locality groups, deĄned by their geographic location, such as regions and states, by investigating the role of each measured variable in their characterization, or in distinguishing them, as well as if/how they correlate to each other. This visualization was also used to further explore some interesting Ąndings of the multidimensional projection layout, to help in the comprehension of how each measured variable inĆuences the characterization of the groups present there, and in the identiĄcation of correlations among the involved attributes, the distribution of instances among attributes, behavior trends, among other tasks.
In each analysis we highlight elements in the proposed layouts that contributed to reveal each pattern, and to lead to a judicious decision making, and associate these analysis with the requirements presented in Chapter 3, in order to highlight how these requirements were fulĄlled. Finally, whenever we felt necessary, to explain some behavior, we consulted external sources, such as news or the raw data.

Choosing a projection technique
In our experiments, one of the visualizations employs a scatterplot which represents the results of a multidimensional projection technique applied to the data. Thus, it is important to choose an adequate technique, which results in a layout that is capable of highlighting patterns representing important aspects related to the LA information. Several multidimensional projection techniques exist, as shown in Chapter 2, and we investigated three state of the art techniques, implemented in the mp package described in Chapter 4, in order to decide which was best for our dataset: LSP (PAULOVICH et al., 2008), t-SNE(MAATEN;HINTON, 2008) and LAMP (JOIA et al., 2011). The parameters used for each technique are summarized in Table 4, and were suggested by the authors.  Figure 25 shows the three layouts generated by LSP, t-SNE and LAMP, respectively. LSP depicted huge numbers of localities (> 1000) overlapped in single points and no distinguishable groups. t-SNE created an oval shape also with no distinguishable groups. LAMP generated a layout with better group separation for the BFLPO dataset. Even though some groups were found in LSP and t-SNE layouts, upon investigation they did not help in revealing any information leading to interesting results. In order to conĄrm how LAMP better represents the relationship observed in the original space, we employed the Neighborhood Preservation(PAULOVICH; MINGHIM, 2008) quality measure on the three layouts. This quality assessment measures the proportion of k nearest neighbors of all data instances that are also nearest neighbors in the visualization space, and is used to evaluate how the projection preserves the relationships observed in the original space. The results are shown in Figure 26. It is possible to notice that, t-SNE provides better results for up to 7 neighbors, due to its local preservation capabilities. However, both Şhuman health and social servicesŤ, which represents few but important cities.
of occurrence of a LA and the resulting lesions. Analyzing this parallel sets, one can notice that the majority of LAs occur in the employment location (blue ribbons) and these LAs result in a greater number of lesions to upper members than to lower members. However, for rural areas (green ribbons) this distribution is more balanced, supporting the Ąndings in the scatterplot. By hovering over these ribbons a tooltip shows the percentage these LAs represent from the total. LAs occurring in the employment location resulting in lesions to upper members are 33.74% and resulting in lesions to lower members are 11.69%. For LAs happening in rural areas, lesions to upper members are 1.6% and lesions to lower members are 1.19%. This example also shows how the interactive tools help the analysis with the parallel sets, in this case the tooltip. The parallelograms formed by the ribbons are sometimes misleading when one wants to compare size, but the tooltip shows clearly the amount each ribbon represents. This parallel sets depicted also helped us to identify that accidents happening on public ways (red ribbons), also have a more balanced distribution of LAs resulting in lesions to upper members (8.57%) and lesions to lower members (10.56%), which we found difficult to identify by looking only at the scatterplot. This type of Ąnding can help the government to prevent speciĄc types of accidents by showing in which place a type of accident is more likely to occur. All the analysis up to this point fulĄll requirement r1.
Identifying areas that may be lacking attention can be readily done when one combines Brasília has a more balanced distribution of man and women accidents in the economic activities, suggesting a better work force distribution, with 62% of LACs related to male LAs, which is 10% less than the region average(72%). The economic activity with more registered accidents is Şhuman health and social servicesŤ, as already seen in the scatterplot, but the parallel sets allows us to perceive other economic activities with a meaningful number of LAs registered. Still related to economic activity, Ştransformation industriesŤ is only responsible for 6% of accidents in Brasília, distant from the rest of the region. Regarding occupations, ŞArts and ScienceŤ, and ŞHuman HealthŤ have more women accidents, and Şproduction of goods and servicesŤ is more associated with accidents with man. ŞService providers and sellersŤ is the occupation with the largest number of LACs. Brasília, being the capital of Brazil, is an administrative center, concentrating the highest number of public agencies, and has a particular focus on providing services. These characteristics seem to reĆect well in its economic activities and occupations registering most of the LACs. Regarding causer agent, the one with most occurrences is Şfall from the same levelŤ, followed by Ştransportation vehiclesŤ. In respect to injury types, the most common in Brasília is ŞcontusionŤ, while in the region is ŞfractureŤ.
As expected from the scatterplot, Santo André displays a similar LA proĄle to Brasília. The distribution of accidents between men and women is also more balanced, 59% are men, which is 6% less than the region average. Şhuman health and social servicesŤ is also an economic acitivity responsible for a signiĄcant number of LACs (23%), even though the top listed here is Ştransformation industriesŤ (27%). The occupation proĄle is also similar to Brasília. Women LAs are more prevalent in occupations in ŞArts and ScienceŤ and in ŞHealthŤ. Regarding causer agent, the one with most LACs in Santo André is Şchemical agentŤ, closely followed by Şfall from the same levelŞ, while the other top listed causers are the same from Brasília. This analysis of both cities fulĄll requirement r3.
The parallel sets + treemap visualization is effective in characterizing entire regions as well, Sul and Sudeste regions have similar behavior, as can be seen in Figure 38. These are the regions with the most balanced proportion of male to female LAs, 65% are males. The economic activities responsible for the greater number of LACs are: Ştransformation industriesŤ, Şcommerce and repair of vehiclesŤ, and Şhuman health and social servicesŤ. This similar behavior is extended to the other attributes, and the only difference is the distribution of LAs in the states. Figure 39 presents the treemaps of both these regions, showing the total number of LACs for each state in them. It is possible to notice that the Sul region is more homogeneous than Sudeste, with its three states registering similar numbers of LACs, while in the Sudeste region the state of São Paulo is responsible for 60% of all LACs in the region, and Espírito Santo is responsible for only 4%. Figure 40 shows the parallel sets for the Nordeste and Centro-Oeste regions. A more signiĄcant number of accidents in the rural area can be perceived in these regions, around 8%. Additionally the percentage of accidents with men is greater than 70%. Similar

Conclusion
In this research work we presented a strategy to perform visual analysis of LAs data, aiming to explore the underlying structural characteristics of the BFLPO dataset, including its heterogeneity and associated hierarchical organization. A system implementing this strategy was developed, as a mean to validate this strategy.
The chosen layouts were able to communicate the underlying data structure, and the interactive tools provided an effective exploration. By using our proposed strategy, we were able to identify proĄles associated with individual cities and large geographical areas, as well as behavior patterns associated with the whole country. Combining the observed patterns in both visualizations we were able to Ąnd similarities among geographically distant localities, occurrence patterns related to the citiesŠ size and economic developments, the frequency distribution of LA types in Brazil, and to characterize those LAs in terms of occupation type, gender differences, causer agent, among other aspects. Moreover, the visualizationsŠ complementary capacities provided simultaneous analysis of different data aspects. As an example, the scatterplot depicts only cities, and the insights over higher hierarchical levels are gathered by the shape formed by the pointsŠ position and the colors, while the treemap and parallel sets visualization allows the characterization of larger geographical areas, in all hierarchical levels. We believe that several of our Ąndings would be difficult to uncover using solely tabular data, or common statistical graphs. Among these Ąnds we identiĄed relationships among cities that would not be easy to identify if each city was analyzed only considering cities located in the same or close localities, and patterns associated with entire regions.
The system can contribute to policy making, because it shows clearly important strategical information, such as major accident causers, localities lacking proper monitoring, odd behaviors, among others. This could potentially aid the government in evaluating if an ongoing strategy is being effective, or even if a new proposed policy would be adequate to the situation. Additionally, the developed system also makes available data more transparent, by easily communicating the LA occurrences situation in the country, enabling a more effective comprehension of their occurrence and providing a simple but effective communication channel between citizen and government. This could foment a more active participation of the population in matters regarding labor issues, allowing the population to better monitor government policies and demand appropriate solutions.
The strategy and correspondent system, as well as the analysis presented in this thesis focused on LA data provided by the BFLPO. However, we believe this strategy can readily be used to analyze other datasets that present similar characteristics, that is, datasets that present hierarchical and geographical aspects, as well as a massive number of categorical measured attributes. A large number of governmental datasets present these characteristics, and we thus believe it has a large applicability in several strategic analysis scenarios.

Limitations
During the development of this research work, we were able to identify two types of limitations: data related and strategy related.
Regarding the data, the number of LACs is small, when compared to the number of LAs that really occur, and this is specially noticeable in poorer regions, like Norte. This underrepresentation is a serious problem and potentially impairs a proper analysis, making difficult to explain a speciĄc behavior, as well as to trust in some revealed patterns. Furthermore, some LACs are not properly reported, resulting in loss of information that may be important for the analysis. We partially address this problem, by shedding light on anomalous behavior and by identifying areas that may lack attention. We expect that this will aid the government in identifying deĄciencies on the LA reporting process, in order to promote a more strict inspection.
Regarding our strategy and implemented system, the parallelograms in the parallel sets, when individually evaluated, may sometimes mislead the analysis, because they tend to get distorted when the path from a category to the next one generates sharp angles. Adding some visual cue to the axis, such as ticks, could improve this situation. We currently address this issue by using the tooltip information.
The treemap, although useful to navigate the parallel sets and to add more context to the analysis, does not allow the comparison of localities from the same level, but in different branches, such as mesoregions from different states. This may make such analysis slower and more prone to errors.
The BFLPO dataset is massively categorical, and in order to employ the pointplacement techniques, it was necessary to transform this data into numerical values. Thus, our evaluations became strongly dependent on these transformation procedures, which may result in information loss. However, we believe that the layouts produced from the transformed data already produced satisfactory results in terms of revealing useful patterns for analysis.

Future Work
After we performed the analysis described in Chapter 5, several interesting research directions were found. This research work explored the structural aspect of LAs data, however another interesting aspect of the data is the temporal one. Combining these two aspects in a single analysis strategy could potentially reveal more patterns and improve analysis as a whole. A temporal visual LA analysis tool has already been developed in(BRITO; RODRIGUES; PAIVA, 2019), and we intend to combine the proposed techniques with our approaches, to perform such analysis. Additionally, there are many small improvements that can be made to the developed system, including new interactions, and better communication between the layouts/visualizations. Some examples of those include improving the scatterplot by showing more than one attribute at once, for instance using different shapes for points, transparency, among others. New multidimensional projections could be made based on user selections. The treemap could show different values besides number of accidents. Finally, we intend to perform qualitative experiments with specialists from the BFLPO, which are domain experts, and may signiĄcantly ben-eĄt from the information provided by our tool. This experiment could lead to further improvements in our strategy and positively impact the decision making process.